## An Introduction to the MNIST Data Set

The present article elaborates on the essential features of the MNIST data set for training machine learning models. Within this *Topics of Machine Learning* series, we elaborated extensively on the various machine learning model categories that exist. Our first article to this series provided an overview to all of these model categories, an investigation which you may find here. After providing this broad introduction, we began diving individually into the specificities and algorithms associated with each of these models. This began with an analysis of supervised learning models, particularly its classifier and regression algorithms.

Next, we followed this with a discussion of the features of unsupervised learning models. We then differentiated the methodologies of batch and online learning, which you may investigate here. We finally followed up with an elaboration of the differences between instance based and model based learning, found here. With the individual algorithms explored and discussed in depth, we arrive at the present discussion, which seeks to dig into the fundamentals of classification algorithms. Along that line of thought, we begin by exploring the MNIST data set and how it may be used in machine learning classification algorithms. Let’s begin.

## Conceptualizing the MNIST Dataset

The MNIST represents one of the most basic machine learning databases. It consists of over 70,000 images of handwritten digits and the labels associated with each of these. This database is quite essential for training image processing systems. This dataset includes 60,000 images which represent the training set, while the left-over 10,000 images denote the test set. This data set lends itself in particular to analysis by machine learning classifiers.

Historically speaking, a variety of machine learning algorithms have been trained on this data set. Perhaps the least accurate of these algorithms manifests in the implementation of a pairwise linear classifier which exhibits an error rate of 7.6%. Alternatively, use of a K-Nearest Neighbor algorithm in a classifier procured an error rate of 0.52%, an exceptionally low rate of error for non-neural network based classifiers. Remarkably, a support vector machine (SVM) algorithm derived machine learning model successfully achieved a similar error rate of 0.56%. On the side of neural network based machine learning models, the least effective of these (1.6% error rate) procured from a two-layer deep neural network. Alternatively, around six convolutional neural networks have acquired error rates below 0.3% (as low as 0.17% error rate in one case).

## Importing the MNIST Data Set

The SciKitLearn library, perhaps the most ubiquitous exogenous library used in machine learning, has a wide variety of machine learning data sets available to users. We can access the MNIST data set in particular first by importing the ‘fetch_openml’ object from the ‘scikitlearn’ library. We then use the fetch_openml method and import the MNIST data set by passing in the argument ‘MNIST original’. Check out the code for doing this:

Once upon a time, the method we would have imported was the ‘fetch_mldata’ method. However, this stored data sets in a now defunct website, so we now utilize the ‘openml’ database. We specify the ‘mnist_784’ data set, in particular the first version. We then are able to print out the response. This response has several components. Firstly, we retrieve an arrat which represents the different digits contained in the data set. This is far too large to print out in its entirey, but briefly, it appears as:

There’s also an additional array object which stores the pixel information for each pixel, which looks like:

In totality, each image consists of 784 pixels. Finally, part of the information which prints out represents background on the data set itself.

## Background on the MNIST Data Set

According to the documentation on the MNIST data set, found at the following link, “The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples.” As we have already elaborated, and this text confirms, we have a totality of seventy-thousand data objects, 60,000 of which are training items and 10,000 test items.

*Notes From External Doc String*

The text continues by saying “The original black and white (bilevel) images from NIST were size normalized to fit in a 20×20 pixel box while preserving their aspect ratio.” This information confers the notion that the data objects are reduced to 20×20 bits of information. However, as these researchers note, “the images were centered in a 28×28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28×28 field.” This identity of images being constructed with 28×28 bits of information accounts for the 784 pixels which construct each image.

Their document provides an additional golden egg of information which may pass as unrecognized. However, giving this item deserved attention can drastically improve machine learning models developed with this data set in mind. These authors contend, “With some classification methods (particuarly template-based methods, such as SVM and K-nearest neighbors), the error rate improves when the digits are centered by bounding box rather than center of mass.” Taking this note of information into consideration, it’s important to keep in mind that, if we decide to employ SVM or KNN learning algorithms, our efforts greatly improve if we center the digits on this proverbial bounding box instead of its center of mass.

*Features of SciKitLearn Data Sets*

Before moving on, it’s incumbent upon us that we mention three important features of SciKitLearn datasets. These datasets have included with them a DESCR object, which provides a description of the dataset. In the case of our discussion of the MNIST data set, the website we just extrapolated on appears in the DESCR of the SciKitLearn data set.

SciKitLearn data sets also exhibit the data key, which confers the structure of the data wherein there is a single row for each instance and a single column for each feature of the data. This feature was represented by the following image:

We can observe the ‘data’ key which immediately precedes this essential array. In addition to the data array, we have a target array which combines the data array with the array possessing the labels. This image demonstrates the ‘pixels’ array:

## Splitting the Data Set

For creating a machine learning model, the first order of business obliges us to separate the training data from the testing data. We can separate and acquire these sets of information by separately calling the ‘data’ and ‘target’ arrays. These data sets are significantly different. The ‘data’ has a shape of 70,000×784 because each image consists of 784 pixels. The ‘target’ array, alternatively, is linear and only has 70,000 objects as these represents the labels. Keep in mind that for both of these array objects, we must also further separate them into training and testing data. Let’s take a look at how we might best organize these groups of data:

We can see from the code from above that we first separated the MNIST data into its two individual arrays, the data array and the labeled array. We then separated these two arrays based upon whether they were training objects or testing objects. This leaves us with four different groups of data we can use for training our machine learning models.

## Training a Classifier

A simple example of training a machine learning classifier will be made with this MNIST data set. This machine learning model specifically identifies the digit ‘five’ in our data set. This model divides data objects into one of two groups (the five or non-five group); thus a binary classifier. We present here an example of a binary classifier on the MNIST data set using a stochastic gradient descent classifier.

To describe in detail the features of stochastic gradient descent algorithms is well beyond the scope of this article. However, we can provide a brief introduction to this to a sufficient degree of specificity. The stochastic gradient descent algorithm is a method for optimizing a function based on the smoothness of the object. The SGD algorithm randomly (stochastically) approximates the gradient descent of the function. Thus, SGD replaces the actual gradient for the entire data set with its own prediction. In this manner, values for data objects assign based upon the predictions made by the approximation.

*Consideration of Stochastic Gradient Descent*

An article provided for here stipulates a helpful explanation to the functionality of the stochastic gradient descent algorithm:

“The word ‘*stochastic*‘ means a system or a process that is linked with a random probability. Hence, in Stochastic Gradient Descent, a few samples are selected randomly instead of the whole data set for each iteration. In Gradient Descent, there is a term called “batch” which denotes the total number of samples from a dataset that is used for calculating the gradient for each iteration. In typical Gradient Descent optimization, like Batch Gradient Descent, the batch is taken to be the whole dataset. Although, using the whole dataset is really useful for getting to the minima in a less noisy and less random manner, but the problem arises when our datasets gets big.”

For training this stochastic gradient descent algorithm in our binary classifier, we must first establish the proper data sets. We first need to acquire the five labels from our label data set for the training and the testing data. Once we do this, all we must do is call the SGD function from scikitlearn. Then we may pass in training data and the training labels for the five. The code for executing this functionality appears as follows:

We can then simply use the ‘predict’ function and pass in the position of some particular digit. If that digit is five, to some degree of accuracy, the function will return true. If that digit is not five, to some degree of accuracy, the function will return false.

## The Take Away

The Topics of Machine Learning series elaborates on a variety of subject matters, beginning first with the specificities of a variety of machine learning models. In this article, we efficaciously deviated from this premise to dig into, with greater detail, more precise features of machine learning. This discussion revolved around the features of the MNIST data set and its ability to train an image processing model. We explored the background aspects of the MNIST data set, the features of the data itself, how and why it is essential to split the data, and how we might be able to train a stochastic gradient descent algorithm that processes the data. Our subsequent article explores in greater detail the various means for analyzing the accuracy of these models as well as certain performance aspects of the models. We look forward to seeing you there.