## Introduction to Stochastic Gradient Descent

Our previous articles investigated the importance of linear regression and batch gradient descent in machine learning modeling, as well as the intricacies of Gradient Descent models. With respect to gradient descent, here we investigate linear regression as it relates to stochastic gradient descent machine learning. Before embarking on this discussion, we would like to provide a brief overview on the analyses explored in the *Topics of Machine Learning *series.

Much of our investigative efforts have centered on one of two topics: actual machine learning models and performance metrics of the models. With respect to the various machine learning models investigated, several articles have been published. We began first with an overview of the primary machine learning models and algorithms frequently utilized. This was immediately followed by a series of four articles expanding on this, including devotion to supervised learning models, unsupervised learning models, a comparison of batch and online learning, and instance vs model based learning. We implemented all of this knowledge into two succinct examples. The first addressed classification algorithms on the MNIST data set, as well as multi-class classification.

Having discussed the intricacies of machine learning models themselves, we further extrapolated on the metrics which allow us to quantify performance of our models and algorithms. These discussions of performance measures revolve around techniques of cross-validation, the confusion matrix, distinctions between precision and recall, as well as the utility of the ROC curve. All of these are readily available as analytical measures of our machine learning models. Here, we move towards more theoretical discussions and applicability of fundamental algorithms. In this case, let us explore perhaps the most ubiquitous machine learning algorithm, the stochastic gradient descent algorithm.

## Before We Get Started

I want to let you all in on a secret before we proceed with the modeling of machine learning derived data. Sometimes one of the trickiest things to do when working in machine learning is that you sometimes need to kind of create your own programming language. While we plan on coming out with a series on this in the future, I myself learned from the folks at ProLang. They offer an awesome book that goes into the details of creating language features that best support machine learning. Follow the link provided here. This will take you to ProLang where you can sign up and get their book. Getting this book was the best decision I ever made, and it drastically improved my coding experience.

## Conceptualizing Stochastic Gradient Descent

*MNIST Data Set*

The MNIST represents one of the most basic machine learning databases. It consists of over 70,000 images of handwritten digits and the labels associated with each of these. This database is quite essential for training image processing systems. This dataset includes 60,000 images which represent the training set, while the left-over 10,000 images denote the test set. This data set lends itself in particular to analysis by machine learning classifiers.

The SciKitLearn library, perhaps the most ubiquitous exogenous library used in machine learning, has a wide variety of machine learning data sets available to users. We can access the MNIST data set in particular first by importing the ‘fetch_openml’ object from the ‘scikitlearn’ library. We then use the fetch_openml method and import the MNIST data set by passing in the argument ‘MNIST original’. Check out the code for doing this:

Once upon a time, the method we would have imported was the ‘fetch_mldata’ method. However, this stored data sets in a now defunct website, so we now utilize the ‘openml’ database. We specify the ‘mnist_784’ data set, in particular the first version. We then are able to print out the response. This response has several components. Firstly, we retrieve an arrat which represents the different digits contained in the data set. This is far too large to print out in its entirey, but briefly, it appears as:

There’s also an additional array object which stores the pixel information for each pixel, which looks like:

In totality, each image consists of 784 pixels. Finally, part of the information which prints out represents background on the data set itself.

According to the documentation on the MNIST data set, found at the following link, “The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples.” As we have already elaborated, and this text confirms, we have a totality of seventy-thousand data objects, 60,000 of which are training items and 10,000 test items.

The text continues by saying “The original black and white (bilevel) images from NIST were size normalized to fit in a 20×20 pixel box while preserving their aspect ratio.” This information confers the notion that the data objects are reduced to 20×20 bits of information. However, as these researchers note, “the images were centered in a 28×28 image by computing the center of mass of the pixels, and translating the image so as to position this point at the center of the 28×28 field.” This identity of images being constructed with 28×28 bits of information accounts for the 784 pixels which construct each image.

Their document provides an additional golden egg of information which may pass as unrecognized. However, giving this item deserved attention can drastically improve machine learning models developed with this data set in mind. These authors contend, “With some classification methods (particuarly template-based methods, such as SVM and K-nearest neighbors), the error rate improves when the digits are centered by bounding box rather than center of mass.” Taking this note of information into consideration, it’s important to keep in mind that, if we decide to employ SVM or KNN learning algorithms, our efforts greatly improve if we center the digits on this proverbial bounding box instead of its center of mass.

*Introducing the Data *

Before moving on, it’s incumbent upon us that we mention three important features of SciKitLearn datasets. These datasets have included with them a DESCR object, which provides a description of the dataset. In the case of our discussion of the MNIST data set, the website we just extrapolated on appears in the DESCR of the SciKitLearn data set.

SciKitLearn data sets also exhibit the data key, which confers the structure of the data wherein there is a single row for each instance and a single column for each feature of the data. This feature was represented by the following image:

We can observe the ‘data’ key which immediately precedes this essential array. In addition to the data array, we have a target array which combines the data array with the array possessing the labels. This image demonstrates the ‘pixels’ array:

For creating a machine learning model, the first order of business obliges us to separate the training data from the testing data. We can separate and acquire these sets of information by separately calling the ‘data’ and ‘target’ arrays. These data sets are significantly different. The ‘data’ has a shape of 70,000×784 because each image consists of 784 pixels. The ‘target’ array, alternatively, is linear and only has 70,000 objects as these represents the labels. Keep in mind that for both of these array objects, we must also further separate them into training and testing data. Let’s take a look at how we might best organize these groups of data:

We can see from the code from above that we first separated the MNIST data into its two individual arrays, the data array and the labeled array. We then separated these two arrays based upon whether they were training objects or testing objects. This leaves us with four different groups of data we can use for training our machine learning models.

*Consideration of Stochastic Gradient Descent*

An article provided for here stipulates a helpful explanation to the functionality of the stochastic gradient descent algorithm:

“The word ‘*stochastic*‘ means a system or a process that is linked with a random probability. Hence, in Stochastic Gradient Descent, a few samples are selected randomly instead of the whole data set for each iteration. In Gradient Descent, there is a term called “batch” which denotes the total number of samples from a dataset that is used for calculating the gradient for each iteration. In typical Gradient Descent optimization, like Batch Gradient Descent, the batch is taken to be the whole dataset. Although, using the whole dataset is really useful for getting to the minima in a less noisy and less random manner, but the problem arises when our datasets gets big.”

For training this stochastic gradient descent algorithm in our binary classifier, we must first establish the proper data sets. We first need to acquire the five labels from our label data set for the training and the testing data. Once we do this, all we must do is call the SGD function from scikitlearn. Then we may pass in training data and the training labels for the five. The code for executing this functionality appears as follows:

We can then simply use the ‘predict’ function and pass in the position of some particular digit. If that digit is five, to some degree of accuracy, the function will return true. If that digit is not five, to some degree of accuracy, the function will return false.

## Example of Stochastic Gradient Descent

As we saw in our discussion of batch gradient descent, one consequence of this methodology is that it must use the entire training set to compute the gradients of the cost function. This ultimately yields a rather slow model. Stochastic gradient descent occupies an opposite side of the spectrum. Stochastic gradient descent selects a random object of the data set and computes the gradient as a function of this instance. This allows much larger data sets to be operated with.

The stochastic gradient descent algorithm differs from the behavior of batch gradient descent in that the stochastic gradient descent picks a randomized instance of data. Furthermore, it only decreases on average, not continuously. Therefore, it eventually gets close to the minima of the cost function, but will not settle down. Rather, it will bounce around the minimum.

When a cost function is irregular with multiple local minima and maxima, the stochastic gradient algorithm can be particularly useful in getting out of the minima. The drawback is that the algorithm won’t arrive exactly at the minimum.

In a process called simulated annealing, as the algorithm moves closer and closer to the minimum, the algorithm takes smaller and smaller steps towards the minimum. This supports the accuracy of the algorithm to identify parameters that minimize the cost function.

The learning schedule is utilized to define the learning rate of the algorithm at every iteration of the model. If the learning rate decreases to rapidly, the algorithm will get stuck in a local minimum. If it decreases too slowly, it will jump around a minimum with no optimal solution.

Take a look at the code we might use to model a stochastic gradient descent algorithm:

When we execute this code, we acquire a final result for the parameter values. These are:

## Comparing Batch and Stochastic Gradient Descent

As purported by Wilson in one of his academic publications, “Gradient descent training of neural networks can be done in either a *batch* or *on-line* manner. A widely held myth in the neural network community is that batch training is as fast or faster and/or more ‘correct’ than on-line training because it supposedly uses a better approximation of the true gradient for its weight updates.”

Wilson proceeds to argue, “As the size of the training set grows, the accumulated weight changes for batch training become large. This leads batch training to use unreasonably large steps, which in turn leads to unstable learning and to the overshooting of curves and local minima in the error landscape. Therefore, a reduction in the size of the learning rate (or, equivalently, using the average weight change instead of the sum) is required for batch training to be stable as the size of the training set grows. However, this means that the larger the training set is, the more computation is required for a weight change of the same magnitude.”

He summarizes the dillema in the following manner, “On the other hand, on-line training uses the ‘local’ gradient of each training instance to determine what direction to go. These local gradients can be noisy and contradict each other, but will on average move in the direction of the true gradient. Since on-line training updates weights after each instance, it is able to follow curves in the gradient during the course of an epoch. It can also handle any size of training set without having to reduce the learning rate.”

Consequentially, we can see that stochastic gradient descent overcomes a lot of these setbacks.

## The Take Away

The stochastic gradient descent algorithm is one of the most fundamental linear regression models programmers employ in machine learning. Gradient descent provides a succinct means of optimizing the parameter values such that the cost function, whichever is being used, is minimized. We briefly addressed the cost function, particularly the mean square error cost function. We then got into the important facets of the gradient descent algorithm, followed by the nuances of the learning rate. In our next article, we begin to look at the features of polynomial regression as we move away from linear regression.