Introduction to Gradient Descent
Our previous article investigated the importance of linear regression in machine learning modeling, but here we focus on implementing this technique. In particular, we investigate linear regression as it relates to gradient descent machine learning. Before embarking on this discussion, we would like to provide a brief overview on the analyses explored in the Topics of Machine Learning series.
Much of our investigative efforts have centered on one of two topics: actual machine learning models and performance metrics of the models. With respect to the various machine learning models investigated, several articles have been published. We began first with an overview of the primary machine learning models and algorithms frequently utilized. This was immediately followed by a series of four articles expanding on this, including devotion to supervised learning models, unsupervised learning models, a comparison of batch and online learning, and instance vs model based learning. We implemented all of this knowledge into two succinct examples. The first addressed classification algorithms on the MNIST data set, as well as multi-class classification.
Having discussed the intricacies of machine learning models themselves, we further extrapolated on the metrics which allow us to quantify performance of our models and algorithms. These discussions of performance measures revolve around techniques of cross-validation, the confusion matrix, distinctions between precision and recall, as well as the utility of the ROC curve. All of these are readily available as analytical measures of our machine learning models. Here, we move towards more theoretical discussions and applicability of fundamental algorithms. In this case, let us explore perhaps the most ubiquitous machine learning algorithm, the linear regression model.
Before We Get Started
I want to let you all in on a secret before we proceed with the modeling of machine learning derived data. Sometimes one of the trickiest things to do when working in machine learning is that you sometimes need to kind of create your own programming language. While we plan on coming out with a series on this in the future, I myself learned from the folks at ProLang. They offer an awesome book that goes into the details of creating language features that best support machine learning. Follow the link provided here. This will take you to ProLang where you can sign up and get their book. Getting this book was the best decision I ever made, and it drastically improved my coding experience. All you need to do is click the link and order a copy of the book. They even let you get it for free. Fill out an order and begin creating your own sophisticated machine learning models today.
Conceptualizing Gradient Descent
Recall that a linear regression model may be be trained by two separate methodologies. Firstly, we may use a closed for function which computes the model parameters that yield the most optimal fit. Furthermore, the closed form function operates by minimizing the cost function as parameter values are modulated. We can also use iterative methodologies, like the Gradient Descent (GD) approach. This model gradually alters the parameter values progressively, also endeavoring to minimize the cost function as parameters are tweaked.
For training a machine learning linear regression model, our primary endeavor is setting parameters that create a best fit algorithm. In order to do this, one obstacle is that we need a means by which we can quantify how well our model performs. One of the most common means of quantifying the performance of the linear regression model is through the use of the root mean squared error cost function. In this manner, the parameter theta is found which minimizes the root mean squared error. However, a much simpler mechanism is utilizing the Mean Squared Error (MSE). The Mean Squared Error function may be modeled as follows:
Therefore, for every parameter theta and ‘x’, including the value of ‘y’, we seek to minimize the MSE over the entire data set. If a vector of predictions is generated from a sample of n data points on all variables, and is the vector of observed values of the variable being predicted, with being the predicted values (e.g. as from a least-squares fit), then the within-sample MSE of the predictor is computed as the function above.
The MSE can also be computed on q data points that were not used in estimating the model, either because they were held back for this purpose or because these data have been newly obtained. In this process, which is known as cross-validation, the MSE is often called the mean squared prediction error. When we utilize this function, we are able to get an optimized prediction of data values.
Features of Gradient Descent
Gradient descent is an available algorithm that may be applied to optimize the parameters of a data set while also minimizing the cost function. Gradient descent optimizes the model such that it measures the local gradient of the cost function (which may be computed with partial derivatives) and gradually alters the weight of the parameter until the cost function is minimized.
Initially, the theta parameter is filled with random values, a step known as random intialization. Once this is done, the algorithm moves in a particular direction as the cost function improves gradually. Eventually, the algorithm will converge on a minimum value. In an article by Sebastian Ruder, gradient descent is conceptualized as, “a way to minimize an objective function J(θ) parameterized by a model’s parameters θ ∈ Rd by updating the parameters in the opposite direction of the gradient of the objective function ∇θJ(θ) w.r.t. to the parameters.”
One of the most important parameters of the gradient descent algorithm is the learning rate, which must be exogenously established. The learning rate defines how rapidly the theta parameter value descends to that optimized value. The greater the learning rate, the faster that the algorithm arrives at the optimized value, and if small, the longer it takes.
There are important set backs to take note of. If the learning rate is too large, the algorithm may bounce back and forth to alternating sides of the cost function gradient. On the other hand, if the learning rate is too small, it may take an infinitely long amount of time. Ruder conceptualizes the learning rate as follows, “The learning rate η determines the size of the
steps we take to reach a (local) minimum. In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.” Thus, when executing a gradient descent algorithm, it is important to establish an intermediate learning rate to ensure these errors are avoided.
Dealing With Complex Cost Functions
Not all cost functions appear parabolic and only have a single minimum, a global minimum. In some cases, they may be more erratic with multiple local minima and maxima, as well as a global minimum. Additionally, there may be plateau regions which are relatively flat, and thus the algorithm will be optimized very slowly. One consequence of having a cost function that has a local minimum is the fact that the algorithm may get stuck in one of these valleys even though this is not the most optimized theta value. Fortunately, in gradient descent, the MSE cost function is a parabolic curve, and thus won’t cause any significant issues with respect to optimizing the parameter value.
Challenges of Gradient Descent
Ruder points out several important challenges posed in gradient descent, which should be noted:
• Choosing a proper learning rate can be difficult. A learning rate that is too small leads to painfully slow convergence, while a learning rate that is too large can hinder convergence and cause the loss function to fluctuate around the minimum or even to diverge.
• Learning rate schedules try to adjust the learning rate during training by e.g. annealing, i.e. reducing the learning rate according to a pre-defined schedule or when the change in objective between epochs falls below a threshold. These schedules and thresholds, however, have to be defined in advance and are thus unable to adapt to a dataset’s characteristics.
• Additionally, the same learning rate applies to all parameter updates. If our data is sparse and our features have very different frequencies, we might not want to update all of them to the same extent, but perform a larger update for rarely occurring features.
• Another key challenge of minimizing highly non-convex error functions common for neural networks is avoiding getting trapped in their numerous suboptimal local minima. Dauphin et al. argue that the difficulty arises in fact not from local minima but from saddle points, i.e. points where one dimension slopes up and another slopes down. These saddle points are
usually surrounded by a plateau of the same error, which makes it notoriously hard for SGD to escape, as the gradient is close to zero in all dimensions.
The Take Away
The gradient descent algorithm is one of the most fundamental linear regression models programmers employ in machine learning. Gradient descent provides a succinct means of optimizing the parameter values such that the cost function, whichever is being used, is minimized. We briefly addressed the cost function, particularly the mean square error cost function. We then got into the important facets of the gradient descent algorithm, followed by the nuances of the learning rate. In our next article, we dig deeper into the subject of gradient descent by focusing specifically on batch gradient descent.