Day 3 Of 100 Days of Code
Our third day of this 100 Days of Code was filled with in depth analysis of linear regression and elaboration on gradient descent algorithms. What should first be noted is the fact that we successfully published two deeply interwoven articles. The first analyzed in depth linear regression algorithms, their utility and mode of implementation. The second article followed up on this sentiment, diving in to gradient descent algorithms, their design, and qualities associated with them such as the cost function and learning rate. These articles were extremely thoroughly written,around 3,000 words between the both of them.
In addition to writing these articles, we investigated academic literature on this subject matter quite rigorously, which we will expand upon herein. Let’s begin to look into these matters of discussion.
Before We Get Started
I want to let you all in on a secret before we proceed with the modeling of machine learning derived data. Sometimes one of the trickiest things to do when working in machine learning is that you sometimes need to kind of create your own programming language. While we plan on coming out with a series on this in the future, I myself learned from the folks at ProLang. They offer an awesome book that goes into the details of creating language features that best support machine learning. Follow the link provided here. This will take you to ProLang where you can sign up and get their book. Getting this book was the best decision I ever made, and it drastically improved my coding experience. All you need to do is click the link and order a copy of the book. They even let you get it for free. Fill out an order and begin creating your own sophisticated machine learning models today.
Linear Regression Via the Art of Better Programming
A linear regression model may be be trained by two separate methodologies. Firstly, we may use a closed for function which computes the model parameters that yield the most optimal fit. Furthermore, the closed form function operates by minimizing the cost function as parameter values are modulated. We can also use iterative methodologies, like the Gradient Descent (GD) approach. This model gradually alters the parameter values progressively, also endeavoring to minimize the cost function as parameters are tweaked.
We first begin by examining closed form function for the training procedure of the machine learning model. Closed form functions look like a linear function which defines the prediction of an object’s parameter value based on the value of another parameter. Closed form linear regression algorithms are essentially procured by variations of linear combinations. Take a look at the following closed-form linear regression model:
Here, ‘y’ represents the parameter of the algorithm to be predicted. All of the parameters of the data object are the entirety of x’s from x1 to xn, which are known, as well as the ‘y’ parameter, which is the single unknown. The values for theta are weights assigned to the different known parameters of the data object, where some stipulate a greater significance to determining the value of ‘y’ than others.
Because a given parameter has a series of known parameters represented by some variable ‘x’, we can specify these known parameters by a vector x. Furthermore, each parameter of ‘x’ is associated with a weight ‘θ‘, and thus we may model the thetas in their own vector θT. If the vector x has the form:
and the vector θT is modeled as:
then a vectorized version of the closed form linear regression algorithm may be displayed as:
Theta and ‘x’ here are both vectors, so the parameter value ‘y’ is computed by the dot product. If you need a refresher on this computation, check out our article which discusses this subject specifically. If you’d like to take at the code needed for encoding these models, check out the following article.
Cost Functions Via the Art of Better Programming
For training a machine learning linear regression model, our primary endeavor is setting parameters that create a best fit algorithm. In order to do this, one obstacle is that we need a means by which we can quantify how well our model performs. One of the most common means of quantifying the performance of the linear regression model is through the use of the root mean squared error cost function. In this manner, the parameter theta is found which minimizes the root mean squared error. However, a much simpler mechanism is utilizing the Mean Squared Error (MSE). The Mean Squared Error function may be modeled as follows:
Therefore, for every parameter theta and ‘x’, including the value of ‘y’, we seek to minimize the MSE over the entire data set.
Gradient Descent Via Art of Better Programming
Gradient descent is an available algorithm that may be applied to optimize the parameters of a data set while also minimizing the cost function. Gradient descent optimizes the model such that it measures the local gradient of the cost function (which may be computed with partial derivatives) and gradually alters the weight of the parameter until the cost function is minimized.
Initially, the theta parameter is filled with random values, a step known as random intialization. Once this is done, the algorithm moves in a particular direction as the cost function improves gradually. Eventually, the algorithm will converge on a minimum value. In an article by Sebastian Ruder, gradient descent is conceptualized as, “a way to minimize an objective function J(θ) parameterized by a model’s parameters θ ∈ Rd by updating the parameters in the opposite direction of the gradient of the objective function ∇θJ(θ) w.r.t. to the parameters.”
One of the most important parameters of the gradient descent algorithm is the learning rate, which must be exogenously established. The learning rate defines how rapidly the theta parameter value descends to that optimized value. The greater the learning rate, the faster that the algorithm arrives at the optimized value, and if small, the longer it takes.
There are important set backs to take note of. If the learning rate is too large, the algorithm may bounce back and forth to alternating sides of the cost function gradient. On the other hand, if the learning rate is too small, it may take an infinitely long amount of time. Ruder conceptualizes the learning rate as follows, “The learning rate η determines the size of the steps we take to reach a (local) minimum. In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.” Thus, when executing a gradient descent algorithm, it is important to establish an intermediate learning rate to ensure these errors are avoided.
Academic Literature of the Day
Simple Linear Regression, Correlation, and Calibration
Title: “Simple Linear Regression, Correlation, and Calibration”
Author: Thomas P. Ryan
Journal: John Wiley & Sons
Discussion: Well, this manifested into a bit more than an academic journal article. This is actually a section from a textbook we found. Nevertheless, as such, it gave a quite thorough introduction and analysis of the use of linear regression and correlation in machine learning models. Ryan describes the implementation of linear regression in the following manner, “In (univariate) regression there is always a single “dependent” variable, and one or more “independent” variables. Consider checking it out.
The Take Away
Linear regression has been on our minds for quite some time up to this point. Tomorrow on day four, we will press further into this subject matter as we explore alternative gradient descent models, like batch gradient descent, stochastic gradient descent, mini batch gradient descent, and more. Look forward to seeing you there.