# Batch Gradient Descent Algorithms: Topics of Machine Learning

## Introduction to Batch Gradient Descent

Our previous article investigated the importance of linear regression in machine learning modeling, as well as the intricacies of Gradient Descent models. With respect to gradient descent, here we investigate linear regression as it relates to batch gradient descent machine learning. Before embarking on this discussion, we would like to provide a brief overview on the analyses explored in the Topics of Machine Learning series.

Much of our investigative efforts have centered on one of two topics: actual machine learning models and performance metrics of the models. With respect to the various machine learning models investigated, several articles have been published. We began first with an overview of the primary machine learning models and algorithms frequently utilized. This was immediately followed by a series of four articles expanding on this, including devotion to supervised learning modelsunsupervised learning models, a comparison of batch and online learning, and instance vs model based learning. We implemented all of this knowledge into two succinct examples. The first addressed classification algorithms on the MNIST data set, as well as multi-class classification.

Having discussed the intricacies of machine learning models themselves, we further extrapolated on the metrics which allow us to quantify performance of our models and algorithms. These discussions of performance measures revolve around techniques of cross-validation, the confusion matrix, distinctions between precision and recall, as well as the utility of the ROC curve. All of these are readily available as analytical measures of our machine learning models. Here, we move towards more theoretical discussions and applicability of fundamental algorithms. In this case, let us explore perhaps the most ubiquitous machine learning algorithm, the linear regression model.

## Before We Get Started

I want to let you all in on a secret before we proceed with the modeling of machine learning derived data. Sometimes one of the trickiest things to do when working in machine learning is that you sometimes need to kind of create your own programming language. While we plan on coming out with a series on this in the future, I myself learned from the folks at ProLang. They offer an awesome book that goes into the details of creating language features that best support machine learning. Follow the link provided here. This will take you to ProLang where you can sign up and get their book. Getting this book was the best decision I ever made, and it drastically improved my coding experience. All you need to do is click the link and order a copy of the book. They even let you get it for free. Fill out an order and begin creating your own sophisticated machine learning models today.

Recall that a linear regression model may be be trained by two separate methodologies. Firstly, we may use a closed for function which computes the model parameters that yield the most optimal fit. Furthermore, the closed form function operates by minimizing the cost function as parameter values are modulated. We can also use iterative methodologies, like the Gradient Descent (GD) approach. This model gradually alters the parameter values progressively, also endeavoring to minimize the cost function as parameters are tweaked.

Gradient descent is an available algorithm that may be applied to optimize the parameters of a data set while also minimizing the cost function. Gradient descent optimizes the model such that it measures the local gradient of the cost function (which may be computed with partial derivatives) and gradually alters the weight of the parameter until the cost function is minimized.

Initially, the theta parameter is filled with random values, a step known as random intialization. Once this is done, the algorithm moves in a particular direction as the cost function improves gradually. Eventually, the algorithm will converge on a minimum value. In an article by Sebastian Ruder, gradient descent is conceptualized as, “a way to minimize an objective function J(θ) parameterized by a model’s parameters θ ∈ Rd by updating the parameters in the opposite direction of the gradient of the objective function ∇θJ(θ) w.r.t. to the parameters.”

##### The Batch Gradient Descent Algorithm

As with any gradient descent model, the algorithm must quantify the gradient of the cost function with respect to the theta parameter. The theta parameter represents the weight associated with a particular attribute of a data object. In other words, what we are doing in effect is computing the value of the cost function for a minuscule change in the parameter value. Mathematically speaking, this is effectively computing the partial derivative.

##### Partial Derivatives

For those who may benefit from a refresher on partial derivatives, check out the following the link. We will only supply a brief overview here, but you will need a thorough comprehension on the subject. Sometimes, we may want to observe only how only one particular variable affects the function’s behavior alone. In this case, we may want to let one of these variables vary while one variable remains constant. For example, if we would like to observe function behavior of the function as x varies and y remains constant, we first say y=b where ‘b’ is a constant. In this case, we are really considering a new function of the form g(x)=f(x,b). If the function ‘g’ has a derivative at the point ‘a’, then we might say that the derivative of ‘g’ is a partial derivative of ‘f’ at (a,b). We can model this rule as:

f_x(a,b)=g'(a)\enspace \text{where}\enspace g(x)=f(x,b)

Here is where the importance of understanding limits becomes apparent. We can use our understanding of limits to procure a much more robust conceptualization of partial derivatives. Recall that our definition of limits as they pertain to derivatives takes the form:

g'(a)=\lim\limits_{h \to 0}\frac{g(a+h)-g(a)}{h}

However, this definition only works for derivatives of functions with a single variable. Therefore, we can modify this to work for a function consisting of two variables, particularly when one of the variables is set to the constant ‘b’. The conversion is rather simple, and takes the form:

f_x(a,b)=\lim\limits_{h \to 0}\frac{f(a+h,b)-f(a,b)}{h}

We can similarly acquire the derivative of ‘f’ with respect to ‘y’ at (a,b) through a similar form:

f_y(a,b)=\lim\limits_{h \to 0}\frac{f(a,b+h)-f(a,b)}{h}

We may sum up the rules for partial derivatives as follows. If a function ‘f’ is comprised of two variables ‘x’ and ‘y’, then the partial derivatives are defined by the functions fx and fy such that:

f_x(x,y)=\lim\limits_{h \to 0}\frac{f(x+h,y)-f(x-y)}{h} \newline f_y(x,y)=\lim\limits_{h \to 0}\frac{f(x,y+h)-f(x-y)}{h} \newline

There are several different notations to indicate the taking of a partial derivative, and they are really up to the user which one you prefer. If we have a function of the form f(x,y)=z, then we can represent the taking of a partial derivative with respect to ‘x’ or ‘y’ as:

f_x(x,y)=f_x=\frac{\partial f}{\partial x}=\frac{\partial}{\partial x}f(x,y)=\frac{\partial z}{\partial x}\newline \cdot \newline f_y(x,y)=f_y=\frac{\partial f}{\partial y}=\frac{\partial}{\partial y}f(x,y)=\frac{\partial z}{\partial y}

We can also simplify the taking of partial derivatives for a function down to two simple rules:

1. If we are taking the partial derivative of f(x,y) such that we are looking for fx, then keep y constant and differentiate f(x,y) with respect to x.
2. If we are taking the partial derivative of f(x,y) such that we are looking for fy, then keep x constant and differentiate f(x,y) with respect to y.

With these two rules in mind, taking the partial derivatives of a multi-variable function is made significantly easier.

##### The Cost Function

Recall the mean square error cost function is represented in the following manner:

This function represents the cost function in its vector form. In order to compute the gradient of the cost function with respect to the parameter theta, we must take the partial derivative of the function with respect to theta. Doing so yields the following function:

\frac{\partial}{\partial\theta_j}MSE(\theta)=\frac{2}{m}\sum_{i=1}^m(\theta^T\cdot x^{(i)}-y^{(i)})x^{(i)}_j

Now recall that in addition to the closed form of the function, we also have a vectorized version of the function. As a result, using partial derivatives, we can define a gradient vector of the MSE cost function. Doing so appears as follows:

∇_\theta MSE(\theta)=\begin{pmatrix} \frac{\partial}{\partial\theta_0}MSE(\theta) \\ \\ \frac{\partial}{\partial\theta_1}MSE(\theta) \\ \cdot\\ \cdot\\ \frac{\partial}{\partial\theta_n}MSE(\theta) \end{pmatrix} = \frac{2}{m}\bold{X}^T\cdot (\bold{X}\cdot \theta-\bold{y})

Recall from your calculus lessons that the gradient vector points in the direction of the steepest upward gradient. Therefore, to walk down the steepest downward gradient to find the minima of the cost function, we need only to move in the opposite direction.

##### Learning Rate

One of the most important parameters of the gradient descent algorithm is the learning rate, which must be exogenously established. The learning rate defines how rapidly the theta parameter value descends to that optimized value. The greater the learning rate, the faster that the algorithm arrives at the optimized value, and if small, the longer it takes.

There are important set backs to take note of. If the learning rate is too large, the algorithm may bounce back and forth to alternating sides of the cost function gradient. On the other hand, if the learning rate is too small, it may take an infinitely long amount of time. Ruder conceptualizes the learning rate as follows, “The learning rate η determines the size of the
steps we take to reach a (local) minimum. In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.” Thus, when executing a gradient descent algorithm, it is important to establish an intermediate learning rate to ensure these errors are avoided.

In light of our gradient vector of the cost function, we can use the learning rate to discern the magnitude of the downward step. This allows us to discern how rapidly the parameter move towards the minimization of the cost function. All we must do is multiply the gradient vector by the learning rate to discern this magnitude. Check out the following function which demonstrates this fact:

\theta^{\text{(next step)}}=\theta – η∇_{\theta}MSE(\theta)

## Encoding Batch Gradient Descent Algorithms

Encoding the batch gradient descent algorithm is actually a rather simple endeavor. Firstly, let’s review how we begin to establish the data for the gradient descent algorithm. Take a look at how exactly we can do this:

Now, when we do this, we can easily establish the batch-gradient descent algorithm which appears as follows:

When we execute this code, we discover the following values for the theta parameters:

Evidently, we quite efficaciously of using batch gradient descent for identifying the parameter values that minimize the value of the cost function.

## Draw Backs

As purported by Wilson in one of his academic publications, “Gradient descent training of neural networks can be done in either a batch or on-line manner. A widely held myth in the neural network community is that batch training is as fast or faster and/or more ‘correct’ than on-line training because it supposedly uses a better approximation of the true gradient for its weight updates.”

Wilson proceeds to argue, “As the size of the training set grows, the accumulated weight changes for batch training become large. This leads batch training to use unreasonably large steps, which in turn leads to unstable learning and to the overshooting of curves and local minima in the error landscape. Therefore, a reduction in the size of the learning rate (or, equivalently, using the average weight change instead of the sum) is required for batch training to be stable as the size of the training set grows. However, this means that the larger the training set is, the more computation is required for a weight change of the same magnitude.”

He summarizes the dillema in the following manner, “On the other hand, on-line training uses the ‘local’ gradient of each training instance to determine what direction to go. These local gradients can be noisy and contradict each other, but will on average move in the direction of the true gradient. Since on-line training updates weights after each instance, it is able to follow curves in the gradient during the course of an epoch. It can also handle any size of training set without having to reduce the learning rate.”

## The Take Away

The batch gradient descent algorithm is one of the most fundamental linear regression models programmers employ in machine learning. Gradient descent provides a succinct means of optimizing the parameter values such that the cost function, whichever is being used, is minimized. We briefly addressed the cost function, particularly the mean square error cost function. We then got into the important facets of the gradient descent algorithm, followed by the nuances of the learning rate. In our next article, we dig deeper into the subject of gradient descent by focusing specifically on batch gradient descent.