# Ridge Regression Linear Models: Topics of Machine Learning

## Introduction to Ridge Regresison

This present article moves away from traditional linear and polynomial regression to a more nuanced form of regression known as ridge regression. Previously, we spent a great deal of time reviewing traditional machine learning models and investigated the performance measures associated with these algorithms. We then took great lengths in examining the various machine learning algorithms associated with regression, in particular, linear regression and polynomial regression. With respect to linear regression, we discussed gradient descent, batch gradient descent, and stochastic gradient descent. On the other hand, we discussed polynomial regression at large. Presently, however, we flip the script a bit and return to regularized linear models. Along this premise, we present the ridge regression methodology.

## Conceptualizing Regularized Linear Models

##### Sources For Investigating Ridge Regression

I’d like to take a moment to introduce to you one of the best resources I have used for learning about the mathematics which underly machine learning models. This resource is called The Probabilistic Perspective to Machine Learning. This book has revolutionized my ability to create sophisticated machine learning algorithms, and as I have previously stated, is one with which I contribute the most significant role in helping me acquire a full-time job as a data scientist. Furthermore, I have managed to acquire this tool for significantly reduced price. While I will present some of the insights of this book in this article, if you desire to get into this topic in greater detail, I highly recommend that you check this book out and get it. You can find it here.

##### Linear Models

When applying linear models to a set of data, we execute a linear regression when we anticipate the features of the data being some sort of linear combinations. In this manner, we may anticipate the data to adhere to the following form:

\hat{y}(w,x)=w_0+w_1x_1+…+w_px_p

In this manner, ‘w’ represents a vector of the form w=(w1,…,wp) which denotes the coefficients of the parameters. Alternatively, the w0 factor is the intercept of the model. We often implement these models in classification algorithms.

##### Brief Discussion of Ordinary Least Squares

According to one of our primary toolkits, the SciKit-Learn LinearRegression function fits the previous linear model with a particular sequence of coefficients. These coefficients are generated through the progressive minimization of a particular cost function. In ordinary least squares, the linear regression functions optimizes these functional coefficients by minimizing the least squares between data set targets and data set predictions. These predictions are discerned by linear approximation, and are mathematically discerned with the following formula:

\min_w||\bold{X}w-y||_2^2

With this being evident, it is clear that the linear regression function may use the fit method on the arrays X and y and stores the coefficients ‘w’.

According to one of our contributing sources, “The coefficient estimates for Ordinary Least Squares rely on the independence of the features. When features are correlated and the columns of the design matrix X have an approximate linear dependence, the design matrix becomes close to singular and as a result, the least-squares estimate becomes highly sensitive to random errors in the observed target, producing a large variance. This situation of multicollinearity can arise, for example, when data are collected without an experimental design.”

##### Regularized Linear Models

Problems we ran into in the past with altering machine learning models according to performance measures was the potential for overfitting or underfitting the model with respect to the data. In previous discussions, we saw that one effective means of reducing an overfitting data model is by reducing the degrees of freedom associated with the data. Expectantly, when working with data that exhibits polynomial tendencies, we can reduce the fit of the model by reducing the number of degrees that the polynomial model exhibits. This, however, represents only one means of reducing the constraints of the machine learning model. Here, we investigate ridge regression.

##### An Overview of Ridge Regression

The following source linked here provides keen insight into describing the fundamental features of the ridge regression algorithm. Ridge regression represents a regularized form of the SciKit-Learn linear regression function. This means that the model includes a regularization term which is included in the cost function. By including this regularization in any given cost function associated with the algorithm, it forces the model to weight the parameters of the function as small as possible. However, take note that this mechanism should only be executed during training, not when trying to predict new data values.

Furthermore, one common discontinuity is the fact that even when we use one particular cost function when training the machine learning model, an entirely different one may be used when testing the model. One reason for this is because the best linear models for training are those which have been regularized such that the weights of the parameters are as small as possible. Another reason is the fact that the training model should utilize a cost function which lends itself well to optimization, but once this has been formulated, the final testing model should be as close to accurate as possible. In this manner, for example, we may train a classifier model using a log loss cost function, but then evaluate novel data in testing using precision/recall performance metrics.

## The Ridge Regression Model

##### Mathematical Features

According to this well produced study of regression in machine learning, ridge regression imposes a penalty for having a model with coefficients of large magnitude. For that reason, we seek the minimization of these parameter coefficients with the sum of least squares function. In this fashion, we adhere to the following model:

\min_w||\bold{X}w-y||^2_2+\alpha||w||^2_2

Here, the alpha figure represents a particular hyperparameter which specifies to what degree the model should be regularized. In other words, as a parameter reflecting complexity, alpha controls the degree of shrinkage in the model. The larger that the value of alpha is, the greater the amount of shrinkage the model exhibits. Thus, when alpha models a large value, nearly all of the parameter weights are close to zero. Ultimately, this procures a flat line.

##### In Accordance With the Cost Function

If you’ve been following along with our series using this source as we advise, then you would be aware of the fact that almost all of our machine learning models take into consideration some type of cost function. When it comes to utilizing ridge regression, we often must take into consideration the minimization of parameter weights with respect to also minimizing the cost function. In particular, we have often employed the mean squared error cost function. When utilizing ridge regression, we may model the algorithm in accordance with the MSE cost function as follows:

J(\theta)=\text{MSE}(\theta)+\alpha \frac{1}{2} \sum_{i=1}^n\theta^2_i

In this case, θ0 operates quite similarly to the parameter weight w0 in that it represents a bias term, intercept, which is not actually regularized in ridge regression. This is a consequence of the fact that the sum begins at one, due to the fact that i=1. Then, if ‘w’ is a vector which takes the form (θ0,…,θn), then the regularization term is represented by the parameter 1/2(||w||2)2.

##### Closed Form Ridge Regression Models

Recall from our previous discussion of stochastic gradient descent where we demonstrated our ability to represent machine learning models using closed form solutions. Therefore, an alternative means of executing ridge regression is through utilizing the closed form function. As represented in the following book, the closed form ridge regression model may be represented as follows:

\hat{\theta}=(\bold{X}^T\cdot \bold{X}+\alpha \bold{A})^{-1} \cdot \bold{X}^T \cdot \bold{y}

In this case, all other parameters represent their similar cognates in the alternative formulas. However, here, the parameter A represents an identity matrix of the form n x n.

## Ridge Classification

While this article focuses primarily on ridge regression, considering this is part of our primary source Probabilistic Perspective of Machine Learning, we take a moment to describe the role of ridge classification.

The Ridge regressor has a classifier variant: RidgeClassifier. This classifier first converts binary targets to {-1, 1} and then treats the problem as a regression task, optimizing the same objective as above. The predicted class corresponds to the sign of the regressor’s prediction. For multiclass classification, the problem is treated as multi-output regression, and the predicted class corresponds to the output with the highest value.

It might seem questionable to use a (penalized) Least Squares loss to fit a classification model instead of the more traditional logistic or hinge losses. However in practice all those models can lead to similar cross-validation scores in terms of accuracy or precision/recall, while the penalized least squares loss used by the RidgeClassifier allows for a very different choice of the numerical solvers with distinct computational performance profiles.