## Prologue

Before we begin, take a second. Open this link in a new tab. Did you do it? I’ll wait. Oh, you opened it? Great. Sorry for the forwardness, but this book is one of the greatest tools I have used in taking myself from a beginner in programming to a full time data scientist. Not only is this book an exceptional resource for bolstering your skills and making yourself a more marketable and employable programmer, but it also revolves around the subject which this article focuses on: regression. I bring this resource to your attention because I have struck a deal and managed to make it 50% off (a lot less than I had to pay for it, let me tell you). At the very least, just take a look at it, read a preview of it, or something. You have my word that this tool has great potential for making you a titan in the coding community.

## Introduction to Mathematical Regression

For quite some time now in our machine learning series, we have belabored various aspects of regression algorithms in machine learning. Our first insight to this concept manifested in our initial article discussing supervised machine learning models, which may be found here. After exploring a variety of different machine learning models and embarking on several examples, we investigated in great detail the subject of regression as it pertains to these models. We began first by providing a succinct overview of linear regression models in machine learning. We then elaborated excruciatingly on these algorithms in detail, beginning first with gradient descent, followed by batch gradient descent and stochastic gradient descent. In our most recent article, we elaborated on the features associated with polynomial regression. However, in this article, we focus specifically on the mathematics associated with the topic of regression, as it may be applied in machine learning.

## Before We Get Started

I want to let you all in on a secret before we proceed with the modeling of machine learning derived data. Sometimes one of the trickiest things to do when working in machine learning is that you sometimes need to kind of create your own programming language. While we plan on coming out with a series on this in the future, I myself learned from the folks at ProLang. They offer an awesome book that goes into the details of creating language features that best support machine learning. Follow the link provided here. This will take you to ProLang where you can sign up and get their book. Getting this book was the best decision I ever made, and it drastically improved my coding experience. All you need to do is click the link and order a copy of the book. They even let you get it for free. Fill out an order and begin creating your own sophisticated machine learning models today.

## Conceptualizing Regression

In the past, we examined the use of correlation as a measure of how strongly individual data points adhered to a line of best fit for the data. However, regression differs from this topic in that regression specifically describes how two variables relate to each other. Consider the regression model we explore below:

This particular example comes from a book known as Regression Analysis With R. This book is one of the leaders in the field explaining how regression can be included in machine learning and data science models to obtain some statistical insight into the data at hand. It is by far one of the best resources I have used. I have been able to get it to you for almost 50% off, so if you’d like to get an overview of the book, check it out here. With great confidence I can say that furthering your academic pursuits with this resource will greatly increase your learning and bolster the level of sophistication in your models.

The data here compares the height of an individual (on the x-axis), to their respective weight (on the y-axis). The dashed line here is the standard deviation line where one standard deviation in the x-dimension is equal to one standard deviation in the y-dimension. The correlation about this line is r~.5. Therefore, we can state that there is quite a bit of variation in this data.

The solid line is the line of regression for the data, calculated based on the weights of particular values of the x-dimension. We will get into how this is done later on. However, take a note of the vertical stripe centered around x = 73 inches +/- .5 inches. Data points that fall within this stripe are considered to be extremes of the data with respect to height. While these individuals are an extreme with respect to their height, there is a significant degree of variability in their weight, given the broad vertical distribution of the data here. Based on the SD line, most of the individuals within this strip are actually below this line, which is to say that at the extremes, there is an over-estimation of weight based upon height. However, we see that with the line of regression, there is a much more even distribution of data points on either side of the line. Why is this?

The line of regression represents the average value of ‘y’ for a particular value of ‘x’, whereas the SD line is really a prediction based upon the entirety of the data set. The SD line is always linear, but the regression line here appears to be linear simply due to the fact that x and y are related to each other linearly. In a more mathematical definition of regression, we might say that the regression model for ‘y’ is a continuum of estimates based on the average value of ‘y’ for each value of ‘x’.

Along the SD line, for every SD traversed in the x-dimension, the line will also traverse one SD in the y-dimension. However, according to the regression line, the correlation of the data is only r=.5. Therefore, for every SD traversed in the x-dimension, the regression line will traverse only .5SD in the y-dimension. In this fashion, the regression method uses the correlation coefficient to predict the value of the dependent variable based upon average values of the dependent variable at that particular location of ‘x’. More mathematically speaking, according to the regression method, for every SD traversed in the x-dimension, the line traverses one ‘r’ in the y-dimension.

We may consider several unique reasons which explain why the regression method relies on the correlation coefficient. Firstly, when ‘r’ is zero, there is no association between ‘x’ and ‘y’, and therefore, for every SD in the ‘x’ dimension, there is a zero-SD increase in the y-dimension, which yields a horizontal line because r=0. Now, if ‘r’ is one, all of the points lie on the line because a correlation coefficient of one reflects perfect correlation. Furthermore, for every SD traversed in the x-dimension, there will also be a one SD traversal in the y-dimension. Finally, the same is true for a situation wherein the correlation coefficient is negative 1. The only difference is that the line has a negative slope and for every SD traversed in the x-dimension, there is a negative one SD traversal in the y-dimension.

## Exploring the Graph of Averages

The graph of averages represents what the regression line in theory is supposed to be, the average value in the y-dimension for each x-value. The graph of averages has a bit more variability, and the regression line is simply a line of best fit for this graph. Take a look at the figure below which represents the graph of averages:

This graph of averages was made from the height and weight example used earlier. Note that there are much less data points, only one point for every primary value of ‘x’. We also see at the extremes there is much more variability, which is what we would expect. This is a consequence of the fact that these outliers are not a large sample, only several individuals have heights in these ranges, and thus are subject to extreme variation. The regression line here represents a line of best fit for the estimated values of ‘y’ based upon their average values for each value of ‘x’.

Sometimes, a graph of averages is used to show a pattern between the *y* and *x* variables. In a graph of averages, the *x*-axis is divided up into intervals. The averages of the *y* values in those intervals are plotted against the midpoints of the intervals. If we needed to summarize the *y* values whose *x* values fall in a certain interval, the point plotted on the graph of averages would be good to use.

The points on a graph of averages do not usually line up in a straight line, making it different from the least-squares regression line. The graph of averages plots a *typical* y*y* value in each interval: some of the points fall above the least-squares regression line, and some of the points fall below that line.

## Regression Fallacy

When it comes to utilizing regression for understanding some statistical model, there is one particular pitfall which many individuals find themselves fall into. When comparing a sample in two scenarios, it often appears that individuals at the lower extreme of the distribution exhibit increased performance while individuals at the upper extreme exhibit decreased performance. This is a consequence of the fact that the performance of the first instance produced an extreme which often is difficult to replicate, and the same applies for the lower extreme. This is called the regression effect. Attributing statistical significance to this change in behavior is known as statistical fallacy, and one must be vigilant of staying away from making this mistake.

## Regression For Interpolation and Extrapolation

Regression models predict a value of the *Y* variable given known values of the *X* variables. This feature was discussed thoroughly from the beginning of our article. Here, however, we take a moment to understand the conceptualization of regression from the perspective of extrapolation and interpolation.

According to the following resource, “Interpolation refers to the endeavor of accurate prediction *within* the range of values in the dataset used for model-fitting. Alternatively, extrapolation represents the prediction *outside* this range of the data.” Performing extrapolation relies strongly on the regression assumptions. The further the extrapolation goes outside the data, the more room there is for the model to fail due to differences between the assumptions and the sample data or the true values.

When embarking on executing extrapolation, one should accompany the estimated value of the dependent variable with a prediction interval that represents the uncertainty. Such intervals tend to expand rapidly as the values of the independent variable(s) moved outside the range covered by the observed data.For such reasons and others, some tend to say that it might be unwise to undertake extrapolation.

However, this does not cover the full set of modeling errors that may be made: in particular, the assumption of a particular form for the relation between *Y* and *X*. A properly conducted regression analysis will include an assessment of how well the assumed form is matched by the observed data, but it can only do so within the range of values of the independent variables actually available. This means that any extrapolation is particularly reliant on the assumptions being made about the structural form of the regression relationship.

Best-practice advice here is that a linear-in-variables and linear-in-parameters relationship should not be chosen simply for computational convenience, but that all available knowledge should be deployed in constructing a regression model. If this knowledge includes the fact that the dependent variable cannot go outside a certain range of values, this can be made use of in selecting the model – even if the observed dataset has no values particularly near such bounds. The implications of this step of choosing an appropriate functional form for the regression can be great when extrapolation is considered. At a minimum, it can ensure that any extrapolation arising from a fitted model is “realistic” (or in accord with what is known).

If you would like to explore the role of interpolation and extrapolation in greater detail as it pertains to regressional machine learning models, check out this book. It goes into great depth on the subject, and takes great lengths to explicate the code exhibited therein.

## Improving Your Models

The textbooks below were essential in my development as a programmer. I used all four of these throughout my career and recommend each of them to you. Knowledge by far is the greatest asset you can afford yourself, and the content present in these books grant this to you, and more. Check them out:

## The Take Away

Regression is one of the most optimal means of demonstrating a statistical relationship between different variables in a data set. This article has revealed the underlying theories supporting the existence of regression, as well as the mathematics which promote its use. We also took a moment to interpret graphs which exhibit regression as their primary component. We then used this knowledge to comprehend the significance of the regression fallacy as well as the utility of extrapolation and interpolation based upon the implementation of regression. Hopefully this article has been of some used to you, and look forward to seeing you in our next discussion. Nevertheless, if you’d like to examine this study of regressional analysis in greater detail, check out this resource which has exceptionally improved my machine learning models.