## Introduction to the Correlation Coefficient

Our discussion in this series on data modeling focuses on statistical tools which function to statistically represent biological data. The initial articles gave an initial foray into the topic, addressing the uses of statistics and a bit of insight into the concept of sampling. Following this, we expanded upon the general array of different types of models that populate this series. Subsequently, in a more recent article, we elaborated on computationally based concepts of statistics through discussion of descriptive statistics. Our most recent article discussed in great detail the methods of developing estimations for biological data. This discourse is logically followed by a thorough examination on the theories and topics grounded in understanding correlation. The study of correlation and the correlation coefficient provide a thorough insight into a global population. Furthermore, using the correlation coefficient as a tool, a researcher may better understand the relationship between two variables.

## The Scatter Plot

There’s a time and place for univariate analysis and modeling, but multivariate operations usher in a larger degree of sophistication. We can thank Sir Francis Galton and his great insights for developing this feature of statistics. In particular, Galton endeavored to relate the height of a father to the height of the son. He established the relationship between these variables by creating the widely known figure, the scatter plot. In one dimension of the plot, Galton represented the height of the father, while the other dimension modeled the son’s height. Galton then plotted a point on the plot representing a father-son pair. You can take a look at Galton’s scatter plot below:

*Discussion of Galton’s Plot*

In Galton’s plot, the father’s height is represented in the x-dimension, while the son’s height occupies the y-dimension. Therefore, a point on the plot is represented by a coordinate of the ‘x’ and ‘y’ values of the form (x,y). In this manner, we may also say that this (x,y) coordinate represents (‘Father’s Height’, ‘Son’s Height’).

Of primary importance to the data model is the oval-shaped cluster where a majority of the data objects are plotted. Through this cloud, we can observe a line that intersects the data set. This line is defined by the equation x=y, and makes a 45˚ angle with the x and y axes. Because the x variable is equal to the ‘y’ variable, if the father is 72 inches tall, then the son is predicted to be 72 inches tall as well. In the scatter plot above, we see that there is a lot of variation along this line. We might say that the variables do not associate closely as a result. However, we can observe the fact that the relationship is in fact a positive association, such that as the fathers height increases, so too does the height of the son.

## Introducing the Correlation Coefficient

The data in the scatter plot above occupies an oval-shaped cloud. At the center of this proverbial oval exists a point called the point of averages. This point of averages reflects the coordinate of the averages of the x-values and that of the y-values. Because the oval-shaped cloud represents the data in two dimensions, we can examine the standard deviation in both x and y-dimensions. These parameters, the point of averages, standard deviation in the x-dimension, and standard deviation in the y-dimension, allows us to infer the spread of the data. Such components are in fact insightful, yet we still lack a means of understanding association of the data.

In both graphs, we may observe that both data sets possess the same line of best fit. However, the proximity of individual data points to the line varies appreciably. We can observe that in the graph on the left, the data objects occupy points much closer to the line. This allows us to say that the association of the data to the line is much stronger. This feature of the strength of association confers the correlation coefficient, often modeled as the variable ‘r’.

*Mathematical Insights*

The correlation coefficient can take on a value between -1 and 1. With a perfect correlation, wherein the correlation coefficient equals one, each point lies on a line. In this manner, there exists a perfect linear relationship between all points of a data set. Perfect correlations, in both the positive and negative sense, tend not to manifest in nature. Rather, it is most common to observe a value which lies between -1 and 1. The closer to the extremes that the correlation lies, the increased strength of the association. At a correlation coefficient of zero, there is no association between the two variables.

## The Standard Deviation Line

In the data set, data points tend to cluster around the line known as the standard deviation line. The standard deviation line intersects the point of averages, as well as all the points whose standard deviations are equal for both variables. For example, at the point of averages, one standard deviation in the x-dimension and one standard deviation in the y-dimension is a point that exists on a line.

When computing the correlation coefficient directly from the data, the first order of business is to get the values into standard units. By computing the products of these values then taking the average, one arrives at the correlation coefficient. We may say that:

It may be of some use to consider this in a step-by-step procedure:

- Convert the x-values into standard units. To convert to standard units, subtract the average of the x-values from each x-value and divide by the standard deviation. Execute this procedure for each x-value.
- In the same manner, convert the y-values into standard units. Refer to step one for procedural methods for this computation
- Take the product of each x/y pair.
- Take the average of the products.

This procedural mechanism procures the correlation coefficient of the data set.

## The Take Away From Correlation

The present article has thoroughly elaborated on the significance, derivation, and computation of the correlation coefficient. From our investigation, we have found that the correlation coefficient is a purely mathematical construct which confers the association that exists between two variables. Hopefully this will serve as a solid introduction to the nuances of correlation. Our subsequent article intends to focus on some of the deeper facets of this mathematical phenomenon. These principles include the scaleability of the correlation coefficient, the consequences of altering the standard deviation, and several examples in nature. Until then, if you would like an opportunity to review the correlation coefficient from the perspective of alternative resources, consider checking out this article. Nevertheless, we hope to see you at our next article, or perhaps indulging in some of the other content of this statistical series, found below.

Topics of Data Modeling Series: