Descriptive Statistics with Biological Data: Topics of Machine Learning

Introduction to Descriptive Statistics with Biological Data

Our series on statistical data modeling has emerged from it’s introductory era into something with a bit more substance. Our initial articles gave an initial foray into the topic, addressing the uses of statistics and a bit of insight into the concept of sampling. Following this, we expanded upon the general array of different types of models that populate this series. This discussion includes discussion of frequency distributions, histograms, as well as the variables we encounter with modeling biological data. Now, we embark on a comprehensive elucidation of descriptive statistics, concepts which are essential to achieving comprehensive insights into a larger population. When studying biological data, our end goal is typically motivated by a desire to glean insights into some aspect of a population under investigation. Descriptive statistics serve as the primary tool by which we extract these insights from the numbers. Let us begin.

Descriptive Statistics with Biological Data

Descriptive statistics represents an aspect of statistical modeling that endeavors to quantify features of a frequency distribution. While frequency distributions are fundamentally rooted numerically, they serve an explicit qualitative function by allowing visual interpretation of a population. Descriptive statistics give a quantitative interpretation to this visual perspective.

When it comes to utilizing descriptive statistics, the primary means of measurement include delineating the location and the spread of the frequency distribution. The location of the frequency distribution stipulates the mean or the mode, while the spread represents variation in individual measurements. Additionally, proportions are a useful description employed which represent the ratio of measurements that fall under a particular observation. All of these aspects serve as fundamental concepts to descriptive statistics.

Conceptualizing Arithmetic Mean

Sample Mean

The mean of a sample is defined as the average value of a parameter for a specific sets of measurements in a sample. The mean can be computed as the quotient of the sum of all values and the number of values.

Computationally, the sample mean can be modeled as:

\bar{x}\enspace =\enspace \frac{\displaystyle\sum_{i=1}^nx_i}{n}

Let’s break down what this function represents. The x-bar symbol on the left side of the ‘equals’ operator represents the average of ‘x’. The Greek sigma symbol denotes the sum, where the ‘n’ superscript represents the number of samples and ‘i=1’ indicates that every item of ‘x’ will be included in the sum. The symbol ‘xi‘ represents the ith value in the sample ‘x’. Thus, this numerator implies the sum of all ‘x’ values in the set ‘x’. This is all placed over the denominator ‘n’ which represents the number of individuals in the sample. Utilizing this function permits computation of the mean for a sample.

Conceptualizing Spread with Variance and Standard Deviation

The concepts of variance and standard deviation are staples to quantifying and interpreting the spread of a given sample. Standard deviation is a measurement which computes how far from the mean measurements tend to be. Mathematically speaking, the standard deviation is low when most measurements are close to the mean, while it is large when measurements tend to be far from the mean. In this way, we get a good idea of the spread of a sample based on the standard deviation (a concise spread vs. a disparate spread).

Standard deviation is generally computed as an instance of the variance. The variance is additionally a computation related to the spread of a sample. Furthermore, variance is a purely mathematical quantitation, and the computation of the standard deviation therefrom provides insight towards the spread.

The variance, and by proxy, the standard deviation, are both measurements of deviation. In particular, these are measurements of an individual value’s distance from the mean. To compute the sample variance, we take the sum of the of all the individual variances in the sample and divide it by the parameter ‘n-1’. The mathematical relationship is represented as follows:

s^2=\frac{\sum(x_i-\bar{x})^2}{n-1}

The symbol ‘s2‘ represents the square of standard deviation, and is the mathematical symbol for variance. This computation represents the sum of the squared differences between each value in the sample ‘x’ from the mean. We divide this by the denominator ‘n-1’ which represents the degrees of freedom. If ‘n’ were used, there would be increased positive bias, so by using the degrees of freedom we obtain a better approximation of the variance.

From this computation of the various, we may also derive the value for standard deviation. Standard deviation computes as the square root of the variance, and therefore:

s=\sqrt{s^2}
Conceptualizing Coefficient of Variance

We have now elaborated on the concepts of mean, standard deviation, and variance. We finalize this discussion by raising the concept of the coefficient of variance, which serves as a bridge between the mean and the variance. In particular, the coefficient of variance computes the standard deviation as a percentage of the mean such that:

CV=100\% \enspace \times \enspace \frac{s}{\bar{x}}

The Median

Conceptualizing the Median

The median of a sample population represents the middle value for a population of values. When it comes to the median, 50% of individuals of a population lie below the median while the other 50% of individuals lie above.

In order to calculate the median, order the values from least to greatest. If the total number of individuals in a sample is odd, then the median value is the middle measurement of the list of observations. Alternatively, if the number of individuals in a sample is even, then the median is the average of the middle two values. In both of these cases, 50% of values lie both above and below this value.

Conceptualizing Interquartile Range

In our previous article, we elaborated at length on the significance of quantiles and percentiles in modeling data. Quartiles are a type of quantiles which partitions data into quarters. The first quartile is the middle value of observations lying below the median (the 25th percentile). The second quartile is itself the median. The third quartile is the middle value of observations lying above the median (the 75th percentile).

We can conceptualize the interquartile range as the difference between the first quartile and the third quartile. By computing this value, we discern the spread of observations outside the median.

Conceptualizing the Box Plot

Box plots are a data representation which display the median and interquartile range graphically. The lower boundary of the box represents the first quartile, while the upped bound of the box represents the third quartile. There is an intercalating line in the box which represents the median. Most box plots also often depict whiskers that project above and below the box plot which represents the extreme values above and below the median.

The essence of the box plot itself is the fact that it represents the regions wherein a majority of the values for a sample lies.

Conceptualizing Distribution Location and Spread

The presentation of these computations raise some questions for us. For example, in terms of understanding the location of the distribution, is it better to utilize the mean or the median. Additionally, is the standard deviation or the interquartile range a better measurement of the spread? The answer is that the use of these tools depends on the situation at hand.

When comparing the locations of two samples, if their distributions are both symmetric, then comparing their means is particularly effective. Employing the median facilitates the understanding samples exhibiting asymmetric distributions.

On the other side of the coin, when it comes to computing the spread, we have available at our disposal the standard deviation and the interquartile range. If the sample has a broad distribution, then employing the standard deviation is better for quantifying the spread as this technique is more sensitive to the extreme values. Alternatively, when samples are more condensed, the interquartile range is preferable as these deliver absolute comprehension of the central values of a sample.

Conceptualizing Proportions

Proportions of a population reflect the ratio of values that fall in the category of a particular observation. The proportion mathematically computes as the quotient between the number of values in a category and the total number of observations for a sample.

Summarizing Descriptive Statistics

  • Descriptive statistics with biological data gives quantitative perspective of mathematical relationships in a biological population
  • Descriptive statistics revolves around conceptualizing the computation of a distribution’s location and spread
  • Sample mean is a measurement of location that represents the average of a population
  • Sample median is also a measurement of location which represents the middle value of a population
  • Variance and standard deviation are measurements of spread the rely on the sum of square differences for all measurements relative to the mean divided by the degrees of freedom
  • The interquartile range measures the spread of a distribution, computed as the difference between the first quartile and third quartile
  • The box plot is a graphical model of the median, first and third quartiles, and extreme values of a population which represents a majority of values in the population
  • Proportions lend insight into the ratio of values in a particular category of observations and the number of total observations

For more insights into comprehension of descriptive statistics with biological data, consider checking out this article from Investopedia which thoroughly investigates this subject.

Leave a Reply

%d bloggers like this: