## Modeling Biological Data

Statistics is the cornerstone of science. Data serves as the matter by which statistics operates upon. However, these numerical computations are seldom useful unless they are efficaciously modeled. The purpose of data modeling is to organize data and statistical information in a succinct form such that observation of the model itself provides insight to the entity under investigation. From this explanation, we can note that modeling data serves a two-fold purpose: (1) Rearranging information in an explanatory context, and (2) Organizing data. If you haven’t had any previous experience in statistics as a subject, check out our previous article with some important introductory concepts to the field. This will be extremely helpful, and can be found here.

## Frequency Distributions

The previous article on biological data discusses the intricacies of frequency distributions. If you haven’t previously utilized these entities, I refer you to this article. Briefly, I will address the distribution. A frequency is a component of statistics describing how often a particular value appears in a sample. A frequency distribution is a model which represents the frequency of all the values for a given sample.

Relative frequency refers to the ratio of values in a data set with a particular value; thus, the relative frequency of a measurement is the number of entries with a value divided by the total number of entries in the sample. From this, we acquire a relative frequency distribution which models the different ratios of all the measurements from a sample. We provide an example of relative frequency below:

*Computing Frequencies*

Computing the frequencies or relative frequencies of measurements permits the creation of frequency tables and bar graphs. These structures facillitate effective data modeling. Frequency tables are textual models which represent the number of occurrences for a particular category of the table. Alternatively, bar graphs demonstrate frequency of a particular category by proxy of the height of a vertical bar associated with the category. These two examples are the most common representations for categorical data with textual information.

Frequency tables model numerical data, in conjunction with histograms. As was the case for frequency tables with textual data, frequency tables demonstrate the frequency of a particular measurement in the table. On the other hand, histograms denote the frequency of a numerical value based upon the area of a rectangle. Consider the exemplar frequency table below:

The histogram differentiates itself from the bar graph in that the x-axis denotes numerical data rather than categorical data. Furthermore, it utilizes continuous data rather than discrete data (for insight towards the difference between these distinctions, consider this article). Finally, with histograms, there are no separations between rectangles. Rather, they are stacked in an adjacent fashion. Consider the histogram below:

## Interpreting Histogram Shape

The shape of a purported histogram stipulates valuable information with regards to the data of a sample. First and foremost, note that the peak of a histogram represents the mode of the sample. The mode of a sample denotes the measurement with the greatest frequency. In some cases, samples may be bimodal, which denotes systems having two frequently occurring values.

In addition to the mode, we can gain great insights from the shape of the histogram. Symmetric histograms exhibit mirror-like frequencies on either side of the mode. This symmetric attribute is characteristic of the normal distribution (bell-shaped curve) as well as the uniform distribution, wherein frequencies for all values are essentially the same.

Histograms may also exhibit an attribute known as skew. A skewed histogram displays a bias towards one side, where the mode occurs closer to one extreme than another. Skewed histograms are asymmetric distributions. Bimodal distributions depict skewed forms as the modes occur away from the center. Therefore, we can say that skew refers to any histogram exhibiting an asymmetric shape. Consider the bimodal histogram below:

In addition to histogram shape itself, individual values have properties that must be considered themselves. For example, histograms often provide insight to data points which are outliers. Outliers are data points found towards the extremes of a histogram, and are not the norm of the data. While they may not be representative of a total population on their own, outliers provide great insight to understanding how certain variations can influence statistical behavior.

## Percentiles and Quantiles

When it comes to working with distributions, it is often useful to examine the data in consideration of percentiles and quantiles. Doing so aborrogates the need of interpreting the data of the x-axis, standardizing it along a predictable continuum. A percentile refers to the percent of values lying below a particular value. For example, the 75th percentile refers to the point above 75% of measurements in a distribution. At the 75th percentile, 25% of measurements are above this point. The quantile refers to the fractional ratio of the same concept. For example, the 75th percentile could just as easily be referred to as the 75/100 quantile. Consider the percentiles and quantiles of the normal distribution below:

These concepts are especially useful when it comes to creating a cumulative frequency distribution. A cumulative frequency distribution models each quantile of a numerical value in a data set.

*Steps of Producing a Cumulative Frequency Distribution *

There are several steps in the construction of a cumulative frequency distribution model. For a cohort of data:

- Data organized in accordance with magnitude of the measurements from least to greatest
- The fraction of values less than or equal to a particular value calculated
- This fraction is the cumulative relative frequency, and plots as the height of the curve
- Points connect by a line which ascends to the value of 1.0

Several aspects of the cumulative frequency distribution must be noted:

Firstly, the curve is ascending and appears to make steps. The curve jumps by a value of 1/n for every measurement. In this case, n is the number of observations. The y-axis typically denotes the cumulative relative frequency, which has a value ranging from 0 to 1. These values reflect the quantile. Therefore, where the distribution crosses the point where cumulative relative frequency is .5, we can take this to be the 50/100 quantile, where 50% of measurements lie above this point an 50% of measurements lie below. Consider the cumulative frequency distribution below:

## Modeling Categorical Variables

*Contingency Tables*

Contingency tables model frequencies of multiple categorical variables. The usefulness of this model, by which the name of this model is given, demonstrates the relationship between these categorical variables.

In the contingency table, the explanatory variable is placed as a column label while the response variable is placed as a row label. Then the frequencies of these combinations are presented within the table. Consider the contingency table below:

*Grouped Bar Graphs*

Grouped bar graphs are quite similar to standard bar graphs, but must not be confused with histograms. These graphs are similar to standard bar graphs in that it models data relative to the height of the bar. The difference is that the bars are not equally shaped, but are separated into groups. In standard operations, the groups are constituted by the explanatory variables. The members of the groups reflect the particular response variable applied to the sample. The use of this model makes easy the modeling of sample differences in multiple populations based both upon the response variable as well as the explanatory variable. Firstly, consider the group bar graph below:

*Mosaic Plot *

The mosaic plot is quite similar to the grouped bar graph. It consists of bars, but the different response variables of a given group overlap. Color coating of the different parts of the single bar allow observation of the differences in the result for different response variables. Because the explanatory variable bars are side-by-side, we may then infer the differences between samples by the difference in their response variable bars. Consider the mosaic plot below:

## Modeling Numerical Variables

*Comparative Histograms *

The principles and intricacies of histograms have already been discussed. If you need a refresher, check out this article. Now, we know that histograms reflect relative frequency of a particular measurement based upon the area of the rectangle. If we are comparing different systems relative to the same measurements, we can compare these systems using grouped histograms. In this case, we stack histograms together on the same x-axis. By comparing the modes, shapes, and other statistical parameters of the histograms between each other, we can infer comprehensive differences in the population.

*Comparative Cumulative Frequencies*

We extensively cover cumulative frequencies in one of the previous sections. Recall that it is an ascending graph which documents the frequency of a particular measurement by virtue of the cumulative frequency. Recall that the final step of constructing a cumulative frequency model involves the connection of data points with lines. Therefore, at its core, the cumulative frequency model is a line graph documenting the apparent frequencies of various values. If we seek to compare systems, the cumulative frequency model is a useful tool. To use this tool, we simply overlay the cumulative frequency graphs of multiple systems. It is generally useful to color code these lines and label these in a legend. Finally, we can directly infer differences between these populations based upon the behavior of these graphs.

## Establishing Numerical Variable Relationships

*Scatter Plots*

Scatter plots are perhaps the bed rock of establishing relationships between numerical variables. For a scatter plot to function properly, a single point of the graph is representative of one data point. The plot possesses an x-axis which denotes the explanatory variable while the y-axis denotes the response variable. The position on the plot that the point takes depends on the value of its respective explanatory and response variables. When we plot all of the points of a data set, we establish a clear relationship between the response and explanatory variables. There may either be a positive relationship, negative relationship, or even no relationship at all. Consider the scatter plot below which depicts a positive correlation between the variables:

*Line Graphs *

Line graphs are useful for documenting relationships between data that may not be linear. As in the case of the scatter plot, the line graph retains an x-axis serving to denote the explanatory variable and a y-axis which represents the response variable. Each point plots in accordance with the values for its explanatory and response variables. Finally, once we plot the data, lines connect all of the points together. For more, check out this article on biological data modeling.

## Summary

- Frequency distributions document the number of observations within a range of a particular measurement
- The frequency distribution gives rise to percentiles and quantiles which indicate the proportion of values that exist below a particular point.
- Histograms model continuous numerical data and help indicate statistical properties of a particular sample such as the mode and behavior of the distribution
- Categorical and numerical data model different techniques that provide insight to the consequence of explanatory variables
- Comparing models of multiple different samples allow us to observe differences in a population