# A Guide to Modeling Biological Data: Introductory Statistics and Sampling

## The Premise

When it comes to working with biological systems, a prolific amount of data may be generated. Whether it be working with an animal population, modeling protein interactions, or accessing gene identities, biological systems are hot beds for data. This data is useless unless we have some way to understand it. This modeling biological data is critical to understanding biological systems through computational manipulation; this is the premise of statistics. Furthermore, modeling this data is executed often by technological programs. For examining the relationship between modeling biological data, check out this article.

Furthermore, biological data is messy. While we often seek to identify a precise answer, individuals within a system often deviate from each other. For that reason, we must have ways of managing error, as well as modeling this data’ again, this is where statistics come in. Statistics collectively refers to the methodology by which we measure and analyze properties of biological systems. This analysis revolves around a principle of estimation, wherein the value of an unknown is inferred based on the data. The quantities we utilize to describe a population are parameters, and our estimates are measurements of these parameters.

Why do we go through such efforts to model such biological data in this manner? Much of biology revolves around hypotheses: our best predictions about some attribute of a biological system. We confirm or annul our hypothesis depending on the data which we interpret through statistics. Thus, managing biological data revolves around hypothesis testing.

## Sample Populations

Perhaps the most essential quality control step which must be undertaken is the process of sampling populations to obtain data. When acquiring samples, the first step is identifying a population to study. The population represents the collection of individuals which happen to be the topic of investigation. However, under most circumstances, it is too inconvenient to study an entire population in a given instance.

For this reason, we settle for a sample of the population, which suffices as a cross-section to be studied. We assume that with a sufficiently large sample, provided that it has been randomly acquired, we obtain a reasonably consistent view of the overall population. Within a sample, the fundamental component is the individual, and the sample is comprised of individuals. Consider the image below:

## Choosing a Proper Sample

##### Purpose of Sampling

When working with data we expect the estimates to deviate, at least partially, from the true behavior of the population. This variability is a consequence of sampling error. Variations in measurements as a consequence of the sampling error gives insight towards the precision of our estimates. Precision is an attribute which confers the consistency of our results. On the other hand, accuracy of measurements refers to how close our estimates are to the true value of a population’s parameter. If there is a consistent error of the data, such that the value has been under estimated or overestimated, we may infer that a bias is present.

In consideration of these potential errors, proper sampling of a population is critical to curtail they influence of sampling error and bias. If our sampling has been undertaken with great efficiency, we hope to observe measurements which are both accurate and precise.

##### Insights of Random Sampling

The most effective type of sampling is random sampling. When it comes to taking a random sample, each individual in a population should have an equal opportunity of being included in the sample. Furthermore, when taking individuals from the population, the choices must be independent from each other. This is to say that the selection of one individual must not influence the selection of another individual. A random sample that incorporates this equal chance phenomenon and independent selection provides the best opportunity for minimizing sampling error and bias.

##### Steps of Taking a Random Sample

We may execute a random sample with four simple steps:

1. Assign a number to each individual in a population, ranging from one to the length of the population
2. Determine the n-size of the sample, which is the number of individuals from the population which will be sampled
3. Use a random number generator to pick ‘n’ random integers
4. Sample the random ‘n’ integers from the population

Now, this is a tried and true method. However, sometimes it is rather impossible to assign numbers to every individual in a population. Some populations are just too large. For that reason, we can dissect a population into groups, wherein each group has an equal size and are randomly assigned.

##### Convenience Samples

A random sample should always be the go-to sample when it comes to statistical computation. However, for a variety of reasons, this may not always be feasible. In this case, a researcher establishes a sample of convenience. The sample of convenience is simply a sample of individuals whose population is easily accessible to the researcher. One of the primary issues of utilizing the sample of convenience is that it is extremely difficult to guarantee an unbiased sample. However, when a random sample is not feasible, this is the best opportunity. We present an example of convenience sampling below:

## Investigating Different Types of Data and Variables

If our data has been properly sampled, we may now measure various attributes of the sample. The variables under investigation represent measurements which differ between individuals. Our measurements coalesce to confer data which consists of the measuremenets made for specific variables. We will now investigate several different properties of data and variables. For more depth on modeling these variables, check out the following article.

##### Categorical and Numerical Variables

Categorical variables are those attributes which ascribe membership of an individual to a particular group. Such a variable describes a perceivable attribute of the individual and does not exhibit a magnitude. An example of this would be sex chromosomes assigning sex to an individual. There are two subtypes of categorical variables. Those that are nominal assign categories without having a particular order. Ordinal variables are those which assign a particular category based on a specific order. Consider the chart below for efficient differentiation:

We assign numerical variables by measurement of a particular attribute which ultimately exhibit a quantitative magnitude. Numerical variables may be either continuous or discrete. Continuous numerical variables are those which can take on any value within a particular range. Alternatively, discrete numerical valuables can take on a finite number of values within a particular range.

##### Explanatory and Response Variables

In addition to quantitative values, variables have additional parameters which establish a relationship between the variable with another. A response variable is also a dependent variable, whose value depends on the value of another variable. The explanatory variable is also known as an independent variable whose value does not depend on another variable, and ‘explains’ the output of the response variable.

## An Overview of Distribution Models

Because there are variations in the measurements of different individuals in a sample, these individuals are said to be distributed according to these measurements. A frequency distribution helps demonstrate how often a particular value for a variable is represented in given population. The frequency addressed here describes how often a single particular measurement appears, while the frequency distribution addresses the frequency of a range of measurements. For a particular variable, its distribution across an entire population is its probability distribution. Distribution of many variables behave according to the normal distribution, though this will be investigated in a later article.

## Types of Statistical Studies

While a variety of study types and attributes of different studies exist, all of them are beyond the scope of this article. We investigate these individually later. Nevertheless, we here address properties of experimental studies and observational studies. Experimental studies consist of different groups assigned to a particular treatment or modification. These revolve around the variations in the explanatory variable. Alternatively, observational studies involve groups which nature assigns rather than the researcher. In this case, the explanatory variable can not be succinctly accounted for by the researcher. Furthermore, the researcher has no control over which groups the individuals belong to. As stated, attributes of studies and their parameters will be investigated individually later. To acquire more depth on statistical studies, check out the following article. Consider the chart below for a brief overview for study types:

## Summary

1. Statistics is the study of measurements for a particular population under investigation, the basis of modeling biological data
2. Statistical analysis of a population has various inconsistencies and errors which may alter results
3. Statistics provides opportunities for minimizing instantiation of error.
4. One of the best opportunities for minimizing error is through sampling which mitigates incursion from sampling error.
5. Through random sampling, maximization of accuracy and precision can be ensured
6. With random samples, each individual found within a population has an equal opportunity of being included in a sample.
7. When operating with statistics, there is a variety of data types and different variables as well
8. We can effectively model variables in consideration of their distribution
9. There are a variety of distributions which provide insight towards behavior of a particular attribute in a population
10. Basic study types include observational and experimental studies, although there exist many other types, which are critical to understand the process of modeling biological data