Introduction to Precision and Recall
The Topics of Machine Learning series has steadily moved from overarching principles of machine learning models to the bits and pieces which drive these models to function. With our previous article having investigated the implementation of the confusion matrix, we now move to conceptualizing the underlying theory of precision and recall. Thus far in the Topics of Machine Learning series, we elaborated extensively on the various machine learning model categories that exist. Our first article to this series provided an overview to all of these model categories, an investigation which you may find here. After providing this broad introduction, we began diving individually into the specificities and algorithms associated with each of these models. This began with an analysis of supervised learning models, particularly its classifier and regression algorithms.
Next, we followed this with a discussion of the features of unsupervised learning models. We then differentiated the methodologies of batch and online learning, which you may investigate here. We finally followed up with an elaboration of the differences between instance based and model based learning, found here.
With the individual algorithms explored and discussed in depth, in one of our most recent article, we investigated the fundamentals of classification algorithms. Along that line of thought, we began by exploring the MNIST data set and its use in machine learning classification algorithms. Having developed this classifier in one of our preceding articles, we followed up with an analysis of one particular performance measurement of machine learning algorithms. In this discussion, we elaborated on the accuracy metric in discerning machine performance using the cross-validation technique. This was then followed by another discussion of performance measurement through the use of the confusion matrix.
One of the primary features of the confusion matrix is its ability to prompt the computation of precision and recall. These computations were discussed in great length in our previous article. However, here, we focus on the underlying theory associated with precision and recall in order to provide a better understanding of what these metrics actually represent. Let’s begin.
Conceptualizing Precision and Recall
Why Measure Performance?
Machine learning models rapidly delineate values of data items within a set. How efficiently these models execute this functionality is enigmatic lest we have some means of quantifying machine performance. The utilization of performance measures allow the exploitation of quantitative aspects of our model, including its accuracy, precision, recall, and facets of efficiency.
According to Marina Sokolova in “A systematic analysis of performance measures for classification tasks”, she contends that “Empirical evaluation remains the most used approach for the algorithm assessment, although ML algorithms can be evaluated through empirical assessment or theory.” Thus, measurement of performance is the best tool we have available at our disposal for understanding our model.
The validity of performance measures is confirmed in a variety of contexts, but bioinformatician Yasen Jiao presents it elegantly in his publication “Performance measures in evaluating machine learning based bioinformatics predictors for classifications.” Jiao presents the importance of machine learning performance measurement as a consequence of the potential for an algorithm to over-fit or under-fit data. The author contends, “Because machine learning algorithms usually use existing data to establish predictive models, it is possible that a predictive model is over-fitted or over-optimized on the existing data.” Here, Jiao points out that a consequence of the manner in which the model generates data, thereby creating inconsistencies which we must account for. Jiao clarifies his definition of over-fit or over-optimize by stating that the algorithm performs well on existing data but does not have sufficient flexibility to operate well on incoming data.
The importance of performance measures is consequential of the fact that it conveys such a large impact on the validity of our data, and thus, our insights. Jiao supports this argument in stating that “the prediction performance drops drastically when it is applied in practical studies with novel data.” In order to correct for the apparent bias that we may introduce into our model if the optimization has not been accounted for, Jiao finally argues that “when applying these computational predictors, it is vitally important to understand their mechanisms and the conditions of their performances in the first place.” We share this conviction as well.
Implementing Performance Measure
Sokolva’s article supplies a brief overview for the computation of a variety of performance measures using a confusion matrix. She states, “The correctness of a classification can be evaluated by computing the number of correctly recognized class examples (true positives). The number of correctly recognized examples that do not belong to the class (true negatives), and examples that either were incorrectly assigned to the class (false positives) or that were not recognized as class examples (false negatives).” We can then use these parameters for quantifying a variety of metrics, including precision and recall.
The concept of precision is colloquially understood as the quality of being exact or accurate. However, we know from our previous discussions that precision is not the same thing as accuracy. In particular, precision represents the closeness between a series of of measurements. Precision is of great importance to machine learning performance assessments, as it is not sufficient to simply acquire a correct answer, but to ensure that the correct answer is a consistent output. As stated in this article, “Precision, or the positive predictive value, refers to the fraction of relevant instances among the total retrieved instances.” Therefore, in computing precision, through whatever mechanism we decide, we really calculate the total true answers correctly identified.
Sokolova defines precision in the following manner: “Precision: the number of correctly classified positive examples divided by the number of examples labeled by the system as positive.” Thus, precision is understood as a ratio between the correctly classified positives and the sum of all objects classified as positive.
Whereas precision represents the consistency of output as a function of correctly identified true values, recall represents a metric also known as the true positive rate. The true positive rate represents the ratio of positive instances correctly identified and the sum of total true positives with false negatives. Recall represents the particular importance as it allows us to understand how often the correct output obtained with respect to how often the correct answer denied. According to this article, “Recall, also known as sensitivity, refers to the fraction of relevant instances retrieved over the total amount of relevant instances.” As such, recall allows us to understand the ability of our machine to correctly identify positive instances.
Sokolova defines recall in the following manner: “Recall: the number of correctly classified positive examples divided by the number of positive examples in the data.” Consequentially, we can understand recall as the ratio between the number of correctly classified positives, and the total number of positives that exist in the data set.
Deriving Precision and Recall From Confusion Matrix
Consider the following definition of the confusion matrix. The row of a confusion matrix represents the class of the object we seek to look at. The column of that row demonstrates all possible values that the data item may be classified as. Therefore, if we are looking at the value ‘m’, we first go to the ‘mth‘ row of the confusion matrix. If we then look at the ‘mth‘ column of the ‘mth‘ row, our hope is that this will be the greatest value in the row (if our algorithm has decent recall). We can then look at other columns in the row to see what the ‘mth‘ value is most mischaracterized as.
According to Sokolova, the correctness of a classification can be evaluated by computing the number of correctly recognized class examples (true positives). In conjunction with true positives, the number of correctly recognized examples that do not belong to the class (true negatives), and examples that either were incorrectly assigned to the class (false positives) or that were not recognized as class examples (false negatives) can all be used for quantization of precision and recall. For example, precision can be calculated with true positives and false positives. Alternatively, recall can be computed with true positives and false negatives. This makes the confusion matrix particularly useful for assessing performance measures.
Computing Precision and Recall From Confusion Matrix
In our previous article, we addressed how precision could be computed on our stochastic gradient descent algorithm. We previously defined precision as the consistency of deriving correct values from our algorithm. We have a readily available mathematical formula which permits us to compute the precision of the algorithm. The formula looks like:
From this formula, we acquire a much more robust understanding of what precision represents. Here, we see that precision is the quotient between the number of true positive identities and the sum of true positives and false positives. By this logic, we may understand precision as the ratio between true positives and the number of fives identified.
The precision of a machine learning algorithm may become readily available through the implementation of the SciKitLearn function ‘precision_score’. All we must do is input the list which contains all of the fives and the list containing all of the predictions for these values. Let us take a look at how we encode this functionality in our Python script:
When we execute the code, we obtain the following output:
We see here that the code reveals a precision score of approximately 0.837. This may be understood as a precision of 83.7%, which suggests that ~84% of the time, a five is correctly classified as a five.
In addition to our mathematical formula which computed precision from the confusion matrix, we also have available which allows us to compute the recall of our machine learning algorithm. Let’s take a look at the mathematical formula for the computation of recall:
From this representation of recall in the present formula, we may acquire a better understanding of what recall denotes. We may understand recall as the quotient between the true positives, and the total number of fives. Recall can be exogenously computed, however, as a consequence of the recall_score function from SciKitLearn. Let’s take a look at how we can encode this in our Python script:
As with the precision score function, the recall score function takes as input the data set of the fives and the predictions. The output for this code appears as follows:
From this we see that our recall score is a bit lower than our precision score at 0.651. This implies that ~65% of the time, for all of the fives in the data set, a five is correctly identified.
Relationships Between Precision and Recall
Computing F1 Score
Precision and recall confer their own significances in terms of their statistical impact. From these different performance measurements, we obtain different insights into the efficiency of our machine learning model. However, it is sometimes useful for us to combine these metrics into a single computation, a computation known as the F1 score. The F1 score represents the harmonic mean between precision and recall. We have a readily available mathematical relationship between precision and recall for this computation. It appears as follows:
From this, we can see that the F1 score is defined as the quotient of the product of precision and recall with the sum of precision and recall. However, we can easily compute the F1 score using the f1_score function from SciKitLearn. This makes computation of this metric as easily as computing the precision and recall itself.
Dynamics of Precision and Recall
An interesting facet of the precision and recall relationship is the fact that increasing precision decreases the recall, and vice verse. This phenomena is known as the precision/recall tradeoff.
For example, in the case of our stochastic gradient descent algorithm, the model executes its classification as a consequence of its decision function. The decision function is based on a threshold which uses the threshold to assign data objects to a particular class. Below the threshold, a data object will be assigned to one class, while above this threshold, the object will be assigned to a different class.
In SciKitLearn, we are not directly able to alter the threshold of this functionality, but the decision function is in fact under our control. The decision function returns a certain score for each instance, and with these scores, we can create our own threshold and assign values based on our arbitrary threshold. Take a look at the code below:
Precision and Recall Curve
Graphing the relationship between precision and recall in our machine learning model can give us a better understanding of the precision/recall trade off and their relationship to threshold. We may plot this relationship quite easily using MatPlotLib and SciKitLearn’s builtin precision_recall_curve function. Let’s take a look at the code for producing this plot:
When we execute this code, we obtain a graph that appears as follows:
The Take Away
The present article extensively discussed the features of quantifying the efficacy of classifier performance by the metric of precision and recall. In particular, we implemented the technique of the confusion matrix as a proxy for precision and recall. Precision and recall are fundamentally roots of statistics. However, they may be effectively applied for the purpose of computing precision and recall of a machine learning model. SciKitLearn provides several helpful avenues for computing the confusion matrix of our model, making this task quite simple.
We employed our code from our previous article, which involved the creation of a binary classifier via the MNIST set. Here, established the precision and recall of the model using the confusion matrix technique. From the values represented within the confusion matrix, the precision and recall can be readily computed. Our next article in the Topics of Machine Learning series explores the ROC curve. We look forward to seeing you there.