Performance Measurement in Machine Learning: 100 Days of Code (1/100)

The Motivation

The Art of Better Programming has grown beyond what we had ever anticipated. With several collaborators now working together to bring programmers unique material to follow along in their training endeavors, the sky’s the limit right now. With that being said, for those who are just now joining us on this journey, we want to provide you with a succinct means of staying up to date with the progress of our efforts, which has prompted us to hopping on the #100DaysOfCode trend. We will use this not only to demonstrate what we have published, but also to summarize texts that we are using to teach ourselves, as well as other projects we’re working on as our growth proceeds. We’re thrilled to have you here with us, so let us elaborate on Day 1 of our 100 Days of Code journey.

Publications of the Day

Overview

To date, we’ve being really hammering out machine learning tutorials, attempting to add some meat to the bones of this area of our tutorials. Today, all of our articles have focused on performance measurements associated with machine learning models. Measuring performance is an essential facet of machine learning, as we need some quantitative means of determining the efficacy of our model. Furthermore, performance measurements allows us to comprehend how tweaks to our systems affect its overall classification, and finally, allows us to compare different models to find the optimal one for our task. Let’s take a look at some of our findings and discussions.

Over Arching Training Model

Before we get into the publications made, we should first introduce the model that ties all of these publications together. We first developed a stochastic gradient descent algorithm on the MNIST data set and used this model to compare performance measures. Our code for creating this model appears as follows:

Our first action with approaching the MNIST data set was the splitting of the data into the training and testing data, and subdividing these groups on the basis of whether they were data which included labels or did not include labels. We separate and acquire these sets of information by separately calling the ‘data’ and ‘target’ arrays. These data sets exhibits significant differences. The ‘data’ has a shape of 70,000×784 because each image consists of 784 pixels. The ‘target’ array, alternatively, is linear and only has 70,000 objects as these represents the labels. Keep in mind that for both of these array objects, we must also further separate them into training and testing data.

After splitting the data into the different groups, we trained a machine learning model that operates as a binary classifier by differentiating images which are fives and those which are not fives. For training this stochastic gradient descent algorithm in our binary classifier, we first established the proper data sets. We first need to acquire the five labels from our label data set for the training and the testing data.

Once we do this, all we must do is call the SGD function from scikitlearn. Then we may pass in training data and the training labels for the five. We can then simply use the ‘predict’ function and pass in the position of some particular digit. If that digit is five, to some degree of accuracy, the function will return true. If that digit is not five, to some degree of accuracy, the function will return false.

The Confusion Matrix

Our first posted article addressed the utility of the confusion matrix in computing the machine learning model’s performance. This was published on the coat-tails of a previous article that began with the accuracy measurement. This article deviated from this territory in that the confusion matrix provides an avenue for deriving precision and recall, and therefrom the F1 score as well.

Creating the confusion matrix is actually rather simple, as we noted in the article. In the beginning of our calculation, the first tool we implement is the cross_val_predict function. This may seem quite similar to the cross_val_score function used for computing the cross-validation in a data set. The difference is that, though cross_val_predict executes K-fold cross-validation, it does not return evaluation scores, but rather, returns predictions of the test fold.

With test fold predictions acquired, the confusion matrix may now be calculated using SciKitLearn’s confusion_matrix function. Let’s take a look at how we might execute this coding of the confusion matrix:

In line 44, we first utilize the cross_val_predict function. Here, we specify inputs of the stochastic gradient descent algorithm. These include input training data which is associated with labels, and the training labels for images of fives. Additionally, we set the ‘cv’ argument equal to three which exploits the number of cross-validation steps.

Next, in line 45, we imported the confusion_matrix function from the SciKitLearn library. We then call this function using the training labels for the five images and the predictions from the cross_val_predict function. We then print out the output.

While the confusion matrix alone will not exactly give us the insights we need for verifying our algorithm’s performance (we need precision and recall for this), as long as the largest numbers inhabit the main diagonal, we can rest assured that the algorithm is for the most part functioning as expected.

Precision and Recall

Having made a publication about the confusion matrix, it was only sensible to dive into the intricacies of precision and recall.

Sokolva’s article supplies a brief overview for the computation of a variety of performance measures using a confusion matrix. She states, “The correctness of a classification can be evaluated by computing the number of correctly recognized class examples (true positives). The number of correctly recognized examples that do not belong to the class (true negatives), and examples that either were incorrectly assigned to the class (false positives) or that were not recognized as class examples (false negatives).” We can then use these parameters for quantifying a variety of metrics, including precision and recall.

The concept of precision is colloquially understood as the quality of being exact or accurate. However, we know from our previous discussions that precision is not the same thing as accuracy. In particular, precision represents the closeness between a series of of measurements. Precision is of great importance to machine learning performance assessments, as it is not sufficient to simply acquire a correct answer, but to ensure that the correct answer is a consistent output. As stated in this article, “Precision, or the positive predictive value, refers to the fraction of relevant instances among the total retrieved instances.” Therefore, in computing precision, through whatever mechanism we decide, we really calculate the total true answers correctly identified.

Whereas precision represents the consistency of output as a function of correctly identified true values, recall represents a metric also known as the true positive rate. The true positive rate represents the ratio of positive instances correctly identified and the sum of total true positives with false negatives. Recall represents the particular importance as it allows us to understand how often the correct output obtained with respect to how often the correct answer denied. According to this article, “Recall, also known as sensitivity, refers to the fraction of relevant instances retrieved over the total amount of relevant instances.” As such, recall allows us to understand the ability of our machine to correctly identify positive instances.

From the confusion matrix, we are capable of computing these features. Consider the following definition of the confusion matrix. The row of a confusion matrix represents the class of the object we seek to look at. The column of that row demonstrates all possible values that the data item may be classified as. Therefore, if we are looking at the value ‘m’, we first go to the ‘mth‘ row of the confusion matrix. If we then look at the ‘mth‘ column of the ‘mth‘ row, our hope is that this will be the greatest value in the row (if our algorithm has decent recall). We can then look at other columns in the row to see what the ‘mth‘ value is most mischaracterized as.

Confusion Matrix - Applied Deep Learning with Keras

According to Sokolova, the correctness of a classification can be evaluated by computing the number of correctly recognized class examples (true positives). In conjunction with true positives, the number of correctly recognized examples that do not belong to the class (true negatives), and examples that either were incorrectly assigned to the class (false positives) or that were not recognized as class examples (false negatives) can all be used for quantization of precision and recall. For example, precision can be calculated with true positives and false positives. Alternatively, recall can be computed with true positives and false negatives. This makes the confusion matrix particularly useful for assessing performance measures.

Furthermore, our computation of precision and recall allows for computation of a more explicit metric, the F1 score. Precision and recall confer their own significances in terms of their statistical impact. From these different performance measurements, we obtain different insights into the efficiency of our machine learning model. However, it is sometimes useful for us to combine these metrics into a single computation, a computation known as the Fscore. The F1 score represents the harmonic mean between precision and recall. We have a readily available mathematical relationship between precision and recall for this computation. It appears as follows:

F_1=\frac{precision \times recall}{precision+recall}

From this, we can see that the F1 score is defined as the quotient of the product of precision and recall with the sum of precision and recall. However, we can easily compute the F1 score using the f1_score function from SciKitLearn. This makes computation of this metric as easily as computing the precision and recall itself.

The ROC Curve

Our last item of publication today included a discussion of the relevance of the ROC curve. This was a predictable follow up to our analysis of precision and recall, as the ROC curve is developed from similar connotations.

As our analysis demonstrated, the ROC curve refers to the plot of the receiver operating characteristic. It’s of particular use in binary classification models by separating data objects into either one group or another. Consider that from our previous article, in our implementation of precision and recall, we are able to plot recall and precision versus the threshold. The receiver operating characteristic (ROC) curve has similar properties, but rather than plotting precision and recall, it plots the true positive rate versus the false positive rate. The true positive rate (TPR) represents the ratio of true positives identified versus the total true positives in the data set. Alternatively, the false positive rate (FPR) reflects the number of negative instances which are classified as positive.

Take note of the fact that the true negative rate (TNR) denotes the specificity (or recall) of the model. The FPR is computed by 1-specificity. Thus, because the TPR demonstrates the sensitivity of the model, the ROC curve reflects a plot of sensitivity vs 1-specificity.

One utility of the ROC curve is its ability to be used for the computation of the likelihood ratio. The likelihood ratio for discerning the probability of a positive test result denotes the ratio of positive outcomes (true positive) to the probability of a positive test result of negative outcomes (false positive). Thus the likelihood ratio represents the increase in odds favoring the outcome given a positive test result.

We have efficaciously demonstrated via the ROC curve that the higher the recall (TPR) the more false positives (FPR) that are produced by the machine learning algorithm. Initially, this may seem counterintuitive. However, think about it. If you have an algorithm that is so accurate that it captures every single positive data item in the set, it is likely a consequence of the fact that it classifies every single data item as positive.

Consequentially, one alternative means of using the ROC for measuring performance of a machine learning model is with the area under the curve. Integrating the ROC curve delivers the area under the curve, and confers the c statistic of the model. In this manner, a perfect binary classifier will have an AUC equal to one.

The concordance (c) statistic represents the most commonly used statistic which measures discrimination. For binary outcomes, the c-statistic represents the area under the receiver operating characteristic curve (which is constructed by plotting the false positive rate against the true positive rate of the test data set). The c-statistic can be interpreted as the proportion of all pairs of patients where one patient experienced the event of interest and
the other patient did not experience the event, and the patient with the lower risk score was the one who did not experience the event.

Generally, the c-statistic is defined such that it must lie between 0.5 and 1. When the c-statistic is close to 0.5 then the percent of patient pairs (where one patient experienced the event, and the other did not) in which the patient with higher risk score experienced the event is around 50%.

Plan For Day 2

Thank you for following along with our first day of the 100 days of code. Tomorrow, we plan to focus more on some academic literature in our area of publication to support greater depth of our analyses. We hope you enjoy, and look forward to seeing you at our next articles.

Leave a Reply

%d bloggers like this: