## Introduction to the ROC Curve

*Previous Efforts*

The *Topics of Machine Learning *series has steadily moved from overarching principles of machine learning models to the bits and pieces which drive these models to function. With our previous article having investigated the implementation of the confusion matrix and the role of precision and recall, we now move to conceptualizing the ROC curve in measurement of machine learning performance. Thus far in the *Topics of Machine Learning* series, we elaborated extensively on the various machine learning model categories that exist. Our first article to this series provided an overview to all of these model categories, an investigation which you may find here. After providing this broad introduction, we began diving individually into the specificities and algorithms associated with each of these models. This began with an analysis of supervised learning models, particularly its classifier and regression algorithms.

Next, we followed this with a discussion of the features of unsupervised learning models. We then differentiated the methodologies of batch and online learning, which you may investigate here. We finally followed up with an elaboration of the differences between instance based and model based learning, found here.

*Current Approach*

With the individual algorithms explored and discussed in depth, in one of our most recent article, we investigated the fundamentals of classification algorithms. Along that line of thought, we began by exploring the MNIST data set and its use in machine learning classification algorithms. Having developed this classifier in one of our preceding articles, we followed up with an analysis of one particular performance measurement of machine learning algorithms. In this discussion, we elaborated on the accuracy metric in discerning machine performance using the cross-validation technique. This was then followed by another discussion of performance measurement through the use of the confusion matrix.

One of the primary features of the confusion matrix is its ability to prompt the computation of precision and recall. These computations were discussed in great length in our previous article. In the most recent article, we also focused on the underlying theory associated with precision and recall in order to provide a better understanding of what these metrics actually represent. This article focuses on measuring performance of our machine learning model using the ROC curve. Let’s begin.

## Conceptualizing the ROC Curve

*What is the ROC Curve?*

The ROC curve refers to the plot of the receiver operating characteristic. It’s of particular use in binary classification models by separating data objects into either one group or another. Consider that from our previous article, in our implementation of precision and recall, we are able to plot recall and precision versus the threshold. The receiver operating characteristic (ROC) curve has similar properties, but rather than plotting precision and recall, it plots the true positive rate versus the false positive rate. The true positive rate (TPR) represents the ratio of true positives identified versus the total true positives in the data set. Alternatively, the false positive rate (FPR) reflects the number of negative instances which are classified as positive.

Take note of the fact that the true negative rate (TNR) denotes the specificity (or recall) of the model. The FPR is computed by 1-specificity. Thus, because the TPR demonstrates the sensitivity of the model, the ROC curve reflects a plot of sensitivity vs 1-specificity.

*The Likelihood Ratio *

One utility of the ROC curve is its ability to be used for the computation of the likelihood ratio. The likelihood ratio for discerning the probability of a positive test result denotes the ratio of positive outcomes (true positive) to the probability of a positive test result of negative outcomes (false positive). Thus, we may represent the likelihood ratio by the following function:

Thus the likelihood ratio represents the increase in odds favoring the outcome given a positive test result.

*Area Under the Curve (AUC) *

We have efficaciously demonstrated via the ROC curve that the higher the recall (TPR) the more false positives (FPR) that are produced by the machine learning algorithm. Initially, this may seem counterintuitive. However, think about it. If you have an algorithm that is so accurate that it captures every single positive data item in the set, it is likely a consequence of the fact that it classifies every single data item as positive.

Consequentially, one alternative means of using the ROC for measuring performance of a machine learning model is with the area under the curve. Integrating the ROC curve delivers the area under the curve, and confers the c statistic of the model. In this manner, a perfect binary classifier will have an AUC equal to one.

The concordance (c) statistic represents the most commonly used statistic which measures discrimination. For binary outcomes, the c-statistic represents the area under the receiver operating characteristic curve (which is constructed by plotting the false positive rate against the true positive rate of the test data set). The c-statistic can be interpreted as the proportion of all pairs of patients where one patient experienced the event of interest and

the other patient did not experience the event, and the patient with the lower risk score was the one who did not experience the event.

Generally, the c-statistic is defined such that it must lie between 0.5 and 1. When the c-statistic is close to 0.5 then the percent of patient pairs (where one patient experienced the event, and the other did not) in which the patient with higher risk score experienced the event is around 50%.

*Why Measure Performance? *

Machine learning models rapidly delineate values of data items within a set. How efficiently these models execute this functionality is enigmatic lest we have some means of quantifying machine performance. The utilization of performance measures allow the exploitation of quantitative aspects of our model, including its accuracy, precision, recall, and facets of efficiency.

According to Marina Sokolova in *“A systematic analysis of performance measures for classification tasks”*, she contends that “Empirical evaluation remains the most used approach for the algorithm assessment, although ML algorithms can be evaluated through empirical assessment or theory.” Thus, measurement of performance is the best tool we have available at our disposal for understanding our model.

The validity of performance measures is confirmed in a variety of contexts, but bioinformatician Yasen Jiao presents it elegantly in his publication “Performance measures in evaluating machine learning based bioinformatics predictors for classifications.” Jiao presents the importance of machine learning performance measurement as a consequence of the potential for an algorithm to over-fit or under-fit data. The author contends, “Because machine learning algorithms usually use existing data to establish predictive models, it is possible that a predictive model is over-fitted or over-optimized on the existing data.” Here, Jiao points out that a consequence of the manner in which the model generates data, thereby creating inconsistencies which we must account for. Jiao clarifies his definition of over-fit or over-optimize by stating that the algorithm performs well on existing data but does not have sufficient flexibility to operate well on incoming data.

The importance of performance measures is consequential of the fact that it conveys such a large impact on the validity of our data, and thus, our insights. Jiao supports this argument in stating that “the prediction performance drops drastically when it is applied in practical studies with novel data.” In order to correct for the apparent bias that we may introduce into our model if the optimization has not been accounted for, Jiao finally argues that “when applying these computational predictors, it is vitally important to understand their mechanisms and the conditions of their performances in the first place.” We share this conviction as well.

## Encoding the ROC Curve

*The Machine Learning Model *

Our previous article created a basic stochastic gradient descent algorithm machine learning model of the MNIST data set. The code we used for this purpose is comprehensively demonstrated as follows:

Our first action with approaching the MNIST data set was the splitting of the data into the training and testing data, and subdividing these groups on the basis of whether they were data which included labels or did not include labels. We separate and acquire these sets of information by separately calling the ‘data’ and ‘target’ arrays. These data sets exhibits significant differences. The ‘data’ has a shape of 70,000×784 because each image consists of 784 pixels. The ‘target’ array, alternatively, is linear and only has 70,000 objects as these represents the labels. Keep in mind that for both of these array objects, we must also further separate them into training and testing data.

After splitting the data into the different groups, we trained a machine learning model that operates as a binary classifier by differentiating images which are fives and those which are not fives. For training this stochastic gradient descent algorithm in our binary classifier, we first established the proper data sets. We first need to acquire the five labels from our label data set for the training and the testing data.

Once we do this, all we must do is call the SGD function from scikitlearn. Then we may pass in training data and the training labels for the five. We can then simply use the ‘predict’ function and pass in the position of some particular digit. If that digit is five, to some degree of accuracy, the function will return true. If that digit is not five, to some degree of accuracy, the function will return false.

*Coding the ROC Curve*

As we have efficaciously delineated herein, the ROC curve and the area under that curve can be used efficaciously for classifying the performance of our machine learning model. Furthermore, these metrics provide quantitative and graphical comparisons between different machine learning models.

We can create both the ROC curve and the area under the curve using functions from SciKitLearn. In particular, we can implement the roc_curve function to yield the curve from the FPR/TPR, and the roc_auc_score function to compute the area under the curve. Take a look at the code we may use to yield the ROC curve and the area under said curve:

By implementing this code, we yield the following ROC curve:

Furthermore, we find that the area under the curve confers a value of ~0.96.

## The Take Away

The present article extensively discussed the features of quantifying the efficacy of classifier performance by the metric of precision and recall. In particular, we implemented the technique of the confusion matrix as a proxy for precision and recall. Precision and recall are fundamentally roots of statistics. However, they may be effectively applied for the purpose of computing precision and recall of a machine learning model. SciKitLearn provides several helpful avenues for computing the confusion matrix of our model, making this task quite simple.

We employed our code from our previous article, which involved the creation of a binary classifier via the MNIST set. Here, we established the value of precision and recall of the model using the confusion matrix technique. From the values represented within the confusion matrix, the precision and recall can be readily computed. Furthermore, we explored the utilization of the ROC curve in validating the efficacy of our machine learning model. Our subsequent article endeavors to move away from performance quantification, and focuses on more complex methodologies of classification, particularly multiclass classification. We look forward to seeing you there.