## Introduction to Cross-Validation

With the *Topics of Machine Learning *series in mind, our present discussion addresses a particular technique of performance measurement, the cross-validation methodology. Within this *Topics of Machine Learning* series, we elaborated extensively on the various machine learning model categories that exist. Our first article to this series provided an overview to all of these model categories, an investigation which you may find here. After providing this broad introduction, we began diving individually into the specificities and algorithms associated with each of these models. This began with an analysis of supervised learning models, particularly its classifier and regression algorithms.

Next, we followed this with a discussion of the features of unsupervised learning models. We then differentiated the methodologies of batch and online learning, which you may investigate here. We finally followed up with an elaboration of the differences between instance based and model based learning, found here. With the individual algorithms explored and discussed in depth, in our most recent article, we investigated the fundamentals of classification algorithms. Along that line of thought, we began by exploring the MNIST data set and its use in machine learning classification algorithms. Having developed this classifier in our preceding article, we now press forward to address one particular performance measurement of machine learning algorithms. More precisely, here, we assess the accuracy of machine learning algorithms using the cross-validation methodology. Let’s begin.

## Conceptualizing Cross-Validation

*Performance Measures*

Machine learning models rapidly delineate values of data items within a set. How efficiently these models execute this functionality is enigmatic lest we have some means of quantifying machine performance. The utilization of performance measures allow the exploitation of quantitative aspects of our model, including its accuracy, precision, and facets of efficiency.

The validity of performance measures is confirmed in a variety of contexts, but bioinformatician Yasen Jiao presents it elegantly in his publication “Performance measures in evaluating machine learning based bioinformatics predictors for classifications.” Jiao presents the importance of machine learning performance measurement as a consequence of the potential for an algorithm to over-fit or under-fit data. The author contends, “Because machine learning algorithms usually use existing data to establish predictive models, it is possible that a predictive model is over-fitted or over-optimized on the existing data.” Here, Jiao points out that a consequence of the manner in which the model generates data, thereby creating inconsistencies which we must account for. Jiao clarifies his definition of over-fit or over-optimize by stating that the algorithm performs well on existing data but does not have sufficient flexibility to operate well on incoming data.

The importance of performance measures is consequential of the fact that it conveys such a large impact on the validity of our data, and thus, our insights. Jiao supports this argument in stating that “the prediction performance drops drastically when it is applied in practical studies with novel data.” In order to correct for the apparent bias that we may introduce into our model if the optimization has not been accounted for, Jiao finally argues that “when applying these computational predictors, it is vitally important to understand their mechanisms and the conditions of their performances in the first place.” We share this conviction as well.

*What About Accuracy? *

The traditional definition behind the word ‘accuracy’ is the quality of something as being correct or precise. However, in machine learning, we need a bit of a more functional definition which allows us to apply this component to the efficiency of our algorithms. Accuracy can be understood as the number of correctly predicted data points out of all the data points.

A more mathematical understanding of accuracy, however, comes from a computational calculation. Therefrom, we might say that accuracy is defined as the number of true positives and true negatives divided by the totality of data points. The true positives and true negatives represent the correct classifications. Alternatively, false positives and false negatives are data points that the algorithm incorrectly classifies.

*What is Cross-Validation? *

The technique of cross-validation actually stems from theory rooted in statistical analysis. According to the SciKitLearn manual, “Learning the parameters of a prediction function and testing it on the same data is a methodological mistake.” SciKitLearn validates this claim in stating that, “a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data.” This inability to use the model to predict new data items, even though values of previously observed data are easily predicted, is a consequence of over-fitting. This is exactly what Jiao addressed in his article.

Cross-validation allows us to circumvent this potential pitfall by allowing us to determine the ability of our machine learning model to adaptively respond to exogenous data. Essentially, the cross-validation technique allows us to predict how accurately our model will perform in practice.

##### k-fold Cross-Validation

According to Jeff Schneider in this article, he emphasizes two specific methodologies of cross-validation: k-fold cross-validation and leave-one-out cross-validation. In k-fold cross-validation, the data set is divided into *k* subsets, and the holdout method is repeated *k* times. Each time, one of the *k* subsets is used as the test set and the other *k-1* subsets are put together to form a training set. Then the average error across all *k* trials is computed. The particular utility of the k-fold cross-validation technique is the fact that the manner in which the data is separated does not affect the accuracy of prediction. This is the consequence of the fact that each data object is accounted for at least once.

*Leave-One-Out Cross-Validation*

On the other hand, leave-one-out cross-validation takes k-fold cross validation to an extreme by making ‘k’ equal to the number of data objects in the data set. That means that N separate times, the function approximator is trained on all the data except for one point and a prediction is made for that point. As before the average error is computed and used to evaluate the model.

## Encoding Cross-Validation Measures

*A Review of Training the Model *

Our previous article created a basic stochastic gradient descent algorithm machine learning model of the MNIST data set. The code we used for this purpose is comprehensively demonstrated as follows:

Our first action with approaching the MNIST data set was the splitting of the data into the training and testing data, and subdividing these groups on the basis of whether they were data which included labels or did not include labels. We separate and acquire these sets of information by separately calling the ‘data’ and ‘target’ arrays. These data sets exhibits significant differences. The ‘data’ has a shape of 70,000×784 because each image consists of 784 pixels. The ‘target’ array, alternatively, is linear and only has 70,000 objects as these represents the labels. Keep in mind that for both of these array objects, we must also further separate them into training and testing data.

After splitting the data into the different groups, we trained a machine learning model that operates as a binary classifier by differentiating images which are fives and those which are not fives. For training this stochastic gradient descent algorithm in our binary classifier, we first established the proper data sets. We first need to acquire the five labels from our label data set for the training and the testing data.

Once we do this, all we must do is call the SGD function from scikitlearn. Then we may pass in training data and the training labels for the five. We can then simply use the ‘predict’ function and pass in the position of some particular digit. If that digit is five, to some degree of accuracy, the function will return true. If that digit is not five, to some degree of accuracy, the function will return false.

*Coding the Cross-Validation Mechanism *

In order to evaluate the accuracy of a machine learning model, we have available to us the cross-validation feature in SciKitLearn. The ‘cross_validation_score’ function executes K-fold cross-validation, which we previously delineated. This function randomly splits the data set into ‘k’ distinct subsets which we call folds. It then trains the model k times, and picks a different fold for evaluation with the other k-1 serving as the training folds. It then yields an array of length ‘k’. Let’s take a look at how we encode this in Python:

Here, we specify our model, input data, the manner of scoring, and the number of folds. This returns the cross-validation score for each fold of our k-fold cross-validation.

Rather than using the ‘cross_val_score’ function, we can assert greater control over cross-validation by encoding our own mechanism. Let’s take a look at how we do this:

*SKLearn Functions*

Here, we use the StratifiedKFold function from SciKitLearn. According to the SciKitLearn manual, “this cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.”

Subsequently, we apply a for-loop that accounts for the index position of the item in the training and testing sets. We then use the ‘clone’ function on the index positions. Clone does a deep copy of the model in an estimator without actually copying attached data. It yields a new estimator with the same parameters that has not been fit on any data.

Using these functions, we efficaciously split the data set and create copies of the data. The StratifiedKFold function executes stratified sampling. In this manner, the fold contains a representative ratio of each class. Alternatively, the clone function creates a clone of the classifier itself.

*Output *

When we execute this code, we obtain the following cross-validation scores:

This cross-validation score proxy depicts the accuracy of our model. By computing the number of correct computations for the fold divided by the number of data objects in the fold, we obtain the accuracy. Thus, each instance of our machine learning model has an accuracy greater than 95%.

## Drawbacks of Accuracy

While an accuracy of 95% may seem high and may give the impression that our model is performing well, let’s reconsider what our model does. Well, the model exists as binary classifier, separating images of fives from all other images which do not represent fives. Well, fives represent only 10% of the totality of the data set. Thus, if the model always predicted that the image was not a five, it would be correct 90%. Now it does not appear to be so efficient, considering we only improved the correct classification rate by 5%. Thus, accuracy is not always the preferred performance metric, especially when working with skewed data sets such as this one.

## The Take Away

The present article extensively discussed the features of quantifying the efficacy of classifier performance by the metric of accuracy. In particular, we implemented the technique of cross-validation as a proxy for accuracy. Cross-validation lays its roots in statistics. However, it may be effectively applied for the purpose of computing accuracy of a machine learning model. SciKitLearn provides several helpful avenues for computing the cross-validation scores of our model, making this task quite simple.

We employed our code from our previous article, which involved the creation of a binary classifier via the MNIST set. Here, established the accuracy of the model using the cross-validation technique. However, as we noted, this model is not entirely helpful. Accuracy can give us a misrepresentation of how effective our model actually is in comparison to simple guessing. Therefore, our next article in the Topics of Machine Learning series explores validation of machine learning performance using the confusion matrix as an analysis of precision. We look forward to seeing you there.