Multi-Class Classification and Literature: 100 Days of Code (2/100)

Day 2 Of 100 Days of Code

We are now underway in this journey of better understanding this coding ecosystem we inhabit. If you’ve followed along the Art of Better Programming thus far, you are well aware that our primary focus to date has been a thorough elaboration on the implementation of machine learning models in our coding projects. Today we published another article on this subject matter, which focuses on Multi-Class Classification algorithms in machine learning. While we did not focus specifically on the algorithms therein, we did discuss the code required to execute this functionality, as well as the various techniques available for multi-class classification.

In addition to our publication endeavors, several papers were read in order to make sure that we brought to your attention some of the most important features of multi-class classification to your attention. Todays discussion in Day 2 of the 100 Days of Coding series simply summarizes our sole publication of the day, and will break down several of the academic literary pieces we investigated. Enjoy.

Before We Get Started

I want to let you all in on a secret before we proceed with the modeling of machine learning derived data. Sometimes one of the trickiest things to do when working in machine learning is that you sometimes need to kind of create your own programming language. While we plan on coming out with a series on this in the future, I myself learned from the folks at ProLang. They offer an awesome book that goes into the details of creating language features that best support machine learning. Follow the link provided here. This will take you to ProLang where you can sign up and get their book. Getting this book was the best decision I ever made, and it drastically improved my coding experience. All you need to do is click the link and order a copy of the book. They even let you get it for free. Fill out an order and begin creating your own sophisticated machine learning models today.

Multi-Class Classification Brought By The Art of Better Programming

Our previous investigation of classifier machine learning algorithms focused on binary classifiers. However, these are rather simple models, insufficient for complex modeling. Multi-class classification relies on algorithms capable of differentiating more than just two classes of data. Several multi-class classification algorithms can handle these higher levels of data classes. These include the Random Forest classifier as well as the naive Bayes’ classifier. However, there are more complex methods, called ensemble methods, which provide powerful means of doing multi-level classification on complex data.

Interestingly, a series of binary classifiers may co-habitate a system and operate synergistically in multi-class classification. For example, with our previous stochastic gradient descent classifier, it was a binary classifier which differentiated between fives and non-fives. We can use a series of binary classifiers to deal with the entire data set. For example, we can create binary classifiers, that differentiate one’s from non-one’s, two from non-two’s, etc. In the case of our MNIST data set, therefore, we would have a total of ten binary classifiers. This strategy is known as the one-versus-all strategy (OvA).

Creating a series of binary classifiers for multi-class classification can be a particularly tedious task. This may be especially true if you utilize a binary classifier for every category, including the one’s, two’s, three’s, etc. For that reason, we can keep our implementation of binary classifiers, but use them for comparing pairs of digits. For example, we can have a binary classifier that distinguishes zero’s and one’s, zero’s and two’s, one’s and two’s, etc.

This particular strategy is known as the one-versus-one (OvO) strategy of classification. While it relies on binary classifiers, it executes the task of multi-class classification. For a OvO type multi-class classification system approach, the number of classifiers that need to be procured consists of a function of the form N x (N-1)/2, where ‘N’ is the number of classes. For our previous classifier of the MNIST data set, this means we would need forty-five classifiers.

If you’d like to observe the code for creating these multi-class classification machine learning algorithms, check out our full article.

Academic Literature of the Day


It may seem dense and uninteresting, but grappling with the progress of the field and jargon implemented by leading data scientists and programmers will give you a much more robust understanding of software engineering. Furthermore, it is an incredible way of comprehending complex systems and topics in significant depth. Let’s explore some of the articles we read today. Feel free to open up the attached links so you can follow along.

Discussion of OVO Ensemble Methodologies

Title: “One versus one multi-class classification fusion using optimizing decision directed acyclic graph for predicting listing status of companies”

Author: Ligang Zhou

Journal: Elsevier, 2016


Discussion: Zhou’s article addresses the use of OVO as an ensemble strategy for multi-class classification. From the outset, he writes that, “Most existing research has demonstrated the success of different decomposition and ensemble strategies for solving multi-class classification problems.” The particular focus of this multi-class classification solution is the implementation off the OVO strategy. More specifically, Zhou discusses the use of the optimizing decision directed acyclic graph (ODDAG), which maximizes fitness of the algorithm based on parameters of the training set rather than pre-instantiated procedures.

Zhou notes a particular draw back of OVO, which is a consequence of it being a non-competent classifier. He elaborates on this point by stating that, “A binary classifier can only distinguish the observations from the two classes used in training set, and therefore the binary classifier has no capability to discriminate the observations from other classes.” For this reason, if an object enters our model and its class is unknown, in some cases this can be accommodated for, and in other cases, it cannot. For this reason, Zhou set out to establish a methodology that limits the prevalence of non-competent classification.

The premise of the DDAG method, as purported by Zhou, is that it, “starts at the top node of the hierarchy and successively discard certain classes by nodes that have been accessed until reaching the bottom of the structure, where the final leaf node returns the class of the estimated observation.” Consequentially, each individual node of the hierarchy denotes its own binary classifier. One draw back of the DDAG method is that it has no self validation method, and as a result, prone to error. Therefore, the premise of Zhou’s effort was to optimize the DDAG method, and yield ODDAG.

Before getting into the ODDAG methodology, Zhou first begins by describing decomposition by OvO. Zhou contends that for a data set with K different classes, the number of binary classifiers needed to deal with the data set is given by the formula K x (K-1)/2. Let us suppose that a binary classifier differentiates between two associated classes, i and j. If the output of the binary classifier is represented by the variable ‘p’, then we may model the output with a function:

p_{ij} = f_{ij}(y = i|X), i < j, p_{ij} ∈ [0, 1]

In this function, ‘f’ is represented by a map function of the form:

f_{ij}: X → {{i, j}}

This function ‘p’ may be understood as the confidence in our binary classifier in its ability to classify an observation with feature vector X in the class i.

From this, we acquire a matrix representing the predictions for every binary classifier in our OVO ensemble:

The DDAG brings about a slightly nuanced approach to its classification method. For a data set with four classes, DDAG will appear as follows:

The root node here is a binary classifier comparing object 1 and object four. If it is not one then it must be a four, and moves to the two vs four classifier. If it not a two then it moves to the three versus four class. Then if it is not a three, then it gets classified as four. Its interesting to note here that this method provides a bit of a fail safe for us. Suppose it passes into the 2 vs 4 classifier, believed to be a four. Here however, it get’s checked again in case it was actually not a four. In this manner it can end up in a class that the one potentially would have ended up in. This can therefore be highly accurate.

Zhou proposes an algorithm conceptualizing the DDAG method:

Input: K(K − 1)/2 binary classifiers inferred from training set, x.
Output: the structure of DDAG and the predicted class of x.

  1. Generate the initial list: {1, 2,…, K}.
  2. Select the first element and last element in the list to form the
    class pair (i, j). If x is classified into Class i by the binary classifier for pair (i, j), remove j from the list. Otherwise, remove
  3. If the number of elements in the list selected at Step 2 is more
    than one, go to Step 2. Otherwise, classify x into the class associated with the only element in the list and stop.
    Each decision node with pairwise classes in DDAG can be any
    binary classification model trained by training data set. For a problem with K classes, K-1 decision nodes will be evaluated in order to derive an answer.

Zhou created a very simple solution to optimizing the DDAG method. He contends that if two objects are easily differentiated with minimal error, then they should be placed as the top node. His new ODDAG is defined with the following algorithm.

Input: training set S feature set f = {f12, f13, …, fK(K−1)}
Output: the optimized structure of DDAG

  1. Examine all possible binary splits on every feature with regard
    to all observations in training set S.
  2. Select a split of the feature with best optimization criterion.
  3. If the split leads to a child node having too few observations,
    select a split with the best optimization criterion subject to the
    minimum leaf size constraint.
  4. Impose the split.
  5. Repeat recursively for the two child nodes until a stopping criterion is triggered.The selected optimization criteria used to split nodes is the Twoing rule

The ODDAG as conceptualized by Zhou appears as follows:

Tomorrow’s Goals

The analysis of Zhou’s ODDAG method ended up being a bit longer than anticipated. Therefore, we will leave this for you to soak up on your own time. It is considerably dense so we recommend you spend a bit of time considering its implications. One of our next goals is to actually code this into a script. With that being said, Day 3 is all about regression, particularly linear and polynomial regression. We will pick apart another piece of academic literature tomorrow as well. Look forward to seeing you there.

Leave a Reply

%d bloggers like this: