## Introduction to Supervised Machine Learning

We now reach a point where we intend to get deeper into the various methodologies of machine learning. Our first article of this series presented all of the general categories of machine learning and their implications. Therein, we documented the various mechanisms and algorithms associated with these categories. They included supervised, unsupervised, batch, online, instance-based, and model-based learning. This article focuses on the supervised machine learning methodologies in particular. In addition to discussing the tasks it’s capable of executing, we also delve deeply into the various algorithms associated with these two methods. Let’s get to it.

## Before We Get Started

I want to let you all in on a secret before we proceed with the modeling of machine learning derived data. Sometimes one of the trickiest things to do when working in machine learning is that you sometimes need to kind of create your own programming language. While we plan on coming out with a series on this in the future, I myself learned from the folks at ProLang. They offer an awesome book that goes into the details of creating language features that best support machine learning. Follow the link provided here. This will take you to ProLang where you can sign up and get their book. Getting this book was the best decision I ever made, and it drastically improved my coding experience. All you need to do is click the link and order a copy of the book. They even let you get it for free. Fill out an order and begin creating your own sophisticated machine learning models today.

## Supervised and Unsupervised Learning

As we previously discussed in our first article, the premise of machine learning methods relies on three fundamental questions. These include:

- System trained with human intervention or not?
- System learns in batches or on the fly?
- Compare new data to training data or identify patterns to build a model?

Supervised and unsupervised learning provide different answers to the first question. These methods exhibit differential degrees of data labeling that facilitate proper identification of incoming data. With supervised learning, training data is associated with a label that provides the system with the correct response so the model may form a methodology for deriving the correct solution. Alternatively, unsupervised learning does not utilize labels associated with training data. Rather, the model learns by identifying patterns in the training data and formulating its own optimal method. Let’s get in to the nuts and bolts.

## Supervised Learning

*Conceptualizing Supervised Learning *

With supervised machine learning, the training data is delivered to the machine with information supplemented with labels. Thus, not only is the machine given the input, but also the solution. In this context, the model is able to utilize the parameters and values of the training data and associates them with the correct solution. As the system receives new data, the model takes the parameter values of the data and utilizes the known data parameters to associate this new data with a particular value.

The primary applications of supervised machine learning include the tasks of classification and regression. With classification, the premise of the supervised learning task is to differentiate data points based on values of respective parameters. By proxy of the parameter values, the incoming data is segregated as belonging to a particular class. The algorithms utilized for supervised learning based classification include:

- Linear Classification
- Support Vector Machines (SVMs)
- Decision Trees
- K-Nearest Neighbor
- Random Forest

The alternative methodology for supervised learning involves regression. Regression takes into account the numerical values of the parameters for the training data, and uses these values to discern the features of the incoming data. With these two tasks, much of the supervised learning tasks may be accomplished. The algorithms associated with supervise learning based regression include:

- Linear Regression
- Polynomial Regression
- Quadratic Regression
- Logistic Regression
- Symlog Regression

Each of these methods uses several different algorithms to accomplish these tasks, which will be elaborated on in great depth here.

*Classification Algorithms *

*1***. K-Nearest Neighbor**

**. K-Nearest Neighbor**

As a supervised classification algorithm, the K-Nearest Neighbor algorithm is one of the most ubiquitous. Like other classification methodologies, it applies its algorithms to segregate data into different groups based upon their parameter values. The K-Nearest Neighbor algorithm is unique from the other classification algorithms in that it relies on computing the distance from the input value to the nearest value of the training data.

This might seem rather mundane, but it may be executed in a variety of ways. Firstly, our standard understanding of distance comes from the Euclidean conceptualization of distance:

However, there are several other computational methods that may be applied for computing the distance. The Manhattan distance is one of such distance computations, wherein distance is not computed linearly, but the distance exclusively along the x-axis and y-axis between the points. On the other hand, the Manhattan distance function takes the form:

The Euclidean and Manhattan distances are effectively applied for continuous variables of a numerical nature. However, sometimes we work exclusively with categorical variables. If we assign values to the categorical variables, like one and zero, then we can use the Hamming Distance to compute distance for discontinuous functions. The Hamming distance function takes the form:

Depending on the circumstances by which we are operating, we may opt to use one of these algorithms over the other. Nevertheless, whichever of these functions we use, the fundamental purpose is to identify the nearest training data item by distance and ultimately associating it with that particular class.

*2. Decision Trees*

The Decision Tree algorithm supports classification of data in a unique way. This methodology readily applies for continuous, discontinuous, and categorical variables. According to an article by the DZone, this algorithm differentiates the data into group based on the values of the data’s primary parameters.

With decision trees, the root node represents the unadulterated total sample population. The Decision Tree algorithm, whichever version is used, splits the root node into a series of sub-nodes. At this point, the sub-nodes may be left alone, or may be further sub-divided to yield decision nodes. At the point where nodes to not split further, these regions are known as leaves.

The decision tree algorithms for machine learning are relatively easy to program. This is due to the fact that they rely on the simply coded Boolean logic states ‘if’, ‘elif’, and ‘else’. This fact makes it clear why this can be so ubiquitously applied for categorical and numerical variables.

**3. Linear Classifiers: Naive-Bayes**

**3. Linear Classifiers: Naive-Bayes**

Firstly, linear classifiers are fundamentally based upon the linear relationships that exist between incoming data in the various parameters thereof. One of such linear classifier algorithm is the Naive-Bayes classifier. This particular linear classifier endeavors to segregate data objects into groups based on an assumption that there is a degree of independence between the various predictors of their group characteristics. Such an assumption relies on Bayes theorem which mathematically models as:

What Bayes theorem determines is the probability that some even A will occur if the event B is true. The Naive-Bayes linear classification model is particularly useful for working with large amounts of data. Furthermore, they are easy for encoding classification models as the category by which a data entry is said to belong to depends on the probability as computed by its parameter values.

**4. Random Forests **

**4. Random Forests**

The random forest is a subset of a decision tree algorithm, but also operates as an ensemble algorithm. In this capacity, the random forest algorithm utilizes several different decision tree algorithms randomly created from the training data, and therefrom chooses the optimal methodology. In this manner, by examining all of the different possibilities of decision tree algorithms that could be employed, the optimal one. The output of the random forest algorithm determines the test object.

Based on the description, it might seem apparent that this algorithm may deliver a very precise methodology for finding a solution. While this may be helpful, it can also be detrimental to flexibility for a machine learning model. This is due to the possibility of overfitting that may come from taking into account so precisely the weights of different parameters for the data.

**5. Support Vector Machines (SVMs) **

**5. Support Vector Machines (SVMs)**

While the support vector machine learning methodology may be applied with linear regression, it is also particularly useful with respect to classification. With SVMs, the training data plots the data in space based on several select parameters. The training data self-segregate into different groups based upon the value of these parameters. Once the groups are fully formed, the algorithm creates a line or series of lines that segregate the groups. When new data enters the machine learning system, the side of the line they end up on determines the group they belong to.

*Regression Algorithms*

**1. Linear Regression**

**1. Linear Regression**

Whereas classification algorithms attempt to segregate incoming data into different groups, regression algorithms seek to predict the value of an incoming data entry based on the values of certain parameters. According to Jason Brownlee in his article on linear regression, he contends that the purpose of linear regression is to minimize the potential for error by seeking to make the most accurate predictions.

There are a variety of means by which to execute linear regression, but the simplest of these revolves around models relating a single input parameter to some output. The varying values associated with the input parameter predict some alternative value of the data entry. Based on all the training data, a linear model takes into consideration the correlation of the data with the model.

Firstly, the simplest linear regression mathematical model takes the form:

Taking careful note of this function, we might note that this appears to be the slope intercept form of a linear function.

**2. Logistic Regression**

**2. Logistic Regression**

The logistic regression supervised learning model is particularly useful when working with categorical variables, particularly when they have been modified into binary data. For example, when data occupies a value as being either one thing or another, it may represent a one or zero. This type of logistic regression represents binary logistic regression.

While binary regression stipulates logistic regression in its simplest form, logistic regression supports several different models. For example, multinomial logistic regression can assign values to three or even more categorical variables and create models in the same manner. Alternatively, ordinal logistical regression can create a scale of values from one value to another in a way to model a scale of categorical variables.

*3. Multivariate Regression Algorithms *

*3. Multivariate Regression Algorithms*

Multivariate regression algorithms are particularly useful for predicting data values when multiple parameters of significant weight control the behavior of the data entries. These multivariate regression algorithms assign a weight to the different parameters based on the correlation of these parameter values to the true value of the data. Then while incoming data enters the system, the model interprets the parameter values to discern the value of the data entry.

The multivariate regression algorithms mathematically model as:

Thus, the multivariate logistic regression model implies that if the weights of the parameters ‘x1’ and ‘x2’ are known and the parameter values ‘b0’, ‘b1’, and ‘b2’ are known, then the value of ‘y’ may be predicted from the model.

*4. Multiple Regression *

*4. Multiple Regression*

Finally, multiple regression resembles multivariate regression in that a variety of parameters exist within the model to predict the data value. However, it is different in that it only uses one dependent variable, which is also the variable we seek to predict. The multiple regression model assumes that the relationship between these multiple independent variables and the dependent variable is linear. The issue with this model, however, is that it does not deal well with outliers of the data set.

## Summary

With supervised machine learning, two primary methodologies apply to the system type: classification learning models and regression learning models. These methodologies have different motives, and thus they utilize different algorithms, which we have elaborated on in depth. Classification algorithms apply to data sets which we seek to organize according to their group identity. Upon receiving incoming data, we may then segregate these data entries and assign them to a particular category. Alternatively, regression algorithms exhibit their role in using training data to create a model that predicts the value of a data entry. Nevertheless, whichever methodology we intend to use depends on the nature of the data we work with, and the goal of our supervised machine learning efforts.