Introduction to Unsupervised Machine Learning
Our initial article on this subject introduced our presently comprehensive machine learning series. This article provided an overview to all the various machine learning systems and the functions they execute. The machine learning models extrapolated on therein include supervised, unsupervised, batch, online, instance-based, and model-based learning methodologies. Our subsequent articles have sought to elucidate these mechanisms in stringent detail. The previous article of this series elaborated extensively on the machine learning algorithms pertaining to supervised machine learning. In particular, the topics of classification and regression were un-shrouded in great detail. Presently, we desire to undertake the same task in illuminating the intricacies of unsupervised machine learning.
Unsupervised Machine Learning
What is Unsupervised Machine Learning?
Unsupervised machine learning is also a machine learning system which responds to our question of whether intervention is involved in training the model. As you may recall, supervised machine learning answers this question by utilizing labels attached to data objects that relate the data value to its proper solution. With unsupervised machine learning, alternatively, the training data uses unlabeled objects. In this case, the system attempts to learn without having explicit instructions, but rather, seeks to identify its own instructions.
Techniques of Unsupervised Machine Learning
A variety of unsupervised machine learning methodologies exists which permit the elucidation of a solution to a particular data object. One of these is clustering, which involves a machine learning system that divides data into various groups based on parameters associated with the data. Clustering algorithms can go so far as dividing these groups into their own subgroups for enhanced precision, especially when working with large data sets.
Unsupervised machine learning also supports execution of visualization tasks which help create visible models based on unlabeled data. The output can be either a two-dimensional or three-dimensional model of the data in space.
Finally, with respect to unsupervised learning, associative learning is one particular task that may prove to be especially useful. This methodology takes multi-dimensional data as input and identifies relationships between various parameters of the data.
We take an opportunity in this article to explicate the algorithms associated with each of these unsupervised machine learning methodologies.
What is Clustering?
We previously described the clustering algorithm as an unsupervised machine learning methodology that utilizes the parameters associated with particular data entries to identify groups and subgroups in the data. Clustering supports several different algorithms that can achieve this end, though these will be expanded on thoroughly in a subsequent article. Nevertheless, the primary algorithms include
- K-Means Clustering
- Hierarchical Clustering Analysis (HCA)
- Expectation Maximization
The purpose of K-Means Clustering in unsupervised learning is to discover patterns in data entries and use pattern similarities to segregate these items into groups. Dr. Michael Garbade contends in one of his articles on K-Means Clustering that clusters represent aggregates of data points grouped by virtue of having similar parameter values.
When executing a K-Means clustering protocol, the user defines a value ‘k’ which represents the number of clusters that define a data set. Each cluster itself has a centroid, wherein each centroid represents the center value of the cluster. Incoming data is assigned to the cluster which it happens to be nearest to.
The K-Means Clustering algorithms operates by randomly identifying clusters in a data set and optimizing the position of the centroid within that cluster based on the data points that occupy it. Once the centroids have been defined, the program may be used to determine the values of new data inputs.
Hierarchical Clustering Analysis (HCA)
Hierarchical clustering analysis is an alternative means of clustering to K-Means clustering. This method differs from the former in that it does not require the number of clusters to be initially specified. Furthermore, its output readily propagates graphically.
The two types of hierarchical clustering mechanisms which manifest frequently in unsupervised machine learning include agglomerative hierarchical clustering and divisive hierarchical clustering. The agglomerative clustering model is the most commonly used version of the HCA mechanism. The algorithm begins with a cohort of individual clusters, and it initiates from the bottom up. In this manner, the individual nodes are combined together and merge progressively until there is one single cluster. This mechanism yields an interconnected diagram of clusters progressing from the entirety of clusters to a single cluster.
Divisive hierarchical clustering differs from agglomerative clustering analysis in that it operates in a top-down manner. Rather than beginning with a cohort of clusters, divisive hierarchical clustering begins with a conglomerate cluster with all the data points. The algorithm then progressively divides the cluster into more precise clusters based on a variety of parameters. The process continues until the data objects inhabit their own cluster.
With both of these methodologies, the similarity between data objects is inferred by the proximal distance between the clusters. However, the similarity computationally models in several different manners. Maximum linkage clustering determines the dissimilarity between each element in two clusters and infers their dissimilarity by their distance based upon the largest value. Alternatively, minimum linkage clustering computes the dissimilarity between each element in two clusters and infers dissimilarity based off the minimum distance. Another methodology computes the distance between two centroids to determine the dissimilarity. The methodology utilized depends on the purposes for which this unsupervised machine learning model executes.
The expectation maximization algorithm functions to compute the maximum likelihood for each parameter of the data set. This methodology is particularly useful for identifying the model which procures a best fit for the data. This algorithm finds its usefulness in scenarios where there are missing data values.
To compute the maximal likelihood, the expectation maximization algorithm randomly selects parameters of the data set and uses the best fit to predict values for the missing data. Once the estimated values for missing data objects settle, the algorithm maximizes the predictability with the different parameters and assign weights to them accordingly. In this manner, incoming data predictions occur as functions of different parameter weights created from incomplete data sets.
What is Anomaly Detection?
Anomaly detection is perhaps one of the simplest algorithms we can employ by unsupervised learning. This is due to the fact that, in many cases, especially with large data sets, it is much easier to identify differences than to identify similarities. One reason for its simplicity is that the anomaly can readily be identified based on some rather straightforward statistical concepts such as standard deviation or the correlation coefficient. Susan Li, another individual writing for Towards Data Science, presents a variety of anomaly detection methodologies in one of her myriad of articles.
Applications of Anomaly Detection
We may extrapolate a bit on this conviction, with an additional presupposition that if a human is able to readily able to identify outliers in a set, then a machine learning system ought be capable of doing so with a much greater degree of specificity. While the anomaly detection algorithms differ in their means by which they identify anomalies, rest assure these revolve around exploiting significant parameter differences. We discuss several of these algorithms in depth here. They include:
- Univariate Anomaly Detection
- Multivariate Anomaly Detection
- Isolation Forest
1. Univariate Anomaly Detection
Firstly, as you may infer from the name, univariate anomaly detection takes into account only one of the parameters associated with the data set to identify outliers. Though seemingly precise, this algorithm may operate in a variety of manners. Firstly, univariate anomaly detection can identify point anomalies, which identifies individual data points that deviate from norms of data groups. Alternatively, contextual anomaly detection takes into account the data surrounding an individual point to identify anomalies. Finally, collective anomaly detection simultaneously identifies a group of anomalous data. The mechanism employed depends on the purpose of the machine learning model.
Univariate anomaly detection can be executed with a variety of computational methods. One could create a univariate Gaussian distribution and identify outliers as a function of the z-score.
If the univariate data derives from a normally distributed data set, the Grubb’s test can identify individual outliers. In this way, as outliers become identified, they may be separated from the original data set. The Grubb’s test mathematically models as:
In this manner, the model rejects the data point if:
2. Multivariate Anomaly Detection
In some data sets, multiple different parameters make significant contributions to the values of data objects. In this case, univariate analysis may be insufficient for efficient identification of outliers. Thus, it is preferable to execute a multivariate analysis to identify such compound outliers.
A variety of multivariate analyses may be undertaken to identify these instances. One of the principal methodologies manifests through the use of multivariate Gaussian distributions. In this manner, outliers may be identified through deviations that occur in multiple different parameters.
A multivariate normal distribution may be conceptualized as a vector reliant upon multiple normally distributed variables. If the individual variables are themselves normally distributed, then a linear combination of these variables also must be normally distributed.
If the multivariate normal is modeled as ‘x’, then ‘x’ must have a mean vector which mathematically models as:
Here, ‘E’ reflects the expected mean of ‘x’. Based on this vector, the values for individual data objects become predicted.
3. Isolation Forest
Finally, the isolation forest frequently appears in unsupervised machine learning algorithms. This methodology is quite similar to the Decision Tree technique observed with respect to the supervised machine learning models. In this case, given a set of parameters, randomly selects these parameters and splits data based off of their respective values for these parameters.
Isolation forests identify anomalies from the data set by identifying data points that diverge from the other forests. When executing the isolation forest method, the first order of business is implicating the model variable. Once we do this, we must specify the number of trees that populate the forest. The algorithm takes account of the feature values to identify anomalies.
When it comes to unsupervised machine learning, the primary tasks of execution include either clustering or anomaly detection. The premise of both of these is to take unlabeled data and create a model capable of predicting values of new data. However, they actualize these goals in different capacities. The clustering methodology tends to identify groups of data and ultimately assign values to data objects based upon the group they inhabit. Anomaly detection differs in that it identifies values based on the groups they do not inhabit. Nevertheless, these approaches effectively take unlabeled training data and create models to predict values.