Introduction to Histograms in MatPlotLib
By this point, we are rather deep in our series addressing the intricacies of MatPlotLib. For a basic overview, we initially began discussing mechanisms of basic configuration of MatPlotLib and its functions, a subject which may be found addressed here. After this discussion, we began by elaborating on basic concepts fundamental to plot construction and customization. This may be investigated here. In our third and fourth installment, we began discussing intricacies of more complex plotting mechanisms. One of these focuses on scatter plots, while the other addresses density and contour plots. Check these out before moving forward. Herein, we will discuss at great lengths the essential features associated with histograms.
Before embarking on the process of modeling data using histograms, it is incumbent upon the user to understand what these tools are used for. We elaborate greatly on this subject in our series on statistics, so I will not take this subject to great depth here. Suffice it to say that histograms represent the frequency that a value in a dataset appears. This is represented by the area of the rectangle for that value.
The histogram is one of the most useful plot types for understanding a wide range of features about our data. We can infer the distribution relationship of the data, the mean, median, mode, and standard deviation. All of these features give us some fundamental insight into the nature of the sample we work with.
Constructing a Histogram
The code for creating a basic histogram is actually rather simple. First of all, we need a data source from which we populate the histogram with. This can be a one dimensional data set, which we can create with the Numpy random function. All we have to do is employ the ‘plt.hist’ function and provide our one dimensional data set as input. The code altogether appears as follows:
Evidently, the code clearly is quite simple. The histogram procured by this code appears as:
Customizing Histograms in MatPlotLib
As with other plot types in MatPlotLib, the library offers a wide variety of customization features we may apply to the histogram for greater control over plot features. We can elaborate on some of these.
Bins represent containers which store data for a particular value pertaining to a histogram. If a particular value falls within one of the bins, then the frequency of that particular bin increases. If we increase the number of bins, we can acquire a more precise look at the totality of the data. We can specify the number of bins using the bin argument and indicating an integer. Let’s see the code when we increase the number of bins to 30:
When we increase the number of bins to 30, we find that the histogram takes on the form:
With MatPlotLib histograms, there are four primary types of histograms that alter the resulting structure plotted. The histogram type may be changed by specifying the ‘histtype’ keyword argument when plotting the histogram. We primarily have been utilizing the step-filled histogram type. However, let’s create some histograms of alternative types, including ‘bar’ and ‘step’. Let us begin with the ‘step’ MatPlotLib histogram type. The code for doing so appears as follows:
The ‘step’ histogram type does not fill the interior of the histogram, but rather, just reveals the border of the histogram. It creates a plot that appears as follows:
Plotting Multiple Histograms
When working with multiple sets of data, it is often desirable to compare the norms of these data by plotting multiple histograms simultaneously. Suppose that we have three normal distributions that have different means and different standard deviations. We can plot each of these data sets on the same histogram and alter their transparency using the alpha function. You may specify a number between 0 and 1 which confers the degree of transparency for the plot. Let’s take a look at how to create an over-layed multi-histogram plot:
Here, we specify three separate data sets, ‘x1′,’x2’, and ‘x3’ which are normal distributions with different means and standard deviations. By setting the transparency via alpha to .3, we obtain a plot that appears as follows:
Creating Two-Dimensional Histograms
In some cases, an individual may work with two-dimensional data that must be plotted by a histogram. Fortunately, MatPlotLib provides efficient means for documenting two dimensional data by creating histograms in two dimensional space. In these cases, the bins are two-dimensional rather than one dimensional, and the frequency of data within these bins are reflected by the intensity of the color in the bin. Let us attempt to create one of these two-dimensional histograms.
Firstly, we must create two dimensional data that represent (x,y) pairs for the plane we create our histogram on. We can create a two-dimensional normal distrivution using Numpy’s random ‘multivariate_normal’ function. Because we are working with two data sets, we need to specify mean and standard deviation for both of the data sets. Let’s first take a look at how we create the data for this two-dimensional histogram.
Once we create this two dimensional data, we charge forward to create the two-dimensional histogram. This revolves around utilizing the ‘plt.hist2d’ function. This function accepts the two data sets, the number of bins, and the color. We can also include a color bar for reference of the frequency. Let us attempt to create this with following code:
This code creates a two-dimensional histogram which appears as follows:
Customizing Two-Dimensional Histograms
Specifying Bin Shape
In many two-dimensional histograms, the shape of the bins are squares. However, MatPlotLib offers a variety of bin shapes. One of these bin shapes is a regular hexagon. We can specify the bin shape as a hexagon by using the plt.hexbin function. We pass in as input our two-dimensional data and specify the color. The code for executing this appears as follows:
This code creates a two-dimensional histogram of the form:
Kernel Density Estimation
Kernel density estimation (KDE) in effect smears out the two-dimensional histogram such that the bins are not explicitly discrete. Rather, there is a gradient of different color intensities across the two-dimensional histogram. Executing this function relies on the scipy.stats package. If you have not used scipy before, do not worry, as this package will be elaborated on extensively in another series. Let the following code suffice for demonstrating the process of kernel density estimation.
This creates a two-dimensional plot of the form:
The Take Away
Histograms are a ubiquitous chart type that hold critical positions in data science. These plots permit the discernment of many statistical concepts like mean and distribution which provide global insight to the nature of a data set. Fortunately, MatPlotLib supplies a variety means for encoding and customizing these structures. Furthermore, the ability of plotting two-dimensional functions also allows construction of significantly more complex data modeling. By applying these tools succinctly in a Python script, these MatPlotLib functions allow users to create highly specific chart types that are exceptional at conveying statistical information. In our next article, we will discuss in great depth the variety of customization features in MatPlotLib. Nevertheless, if you desire to explore histograms in greater depth, check out the MatPlotLib manual here.