Pandas DataFrames and Indexing: Organizing COVID-19 Data

The Pandas DataFrames

A recent article extrapolated on the implementation of web scraping to acquire data from the internet. This data derives from COVID-19 statistics. Some of this data will be implemented herein for examining the utilization of Pandas for coherent data modeling. Pandas DataFrames as well as Series will be the primary structures elaborated upon.

Check out this article to observe the code employed for generating the data we will be using herein.

Pandas Structures

Perhaps you’ve arrived here having no background in the use of Pandas, but at the very least, you are here with the intention of learning. This article describes the basic structures utilized in Pandas (Series and DataFrames), and explores the various methods for extracting data. The Pandas web-based manual provides useful descriptions of their functions. But if you’re looking for something with less jargon, you’ve come to the right place.

Pandas Series

The most fundamental structure of Pandas is the Series object. This entity is a one dimensional array that stores linear data. These Pandas Series can be created by manual input as follows:

Alternatively, we can use our COVID-19 data to create a one dimensional list of the number of cases:

Note that in both cases, the pd.Series() function takes a list object as input. The output is a linear structure associating a value with an index.

Conceptualizing Pandas Series
  • Series as a Numpy Array: You may observe that the Pandas Series object is quite similar to the Numpy linear array. The essential difference between the two is that the Numpy array has an implicitly defined index, while the Pandas Series index is explicitly defined. This means the programmer has the option to change the index in any fashion, with random numbers, strings, and more.
  • Series as a Dictionary: The Pandas Series is a sturdy abstraction to Python dictionaries. These innate structures are mapping functions that associate values with a specific key. In this fashion, a Python dictionaries directly spawn Pandas Seroes, which may be observed using our COVID-19 data:

In line 33, we see that there is a direct association between the case counts and the label “COVID-19 cases”. Take note that we could, however, make the key another list, such as states, or some other arbitrary entity of the same length. These entities associate item by item.

Pandas DataFrames

The alternative Pandas object is the DataFrame structure. As we considered with Series objects, Pandas DataFrames are quite similar to Numpy arrays, particularly themultidimensional arrays. Series are linear arrays with one value set and one index set. Pandas DataFrames, on the other hand, support a variety of compositions, including multiple indices and multiple data inputs. Let’s briefly explore several different structures we created with our COVID-19 data.

DataFrame From Dictionary

Suppose we desire to create a Pandas DataFrame exhibiting the number of deaths and cases in the five most afflicted states. In this format, we have three components we must account for: two items specifying the data (cases and deaths) and the index (the five most afflicted states. The easiest way to organize this is to create a dictionary that labels the number of cases and deaths from our two lists. We then must set the index to the states, which we organize as a list. The code for this DataFrame looks like this:

As you can see, this simple code readily organizes our data into two columns exhibiting the number of cases and number of deaths, and associates these values to a particular state.

Pandas DataFrames From List

Consider the possibility that we desire an alternative design. For example, what if we want a column for each state and data exhibited in rows? We thus need to specify the column as the list containing states, as well as an index labeling the rows of data. Then we need a list containing our two data inputs. The code will look and produce the content as follows:

These are just two of many ways in which a Pandas DataFrame are constructed. Suffice it to say that there are four general manners of developing these structures:

  1. Dictionary Input
  2. From Pandas Series
  3. Two-Dimensional Array
  4. Numpy Structured Array

Indexing Pandas Structures

While Pandas are useful for organizing large aggregates of data, their usefulness is limited unless we are able to access specific attributes of data. The rest of this article will extrapolate upon the methods of indexing and data selection applied to Pandas data structures to actualize this exact purpose.

Indexing Series

Recall from our previous conceptualization of Pandas Series that we can consider these structures from the perspective of dictionaries or as a linear Numpy array. For this reason we may undertake several different methods for indexing Series depending on the data was seek to procure. In consideration of the relationship between Pandas Series and Numpy arrays, lets begin by indexing our initial random data Series.

Recall that our random data Series takes a list input of numbers from one through ten, and indexes these values accordingly. We can index this Series in a fashion quite similar to how we index Python lists. For example, if we desire to index the number ‘1’ from the list, all we must do is index the Series at position ‘0’

We may also slice from this Series the first three items by changing the indexing of position 0 to [0:3].

Indexing Series as Dictionaries

Because Pandas Series may be conceptualized as a dictionary, we may utilize some dictionary-based indexing methods to return desired data. For example, we can discern if a certain value is in the series, we can return the index itself, and we can return a list of tuples containing the key:value association. Let us see how this may be coded:

Observe from the results that 2 is in fact in the index by proxy of the Boolean output. Because the .keys() function in this case derives from the implicitly defined index, we retrieve data on its start/stop values. Finally we obtain a list of the key:value pairs where the first item represents the index position and the second item is the value at that position.

Indexing Pandas DataFrames

With increased data structure complexity means greater flexibility in the data accessible therefrom. With Pandas DataFrames, we are able to index specific columns of data by specifying the column we desire

Additionally, we can index for the data of a particular state. For example, suppose we desire to index for New York’s cases and deaths. Because the states are a part of the index for the DataFrame, we must execute a slice. All we must do is specify the state in our index call and specify it as a slice:

Using loc, iloc, and ix

Pandas data structures come with three unique methods for specialized forms of indexing. These functions are loc(), iloc(), and ix(). When understood, these functions provide great versatility and flexibility in indexing various Pandas data structures. However, their implementation differs between Series and DataFrames. Let’s examine the use of these.

Index Functions in Pandas Series

The loc(), iloc(), and ix() functions support special indexing schemes useful for more sophisticated data acquisition. The loc() function allows for indexing according to the explicit index defined by the user.

Consequently, we redesign a Series of the COVID-19 data, using the top five states and the cases in those states. Firstly, set the case for the first five states as the values and the states as the index for. At the same time, use loc to index for New York, which returns the number of cases in New York. We also take a slice of every item in the list with the syntax loc[“New York”:”Massachusetts”:2], which returns a new series with New York, Illinois, and Massachusetts associated with their respective cases:

We may also use iloc to index via the implicitly defined index which is purely numerical:

Together, these two functions make indexing via a Series index much more efficient. Additionally, they grant the programmer greater flexibility in terms of how they acquire data.

Pandas DataFrames and Index Functions

In the previous section, we observed how loc and iloc applies to index data structures with respect to Series. In DataFrames, these functions operate in a similarly. Let’s again explore these functions with respect to our COVID-19 DataFrame.

As with Pandas Series, we can use the loc function to index the explicit index associated with the Pandas DataFrame. To illustrate, let us attempt to use this to extract the same data we accessed from the Pandas Series:

We may code similar functions using the iloc method to index via the implicit index of the DataFrame:

Note that execution of these functions produce outputs that are a new, modified, DataFrame object that appears as:

The Take-Home

Pandas data structures provide flexible methodologies for the purpose of generating these structures as well as excising their content. However, this is only one of their functions. Future articles focus on wielding the power of Pandas data structures for complex data modeling as well as rigorous computation. If you’d like to consider the use of vector functions and their applicability to Pandas, check out this article. Stay tuned for future updates with respect to the elucidation of Pandas functionality.

Leave a Reply

%d bloggers like this: