If you’ve been following along with The Art of Data Science blog, you may have seen our previous tutorials on data parsing from the internet. Our first tutorial involved the construction of a rather basic web scraper, the other involved the development of a slightly more sophisticated web crawler. Both of these were developed for the purpose of acquiring data on COVID-19. In our web scraper, the Worldometer website for the United States was accessed and parsed out the data for total cases and total deaths. In our web crawler, there was a bit more complexity. The Worldwometer website listing the data for all countries was accessed. Then the individual country webpage for each country was accessed. Then those individual countries were parsed out for total cases and total deaths.
Neither of our previous models discussed the subject of data modeling and storage in any significant depth. Therefore, the purpose of this article and the program associated with it is to unify web scraping/crawling with the concepts of data storage and data modeling. Before we begin, if you have not checked out our tutorials on Pandas data structures, we highly recommend this as our storage and modeling efforts revolve around this library. While higher level plots are often executed with with MatPlotLib or ggplot, because we are using Pandas for storage, we are conceding to use this library for modeling as well. Definitely check out the following articles for insights and assistance.
- Constructing Pandas Data Structures with COVID-19 Data
- Operating on Pandas Data Structures with Ufuncs
- Using Pandas With Web Scraping
- Vectorized String Operations with Pandas
Accessing Data Storage Location
As with our previous endeavors, our data is being stored on the Worldometer website. For our purposes, we intend to access only the data for the United States. The website content is structured as HTML, and thus we employ BeautifulSoup for accessing the website. If you need a refresher on either HTML tags or BeautifulSoup, check this article here, as we do not expand on those nuances here.
Firstly, take note of the libraries needed for executing the code of this program. We utilize ‘urllib’, BeautifulSoup, and Pandas.
To begin, when parsing HTML with BeautifulSoup, we utilize our standard base for accessing the source content and storing it. Firstly, we specify the URL we seek to access as the United States coronavirus web page at Worldometer. Secondly, we request access to the specified URL, and open it using the ‘urlopen’ library. Finally, we store the content of the HTML page source using BeautifulSoup. In only four simple lines of code, we have effectively specified, accessed, and stored the entirety of the content we seek to parse. Observe the code which executes this functionality below:
Initial Data Acquisition
Perhaps the most essential feature for data modeling is establishing an index. Doing this at the front end of the data acquisition rather than the back end makes organization of data, as well as modeling, much simpler. We may have to do some brainstorming about how we seek to organize data in advance, but for our purposes, we can reasonably organize data in accordance with the states comprising the USA. So, our first order of business is acquiring the states in the database.
On the Worldometer website, all of the data we need is stored in a data table, and thus can be accessed via the ‘table’ tag. However, there are multiple table’s on the page, so we must make sure we take the proper table by specifying its respective ID.
Once we have acquired access to the data table, we now possess the ability to access the states specified by the table. Now, if we seek only to acquire states here, we must exploit some feature of the state’s style unique from the rest of the data that allows us to target these units specifically. We might note that the font of the states is different from the rest of the numerical data; therefore, we can specify the ‘td’ tag and access the data with this unique font size. Now, we must include an extra step because we only want the text here.
Now, when we extract these chunks of information, they arrive in a messy form. This is because they possess a leading new line symbol (‘\n’). So, we want to make sure we get rid of this useless artifact using the replace function. Finally, we can rest assured we have successfully modified the information and append it to a list unique for states. The code used to execute this function is presented below:
Acquiring The Daily New Data
Now that we have collected and specified our index, we can move on to acquiring the data we desire. So, when it comes to the Worldometer website, there are two different types of numerical data we might desire. One is the cumulative data like the total cases, cases per million, deaths per million, and total tests. On the other hand, there is daily updated data like daily new deaths and daily new cases. It is best for us to tackle these separately as its difficult to access all of these together, but rather simple to do individually. Let us begin with the new daily data.
As we did with accessing the states, for accessing the new daily data, we acquire it by exploiting some unique aspect of its structure. For the states, it was the size of the font. However, this is not true for the new daily data. If we look at the Worldometer website, we would very rapidly observe that the daily case numbers are highlighted in yellow while the new daily deaths are highlighted in red. The rest of the data has a white background. So, we can parse out these units of information by targeting their styles.
As we did with the states, we specify the data table as our initial tag. Then, as with most HTML, the data is stored in a ‘td’ tag, so we next specify this feature. Finally, we target the style feature of this piece of information. Then, we append the acquired data to the list after cleaning it up with the replace function. We demonstrate the code utilized to execute this below:
Now, if you look at the code for acquiring the cases and deaths, you’ll notice two separate while loops. This is where understanding the mechanics of the website you’re working with is very important. Worldometer is a live updated website where frequently throughout the day, new daily data is added periodically. This means that the values at these locations are changing frequently. The significance of this is the fact that at certain times of the day, there may be no data in some of these cells. Therefore, we want to identify the number of states with empty data in these locations, then fill these spaces in with zeros.
Accessing Cumulative Data
Worldometer offers a wide variety of data, but we don’t want all of it. In fact, what do we want? In the wise words of the econometrician Ardian Harri, “As a data person, more is often better.” I agree with this conviction, and so we want to make sure we extract as much usefulness as we can from what Worldometer has to offer. But at the same time, we don’t want to extract units of information we can discern from our own calculations. These include test per million, total deaths (can be computed from deaths/million which offers more information), active cases, and more. For that reason, we settle for total cases, cases per million, deaths per million, and total cases.
With that being said, we execute similar code from the previous example, making sure to specify these desired units with their respective tags. However, when we execute the code, it’s not just going to stop at every state and loop through. Rather, its going to acquire the data from the entire table at once. So, when we put all of this new data into a list. We will need to convert it to a list of sublists with four items of information in each sublist. This will include total cases, cases per million, deaths per million, and total deaths for each sublist. In this way, associating a state with its respective data is as simple as aligning the sublists to the state list. The code to execute this appears as follows:
An Organizational Note
Now, as a matter of convenience, the Pandas library is very efficient at storing data in the form of dictionaries. So, we could have created a dictionary that associates all of states with their respective information in a dictionary.
However, if you haven’t had much experience with using BeautifulSoup, you may have missed something very significant. All of our parsing code has utilized a function called ‘.text’, which allows us to extract the text information from the HTML tags. However, the text tag returns the information as a string. Now, this is fine and dandy for our state information; the states are listed as words and thus we wouldn’t expect to conduct any computations on them.
On the other hand, this is quite a problem for our numerical data. If these various integers return to us in the form of a string, it will be impossible to execute any type of computation or modeling on them. Therefore, we need to make sure to convert these items to integers. While we execute this string-to-integer conversion, we might as well bolster the transparency of our program by dividing our items of information into their own lists.
This means that rather than having one conglomerate list storing nested sublists containing total case, case per million, death per million, and total test data, we will create separate lists for each of these items while we also convert them to lists. The code to do this appears as follows:
Developing the DataFrame
Now, the nuts and bolts of data storage and organization will not be discussed here. Extensive attention has been payed toward this subject in the previous series on Pandas. Therefore, I will leave that up to the reader’s discretion whether or not they need to take a look at this.
Firsts things first, when it comes to creating a Pandas data frame of multi-dimensional data, we first want to specify the index. As discussed from the outset, this will be the states list. Once we’ve done this, it’s as simple as specifying a dictionary where the column labels are specified as keys and each respective list is presented as the values. The code for this function appears as follows:
The Essence of Data Modeling
When it comes down to modeling data, the first question you need to answer is: ‘What do I want to convey?” The way you answer this question depends on the level of complexity you are seeking to undertake. Furthermore, what is the audience for whom your program is intended? Because we are working with numerical data, a series of bar graphs should provide a reasonable comparison between states. A later article focuses exclusively on methods of modeling in a more sophisticated fashion. So let the bar graphs suffice for now.
Once we answered our method of modeling question, we want to determine the platform for our modeling. The simplest platform allowing easy export with Pandas is certainly Excel. This is especially true in consideration of Pandas built in ‘.to_excel’ function. So, we will create four bar graphs that specify the total cases per state, death per million in each state, new daily cases in each state, and new daily deaths in each state.
Generating these bar graphs is really a repetition of the same code with different titles and different inputs. We invoke the panda data frame containing our data and specify the items we want for each bar graph. We set these equal to their respective ‘x’ and ‘y’ axes. Subsequently, we make sure to specify the type of graph we us (not a histogram, a column plot). Then we specify the location of the excel file to put this graph as well as the size. Then we give the title and axes labels. Once this is done, we do the same for the rest of the graphs, followed by a rapid export function. The code to do so looks as follows:
The COVID-19 Data Model
This simple code we just here executed produces rather neat and organized data. A piece of the data frame created appears as follows:
Furthermore, let’s take a look at one of the bar plots we created:
This bar plot demonstrates the Total Cases Per State, as can be interpreted from the chart title. As you can see, each state is associated with its own bar whose height represents the total number of cases for the state. If this graph seems boring, worry not. Our next series explores methodologies of customizing data models, in addition to exploring more sophisticated models. This article sought to just give you a taste of the process all the way from scraping to storage to modeling. If you found this article helpful, consider checking out the article on creating a universal parser as this can make your life much simpler as you approach parsing data outside the realm of HTML such as XML or JSON.