What is a Web Scraper?
Web scrapers are the most ubiquitous tool of data science; if you pursue a career in data science or simply desire to work with data, a comprehensive understanding of producing web scrapers is essential to this endeavor. Web scraping is the tool which grants access to the plethora of data available on the internet as well as other sources. Here I will address the process designing a web scraper from start to finish. This particular web scraper uses Python code to excise data on COVID-19 and efficaciously model it.
An Introduction to HTML
No worries, one need not be an expert in HTML in order to access the data encoded therein. However, a basic understanding of what HTML is provides a concrete foundation upon which to build your very own web scraper!
HTML stands for hypertext markup language, and it is the structure which lies beneath the content of various websites. Perhaps even the one you are now reading! HTML is a methodology of organizing source code according to the tags which the content is associated with. If this does not make much sense as of yet, have no fear. As we build up this web scraper, the intricacies of HTML are illuminated. Suffice it to say that HTML is the content form used when constructing our web scraper.
Here is an example of HTML if you have never seen it before:
Accessing HTML Content With Web Scraper
I will first elucidate the process of obtaining content of html based content. Suppose a given project requires data representation of COVID-19 statistics. We may begin by first looking for useful data sources on-line. With vigorous research, we may in fact encounter Worldometer, a database devoted to a variety of subject matter. We also find a sub-page devoted to COVID-19 statistics for the US. It is found here.
If we enter this website, we find it is designed such that a user has an exceptional ability of interpreting the data. However, this is extensively manipulated by structural code. When we pull down this data, it is in a raw format.
Let’s begin by accessing the HTML source code of this website. Open a Python session. I prefer using IPython for mathematical manipulation and Atom as a text editor. For requesting links in Python, the standard library urllib is particularly useful.
Executing this code returns the full HTML content of the web page. What we just did was acquire all of the HTML structural information, as well as the html data. This is roughly what it looks like:
Using BeautifulSoup for Web Scrapers
We successfully acquired access to the content of the web page we desired, but now we must separate the useful from the useless. To do this we must acknowledge what type of data is desired. For simplicities sake, we will acquire the state name, state population, number of COVID-19 cases and number of COVID-19 deaths.
If you’ve never used BeautifulSoup before, let alone heard of it, this library is used for parsing through HTML data. Because it is not a standard Python library, it must be exogenously downloaded by the following Terminal command:
With BeautifulSoup installed, we may now attempt to acquire our desired data by parsing through the html. First, we create a BeautifulSoup object which stores the html content. We do this by manipulating our previous code such that:
The input argument ‘headers’ is set equal to User-Agent, which can be thought of as a quality control step for gaining access to the website. A discussion of the intricacies of BeautifulSoup is well beyond the scope of this present article. A future article seeks to break it down. However, suffice it to say our HTML content is categorized according to various tags, and we index these tags to acquire the content we desire.
Web Scraper Execution
Note that in the HTML file, we have a tag called ‘<tbody>’, which is the start of the data-table. Beneath this tag are sub-tags, of particular prevalence are the ‘td’ tags. Within these ‘td’ tags exists the data we desire, so we write code that parses these out.
The line is long, but lets’ break down what this code is doing. First, call the BeautifulSoupObject which stores the HTML content, and index for the ‘table’ tag. Second, use the “findAll” function to find all instances of the sub-tag ‘td’. Finally, raise an exception for this, that we desire only ‘td’ tags with specific attributes: in this case, we target characteristics of the font. All of these are stored as individual items in tag.
This output is just a snapshot of hundreds of lines of code that have our ‘td’ tags stored. Now, this is great, but the extra jargon is useless for us. We want the data. So, we next extract the data within these lines. To do so, write a quick text parser:
BeautifulSoup content has a variety of forms, but the content held between two tags is text. Iterate through the tag object and grab the text with the item.get_text() function. If we did this alone, we acquire a mess of data with empty strings and other useless garbage. So, raise an exception to only get items that are numbers.
Because of the way this particular HTML is designed, when we pull down this data, each item has a new-line symbol ‘\n’ at its leading edge. Thus, we must make sure that we rid ourselves of this pest. We are now ready to append the data to our list. Let’s see what we get:
As you can see, we have a neat and tidy list storing all of our data. Time to clean up.
Cleaning Up Data
One of the most tedious, but necessary, steps of working with data is cleaning it up into a usable format. Note that we only want the number of cases and number of deaths in each state. Well, in this list, we have a bit more than that. If we look at the Worldometer website, we see that each state has five different counters associated with it. Obviously we only want two. For that reason, convert our current list into a nested list that associates each state with its own numbers. This is how we execute:
This code creates a sub-list for every five items in the list. Because we know that each state has five counters associated with it, each state exhibits its own data. Let’s see what this looks like:
As you can see, each item in our nested list has its own five items. But recall that we have no use for some of these. We may note that on the Worldometer website, the number of cases are in the first column while the number of deaths are in the second column. For that reason, in every sublist, we can index the first item to achieve the cases in that state, and index the second item to ascertain the deaths in that state. We may also put these items in their own lists. Let’s see how this is done:
As you can see, we have an empty list for cases and an empty list for deaths. We can iterate through each item of each state with a nested for-loop. If the item occurs at index position 0, then we append to the cases list. If the item occurs at index position 2, then we append to the deaths list. Although there are more items associated with each state, we need not raise an exception for these.
Modeling With Pandas
Now, data in and of itself is not that useful. We must consider ways to store it and represent it. Much of this will be expanded upon in later articles, but here I will provide a brief way in which we can store and represent this data. We can create a dictionary that associates a label “Cases” with the cases list and a label “Deaths” with the deaths list. We then can use a Pandas dataframe (to be discussed later) to associate our data. The code to do so is as follows:
Now, this produces a pretty nifty dataframe to look at:
Now, we could have put the states names at values 0,1,2 etc., but note that the topic of pandas for data modeling will be extensively discussed later.
The importance of mathematics can not be understated when it comes to working with data. In particular, data of many dimensions may require some higher level mathematics. Check out this article to see how vector calculus can make you a better programmer. Believe me when I say that implementing higher level mathematics is truly a game changer for developing an efficient web scraper.
The Web Scraper Take Away
This is not the end-all be all of web scraping. As with any coding project, there is more than one way to skin a cat. Perhaps too many. Nevertheless, this will help you get your feet wet and give you a basic understanding of how to code web scrapers. Later discussions will give a thorough break down of HTML, BeautifulSoup, other website structures such as xml, as well as data modeling with Pandas. Hopefully this example shall suffice to help you on your way to scraping the internet in no time.