Web Crawling: An Introduction to Automated Internet Surfing

What is Web Crawling?

If you read our previous article on web scraping, you might note that we designed a web scraper that drilled into a website containing COVID-19 data. We wrote another article that describes the process of designing a universal parser for any of your web scraping desires. In both of these, you may have noticed that all of this data was contained within a single page. However, data is seldom so well confined in a single location. Rather, it often spreads across multiple different sources which must be individually accessed. This is when the task of web crawling becomes involved.

Designing a Web Crawler

In this article, we develop a web crawler which goes into each individual country on the COVID-19 data site, and brings back the data for each of them. The main coronavirus data website can be found here. I wish to bring to your attention the differences between web scraping and web crawling. They are few, but result in starkly different behaviors. In both web scraping and web crawling, a program accesses a web page and extracts a particularly object. Where web scraping differs from web crawling is the fact that the particular piece of data the program extracts from the page is often a URL. The purpose of this is so that the acquired URL can then be searched, and data may be extracted therefrom, either another article or an actual piece of data.

This is exactly what we intend to do. The COVID-19 website possesses a worldwide page where it list the statistics for every country. In this table, there exists a specific URL for each country as well. If a user follows this link, they arrive at a location full of more in depth statistics and models. The plan here is to begin at the worldwide page and access each of the country’s links. Once the program accesses the individual countries, the program drills into this page’s data. From this page, our crawler then extracts the total number of cases and total number of deaths in this country. The crawler then moves on to the next country on the list. Let us begin with an initial design for our novel web crawler.

The Base of the Web Crawler

As was the case with our original web scraper, the web crawler depends on the same imports and the same utilization of BeautifulSoup. If you desire a more in depth analysis of utilizing BeautifulSoup, check out this article. Firstly, we establish the main article we begin with, which is the primary web page housing the sources for all country URLs. We invoke urllib to open the url and store the content of the HTML page as a BeautifulSoup object. As you may observe from the code below, there is absolutely no difference between this code for web crawling and the code employed with web scraping.

Parsing the Primary Website

Our first task for our web crawler is to access the primary website containing the URLs we need. Let us first take into consideration the HTML content we parse. It appears as follows:

The first thin one notices is the ‘table’ tag, which specifies the commencement of the data frame. Associated with this tag is its attribute “id=main_table_countries_today”. We use this attribute to target this table specifically. Now, if we dig deeper into the table, we must find the URLs associated with each country. Let’s take a look at what one of these individual country blocks looks like:

Early into the block for the United States, we observe a tag ‘td’. Within this tag, there is a sub-tag, ‘a’, which has an attribute ‘href’. The ‘href’ attribute is a universal label that specifies a link or link attachment. So, if we begin by targeting the ‘td’ label, and raise an exception for those possessing an ‘a’ sub-tag, then we may specifically parse out the ‘href’ attribute for each country. The code that performs these functions is shown as follows:

If you look from lines 12-14, you’ll see a conditional statement. This function verifies that the ‘href’ attribute is within the parsed out item. If it is, then we append the URL attachment to our list.

Modifying URLs for Web Crawling

The preceding code brings back the URL terminator which specifies each respective country’s page. However, this is not the full URL. We must attach the URL prefix in order to construct fully accessible URLs. To do so, we incorporate the following code:

This code accesses the URL terminators from the ‘countries’ list and attaches it to the URL prefix. This is appended to a dictionary where it associates with the country for which it is defined. This is purely an organizational step that may be useful for later processing.

Web Crawling and Generating Data

With the functional URLs now established, we must actually gain access to these pages and drill into them, returning the number of cases and deaths for each of them. Let’s look at the HTML for one of these individual countries so we may discern what it is we desire to extract.

Note that for the cases, there is a tag ‘h1’ which denotes the coronavirus cases. This is followed by a ‘div’ sub-tag with an ‘id’ attribute. Within this ‘div’ tag is a sub-tag span. If we parse this ‘div’ tag specifically, then parsing the ‘span sub-tag will deliver the number of cases. In a similar case, we may see another ‘h1’ tag which denotes the number of deaths. If we target this one and parse out its span, we will achieve the number of deaths. This is what the code would look like:

Note the syntax of these parsing actions. We must use two different syntaxes in order to parse out the different items of information. Also take note of the fact that we store these items in a dictionary associated with their respective countries. While this is an optional action, it makes it significantly easier to put our items into a Pandas data frame.

Organizing Data From Web Crawling

Data is useless unless we have a way to store it and model it. While the intricacies of Pandas and data modeling are beyond the scope of this article, I will demonstrate how this particular data is to be organized. Take note of the fact that this has been made much easier simply by storing our data in a dictionary.

With this code, we specify the number of cases from the case dictionary and the number of deaths from the deaths dictionary. We then index this using the countries list, but convert it to a tuple. The Pandas DataFrame knows which data to utilize because all of these have an overlapping index with each other. If you’re looking for more insight on using Pandas data frames, check out this article.

The Take Away

Web crawling executes a function that differs starkly from web scraping. However, it does not take much more effort to properly code it and excise the information desired. All that is really needed is a bit of handiness with HTML and you’re on your way to coding a web crawler. One particular aspect of web crawling is that fact that they all function in relatively similar fashions. The primary difference is the content of the information being excised. While we just built a web crawler from scratch, if you create a utilities class for a web crawler, a lot of this code can be recycled, making it much easier in the future to just tweak a few items and get your code running.

I would like to mention that our previous article on web scraping is very useful if one desires more insight into the intricacies of BeautifulSoup. Furthermore, this article will provide a more robust discussion of the applications and various perspectives of parsing HTML. It may be found here.

If you desire to learn more about the Panda’s data structures and how to use them, I provide links for them below.

  1. Constructing/Indexing Pandas Data Structures
  2. Operating On Pandas Data Structures With Ufuncs
  3. Designing Universal Parsers

Checking these out will help to provide you with a very thorough background in scraping and crawling, accessing and storing data, as well as operating on large data objects. If you’ve found this present article to be helpful, I highly recommend these.

Leave a Reply

%d bloggers like this: