Implementing Parsers For Web Scraping
Web scraping is perhaps the most ubiquitous method in all of data science. In order to operate on data, one must first acquire data to operate with. Or you may need data to apply to your functions. What have you, parsing is the critical step of web scraping. A variety of different parsers and techniques are required depending on the present situation. That primarily depends on the form of the data in which you are accessing, and what type the data is. This article explores several different styles of parsers that you can implement in web scraping to acquire whatever data you desire.
If you’re curious about the process of web scraping in general, check out this article which demonstrates how to implement this into your design. Furthermore, it would be prudent to check out the article on Pandas data structures, as well as applying operations to Pandas, as these will be essential in designing your complete web scraper.
Perhaps the most basic parsing responsibility is parsing HTML files found on the internet. If you have not had prior background in using HTML, check out this article. Suffice it presently to say that HTML is a methodology of organizing source code according to the content tags. It standardizes the structure of the website and lurks beneath the graphical interface observable to the website user.
So, what do you do if the website you are parsing is in the form of HTML? Well first, you can verify this by right clicking the page and clicking ‘View Page Source”. In the top left you will see a label that states defines the page as HTML. This indicates that the content of the web page is structured in an HTML format. Now, we can see that HTML has a very particular form consisting of groups of tags with varying degrees of indentations. Here is an example of HTML data:
Parsing HTML content revolves primarily around specifying these tags specifically towards whatever information you target. Sub-tags beneath these larger tags associate information within a particular level of the content. Parsing via these sub-tags provides greater specificity to your parsing efforts. Finally, tags associate with specific attributes which modify the content of the website. For example, they may change the color of the text, or the font, among other things. One can parse directly for these attributes as well, which also provides greater control.
Example HTML Parser
So, you want to build a web scraper that operates as an HTML parser. To do so, the most frequent methodology is to utilize BeautifulSoup. If you have not yet utilized this library, check out this article for more insights to HTML and using BeautifulSoup. This article peels back the different layers of this example in greater detail.
First, we will generate a BeautifulSoup object which stores the HTML content of the website being parsed. Write the code in the following format:
Now, for this example, I specify the URL to be parsed as a website that stores COVID-19 data. However, I don’t want all of the data stored on the website, all I want is the data specifying total case numbers and deaths. For that reason, any parse coder needs to spend time looking at HTML tags to find a unique attribute/tag that target specific types of data. Let me show you the tag target I used to pull back the specific data.
I’ll break this down specifically. Ultimately, the BeautifulSoup object stores the totality of the HTML content. The ‘table’ tag is follows this BeautifulSoup object. The ‘table’ variable is actually a tag which specifies the beginning of the website’s data table. This is where the data is stored. The ‘findAll’ function then parses out all of the items which are specified within the parentheses. In the parentheses, the first object is ‘td’. This is again a specific tag which stores specific integer values in the data table. A lambda function specifies the attributes of the ‘td’ tag to target. In particular, it returns values that are bold-face and aligned to the right of their cell.
Organizing Data Output
Now, take note that for every state in the data table, this function returns all of the values associated with the state. However, recall that this parser seeks only to return the total cases and total deaths. So, we must create an individual sub-list for each state. Then, we parse out those two specific items for each state. We can execute this with the following code:
By doing this, we have successfully collected our data. Now, storing data is beyond the scope of the present article. Check out this article which describes the process of data storage with Pandas. Nevertheless, we can create a quick Pandas DataFrame which will store our collected data.
This DataFrame is simply used to reveal the data associated with five specific states. If you wish to view this example in greater depth, you may do so here.
Designing a JSON Parser
Now, the technique demonstrated here employs a bit more sophisticated coding. This is not an obligatory circumstance; this same technique can be employed simply with def functions. Nevertheless, the parser presented here relies on object oriented classes.
The Accessing_URL class takes a list of URLs that the user desires to scrape. It then uses the ‘access_method_a’ function to open each one up individually. “json.loads” and “json.dumps” are functions that organize the content and then stores it all in a data structure. The code appears as follows:
Now that the content of the JSON website is stored in a data structure, the desired content of the websites is parsed. The Parsing_File class executes this operation efficaciously. This class structure takes as argument the JSON object and the output limit (if one desires such). All that the programmer must do is specify the keys associated with the desired content. This code appears as follows:
Executing this code on the JSON object created will return all of the values found in the object.
Parsing Messy Data
While HTML and JSON represent highly organized data, not all data can be guaranteed in a useful format. Sometimes it’s, frankly, a jumbled mess. This example represents a universal parser that applies to all files. Let’s consider how it works.
The access_method_b function takes a list of URLs, and accesses them with urllib. The function automatically decodes the content from UTF-8 and stores the content of the website to a data structure. The code appears as follows:
Once we fill this content object with data, the code then manually parses this content. The code to do this is rather haphazard because, as a consequence of poorly organized data. In order to parse out desired data, the user must specify item preceding the value to be excised. This code stores the position of the item in a list, and indexes the position +1 relative to it. This operation returns the desired object. The code appears as follows:
Here is an easy BeautifulSoup workaround to parsing XML and LXML. It’s simply a minor alteration to the HTML parser:
Evident in line 7, the only modification that has been made to the HTML parser is specifying the feature. Everything else is the same.
Parsing and the Coder
Hopefully the examples provided herein will be sufficient explanations to your programming needs. As stated previously, if greater depth is desired, articles specific to JSON and XML are soon to come. A detailed web scraper investigating parsing COVID-19 data can be found here. Additionally, we have several articles demonstrating data modeling and manipulation with Pandas. Check these out for greater insight towards your programming of parsers.