Introduction to XML Data
The Topics of Data Analytics series inclines toward stipulating the most essential tools of analyzing data. We are still at an embryonic point in this investigation, but have elaborated in particular on two important subjects. We first began exploring the features of CSV files and CSV derived data. This discussion was then followed by an analysis of JSON and its associated content. In both of these articles, the features of investigation included an examination of the file format structure, importing the data, extracting individual items from the data, storing the data, as well as modeling. Presently, we conduct the same inspection with respect to XML files and its associated information.
This article provides the essential tools necessary for working with XML derived data. This will be done by working through an example of cancer clinical trial information in the state of New York. This data can be found here, if you’d like to follow along yourself. We will use the cancer clinical trial information to examine the methodologies of understanding the XML structure, importing the data, parsing the data, and storing the data. Let’s begin.
Conceptualizing XML Data Structure
As with JSON and CSV, XML is a file format with its own unique structure. This unique structure of data confers its status as a distinctive file format. XML stands for Extensible Markup Language, and as a markup language, is designed in such a way that it may be readily interpreted by both humans and computers.
Every XML document begins with a document prologue which denotes the document as an XML file. Followed by the XML document prologue, the XML elements become apparent. XML file contents organize specifically with element tags, specifically with start tags and end tags. For example start and end tags of the ‘Data’ element, consider the example tags below:
<Data> "Data_Object" </Data>
The Data tag on the left represents the start tag while the tag on the right represents the end tag. Between these two tags is the “Data_Object” which denotes the element of the ‘Data’ tag.
In some cases, the tag element may be empty and will not carry with it a particular object. In such instances, an empty tag appears as follows:
Associated with a tag are a series of attributes which control some particular aspect of the contents held therein. For example, one particular attribute of a specific tag may be the category the element belongs to. Suppose an XML document seeks to encode the address of an individual, and this particular address is the residential address of said individual. The XML code for such an endeavor appears as:
<address category = 'residence'> <name>Alan Turing</name> <street>1347 E Parkway</street> <phone>(888)-535-8255</phone> </address>
Here, the ‘address’ tag has the ‘category’ attribute which specifies the type of address. The address tag incorporates sub-tags, including the name, street, and phone number associated with the address. This is the basic structure of XML documents. It’s quite important to note that while JSON typically stores data as key/value pairs like in a dictionary, XML stores data in groups under the guise of a tag.
Viewing the XML File
As with the accession of data from any other file type, one essential action we must undertake requires us first to view the actual file structure from which the information derives. The XML file below comes from the data.gov website, specifically the following link. This link leads to the XML file that stores all of the available clinical trial information that exists in the state of New York. Associated with this data includes the particular cancer for which the clinical trial intends, the ID number of the trial, the date the trial commenced, the individual who runs the trial, the phase of the trial, and the name of the trial. Take a look at the XML file below:
Note that the primary tag of the XML file is the ‘response’ tag. Beneath this primary tag are the ‘row’ sub-tags which denote an individual clinical trial object, which represents a row in some type of data table most likely. Each individual clinical trial object has several different additional tags representing elements of the clinical trial. These are the elements which were discussed in the previous paragraph, which are represented by the sub-tags:
<primary_site></primary_site> <protocol></protocol> <principal_investigator></principal_investigator> <date_opened></date_opened> <study_phase></study_phase> <title></title>
These tags represent the data items for each individual clinical trial which we seek to access here.
Importing XML Data
The importing of content from XML files takes a bit more code for accession of the data. For this endeavor, we require the use of an external library. With JSON we utilized the ‘json’ library, and with CSV files, we employed the ‘csv’ library. With XML, importing and parsing requires an alternative library, the ‘xml.etree’ from this library. In particular, we must utilize the ‘ElementTree’ method.
The ElementTree method has several useful functions that allow the accessing and parsing of an XML file. Firstly, we use the ‘parse’ function, specifying the XML file as input. Here, we investigate the MarylandCancerClinicalTrials.xml file, and thus, this file serves as input to the parse function. We then use the ‘.getroot’ function to acquire the main tag of the XML file. The code for executing this functionality appears as follows:
Parsing XML Data
In the code above, the ‘root’ variable stores the main tag from the XML file, which in this case is the ‘response’ tag. Once we access the primary tag, we use the ‘find’ function to acquire the sub-tags that contain the data we desire. In this case, we desire the ‘row’ sub-tag, which stores all of the clinical trial information. We thus specify this with the following code:
After executing this, we can then focus on acquiring the data we desire. Each block of information denotes one particular clinical trial. We can iterate through the data object with a for loop. This will allow us to parse out information from each clinical trial individually.
Therefrom, we must consider the information which may be significant to us. Such essential data includes the clinical trial ID, the type of cancer it associates with, the phase of the trial, the investigator leading the trial, the date of its commencement and the name of the trial. Because the content which we desire lies between the start and end tag, we must use the ‘text’ function to specify this particular object. Let’s take a look at the code which best accesses this content:
We specify the tags that lie within the ‘row’ tag, and use the text function to access the content between the start and end tags. Subsequently, we store all of these items in a list, except for the type of cancer (primary site). We store the primary site separately because this site serves as the index for the data frame. We then append this list for this particular clinical trial to a larger nested list, which will become the primary content of the data frame.
Storing XML Data
Storing XML data with Python must not be labored on much beyond its discussion with respect to CSV and JSON. With XML, as well as other data types, the most efficient means of storing exogenous data is with the Pandas data frame. Note that in order to utilize this functionality, we must first import the Pandas library into our Python script. Once this has been done, we can use the DataFrame function to construct the data storage structure. All we must do is specify the primary content of the structure, and specify the index which is the cancer type associated with the trial. Once we do this, we can use the ‘columns’ keyword argument to specify headings for the data in the data frame. The code for executing this spans only a couple lines, and appears as:
The Take Away
The present article has diligently extrapolated on the tools necessary for working with XML data. This article employs an example which utilizes the cancer clinical trial information derived from New York. In this manner, we successfully expanded upon the structure of XML derived data, the importing of XML data into our programming, the parsing of said dat from the file, and its storage in Pandas.
To date, within this series, we’ve looked at the mechanics of working with CSV, JSON, and now, XML. We also previously belabored the subject of HTML, and thus will not review it in this present series. If you wish to take a look at working with HTML derived data, check out this article. Nevertheless, our next project elaborates on the process of working within databases, and the intricacies they associate with. We look forward to seeing you there.