Introduction to CSV Data
The Data Analytics series intends to provide individuals with helpful tools for approaching data analysis and manipulation. Our introduction to this series focuses specifically on working with CSV data, as these files are perhaps the simplest in terms of malleability. CSV data structures are principally organize such that data associates directly with specific tags and attributes. Being the first article in this series, we intend to get rolling on a high note by diving deep into the nuances of CSV files. This includes discussions of their uses, their accession, and unique attributes that differentiate them from alternative file types. Let us begin.
What is CSV Data?
Firstly, we ought to explore what exactly the category ‘CSV’ denotes as a file type. CSV stands for a comma-separated value file wherein a comma operates as a delimiter between data objects. In this manner, CSV files generally co-localize tabular data and differentiate cells of data by items occurring between commas. CSV files can also use different types of delimiters such as semicolons.
In addition to CSV files globally, a sub-type of CSV file includes the TSV file. The TSV file stands for tab-separated value files wherein data differentiates from each other as a function of their spatial distribution. These file types differ from the CSV file in that they have a .tsv suffix rather than a .csv suffix.
Viewing CSV Files
If you have downloaded a CSV file to your computer, you can view the CSV file structure by opening up the document in a general text editor. Here we demonstrate Florida insurance data stored in a CSV file. Take a look at the file below:
We can begin by looking at the first three lines, which defines the header attributes associated with the data table. The individual columns of data are defined by the object bound by commas. These columns include data on the policy ID, the state, county, and monetary values associated with the insurance policies.
It may prove helpful to observe how this CSV file manifests in Excel. Take a look:
Notice that when viewing this CSV file in Excel, it automatically correctly organizes itself. This is a consequence of the comma delimited structure of the CSV file which readily converts directly into Excel.
Importing CSV Data
If we want to work with CSV data within our program, we begin importing the ‘csv’ library which provides functions for modifying this file. We then open the file and read through it with the ‘csv.reader’ function. Here’s a look at the code required for executing this functionality:
Firstly, we import the CSV library of functions that permit modification of the CSV file we intend to work with. Subsequently, we open the CSV file by calling the name of the file in the ‘open’ function. The ‘open’ function is a built-in method in Python and takes the file name as an argument. It also takes an additional argument which specifies how we intend to use the file. For example, we use ‘r’ as an additional argument which demonstrates our intention to ‘read’ the file. Alternatively, passing in a ‘w’ as this argument would specify our intention of writing to the file.
After opening the file, we use the ‘csv.reader’ function which iterates through the content of the CSV file which we save to the ‘content’ variable. We then iterate through the content object with a for-loop and print out each line of the document. The output of this code appears as follows:
Altering CSV Data
Note that in the output above, the content of each row from the CSV file appears as a list of strings. While this may be useful on occasion, it may actually be preferable to store the CSV data as a dictionary. This can make our system more organized and also make it easier to access specific items of information. Converting the CSV data to a dictionary is rather simple as the ‘csv’ library comes prepared with a built-in function known as ‘DictReader’.
Like the ‘csv.reader’ function, ‘csv.DictReader’ iterates through the content of the CSV file. However, rather than storing the data linearly, it stores that data as a dictionary, associating the data with its particular key. The code for executing this dictionary conversion appears as follows:
Once we have created our dictionary, we can iterate through it again with a for-loop. Even more interesting, however, is the fact that we can extract information from the dictionary on the basis of the key with which it associates. For example, we can iterate through the dictionary with a for-loop and pull out the policy ID for each item in the data set. The code to do this appears as follows:
Once we have done this, the output of the for-loop is every single policy ID found in the CSV file, which looks something like this:
Storing CSV Data
Creating the Dictionary
We have gone through the manners in which we may view the data in a CSV file, open the CSV file in Python, iterate through it, and extract the data within. What we have not yet expanded upon is how we might store this data and model it. Suppose we might want to associate the policy ID with the point longitude and point latitude.
Because the content of the CSV has been converted into the form of a dictionary, constructing a new dictionary for our desired components is rather simple. We can also use our for-loop to do so. Furthermore, we can iterate through each line of the content and for each data item and extract the policy ID, latitude, and longitude. We can then create a nested dictionary, using the policy ID as the key and the latitudes and longitudes as values. We can then use the update function to append each dictionary item to the global new dictionary. The code for executing this functionality appears as follows:
When we print out the dictionary, it appears as follows:
Storing in Pandas
Now, this dictionary is quite disorganized, as well as repetitive, as the latitude and longitude labels appear quite often throughout. Therefore, it could be of great convenience to convert this dictionary into a Pandas DataFrame.
Working with Pandas is a critical component of data analytics, so if you aren’t too familiar with computations utilizing this tool, I advise you to check out this series of tutorials on working with Pandas:
- Constructing Pandas Data Structures with COVID-19 Data
- Operating on Pandas Data Structures with Ufuncs
- Using Pandas With Web Scraping
The code for creating the Pandas data frame is actually quite simple to program as Pandas has a built-in function for creating data frames from dictionaries called ‘from_dict’. The code looks like this:
We can now check out the Pandas data frame, and observe that it appears significantly more organized. Check it out:
Plotting the Pandas DataFrame
Doing this is quite useful as it makes it quite easy to plot the data. The code to do this appears as:
This code yields the following plot:
The Take Away
This article has really gotten deep into the intricacies of working with CSV data. We began by describing the nature of CSV files as organized through comma separated items. Subsequently, we discussed the methods of importing this data into a Python program and iterating through it. We then took a detour to discuss the methodologies of extracting and reorganizing the CSV data as a dictionary. This was ultimately followed by a discussion of storing the data and finally plotting it. Hopefully this has proven helpful in your efforts of improving data analytics techniques. Our subsequent article endeavors to explore the technique of working with JSON data in a similar manner. However, if you endeavor to dive deeper into the subject of working with CSV files, check out this resource. Nevertheless, see you next time around.