Accessing and Manipulating Data in R
Perhaps as you may have noted in following this series from the Art of Better Programming, we have gone through great lengths to provide knowledge and resources to you for learning data science. We began primarily with a variety of series that elaborated on data science with Python. We discussed mathematical utilities as well as machine learning. The present series, however, seeks to elaborate on this knowledge from the perspective of programming with R. While our previous article in this particular focused on elaborating on the different object types used in R, this article focuses on discussing the mechanics of data accession from objects and subsequent manipulation.
Before we get started, I want to take a moment to mention one of the best opportunities for learning the underlying features of R and data science using the following course. This comprehensive, twenty hour class on the intricacies of data science will vastly improve your skills in data science and give you essential credentials for advancing your career. Furthermore, I’ve managed to acquire the course for you for 97% off, making this an incredible opportunity for you to get your journey started. You can find the discounted version of the course here. Additionally, one of the primary resources we consult as we investigate the role of examining R programming in data science may be found here. I highly recommend this tool as it presents a myriad of coding examples and the underlying mathematical theory to statistical implementation of R in data science. For that reason, you can acquire the discounted version of the book here.
Before We Begin
Quick. Before we get started here, I really need to share with you one of the best classes you can invest in for advancing your data science skills, no matter what level you are at presently. Check out the following link. This will take you to a location where you can look into the Python Summer Camp Plus online class which for very minimal cost, when following the link, can introduce to you the most essential tools for developing your skills. Please, do yourself a favor and at least check it out. I guarantee that spending even just one week in this class will fundamentally bring you to the next level of data science.
Conceptualizing Data Accession in R
Our previous article discussed the various data objects we are able to create in our R-based programming. While the objects themselves are important, they are for the most part useless unless we have a means of accessing the data stored therein. Fortunately, in a manner similar to the way we do in Python, we can slice data out of various data objects in R. This matter of data accession is discussed extensively in a publication made by Chapman and Hall, which you can follow along with here.
Let’s take a look at how we might encode a vector or list, and access individual elements therefrom. The code looks a bit as follows:
> X <- 1:5 > X  1 2 3 4 5 > X  3 > X <- 20 > X  1 2 20 4 5
In the first line, we created a sequence object using the colon symbol which produces an object with numeric elements ranging from one to five. We index a specific item from the list by indexing the position of the element in the list, which returned a value of three. We then altered the value at this position by inserting the value of twenty at position 3. This example demonstrates means of creating sequences, excising data therefrom, and manipulating the data.
In addition to indexing by position, we can index with a sequence, a process known as slicing. We can create a vector using the function ‘c’ and using it to specify the positions of items in the list. Let’s compare how slicing works between Python and R:
#Slicing in Python > X = [1, 2, 3, 4, 5, 6, 7, 8, 9] #Establishes the List X > Slice = [0:3] > X[Slice]  [1, 2, 3] #Slicing in R > X <- 1:9 > Slice <- c(1:3) > X[Slice]  1 2 3
The methodologies between the two programming languages are quite similar, but with important syntax nuances. According to one of the leading resources on R programming, which you can find here, the authors contend that the indexing system is an efficient and flexible way to access selectively the elements of an object; it can be either numeric or logical. This flexibility demonstrates the utility of object indexing in R over Python. We highly recommend that you check this resource out to observe this flexibility in action.
Accessing Matrix Data in R
If you recall from our previous article, matrices are multidimensional data structures with rows and columns for data storage. As stipulated by Harris in Statistics With R (definitely check out this text for examining the complexities of statistical analysis with R), if ‘x’ is a matrix or a data frame, the value of the ith line and jth column
is accessed with x[i, j]. To access all values of a given row or column, one has simply to omit the appropriate index. Check out the code which executes this functionality:
> MatrixA <- matrix(data=1:6, nrow=2, ncol=3, byrow=TRUE) > MatrixA [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 #Indexing a Single Item: > MatrixA[1,2] #Access First Row, Second Column  2 #Indexing Multiple Items > MatrixA[1,2:3] #Access First Row, Second and Third Column  2 3 > MatrixA[2,] #Access Entire Second Row  4 5 6 > MatrixA[,3] #Access Entire Third Column  3 6
You have certainly noticed that the last result is a vector and not a matrix. The default behavior of R is to return an object of the lowest dimension possible. The previous example demonstrates how effectively we can index a single item from the matrix by specifying an individual row and position which specifies a single position. Furthermore, we can specify multiple items from a row or column with a slice. Finally, we can access an entire row or entire column simply by only specifying a row or column. All of these methods make for quick accession of matrix data, and prove to be quite similar to the methodologies employed in Python.
Manipulating Matrix Data
Matrices are the most common data structure when working with big data, as they have incredible potential for organizing data in a succinct format. The resource Big Data Analytics With R belabors this subject intensely, providing a myriad of interesting perspective and coding techniques for working with matrices, which personally, have vastly improved the means by which I work with large data. I highly recommend you check it out, as I attribute much of my knowledge to this resource. You can find it for a significantly discounted price by following this link.
As we did with altering particular items in a list through insertion, we can follow the same mechanics when working with matrices. The only difference is we need to be vigilant of specifying the location of insertion using a row and a column. Furthermore, when inserting multiple values into the matrix, you have to make sure the values inserted matches the number of items removed. Take a look at how we can execute this functionality with some code:
> MatrixA <- matrix(data=1:6, nrow=2, ncol=3, byrow=TRUE) > MatrixA [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 #Inserting a Single Value > MatrixA[1,1] <- 6 #Change the value in position (1,1) to 6 > MatrixA [,1] [,2] [,3] [1,] 6 2 3 [2,] 4 5 6 #Inserting Multiple Values > MatrixA[1,] <- c(3, 4, 5) #Insert multiple values with vector > MatrixA [,1] [,2] [,3] [1,] 3 4 5 [2,] 4 5 6
As you can see, and as corroborated by Harris here, multiple values may be inserted into a matrix through the insertion of a vector.
Special Matrix Insertion
The text Practical Statistics for Data Science is one of the best tools for examining the various tools for data manipulation in R as well as Python. This tool provides a variety of insights into the more sophisticated topics of data analysis, so if you’d like to get into the deeper knowledge of statistical analysis with R, definitely check it out here.
Nevertheless, we take a moment to embark on one of the more useful nuances of data manipulation, which we can execute with logical operators. According to Practical Statistics for Data Science, for vectors, matrices and arrays, it is possible to access the values of an element with a comparison expression as the index. Take a look at how we might do this for a single vector:
> X <- 1:10 > X[X >= 5]  5 6 7 8 9 10
The first line of the code, as we have previously noted, creates a sequence of values that range from one to ten. The second line uses the logical operator for greater than or equal to. This code accesses values at positions five and greater, returning a vector from those positions. We can use a similar code for inserting values into the sequence as well:
> L <- 1:10 > L[L >= 5] <- 50  1 2 3 4 50 50 50 50 50 50
The present example demonstrates that for inserting data into a unidimensional sequence, this can be done with quite ease, but there is not much in the way of complexity here. Nevertheless, we can raise the level of sophistication quite a bit just by executing this functionality on a matrix. Take a look at how we encode these features:
> MatrixA <- matrix(data=1:6, nrow=3, ncol=3, byrow=TRUE) > MatrixA [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 > MatrixA[MatrixA > 2, MatrixA >2] <- 20 > MatrixA [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 20
When inserting data into a matrix, as we did for indexing values from a matrix, we must specify the location of insertion with two logical operators rather than just one as we do with unidimensional sequences.
Object Accession By Name
Our previous examples focused on the means of object accession and manipulation through specification of position. These instances were well documented in a variety of sources, not the least of which, our preferred tool. Another feature well reflected upon in this text is the accession of values from a data object by name. This technique is particularly useful for working with character type objects.
According to another text we have not yet already consulted, the names are labels of the elements of an object, and thus of mode character. They are generally optional attributes. There are several kinds of names (names, colnames, rownames, dimnames). The names of a vector are stored in a vector of the same length of the object, and can be accessed with the function names. Take a look at several examples for object accession by name:
> X <- 1:3 > names(X)  NULL > names(X) <- c("A", "B", "C") > X  A B C  1 2 3 > names(x)  A B C
In this example, we assign names to the values in our sequence X. By that token, each value within the data object is associated with its own particular ‘label’ or name.
Again, as stipulated in another one of our consulted texts, for matrices and data frames, colnames and rownames are labels of the columns and rows, respectively. They can be accessed either with their respective functions, or with dimnames which returns a list with both vectors. Take a look at how we work with names in this manner with the following example:
> X <- matrix(1:4, nrow=2, ncol=2, byrow=TRUE) > X [,1] [,2] [1,] 1 2 [2,] 3 4 > rownames(X) <- c("X", "x") > colnames(X) <- c("Y", "y") > X ["X"] ["x"] ["Y"] 1 2 ["y"] 3 4
This is simply to serve as an extra token of knowledge to take with you as you proceed to work with data accession and manipulation. The following was clearly described by one of the most helpful tools we have consulted to date. This source contends that it is possible to use a graphical spreadsheet-like editor to edit a “data” object.
For example, if X is a matrix, the command data.entry(X) will open a graphic editor and one will be able to modify some values by clicking on the appropriate cells, or to add new columns or rows. The function data.entry modifies directly the object given as argument without needing to assign its result.
On the other hand, the function ‘de’ returns a list with the objects given as arguments and possibly modified. This result is displayed on the screen by default, but, as for most functions, can be assigned to an object. The details of using the data editor depend on the operating system.
The Take Away
The following article has explored in great detail the most important features for working with data in R. This may be one of the paramount tools you apply as you pursue data science. Not only is the ability to access data in R an essential skill, but so too is the alteration of data in different objects through deliberate operation. These are the subjects extrapolated on in this article. Hopefully this has been a useful tutorial for you, but in consideration of the broad techniques that may be used in R, other tools should be consulted as well.
Out of all of the available resources out there, my first recommendation would be the following course. It goes for twenty hours, providing you the most helpful techniques for landing a data science job. Furthermore, as I previously stated, I have managed to get it to you for a 97% discount. Definitely follow this course if you’re serious about boosting your status as an up and coming data scientist. Furthermore, I highly recommend the following book, which we consulted on one occasion in this article. You can find it here, and we also managed to get the discounted version to you as well. Books are one of the best investments you can make as there is so much condensed knowledge you can get easily and truly deepen your understanding.
Definitely consider getting at least one of these tools and watch your knowledge increase exponentially. Besides waiting for the next Art of Better Programming articles to come out, this can also be one of the best decisions you can make. Regardless, we look forward to seeing you at our next discussion of R programming in data science and machine learning.