06| Data Structures

Miles Robertson, 12.23.23 (edited 01.16.24)

Introduction

So far, all the data we have discussed has been atomic vectors (and factors). However, R has several additional ways to organize data that are useful for storing data. There are many ways to organize these atomic vectors into more complex and more powerful forms. These forms, called data structures, each have different applications and ways of being manipulated.

In this chapter, we will discuss matrices/arrays, data frames, and lists, and how to manipulate each to be most useful to you. Unfortunately, This is where some of the uglier parts of R start to show, and I will discuss how R is not as consistent as other languages. However, given R's prominence amongst data scientists, it is still a mountain worth conquering.

Matrices and Arrays

Atomic vectors are one-dimensional, meaning that the position of each element is determined by a single index. However, it is often useful to have data that is two-dimensional. For example, say you have three species of plants, and you want to measure the height of each plant when grown under three different environmental conditions: hot, normal, and cold. You could store this data in a matrix, where the rows represent the plant species and the columns represent the environmental conditions:

spp\temp	hot	normal	cold
species A	1.2	1.5	1.1
species B	1.3	1.4	1.2
species C	1.1	1.3	1.0

You could represent this data in R with a single vector: c(1.2, 1.5, 1.1, 1.3, 1.4, 1.2, 1.1, 1.3, 1.0). However, this is difficult to interpret, and it is not clear which values correspond to which species or temperature. This is where matrices can be of use.

In R, matrices have very similar properties to vectors, with only a few minor differences. They also can only be of one type, and length() can be used to determine the total number of entries. They are created with the matrix() function, where the first argument is a vector of data to be "reshaped" into a matrix, and other arguments can be used to specify the number of rows and columns of the matrix. When you use the matrix() function, it by default fills the matrix column by column, not row by row like you might expect.

my.matrix <- matrix(1:9, nrow=3) # note the nrow argument
# my.matrix is a 3x3 matrix that looks like this:
# 1 4 7
# 2 5 8
# 3 6 9

my.matrix2 <- matrix(1:12, ncol=3) # note the ncol argument
# my.matrix2 is a 4x3 matrix that looks like this:
# 1 5 9
# 2 6 10
# 3 7 11
# 4 8 12

my.matrix3 <- matrix(1:9, nrow=3, byrow=TRUE)
# the byrow argument lets you fill in the matrix
# row by row instead of column by column (the default):
# 1 2 3
# 4 5 6
# 7 8 9

You can also create matrices by combining vectors with the rbind() and cbind() functions. rbind() combines vectors into a matrix by row, and cbind() combines vectors into a matrix by column. These functions can also be used to add rows or columns to an existing matrix, as long as the dimensions are compatible.

my.matrix4 <- rbind(c(1, 2, 3), c(4, 5, 6), c(7, 8, 9))
# my.matrix4 is a 3x3 matrix that looks like this:
# 1 2 3
# 4 5 6
# 7 8 9

my.matrix4 <- cbind(my.matrix4, c(11, 22, 44))
# my.matrix4 is now a 3x4 matrix that looks like this:
# 1 2 3 11
# 4 5 6 22
# 7 8 9 44

To find how many rows and columns a matrix has, you can use nrow(), ncol(), and dim() (the last one returns a two-element vector, where the first element is the number of rows and the second is the number of columns). Unless you need both the number of rows and the number of columns at once, then you should use nrow() and ncol() instead of dim() because it is easier to read, as seen here:

nrow(some.matrix)   # this is easier to read than...
dim(some.matrix)[1] # ...this

You can also use rownames() and colnames() to get the names of the rows and columns of a matrix, respectively. These functions return a vector of the row/column names, which you can then manipulate like any other vector. You can also set the row/column names with rownames() and colnames() by assigning a vector of names to them. Using the example from the table above, we can create a matrix with row and column names like this:

my.matrix5 <- matrix(
    c(1.2, 1.5, 1.1, 
      1.3, 1.4, 1.2, 
      1.1, 1.3, 1.0), 
    nrow = 3,
    byrow = TRUE
) # note that I broke up this line 
# into multiple lines for readability

rownames(my.matrix5) <- c("sp A", "sp B", "sp C")
colnames(my.matrix5) <- c("hot", "normal", "cold")
# my.matrix5 is now a 3x3 matrix that looks like this:
#      hot normal cold
# sp A 1.2    1.5  1.1
# sp B 1.3    1.4  1.2
# sp C 1.1    1.3  1.0

If you use the class() function on a matrix, you will find that it returns "matrix" "array", which is a helpful way to identify its type. Even better, use the is.matrix() and is.array() functions to determine if an object is a matrix or an array, respectively. They return TRUE if the object is a matrix/array and FALSE otherwise.

So far, we have only discussed two-dimensional vectors, i.e., matrices. However, R also allows for vectors of more than two dimensions. In general, these are called arrays. In fact, matrices are just a special case of an array with two dimensions. Given that arrays are just a generalization of matrices, they have many of the same properties, so we will not discuss them here.

Indexing Matrices

Indexing matrices is very similar to indexing vectors, with the major exception being that you need to specify both the row and column index. You can do this by using a comma to separate the row index from the column index, like this:

my.matrix[1, 2] 
# returns the value in the first row and second column

Besides this difference, the features available for vector indexing are largely identical for matrices, once for row selection and once for column selection (e.g., negative indices, multiple indices as a vector, logical vectors):

my.matrix[-1, c(1, 4, 5)] 
# returns entries in every row except the first row 
# and in columns 1, 4 and 5

my.matrix[c(TRUE, FALSE, TRUE), 1:3]
# returns entries in first and third rows
# and in columns 1 thru 3

One difference in this pattern is that using logical operators on the same matrix inside the brackets is only required once:

my.matrix[my.matrix > 5]
# returns all entries in my.matrix that are greater than 5.
# Note no comma is needed here, unlike previous examples

Additionally, if you only need to select by rows or columns but not both, you can leave the other index blank. For example, if you only want the first row of my.matrix, you can use my.matrix[1, ]. Similarly, if you only want the second and third column of my.matrix, you can use my.matrix[, 2:3].

Matrix Operations

Matrices can be manipulated in many of the same ways as vectors. For example, you can add, subtract, multiply, and divide between matrices of the same size or between a matrix and a single number. There are several other operations that are specific to matrices that I will introduce here, but unless you have an interest in linear algebra, you will probably not need to use them.

Matrix multiplication is a special type of multiplication that is specific to matrices. There are terrific resources online to explain the concept of matrix multiplication, so I will defer explanation to them. Matrix multiplication can only be performed between matrices of compatible dimensions, meaning that the number of columns in the first matrix must be equal to the number of rows in the second matrix. In R, matrix multiplication is performed with the %*% operator.

my.matrix <- matrix(1:12, ncol=4)
my.matrix2 <- matrix(13:24, nrow=4)
my.matrix %*% my.matrix2
# this matrix product gives a 3x3 matrix

The transpose of a matrix (switching rows with columns) can be found with the t() function. The inverse of a square matrix (i.e., same number of rows and columns) can be found with the solve() function, and the determinant with the det() function.

my.matrix <- matrix(c(1, 8, 3, 4), nrow=2)
t(my.matrix)     # transpose
det(my.matrix)   # determinant
solve(my.matrix) # inverse

Data Frames

By far the most common data structure in R, data frames are a way to group several vectors of different types (but the same length) into a single object. Think of data frames like a table: each column is some measurement, and each row is an observation. For example, a data frame could be used to store the name, age, height and marital status of several people:

my.df <- data.frame(
    name = c("Sarah", "Jake", "Ava", "Elija", "Sophia"),
    age = c(35, 19, 23, 61, 10),
    height = c(5.9, 6.1, 5.5, 5.4, 3.5),
    is.married = c(TRUE, FALSE, FALSE, TRUE, FALSE)
)

As you can see here, the data.frame() function is used to create a data frame. This function takes named arguments, where the name of the argument is the name of the column, and the value of the argument is the vector of data for that column. As seen here, each vector does not need to be of the same type, but they must be the same length.

The names of the columns can be changed with colnames() in the same way that they can be changed for matrices. It is rare to use rownames() for data frames.

To identify a data frame, you can use the class() function, which will return "data.frame". However, there are some special cases where this won't work as expected, so it is better to use the is.data.frame() function, which will return TRUE if the object is a data frame and FALSE otherwise.

Indexing Data Frames

You can index data frames in the same way as matrices. However, it is not common to index data frames by their row and column indices. Instead, you will usually want to index data frames by the names of the columns. This can be done by using the $ operator, like this:

my.df$name
# returns the "name" column of my.df

my.df$name[4]
# returns the fourth element of the "name" column

As you saw above, R allows you to name columns when you create a data frame by typing the name of the column, followed by an equals sign, followed by the vector of data for that column. Once the data frame is created, however, the names of the columns are converted into character vectors. This is irregular compared to other languages, since the names of the columns are sometimes treated as variables and sometimes treated as strings. This becomes a problem when the names of the columns are not valid variable names. For example, if you have a column named speed (m/s), you cannot use the $ operator to access that column because speed (m/s) has a space, parentheses, and a slash, none of which are allowed in variable names. In this case, you can either use the [[]] operator or you can use $ with backticks (``) to get that column:

my.df[["speed (m/s)"]]
# returns the "speed (m/s)" column of my.df

my.df$`speed (m/s)`
# also returns the "speed (m/s)" column of my.df

Although I highly discourage its use, you can use backticks more broadly to make variable names that would otherwise be invalid. I expose you to this because it unfortunately shows up in various packages. The following is technically valid R code, but if you use it, I will not be coming to your birthday party any more.

`please don't use this` <- c(7, 4, 3, 1, 5)
`please don't use this`[`please don't use this` >= 5] <- 0

Above, I mentioned the [[]] operator. This operator is used to access elements of a data frame by name. It behaves the same way at the $ operator. Here, I want to briefly demonstrate the difference between [[]] and []. Try out the code below to see the difference. You'll see that [[]] returns a vector, while [] returns a data frame that contains that same vector. The difference between the two is subtle, but it can be important in some situations. It also applies to lists as discussed below.

my.df["name"]
# returns a data frame with only the "name" column

my.df[["name"]]
# returns the vector that the "name" column had
# (not inside a data frame)

It is frequently of interest to select a subset of a data frame based on some condition. This can be done using the subset() function. Here, the first argument is the data frame to be subsetted, and the second argument is a condition that must be met for a row to be included in the subset. The condition can use column names from the data frame, without the need to use the $ operator.

subset(my.df, age > 20)
# returns a subset of my.df where the age is greater than 20

subset(my.df, age > 20, c(name, height))
# same as above, but only returns the 
# columns for "name" and "height"

Although being able to refer to the age column without the $ in the code above is convenient, it is important to remember that age is not a variable that can be accessed outside the context of the my.df data frame. For example, on a separate line, age > 20 would not make sense to R, and it would give an error that age was not defined. This is one of the shortcomings of R: shorthands are sometimes only applicable in limited contexts. Other languages make great efforts to avoid such inconsistencies.

If you want to add a column to a data frame, the preferred method is to use the $ operator, as seen below:

my.df$height.in.m <- my.df$height / 3.281
my.df$eye.color <- as.factor(c("blue", "brown", "green", "brown", "blue"))

In the code above, we have added a new column to my.df called height.in.m that is the height in meters, which uses arithmetic on the height column. We have also added a new column called eye.color that is a factor vector of the eye colors of each person. Remember from earlier chapters that factors have nearly identical properties to character vectors, but they are used to represent categorical data that only has a few possible values.

For matrices above, I mentioned that, when indexing, you can leave one of the indices blank if you only want to select by rows or columns. This is also true for data frames. For example, if you only want the first row of my.df, you can use my.df[1, ]. However, since matrices can only be of one type, this operation will only ever return a vector for matrices. In data frames, where columns can be of different types, this operation will return a data frame that contains that row.

This presents a problem when it comes to adding new rows to an existing data frame that does not occur when adding new columns: the entries in a data frame across a row are not necessarily of the same type. This is a problem, because that means that vectors can't help us here, such that the following code will not work:

my.df <- data.frame(
    name = c("Sarah", "Jake", "Ava", "Elija", "Sophia"),
    age = c(35, 19, 23, 61, 10),
    height = c(5.9, 6.1, 5.5, 5.4, 3.5),
    is.married = c(TRUE, FALSE, FALSE, TRUE, FALSE)
)

my.df[nrow(my.df) + 1, ] <- c("Tim", 63, 6.0, FALSE)
# this will NOT work because c() will force 
# its contents to be of the same type,
# even though the columns of my.df
# are of different types

One way to fix this issue is to create another data frame that contains only this new row of information, but it must have the same column names as the original data frame. However, this is not a very elegant solution. Instead, we can use lists to solve this problem. These will be explained more in the next section, but for now, know that lists are a data structure that can contain objects of different types. We can use a list to add a new row to a data frame like this:

my.df[nrow(my.df) + 1, ] <- list("Tim", 63, 6.0, FALSE)

Manipulating Data Frames

Most of your time spent while working with data frames will be to prepare their usage for statistical analysis. This might mean removing rows or columns that are not needed, or it might mean sorting the data in a particular way. I will show an example where we sort columns by their greatest sum and then remove the column with the smallest sum. This is a contrived example, but it demonstrates how to manipulate data frames in a way that is useful for statistical analysis.

First, we will create a data frame with 100 rows and 10 columns, where each entry is a random number between 0 and 1. Note that I am making the data frame by giving the data.frame() function a matrix. You won't usually create data frames this way, but it is useful for demonstration purposes. Then, I name the columns of the data frame with the colnames() function, and using R's built-in vector called letters (which is all the lower case letters in order) to name the columns a through j.

set.seed(100)
column.count <- 10
row.count <- 100

my.df <- data.frame(
    matrix(
        runif(column.count * row.count), 
        nrow=row.count
    )
)

colnames(my.df) <- letters[1:column.count]

This next part will look confusing, but it is just a combination of things that we have already seen. I am going to present it as a single line, not because it is good practice to do so, but because this is unfortunately the kind of bad code you will run across in the wild. Below, we'll rewrite this single line into something more readable.

my.df <- my.df[, order(colSums(my.df), decreasing=TRUE)[1:(column.count-1)]]

When you see something confusing like this, it is best to work inside-out. At the very center of this code is colSums(my.df), which returns a vector of the sums of each column of my.df. This will be needed to sort the columns by their sums. The order() function, discussed briefly last chapter, then takes this vector and returns a vector of indices that would sort the vector in ascending order (since decreasing=TRUE). Then, since we are interested in keeping all but the lowest-sum column, we trim off the last result of order() with [1:(column.count-1)]. Finally, the [, ] (note the comma) operator is used to sort the columns of my.df by the indices returned by order(). This is a lot to take in, so let's rewrite this code to be more readable:

sorted.column.indices <- order(colSums(my.df), decreasing=TRUE)
selected.columns <- sorted.column.indices[1:(column.count-1)]
my.df <- my.df[, selected.columns]

Importing Data Frames

In the previous section, I showed you how to create a data frame from scratch. However, this is not how you will usually create data frames. Instead, you will usually import data from a file. There are many ways to import data into R, but the most common ways are to import a CSV file or an Excel file. CSV stands for "comma-separated values", and it is a common file format for storing data. In a CSV file, which is simply a text file with a specific format, each row of text is a new line, and each column is separated by a comma. For example, the following CSV file contains the same data as the data frame we created above:

name,age,height,is.married
Sarah,35,5.9,TRUE
Jake,19,6.1,FALSE
Ava,23,5.5,FALSE
Elija,61,5.4,TRUE
Sophia,10,3.5,FALSE

If this file is saved as my_data.csv, then it can be imported into R with the read.csv() function, like this (note the file name has to be a full path or the correct relative path, as discussed in a previous chapter):

my.df <- read.csv("my_data.csv")

This function will automatically create a data frame with the same name as the CSV file. For Excel files, you can use the read_excel() function from the readxl package. This function is very similar to read.csv(), but it requires the readxl package to be installed and loaded. The readxl package is not included with base R, so you will need to install it before you can use it. Simply type install.packages("readxl") in the console and hit enter. This will only have to be ran one time, and need not be included in any R scripts. However, every time you hope to use the read_excel() function, you will need to load the readxl package with library(readxl), and this should be included at the beginning of any R scripts that use the read_excel() function. This function works nearly identically to read.csv(), except if there are multiple sheets, in which case you will need to specify which sheet to import with the sheet argument.

Lists

As mentioned above, Lists are a data structure that can contain objects of different types. They are similar to vectors in that they are one-dimensional, but can have anything at each index. They are created with the list() function, where each argument is an object to be added to the list. For example, the following code creates a list with a matrix, a data frame, a numeric vector, and a character vector:

my.list <- list(
    matrix(1:9, nrow=3),
    data.frame(
        name = c("Sarah", "Jake", "Ava", "Elija", "Sophia"),
        age = c(35, 19, 23, 61, 10),
        height = c(5.9, 6.1, 5.5, 5.4, 3.5),
        is.married = c(TRUE, FALSE, FALSE, TRUE, FALSE)
    ),
    c(1, 8, 3, 2, 5),
    "Hello"
)

As you can see, lists are created in a similar way to data frames, where each argument is an object to be added to the list. However, unlike data frames, list objects need not be vectors, nor do they need to be of the same length. In fact, they can be any type of object, including other lists. This makes lists a very flexible data structure.

Lists can be identified by using the class() function, which usually returns "list" if it is one. However, you will have better luck using the is.list() function to determine if an object is a list.

Indexing Lists

Lists can be indexed in a similar way to the columns of data frames (i.e., with single brackets [] or double brackets [[]]):

my.list[1]
# returns the first element of my.list (a matrix in this case),
# itself "wrapped" in a one-element list

my.list[[1]]
# returns the same matrix, with no list involved

my.list[[1]] = "a"
# replaces whatever is the first element of the list to "a"

Lists can also be indexed by name using the $ operator if the entries are given names when the list is created, as seen in data frames:

my.list <- list(
    matrix.entry = matrix(1:9, nrow=3),
    data.frame.entry = data.frame(
        name = c("Sarah", "Jake", "Ava", "Elija", "Sophia"),
        age = c(35, 19, 23, 61, 10),
        height = c(5.9, 6.1, 5.5, 5.4, 3.5),
        is.married = c(TRUE, FALSE, FALSE, TRUE, FALSE)
    ),
    some.numbers = c(1, 8, 3, 2, 5),
    friendly.salutation = "Hello"
)

my.list$friendly.salutation
# returns "Hello"

As will be discussed later, tons of functions in R return lists, with named entries that contain various results of the function. For example, the rle() function used in a previous chapter returns a list with two entries: lengths and values. We accessed the lengths entry by using the $.

One important thing to note is that adding additional entries to lists is computationally slow, because R has to copy the entire list to a new location in memory. This is not a problem for small lists, but it can be a problem for large lists. If you know the size of the list beforehand, it is better to create a list of that size and then fill it in, rather than creating an empty list and adding to it.

Conclusion

R has some very powerful data structures that can be used to store data in a variety of ways. You have seen that R has some unfortunate quirks that make it somewhat difficult to learn. Unfortunately, this means that you will likely struggle to feel that you have a firm grasp on what is happening with R. For this reason, it is helpful to learn other languages, such as Python or Julia, to get a better understanding of how programming languages work. However, given that most scientists do not have the drive to learn multiple programming languages, it is currently in your best interest to learn R given its prominence in the scientific community. Understanding the quirks of R is a much better state to be in than constantly shrugging your shoulders and blindly following what others tell you to do with the language.

Practice

Make and Manipulate a Data Frame

Create a data frame called practice.data with the following features (note that you will have to rely on knowledge from a previous chapter to complete this exercise):

practice.data has 20 rows and 3 columns.
Its first column is called species.by.letter and contains the first 20 letters of the alphabet (Hint: use the built-in letters vector).
Its second column is called average.mass.kg and contains 20 normally distributed random numbers with a mean and standard deviation of your choice.
Its third column is called abundance and contains 20 randomly selected integers from the vector 1:100.

After making this data frame, complete the following steps, using the resulting data frame from each step in the subsequent step. In some cases, there are multiple approaches to complete the step.

Get all rows of practice.data whose average.mass.kg value is above the mean used for the normally distributed random numbers.
Sort the rows of the data frame by the abundance column in descending order.
Add a column called total.biomass.kg that holds the product of the two existing numeric columns.
Remove the average.mass.kg and the abundance columns.
Remove the first row.
Print the species.by.letter column to the console.

Make and Manipulate a List

Create a list called practice.list with the following features:

The list has 4 entries.
The first entry is a character vector of the first 10 letters of the alphabet.
The second entry is a data frame with 10 rows and 2 columns. Any data frame entries are fine.
The third entry is a matrix with 3 rows and 3 columns. Any matrix entries are fine.
The fourth entry is some integer vector created using the : operator.

After making this list, complete the following steps, using the variable name practice.list in each step. Each can be completed in a single line of code. Unlike the previous exercise, there is no need to overwrite the list with these steps, specifically steps 1 through 3.

Get the first and second entries of the list. These two elements will be contained by a list.
Get the third entry of the list, ensuring that it is contained in a one-element list.
Get the third entry of the list, ensuring that it is not contained in a list.
Replace the second entry of the list with a single number.
Replace one of the numbers in the vector held at list index four with a different number.