So far, all the data we have discussed has been atomic vectors (and factors). However, R has several additional ways to organize data that are useful for storing data. There are many ways to organize these atomic vectors into more complex and more powerful forms. These forms, called data structures, each have different applications and ways of being manipulated.
In this chapter, we will discuss matrices/arrays, data frames, and lists, and how to manipulate each to be most useful to you. Unfortunately, This is where some of the uglier parts of R start to show, and I will discuss how R is not as consistent as other languages. However, given R's prominence amongst data scientists, it is still a mountain worth conquering.
Atomic vectors are one-dimensional, meaning that the position of each element is determined by a single index. However, it is often useful to have data that is two-dimensional. For example, say you have three species of plants, and you want to measure the height of each plant when grown under three different environmental conditions: hot, normal, and cold. You could store this data in a matrix, where the rows represent the plant species and the columns represent the environmental conditions:
spp\temp | hot | normal | cold |
---|---|---|---|
species A | 1.2 | 1.5 | 1.1 |
species B | 1.3 | 1.4 | 1.2 |
species C | 1.1 | 1.3 | 1.0 |
You could represent this data in R with a single vector:
c(1.2, 1.5, 1.1, 1.3, 1.4, 1.2, 1.1, 1.3, 1.0)
.
However, this is difficult to interpret, and it is not clear which
values correspond to which species or temperature. This is where
matrices can be of use.
In R, matrices have very similar properties to vectors, with only a few
minor differences. They also can only be of one type, and length()
can be used to determine the total number of entries. They are created with the
matrix()
function, where
the first argument is a vector of data to be "reshaped" into a matrix,
and other arguments can be used to
specify the number of rows and columns of the matrix.
When you use the matrix()
function, it by default fills
the matrix column by column,
not row by row like you might expect.
my.matrix <- matrix(1:9, nrow=3) # note the nrow argument
# my.matrix is a 3x3 matrix that looks like this:
# 1 4 7
# 2 5 8
# 3 6 9
my.matrix2 <- matrix(1:12, ncol=3) # note the ncol argument
# my.matrix2 is a 4x3 matrix that looks like this:
# 1 5 9
# 2 6 10
# 3 7 11
# 4 8 12
my.matrix3 <- matrix(1:9, nrow=3, byrow=TRUE)
# the byrow argument lets you fill in the matrix
# row by row instead of column by column (the default):
# 1 2 3
# 4 5 6
# 7 8 9
You can also create matrices by combining vectors with the rbind()
and cbind()
functions. rbind()
combines vectors into a matrix
by row, and cbind()
combines vectors into a matrix by column. These
functions can also be used to add rows or columns to an existing matrix, as long as
the dimensions are compatible.
my.matrix4 <- rbind(c(1, 2, 3), c(4, 5, 6), c(7, 8, 9))
# my.matrix4 is a 3x3 matrix that looks like this:
# 1 2 3
# 4 5 6
# 7 8 9
my.matrix4 <- cbind(my.matrix4, c(11, 22, 44))
# my.matrix4 is now a 3x4 matrix that looks like this:
# 1 2 3 11
# 4 5 6 22
# 7 8 9 44
To find how many rows and columns a matrix has, you can use nrow()
,
ncol()
, and dim()
(the last one returns a two-element
vector, where the first element is the number of rows and the second is the number of columns).
Unless you need both the number of rows and the number of columns at once, then you
should use nrow()
and ncol()
instead of dim()
because it is easier to read, as seen here:
nrow(some.matrix) # this is easier to read than...
dim(some.matrix)[1] # ...this
You can also use rownames()
and colnames()
to get the
names of the rows and columns of a matrix, respectively. These functions return
a vector of the row/column names, which you can then manipulate like any other
vector. You can also set the row/column names with rownames()
and
colnames()
by assigning a vector of names to them. Using the example
from the table above, we can create a matrix with row and column names like this:
my.matrix5 <- matrix(
c(1.2, 1.5, 1.1,
1.3, 1.4, 1.2,
1.1, 1.3, 1.0),
nrow = 3,
byrow = TRUE
) # note that I broke up this line
# into multiple lines for readability
rownames(my.matrix5) <- c("sp A", "sp B", "sp C")
colnames(my.matrix5) <- c("hot", "normal", "cold")
# my.matrix5 is now a 3x3 matrix that looks like this:
# hot normal cold
# sp A 1.2 1.5 1.1
# sp B 1.3 1.4 1.2
# sp C 1.1 1.3 1.0
If you use the class()
function on a matrix, you will
find that it returns "matrix" "array"
, which is a helpful
way to identify its type. Even better, use the is.matrix()
and is.array()
functions to determine if an object is a matrix
or an array, respectively. They return TRUE
if the object is
a matrix/array and FALSE
otherwise.
So far, we have only discussed two-dimensional vectors, i.e., matrices. However, R also allows for vectors of more than two dimensions. In general, these are called arrays. In fact, matrices are just a special case of an array with two dimensions. Given that arrays are just a generalization of matrices, they have many of the same properties, so we will not discuss them here.
Indexing matrices is very similar to indexing vectors, with the major exception being that you need to specify both the row and column index. You can do this by using a comma to separate the row index from the column index, like this:
my.matrix[1, 2]
# returns the value in the first row and second column
Besides this difference, the features available for vector indexing are largely identical for matrices, once for row selection and once for column selection (e.g., negative indices, multiple indices as a vector, logical vectors):
my.matrix[-1, c(1, 4, 5)]
# returns entries in every row except the first row
# and in columns 1, 4 and 5
my.matrix[c(TRUE, FALSE, TRUE), 1:3]
# returns entries in first and third rows
# and in columns 1 thru 3
One difference in this pattern is that using logical operators on the same matrix inside the brackets is only required once:
my.matrix[my.matrix > 5]
# returns all entries in my.matrix that are greater than 5.
# Note no comma is needed here, unlike previous examples
Additionally, if you only need to select by rows or columns but not both,
you can leave
the other index blank. For example, if you only want the first row of
my.matrix
, you can use my.matrix[1, ]
.
Similarly, if you only want the second and third column of my.matrix
,
you can use my.matrix[, 2:3]
.
Matrices can be manipulated in many of the same ways as vectors. For example, you can add, subtract, multiply, and divide between matrices of the same size or between a matrix and a single number. There are several other operations that are specific to matrices that I will introduce here, but unless you have an interest in linear algebra, you will probably not need to use them.
Matrix multiplication is
a special type of multiplication that is specific to matrices.
There are terrific resources online to explain the concept of matrix multiplication,
so I will defer explanation to them. Matrix multiplication can only be
performed between matrices of compatible dimensions, meaning that the
number of columns in the first matrix must be equal to the number of rows
in the second matrix. In R, matrix multiplication is performed with the
%*%
operator.
my.matrix <- matrix(1:12, ncol=4)
my.matrix2 <- matrix(13:24, nrow=4)
my.matrix %*% my.matrix2
# this matrix product gives a 3x3 matrix
The transpose of a matrix (switching rows with columns) can be found
with the t()
function.
The inverse of a square matrix (i.e., same number of rows and columns)
can be found with the solve()
function, and the determinant
with the det()
function.
my.matrix <- matrix(c(1, 8, 3, 4), nrow=2)
t(my.matrix) # transpose
det(my.matrix) # determinant
solve(my.matrix) # inverse
By far the most common data structure in R, data frames are a way to group several vectors of different types (but the same length) into a single object. Think of data frames like a table: each column is some measurement, and each row is an observation. For example, a data frame could be used to store the name, age, height and marital status of several people:
my.df <- data.frame(
name = c("Sarah", "Jake", "Ava", "Elija", "Sophia"),
age = c(35, 19, 23, 61, 10),
height = c(5.9, 6.1, 5.5, 5.4, 3.5),
is.married = c(TRUE, FALSE, FALSE, TRUE, FALSE)
)
As you can see here, the data.frame()
function is used to create
a data frame. This function takes named arguments, where the name of the argument
is the name of the column, and the value of the argument is the vector of data
for that column. As seen here,
each vector does not need to be of the same type, but they must be
the same length.
The names of the columns can be changed with colnames()
in the same
way that they can be changed for matrices. It is rare to use rownames()
for data frames.
To identify a data frame, you can use the class()
function, which
will return "data.frame"
. However, there are some special
cases where this won't work as expected, so it is better to use the is.data.frame()
function, which will return TRUE
if the object is a data frame and
FALSE
otherwise.
You can index data frames in the same way as matrices. However, it is not common
to index data frames by their row and column indices.
Instead, you will usually want to index data frames by the names of the columns.
This can be done by using the $
operator, like this:
my.df$name
# returns the "name" column of my.df
my.df$name[4]
# returns the fourth element of the "name" column
As you saw above, R allows you to name columns when you create a data frame
by typing the name of the column, followed by an equals sign, followed by the
vector of data for that column. Once the data frame is created, however,
the names of the columns are converted into character vectors.
This is irregular compared to other languages, since the names of the columns
are sometimes treated as variables and sometimes treated as strings. This becomes
a problem when the names of the columns are not valid variable names. For example,
if you have a column named speed (m/s)
,
you cannot use the $
operator to access that column because
speed (m/s)
has a space, parentheses, and a slash, none of which are
allowed in variable names. In this case, you can either
use the [[]]
operator
or you can use $
with backticks (``
) to get that column:
my.df[["speed (m/s)"]]
# returns the "speed (m/s)" column of my.df
my.df$`speed (m/s)`
# also returns the "speed (m/s)" column of my.df
Although I highly discourage its use, you can use backticks more broadly to make variable names that would otherwise be invalid. I expose you to this because it unfortunately shows up in various packages. The following is technically valid R code, but if you use it, I will not be coming to your birthday party any more.
`please don't use this` <- c(7, 4, 3, 1, 5)
`please don't use this`[`please don't use this` >= 5] <- 0
Above, I mentioned the [[]]
operator. This operator is used to
access elements of a data frame by name. It behaves the same way at the $
operator. Here, I want to briefly demonstrate the difference between [[]]
and []
. Try out the code below to see the difference. You'll see that
[[]]
returns a vector, while []
returns a data frame that
contains that same vector. The difference between the two is subtle, but it can
be important in some situations. It also applies to lists as discussed below.
my.df["name"]
# returns a data frame with only the "name" column
my.df[["name"]]
# returns the vector that the "name" column had
# (not inside a data frame)
It is frequently of interest to select a subset of a data frame based on some
condition. This can be done using the subset()
function. Here, the
first argument is the data frame to be subsetted, and the second argument is
a condition that must be met for a row to be included in the subset. The condition
can use column names from the data frame, without the need to use the $
operator.
subset(my.df, age > 20)
# returns a subset of my.df where the age is greater than 20
subset(my.df, age > 20, c(name, height))
# same as above, but only returns the
# columns for "name" and "height"
Although being able to refer to the age
column without the
$
in the code above is convenient, it is important to remember that
age
is not a variable that can be accessed outside the
context of the my.df
data frame. For example, on a separate line,
age > 20
would not make sense to R, and it would give an error
that age
was not defined. This is one of the shortcomings of
R: shorthands are sometimes only applicable in limited contexts. Other languages
make great efforts to avoid such inconsistencies.
If you want to add a column to a data frame, the preferred method is to use
the $
operator, as seen below:
my.df$height.in.m <- my.df$height / 3.281
my.df$eye.color <- as.factor(c("blue", "brown", "green", "brown", "blue"))
In the code above, we have added a new column to my.df
called
height.in.m
that is the height in meters, which uses arithmetic
on the height
column. We have also added a new column called
eye.color
that is a factor vector of the eye colors of each person.
Remember from
earlier chapters
that factors have nearly identical properties
to character vectors, but they are used to represent categorical data that only
has a few possible values.
For matrices above, I mentioned that, when indexing, you can leave one of the
indices blank if you only want to select by rows or columns. This is also true
for data frames. For example, if you only want the first row of my.df
,
you can use my.df[1, ]
. However, since matrices can only be of one type,
this operation will only ever return a vector for matrices. In data frames,
where columns
can be of different types, this operation will return a data frame that contains
that row.
This presents a problem when it comes to adding new rows to an existing data frame that does not occur when adding new columns: the entries in a data frame across a row are not necessarily of the same type. This is a problem, because that means that vectors can't help us here, such that the following code will not work:
my.df <- data.frame(
name = c("Sarah", "Jake", "Ava", "Elija", "Sophia"),
age = c(35, 19, 23, 61, 10),
height = c(5.9, 6.1, 5.5, 5.4, 3.5),
is.married = c(TRUE, FALSE, FALSE, TRUE, FALSE)
)
my.df[nrow(my.df) + 1, ] <- c("Tim", 63, 6.0, FALSE)
# this will NOT work because c() will force
# its contents to be of the same type,
# even though the columns of my.df
# are of different types
One way to fix this issue is to create another data frame that contains only this new row of information, but it must have the same column names as the original data frame. However, this is not a very elegant solution. Instead, we can use lists to solve this problem. These will be explained more in the next section, but for now, know that lists are a data structure that can contain objects of different types. We can use a list to add a new row to a data frame like this:
my.df[nrow(my.df) + 1, ] <- list("Tim", 63, 6.0, FALSE)
Most of your time spent while working with data frames will be to prepare their usage for statistical analysis. This might mean removing rows or columns that are not needed, or it might mean sorting the data in a particular way. I will show an example where we sort columns by their greatest sum and then remove the column with the smallest sum. This is a contrived example, but it demonstrates how to manipulate data frames in a way that is useful for statistical analysis.
First, we will create a data frame with 100 rows and 10 columns, where
each entry is a random number between 0 and 1. Note that I am making the
data frame by giving the data.frame()
function a matrix.
You won't usually create data frames this way, but it is useful for
demonstration purposes. Then, I name the columns of the data frame with
the colnames()
function, and using R's built-in vector called
letters
(which is all the lower case letters in order)
to name the columns a
through j
.
set.seed(100)
column.count <- 10
row.count <- 100
my.df <- data.frame(
matrix(
runif(column.count * row.count),
nrow=row.count
)
)
colnames(my.df) <- letters[1:column.count]
This next part will look confusing, but it is just a combination of things that we have already seen. I am going to present it as a single line, not because it is good practice to do so, but because this is unfortunately the kind of bad code you will run across in the wild. Below, we'll rewrite this single line into something more readable.
my.df <- my.df[, order(colSums(my.df), decreasing=TRUE)[1:(column.count-1)]]
When you see something confusing like this, it is best to work inside-out.
At the very center of this code is colSums(my.df)
, which returns
a vector of the sums of each column of my.df
. This will be needed to
sort the columns by their sums. The order()
function, discussed briefly
last chapter,
then takes this vector and returns a vector of indices that would
sort the vector in ascending order (since decreasing=TRUE
).
Then, since we are interested in keeping all but the lowest-sum column, we trim off
the last result of order()
with [1:(column.count-1)]
.
Finally, the [, ]
(note the comma) operator is used to sort the columns of
my.df
by the indices returned by order()
. This is a lot
to take in, so let's rewrite this code to be more readable:
sorted.column.indices <- order(colSums(my.df), decreasing=TRUE)
selected.columns <- sorted.column.indices[1:(column.count-1)]
my.df <- my.df[, selected.columns]
In the previous section, I showed you how to create a data frame from scratch. However, this is not how you will usually create data frames. Instead, you will usually import data from a file. There are many ways to import data into R, but the most common ways are to import a CSV file or an Excel file. CSV stands for "comma-separated values", and it is a common file format for storing data. In a CSV file, which is simply a text file with a specific format, each row of text is a new line, and each column is separated by a comma. For example, the following CSV file contains the same data as the data frame we created above:
name,age,height,is.married
Sarah,35,5.9,TRUE
Jake,19,6.1,FALSE
Ava,23,5.5,FALSE
Elija,61,5.4,TRUE
Sophia,10,3.5,FALSE
If this file is saved as my_data.csv
, then it can be imported into R
with the read.csv()
function, like this (note the file name
has to be a full path or the correct relative path, as discussed in a
previous chapter):
my.df <- read.csv("my_data.csv")
This function will automatically create a data frame with the same name as the
CSV file. For Excel files, you can use the read_excel()
function
from the readxl
package. This function is very similar to
read.csv()
, but it requires the readxl
package to be
installed and loaded. The readxl
package is not included with
base R, so you will need to install it before you can use it. Simply type
install.packages("readxl")
in the console and hit enter.
This will only have to be ran one time, and need not be included in any R scripts.
However, every time you hope to use the read_excel()
function,
you will need to load the readxl
package with
library(readxl)
, and this should be included at the beginning of
any R scripts that use the read_excel()
function. This function works
nearly identically to read.csv()
, except if there are multiple sheets,
in which case you will need to specify which sheet to import with the sheet
argument.
As mentioned above, Lists are a data structure that can
contain objects of different types.
They are similar to vectors in that they are one-dimensional,
but can have anything at each index.
They are created with the list()
function, where each argument
is an object to be added to the list. For example, the following code creates
a list with a matrix, a data frame, a numeric vector, and a character vector:
my.list <- list(
matrix(1:9, nrow=3),
data.frame(
name = c("Sarah", "Jake", "Ava", "Elija", "Sophia"),
age = c(35, 19, 23, 61, 10),
height = c(5.9, 6.1, 5.5, 5.4, 3.5),
is.married = c(TRUE, FALSE, FALSE, TRUE, FALSE)
),
c(1, 8, 3, 2, 5),
"Hello"
)
As you can see, lists are created in a similar way to data frames, where each argument is an object to be added to the list. However, unlike data frames, list objects need not be vectors, nor do they need to be of the same length. In fact, they can be any type of object, including other lists. This makes lists a very flexible data structure.
Lists can be identified by using the class()
function,
which usually returns "list"
if it is one.
However, you will have better luck using the
is.list()
function to determine if an object is a list.
Lists can be
indexed in a similar way to the columns of data frames (i.e., with
single brackets
[]
or double brackets [[]]
):
my.list[1]
# returns the first element of my.list (a matrix in this case),
# itself "wrapped" in a one-element list
my.list[[1]]
# returns the same matrix, with no list involved
my.list[[1]] = "a"
# replaces whatever is the first element of the list to "a"
Lists can also be indexed by name
using the $
operator if the entries are given names
when the list is created, as seen in data frames:
my.list <- list(
matrix.entry = matrix(1:9, nrow=3),
data.frame.entry = data.frame(
name = c("Sarah", "Jake", "Ava", "Elija", "Sophia"),
age = c(35, 19, 23, 61, 10),
height = c(5.9, 6.1, 5.5, 5.4, 3.5),
is.married = c(TRUE, FALSE, FALSE, TRUE, FALSE)
),
some.numbers = c(1, 8, 3, 2, 5),
friendly.salutation = "Hello"
)
my.list$friendly.salutation
# returns "Hello"
As will be discussed later, tons of functions in R return lists,
with named entries that contain various results of the function.
For example, the rle()
function used in a
previous chapter
returns a list with two entries: lengths
and
values
. We accessed the lengths
entry by
using the $
.
One important thing to note is that adding additional entries to lists is computationally slow, because R has to copy the entire list to a new location in memory. This is not a problem for small lists, but it can be a problem for large lists. If you know the size of the list beforehand, it is better to create a list of that size and then fill it in, rather than creating an empty list and adding to it.
R has some very powerful data structures that can be used to store data in a variety of ways. You have seen that R has some unfortunate quirks that make it somewhat difficult to learn. Unfortunately, this means that you will likely struggle to feel that you have a firm grasp on what is happening with R. For this reason, it is helpful to learn other languages, such as Python or Julia, to get a better understanding of how programming languages work. However, given that most scientists do not have the drive to learn multiple programming languages, it is currently in your best interest to learn R given its prominence in the scientific community. Understanding the quirks of R is a much better state to be in than constantly shrugging your shoulders and blindly following what others tell you to do with the language.
Create a data frame called practice.data
with the following features (note that you will have to rely on knowledge from a
previous chapter to complete this exercise):
practice.data
has 20 rows and 3 columns.species.by.letter
and contains the first 20 letters of the alphabet (Hint: use the built-in letters
vector).average.mass.kg
and contains 20 normally distributed random numbers with a mean and standard deviation of your choice.abundance
and contains 20 randomly selected integers from the vector 1:100
.After making this data frame, complete the following steps, using the resulting data frame from each step in the subsequent step. In some cases, there are multiple approaches to complete the step.
practice.data
whose average.mass.kg
value is above the mean used for the normally distributed random numbers.abundance
column in descending order.total.biomass.kg
that holds the product of the two existing numeric columns.average.mass.kg
and the abundance
columns.species.by.letter
column to the console.
Create a list called practice.list
with the following features:
:
operator.
After making this list, complete the following steps,
using the variable name practice.list
in each step. Each can be completed
in a single line of code. Unlike the previous exercise,
there is no need to overwrite the list with these steps, specifically steps 1 through 3.