Sometimes, you have a function that you want to apply to every
element in a vector or list. For example, in the
chapter
about for
loops, we discussed a case where we wanted to capitalize
every element in a vector of names, and make sure it had a certain suffix.
As we did there, we can always apply this function to each element by using
a for
loop. However, there is an easier and computationally
faster way to do this. In many languages, this is referred to as mapping
a function to a vector. This concept is called applying in R. In this
chapter, we will discuss how to apply functions to vectors and lists.
In R, there are a few different functions that can be used to apply
a function to each element in a vector or list. These functions are
lapply()
, sapply()
, vapply()
, and
replicate()
, and essentially do the same thing. Their
differences are largely in the format of the output.
In fact, all four of these functions are just wrappers for lapply()
,
meaning that they all just call lapply()
under the hood.
Generally,
you will use only lapply()
(the apply function that returns a list)
and sapply()
(the apply function that usually returns a
vector, discussed more below).
Below, we will discuss a couple of ways that you can pass these functions as an argument to these apply functions.
The first way to pass a function as an argument to an apply function
is to simply pass the name of the function. For example, if we wanted
to apply the sqrt()
(square root) function to each element
in a vector, we could do the following:
sapply(1:10, sqrt)
In this case, it would have been even simpler to just run sqrt(1:10)
,
but this is a good starting point. Note that we did not include the parentheses
after sqrt
when used in sapply()
. This is because we
are not calling the function right then and there, but rather telling
sapply()
what to use on each element of the vector.
As mentioned before, lapply()
returns a list of results,
while sapply()
returns a vector of results. Which you pick depends
on what you want. For example, say you want a list of 10 vectors of random uniform numbers,
but in a special structure where the first vector has 5 random numbers,
the second to have 6, and so on. This can be done quickly with lapply()
:
lapply(5:14, runif)
As you can see here, the output of this is a list of 10 vectors, instead of the
sapply()
example where the output of sqrt()
was always
a single number, and thus a list of vectors was a more appropriate output. In fact, if
we tried to use sapply()
in this case, it would just give up on
trying to return a vector and just return a list of vectors instead.
These functions also work with user-defined functions. For example, we can borrow the function from the previous chapter where we wanted to capitalize every element in a vector of names, and make sure it had a certain suffix:
fixDoctorName <- function (doctor.name) {
# make all letters capitalized
doctor.name <- toupper(doctor.name)
# if the name doesn't end with " MD"...
if (!endsWith(doctor.name, " MD")) {
# ...add " MD" to the end
doctor.name <- paste(doctor.name, "MD")
}
return(doctor.name)
}
Now, we can apply this function to a vector of names (this vector can be copied below):
names <- c("olivia bennett MD", "ETHAN HAYES MD", "Mia Rodriguez")
We can apply this function to each element in the vector using sapply()
:
sapply(names, fixDoctorName)
Note that in all the examples above, we are only changing the first argument
of the apply function. However, you can also pass additional arguments
to the function you are applying. For example, say we wanted to apply
the round()
function to each element in a vector, but we
wanted to round to the nearest 3 decimals. We can do this by passing the
round()
argument
digits=3
to sapply()
:
sapply(1:10*pi, round, digits=3)
In case this line of code is confusing, I will write this same line of code out the "long way" below:
c(
round(pi, digits=3),
round(2*pi, digits=3),
round(3*pi, digits=3),
round(4*pi, digits=3),
round(5*pi, digits=3),
round(6*pi, digits=3),
round(7*pi, digits=3),
round(8*pi, digits=3),
round(9*pi, digits=3),
round(10*pi, digits=3)
)
As you hopefully see here, adding the digits=3
argument to sapply()
has the effect of giving this argument to round()
each time it is called.
In these examples, we used the keyword digits
to specify the argument,
but you can also use the argument's position in the function. The round()
function can have the number of decimals specified by position only, as described in the
functions chapter:
round(10.12345, 3)
. This means that the above sapply()
call could also be written as follows:
sapply(1:10*pi, round, 3)
Perhaps this will be unsurprising to you, but I recommend that you specify this argument by name, rather than by position, for readability. In these apply functions, multiple arguments, not just one, can be passed to the function you are applying by position or keyword.
In all the examples in the previous examples, each element of the
vector or list was passed as the first argument to the function being applied.
That is a constraint to using the apply functions, that the elements
of the vector or list must be the first argument of the function being applied.
However, there is a way around this constraint. You can use an anonymous function,
which is a function that is not given a name. This is done by
defining a function "on the fly" using the function
keyword, and not
assigning that function to any variable name. For
example, perhaps we want to make a list of vectors of random normal numbers,
but we want to change the mean for each list. We can do this as follows:
lapply(1:10, function (x) {rnorm(10, mean=x)})
This successfully gives us a list of ten vectors, each with ten random normal numbers,
but with a different mean each time. Note that we did not assign the function
function (x) {rnorm(10, mean=x)}
to a variable name, which makes it
an anonymous function. This function takes one argument, x
, and returns
a vector of ten random normal numbers with mean x
. Since the first (and only)
argument of the anonymous function is the element of the vector or list, this
is an appropriate work-around for needing to pass each element of the vector as
the first argument.
As mentioned in a previous chapter, there is a shorthand for creating
functions in R, where \
replaces the word function
.
Perhaps the only place where this shorthand is helpful is when using functions
like lapply()
, where you are passing a function as an argument.
For example, the above code could be written as follows:
lapply(1:10, \(x) {rnorm(10, mean=x)})
Above, I mentioned that using sapply()
instead of lapply()
for one example ended with sapply()
just giving up on trying to return
a vector and just returning a list of vectors instead. Interestingly, in the case of the
code shown immediately above, sapply()
will actually return a matrix of values,
where each column is a vector of random normal numbers with a different mean. This is because
the "s" in sapply()
stands for simplify, as this apply function
first tries to simplify the output to a vector, and if it cannot, it tries to return a matrix,
and if it cannot, it returns a list. In this case, it cannot return the output as a single
vector, so instead it simplifies the output to a matrix.
To give one more example of an anonymous function used as an argument of the apply functions, say we want to delete the leading digit of each number in a vector. This can be done as follows:
sapply(c(101, 34, 555), \(x) {as.numeric(substr(x, 2, nchar(x)))})
The anonymous function uses the substr()
and nchar
functions
to convert each number to a string and chop off the first character. Finally, the
as.numeric()
function converts the string back to a number, which now is
missing the leading digit. Technically, this example does not require sapply
at all, as the substr()
and nchar()
functions can be applied
to a vector of numbers directly. However, this example is a good demonstration of
how to use an anonymous function as an argument to an apply function.
In this chapter, we discussed how to apply functions to vectors and lists.
We discussed the lapply()
and sapply()
functions,
and how they can be used to apply functions to vectors and lists. We also
discussed how to pass additional arguments to the function being applied,
and how to use anonymous functions as arguments to the apply functions.
I will note that there are other ways to apply functions to vectors and lists,
with one noteworthy omission being the mapply()
function, which allows
you to apply a function to multiple vectors or lists at once. However, I imagine
that this chapter will get you started with understanding how to apply functions
to vectors and lists, and you can learn more about the other ways to do this
in the future through documentation and online research.
Use the functions discussed above to complete the following prompts.
seq()
function to make a list of vectors, where each vector
is a sequence of numbers from 1 to 10, with each vector having a length of 10 plus the index number.
rep()
function to make a 5-by-5 matrix of characters, where each
column is a vector of the same character repeated 5 times, and each column has a different character.
Get these characters from the letters
vector.
rnorm()
function to make a list of 5 vectors, where each vector has 10 random
normal numbers with the mean equal to the index number of the entry and a standard deviation of 1.
Then, apply the mean()
function to this list to get a vector that shows what the means of each
vector actually turned out to be.
In a practice problem of an earlier chapter, I asked you to create a list where each element is a row of Pascal's triangle. Here, we will redo this problem using apply functions.
To begin, I will introduce the choose()
function. This function takes two arguments, n
and
k
, and returns the number of ways to choose k
items from a set of n
items.
For example, say you have five friends and you want to choose two of them to go to the movies with you. There are
choose(5, 2)
ways to do this, or 10 ways. This is connected to Pascal's triangle,
as the k
th entry of the n
th row of Pascal's triangle is choose(n, k)
.
To begin, write a function called getPascalRow
that takes one argument, n
, and returns a vector of length n
where each entry is choose(n, k)
, where k
is the index of the entry. You can
do this by applying the choose()
function to a vector of numbers from 0
to n
.
Now apply getPascalRow
to a vector of numbers from 1
to 10
to get a list of vectors,
where each vector is a row of Pascal's triangle.
Imagine you have a data frame where each column is a basketball team and each of the 10 rows is the number of points scored in a game. I made a fake data set with this structure that you can copy below:
points.data <- data.frame(
team1 = c(62, 110, 93, 121, 105, 83, 92, 101, 110, 109),
team2 = c(91, 93, 65, 102, 93, 92, 98, 83, 95, 175),
team3 = c(81, 96, 107, 111, 53, 92, 94, 106, 92, 120),
team4 = c(110, 80, 92, 86, 78, 85, 87, 63, 72, 73),
team5 = c(90, 93, 25, 50, 120, 92, 93, 92, 93, 92)
)
Using boxplot(points.data)
, you will be able to have a rough visualization
of the data. You may be interested in finding each team's average score. This can be done
by simply running colMeans(points.data)
, which will return a vector of the
average score for each team. This is equivalent to running sapply(points.data, mean)
,
which applies the mean()
function to each column of the data frame.
However, the box plot shows that there are outliers for each
team, and averages are very sensitive to outliers. Thus, an average may not be
appropriate for our purposes, most notably since team 2 scored 175 points in one game and team 5
scored 25 points in another.
There is a concept of the trimmed mean, which is the mean of a vector after removing the
top and bottom x
percent of the values. This is useful for removing the influence of outliers
on a measurement that aims to establish what is a typical value for the data set.
For example, say you have a vector of 100 numbers called number.data
, and you are looking to
calculate the 5% trimmed mean. You could do so as follows:
percent.trim <- 0.05
number.data <- sort(number.data)
data.length <- length(number.data)
trim.length <- round(data.length * percent.trim)
lower.bound <- trim.length + 1
upper.bound <- data.length - trim.length
trimmed.mean <- mean(number.data[lower.bound:upper.bound])
This code first sorts the vector, then calculates the number of values to trim from the top and bottom
of the vector. Then, it calculates the indices of the values to include in the trimmed mean, and finally
calculates the trimmed mean. This code should be wrapped in a function called getTrimmedMean
as follows:
getTrimmedMean <- function (number.data, percent.trim) {
number.data <- sort(number.data)
data.length <- length(number.data)
trim.length <- round(data.length * percent.trim)
lower.bound <- trim.length + 1
upper.bound <- data.length - trim.length
trimmed.mean <- mean(number.data[lower.bound:upper.bound])
return (trimmed.mean)
}
Apply the getTrimmedMean
function with a 10% trim to calculate the
trimmed mean for each team in points.data
. Compare these results to the
output of colMeans(points.data)
to see if it made a meaningful difference in
the ranking between teams.