So far, I have introduced several functions, but have not discussed
functions generally. Functions are a way to reuse code quickly,
and can make code much more readable and easy to edit. Some bits of
code are used so often that they come ready-to-use without any need to
bring in other code,
called built-in functions. Some functions come with packages that
you import into your code, like the read_excel
function used in
the previous chapter.
Finally, you can write your own functions, called user-defined functions,
which is paramount to making readable and reusable code.
Here, I'll discuss the general structure
of functions, cover some important built-in functions, and show how to define your
own functions.
This chapter will show the inconsistencies of naming conventions in R. Although R is not careful about how it names functions, it is pivotal that you name things clearly and consistently. Doing so makes code easier to read and edit, which helps science progress more rapidly and helps you avoid mistakes.
Functions can have inputs and outputs. The inputs are called arguments, and the outputs are called return values. We'll talk about each of these below. One vital tool to be aware of is the documentation for functions. You can access the documentation for a function by typing a question mark followed by the name of the function. Try it out below.
Here, we will show an example of how you would go about finding and reading the documentation
for a function, using sample()
as an example. I will mention up front that the
documentation for many R functions is riddled with jargon
and special cases, so it can be difficult to understand. However,
it is nonetheless some of the best documentation available for
these functions, and is more reliable than some information
you might find on the internet.
In a
previous chapter, I introduced the sample()
function,
which is used to randomly pick elements from a vector. We will now look at the
official documentation for this function. To see it, type ?sample
into the
console and press enter. You will see the documentation for the
function appear in the bottom right window of RStudio. You will
see a part of the documentation that looks like this under the "Usage" heading:
sample(x, size, replace = FALSE, prob = NULL)
When you used this function previously, I only showed you how to use
the first three arguments, x
, size
and
replace
. However, you can see that there is a fourth
available argument, prob
. Scroll down to the "Arguments"
section and read the description of how to use this argument.
Scroll down to the "Details" section, to the paragraph that
discusses the prob
argument.
You will have read that the prob
argument is used to
specify the weighted probability of each element of x
being
chosen. This means that you can use this argument to make some elements
of x
more likely to be chosen than others. Try the following
bit of code to see how this works.
set.seed(100)
sample.of.integers <- sample(50:52, 100, replace=TRUE, prob=c(4, 2, 1))
# the prob vector makes it so 50 is twice as likely to be
# selected as 51, and 51 is twice as likely to be
# selected than 52. Use the code below to see if that is
# reflected in the results.
hist(sample.of.integers)
Arguments are the inputs to a function. They are the values that you
pass to the function to use. Looking at the example in the box above,
you can see that some arguments are required, and some are optional.
For example, both replace
and prob
need not
be included for the sample()
function to give output in
all cases. When they
are not provided, the documentation tells us that they will be set to
the default values of FALSE
and NULL
, respectively.
The size
argument is not shown to have a default value,
but the "Details" section explains that if size
is not
provided, it will be set to the length of x
. Therefore,
the only required argument for the sample()
function is
x
, or the vector to be sampled. A bit of terminology:
when you give a function an argument, it is called passing the argument
to the function, as in passing the salt at the dinner table. So,
in the code sample(1:100, 10)
, we are passing the arguments
1:100
and 10
to the sample()
function.
In R, arguments that are required are placed first in the list
of arguments, and optional arguments are placed after them. These
required arguments (e.g., x
) are called
positional arguments, because
their position in the list of arguments determines their meaning.
Optional arguments (e.g., replace
) are called keyword arguments, because
they are identified by a keyword, or name, that is used to specify
their value. Keyword arguments can be given without specifying the keyword
as long as they are given in the correct order, but it is generally
better to specify the keyword to make the code more readable. Additionally,
when arguments are specified by their keyword, they can be given in any
order, which is useful since it is generally harder to remember the order
of keyword arguments than the names of those arguments. In the following code
chunk, lines 3 through 7 execute the same command (although the output will be
slightly different for each in this case because the seed is not reset each time),
but lines 9 and 10 will fail to execute:
numbers <- c(1, 9, 2, 3)
sample(numbers, 10, TRUE, c(3, 1, 1, 1))
sample(numbers, 10, replace=TRUE, prob=c(3, 1, 1, 1))
sample(numbers, 10, prob=c(3, 1, 1, 1), replace=TRUE)
sample(x=numbers, size=10, prob=c(3, 1, 1, 1), replace=TRUE)
sample(size=10, x=numbers, prob=c(3, 1, 1, 1), replace=TRUE)
sample(numbers, 10, c(3, 1, 1, 1), TRUE) # DOES NOT WORK
sample(10, numbers, TRUE, c(3, 1, 1, 1)) # DOES NOT WORK
It should be noted that not all functions have arguments at all.
For example, the date()
function returns the current
date, and does not require any arguments, and simply returns the
current date as a character string, e.g., "Sat Dec 23 20:40:35 2023"
.
This example shows that even when functions have no arguments, you are still
required to include the parentheses after the function name to execute the
function. If you omit the parentheses, you will simply get the code that
defines the function, rather than running the function.
Functions return values that can be assigned to variables or used
directly in other functions. For example, the sample()
function returns a vector of the same length as the size
argument, with each element being a randomly selected element of
the x
argument. As discussed above, the date()
function returns a character string containing the current date.
Many functions, especially statistical ones, return a list of values
that result from the function. For example, the lm()
function returns a list of values that describe the linear model
that was fit to the data. It is helpful to be aware what is being
returned from a function in order to know how to handle the output.
In this tutorial, we have already covered many functions that are very important for coding purposes. Here, I will try to make a list of some of the most important built-in functions, which will mention many of the functions that have already been introduced. Although it is tempting to skip this section, pay special attention to the functions that you are unfamiliar with by looking up their documentation and giving them a try.
cat()
/print()
: print the value of an object to the console.
Usually, cat()
is better for printing messages to the console, and
print()
is better for seeing the value of an object.
paste()
/paste0()
: combine strings into a single string.
paste()
separates the strings with a space by default, and
paste0()
does not separate the strings. These are discussed
in more detail in a Practice problem below.
max()
/min()
: return the maximum/minimum
value of a vector.
sum()
/prod()
: return the sum/product
of the elements of a vector.
mean()
/median()
: return the mean/median
of the elements of a vector.
sd()
/var()
: return the standard deviation/variance
of the elements of a vector.
length()
: return the length of a vector.
seq()
/rep()
: return vectors from patterns, as discussed
in a previous chapter.
sort()
: return a vector of the elements of a vector, sorted in ascending order by default.
rev()
: return a vector of the elements of a vector, in reverse order.
colSums()
/rowSums()
: return a vector of the sums of the columns/rows of a matrix or data frame.
colMeans()
/rowMeans()
: return a vector of the means of the columns/rows of a matrix or data frame.
cbind()
/rbind()
: combine vectors into a matrix or data frame by column/row.
set.seed()
: set the seed for the random number generator.
runif()
/rnorm()
: return a vector of uniformally/normally distributed random numbers.
sample()
: return a vector of randomly selected elements from a vector.
read.csv()
/read_excel()
: read a csv/excel file into R.
write.csv()
/write_excel()
: convert a data frame to a csv/excel file.
plot()
: make a scatter or line plot from two vectors (the x-coordinates and y-coordinates of the points).
hist()
: make a histogram of a vector.
boxplot()
: make a boxplot of a vector.
lines()
: add lines to an existing plot with a series of x-coordinates and y-coordinates of the points of the line.
abline()
: add a line to an existing plot. This is different from lines()
because it
adds a line with a specified slope and intercept, rather than adding a line that is defined by a set of points.
par()
: set graphical parameters. Most commonly used to fit multiple plots on one screen, such as setting the
number of rows and columns of plots, e.g., par(mfrow=c(2, 2))
.
There are many cases where things are tedious to code and re-code. This is especially true in R, where some notation is cumbersome. You have already used many functions that are built into R, and now you are going to learn how to make your own functions from scratch. User-defined functions are used to package complex or "reusable" code into a single function call.
It is important to know the end goal of how you will use your function before making it. I will share one example that is discussed in much more detail below (see Example 3): say that you want to sort the vectors of a data frame from highest to lowest sum, and drop a specified number of columns from the end of the data frame. Imagine that such a function already existed, and could be used like this:
my.df <- sortAndTrimColumns(my.df, 4)
# sort columns of my.df by sum and delete the last 4
Of course, if you run the code above right now on some data frame called my.df
,
you'll get an error saying
could not find function "sortAndTrimColumns"
. However, imagining this function
in use is an important step in defining the function.
This step of planning how you will use the function in real life gives context to everything else you will
do when defining the function.
Now that we have a clear goal for the use of this function, we can make it.
It needs to be able to take in two things (a data frame and a number of columns to remove),
then perform the sorting and trimming operation, and finally return the newly processed data frame.
With this in mind, we already know that when we define sortAndTrimColumns
,
we need to specify to the computer
that it will take two arguments, and that it will need to return a data frame. If you are
understanding the process so far, you have already understood the most important parts of
defining a function. The rest is just coding like you always do.
In this chapter, I will show the general format of how to define a function, and then show some examples of code that can be turned into functions.
In a previous chapter, I strongly encouraged you to be consistent about how you name variable names, and avoid abbreviations. Here, I will recommend the same for function names, but using a different style. It is very helpful to use a different style for variable names than function names, so that you can easily tell the difference between the two. For function names, I generally use what is called "camel case", where words are placed end-to-end without periods or underscores, and all first letters are capitalized except the initial one. In addition, functions should be named to sound like verbs, so that it is clear what they do. Here are a few examples of good and bad function names:
# bad function names:
div.hnd # divides each entry of a vector by 100
last3 # get last three elements of a vector
relabund # get relative species abundance from counts
del.small # delete small values from a vector
getSmall # get small values from a vector
# good function names, matching descriptions from above:
divideByHundred
getLastThree
getRelativeAbundance
deleteSmallValues
getSmallValues
To define a function in R, you will follow this general structure:
name <- function (arg1, arg2, arg3=TRUE) {
# code to run
# ...
return(return.value)
}
Functions you make need not have any positional or keyword arguments,
nor do they need to return any value, but most functions you write will have
both. In the example above,
there are three arguments, two positional (arg1
and arg2
)
and one keyword (arg3
). The keyword argument has a default value
of TRUE
, so if the user does not specify a value for arg3
,
it will be set to TRUE
.
The return()
function specifies what should be given as the result of
the function, and no code after the return()
function will be executed.
It should be noted that if return()
is not used in a function definition, the function
will return the value of the last line of the function definition. Omitting return()
is generally
not recommended for functions that are made to return things,
as it can make the function definition less clear to readers. For functions that are meant
to perform some action and not return a value, it is generally better to omit return()
.
After the function is defined and run in R, it can be used like any other function. Just remember that edits to the function definition must be followed by re-running the function definition in RStudio in order for the edits to take effect.
As a side note, R has a shorthand for the function definition: you can replace the word
function
with a single slash (\
). For example, the following code
defines the function giveMessage
in one line:
giveMessage <- \() {cat("Hello!")}
# identical to:
giveMessage <- function () {cat("Hello!")}
# also identical to:
giveMessage <- function () {
cat("Hello!")
}
This \
is useful in some cases,
e.g., functions you need to fit onto one line and that you want to keep short,
but in most cases you should type out the
word function
for readability. You will note that I omitted return()
in this example, and that is because the purpose of this function is not
to return a value, but to print a message to the console. Technically, the cat()
function
returns NULL
, and since it is the "last" (and only) line of giveMessage
,
then giveMessage
will return NULL
if you were to assign its output to a variable. However,
to reiterate, the purpose of this function is to print a message, not to return a value, so
return()
is omitted.
What code chunk merits making a function? Let me put it this way: If you repeat code in your program, you should make it a function instead. If you find yourself copying and pasting code, you should make it a function instead. Generally, you should follow the DRY principle: don't repeat yourself. There are simple reasons for this rule:
In my experience, my peers have resisted using functions because they perceive functions to be complicated and error-prone. However, I wonder if this is an issue of perception. Functions are a great way to package complex code, and to "abstract" the use of that code into a simple function call. It makes the code easier to read and fix, which is what you spend most of your time doing when coding. Below, I'll give some examples of code that can be turned into functions, and hope you are convinced of the utility of user-defined functions.
In mathematics, there are a few types of ways to calculate an average. There is
the common definition, where you sum all values and divide by the total number of values.
This can be easily achieved in R with the mean()
function. Alternatively,
you can calculate what is called the harmonic mean, which is calculated by
taking the reciprocal of each value (i.e., one divided by each value),
taking the average of those values, and then taking the reciprocal of that result.
In other words, the harmonic mean is calculated as follows:
$$\text{Harmonic Mean} = \frac{n}{\frac{1}{x_1} + \frac{1}{x_2} + \cdots + \frac{1}{x_n}}$$
One example where the harmonic mean is useful is when calculating the average of rates, e.g., the average speed of a car over a trip. If a car drives 50 mph for the first mile of the trip, and 2 mph for the second mile of the trip, what was its average speed over the whole trip? The answer is not 26 mph like you might guess. The car spent much less time driving at 50 mph than it did at 2 miles per hour, so the average speed is closer to 2 mph than 50 mph. The harmonic mean is the correct way to calculate this average, by taking the reciprocal of the two values (0.02 and 0.5), averaging those values (0.26), and then taking the reciprocal of that value (3.85 mph). How might we use functions to calculate the harmonic mean? Look at the code below:
getHarmonicMean <- function (vector) {
reciprocals = 1 / vector
mean.of.reciprocals = mean(reciprocals)
harmonic.mean = 1 / mean.of.reciprocals
return(harmonic.mean)
}
getHarmonicMean(c(50, 2))
In Ecology, one common index used to measure the diversity of a community is the Shannon Index. For a given list of species abundances, the Shannon Index begins by calculating abundances into proportions, and then is calculated as follows:
$$\text{Shannon Index} = -\sum p_i \ln p_i$$
Here, $p_i$ is the proportion of the $i$th species in the community. How might we use functions to calculate the Shannon Index? Look at the code below:
getShannonIndex <- function (vector) {
proportions = vector / sum(vector)
inner.product = proportions * log(proportions)
shannon.index = -sum(inner.product)
return(shannon.index)
}
getShannonIndex(c(50, 100, 20, 1, 3, 8))
In the two above examples, we used functions to package code that was meant to make mathematical calculations. However, functions can also help sidestep repeated tedium in coding.
For example, say that you have to perform the operation described in the previous chapter, where you sort the columns of a data frame from highest to lowest sum, omitting the final column. Given that this goal is hard to conceptualize without context, I will briefly describe a situation where such a need may arise. Say you are carrying out an experiment where you are measuring the abundance of several species across eight monitored sites. You create a data frame for each of the following groups of species: birds, mammals, reptiles, grasses, and trees. You are interested in analyzing the density of species in these groups. However, you notice that for each species group there is a single site where the density is abnormally low. Site 2 has incredibly low reptile density, site 4 has incredibly low tree density, etc. Perhaps the analysis you wish to perform for each species group is incapable of making conclusions when species density is abnormally low. Therefore, for each data frame (i.e., each species group), you want to remove the column that represents the site with abnormally low density. In addition, sorting columns from high to low total density will make the results of your analysis come out in the same order.
Although this may seem like a contrived example, it is just a few steps removed from real-life problems that you will inevitably face in your own data analysis. To achieve this end goal, you might write the following code to perform this action with a function:
sortAndTrimColumns <- function (my.df) {
column.count <- ncol(my.df)
sorted.column.indices <- order(colSums(my.df), decreasing=TRUE)
selected.columns <- sorted.column.indices[1:(column.count-1)]
return(my.df[, selected.columns])
}
You might want to adjust how many columns are removed from the data frame.
This can be done easily here by adding a keyword argument to the function
called columns.to.remove
:
sortAndTrimColumns <- function (my.df, columns.to.remove=1) {
column.count <- ncol(my.df)
sorted.column.indices <- order(colSums(my.df), decreasing=TRUE)
selected.columns <- sorted.column.indices[1:(column.count-columns.to.remove)] # changed this line
return(my.df[, selected.columns])
}
This updated function shows that if the user does not specify a value for
columns.to.remove
, it will be set to 1 by default. This is
helpful for things that might need to be changed often but not always.
Such a function, after being run in RStudio, can be used like this:
set.seed(100)
my.df <- data.frame(matrix(runif(1000), nrow=10))
my.df <- sortAndTrimColumns(my.df, columns.to.remove=2)
Functions are a great way to make code more readable and reusable. They can be used to package complex code into a simple function call, and can be used to avoid repeating code. In this chapter, I discussed common built-in functions, introduced the general structure of functions, and showed some examples of code that can be turned into functions.
paste
and paste0
with Arguments
As explained above, these two functions are used to combine multiple items
into a single string. These have some
awkward usage sometimes, so I give a few examples here. Most importantly,
note the use of the sep
and collapse
arguments:
paste("I", "saw", 1, "cat")
# returns "I saw 1 cat"
paste("I", "saw", 1, "cat", sep="")
# returns "Isaw1cat"
paste("I", "saw", 1, "cat", sep=".")
# returns "I.saw.1.cat"
paste(c("A", "B", "C"), collapse="_")
# returns "A_B_C" (note the use of the collapse
# argument when dealing with a vector input)
paste0("Hi", "world")
# returns "Hiworld" (same as saying sep=""
# with the paste function)
Given these examples, use paste
or paste0
to complete
the following tasks:
"1, 2, 3, 4, 5, 6, 7, 8"
using 1:8
in the paste
function.
Make sure there is a comma followed by a space between each number.
letters
vector that contains all the lowercase letters
in order. Use this vector with paste
or paste0
to create the string "abcdefgh"
.
observation1
and observation2
.
Give them the values 5
and 15
, respectively.
Create the string "The average of 5 and 15 is 10"
using these variables
and mean(c(observation1, observation2))
in the paste
function.
Be sure that your result has only a single space between words.
name
and height_in_feet
.
Give them the values "Jacquelin"
and 4.5
, respectively.
Create the string "Jacquelin is relatively short (4.5 feet tall)"
using these variables in the paste0
function. Be sure that your
result matches the expected result exactly.
Follow the descriptions below to make your own functions. Give them good names!
5
and 3
should return
2
.
5
, 3
, and 7
should return
7
.
letter.count
that
allows you to set the number of letters of the alphabet, and this argument should have a
default value of 26. E.g., passing "~"
should return "a~b~c~d~e~f~g~h~i~j~k~l~m~n~o~p~q~r~s~t~u~v~w~x~y~z"
,
and "CAT"
with letter.count=5
should return
"aCATbCATcCATdCATe"
. Remember to use the built-in letters
vector
to easily access the letters of the alphabet.
c(1, 2, 3)
to the function should return
14
(12 + 22 + 32).
lower
and upper
,
and returns a vector of all the integers between lower
and upper
, inclusive.
If lower
is not provided, it should default to 1
, and if upper
is not provided, it should default to 10
. E.g., passing lower=5
and upper=8
should return c(5, 6, 7, 8)
, passing nothing should return c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
,
and passing only upper=3
should return c(1, 2, 3)
.
In the previous chapter, I mentioned that data frames are frequently manipulated in R. For any repetitive tasks in coding, functions are the way to go. For example, say you need to add a row to your data frame several times. To do so, you could write the function in the following code:
animal.data <- data.frame(
animal = c("Dog", "Woodpecker", "Sponge"),
is.mammal = c(TRUE, FALSE, FALSE),
weight.kg = c(20, 0.2, 3)
)
addAnimalRow <- function (df, animal, is.mammal, weight.kg) {
new.row <- list(animal, is.mammal, weight.kg)
df[nrow(df) + 1,] <- new.row
return (df)
}
animal.data <- addAnimalRow(animal.data, "Cat", TRUE, 5)
animal.data <- addAnimalRow(animal.data, "Catfish", FALSE, 0.5)
animal.data <- addAnimalRow(animal.data, "Pig", TRUE, 10)
Write functions that can complete the following tasks. Test out your functions with examples to show that they are working as intended.
removeLastRow
that takes a data frame as the only argument
and returns the data frame with the last row removed.
combineDataFrames
that takes two data frames that have
the same column names and returns a single data frame that has all the rows of the first
data frame followed by all the rows of the second data frame.
sortDataFrameBySum
that takes a single data frame
and returns the same data frame with columns sorted by order of decreasing sum.
makeResultList
that takes three arguments:
description
, value
, and is.significant
.
The last argument should have a default value of FALSE
. This function
should return a list with three elements, each with the same name as their corresponding arguments.
This function would be a way to consistently present results from statistical tests.