07| Functions

Miles Robertson, 12.24.23 (edited 01.16.24)

Introduction

So far, I have introduced several functions, but have not discussed functions generally. Functions are a way to reuse code quickly, and can make code much more readable and easy to edit. Some bits of code are used so often that they come ready-to-use without any need to bring in other code, called built-in functions. Some functions come with packages that you import into your code, like the read_excel function used in the previous chapter. Finally, you can write your own functions, called user-defined functions, which is paramount to making readable and reusable code. Here, I'll discuss the general structure of functions, cover some important built-in functions, and show how to define your own functions.

This chapter will show the inconsistencies of naming conventions in R. Although R is not careful about how it names functions, it is pivotal that you name things clearly and consistently. Doing so makes code easier to read and edit, which helps science progress more rapidly and helps you avoid mistakes.

The Structure of Functions

Functions can have inputs and outputs. The inputs are called arguments, and the outputs are called return values. We'll talk about each of these below. One vital tool to be aware of is the documentation for functions. You can access the documentation for a function by typing a question mark followed by the name of the function. Try it out below.

TODO: look at function documentation

Here, we will show an example of how you would go about finding and reading the documentation for a function, using sample() as an example. I will mention up front that the documentation for many R functions is riddled with jargon and special cases, so it can be difficult to understand. However, it is nonetheless some of the best documentation available for these functions, and is more reliable than some information you might find on the internet.

In a previous chapter, I introduced the sample() function, which is used to randomly pick elements from a vector. We will now look at the official documentation for this function. To see it, type ?sample into the console and press enter. You will see the documentation for the function appear in the bottom right window of RStudio. You will see a part of the documentation that looks like this under the "Usage" heading:

sample(x, size, replace = FALSE, prob = NULL)

When you used this function previously, I only showed you how to use the first three arguments, x, size and replace. However, you can see that there is a fourth available argument, prob. Scroll down to the "Arguments" section and read the description of how to use this argument. Scroll down to the "Details" section, to the paragraph that discusses the prob argument.

You will have read that the prob argument is used to specify the weighted probability of each element of x being chosen. This means that you can use this argument to make some elements of x more likely to be chosen than others. Try the following bit of code to see how this works.

set.seed(100)
sample.of.integers <- sample(50:52, 100, replace=TRUE, prob=c(4, 2, 1))
# the prob vector makes it so 50 is twice as likely to be
# selected as 51, and 51 is twice as likely to be
# selected than 52. Use the code below to see if that is
# reflected in the results.

hist(sample.of.integers)

Arguments

Arguments are the inputs to a function. They are the values that you pass to the function to use. Looking at the example in the box above, you can see that some arguments are required, and some are optional. For example, both replace and prob need not be included for the sample() function to give output in all cases. When they are not provided, the documentation tells us that they will be set to the default values of FALSE and NULL, respectively. The size argument is not shown to have a default value, but the "Details" section explains that if size is not provided, it will be set to the length of x. Therefore, the only required argument for the sample() function is x, or the vector to be sampled. A bit of terminology: when you give a function an argument, it is called passing the argument to the function, as in passing the salt at the dinner table. So, in the code sample(1:100, 10), we are passing the arguments 1:100 and 10 to the sample() function.

In R, arguments that are required are placed first in the list of arguments, and optional arguments are placed after them. These required arguments (e.g., x) are called positional arguments, because their position in the list of arguments determines their meaning. Optional arguments (e.g., replace) are called keyword arguments, because they are identified by a keyword, or name, that is used to specify their value. Keyword arguments can be given without specifying the keyword as long as they are given in the correct order, but it is generally better to specify the keyword to make the code more readable. Additionally, when arguments are specified by their keyword, they can be given in any order, which is useful since it is generally harder to remember the order of keyword arguments than the names of those arguments. In the following code chunk, lines 3 through 7 execute the same command (although the output will be slightly different for each in this case because the seed is not reset each time), but lines 9 and 10 will fail to execute:

numbers <- c(1, 9, 2, 3)

sample(numbers, 10, TRUE, c(3, 1, 1, 1))
sample(numbers, 10, replace=TRUE, prob=c(3, 1, 1, 1))
sample(numbers, 10, prob=c(3, 1, 1, 1), replace=TRUE)
sample(x=numbers, size=10, prob=c(3, 1, 1, 1), replace=TRUE)
sample(size=10, x=numbers, prob=c(3, 1, 1, 1), replace=TRUE)

sample(numbers, 10, c(3, 1, 1, 1), TRUE) # DOES NOT WORK
sample(10, numbers, TRUE, c(3, 1, 1, 1)) # DOES NOT WORK

It should be noted that not all functions have arguments at all. For example, the date() function returns the current date, and does not require any arguments, and simply returns the current date as a character string, e.g., "Sat Dec 23 20:40:35 2023". This example shows that even when functions have no arguments, you are still required to include the parentheses after the function name to execute the function. If you omit the parentheses, you will simply get the code that defines the function, rather than running the function.

Return Value

Functions return values that can be assigned to variables or used directly in other functions. For example, the sample() function returns a vector of the same length as the size argument, with each element being a randomly selected element of the x argument. As discussed above, the date() function returns a character string containing the current date. Many functions, especially statistical ones, return a list of values that result from the function. For example, the lm() function returns a list of values that describe the linear model that was fit to the data. It is helpful to be aware what is being returned from a function in order to know how to handle the output.

Important Built-In Functions

In this tutorial, we have already covered many functions that are very important for coding purposes. Here, I will try to make a list of some of the most important built-in functions, which will mention many of the functions that have already been introduced. Although it is tempting to skip this section, pay special attention to the functions that you are unfamiliar with by looking up their documentation and giving them a try.

Printing and character concatenation:
- cat()/print(): print the value of an object to the console. Usually, cat() is better for printing messages to the console, and print() is better for seeing the value of an object.
- paste()/paste0(): combine strings into a single string. paste() separates the strings with a space by default, and paste0() does not separate the strings. These are discussed in more detail in a Practice problem below.
Basic statistical functions:
- max()/min(): return the maximum/minimum value of a vector.
- sum()/prod(): return the sum/product of the elements of a vector.
- mean()/median(): return the mean/median of the elements of a vector.
- sd()/var(): return the standard deviation/variance of the elements of a vector.
Vector manipulation:
- length(): return the length of a vector.
- seq()/rep(): return vectors from patterns, as discussed in a previous chapter.
- sort(): return a vector of the elements of a vector, sorted in ascending order by default.
- rev(): return a vector of the elements of a vector, in reverse order.
Matrix and data frame manipulation:
- colSums()/rowSums(): return a vector of the sums of the columns/rows of a matrix or data frame.
- colMeans()/rowMeans(): return a vector of the means of the columns/rows of a matrix or data frame.
- cbind()/rbind(): combine vectors into a matrix or data frame by column/row.
Random number generation and use:
- set.seed(): set the seed for the random number generator.
- runif()/rnorm(): return a vector of uniformally/normally distributed random numbers.
- sample(): return a vector of randomly selected elements from a vector.
Data import and export:
- read.csv()/read_excel(): read a csv/excel file into R.
- write.csv()/write_excel(): convert a data frame to a csv/excel file.
Basic graphing functions (see this chapter for more details):
- plot(): make a scatter or line plot from two vectors (the x-coordinates and y-coordinates of the points).
- hist(): make a histogram of a vector.
- boxplot(): make a boxplot of a vector.
- lines(): add lines to an existing plot with a series of x-coordinates and y-coordinates of the points of the line.
- abline(): add a line to an existing plot. This is different from lines() because it adds a line with a specified slope and intercept, rather than adding a line that is defined by a set of points.
- par(): set graphical parameters. Most commonly used to fit multiple plots on one screen, such as setting the number of rows and columns of plots, e.g., par(mfrow=c(2, 2)).

Defining Your Own Functions

There are many cases where things are tedious to code and re-code. This is especially true in R, where some notation is cumbersome. You have already used many functions that are built into R, and now you are going to learn how to make your own functions from scratch. User-defined functions are used to package complex or "reusable" code into a single function call.

It is important to know the end goal of how you will use your function before making it. I will share one example that is discussed in much more detail below (see Example 3): say that you want to sort the vectors of a data frame from highest to lowest sum, and drop a specified number of columns from the end of the data frame. Imagine that such a function already existed, and could be used like this:

my.df <- sortAndTrimColumns(my.df, 4)
# sort columns of my.df by sum and delete the last 4

Of course, if you run the code above right now on some data frame called my.df, you'll get an error saying could not find function "sortAndTrimColumns". However, imagining this function in use is an important step in defining the function. This step of planning how you will use the function in real life gives context to everything else you will do when defining the function.

Now that we have a clear goal for the use of this function, we can make it. It needs to be able to take in two things (a data frame and a number of columns to remove), then perform the sorting and trimming operation, and finally return the newly processed data frame. With this in mind, we already know that when we define sortAndTrimColumns, we need to specify to the computer that it will take two arguments, and that it will need to return a data frame. If you are understanding the process so far, you have already understood the most important parts of defining a function. The rest is just coding like you always do.

In this chapter, I will show the general format of how to define a function, and then show some examples of code that can be turned into functions.

Function Structure

In a previous chapter, I strongly encouraged you to be consistent about how you name variable names, and avoid abbreviations. Here, I will recommend the same for function names, but using a different style. It is very helpful to use a different style for variable names than function names, so that you can easily tell the difference between the two. For function names, I generally use what is called "camel case", where words are placed end-to-end without periods or underscores, and all first letters are capitalized except the initial one. In addition, functions should be named to sound like verbs, so that it is clear what they do. Here are a few examples of good and bad function names:

# bad function names:
div.hnd   # divides each entry of a vector by 100
last3     # get last three elements of a vector
relabund  # get relative species abundance from counts
del.small # delete small values from a vector
getSmall  # get small values from a vector

# good function names, matching descriptions from above:
divideByHundred
getLastThree
getRelativeAbundance
deleteSmallValues
getSmallValues

To define a function in R, you will follow this general structure:

name <- function (arg1, arg2, arg3=TRUE) {
    # code to run
    # ...
    return(return.value)
}

Functions you make need not have any positional or keyword arguments, nor do they need to return any value, but most functions you write will have both. In the example above, there are three arguments, two positional (arg1 and arg2) and one keyword (arg3). The keyword argument has a default value of TRUE, so if the user does not specify a value for arg3, it will be set to TRUE. The return() function specifies what should be given as the result of the function, and no code after the return() function will be executed.

It should be noted that if return() is not used in a function definition, the function will return the value of the last line of the function definition. Omitting return() is generally not recommended for functions that are made to return things, as it can make the function definition less clear to readers. For functions that are meant to perform some action and not return a value, it is generally better to omit return().

After the function is defined and run in R, it can be used like any other function. Just remember that edits to the function definition must be followed by re-running the function definition in RStudio in order for the edits to take effect.

As a side note, R has a shorthand for the function definition: you can replace the word function with a single slash (\). For example, the following code defines the function giveMessage in one line:

giveMessage <- \() {cat("Hello!")}

# identical to:
giveMessage <- function () {cat("Hello!")}

# also identical to:
giveMessage <- function () {
    cat("Hello!")
}

This \ is useful in some cases, e.g., functions you need to fit onto one line and that you want to keep short, but in most cases you should type out the word function for readability. You will note that I omitted return() in this example, and that is because the purpose of this function is not to return a value, but to print a message to the console. Technically, the cat() function returns NULL, and since it is the "last" (and only) line of giveMessage, then giveMessage will return NULL if you were to assign its output to a variable. However, to reiterate, the purpose of this function is to print a message, not to return a value, so return() is omitted.

Why Use Functions?

What code chunk merits making a function? Let me put it this way: If you repeat code in your program, you should make it a function instead. If you find yourself copying and pasting code, you should make it a function instead. Generally, you should follow the DRY principle: don't repeat yourself. There are simple reasons for this rule:

Code is generally hard to read, especially R code. If you "wrap" code into a function, you can give it a name that describes what it does, and then you can use that name instead of the block of code that it contains.
If the same code is written in several places, and you want to change even a small part of that code, you have to find each occurrence of that code to apply that change. If you have put that code into a function, you only need to change the code in one place, and the change is applied everywhere you use the function.
If you copy and paste code, you are likely to make mistakes between the different copies, which makes debugging confusing and frustrating. Put it in a function instead to avoid this problem altogether!

In my experience, my peers have resisted using functions because they perceive functions to be complicated and error-prone. However, I wonder if this is an issue of perception. Functions are a great way to package complex code, and to "abstract" the use of that code into a simple function call. It makes the code easier to read and fix, which is what you spend most of your time doing when coding. Below, I'll give some examples of code that can be turned into functions, and hope you are convinced of the utility of user-defined functions.

Example Function 1: Harmonic Mean

In mathematics, there are a few types of ways to calculate an average. There is the common definition, where you sum all values and divide by the total number of values. This can be easily achieved in R with the mean() function. Alternatively, you can calculate what is called the harmonic mean, which is calculated by taking the reciprocal of each value (i.e., one divided by each value), taking the average of those values, and then taking the reciprocal of that result. In other words, the harmonic mean is calculated as follows:

$$\text{Harmonic Mean} = \frac{n}{\frac{1}{x_1} + \frac{1}{x_2} + \cdots + \frac{1}{x_n}}$$

One example where the harmonic mean is useful is when calculating the average of rates, e.g., the average speed of a car over a trip. If a car drives 50 mph for the first mile of the trip, and 2 mph for the second mile of the trip, what was its average speed over the whole trip? The answer is not 26 mph like you might guess. The car spent much less time driving at 50 mph than it did at 2 miles per hour, so the average speed is closer to 2 mph than 50 mph. The harmonic mean is the correct way to calculate this average, by taking the reciprocal of the two values (0.02 and 0.5), averaging those values (0.26), and then taking the reciprocal of that value (3.85 mph). How might we use functions to calculate the harmonic mean? Look at the code below:

getHarmonicMean <- function (vector) {
    reciprocals = 1 / vector
    mean.of.reciprocals = mean(reciprocals)
    harmonic.mean = 1 / mean.of.reciprocals
    return(harmonic.mean)
}

getHarmonicMean(c(50, 2))

Example Function 2: Shannon Index

In Ecology, one common index used to measure the diversity of a community is the Shannon Index. For a given list of species abundances, the Shannon Index begins by calculating abundances into proportions, and then is calculated as follows:

$$\text{Shannon Index} = -\sum p_i \ln p_i$$

Here, $p_i$ is the proportion of the $i$th species in the community. How might we use functions to calculate the Shannon Index? Look at the code below:

getShannonIndex <- function (vector) {
    proportions = vector / sum(vector)
    inner.product = proportions * log(proportions)
    shannon.index = -sum(inner.product)
    return(shannon.index)
}

getShannonIndex(c(50, 100, 20, 1, 3, 8))

Example Function 3: Manipulating Data Frames

In the two above examples, we used functions to package code that was meant to make mathematical calculations. However, functions can also help sidestep repeated tedium in coding.

For example, say that you have to perform the operation described in the previous chapter, where you sort the columns of a data frame from highest to lowest sum, omitting the final column. Given that this goal is hard to conceptualize without context, I will briefly describe a situation where such a need may arise. Say you are carrying out an experiment where you are measuring the abundance of several species across eight monitored sites. You create a data frame for each of the following groups of species: birds, mammals, reptiles, grasses, and trees. You are interested in analyzing the density of species in these groups. However, you notice that for each species group there is a single site where the density is abnormally low. Site 2 has incredibly low reptile density, site 4 has incredibly low tree density, etc. Perhaps the analysis you wish to perform for each species group is incapable of making conclusions when species density is abnormally low. Therefore, for each data frame (i.e., each species group), you want to remove the column that represents the site with abnormally low density. In addition, sorting columns from high to low total density will make the results of your analysis come out in the same order.

Although this may seem like a contrived example, it is just a few steps removed from real-life problems that you will inevitably face in your own data analysis. To achieve this end goal, you might write the following code to perform this action with a function:

sortAndTrimColumns <- function (my.df) {
    column.count <- ncol(my.df)
    sorted.column.indices <- order(colSums(my.df), decreasing=TRUE)
    selected.columns <- sorted.column.indices[1:(column.count-1)]
    return(my.df[, selected.columns])
}

You might want to adjust how many columns are removed from the data frame. This can be done easily here by adding a keyword argument to the function called columns.to.remove:

sortAndTrimColumns <- function (my.df, columns.to.remove=1) {
    column.count <- ncol(my.df)
    sorted.column.indices <- order(colSums(my.df), decreasing=TRUE)
    selected.columns <- sorted.column.indices[1:(column.count-columns.to.remove)] # changed this line
    return(my.df[, selected.columns])
}

This updated function shows that if the user does not specify a value for columns.to.remove, it will be set to 1 by default. This is helpful for things that might need to be changed often but not always. Such a function, after being run in RStudio, can be used like this:

set.seed(100)
my.df <- data.frame(matrix(runif(1000), nrow=10))
my.df <- sortAndTrimColumns(my.df, columns.to.remove=2)

Conclusion

Functions are a great way to make code more readable and reusable. They can be used to package complex code into a simple function call, and can be used to avoid repeating code. In this chapter, I discussed common built-in functions, introduced the general structure of functions, and showed some examples of code that can be turned into functions.

Practice

Use `paste` and `paste0` with Arguments

As explained above, these two functions are used to combine multiple items into a single string. These have some awkward usage sometimes, so I give a few examples here. Most importantly, note the use of the sep and collapse arguments:

paste("I", "saw", 1, "cat")
# returns "I saw 1 cat"

paste("I", "saw", 1, "cat", sep="") 
# returns "Isaw1cat"

paste("I", "saw", 1, "cat", sep=".")
# returns "I.saw.1.cat"

paste(c("A", "B", "C"), collapse="_")
# returns "A_B_C" (note the use of the collapse 
# argument when dealing with a vector input)

paste0("Hi", "world")
# returns "Hiworld" (same as saying sep="" 
# with the paste function)

Given these examples, use paste or paste0 to complete the following tasks:

Create the string "1, 2, 3, 4, 5, 6, 7, 8" using 1:8 in the paste function. Make sure there is a comma followed by a space between each number.
R has a built-in letters vector that contains all the lowercase letters in order. Use this vector with paste or paste0 to create the string "abcdefgh".
Make two variables, observation1 and observation2. Give them the values 5 and 15, respectively. Create the string "The average of 5 and 15 is 10" using these variables and mean(c(observation1, observation2)) in the paste function. Be sure that your result has only a single space between words.
Make two variables, name and height_in_feet. Give them the values "Jacquelin" and 4.5, respectively. Create the string "Jacquelin is relatively short (4.5 feet tall)" using these variables in the paste0 function. Be sure that your result matches the expected result exactly.

Write Simple Functions

Follow the descriptions below to make your own functions. Give them good names!

Write a function that takes two numbers as arguments and returns the difference between the two numbers. E.g., passing 5 and 3 should return 2.
Write a function that takes three numbers as arguments and returns the maximum number. E.g., passing 5, 3, and 7 should return 7.
Write a function that takes a string as its first argument and returns the letters of the alphabet interspersed with that string. There should be an optional argument letter.count that allows you to set the number of letters of the alphabet, and this argument should have a default value of 26. E.g., passing "~" should return "a~b~c~d~e~f~g~h~i~j~k~l~m~n~o~p~q~r~s~t~u~v~w~x~y~z", and "CAT" with letter.count=5 should return "aCATbCATcCATdCATe". Remember to use the built-in letters vector to easily access the letters of the alphabet.
Write a function that takes a vector as its only argument and returns the sum of the squares of the elements of the vector. E.g., passing c(1, 2, 3) to the function should return 14 (1² + 2² + 3²).
Write a function that takes two optional arguments, lower and upper, and returns a vector of all the integers between lower and upper, inclusive. If lower is not provided, it should default to 1, and if upper is not provided, it should default to 10. E.g., passing lower=5 and upper=8 should return c(5, 6, 7, 8), passing nothing should return c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), and passing only upper=3 should return c(1, 2, 3).

Write Functions for Manipulating Data Frames

In the previous chapter, I mentioned that data frames are frequently manipulated in R. For any repetitive tasks in coding, functions are the way to go. For example, say you need to add a row to your data frame several times. To do so, you could write the function in the following code:

animal.data <- data.frame(
    animal = c("Dog", "Woodpecker", "Sponge"),
    is.mammal = c(TRUE, FALSE, FALSE),
    weight.kg = c(20, 0.2, 3)
)

addAnimalRow <- function (df, animal, is.mammal, weight.kg) {
    new.row <- list(animal, is.mammal, weight.kg)
    df[nrow(df) + 1,] <- new.row
    return (df)
}

animal.data <- addAnimalRow(animal.data, "Cat", TRUE, 5)
animal.data <- addAnimalRow(animal.data, "Catfish", FALSE, 0.5)
animal.data <- addAnimalRow(animal.data, "Pig", TRUE, 10)

Write functions that can complete the following tasks. Test out your functions with examples to show that they are working as intended.

Write a function called removeLastRow that takes a data frame as the only argument and returns the data frame with the last row removed.
Write a function called combineDataFrames that takes two data frames that have the same column names and returns a single data frame that has all the rows of the first data frame followed by all the rows of the second data frame.
Write a function called sortDataFrameBySum that takes a single data frame and returns the same data frame with columns sorted by order of decreasing sum.
Write a function called makeResultList that takes three arguments: description, value, and is.significant. The last argument should have a default value of FALSE. This function should return a list with three elements, each with the same name as their corresponding arguments. This function would be a way to consistently present results from statistical tests.