Graphing is a terrific way to visualize data. It can help demonstrate relationships between variables that are not possible to identify in a list of data. In this chapter, we will learn how to create graphs in R using the base R graphics package. It is my understanding that many people prefer to use the ggplot2 package, and there is room in this wide world for multiple opinions. However, knowing how to use the base R graphics is important given its continued prominence amongst data scientists.
Let's start with a simple example. We will create a scatter plot of
the relationship between the number of hours studied and the grade
received on a test. We will use the plot()
function to
create the plot. The first argument is the x-axis variable, and the
second argument is the y-axis variable. Both are provided as vectors
of data. We will also add a title to
the plot using the main
argument.
set.seed(1234)
hours.studied <- rnorm(100, 10, 2)
grade <- rnorm(100, 60, 2)
plot(hours.studied, grade, main="Hours Studied vs. Grade")
Since both variables in this example were created randomly with
no relationship between them, we observe that there is no clear
relationship between the two variables. If there is any relationship,
it is by chance alone. We can quantify the linear relationship between
the two variables using the lm()
statistical function:
linear.fit <- lm(grade ~ hours.studied)
Here, we are fitting a linear model to the data. The ~
symbol is used to indicate the dependent variable on the left and
the independent variable on the right. To see the results of this
model, we can use the summary()
function:
summary(linear.fit)
This shows that the numbers of hours studied has virtually no predictive
ability in explaining the grade received on the test. I will quickly note that
passing linear.fit
to the plot
function will
give a series of plots related to statistical analysis, which are revealed
one at a time as the message "Hit <Return> to see next plot:"
is
printed to the console. These plots may be of interest to you, so I mention them here.
However, I will focus this chapter on plots that you have more contol over.
Technically, instead of plot(hours.studied, grade)
, you can alternatively
use plot(grade ~ hours.studied)
. This is because the ~
symbol
is used to indicate the dependent variable (vertical axis) on the left and the
independent variable (horizontal axis) on
the right. This is the reverse order of the two-argument case, which is confusing.
In this case, grade ~ hours.studied
is a single argument, which
is a formula. This can be verified using the class()
function:
class(grade ~ hours.studied)
This formula object can be used for both lm()
and plot()
.
However, the two-argument approach, i.e., plot(hours.studied, grade)
,
is easier to read for the plot()
function,
and it is my opinion that you should reserve the use of formulas for statistical functions.
Now that we have performed statistical analysis on the data, we can add the
regression line to the plot. We can do this using the abline()
function. This function can be used a few different ways, the most basic way being that you can
provide the slope and the intercept of the line you want to add to the plot (e.g.,
abline(a = 5, b = 3)
will add a line with a slope of 3 and an intercept of 5).
No matter the way it is used, this function is always used
to put a straight line on an already-existing plot. Here, we can pass in the
linear model object, linear.fit
, to add the regression line, since
linear.fit
contains information about the slope and intercept of the line:
abline(linear.fit)
You will see that the regression line is almost completely horizontal. This is unsurprising, since variables with virtually no correlation will mean that the best guess of the value of the dependent variable is its mean in the absence of any other information.
We can adjust this generated data to actually include a relationship between
the two variables. We will do this by adding a small amount of noise to the
dependent variable, grade
, that is correlated with the independent
variable, hours.studied
. See the code below for how this is done:
set.seed(1234)
hours.studied <- rnorm(100, 10, 2)
grade <- 3 * hours.studied + rnorm(100, 50, 10)
plot(hours.studied, grade, main="Hours Studied vs. Grade")
Now, generally, the trend is that the average number of hours studied is 10, and the
grade that people receive is roughly 3 times the number of hours studied plus 50.
Plotting this shows a clear linear relationship between the two variables. We can add
the regression line to the plot using the abline()
function, as before:
linear.fit <- lm(grade ~ hours.studied)
abline(linear.fit)
Adding a confidence interval to the regression line is also possible, but in true R style,
it is messy. We can do this using the predict()
function. This function takes the
linear model object as its first argument, and the interval="confidence"
argument to indicate that we want a confidence interval. The level
argument is used to indicate the confidence level, which is 95% by default,
so we will not change it. Given that we need to draw new lines to this plot, we have to
specify the horizontal axis values for which the confidence intervals will be plotted,
which we will do by creating a vector of values from the minimum to the maximum of the
hours.studied
variable, with a total of 100 values. We will store this vector in
the horizontal.axis.values
variable. Annoyingly, the predict()
function requires that the horizontal axis values be given in a data frame, so we will
use the data.frame()
function to create a data frame with the horizontal
axis values as the hours.studied
variable. Note that the name of the column
in the data frame must match the name of the original horizontal axis data vector.
horizontal.axis.values <- data.frame(
hours.studied = seq(
min(hours.studied),
max(hours.studied),
length.out = 100
)
)
confidence.bounds <- predict(
linear.fit,
interval = "confidence",
newdata = horizontal.axis.values
)
This makes confidence.bounds
a matrix with three columns.
The first column is the predicted
value of the dependent variable, the second column is the lower bound of the
confidence interval, and the third column is the upper bound of the confidence
interval. We can use this information to add the confidence interval to the
plot. We will use the lines()
function to add the confidence
interval to the plot. The lines()
function is similar to the
abline()
function, but its output need not be a straight line.
Its first argument is the horizontal axis variable, and the second
argument is the vertical axis variable, much like the plot()
function.
Since we want to plot both the lower and upper bounds of the confidence interval,
we can use a for
loop to reduce code repetition.
Here, we will use the horizontal.axis.values$hours.studied
vector for the first argument, and either the second or third column of the
confidence.bounds
matrix for the second argument. We will use
the lty = "dashed"
argument to indicate that we want a dashed line, and
the col = "red"
argument to indicate the color we want for the lines.
for (i in 2:3) {
lines(
horizontal.axis.values$hours.studied,
confidence.bounds[, i],
lty = "dashed",
col = "red"
)
}
Above, we used the seq()
function to create a vector of values
from the minimum to the maximum of the hours.studied
variable.
Here, we will use a similar concept to plot mathematical functions. We will
plot the function f(x) = sin(x)
from
x = -10
to x = 10
.
x.values <- seq(-10, 10, length.out = 100)
y.values <- sin(x.values)
Now, we can plot the function using the plot()
function:
plot(x.values, y.values)
As you can see, by default, the plot()
function plots the
points as circles. However, we can change this to a straight line using
the type
argument:
plot(x.values, y.values, type = "l") # l stands for line
Doing this gives us a much better idea of the shape of the function. You can see near the
peaks and valleys of the function that there are slightly jagged edges. This is because
the plot()
function is just drawing straight lines between the 100 points
we provided. We can increase the number of points to make the function appear smoother.
On the other hand, we can decrease the number of points to make the function appear
more jagged. We can change the length.out
argument to adjust the number of
points.
The function below "wraps" the code we used above to plot the function f(x) = sin(x)
.
Its only argument is the number of points to use. The function returns nothing, and instead
just plots the function. Try calling the function with different numbers of points to see
how the function changes.
plotSin <- function (point.count) {
x.values <- seq(-10, 10, length.out = point.count)
y.values <- sin(x.values)
plot(x.values, y.values, type = "l")
}
We can also plot multiple functions on the same plot. However, as you have seen above,
every time you call the plot()
function, it creates a new plot. To plot
additional functions on an existing plot, we can use the lines()
function
as shown above. We will plot the functions f(x) = sin(x)
and
f(x) = 0.25 * x^2 - 1
on the same plot to demonstrate this:
x.values <- seq(-10, 10, length.out = 100)
sin.values <- sin(x.values)
square.values <- 0.25 * x.values ^ 2 - 10
plot(x.values, sin.values, type = "l")
lines(x.values, square.values, col = "red")
As seen above, it is somewhat awkward to have to switch the function used to plot data from the initial
plot()
function to the lines()
function. However, there is an
important reason for this: the plot()
function is used to create the axes
and the plot area, while the lines()
function is used to add lines to the
axes. You will notice that even though our second function, f(x) = 0.25 * x^2 - 1
,
has a range of y = [-10, 10]
, the y-axis of the plot is still y = [-1, 1]
.
This is because the plot()
function automatically adjusts the axes to fit the
initial data. This is helpful for most cases, but sometimes, you might want more control
over the bounds of the axes. For example, the code below sets the x-axis to be
x = [-7, 7]
and the
y-axis to be y = [-10, 10]
and plots the same two functions as above:
x.values <- seq(-10, 10, length.out = 100)
sin.values <- sin(x.values)
square.values <- 0.25 * x.values ^ 2 - 10
plot(
x.values,
sin.values,
type = "l",
xlim = c(-7, 7),
ylim = c(-10, 10)
)
lines(x.values, square.values, col = "red")
You will note that by default, R borrows what you actually typed for the arguments of the x and y data for their respective axis titles. This feature is a grave sin in the eyes of the rest of the coding world. All other major programming languages simply evaluate what is typed, and does not keep track of how it was typed in, which improves memory management and program speed. However, R always finds a way to be special, so the code typed for arguments are used to label the axes by default. This is usually not a critical issue if you use a good variable name for the argument, but is disastrous if you put something more complicated in as the arguments of the function:
plot(c(1, 2, 3, 5, 4), tail(1:6 * 3, n=5))
Here, the x-axis is labeled c(1, 2, 3, 5, 4)
, and the y-axis is labeled
tail(1:6 * 3, n=5)
, which hurts me in my bones. We can
specify the labels of the axes using the xlab
and ylab
arguments:
plot(
c(1, 2, 3, 5, 4),
tail(1:6 * 3, n=5),
xlab = "X Values",
ylab = "Y Values"
)
As shown above, the main
argument is used to specify the title of the plot,
which is always a good idea for anything other than a quick plot. There are many other things
you can adjust about the plot, which you can learn about by typing ?plot
into the
console, and then clicking the "Generic X-Y Plotting" link that appears.
Personally, I love making my code tidy and simple. So, when I have to plot several functions
onto one plot, I like to use the plot()
function only to create the axes, axis titles,
and main title, and then use the lines()
function to plot all the functions.
For example, the code below plots the first few Taylor expansions
of f(x) = sin(x)
programmatically by using a for
loop.
I will first provide the function that makes the curves, which you can copy,
followed by the code to make the plots, which you should type out.
No need to understand what is happening in
the sinExpansion
function, just focus on how it is used.
sinExpansion <- function (curve.number, x.values) {
total.expansion <- 0
for (i in 1:curve.number) {
term.num <- 2 * i - 1
coefficient <- (-1) ^ (i + 1) / factorial(term.num)
total.expansion <- total.expansion +
coefficient * x.values ^ term.num
}
return (total.expansion)
}
x.values <- seq(-10, 10, length.out = 100)
number.of.curves <- 8
curve.colors <- hcl.colors(number.of.curves, "BluGrn")
plot(
NULL, # indicates that we do not want to plot any data
xlim = c(-15, 10), # required if above is NULL
ylim = c(-2, 2), # required if above is NULL
xlab = "X Values",
ylab = "Y Values",
main = "Taylor Expansions of sin(x)"
)
lines(x.values, sin(x.values), lty = "dashed", lwd = 2)
for (i in 1:number.of.curves) {
lines(
x.values,
sinExpansion(i, x.values),
col = curve.colors[i]
)
}
legend(
"bottomleft",
legend = c("sin(x)", paste("Curve", 1:number.of.curves)),
col = c("black", curve.colors),
lty = c("dashed", rep("solid", number.of.curves)),
lwd = c(2, rep(1, number.of.curves)),
cex = 0.7
)
There are a few important things that I used here that I want to discuss:
hcl.colors()
function is used to generate a vector of colors
that are evenly spaced from each other. The first argument is the number
of colors to generate, and the second argument is the color palette. I used
the "BluGrn"
color, which is a blue-green color scale. To find other
palettes, you can type hcl.pals()
, which returns all the possible
color palette names. There are many other ways to control color in R, so do
more research if you are unhappy with the color options.
plot()
function call
doesn't need to have any data plotted (i.e., NULL
can be given instead of data).
However, if you do give NULL
to the plot
function, it
does need to have the
xlim
and ylim
arguments specified. Otherwise, R
will not know what the bounds of the plot should be. It is not required
to do plots this way, but I like the layout when I need to plot several lines
on one graph.
lty
argument is used to specify the line type. The default
is a "solid"
line, but you can also use "dashed"
,
or several other options. You can also use a number from 0 to 6 to specify a
different type of line, but I can never remember what each number means,
so I prefer to use the string options, and find them more readable.
lwd
argument is used to specify the line width. The default
is 1
, but you can use any number greater than 0. I used 2
for the dashed line to make it more visible.
legend()
function is used to add a legend to the plot. Unlike other
languages, you must re-specify the colors and line types for each line. Since
I had one line that was dashed, thickened, and black, I had to specify the
col
, tly
, and lwd
values for that line, and
then specify the values for the other lines using rep()
. If they were
all the same, no such repetition would be necessary. E.g., if I removed the
sin(x)
line, I could have alternatively just used col = curve.colors
,
lty = "solid"
, and lwd = 1
for the legend()
function.
legend()
function, the cex
argument (i.e., the character
expansion factor) is used to specify the size of the text in the legend.
The default is 1
, but I wanted the text in the legend to be smaller
so everything would fit.
In this chapter, I exposed you to many different features of the base plot()
function in R. There are many other features that I did not cover, but you can find how
to change your plot to your needs with a little research. Of course, ggplot2
is popular for many R users, and you should feel free to use it if you feel more comfortable
with it. However, it is important to know how to use the base R graphics package, since
it is still widely used.
Use plot()
to show the following functions within the given x-axis range.
For each, rerun the code to have 5, 50, 500 and 5000 steps used.
To do this, use a for loop with iteration variable number.steps
and the vector c(5, 50, 500, 5000)
. At the end of each iteration, use the line
readline("Press Enter to see next plot: ")
. As implied, this will stop execution
of the script until you press enter in the console. This will allow you to see the plot before the next one is
generated. As an added challenge, make the title of the plot include how many steps were used.
Complete the following prompts using the plot()
and lines()
function.
for
loop. Add a legend with the legend
function that
shows which curve color represents which $b$ value. Make sure that the y-axis goes from $-6$ to $6$.for
loop and the text()
function to add text that says the x-value of each peak just above
itself. Inside the for
loop, since the peaks always occur at multiples of $2\pi$ and at
$1$ unit high, such text can be added as follows: text(2*pi, 1, "2pi")
.
Include it for all 6 peaks visible on the plot. Note: the text overlaps the line slightly
since the text()
function directly centers the text at the given coordinates on the plot.
To avoid this issue, you may want to increase the y-axis bounds to be from $-1$ to $1.5$, and use
1.1
for the y-coordinate of the text.
There are several plotting functions that are worth knowing about in addition to the
plot()
function. Each of the following prompts will introduce one of these
functions and give you a prompt where you can practice using it.
hist()
: This function plots a histogram, which is good for showing the distribution of
one-dimensional data. Start by plotting a histogram of 100 random normal numbers like this:
random.numbers <- rnorm(100, 5, 2)
hist(random.numbers)
By default, this function just guesses how many bars to break the data into. However, you can
specify the number of bars using the breaks
argument. Try using breaks = 10
and breaks = 20
to see how the plot changes. You can alternatively specify the
start and end points of each bar by giving a vector, like this: breaks = c(-6, -3, 0, 3, 6, 9, 12)
.
With this information, use a for
loop to plot a histogram for three sets:
10, 100, and 1000 random normal numbers.
Use breaks from -10 to 15 with a step size of 1. As before, to avoid progression to subsequent graphs before
you've seen the current one, use readline("Press Enter to see next plot: ")
at the end of each
iteration of the for
loop.boxplot()
: This function plots a box-and-whisker plot,
which is good for showing the distribution of
one-dimensional data, but in a different way than histograms.
Start by plotting a boxplot of 100 random normal numbers like this:
random.numbers <- rnorm(100, 5, 2)
boxplot(random.numbers, )
This shows how the data is distributed in terms of the median, quartiles, and outliers.
To plot multiple box-and-whisker plots on the same plot, you can give a data frame or list
of vectors to the boxplot()
function. Use this function to plot the following
two sets of data, with one plot for each data set:
data.set1 <- list(
item1 = rnorm(10, 5, 2),
item2 = rnorm(100, 5, 2),
item3 = rnorm(1000, 5, 2)
)
data.set2 <- data.frame(
col1 = rnorm(100, 5, 2),
col2 = rnorm(100, 10, 4),
col3 = rnorm(100, 15, 8)
)
barplot()
: This function plots a bar for each value of a categorical variable in a data set.
The type of data used for this plot type is different from that of the previous two: it must use two-dimensional
data where one of those dimensions is a categorical variable. Here is an example of the function being used:
categorical.variable <- c("Green", "Brown", "Blue")
counts <- c(15, 30, 20)
barplot(
counts,
names.arg = categorical.variable,
xlab = "Eye Color",
ylab = "Frequency"
)
Use this function to plot the following data set, where the color
column
is the categorical variable (i.e., used for names.arg
) and the count
column is the frequency:
shirt.color.data <- data.frame(
color = c("red", "green", "blue", "black"),
count = c(11, 21, 15, 34)
)
In many cases, there may be multiple measurements that you wish to report for the same categorical variable.
For example, you may wish to report the height of three individual trees across four years. The following
code shows such a data set:
tree.heights <- data.frame(
tree1 = c(3.9, 4.0, 4.2, 4.5),
tree2 = c(4.1, 4.1, 4.0, 4.1),
tree3 = c(3.8, 4.0, 4.5, 4.9)
)
barplot(as.matrix(tree.heights), beside=TRUE)
Note that the data in this case must be a matrix, thus requiring the as.matrix()
function,
and that the beside=TRUE
argument is used to indicate that the bars should be placed
beside each other (as opposed to stacked on top of each other). Using this code snippet as a guide,
make a barplot of a data set that tracks the number of people who like four different flavors of ice cream
across three years. Use round(runif(3, 10, 1000))
to generate the data for each flavor you include.
As an added challenge, add a legend that shows which color of bar represents which year.