Humans are usually pretty bad at generating random numbers by hand. In one classroom, students were asked to write down a sequence of 100 coin flips, but only half of them were instructed to actually flip a coin to collect their data. The other half were told to just make up the results. The teacher then looked at the results and was able to tell instantly which students had actually flipped the coin and which students had faked their results. What was the most common give-away? The students who had faked their results generally avoided long streaks of heads or tails. One student said, "I was thinking if we [made streaks too long], maybe she would recognize that we were actually doing that on purpose."
In fact, the greater the number of coinflips, the longer you would expect the longest streak of heads or tails to be. This is because random number sequences, by definition, should have no patterns. That means that each sequence of heads and tails are equally likely to occur, and a majority of those possible sequences have long streaks in them. Our pattern-seeking monkey brains simply have a poor sense of what randomness looks like.
So, in 100 coin flips, what do you expect the longest streak of heads or tails to be? This can be answered with probability theory, but we will answer this question in this section by writing a program that simulates a 100-flip coin toss thousands of times and identifying the average length of the longest streak.
You may have been exposed to the idea of computers generating random numbers. However, in combination with the fact that computers only ever do exactly what we tell them to do, you may be wondering how our bias against randomness can be avoided in computations. In reality, our computers do not generate random numbers. Instead, they generate pseudo-random numbers. As the name implies, these numbers are not truly random, but they are so unpredictable that they are effectively random for most applications.
Although there are a variety of ways that pseudo-random numbers can be generated, many methods rely on modular arithmetic. This name might not sound familiar, but you are familiar with the concept: adding an hour to a 12-hour clock that shows 1:00 will give 2:00, and an additional hour gives 3:00, and so forth, up until 12:00 where the clock resets to 1:00. Using the terminology of modular arithmetic, we would say that the clock is modulo 12. In other words, the clock "rolls over" and restarts every 12 hours.
How does modular arithmetic help us generate random numbers? Let's return to the clock example: imagine that we are trying to generate three random numbers between 1 and 12. We start by positioning the clock's hour hand arbitrarily to some hour, which we call the seed. Say the seed is 2:00. Then, to generate three random numbers, we follow this rule: to get the next random number, we take the current hour of the clock, raise that number to the 23rd power, and then add an additional 515 hours. So, the first random number starts with the seed to follow this rule: 223 + 515 = 8389123. Then, we take the result, 8389123, and move the hour hand forward that many hours. Do you have any guess as to where the hour hand will end up? Probably not, which is exactly the point of this operation. In this case, the hour hand will end up at 9:00. Then, we repeat the process to get the second random number: 923 + 515 = 8862938119652501096444. That number is huge, of course, but computers can easily figure out where the hour hand will end up in this case: 5:00. Finally, we repeat the process one more time to get the third random number: 523 + 515 = 11920928955078640. Moving forward this many hours, we end up at 9:00. So, our three pseudo-random numbers are 9, 5, and 9.
In reality, the process described above is actually far too simple to produce sufficiently random numbers. However, if you take this concept and make it more complicated, you can generate numbers that are plenty random for most purposes. This idea is important to understand the concept of a seed: the seed is the starting point for the pseudo-random number generator. If you start with the same seed, you will always get the same sequence of pseudo-random numbers that follows. This is pivotal to make reproducible results in simulations.
In this chapter, we'll learn about using random numbers in R, the various distributions it offers, and the general structure of simulations. I will prematurely introduce some features of R that will be discussed in more detail later, so don't be worried if some of the code doesn't make sense yet.
In the introduction, I posed the question: in 100 coin flips, what do you expect the longest streak of heads or tails to be? One way to answer this question is to write a program that simulates a 100-flip coin toss thousands of times, and calculate the average length of the longest streak. To do this, I'll introduce a few functions:
set.seed()
: As described in the introduction, setting the seed
allows us to generate the same sequence of pseudo-random numbers every time
we run the program. You may be wondering how this is useful, since it seems to
stop being random this way, but it simply guarantees that we'll receive the
same random numbers every time we run the program. If you don't use this function,
R uses the current time as the seed, which means that every time you run the program,
you'll get a different result, which is undesirable for simulations where we want
to report our results. You can give this function any number you want, and it
will use that number as the seed. Let's use 123
as the seed:
set.seed(123)
sample()
: This function takes a vector and randomly picks elements
from it a specified number of times. This can be used to simulate
100 coin flips quickly: we can give it the vector c("H", "T")
.
By default, it assumes that once an element is picked, it should not be
picked again (i.e., without replacement). Of course, we want to allow
for more than one heads or tails in our 100 flips, so we need to specify
that we want to sample with replacement. Pseudo-random numbers are used by this
function to decide what element is picked each time. Altogether, we can write
the following to generate 100 coin flips:
sample(c("H", "T"), 100, replace = TRUE)
rle()
: This function is short for "run length encoding". It is
quite possible that you will likely never use this function outside of this
problem, so I'll keep the explanation simple: it takes a vector and returns
a list of the lengths of each run (i.e., streak) of consecutive elements.
For example,
if we give it the vector c("H", "H", "H", "T", "H", "H")
,
it will return the vector c(3, 1, 2)
. The first run is 3 heads,
the second run is 1 tail, and the third run is 2 heads. If we have our
100 coin flips stored in a variable called flips
, we can
use the following to find the lengths of each run:
rle(flips)$lengths
We'll discuss the $
seen here in a later chapter.
max()
: This function simply takes a vector and returns the largest
element in it. If the lengths of each run found by rle()
are stored
in a variable called run.lengths
, we can use the following to find
the length of the longest run:
max(run.lengths)
Create a new R file. Use the functions described above to write a program that simulates 100 coin flips and finds the longest streak. Run it a few times to see that the results are the same because of the seed that was set. Change the seed a few times to see different results.
Hopefully, you were able to piece the functions together to get the program working. I very much recommend that in general you try to write code yourself before looking at an answer, even if you're unsuccessful in your attempt. People only get better at things by practicing, and programming is no exception.
Here is the code I wrote to solve this problem:
set.seed(123)
number.of.flips <- 100
flips <- sample(c("H", "T"), number.of.flips, replace = TRUE)
run.lengths <- rle(flips)$lengths
max(run.lengths)
The output of the last line of this program is 4. So, for this seed, it seems the intuition of the students in the introduction was correct: the longest streak of heads or tails in this set of 100 coin flips is 4. However, this is only one result. If we want to be more confident in our answer, we should run the program many, many times and take the average of the longest streaks. This concept is the foundation of stochastic simulations.
To run this bit of code many times, I need to use a few features of R that we have not yet discussed. These features will be explained in much more detail in later chapters, but for now, just try to connect your understanding of the structure of the simulation (i.e., running one block of code many times) to what you see, and get a general sense for what is going on.
set.seed(123)
number.of.flips <- 100
number.of.trials <- 10000
longest.streaks <- c() # an empty vector we'll add results to
for (i in 1:number.of.trials) {
flips <- sample(c("H", "T"), number.of.flips, replace=TRUE)
run.lengths <- rle(flips)$lengths
longest.streaks <- c(longest.streaks, max(run.lengths))
}
# plot a histogram of the results in
# bottom right panel of RStudio
hist(longest.streaks)
# Find the average and quartiles
# of the longest streak
summary(longest.streaks)
# Find the proportion of trials
# where the longest streak was 4 or less
sum(longest.streaks <= 4) / number.of.trials
To briefly explain line 22, I used the logical operator <=
to find if each entry was less than or equal to 4
. This returns
a vector of the same length with either TRUE
or FALSE
for each entry. The sum()
function counts the
number of entries that are TRUE
(this is because TRUE
becomes 1
if treated like a number, while FALSE
becomes
0
).
Dividing this number by the total number of trials gives the proportion of trials
where the longest streak was 4 or less.
Type the above code into your R document and run the code. Answer the following questions:
In probability theory, a random variable is something that can take on one of several values, each with their own probability of occurring. For example, if we flip a coin, the random variable is the result of the coin flip, which can be either heads or tails (perhaps encoded as 1 and 0, respectively). The probability of getting either is 0.5. This is an example of a discrete random variable, which means that it can only take on a finite number of values. Specifically, this is a binary (or Bernoulli) random variable, since it can only take on two values.
In contrast, a continuous random variable can take on any value within a range. For example, the amount of time it takes for a person to turn off or snooze their alarm clock in the morning is a continuous random variable, since it can take on any positive value, at least in theory. In reality, I imagine that the vast majority of people turn off their morning alarm between 1 and 60 seconds of it going off. This means that there is a very high probability of the random variable taking on a value between 1 and 60 seconds, and a very low probability of it taking on a value outside that range. Assuming our measurements are in whole seconds, this is likely well represented by a Poisson random variable, with the bulk of people responding to the alarm around 10 seconds or so, and an increasingly small number of people responding for longer wait times.
In the two above examples, the terms "Bernoulli" and "Poisson" are the probability distributions of those random variables. This term refers to the probability of each possible value of the random variable, as described for these specific cases above. Probability distributions are specified with parameters, which are values that determine the exact shape of the distribution. For example, all Bernoulli random variables can only take on the values 0 and 1, but a coin being heads does not have the same probability as your car tire popping on the way to work, which are both Bernoulli random variables. The single parameter that Bernoulli random variables have is the probability of a 1. In the case of coin flips showing heads, this parameter is 0.5, but in the case of your car tire popping, this parameter is much smaller, perhaps in the neighborhood of 0.001.
Perhaps the most common continuous probability distribution used in science is the Normal distribution, which is often called the "bell curve". This distribution has two parameters, the mean (the center of the distribution) and the standard deviation (how spread out the distribution is). Although you're almost certainly familiar with this concept, fewer are familiar with how to use it in R.
The rnorm()
function can be used to generate random numbers from
a normal distribution. The first argument is the number of random numbers to
generate, the second is the mean, and the third is the standard
deviation. For example, the following code generates 10 random numbers from a
normal distribution with mean 2 and standard deviation 1.5:
rnorm(n = 10, mean = 2, sd = 1.5) # or, equivalently,
rnorm(10, 2, 1.5)
This function has several other sister functions (dnorm()
,
pnorm()
, and qnorm()
) that give other ways to
make calculations with the normal distribution. For example, the code below
uses the dnorm()
function to plot the density of the normal distribution
with mean 5 and standard deviation 1.2, and the code below it uses the
pnorm()
function to find the probability of getting a value
less than 6.5 from a normal distribution with mean 6 and standard deviation 2.
However, when it comes to generating random numbers, only the rnorm()
function is needed.
x.values <- 1:100/10
y.values <- dnorm(x.values, 5, 1.2)
plot(x.values, y.values, type="l")
pnorm(6.5, 6, 2)
Generating random numbers from a normal distribution are useful for many simulations, including to simulate noise in data.
The runif()
function (short for "random uniform", not "run if") can be used to generate random numbers from
a Uniform distribution. This distribution is the simplest continuous
distribution, since each value between its two parameters are equally likely.
For this function, the first argument is the number of random numbers to generate,
and the second and third are the minimum and maximum values, respectively. For example, the
following code generates 10 random numbers from a uniform distribution between
4 and 8:
runif(n = 10, min = 4, max = 8) # or, equivalently,
runif(10, 4, 8)
Like the normal distribution, the uniform distribution has several sister functions
(dunif()
, punif()
, and qunif()
) that have similar
purposes to their normal counterparts. Uniform distributions are useful when
the occurrence of some event needs to be simulated. For example, if there is a 60%
chance a predator successfully captures a prey, you can simulate such an event by
generating a random number from a uniform distribution between 0 and 1, and if the
number is less than 0.6, the predator successfully "captures" the prey.
There are many other probability distributions that are useful for simulations,
with functions for each built into R (e.g., rexp()
, rbinom()
,
rpois()
, etc.). When to use each is a system-specific question, but
there are bounteous resources online to help you learn about these probability distributions
and how they apply to scientific systems and simulations.
Remember that for all of these functions, the set.seed()
function
can be used to make the results reproducible.
In this chapter, we learned about how to use random numbers in R, and how to use them to simulate a system. We also learned about probability distributions, and introduced ideas of how they can be used. In future chapters, we will use these random number functions to see what else is possible with simulations in R.
In the coin-flipping simulation, I set the seed on the first line of the program, instead of setting it right above where the samples were drawn, between lines 6 and 7. What is the benefit of this? What would happen if I set the seed right above where the samples were drawn?
One difficulty with simulations is that it is often unclear how many trials are needed to get a good estimate of the answer. I'll include a modified version of the coin-flipping simulation here:
set.seed(123)
number.of.flips <- 100
number.of.trials <- 10000
longest.streaks <- c()
for (i in 1:number.of.trials) {
flips <- sample(c("H", "T"), number.of.flips, replace=TRUE)
run.lengths <- rle(flips)$lengths
longest.streaks <- c(longest.streaks, max(run.lengths))
}
mean(longest.streaks)
The code now only reports the average length of the longest streak across trials. Because of the increased number of trials, the code may take more time to run (about 10-30 seconds instead of an immediate response). In RStudio, you will see a small stop sign in the top right corner of the console while the code is running. Once the code is finished, the stop sign will disappear, and the console will show the simulation average.
On a piece of paper, make a table with two columns: the number of
trials and the resulting simulation average. Write down the results
of the simulation you just ran, recording up to 5 decimal places.
Then, run the code several
more times, decreasing the value of number.of.trials
each
time, and record their results.
You might try decreasing by a thousand trials each time, and then including
500, 100 and 10 trials on the tail end.
After completing your table, create two vectors in R to store your two
columns, making sure that the two vectors are typed out in the same order
as in your table.
Use the plot
function as seen below to show how
the simulation average changes with the number of trials (the
type="l"
argument makes it a line plot instead of a scatter plot).
The abline
function can also be used to draw a horizontal line
to show the average for the simulation with the largest number of trials.
simulation.size <- c(10:1*1000, 500, 100, 10)
simulation.averages <- c() # enter data from your table here
plot(simulation.size, simulation.averages, type="l")
abline(simulation.averages[1], 0, col="blue")
What do you notice about how the simulation average changes with the number of trials? What do you think is the minimum number of trials needed to get a good estimate of the answer? How do you expect this minimum number would change if we increased the number of coin flips per trial from 100 to 1000?
The rnorm()
and runif()
functions can both take three arguments:
the number of desired randomly-generated numbers and the distribution's two parameters.
However, it is very common to use normal curves with mean 0 and standard deviation 1,
as it is to use uniform distributions with minimum 0 and maximum 1. Given these facts,
these two functions have default parameter values, which means that omission of the
parameter values will result in the default values being used. For example,
for runif(20)
, R assumes that you want 20 random numbers from a uniform
distribution between 0 and 1. For rnorm(20)
, R assumes that you want
20 random numbers from a normal distribution with mean 0 and standard deviation 1.
Generate random numbers given the following specifications, omitting parameters where
applicable as described above. Visualize each of your results by putting them
into a histogram using the hist()
function. How does the sample size affect the shape of the histogram?