When first learning coding, it is somewhat difficult to comprehend what coding languages even do. Robotics, a field closely related to computer science, has clearer operations: a computer controlling a robot might turn a motor, light up an LED, or otherwise interact with the physical world. But what does a computer do when it is not controlling a robot? This question is more difficult to intuitively understand.
For questions like this, it is helpful to look at simple examples.
One simple computer that we're all familiar
with is a calculator. A calculator is a computer that takes in numbers and
operators (e.g., add, subtract, multiply, divide) and outputs another number.
If we hit the buttons 1
, +
, 2
, and
=
in succession, the calculator will output 3
on its screen.
The calculator has code that gives it instructions on what to do with the inputs
it receives. Calculators have no sense of what numbers are, but when given instructions,
they are able to execute them. The code tells the calculator how to receive data
(e.g., numbers), how to manipulate that data (e.g., how to add numbers), and how to
present this manipulated data to a user (e.g., putting the result on the screen).
Although our laptops, desktop computers and phones are all much more complicated, the basic principle is the same. Everything coding languages can do can be summarized as data input, data manipulation, and data output. Calculators only deal with numbers and operators, but computers can deal with several other types of data. If coding languages only input, output, and manipulate data, you may question how they can do anything that is all that interesting. However, data can be manipulated in intricate ways that produce highly organized and complex results, from a spreadsheet of measurements to a video game.
In this section, I will introduce the most basic types of data that the R coding language uses, and how variables are used to do things with these data. In many coding languages, the simplest data types are called primitive data, in the sense that it is uncomplicated and has very little structure. Although R documentation usually does not use this term, it is a useful way to consider these basic data types. All other data types are simply organized collections of primitive data, so primitive data are the building blocks of all data.
As discussed in the practice section of the
Setup chapter, RStudio
has a console (bottom left window) that can be used to run code. The console is a place where you can
type in code and see its results immediately. This is different from .R
files, usually edited in a window directly above the console, which are simply text files
containing R code that is only run when instructed to do so. The console is a great place to
experiment with code and see what it does.
In the console, you'll notice a >
symbol.
This is called a prompt, and indicates that R is ready to receive code.
Follow along with the code below in your console.
We'll start very simple. Type this into the console after the >
and press Enter
:
1 + 2
You'll see [1] 3
appear in the console, which has the result
of the code you just ran. The [1]
is
not important for now, but know that it accompanies output in the console.
Here, the data input is 1
and 2
,
the data manipulation is +
, and the data output is 3
.
In this code, we found the answer to 1 + 2
, but we did not
save the result anywhere. If we wanted to use this result later, we would
have to re-run the code. This is where variables come in. Variables
are a way to store data so that it can be used later. Variables are assigned using the <-
operator, which is an assignment operator, and is often read as "gets"
(e.g., x <- 3
is read as "x
gets 3"). It
is used to give a value to a variable. For example, the following code assigns
the value 3
to the variable x
:
x <- 3
You can also use =
to assign variables, but to be consistent,
it is recommended to use <-
. Now, if you type x
into the console and press Enter
, you'll see 3
appear.
These variables can be used in the same way as the data input in the previous
example. For example, the following code will output 5
:
x + 2
Now that we have gone through this simple example, we'll talk about the specific types of data that R uses, and expound on variables.
R has six basic data types. One of them is obscure and will be of no use to you, but the other five are of interest: (1) numeric, (2) integer, (3) complex, (4) logical and (5) character. We'll go through each of these in turn, and I'll give some examples of how they are used.
R is unique in that all of its simplest data types are always in atomic vectors (more commonly just referred to as vectors). Here, a "vector" is just a bunch of things put in some order, and "atomic" means that the data in the vector are the most basic data types. These vectors only contain data that are all of the same type. We'll talk about why vectors make R unique later, but here I give an early warning: R has some oddities that can be confusing or frustrating. Nonetheless, understanding the "behavior" of primitive data will give you a foundation for understanding more complex data types. It is a worthy goal to try to get a firm mental grasp on the primitive data types.
There are three different types of numbers in R, but they largely behave the same way, and only rarely will you need to distinguish between them. Let's start with the example in the box below.
In the console, type c(1,2.5,3)
and press Enter
.
You'll see 1.0 2.5 3.0
appear in the console. This is an atomic vector
of length 3, containing the numbers 1, 2.5, and 3. The c
stands for
combine, and turns makes a vector from what is given in the parentheses.
The vector created in the box above is a numeric vector, the most common
of R's primitive data types. Unsurprisingly, this indicates that the vector contains numbers.
Even though you didn't add decimals to the numbers in the vector, R still treats the integers
typed out in your code (e.g., 1
, 179
, -14
) like they're
decimal numbers. If you want to make a vector of integers, you can use the L
suffix. For example, c(1L, 2L, 3L)
creates a vector of integers. Generally speaking,
you'll mostly deal with numeric vectors in code you write, with one strange exception described
in the next paragraph.
In some cases, you need to make a list of sequential numbers. For example, you might want to
make a vector of the numbers 5 through 10. You could type out c(5,6,7,8,9,10)
,
but this is tedious and error-prone. Instead, you can use the :
operator to make
a vector of sequential numbers. 5:10
completes the goal much more succinctly and
readably than the previous code. This, oddly, creates an integer vector, even though everywhere
else in R you have to type L
to make an integer. Luckily, integers and numerics
act the same in almost all cases, so this is not a big deal. If you want to know what the type
of a vector is, you can use the class()
function. For example, class(5:10)
returns "integer"
and class(c(1,2,3))
returns "numeric"
.
As you read through the previous paragraphs, you may have wondered about special cases with vectors that I did not mention. Perhaps you wondered one of the following:
c(1L, 1.5, -20)
)?
:
operator
(e.g., -5:5
)?
What about if I put the same number on both sides? If I put the larger number
on the left side? If I try to use decimal numbers on one or both sides?
c()
)?
If you ever have questions like these, you're in luck! It is incredibly easy to test out these curiosities in the console. Understanding how R handles these edge cases can help you gain confidence in how to use the language. Test out the above questions in the console and come up with answers for yourself.
The next primitive data type is complex. Complex numbers are numbers that have
both a real and imaginary part. For example, 1 + 2i
is a complex number.
You are unlikely to use these for yourself, but they are good to be aware of.
You can use the standard mathematical operators (+
, -
, *
, /
, ^
)
with numeric, integer or complex vectors.
Try out the following examples in the console to see how they work (note that these operations
are performed element-by-element when there is more than one number in the vector):
2 + 6
2 - 6
2 * 6
2 / 6
2 ^ 6
c(1, 2, 3) + 6
c(1, 2, 3) - 6
c(1, 2, 3) * 6
c(1, 2, 3) / 6
c(1, 2, 3) ^ 6
c(1, 2, 3) + c(4, 5, 6)
c(1, 2, 3) - c(4, 5, 6)
c(1, 2, 3) * c(4, 5, 6)
c(1, 2, 3) / c(4, 5, 6)
c(1, 2, 3) ^ c(4, 5, 6)
The next primitive data type is logical. Logical data are either TRUE
or FALSE
(alternatively, T
or F
, but the full name
is more clear). In most other languages, these are referred to as "booleans".
These are used in many ways in R. For example, if you want to compare two values,
like 1
and 2
, you can use the "less than" operator (<
)
to see if
the first value is less than the second with 1 < 2
.
This will return TRUE
or FALSE
, depending on whether the statement
is true or false (it is true in this case, of course). You can also use the >
,
<=
(less than or equal to), >=
(greater than or equal to),
==
(equal to), and !=
(not equal to) operators to compare values.
See the code below for some examples, where comments indicate the return value of each line.
1 < 1 # FALSE
1 <= 1 # TRUE
1 >= 5 # FALSE
1 == 1 # TRUE
1 != 1 # FALSE
1 == 2 # FALSE
1 != 2 # TRUE
You can also use the &
(and) and |
(or) operators to
combine logical values. These take two logical values and return a single logical value.
The &
operator returns TRUE
if both values are TRUE
,
and FALSE
otherwise. The |
operator returns TRUE
if
either value is TRUE
, and FALSE
otherwise. See the code below for
some examples, where comments indicate the return value of each line.
TRUE & TRUE # TRUE
TRUE & FALSE # FALSE
FALSE & TRUE # FALSE
FALSE & FALSE # FALSE
TRUE | TRUE # TRUE
TRUE | FALSE # TRUE
FALSE | TRUE # TRUE
FALSE | FALSE # FALSE
These may seem unhelpful at first, but are pivotal for making more complex logical statements, which are common in coding. These generally make more sense in the context of variables, so I'll lean on the simple example above to explain them in the next box.
Begin by running the following code in the console to create the variable
x
and give it a value of 3
:
x <- 3
Try to predict the outcome of each of the following lines. Check to see if your predictions were correct by typing them in the terminal.
0 < x & x < 20
0 < x | x < 20
x < 2 & x < 8
x < 2 | x < 8
-10 <= x | x == 21
x == 0 | x == 1
As a final note, you will have instances where it is useful to know if
a value is inside another vector. For example, you may want to know if
the number 2 is in the vector c(2,1,5,4)
. It is obvious
in this case, but if the vector is very long, it can be difficult to
tell. You can use the %in%
operator to check if a value
is in a vector. For example, 2 %in% c(2,1,5,4)
returns
TRUE
, and 3 %in% c(2,1,5,4)
returns FALSE
.
The final primitive data type is character. Character data usually just look like words inside of quotation marks, but as the name suggests, these data are just any typed characters between quotation marks. Try out the following code in the console to see some examples:
"Hello, world!"
c("abc", "d", "&", "1", "2!!!", " ")
The first line has a single character value, and the second line is a vector of character values.
Note that the second line has "1"
in it, and not 1
. The former is a character
value, where it's treated as the character 1, and the latter is a numeric value, treated as the number 1.
Additionally, the second line has " "
, which is a character value that is nothing but a space.
In most other languages, this data type is called a string, as in "a string of characters", but R
refers to this type just as "character".
These are used in many ways in R. Most commonly, you will use character values to label things,
like columns in a data set.
Let's look about how we could push two
character values together to make a single character value. We can do this with the paste()
:
paste("Hello", "world!")
This returns "Hello world!"
, which is a single character value. The paste()
function takes any number of values, of any type, and combines them into a single character value.
You'll notice it automatically added a space between the two character values. This can be controlled,
but that will be discussed later.
In some cases, you may want to put a message in the console as you're running code. For example,
say your code is running a simulation that takes a long time to complete. You may want to show
a message in the console to let you know how far along the code is. You can do this with the
cat()
function. For example, the following code shows the message "Hello, world!"
in the console:
cat("Hello, world!")
This function does the trick in most cases, and even works with special characters, other data types or multiple values:
cat(13)
cat("\U1F600") # The \ indicates that what follows is
# a special character, in this case a smiley face emoji
cat("This line will print\non two lines")
# The \n creates a new line
cat("I was born in", 1903)
There is another similar function in R, called print()
, which can show
more complicated data types than cat()
can. However, print()
can only take one argument
at a time, and cannot handle special characters,
so it is less useful for printing messages to the console.
There are a few values that come up with these primitive data types that are worth mentioning.
The first is Inf
, which is short for infinity. Running class(Inf)
shows
that it is a numeric value. In other words, R treats Inf
as a special type of number.
This is useful for indicating when a mathematical operation returns infinity, or an
incredibly large number. Try out both of the following lines to see how Inf
can appear:
1 / 0
-1 / 0
1.4e1000 # This is how you type 1.4 x 10^1000,
# which is too big for R to handle, so it converts it to Inf
Similar to Inf
, NaN
is also a numeric value, which stands for "not a number".
This is used to indicate when a mathematical operation returns a value that is not a number. The following
line shows how NaN
can appear:
0 / 0
If you see Inf
or NaN
in a result, it usually indicates that you've done
some incorrect math, often by dividing by zero.
It is helpful to have a placeholder value that indicates that a value is missing. R uses NA
for this purpose. This is a logical value (see class(NA)
), and will be a frequent occurrence
when you import data sets with empty cells. Note that NA
looks very similar to NaN
,
but they have very different meanings.
Finally, NULL
is a special value that indicates that a variable has no value.
Conceptually, you may think of this example to help understand: If I am not wearing a hat,
you might say that my hat is NULL
.
This value is
different from NA
, which indicates that a variable has a missing value. If you run
class(NULL)
, you'll see that it is of type "NULL"
, and is in fact the only
value of this type. Using this is not common for statistical purposes, but disambiguation between this
and the other values above will leave you less confused when you run across them.
There are many instances where you will want to convert between data types. For example, you may want
to convert a numeric vector to a character vector, or a character vector to a numeric vector. This is
easy to do with the as.character()
, as.numeric()
, as.integer()
,
etc. functions. For example, the following code converts the numeric vector c(1,2,3)
to
a character vector:
as.character(c(1,2,3))
I'll leave it to you in the Practice section below to try out these functions,
and to see how it handles weird cases, like
trying to convert "hello"
to a numeric value.
In some cases, scientific data might take on a limited number of values. For example,
an experiment might have categorical treatments, like different food types for an herbivore's diet.
In this case, the data are characters, but are still limited to a few values (e.g.,
"hay"
, "corn"
, or "alfalfa"
). R has a special
way to indicate this type of data, called a factor. Factors are a special type of
vector that can only take on a limited number of values. For example, the following code
creates a factor with the three values mentioned above:
as.factor(c("hay", "corn", "alfalfa"))
Running this in R shows that the factor has three "levels", which are the three values that the factor can take on. Factors are useful for statistical analyses, and are used in many of R's built-in functions.
At the beginning of this section, I said that R is unique when it comes to vectors. I mentioned that all of R's basic data types come in vectors. In fact, even solo numerics, logicals, characters, etc. are vectors of length 1. That means that there is no such thing as something that is of type numeric but is not a vector. All of the following examples are one-element vectors:
1
TRUE
"hello"
Other languages make the distinction between primitive data and vectors of
that data, but R does not. In other languages, you can create vectors that themselves
contain more vectors, but in R, you can only create atomic vectors that contain primitive data.
The c()
function mentioned before collects all the vectors it receives and makes
a single vector out of them. All of the following examples create the same vector:
c(1, 2, 3)
c(c(1, 2), 3)
c(1, c(2, 3))
There are upsides and downsides to this behavior. However, it is generally useful for statistical purposes, and is one of the reasons R is so popular for statistics.
As discussed above in the initial example, variables are a way to store data so that it can be used later. They can store all types of data, and can be used in place of data in operations. Here, I'll discuss how variables work and the conventions you should follow when using them. Try out the following to see how variables work:
Execute the following lines of code in the console, one by one.
Some will have output to the console, and some will not.
It may be helpful for you to pay attention to the Global Environment
panel in the upper right window of RStudio, which shows the variables you have created.
You'll be able to see how the value of x
changes as you run the code.
x <- 3
x + 2
x <- 10
x + 2
x <- x + 1
Answer the following questions:
x
after each line?
x
when line 3 was ran?
x
when line 5 was ran?
As you were hopefully able to determine, variables in R hold their value
until they are changed. In addition, a variable can be used in its own assignment (e.g.,
line 5 in the box above). This shows that R will first calculate whatever is on the
right side of the assignment operator (<-
), and then assign that value to the variable on the
left side, which overrides the previous value.
Below, I will discuss several aspects of variables, including how to name them and use them.
RStudio maintains a global environment. This term refers to what is
available to R. Every time you assign a new variable, it is added to the global environment
(e.g., the line x <- 3
adds x
to the global environment, and then
x
can be accessed at any time). As mentioned in the box above,
you can see what is in your global environment by looking at the top right window of RStudio,
and can clear it by pressing the brush icon at the top of that window.
Having a global environment means that if you run a line of
code that creates a variable, that variable will be stored until you close RStudio or clear
the environment. This can be useful, but can also mean that you may think that code in your .R
document is complete, but in reality, you created a variable at some point that is defined in the global
environment but is not defined in your code. This is a common problem for new R users, and it is important
to understand how the environment works in order to avoid this problem.
Variables allow you to write code that is more adaptable. If you are doing a lot of calculations with one quantity, you're better off assigning it to a variable and using that variable in your calculations than typing the quantity every time you need it. This way, if you need to change the value of that quantity, you only need to change it in one place, and the rest of your code will automatically use the new value. This is much faster and less error-prone than having to change the value in every place it is used. Even if you don't plan on changing the value of a quantity, it is still good practice to assign it to a variable, and it makes your code more readable.
For example, say you are doing a calculation with the number of units of a product you are selling, the cost per unit, and the total cost. You could write the following code:
quantity <- 10
cost.per.unit <- 5
total.cost <- quantity * cost.per.unit
cat("Total cost for", quantity, "units: $", total.cost, "\n")
grams.conversion.factor <- 1000
grams.quantity <- quantity * grams.conversion.factor
cat(quantity, "units is equal to", grams.quantity, "grams\n")
pounds.conversion.factor <- 0.00220462
pounds.quantity <- quantity * pounds.conversion.factor
cat(quantity, "units is equal to", pounds.quantity, "pounds\n")
This code is easy to edit, and easy to understand, even if you don't know R.
There are no "magic numbers" (meaning
numbers that appear in code without context or explanation), and any change in the quantity will
automatically update the total cost, grams, and pounds. The alternative, retyping the
number in each place quantity
is used, is much more error-prone and difficult.
Variables in R are named
using letters, numbers, underscores and periods, but cannot start
with a number. For example, my.variable
is a valid variable name,
but 1variable
is not. Generally, it is a good idea in R to use all lowercase
letters with variables, with words separated by periods.
Although it can be a bit of a challenge,
it is pivotal to give variables names that are descriptive of what they
represent. For example, temp
is a bad variable name. What does
it represent? Temperature? Temporary? If you knew it meant temporary, would you know
what it was temporary for? An example of a better name would be total.cost
, which
is descriptive of what it represents. This is a simple example, but it is important to
give variables descriptive names, especially when you are working with more complicated
code.
There are generally sets of conventions that people follow when coding, often referred to as a style guide (example of Google's R style guide here). Although you don't have to follow the exact recommendations I gave above, I sincerely urge you follow these rules when creating variables:
Many, many people ignore conventions of coding, and instead aim to just make code that works. This is a bad idea. It is painful to read and edit messy code, and science is riddled with poor code that slows down scientific progress. The effort it takes to make code readable and easy to edit is well worth it. Even if you are the only person who will ever see your code, you will thank yourself later for using consistent conventions.
In this chapter, I introduced the most basic data types in R, and how to use variables to manipulate them. Although these data types are simple, they are the building blocks of all data in R, so it is important to understand them well. In the section below, you'll get a chance to practice using these data types and variables.
I will briefly introduce some functionality of vectors that will be expounded upon
in later chapters: (1) indexing with []
, which gets the entry of a vector at the specified index, and
(2) the length()
function, which returns the length of a vector. With this information,
run this code in your console:
vector.of.numbers <- c(8, 1, 8, 2)
length(vector.of.numbers)
vector.of.numbers[1]
vector.of.numbers[2]
vector.of.numbers[3]
vector.of.numbers[4]
vector.of.numbers[5]
vector.of.numbers[1] = 4
length(vector.of.numbers)
vector.of.numbers[5] = 10
length(vector.of.numbers)
Answer the following questions:
As discussed above, the :
operator is used to make a vector of sequential numbers,
and operations (+
, -
, etc.) can be used on vectors to make new vectors.
Using these two facts, make a vector
of the numbers 1 through 100, and assign it to the variable large.vector
.
Then, use multiplication to change this vector to be all the even numbers up to 200.
Finally, use the length()
function to check that the vector is still of length 100.
Now, follow the same concept to generate a 101-element list that contains the numbers 0 through 25, with steps of 0.25.
The following code is trying to find the perimeter and area of a circle given a radius of 5. The code works, but is poorly written. Edit it so that it is more readable. Specifically:
When you're done, the code should be easy to read and understand, even for someone who doesn't know R. This sort of clean-up editing is often referred to as refactoring.
circrad <- 5
circlePerm<-2*pi* 5
A <- 3.14 *5 ^ 2
cat("The perimeter of a circle with radius", circrad, "is", circlePerm, ".\n")
cat("The area of this circle is", A, ".\n")