Sets are a powerful concept in mathematics. Sets are collections of unique elements, where unique indicates that there are no duplicates. For example, {1, 2, 3} is a set, but {1, 2, 2, 3} is not a set because it contains a duplicate element. Set operations, specifically union, intersection, and difference, are useful in many applications, including data sets and probability. In this chapter, I will cover how to execute set operations in R.
In R, sets are not their own type of object, unlike other languages.
Instead, vectors can just have all duplicates removed, effectively
making the vector a set. This is done with the unique()
function. See below for an example:
set.to.be <- c(1, 2, 2, 3)
unique(set.to.be)
The output of the above code is 1 2 3
, which is a
set. Note that the
original vector is not changed, and the set version is not saved
unless you assign it to a variable.
In mathematics, sets do not have a specific order. However, in R, since sets are just vectors, they do have an order. The outputs of the functions in this chapter are vectors, but their order should not be relied upon. If you need to use a set in a specific order, you should sort these vectors per your specifications.
The union of two sets is the set of all elements that exist in both sets.
In R, the union of two sets can be found with the union()
function. See below for an example:
set1 <- c(1, 2, 3)
set2 <- c(3, 4, 5)
union(set1, set2)
The output of the above code is 1 2 3 4 5
, which is the
union of the two sets. Note that the order of the elements in the
output is not necessarily the same as the order of the elements in
the original sets.
The intersection of two sets is the set of all elements that exist in
both sets. In R, the intersection of two sets can be found with the
intersect()
function. See below for an example:
set1 <- c(4, 2, 3)
set2 <- c(3, 4, 5)
intersect(set1, set2)
The output of the above code is 4 3
, which is the
intersection of the two sets.
The difference of two sets is the set of all elements that exist in
the first set but not the second set. You might have a certain degree of
intuition about the above functions, but this operation might throw you off
a bit at first. In R, the difference of two sets
can be found with the setdiff()
function. See below for
an example:
set1 <- c(1, 2, 3)
set2 <- c(3, 4, 5)
setdiff(set1, set2)
setdiff(set2, set1)
The output of line 3 is
1 2
, which is the difference of the two sets. The output
of line 4 is 4 5
,
which is the difference of the two sets in the opposite order. As you can see,
unlike the other operations above, the order of the sets matters in this
operation.
Set operations are helpful in many cases in coding. In this chapter, I covered how to execute these set operations in R. These functions might not be the most commonly used functions in R, but you will certainly need them in some cases.
There are many data sets that come built-in with R. They are often used for pedagogical or testing
purposes. To see the list of built-in data sets, run help(package = "datasets")
.
The following prompts will use those built-in data sets to practice using set
functions: unique()
, union()
, intersect()
, and setdiff()
.
To find out more about any data set, use a question mark followed by its name, e.g., ?airquality
.
You may benefit from reviewing the chapter about
how to handle data structures.
airquality
data set contains information about air quality in
New York in a few months of 1973. One of its columns is
Month
, which
contains the month of the year as a number. In a single line, make a vector
that lists all the unique months that show up in the data set. E.g., if the Month
column of the
data set were c(4, 4, 4, 5, 5, 6, 6, 6)
,
the vector sought after is c(4, 5, 6)
. You should find five unique months.
quakes
data set contains information about earthquakes near Fiji
since 1964. The stations
column contains the numerical value given to the station
that recorded the earthquake. In a single line, make a vector that lists all the
unique stations that show up in the data set. You should find 102 unique stations.
CO2
data set contains information about the uptake rate of CO2
by six grass plants in Quebec and Mississippi. The Treatment
column indicates if the
plant was chilled or not, while the uptake
column indicates the uptake rate of CO2
for each plant. Begin by creating an additional column called rounded.uptake
that uses the round()
function on the
uptake
column to round the uptake rates to the nearest integer. Then,
find what integer-valued uptake rates occurred for both "chilled"
and "nonchilled"
plants. Note that the set functions require vectors for input. You should find 13 unique rounded uptake rate values that are found for both
treatment types.
mdeaths
and fdeaths
data sets record the number of deaths
of males and females, respectively, from diseases of the lung in the UK during the 70's.
These data sets look like matrices when printed, but is.matrix(mdeaths)
gives
FALSE
. In reality, they are actually time series objects,
or ts
as named in R (note that is.ts(mdeaths)
is TRUE
).
These are one-dimensional arrays that are positioned across months and years.
For both data sets, entries 1 through 12 the number of deaths across the twelve months of 1974.
Access them using mdeaths[1:12]
and fdeaths[1:12]
, and save them in
the variables male.deaths
and female.deaths
, respectively.
For both data sets, find the months that had above-average mortality for the year.
To do this, begin by using []
to find the months that were above average,
and using that with the built-in month.name
vector to get the name of the month.
Then, complete the following:
intersect()
.union()
.setdiff()
.setdiff()
.