2 A, B, C
2.1 The first code
We can start exploring the big potential of R by typing some small basic mathematical operations. My suggestion is to always work on a source file1 and run your commands line by line using the “Run” button (or Control+Enter on Windows, Command+Enter on Mac).
Now we can create some “objects”. This means storing numbers or series
of numbers within the memory of our computer (the Environment), and we
will be able to use these objects for our analysis. We can also create a
vector by assigning more than one value to an object. For this purpose,
we will need to use the combine function c()
and put between the
parentheses all the values separated by a comma. If what we want to
create is a vector comprising an ordered series of integer numbers,
there is a shortcut: we can use the colon symbol between the starting
and the ending numbers. We can also do operations between values and
objects. Remember that if we assign to the same object another
content, we will loose the original one. In R there is no “undo”
button.
Below we can see the syntax used for this scope. As you can see, within the code there are some comments. Comments are a great tool to allow us to understand what the code stands for, and let people (and us) understand it better. By placing one (or more) hashtag (#) within the code, we will transform the whole line into a comment and R will skip that line from running.
Code
# = and <- are the same, but I prefer <-
x = 3
x <- 5
# with the == sign you ask R if a condition is true
x == 7
# single value assignment
x <- "hello"
x <- 2 * 5
x
# vector assignment
y <- c(3, 4, 2, 4, 5, 1)
y == 4
# character vector
t <- c("hello", "how", "are you", "?")
t
# oredered integer series
y <- c(1:10)
The notations <-
and =
are equivalent in assigning values to an
object, however my suggestion is to always use the former to assign
values, in order to avoid confusion with the equal sign. As you can see,
when we store something in the Environment, R is not showing it to us.
We will need to “call” it in order to see what is inside that object.
Read the code below by yourself, and guess the result before running the code.
2.2 Indexing
The indexing system allows us to select a specific part of our data. To
do so, we need to first call the name of the object we want to search
(y
), and then put the searching instruction between squared
parenthesis [ ]
right after. For example, selecting the first, second
and fourth value y[c(1,2,4)]
; or all the values bigger than 3
y[y>3]
, or we can give multiple conditions together y[y<3 & y>10]
,
and so on. Another common and useful instruction is to select all the
values in one object \(y\) that are present in another object \(z\)
y[y %in% z]
.2
Try the following code by yourself and guess the result before running it.
Code
I recognize that the indexing system is not immediate and may result
confusing to the majority. My suggestion is to exercise a lot with it.
Don’t be shy in creating objects of all sorts and select whatever you
want. Later in this chapter and in Advanced Data Manipulation and
Plotting we will see two more friendly methods for sub-setting (the
subset()
function and the package dplyr
), but, please, do not
underestimate the power of the indexing system.
Now that we master numeric vectors, is time to learn that R is able to manage many different types of data. More precisely a vector can be: character, numeric (real or decimal), integer, logical, or complex. The most used vectors are however, character (like the one called t in our first code chunk), numeric and logical (TRUE or FALSE). Another interesting type is the factor vector, but we will see it in action in Data Cleaning. Be aware that if a vector is composed by both characters and numbers, it will be considered automatically as a character vector! And this applies also if our numbers have a comma instead of a dot before decimals!
2.3 The first function
The first R function of our life is the mean. Before running it I
suggest to explore the Help documentation for this function by typing in
the console ?mean
. Within the help page, that will appear in the right
side of our work-space, we will find all the instructions for how to use
the function. Those “options” specified within the normal parenthesis
()
are called the arguments. Some of them have a default value and
some others are, instead needed in order to run the function. In this
case only x
is needed: the object of which we want to compute the
mean. The help provides an exhaustive explanation of how each function
works, plus it cites some references and examples. Most importantly, the
help will be available for any function or package, and, in 99% of the
cases, it will provide you with all the above mentioned information.
Functions works by calling them (be aware that R is case sensitive) and putting between normal parenthesis the required arguments.
Try the following code by yourself and try to guess the result.
Code
mean(y)
mean(z)
# what will happen here?
mean(y[1:4])
# calculate the mean manually
sum(y)/length(y)
p <- c(1, NA, 2, 4, NA, 7, 9, 30, NA)
mean(p)
# why is it NA?
mean(p, na.rm = TRUE)
# check in the help what na.rm does!
mean(p[!is.na(p)])
# we can see how many values p has, and how many of them are NA (Please note that
# for NA we do not use the normal equality expressions)
length(p)
length(p[is.na(p)])
# the function class tells us which is the type of data in the vector
class(y)
class(t)
What happens when we have to compute the mean of a vector with some
missing observations (NA)? Missing observations are not zeros, so the
sum cannot be computed. For this reason, if we add the na.rm=TRUE
argument to our mean function, R strips the missing observations before
applying the function, and gives us a result. Be aware that by
stripping the missing observation, the vector will have a smaller
length, this, may have some implications if we want the per-capita
average for example.
2.4 Dataset Exploration
A dataset is a matrix \(r*c\), with \(r\) rows representing the observations and \(c\) columns representing the variables. In other words, a dataset is composed by \(r\) horizontal vectors, or \(c\) longitudinal vectors. Having this in mind, we can now load a dataset and start exploring its dimensions.
For this scope we will use a dataset embedded into R that is called
mtcars
. In order to better understand what mtcars
refers to, type
?mtcars
in the Console and read carefully what the help says about it.
I know that working with data about car design and performances belonging to 1973 is not your favorite hobby, but this R dataset is cleaned, ready to use, and has an amazing documentation attached, so perfect for our learning purposes.
The following code presents us some exploratory activities that should be carried out when we get our hands on some new data in order to understand its shape.
Code
We first explore the internal structure of the dataset (str(mtcars)
),
meaning: which kind of dataset it is (yes, there are multiple types),
the number of observations (rows) and variables (columns), and the name
and type of each variable with a small preview. A dataset could be a
“matrix”, a “data frame”, a “thible”, etc… I will not go through the
details of each type of dataset, my suggestion is to always use a data
frame, as this is one of the most flexible and complete form of
bi-dimensional data. In order to convert to data frame whatever kind of
dataset, use the as.data.frame()
function.3
Another important skill to acquire is the capability to create a dataset from zero. Again, there exist millions of ways to do it. Below some examples to create the same dataset.
Code
# by specifying it columnwise
people <- data.frame(name = c("Mary", "Mike", "Greg"),
age = c(44, 52, 46),
IQ = c(160, 95, 110))
# by creating some vectors and then binding them togheter column-wise
name <- c("Mary", "Mike", "Greg")
age <- c(44, 52, 46)
IQ <- c(160, 95, 110)
people2 <- data.frame(cbind(name, age, IQ))
2.5 Subsetting
Now that we can create complex objects, we need to be able to “destroy” them, meaning split them into subsets according to some interesting characteristics (James et al. 2021). Sub-setting is the heart of data manipulation and the basis for data analysis. Via subsetting I can analyze the data of only one group of observations within my dataset according to some interesting characteristics we define. As an example, the consumption (mpg) of manual cars vs automatic cars.
To do so, we will use the same indexing expressions we used in the
Indexing chapter, but this time we will have to specify two dimensions
(rows and columns) separated by a comma within the squared brackets
[rows,columns]
. Remember that if we do not specify either rows or
columns, R will consider the whole set of the corresponding dimension.
Explore the code below. Use the str()
function, specifying the new
object created in each line of code, in order to understand what
happened.
Code
If we want to subset one single column from a dataset we can use the $ sign.4 This opens up opportunities for specifying some peculiar characteristics that we want to retain in our dataset.
Code
# to call a singe variable we use the $ sign
mtcars$cyl
# it is the same as
mtcars[,2]
# equality subsetting: we want all the data of the cars with 4 cylinders
mtcars[mtcars$cyl == 4,]
# we want the values of the firt clumn (mpg) of cars with 4 cylinders
mtcars[mtcars$cyl == 4, 1]
# it is the same as
mtcars$mpg[mtcars$cyl == 4] # I prefer this
# subsetting consumption per type of transmission
# automatic cars
mpg.at <- mtcars$mpg[mtcars$am == 0]
# manual cars
mpg.mt <- mtcars$mpg[mtcars$am == 1]
There is also an easier (more discursive) way: the subset()
function.
This function takes 3 arguments: the data frame we want subsetted, the
rows corresponding to the condition by which we want it subsetted, and
the columns we want returned. The argument drop=TRUE
allows us to drop
the row names and have a vector as final output. Of course we can input
multiple conditions and select multiple columns.
The code below leads to the same result as in the last lines of the previous code chunk.
Code
2.6 Importing and exporting data
Of course none expects us to work only with internal datasets, nor to keep the data only on our computer, so we need to know how to import and export files containing data. While R is a super powerful tool for cleaning the data, it is important that the data we import follow at least the basic rule of one observation per row and one variable per column (meaning that there cannot be variables such as: “income2021”, “income2020”, etc; but one column for the years and one for the income). More rules on how to clean the data are available in Data Cleaning.
The first thing to do when working with external files is to set the
working directory within the code (finction setwd()
). This will tell R
where to look for files and where to put the new ones. Imaging that we
will have to share our code with someone else, that person will have
only to change the working directory path before running our code
successfully.
In order to import external data into R, we should go to File -> Import Dataset and select the format of the dataset to import. R can read by itself the most common data formats (text based like csv or txt, Excel, Stata, SAS, SPSS)(Wickham and Bryan 2022), but through the use of packages we can extend its importing capacity to spacial data, or other formats . After having selected the appropriate format, a new window will appear asking to select the file and set the options you need in order to have it read properly. Finally, my suggestion is to copy the code that will appear in the Console and paste it to the Source. This way we will be able to reload the file anytime without losing time into windows and clicks.
We have seen in the Installation chapter how to install packages,
however, in order to avoid overloading our computers, R activates only a
small set of default packages when we open it. This means that before
using any content (function, data, object) of an additional package, we
need to activate it using the function library()
. Once the package has
been activated, it will remain active for the whole session of work, so
there is no need to call it again.
During my teaching career, I have seen many students having a huge list of packages called at the beginning of their code (most of them useless for their purposes). This method allowed them to avoid remembering what was the purpose of each package, but also overcharged their computers and often leaded to software crashes and data losses. So, please, avoid it!
Download the wine dataset, change the directory below with the folder where you put the downloaded file, and try to import it (it is a text based csv file). Try the same with the village dataset, be aware that this one is in Excel format.
Code
# setting the working directory
setwd("/Users/federicoroscioli/Desktop")
# or setting the working directory as the same where the Source file is stored
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
# for text based datasets (csv)
wine <- read.csv("Wine.csv")
# for excel datasets you need the readxl package
library(readxl)
village <- read_excel("Village.xlsx")
To end our first chapter we have to talk also about how to save the final output of our work. The most common way is to export our dataframe as an excel file. In order to do so, we need to use the package “xlsx” (Dragulescu and Arendt 2020). Note that it is always important to set the working directory in order to tell R where to put our exported file. If we did it at the beginning of our code, there is no need to repeat it.
Code
2.7 Exercises
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning (Vol. 103). Springer New York. Available here. Chapter 2.4 exercises 8, 9, and 10.
- R playground, A, B, C
Bibliography
If you don’t know what it means, please, review the Installation chapter.↩︎
For a complete list of please check within the references of this chapter.↩︎
This is not applicable to mtcars, as it is already a data frame, but we will see this function in action later on in the book.↩︎
This is equivalent to subset using the index number corresponding to the same column (see the code below).↩︎