2 A, B, C

 

 

 

2.1 The first code

We can start exploring the big potential of R by typing some small basic mathematical operations. My suggestion is to always work on a source file1 and run your commands line by line using the “Run” button (or Control+Enter on Windows, Command+Enter on Mac).

Code
1 + 1
2 * 5

Now we can create some “objects”. This means storing numbers or series of numbers within the memory of our computer (the Environment), and we will be able to use these objects for our analysis. We can also create a vector by assigning more than one value to an object. For this purpose, we will need to use the combine function c() and put between the parentheses all the values separated by a comma. If what we want to create is a vector comprising an ordered series of integer numbers, there is a shortcut: we can use the colon symbol between the starting and the ending numbers. We can also do operations between values and objects. Remember that if we assign to the same object another content, we will loose the original one. In R there is no “undo” button.

Below we can see the syntax used for this scope. As you can see, within the code there are some comments. Comments are a great tool to allow us to understand what the code stands for, and let people (and us) understand it better. By placing one (or more) hashtag (#) within the code, we will transform the whole line into a comment and R will skip that line from running.

Code
# = and <- are the same, but I prefer <- 
x = 3
x <- 5

# with the == sign you ask R if a condition is true
x == 7

# single value assignment
x <- "hello"

x <- 2 * 5
x

# vector assignment
y <- c(3, 4, 2, 4, 5, 1)

y == 4

# character vector
t <- c("hello", "how", "are you", "?")
t

# oredered integer series
y <- c(1:10)

The notations <- and = are equivalent in assigning values to an object, however my suggestion is to always use the former to assign values, in order to avoid confusion with the equal sign. As you can see, when we store something in the Environment, R is not showing it to us. We will need to “call” it in order to see what is inside that object.

Read the code below by yourself, and guess the result before running the code.

Code
# operations
x * 5

# what will happen here?
x * y
x - y
y - x

 

 

 

2.2 Indexing

The indexing system allows us to select a specific part of our data. To do so, we need to first call the name of the object we want to search (y), and then put the searching instruction between squared parenthesis [ ] right after. For example, selecting the first, second and fourth value y[c(1,2,4)]; or all the values bigger than 3 y[y>3], or we can give multiple conditions together y[y<3 & y>10], and so on. Another common and useful instruction is to select all the values in one object \(y\) that are present in another object \(z\) y[y %in% z].2

Try the following code by yourself and guess the result before running it.

Code
y <- c(30:70)
# partial selection
y[c(1,3,15)]
y[c(1:3,5)]
z <- y[1:4]
# subtractive selection
y[-4]
y[-c(1:3,5)]

# assigning values only to a subset of observations
y[c(1,3,15)] <- 3
y

# equality selection
y[y==3]
y[y==3 & y==45]
y[y==3 | y==45]

z <- c(30:50)
y[y %in% z]

I recognize that the indexing system is not immediate and may result confusing to the majority. My suggestion is to exercise a lot with it. Don’t be shy in creating objects of all sorts and select whatever you want. Later in this chapter and in Advanced Data Manipulation and Plotting we will see two more friendly methods for sub-setting (the subset() function and the package dplyr), but, please, do not underestimate the power of the indexing system.

Now that we master numeric vectors, is time to learn that R is able to manage many different types of data. More precisely a vector can be: character, numeric (real or decimal), integer, logical, or complex. The most used vectors are however, character (like the one called t in our first code chunk), numeric and logical (TRUE or FALSE). Another interesting type is the factor vector, but we will see it in action in Data Cleaning. Be aware that if a vector is composed by both characters and numbers, it will be considered automatically as a character vector! And this applies also if our numbers have a comma instead of a dot before decimals!

 

   

2.3 The first function

The first R function of our life is the mean. Before running it I suggest to explore the Help documentation for this function by typing in the console ?mean. Within the help page, that will appear in the right side of our work-space, we will find all the instructions for how to use the function. Those “options” specified within the normal parenthesis () are called the arguments. Some of them have a default value and some others are, instead needed in order to run the function. In this case only x is needed: the object of which we want to compute the mean. The help provides an exhaustive explanation of how each function works, plus it cites some references and examples. Most importantly, the help will be available for any function or package, and, in 99% of the cases, it will provide you with all the above mentioned information.

Functions works by calling them (be aware that R is case sensitive) and putting between normal parenthesis the required arguments.

Try the following code by yourself and try to guess the result.

Code
mean(y)
mean(z)

# what will happen here?
mean(y[1:4])

# calculate the mean manually
sum(y)/length(y)

p <- c(1, NA, 2, 4, NA, 7, 9, 30, NA)
mean(p)
# why is it NA?

mean(p, na.rm = TRUE)
# check in the help what na.rm does!
mean(p[!is.na(p)])

# we can see how many values p has, and how many of them are NA (Please note that 
# for NA we do not use the normal equality expressions)
length(p)
length(p[is.na(p)])

# the function class tells us which is the type of data in the vector
class(y)
class(t)

What happens when we have to compute the mean of a vector with some missing observations (NA)? Missing observations are not zeros, so the sum cannot be computed. For this reason, if we add the na.rm=TRUE argument to our mean function, R strips the missing observations before applying the function, and gives us a result. Be aware that by stripping the missing observation, the vector will have a smaller length, this, may have some implications if we want the per-capita average for example.

 

   

2.4 Dataset Exploration

A dataset is a matrix \(r*c\), with \(r\) rows representing the observations and \(c\) columns representing the variables. In other words, a dataset is composed by \(r\) horizontal vectors, or \(c\) longitudinal vectors. Having this in mind, we can now load a dataset and start exploring its dimensions.

For this scope we will use a dataset embedded into R that is called mtcars. In order to better understand what mtcars refers to, type ?mtcars in the Console and read carefully what the help says about it.

I know that working with data about car design and performances belonging to 1973 is not your favorite hobby, but this R dataset is cleaned, ready to use, and has an amazing documentation attached, so perfect for our learning purposes.

The following code presents us some exploratory activities that should be carried out when we get our hands on some new data in order to understand its shape.

Code
# loading an internal dataset
data("mtcars")

# structure of the dataset
str(mtcars)

# visualizing the first 6 rows and the last 6
head(mtcars)
tail(mtcars)

# variables names
colnames(mtcars)

# number of rows and number of columns
nrow(mtcars)
ncol(mtcars)
dim(mtcars)

We first explore the internal structure of the dataset (str(mtcars)), meaning: which kind of dataset it is (yes, there are multiple types), the number of observations (rows) and variables (columns), and the name and type of each variable with a small preview. A dataset could be a “matrix”, a “data frame”, a “thible”, etc… I will not go through the details of each type of dataset, my suggestion is to always use a data frame, as this is one of the most flexible and complete form of bi-dimensional data. In order to convert to data frame whatever kind of dataset, use the as.data.frame() function.3

Another important skill to acquire is the capability to create a dataset from zero. Again, there exist millions of ways to do it. Below some examples to create the same dataset.

Code
# by specifying it columnwise
people <- data.frame(name = c("Mary", "Mike", "Greg"),
                     age = c(44, 52, 46),
                     IQ = c(160, 95, 110))

# by creating some vectors and then binding them togheter column-wise
name <- c("Mary", "Mike", "Greg")
age <- c(44, 52, 46)
IQ <- c(160, 95, 110)
people2 <- data.frame(cbind(name, age, IQ))

 

 

 

2.5 Subsetting

Now that we can create complex objects, we need to be able to “destroy” them, meaning split them into subsets according to some interesting characteristics (James et al. 2021). Sub-setting is the heart of data manipulation and the basis for data analysis. Via subsetting I can analyze the data of only one group of observations within my dataset according to some interesting characteristics we define. As an example, the consumption (mpg) of manual cars vs automatic cars.

To do so, we will use the same indexing expressions we used in the Indexing chapter, but this time we will have to specify two dimensions (rows and columns) separated by a comma within the squared brackets [rows,columns]. Remember that if we do not specify either rows or columns, R will consider the whole set of the corresponding dimension.

Explore the code below. Use the str() function, specifying the new object created in each line of code, in order to understand what happened.

Code
# to subset we use [rows,columns]
a <- mtcars[2, 5]
str(a)

# what will happen here?
b <- mtcars[3,]
str(b)

# and here?
c <- mtcars[,3]
str(c)

# we want to see the first 5 rows and the column from 3 to 8
mtcars[1:5, 3:8]

If we want to subset one single column from a dataset we can use the $ sign.4 This opens up opportunities for specifying some peculiar characteristics that we want to retain in our dataset.

Code
# to call a singe variable we use the $ sign
mtcars$cyl
# it is the same as
mtcars[,2]

# equality subsetting: we want all the data of the cars with 4 cylinders
mtcars[mtcars$cyl == 4,]

# we want the values of the firt clumn (mpg) of cars with 4 cylinders
mtcars[mtcars$cyl == 4, 1]
# it is the same as
mtcars$mpg[mtcars$cyl == 4] # I prefer this

# subsetting consumption per type of transmission
# automatic cars
mpg.at <- mtcars$mpg[mtcars$am == 0]
# manual cars
mpg.mt <- mtcars$mpg[mtcars$am == 1]

There is also an easier (more discursive) way: the subset() function. This function takes 3 arguments: the data frame we want subsetted, the rows corresponding to the condition by which we want it subsetted, and the columns we want returned. The argument drop=TRUE allows us to drop the row names and have a vector as final output. Of course we can input multiple conditions and select multiple columns.

The code below leads to the same result as in the last lines of the previous code chunk.

Code
# subsetting consumption per type of transmission
mpg.at2 <- subset(mtcars, am == 0, select = "mpg", drop=TRUE)
mpg.mt2 <- subset(mtcars, am == 1, select = "mpg", drop=TRUE)

# multiple conditions and columns
mix <- subset(mtcars, am==0 & cyl==4, select=c("mpg", "hp"))

 

 

 

2.6 Importing and exporting data

Of course none expects us to work only with internal datasets, nor to keep the data only on our computer, so we need to know how to import and export files containing data. While R is a super powerful tool for cleaning the data, it is important that the data we import follow at least the basic rule of one observation per row and one variable per column (meaning that there cannot be variables such as: “income2021”, “income2020”, etc; but one column for the years and one for the income). More rules on how to clean the data are available in Data Cleaning.

The first thing to do when working with external files is to set the working directory within the code (finction setwd()). This will tell R where to look for files and where to put the new ones. Imaging that we will have to share our code with someone else, that person will have only to change the working directory path before running our code successfully.

In order to import external data into R, we should go to File -> Import Dataset and select the format of the dataset to import. R can read by itself the most common data formats (text based like csv or txt, Excel, Stata, SAS, SPSS)(Wickham and Bryan 2022), but through the use of packages we can extend its importing capacity to spacial data, or other formats . After having selected the appropriate format, a new window will appear asking to select the file and set the options you need in order to have it read properly. Finally, my suggestion is to copy the code that will appear in the Console and paste it to the Source. This way we will be able to reload the file anytime without losing time into windows and clicks.

We have seen in the Installation chapter how to install packages, however, in order to avoid overloading our computers, R activates only a small set of default packages when we open it. This means that before using any content (function, data, object) of an additional package, we need to activate it using the function library(). Once the package has been activated, it will remain active for the whole session of work, so there is no need to call it again.

During my teaching career, I have seen many students having a huge list of packages called at the beginning of their code (most of them useless for their purposes). This method allowed them to avoid remembering what was the purpose of each package, but also overcharged their computers and often leaded to software crashes and data losses. So, please, avoid it!

Download the wine dataset, change the directory below with the folder where you put the downloaded file, and try to import it (it is a text based csv file). Try the same with the village dataset, be aware that this one is in Excel format.

Code
# setting the working directory
setwd("/Users/federicoroscioli/Desktop")
# or setting the working directory as the same where the Source file is stored
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))

# for text based datasets (csv)
wine <- read.csv("Wine.csv")

# for excel datasets you need the readxl package
library(readxl)
village <- read_excel("Village.xlsx")

To end our first chapter we have to talk also about how to save the final output of our work. The most common way is to export our dataframe as an excel file. In order to do so, we need to use the package “xlsx” (Dragulescu and Arendt 2020). Note that it is always important to set the working directory in order to tell R where to put our exported file. If we did it at the beginning of our code, there is no need to repeat it.

Code
# exporting to excel
library("xlsx")
write.xlsx(mtcars, file = "mtcars.xlsx")

# exporting to csv
write.csv(mtcars, file = "mtcars.csv")

 

 

 

2.7 Exercises

  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning (Vol. 103). Springer New York. Available here. Chapter 2.4 exercises 8, 9, and 10.
  • R playground, A, B, C

Bibliography

Dragulescu, Adrian, and Cole Arendt. 2020. “Package ‘Xlsx’.” https://cran.r-project.org/web/packages/xlsx/xlsx.pdf.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning: With Applications in r. Second edition. Springer Texts in Statistics. Springer.
Wickham, Hadley, and Jennifer Bryan. 2022. “Readxl: Read Excel Files.” https://readxl.tidyverse.org.

  1. If you don’t know what it means, please, review the Installation chapter.↩︎

  2. For a complete list of please check within the references of this chapter.↩︎

  3. This is not applicable to mtcars, as it is already a data frame, but we will see this function in action later on in the book.↩︎

  4. This is equivalent to subset using the index number corresponding to the same column (see the code below).↩︎