3 Data Cleaning

Happy families are all alike; every unhappy family is unhappy in its own way.
Lev Tolstoy

It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. Data cleaning is not just one step, but must be repeated many times over the course of analysis as new problems arise or new data is collected. The tidy data standard has been designed to facilitate initial exploration and analysis of the data, and to simplify the development of data analysis tools that work well together.

A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Most statistical datasets are rectangular tables made up of rows and columns. The columns are almost always labeled and the rows are sometimes labeled. Values are organized in two ways. Every value belongs to a variable (column) and an observation (row). A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.

In tidy data (Wickham 2014):

1. Each variable forms a column.

2. Each observation forms a row.

3. Each type of observational unit forms a table.

Tidy data makes it easy for an analyst or a computer to extract needed variables because it provides a standard way of structuring a dataset. They are also particularly well suited for vectorized programming languages like R, because the layout ensures that values of different variables from the same observation are always paired. While the order of variables and observations does not affect analysis, a good ordering makes it easier to scan the raw values.

The five most common problems with messy datasets are:

Column names are values or codes and not variable names.
Multiple variables are stored in one column.
Variables are stored in both rows and columns.
Multiple types of observational units are stored in the same table (i.e. people and household).
A single observational unit is stored in multiple tables.
Variables types are miss-specified due to inputational errors (comma before decimals, or written comments within numeric columns).

3.1 Variable Names

Working with external code or data, I often see variable names or object names such as \(X, y, xs, x1, x2, tp, tn, clf, reg, xi, yi, ii\). This creates a problem when we need to analyze a specific variable, or understand the results we get. In fact, we will have to go back to the code-book to understand the content of each variable. To put it frankly, people (myself included) are terrible at naming objects, especially when we need to invent a lot of names, but a bigger effort is needed (Koehrsen 2019).

There are three fundamental ideas to keep in mind when naming variables:

The variable name must describe the information represented by the variable. A variable name should tell us specifically in words what the variable stands for.
Our code will be read by others or us in the future. So we should prioritize how easy it is understood, rather than how quickly it is written.
Adopt standard conventions for naming so we can make one global decision instead of multiple local decisions throughout the process.

In order to change the names of the variables we can use the following code. Basically the function colnames() retrieves the names of the variables as a character vector and we can change it as we have seen in Indexing.

Code

# retreiving variable names
colnames(mtcars)

# changing the name of the variable mpg to consumption
colnames(mtcars)[1] <- "consumption"

3.2 Variable Types

Variable types are important because they tell R how to interpret a variable. As an example, we never want R to treat a numeric variable as a character one because we would not be able to use that variable for calculations.

Fortunately, when we import a dataset R-Studio automatically recognizes the type of each variable. However, sometimes there are some inputation mistakes that the software cannot overcome by itself. I refer to cases such as a number with a comma instead of a dot before decimals, or a cell with some comments in a numeric variable, etc. In all these cases R will read the variable as a character variable. The function str(), allows us to recognize the possible problem by listing all the variables and their types. If we see a missclassified variable we can immediately explore it via summary() and/or table(), find the error, and correct it. Finally we will need to tell R which is the right type for that variable, if we don’t get any error, it means that we fixed it (look at the example below).

Code

# creating an age vector
age <- c(22, 12, "didn't answer", 30)

# we need it numeric, but it is character (chr)
str(age)

# we find the problem: there is a comment!
table(age)

# we change the value where there was the comment and input a missing value
age[age=="didn't answer"] <- NA

# we change the vector type to numeric
age <- as.numeric(age)

Remember that all data types in R have an apposite function that converts the data to its type. To list the most used: as.character(), as.numeric(), as.factor(), as.data.frame(), etc…

3.2.1 Factor variables

In some cases (especially when plotting data), it is important that factor variables are recognized by R and treated accordingly. Factor variables are dichotomous (female and male, or 1 and 0, or yes and no) or categorical (hair color, rooftop type, vehicle type, etc…).

A dummy variable is a variable that indicates whether an observation has a particular characteristic. A dummy variable can only assume the values 0 and 1, where 0 indicates the absence of the property, and 1 indicates the presence of the same. The values 0/1 can be seen as no/yes or off/on.

The following code creates a character variable composed by ten random “yes” and “no”. It, then, turns the variable into a factor variable and specifies the levels. By comparing the summary of the variables, we see immediately how differently R behaves when it is dealing with a factor as compared to a character variable. The relevel() function reorders the variable as we want.

Code

yesno <- sample(c("yes","no"), size=10, replace=TRUE)
str(yesno)
summary(yesno)

# transform the variable in factor
yesnofac = factor(yesno, levels=c("yes","no"))
str(yesnofac)
summary(yesnofac)
levels(yesnofac)

# relevel the variable
relevel(yesnofac, ref="no")
as.numeric(yesnofac)

3.2.2 Dates and times

With R we can do calculations using dates and times too. This may sound wired, but sometimes we need to subtract 180 days to a date variable and without this functionality this would be quite a complex task. Dates, however, are not automatically recognized by the software and there is the need for an external package called lubridate()(Yarberry 2021).

Following a small example. The functions today() and now() retrieve the current date in two different levels of detail. The function ymd() transforms a numeric or character variables containing year, month, and day into a date variable. Note that the lubridate() package has a long list of functions according to the format of the date needed (check the Lubridate cheatsheet).

Code

library(lubridate)
# today's date
today()
now()

# creating a numeric vector of dates with year, month and day
x <- c(20220321, 20220322, 20220320)
str(x)

# transforming it in a date type vector 
y <- ymd(x)
str(y)

# subtracting 180 days to y and x
x-180
y-180

3.3 Row Names

Row names are often not important. In most of the cases we will have a dataset with a column representing the id of the observation. However, when it comes to some Multivariate Analysis methods, we will need to have the id of the observation not in one column, but as row name. In this way R will be able to plot the data automatically referring to the name of the row.

The code below gives us a way to copy the first column of a dataset to the row names and then delete the column itself, in order to have a fully numeric dataset. We will use the same dataset we created in Dataset Exploration.

Code

# creatng the dataset
people <- data.frame(name = c("Mary", "Mike", "Greg"),
                     age = c(44, 52, 46),
                     IQ = c(160, 95, 110))

# inputting rownames from the column "name"
rownames(people) <- people$name

# deleting the clumn "name"
people <- people[,-1]

Bibliography

Koehrsen, Will. 2019. “Data Scientists: Your Variable Names Are Awful. Here’s How to Fix Them.” https://towardsdatascience.com/data-scientists-your-variable-names-are-awful-heres-how-to-fix-them-89053d2855be.

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). https://doi.org/10.18637/jss.v059.i10.

Yarberry, William. 2021. “Lubridate: Date and Time Processing.” In CRAN Recipes, 109–60. Apress. https://doi.org/10.1007/978-1-4842-6876-6_3.