3 Data Cleaning
Happy families are all alike; every unhappy family is unhappy in its own way.
Lev Tolstoy
It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. Data cleaning is not just one step, but must be repeated many times over the course of analysis as new problems arise or new data is collected. The tidy data standard has been designed to facilitate initial exploration and analysis of the data, and to simplify the development of data analysis tools that work well together.
A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Most statistical datasets are rectangular tables made up of rows and columns. The columns are almost always labeled and the rows are sometimes labeled. Values are organized in two ways. Every value belongs to a variable (column) and an observation (row). A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.
In tidy data (Wickham 2014):
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
Tidy data makes it easy for an analyst or a computer to extract needed variables because it provides a standard way of structuring a dataset. They are also particularly well suited for vectorized programming languages like R, because the layout ensures that values of different variables from the same observation are always paired. While the order of variables and observations does not affect analysis, a good ordering makes it easier to scan the raw values.
The five most common problems with messy datasets are:
Column names are values or codes and not variable names.
Multiple variables are stored in one column.
Variables are stored in both rows and columns.
Multiple types of observational units are stored in the same table (i.e. people and household).
A single observational unit is stored in multiple tables.
Variables types are miss-specified due to inputational errors (comma before decimals, or written comments within numeric columns).
3.1 Variable Names
Working with external code or data, I often see variable names or object names such as \(X, y, xs, x1, x2, tp, tn, clf, reg, xi, yi, ii\). This creates a problem when we need to analyze a specific variable, or understand the results we get. In fact, we will have to go back to the code-book to understand the content of each variable. To put it frankly, people (myself included) are terrible at naming objects, especially when we need to invent a lot of names, but a bigger effort is needed (Koehrsen 2019).
There are three fundamental ideas to keep in mind when naming variables:
The variable name must describe the information represented by the variable. A variable name should tell us specifically in words what the variable stands for.
Our code will be read by others or us in the future. So we should prioritize how easy it is understood, rather than how quickly it is written.
Adopt standard conventions for naming so we can make one global decision instead of multiple local decisions throughout the process.
In order to change the names of the variables we can use the following
code. Basically the function colnames()
retrieves the names of the
variables as a character vector and we can change it as we have seen in
Indexing.
Code
3.2 Variable Types
Variable types are important because they tell R how to interpret a variable. As an example, we never want R to treat a numeric variable as a character one because we would not be able to use that variable for calculations.
Fortunately, when we import a dataset R-Studio automatically recognizes
the type of each variable. However, sometimes there are some inputation
mistakes that the software cannot overcome by itself. I refer to cases
such as a number with a comma instead of a dot before decimals, or a
cell with some comments in a numeric variable, etc. In all these cases R
will read the variable as a character variable. The function str()
,
allows us to recognize the possible problem by listing all the variables
and their types. If we see a missclassified variable we can immediately
explore it via summary()
and/or table()
, find the error, and correct
it. Finally we will need to tell R which is the right type for that
variable, if we don’t get any error, it means that we fixed it (look at
the example below).
Code
# creating an age vector
age <- c(22, 12, "didn't answer", 30)
# we need it numeric, but it is character (chr)
str(age)
# we find the problem: there is a comment!
table(age)
# we change the value where there was the comment and input a missing value
age[age=="didn't answer"] <- NA
# we change the vector type to numeric
age <- as.numeric(age)
Remember that all data types in R have an apposite function that
converts the data to its type. To list the most used: as.character()
,
as.numeric()
, as.factor()
, as.data.frame()
, etc…
3.2.1 Factor variables
In some cases (especially when plotting data), it is important that factor variables are recognized by R and treated accordingly. Factor variables are dichotomous (female and male, or 1 and 0, or yes and no) or categorical (hair color, rooftop type, vehicle type, etc…).
A dummy variable is a variable that indicates whether an observation has a particular characteristic. A dummy variable can only assume the values 0 and 1, where 0 indicates the absence of the property, and 1 indicates the presence of the same. The values 0/1 can be seen as no/yes or off/on.
The following code creates a character variable composed by ten random
“yes” and “no”. It, then, turns the variable into a factor variable and
specifies the levels. By comparing the summary of the variables, we see
immediately how differently R behaves when it is dealing with a factor
as compared to a character variable. The relevel()
function reorders
the variable as we want.
Code
3.2.2 Dates and times
With R we can do calculations using dates and times too. This may sound
wired, but sometimes we need to subtract 180 days to a date variable and
without this functionality this would be quite a complex task. Dates,
however, are not automatically recognized by the software and there is
the need for an external package called lubridate()
(Yarberry 2021).
Following a small example. The functions today()
and now()
retrieve
the current date in two different levels of detail. The function ymd()
transforms a numeric or character variables containing year, month, and
day into a date variable. Note that the lubridate()
package has a long
list of functions according to the format of the date needed (check the
Lubridate cheatsheet).
Code
3.3 Row Names
Row names are often not important. In most of the cases we will have a dataset with a column representing the id of the observation. However, when it comes to some Multivariate Analysis methods, we will need to have the id of the observation not in one column, but as row name. In this way R will be able to plot the data automatically referring to the name of the row.
The code below gives us a way to copy the first column of a dataset to the row names and then delete the column itself, in order to have a fully numeric dataset. We will use the same dataset we created in Dataset Exploration.