# 5 Exploratory Data Analysis

From this point forward, we will stop focusing on R, and we will finally start acting as statisticians. We will perform various level of data analysis by applying different methodologies. Of course, R will be our tool, our facilitator, but it will be no more at the center of the stage. We will fully concentrate on the statistical methods, their application and interpretation.

Following the six steps of data analysis. This section explores the steps from 4 to 6, given that the first two are out of the scope of this book and the third has been treated in the previous section.

Exploratory data analysis is usually performed when we have our first
contact with __clean__ new data. Of course, the first
impression is not always the right one, but if we give it a proper
chance can save us a lot of work and some headaches later on. During
this phase we will perform some summary statistics (central tendency and
variability measures), basic plotting, inequality measures, and
correlations. Exploratory analysis differs from the dataset
exploration we performed in Chapter 1, because the latter focuses on
the shape of the data (type, dimensions, etc) while the former focuses
on the content.

## 5.1 Central Tendency Measures

A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. The mean, median and mode are all valid measures of central tendency, but under different conditions, some measures of central tendency become more appropriate to use than others (Lane et al. 2003; Manikandan 2011; Laerd Statistics, n.d.).

The **arithmetic** **mean** (or **average**) is the most popular and
well known measure of central tendency. The mean is equal to the sum of
all values divided by the number of observations (Equation
(5.1)).

\[\begin{equation} \bar x = \frac{\sum{x}}{n} \tag{5.1} \end{equation}\]

One of its important properties is that it minimizes error in the prediction of any value in your data set, given a big enough sample size (more details on this property will be discussed in [The Normal Distribution] chapter). Another important property of the mean is that it includes every value in your data set as part of the calculation. This implies, also, that it is sensitive to the presence of outliers or skewed distributions (i.e., the frequency distribution for our data is skewed). In those cases the median would be a better indicator of central location.

The **geometric mean** is defined as the \(n^{th}\) root of the product of
\(n\) values (Equation (5.2)).

\[\begin{equation} \bar x_{g} = \sqrt[n^{th}]{x_1*x_2*...*x_n} \tag{5.2} \end{equation}\]

The geometric mean cannot be used if the values considered are lower or equal to zero. It is often used for a set of numbers whose values are meant to be multiplied together or are exponential in nature, such as a set of growth figures: values of the human population, or interest rates of a financial investment over time.

The **median** is the middle score for a set of data that has been
arranged in order of magnitude. This means, also, that the median is
equal to the value corresponding to the 50th percentile of he
distribution. If the number of observations is even, then the median is
the simple average of the middle two numbers.

The **mode** is the most frequent score in our data set. On a histogram
it represents the highest point (we will see more about it later in this
chapter). You can, therefore, sometimes consider the mode as being the
most popular option. This implies that the mode is not always a unique
value, in fact (particularly in continuous data) we can have bi-modal,
tri-modal, or multi-modal distributions. Moreover, the mode will not
provide us with a very good measure of central tendency when the most
common mark is far away from the rest of the data.

The **middle value** is a less used measure. It represents the value
exactly at the center between the minimum and the maximum values
(Equation (5.3)), regardless of the distribution of the data Thus it is
sensitive to outliers.

\[\begin{equation} \frac{max-min}{2} \tag{5.3} \end{equation}\]

The function `summary()`

gives us a brief overview of the minimum,
maximum, first and third quantile, median and mean of each variable, or
of a single variable. Of course these measures can be also computed
independently using dedicated functions.

The following code computes summary statistics, arithmetic mean,
geometric mean (there is not a dedicated function for it), median, mode
(the function from `DescTools`

allows us to retrieve also multi-modal
values)(Signorelli 2021), and middle value. For more options about all
the functions below, please, look in the help.

## Code

```
# to do statistical summaries of the whole dataset or of a single variable
summary(mtcars)
summary(mtcars$cyl)
# arithmetic mean
mean(mtcars$mpg)
# geometric mean
exp(mean(log(mtcars$mpg)))
# median
median(mtcars$mpg)
# mode
library(DescTools)
Mode(mtcars$mpg)
#middle value
(max(mtcars$mpg)-min(mtcars$mpg))/2
```

## 5.2 Variability Measures

The terms variability, spread, and dispersion are synonyms, and refer to how disperse a distribution is. These measures complete the information given by central tendency measures, in order to better understand the data we are analyzing. The example below gives us an idea of how unequal distributions may have the same central tendency measures, and thus, considering only them would lead to mistaken evaluations.

distributions | mean | median | mode |
---|---|---|---|

a | 3 | 3 | 3 |

b | 3 | 3 | 3 |

The **range** is the simplest measure of variability to calculate, and
one you have probably encountered many times in your life. The range is
the maximum value minus the minimum value (Equation (5.4)).

\[\begin{equation} range=max-min \tag{5.4} \end{equation}\]

The **interquartile range** (IQR) is the range of the central 50% of the
values in a distribution (Equation (5.5)).

\[\begin{equation} IQR=75^{th} percentile - 25^{th} percentile \tag{5.5} \end{equation}\]

But variability can also be defined in terms of how close the values in
the distribution are to the middle of the distribution. Using the mean
as the measure for referencing to the middle of the distribution, the
**variance** is defined as the average squared difference of the scores
from the mean (Equation (5.6)).

\[\begin{equation} \sigma^2=\frac{\sum{(X-\bar x)^2}}{N} \tag{5.6} \end{equation}\]

The **standard deviation** is simply the square root of the variance
(Equation (5.7)). It is an especially useful measure of variability when
the distribution is normal or approximately normal (see [The Normal
Distribution]) because the proportion of the distribution within a given
number of standard deviations from the mean can be calculated.

\[\begin{equation} \sigma=\sqrt{\sigma^2} \tag{5.7} \end{equation}\]

The **coefficient of variation** (CV) represents the ratio of the
standard deviation to the mean (Equation (5.8)), and it is a useful
statistic for comparing the degree of variation of data relative to
different unit of measures. In fact, the CV is the only variability
measure (of those mentioned here) that is standardized, thus does not
depend on its unit of measure.

\[\begin{equation} CV=\frac{\sigma}{\bar{x}} \tag{5.8} \end{equation}\]

The following code computes the above mentioned variability measures for
the consumption variable of the mtcars dataset. In the last lines you
can find also the code to compile a frequency table with the function
`table()`

.

## Code

```
# range
max(mtcars$mpg) - min(mtcars$mpg)
# quantile distribution
quantile(mtcars$mpg)
# interquantile range
quantile(mtcars$mpg, probs = .75) - quantile(mtcars$mpg, probs = .25)
# variance
var(mtcars$mpg)
# standard deviation
sd(mtcars$mpg)
# coefficent of variation
sd(mtcars$mpg)/mean(mtcars$mpg)
# frequency tables
table(mtcars$cyl)
table(mtcars$cyl, mtcars$am)
```

Below a simple code that allows to create a dataframe, which is also a
way to create a summary table. The `data.frame()`

function has to be
filled column wise, specifying the column/variable name and then the
content. You can play with it in order to create a personalized table.
The `stargazer()`

function, instead presents the same statistics as the
`summary()`

function, but in a cooler style (Torres-Reyna 2014; Hlavac 2022).

## Code

```
# create a table storing your values
data.frame(obs = c("Miles per Gallon", "Number of Cylinders"),
mean = c(mean(mtcars$mpg), mean(mtcars$cyl)),
median = c(median(mtcars$mpg), median(mtcars$cyl)),
sd = c(sd(mtcars$mpg), sd(mtcars$cyl))
)
# create a table storing basic summary statistics
library(stargazer)
stargazer(mtcars, type = "text", digits=2)
```

## 5.3 Inequality Measures

Another set of analysis methods that are related to variability are inequality measures. Inequality can be defined as: “a scalar numerical representation of the interpersonal differences in income (for example) within a given population.” The use of the word “scalar” implies that all the different features of inequality are compressed into a single number or a single point on a scale.

But inequality of what? The strict statistical rule says that it must be a quantitative variable transferable among the population of interest, such as income, apples, cars, etc… However, in economics, inequality measurements are used also for other items, like carbon emissions.

The **Gini** coefficient, or Gini Index, is the most widely used measure
of inequality in policy-related discussions. It is a measure of
statistical dispersion most prominently used as a measure of inequality
of income distribution or inequality of wealth distribution. It is
defined as a ratio with values between 0 and 1. Thus, a low Gini
coefficient indicates more equal income or wealth distribution, while a
high Gini coefficient indicates more unequal distribution. Zero
corresponds to perfect equality (everyone having exactly the same
income) and 1 corresponds to perfect inequality (where one person has
all the income, while everyone else has zero income). *Note that the
Gini coefficient requires that no one have a negative net income or
wealth.*

In geometrical terms, the Gini coefficient can be thought of as the ratio of the area that lies between the line of perfect equality (the diagonal) and the Lorenz curve over the total area under the line of equality (see Figure 5.3). Here the formula used to compute the Gini coefficient:

\[\begin{equation} G(x)=1-\frac{2}{N-1}\sum^{N-1}_{i=1}{q_i} \tag{5.9} \end{equation}\]

where *N* is the population size, and *q* is the cumulative relative
income.

Run the code below and calculate the Gini coefficient for both the x and
y objects. You can either use the function from the package `DescTools`

(Signorelli 2021), or the function from the package `labstatR`

(Iacus and Masarotto 2020) (remember that R is case-sensitive!). Did you expect this
value of Gini?

Try to understand how the function `rep()`

works using the help.

## Code

As mentioned, the Gini Index is nowadays a recognized standard, nevertheless it has some limitations. In fact, Gini is more sensitive to changes in the middle of the distribution, than to the tails. In economics, when we study inequality, we are often more interested in the tails behavior (namely the top and bottom 10% of the distribution).

The **Palma Ratio** is a particular specification within a family of
inequality measures known as inter-decile ratios (Cobham, Schlögl, and Sumner 2016). More
specifically, it is the ratio of national income shares of the top 10%
of households to the bottom 40%. It, thus, tells us how many times the
income of the top 10% of the population is higher than that of the
bottom 40% (Equation (5.10)).

\[\begin{equation} Palma = \frac {top10}{bottom40} \tag{5.10} \end{equation}\]

The code below computes the Palma Ratio for the \(x\) and \(z\)
distributions created previously. This has to be done manually, by
computing the top 10% income and the bottom 40% income with the
cumulative frequencies computed by the function `gini()`

of the package
`labstatR`

(Iacus and Masarotto 2020). Confront the Gini Index and the Palma Ratio, do
you find different results? Why?

## Code

```
library(labstatR)
# extracting the cumulative frequencies from the function gini by labstatR
q <- gini(x)$Q
# computing the Palma Ratio on the cumulative frequencies
(quantile(q, probs = 1, type=3)-quantile(q, probs = .9, type=3))/
quantile(q, probs = .4, type=3)
# extracting the cumulative frequencies from the function gini by labstatR
q2 <- gini(y)$Q
# computing the Palma Ratio on the cumulative frequencies
(quantile(q2, probs = 1, type=3)-quantile(q2, probs = .9, type=3))/
quantile(q2, probs = .4, type=3)
# extracting the cumulative frequencies from the function gini by labstatR
q3 <- gini(z)$Q
# computing the Palma Ratio on the cumulative frequencies
(quantile(q3, probs = 1, type=3)-quantile(q3, probs = .9, type=3))/
quantile(q3, probs = .4, type=3)
```

## 5.4 Data visualization

Visualizing data is another way to explore the data, or better, a complement to the quantitative exploratory analysis carried out above. In fact, as we saw in the example in Figure 5.4, a simple histogram can tell us whether the arithmetic mean can give us a realistic representation of the data, or if more analysis is needed. In this section we will not see the beautiful graphs for which R is recognized worldwide (for those I invite you to explore Ggplot2), but fast and dirty graphs have their potential too, at least for the exploratory analysis phase.

A **histogram** is a plot that allows the inspection of the data for its
underlying distribution (e.g., normal distribution), outliers, skewness,
etc. If the bars of the histogram are equally spaced bars, the height of
the bin reflects the frequency of each value of the distribution.

A **box plot**, also called a “box and whisker plot”, is a way to show
the spread and centers of a data set. A box plot is a way to show a five
number summary in a chart. The main part of the chart (the “box”) shows
where the middle portion of the data is: the interquartile range. At the
ends of the box, you” find the first quartile (the 25% mark) and the
third quartile (the 75% mark). The far bottom of the chart is the
minimum (the smallest number in the set) and the far top is the maximum
(the largest number in the set). Finally, the median
__(__**not the mean!**__)__ is represented by an
horizontal bar in the center of the box.

A **scatter plot** is a bi-dimensional plot in which the dots represent
values for two different numeric variables in the Cartesian space. The
position of each dot indicates each individual data point with respect
to the two variables selected. Scatter plots are used to observe
relationships between two numeric variables (we will deepen their use
later in this chapter).

The code provided for this part may look a bit more complex than what we
have seen so far. We first draw an histogram and a density plot (which
is a linear representation of the same distribution). Then, we draw some
scatter plots, we add some “stilish” (but still basic) arguments and a
linear regression line. Note that the formula used in the `abline()`

function is something that we will explore better in Linear
Regression, for now only notice that `y ~ x + z + e`

stands for
`y = x + z + e`

in a classic linear expression algebra.^{5} We then
subset consumption (mpg) data for automatic and manual cars from the
`mtcars`

dataset in order to study the differences among them with a box
plot. We can see how manual cars (1 in the plot) present higher miles
per gallon (thus lower consumption) than automatic cars (0 in the plot),
and the two distributions overlaps only on their tails. In the following
graph, we appreciate how a continuous variable (horse power) is
distributed among cars grouped by a discrete variable (cylinders). In
this way, we are studying a relationship between variable of mixed
types. Note that, in order to tell R that the variable is discrete, we
used the function `as.factor()`

(more details on it in Factor
variables). Moreover, we specified the \(x\) and \(y\) axes labels using
the `xlab`

and `ylab`

arguments.

Try the code, personalize it, and check in the help for supplementary options.

## Code

```
# histogram
hist(mtcars$mpg)
# density plot
plot(density(mtcars$mpg), type = "l")
# scatterplot of cyl vs hp
plot(mtcars$cyl, mtcars$hp)
# why do we get an error here?
plot(mpg, hp)
# stylish scatterplot
plot(mtcars$hp,
mtcars$mpg,
# adding titles and axes names
ylab = "Miles per Gallon",
xlab = "Horse Power",
main = "Is consumption related to Horse Power?")
# adding a regression line on the scatterplot
abline(lm(mpg ~ hp, data = mtcars), col = "blue")
# subset automatic cars consumption
mpg.at <- mtcars$mpg[mtcars$am == 0]
# and manual cars consumption
mpg.mt <- mtcars$mpg[mtcars$am == 1]
# boxplot
boxplot(mpg.mt, mpg.at)
# boxplot with a categorical variable
plot(as.factor(mtcars$cyl), mtcars$hp,
# adding axes names
xlab="Cylinders", ylab="Horse Power")
```

## 5.5 Scaling data

Scaling, or standardizing, data is a useful technical trick that statisticians and economists use in order to better use data. Some of the most common situations where scaling is required are visualization, interpretation, but most of all comparison.

**Log scaling** is typically used for plotting and regression analysis.
It is a way of displaying numerical data over a very wide range of
values in a compact way. In fact, typically, when log scaling is needed,
the largest numbers in the data are hundreds or even thousands of times
larger than the smallest numbers. As an example, often exponential
growth curves are displayed on a log scale, otherwise they would
increase too quickly to fit within a small graph and allow a complete
analysis of its variation. R has the function `log()`

to do that.

## Code

**Ranking** replaces the value assumed by each unit, with the order
number (rank) by which the unit is placed on the list according to a
specific indicator. If two or more units assume the same value, then
they will give the average rank of the positions that they would have
had in case of different values. The transformation into ranks purifies
the indicators from the unit of measure, but it does not preserve the
relative distance between the different units.

The basic form of the `rank()`

function produces a vector that contains
the rank of the values in the vector that was evaluated such that the
lowest value would have a rank of 1 and the second-lowest value would
have a rank of 2.

Relative indices with respect to the range (**Min-Max**) purifies the
data from their unit of measure such that the features are within a
specific range (i.e. \(0, 1\))(Equation (5.11)). Min-Max scaling is important when we want
to retain the distance between the data points.

\[\begin{equation} r_{ij}=\frac{x_{ij}-\min_ix_{ij}}{\max_ix_{ij}-\min_ix_{ij}} \tag{5.11} \end{equation}\]

## Code

The point of **z-score standardization** is to change your data so that
it can be described as a normal distribution (Equation (5.12)). Normal distribution, or
Gaussian distribution, is a specific statistical distribution where
equal observations fall above and below the mean, the mean and the
median are the same, and there are more observations closer to the mean
(for more details see [The Normal Distribution]).

\[\begin{equation} z_{ij}=\frac{x_{ij}-\bar x_j}{\sigma_j} \tag{5.12} \end{equation}\]

The `scale()`

function, with default settings, will calculate the mean
and standard deviation of the entire vector, then normalize each element
by those values by subtracting the mean and dividing by the standard
deviation (Equation (5.12)). The resulting distribution will have mean
equal to 0 and standard deviation equal to 1.

## Code

In **index numbers**, the value assumed by each unit is divided by a
reference value belonging to the same distribution or calculated on it
(generally the mean or maximum)(Equation (5.13)). This normalization allows
to delete the unit of measure and to keep the relative distance among
the units. If the denominator is the maximum than we obtain values less
or equal to 100.

\[\begin{equation} I_{ij}=\frac{x_{ij}}{x^*_{oj}}*100 \tag{5.13} \end{equation}\]

## Code

In the **percentage** transformation the value of each unit is divided
by the sum of the values (Equation (5.14)). The sum of the normalized
values are equal to 100.

\[\begin{equation} p_{ij}=\frac{x_{ij}}{\sum^n_{i=1}x_{ij}}*100 \tag{5.14} \end{equation}\]

## 5.6 Probability Sampling

Sampling allows statisticians to draw conclusions about a whole by examining a part. It enables us to estimate characteristics of a population by directly observing a portion of the entire population. Researchers are not interested in the sample itself, but in what can be learned from the survey—and how this information can be applied to the entire population.

**Simple Random sampling** is a powerful tool, we can use it to random
subset data, or generate random distributions with precise
characteristics. Each member of a population has an equal chance of
being included in the sample. Also, each combination of members of the
population has an equal chance of composing the sample. Those two
properties are what defines simple random sampling. Simple random
sampling can be done *with or without replacement*. A sample with
replacement means that there is a possibility that the sampled
observation may be selected twice or more. Usually, the simple random
sampling approach is conducted without replacement because it is more
convenient and gives more precise results.

Simple random sampling is the easiest method of sampling and it is the most commonly used. Advantages of this technique are that it does not require any additional information on the sampled unit, nor on the context (i.e. the geographic area, gender, age, etc…). Also, since simple random sampling is a simple method and the theory behind it is well established, standard formulas exist to determine the sample size, the estimates and so on, and these formulas are easy to use.

On the other hand, by making no use of context information this method may sometimes result less efficient in equally representing some strata of the population, particularly for smaller samples. Finally, if we are planning a survey and not just sampling already collected data, simple random sampling may result an expensive and unfeasible method for large populations because all elements must be identified and labeled prior to sampling. It can also be expensive if personal interviewers are required since the sample may be geographically spread out across the population.

We can calculate the probability of a given observation being selected. Since we know the sample size (\(n\)) and the total population (\(N\)), calculating the probability of being included in the sample becomes a simple matter of division (Equation (5.15)).

\[\begin{equation} p=\frac{n}{N}*100 \tag{5.15} \end{equation}\]

In R we will use the `sample()`

function. This function requires us to
specify the total population to be sampled (i.e. a vector between 1 and
1 million), comma, the sample size we are interested in, comma, if we
want replacement to happen (`TRUE`

) or not (`FALSE`

).

In order to have a reproducible code (this is one of the main reasons
for using R as our working tools), we want to be able to select the same
random sample over and over. It may sound complex, but it is not. If we
are not able to select the same random sample today and tomorrow, our
analysis will change slightly every time we run the code. In order to
solve this problem, we will use the `set.seed()`

function at the
beginning of our work. In fact by setting the seed, we tell R which set
of random number generator to use. **Please be aware that different
versions of the Random Number Generator will select different samples
given the same seed. If you collaborate with other people, specify the
same version of it** (see the code below)**.**

In the code below, run the first line multiple times and you will see
different samples drawn. If instead, you run the set.seed line plus the
sampling, you will have always the same sample. The last line of code,
gives us an example of how to randomly select 4 rows out of the mtcars
dataset (32 rows) using a combination of the indexing system and the
`sample()`

function. In fact, between square brackets there are: a
vector of row numbers (generated by the `sample()`

function), comma
nothing (which means that we want to keep all the columns of the
dataset). You can run just the sample function first to see the
resulting vector (`sample(1:32, 4, replace=FALSE)`

), after this run the
whole line (`samp_data <- mtcars[sample(1:32, 4, replace=FALSE),]`

). The
code subsets the mtcars lines randomly sampled.

## Code

```
# random sampling 4 numbers out of the series from 1 to 32
sample(1:32, 4, replace=FALSE)
# set seed allows us to reproduce the random operations we do locally
set.seed(1234)
sample(1:32, 4, replace=FALSE)
# specifying the Random Number Generator version, allows everyone to have the same sample.
set.seed(123, kind = "Mersenne-Twister", normal.kind = "Inversion")
# sampling the random observations selected
set.seed(1234)
samp_data <- mtcars[sample(1:32, 4, replace=FALSE),]
```

In **stratified sampling**, the population is divided into homogeneous,
mutually exclusive groups called strata, and then independent samples
are selected from each stratum. Why do we need to create strata? There
are many reasons, the main one being that it can make the sampling
strategy more efficient. In fact, it was mentioned earlier that we need
a larger sample to get a more accurate estimation of a characteristic
that varies greatly from one unit to the other than for a characteristic
that does not. For example, if every person in a population had the same
salary, then a sample of one individual would be enough to get a precise
estimate of the average salary.

This is the idea behind the efficiency gain obtained with stratification. If we create strata within which units share similar characteristics (e.g., income) and are considerably different from units in other strata (e.g., occupation, type of dwelling) then we would only need a small sample from each stratum to get a precise estimate of total income for that stratum. Then we could combine these estimates to get a precise estimate of total income for the whole population. If we were to use a simple random sampling approach in the whole population without stratification, the sample would need to be larger than the total of all stratum samples to get an estimate of total income with the same level of precision.

Any of the sampling methods mentioned in this section (and others that
exist) can be used to sample within each stratum. The sampling method
and size can vary from one stratum to another, since each stratum
becomes an independent population. When simple random sampling is used
to select the sample within each stratum, the sample design is called
**stratified simple random sampling**. A population can be stratified by
any variable that is available for all units on the sampling frame prior
to sampling (i.e., age, gender, province of residence, income, etc.).

The code below applies a Stratified Simple Random Sampling to the Star
Wars dataset. Using the `Dplyr`

package (Wickham and Bryan 2022), we are able to
stratify the data by one variable (eye color), and sample randomly 40%
of the available population in each stratum (function `sample_frac()`

),
or 2 observations per stratum (function `sample_n()`

).

## Code

## 5.7 Exercises

- R playground, Exploratory data analysis

### Bibliography

*Global Policy*7 (1): 25–36. https://doi.org/10.1111/1758-5899.12320.

*The Comprehensive R Archive Network*. https://cran.r-project.org/web/packages/stargazer/vignettes/stargazer.pdf.

*The Comprehensive R Archive Network*. https://cran.r-project.org/web/packages/labstatR/labstatR.pdf.

*Introduction to Statistics*. https://onlinestatbook.com/Online_Statistics_Education.pdf.

*Journal of Pharmacology and Pharmacotherapeutics*2 (3): 214. https://doi.org/10.4103/0976-500X.83300.

In order to type the symbol ~ (tilde) it is needed a different combination of keys according to the operating system and the keyboards.↩︎