EDA with ‘ggplot2’

EDA = Exploratory Data Analysis

R
plotting
Author

Pedro J. Aphalo

Published

2023-05-24

Modified

2023-07-27

Abstract

EDA and ‘ggplot2’ have a common ancestry. EDA as originally proposed by John W. Tukey included data visualization as a key tool. In this page I show examples of some of the data visualizations that are used in EDA, together with the R code I used to create them. I also briefly describe EDA’s position in the scientific research process and mention its origins.

Keywords

ggplot2 pkg, data visualization, dataviz

Note

To see the source of this document click on “</> CODE” to the right of the page title. The page is written using Quarto which is an enhanced version of R Markdown. The diagrams are created with Mermaid, a language inspired by the simplicity of Markdown.

Warning

Package ‘ggplot2’ has gained new features over its long life, and although few changes have been ‘code breaking’ you should be aware that the examples in this page have been tested with version (==3.4.2).

Introduction

Exploratory Data Analysis (EDA) as proposed by John W. Tukey is strongly based on data visualization. There is a contrast in how a plot and a numerical summary present the data to us. A well-designed data display shows the data in its context, while an isolated numerical summary does not. The Data visualization handbook (Koponen and Hildén (2019)), and its Finnish language version, Tieto näkyväksi (Koponen, Hildén, and Vapaasalo. (2016)), provides a detailed account of how to produce accurate and effective graphs and diagrams based on data.

In William S. Cleveland’s words

Graphs are exceptionally powerful tools for data analysis. The reason is nicely encapsulated in a sentence from a 1982 letter written to me by W. Edwards Deming: “Graphical methods can retain the information in the data.” Numerical data analytic procedures-such as means, standard deviations, correlation coefficients, and t-tests-are essentially data reduction techniques. Graphical methods complement such numerical techniques. Graphical methods tend to show data sets as a whole, allowing us to summarize the general behavior and to study the detail. This leads to much more thorough data analyses. (Cleveland 1985)

Figure 1 demonstrates how misleading it can be to rely only on numerical summaries of data.

Code
# we rearrange the data
library(ggplot2)
library(ggpmisc)

my.mat <- matrix(as.matrix(anscombe), ncol=2)
my.anscombe <- 
  data.frame(x = my.mat[ , 1],
             y = my.mat[ , 2],
             case=factor(rep(1:4, rep(11,4))))
ggplot(my.anscombe, aes(x,y)) +
  geom_point(shape=21, fill="orange", size=3) +
  stat_poly_line() +
  stat_poly_eq(use_label(c("eq", "R2", "P"))) +
  facet_wrap(~case, ncol=2) +
  theme_bw(16)
Figure 1: Anscombe’s linear regression examples. The artificial data sets (Anscombe 1973) plotted here demonstrate that even if very different they yield exactly the same numerical estimates when a linear regression model is fit to them. These plots highlight how much information numerical estimates can hide by reducing the observed data into a few numerical values and why extensive use of plots as a tool in data analysis is encouraged.

The paper Remembrances of Things EDA Friendly (2022) recounts the history of EDA and the people involved in its development. In the early 1970’s emphasis was on hand-drafted plots, given that the predominant way of interacting with computers was still based on text-only consoles and displays. Plots like boxplots are easy to draw by hand and can be constructed from values that are easy to calculate. Although EDA is frequently associated with such plots, John W. Tukey’s ideas were broader, as reflected in the quotations below.

Exploratory data analysis is detective work — numerical detective work or counting detective work — or graphical detective work. (Mosteller and Tukey 1977)

There is nothing better than a picture for making you think of questions you had forgotten to ask (even mentally). (Tukey and Tukey 1985).

Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone—as the first step. (Mosteller and Tukey 1977)

The Layered Grammar of Graphics implemented in R package ggplot2 derives from this tradition initiated by John Tukey, which encourages data exploration during analysis and the ad-hoc

The role of EDA in research

Exploratory data analysis can play two distinct roles in research, depending on the type of experiment or survey. In both approaches EDA plays a crucial role (Figure 2).

  1. EDA = Detective work for Quality Control (QC). Some research is based on testing of hypotheses set a priori (before the the experiment is designed, or at least independently of the data collected in the current study), in which case EDA can be seen mainly as a quality control step asking whether there is anything unusual in the data that suggests problems during its collection or in the assumptions on which the design of the study was based.

  2. EDA = Detective work searching for interesting relationships and patterns in data that may reveal how the system under study works or is structured, combined withQC. In other cases research aims at describing the system under study, in most cases, searching for meaningful relationships among variables. In such research EDA plays an additional role, it conveys information about the system under study. In such cases it is in most cases necessary to divide the data into a set used to generate testable hypotheses during EDA and a set used for validation (= testing of hypotheses) during Confirmatory Data Analysis (CDA).

  3. EDA-like utilitarian data analysis. It is also possible to have as an aim to develop a purely empirical way of achieving a prediction. This is almost never the aim in basic scientific research because science aims at improving understanding. However this tends to be the norm when artificial intelligence and machine learning are used in practical applications.

Tip

Even if an experiment or survey is designed to test hypotheses set a priori, EDA can suggest new hypotheses to be tested in future experiments. Even if not testable with the same data, such hypotheses can be discussed during communication of results as something to be considered in future studies. However, in all cases it is imperative to clearly distinguish statements supported by strong evidence from new hypotheses suggested by the results of an experiment or survey.

The diagram in Figure 2 shows the position of EDA in the research process. Think solid arrows show the most direct flow of information, while the thin and dotted arrows show information-dependent connections that are also importantl. The dot “arrow” heads indicate that the connection imposes restrictions. For example the design of an experiment determines how the data collected can be analysed and interpreted. Decisions about what can be considered an outlier are informed by anomalies observed during the course of an experiment, and sometimes even design decisions may need to be modified during the realization of a study. I have included Tests of Hypotheses and Model Selection as alternatives, but wording is rather unsatisfactory. The distinction is between testing hypotheses set a priori, versus searching for a model (or explanation) that describes the data well. In the first case we are interested in the P-values themselves, in the second case in the actual parameters estimates, such as the slope and intercept of a linear regression. In the first case, our a priori hypotheses are hypotheses still about the values of parameter estimates from a fit of data to a model, so, in both cases we may rely on fitting one or more models to the observed data for our statistical analysis.

%%{init: {"htmlLabels": true} }%%

flowchart TD
  
  Z([background<br>information]) ==> Y(Hypothesis)
  Y ==> A(Design) ==> Aa(Planning) ==> B(Realization) ==> H(Data collection) ==<font color=blue><strong>2.</strong>==> C
  C[<font color=blue><i>full</i> <strong>EDA</strong>] ==> D(<font color=blue>Model\nSelection) =="<font color=blue><i>R</i><sup>2</sup>, <i>f(x<sub>i</sub>)</i>, AIC, BIC"==> E(Interpretation) ==> F([communication])
  H ==<font color=green><strong>1.</strong>==> I[<font color=green><i>QC</i> <strong>EDA</strong>]
  H ==deposit<br>data+metadata=====> X([data<br>repository])
  I ==> G(<font color=green><strong>CDA</strong>\nTests of\nHypothesis) ==<font color=green><i>P</i>-value==> E
  E --follow up<br>study--> Y
  C <--<font color=blue>new/modified<br>hypothesis--> Y
  C --improved design--> A 
  I --improved design--> A
  B <-.-> H
  A -.-o D
  A -.-o E
  A -.-o G
  F --scientific<br>literature--> Z
  X --"open data"--> Z
  Z ==> E
  linkStyle 5,6,7,14,15 stroke: blue
  linkStyle 9,11,12,16 stroke: green
Figure 2: A diagram showing the steps of scientific studies. The thick arrows describe the sequence of events/actions, connecting the design of an experiment to the communication of the results. Two paths, 1. for hypothesis based research and 2. for descriptive studies, are highlighted (see main text). The dotted arrows with round heads indicate constraints imposed by design-related decisions. The double headed dotted arrow describes that the realization of a study can be influenced by data observed during its course, especially when data are collected repeatedly. Thin arrows indicate how one study can affect subsequent studies. QC= Quality Control or sanity checks of data. Even when no hypothesis testing is done, a hypothesis of what variables are of interest is involved in deciding what data are going to be used or collected. Only if no formal hypothesis testing is involved, we can revise this weaker hypothesis during data analysis. This abstraction can be applied to empirical research, but with small changes (not shown) also to simulation studies.

In most cases EDA relies heavily on the graphical display of data. It is however important to remember that different types of plots make it easy to visualize different aspects of a data set. In this page I show some examples using R and package ‘ggplot2’. These are very far from being exhaustive. The detective work in most cases involves imagining and implementing a data visualization able to reveal the information we are interested in.

Quality control

Before any plotting it is good to start by tabulating the data. table() is useful when we have categorical variables or factors, as we can use it to get counts of observations.

table(iris$Species)

    setosa versicolor  virginica 
        50         50         50 

With two factors we get a two-way table, and with additional factors, additional 2D tables. From this very simple table we can see that we have 50 observations per species. Had we had wrongly typed the species name in one cell we would get a different table.

# Make version of iris with a wrongly encoded value
bad_iris <- iris
bad_iris$Species <- as.character(bad_iris$Species)
bad_iris$Species[5] <- "acetosa"
bad_iris$Species <- factor(bad_iris$Species)
table(bad_iris$Species)

   acetosa     setosa versicolor  virginica 
         1         49         50         50 

We now know that there is an incorrectly encoded value, and we can find in which row it is.

which(bad_iris$Species == "acetosa")
[1] 5

We can also replace it, if we are sure that this is a mistake. The code below will replace all instances of "acetosa" by "setosa".

bad_iris$Species[bad_iris$Species == "acetosa"] <- "setosa" # replaces the value
bad_iris$Species <- factor(bad_iris$Species) # removes the level
table(bad_iris$Species)

    setosa versicolor  virginica 
        50         50         50 

In a case like this with 150 rows, I find this approach easier than editing a file. With thousands of observations all other viable options are based on scripts.

The organization of the data frame seems to be fine, so we may want to use summary() for a quick inspection.

summary(iris)
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

Summary is useful in that it provides a quick view of all the columns of the data frame and will also report the number of NA values in each colum.

I am not showing an example, but when reading data from a text file (e.g., .csv file) if the decimal mark (, or .) is not recognized as such, numbers are read as text (character values) instead numbers (numeric values). It is enough that a single value in a whole column has the wrong decimal marker, for the whole column to be read as text. summary() can also help in such a case, as the summary for the affected column will not be numeric.

Tip

NA is an abbreviation for Not Available, it is used to indicate missing data. Observations that should have been available but were lost or are missing because of an “accident”. They propagate in arithmetic operations. One or more missing values make the result that would have been obtained had the value(s) been available, also not available or unknown. In general, in R, removal of NA values is not the default, but can be explicitly requested.

NA + 10
[1] NA

NaN is an abbreviation for Not a Number, NaN values are generated by undefined arithmetic operations.

log(-1) # is undefined
[1] NaN
1/0 # is a valid operation!
[1] Inf
-5/0 # is a valid operation!
[1] -Inf
1000 / Inf # is a valid operation!
[1] 0

Plots of single variables

The plots we will create today are for data exploration, not communication of results. If variable names are self explanatory as in the iris data set we do not need to change axis labels or change how plots look, except to ensure that the features of the data are clearly displayed.

library(ggplot2)
theme_set(theme_bw(16))

Box plot

A very popular type of plot for EDA are box plots (more descriptively, also called box and whiskers plots). John Tukey’s most important and original contribution was on EDA methods, including box plots, not Tukey’s HSD. In the early 1970’s easy of drawing and easy of computation were still crucial for method to be widely used, and this one of the goals in most of Tukey’s work on EDA. In the iris data sets we have data from measurements of the size of different flower parts, petals and sepals. For each of them, width and length are available.

ggplot(iris, aes(x = Species, y = Petal.Length)) +
  geom_boxplot()

Caution

The advice is not to use box plots with small data sets, say with fewer than 10 to 15 observations per box plot (e.g., 50 observations for each of species in the figure above). The shape of the boxes and whiskers is quite reasonable with 50 observations per species, with very few observations the box plots are less informative than showing all the observations individually.

An enhancement to simple box plots is to show an approximate 95% confidence interval as a notch in the box plot. We can also make them narrower.

ggplot(iris, aes(x = Species, y = Petal.Length)) +
  geom_boxplot(notch = TRUE, width = 0.3)

Dot plot

For exploration one would think that plotting all observations would be ideal, but how can we ensure that all observations are visible rather than occluded by other observations. We can use transparency, but this works well only with few overlaps, so works well in those situations when we have too few observations for a more complex visual like a boxplot or density plots. In the example I use alpha = 0.25 meaning that a point is 1/4 opaque and 3/4 transparent. For a point to look black, four or more observations have to overlap.

ggplot(iris, aes(x = Species, y = Petal.Length)) +
  geom_point(alpha = 0.25)

Another approach is randomly displace points along the x axis to avoid overlaps.

ggplot(iris, aes(x = Species, y = Petal.Length)) +
  geom_point(position = position_jitter(width = 0.25))

Violin plot

Surely we can do better than this with modern computers! Box plots aim at describing the distribution of observations based on the median, quartiles and range, and possible “outliers”. Violin plots are like box plots but describe the density distribution with a fitted curve.

ggplot(iris, aes(x = Species, y = Petal.Length)) +
  geom_violin()

Beeswarm plot

A final alternative is to combine violin plots with showing the observations.

library(ggbeeswarm)
ggplot(iris, aes(x = Species, y = Petal.Length)) +
  geom_quasirandom(width = 0.25)

To any of the plots shown above we can add numerical summaries, either numerically, graphically or both.

library(ggbeeswarm)
ggplot(iris, aes(x = Species, y = Petal.Length)) +
  geom_quasirandom(width = 0.25) +
  stat_summary(fun.data = mean_cl_boot, colour = "red", linewidth = 1) +
  stat_group_counts(geom = "text",
                    label.y = 0.1,
                    label.x = "factor")

These are some of the common variations around the theme of boxplots, or more accurately plots that give information not only on the central tendency and spread of the data but also on the shape the empirical density distribution. This approach differs from the approach of showing means and standard errors, or confidence intervals, that implicitly assume that the data have been sampled from a population that follows a Normal distribution.

What do these plots tell us

In the case of variable Petal.Length from the iris data set it is clear that the spread of the values is less for I. setosa than for the other two species. The distributions are rather symmetrical around the median, and no unusual or unexpected observations, or outliers can be seen. What is considered an outliers in box plots and plotted individually should be taken with a grain of salt, as it can be affected by the number of observations plotted. For example, I would not worry about the highlighted “outliers” in the data for I. setosa. In this case, what would need to be considered for a data analysis is the heterosedasticity or differences in spread or variance.

We would make similar plots for other variables and explore their properties.

What type of plot to use?

We are exploring the data, so essentially we can use what works best for us and our data. However, one key consideration is the number of observations. If we have many hundreds of observations, then violin plots are the best option, as anything else would become cluttered. The beeswarm plot is useful with several tens of observations up to a couple of hundreds if one adjusts the size of the points and the width of the swarm. When using points to represent individual observations, the upper limit depends on the spread of the data compared to the plotting area (we can see this effect for Iris setosa compared to the other species). I would avoid jitter, as I personally find it difficult to visualize the density in such plots, although it is used quite frequently.

Exploring relationships

In the iris data set we could be interested in the relationships among the measured variables. Even if when plotted individually the data for each measured variable do not show anything unusual, a more subtle kind of outlier would be an unexpected relationship between petal length and width.

Scatter plot

A scatter plot is the simplest and usually first approach to try. As now there is variation both along x and y axes, overlaps are less likely for the same number of observation than in the case of dot plots (it is anyway important to use transparency to make sure overlaps do not remain unnoticed). We have three species, so I would start by using separate panels. We are not much interested in the actual results or interpretation, we are mainly focusing on quality of the data and properties that will affect how to analyse it.

ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) +
  geom_point(alpha = 0.25) +
  facet_wrap(facets = vars(Species), scales = "free")

We can two interesting things: 1) the real world width and length are continuous variables, but measurements have been done to the nearest 0.1 units, and 2) because of this there are quite many overlapping observations, specially for I. setosa because of the smaller spread of the values: we see only 20 points but we have 50 observations. This is a rather uncommon situation, but as earlier we can use jitter, but now along both x and y. Using as maximum displacement half the resolution of the measured quantities gives a more reasonable set of plots. Some partial overlaps remain, but these are clearly visible.

ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) +
  geom_point(alpha = 0.25, position = position_jitter(width = 0.05, height = 0.05)) +
  facet_wrap(facets = vars(Species), scales = "free")

Now that we have dealt with the overlaps, we can use a common scale, to be able to compare the species.

ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) +
  geom_point(alpha = 0.25, position = position_jitter(width = 0.05, height = 0.05)) +
  facet_wrap(facets = vars(Species))

ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) +
  geom_point(alpha = 0.25, position = position_jitter(width = 0.05, height = 0.05)) +
  facet_wrap(facets = vars(Species)) +
  scale_x_log10() +
  scale_y_log10()

With many observations we face again the problem of overlaps, and the best option is again to describe the 2D empirical distribution of the data. As with violin plots outliers can disappear from our view. One approach is to add the outliers as individual observations to a density plot. We start with a plain density plot. Package ‘ggplot2’ provides stat_density2D(), however, stat_hdr() presents the areas in a more useful way. For this particular data set stat_density2D(), using default arguments fails. I use the iris data set, although this approach is specially well suited to large data sets (I have used it with over 500000 observations per plot).

library(ggdensity)
ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) +
  stat_hdr() +
  expand_limits(x = c(-0.5, 3), y = c(0, 7.5)) +
  facet_wrap(facets = vars(Species))

To understand the relationship between the density estimates and the observations, we can add the observations as semitransparent points.

library(ggdensity)
ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) +
  stat_hdr() +
  geom_point(colour ="yellow",
             alpha = 0.5, 
             position = position_jitter(width = 0.05, height = 0.05)) +
  expand_limits(x = c(-0.5, 3), y = c(0, 7.5)) +
  facet_wrap(facets = vars(Species))

Bad data are not always obvious

To end I will freely tell from memory what is I think, I wonderful demonstration, that plotting can show patterns that otherwise remain hidden. More so, different types of plots can reveal different features, including problems. Data set barley included in package ‘lattice’ includes data for yield in bushels per acre of 10 barley cultivars in Minnesota field trials at six sites on years 1931 and 1932. The question was revisited most recently by Wright (Wright, Kevin (2013). Revisiting Immer’s Barley Data. The American Statistician, 67(3), 129–133.)

The story goes like this, there is what looks like an error in the data coding, that was discovered only in 1993, or nearly 60 years after the first publication of the data. The first report about the problem was by Cleveland (1993), who popularized “trellis” plots, which are equivalent to what are called facets in ‘ggplot2’. Most astonishing is that the data had been used in the interim in examples in statistics books, with the error unnoticed. This just, highlights the need to be careful about EDA, and not assuming that it is very easy, that errors will easily be in sight in the first plot done. The time spent checking the data is very well spent, as it avoids difficulties at all later stages and protects us from reaching wrong conclusions.

The plots below are what I would have used nowadays without consulting the publications by Cleveland and collaborators of which I first became aware of in the early 2000’s.

library(lattice)
ggplot(barley, aes(x = year, y = yield)) +
  geom_col(width = 0.67) +
  expand_limits(y = 0) +
  facet_grid(rows = vars(variety), cols = vars(site)) +
  theme_bw(12)

library(lattice)
ggplot(barley, aes(x = year, y = yield)) +
  stat_summary(fun = median, geom = "col", 
               width = 0.67) +
  expand_limits(y = 0) +
  facet_grid(cols = vars(site)) +
  theme_bw(12)

We can make the difference even more visible by better distinguishing the years. This plot is more refined graphically than the usual EDA plots, but tries to demonstrate that the selection of graphical representation affects affects what patterns we recognize easily.

library(lattice)
ggplot(barley, aes(x = year, y = yield, fill = year)) +
  stat_summary(fun = median, geom = "col", 
               colour = "black", width = 0.67) +
  expand_limits(y = 0) +
  scale_fill_grey() +
  facet_grid(cols = vars(site)) +
  theme(legend.position = "top") +
  theme_bw(12)

Errors become obvious when we discover them, but before we plot the data in a revealing way they are not obvious at all! In this case, it is difficult after some many years to get to the root of the problem. We may guess a miscoding of years for the data from Morris, but as far as I know, this is just a hypothesis. Had the odd results been noticed by those collecting the data, and either corrected or described as being a real feature when they reported them in a publication, we could be sure of what had happened.

References

Anscombe, F. J. 1973. “Graphs in Statistical Analysis.” The American Statistician 27 (1): 17. https://doi.org/10.2307/2682899.
Cleveland, William S. 1985. The Elements of Graphing Data. Wadsworth, Inc.
Friendly, Michael. 2022. “Remembrances of Things EDA.” Nightingale, The Journal of the Data Visualization Society, June. https://nightingaledvs.com/remembrances-of-things-eda/.
Koponen, Juuso, and Jonatan Hildén. 2019. Data Visualization Handbook. Espoo, Finland: Aalto University.
Koponen, Juuso, Jonatan Hildén, and Tapio Vapaasalo. 2016. Tieto Näkyväksi: Informaatiomuotoilun Perusteet. Aalto yliopisto.
Mosteller, F., and J. W. Tukey. 1977. Data Analysis and Regression. Reading, Massachusetts: Addison-Wesley Publishing Company.
Tukey, John W, and P Tukey. 1985. “Computer Graphics and Exploratory Data Analysis: An Introduction.” In Proceedings of the Sixth Annual Conference and Exposition: Computer Graphics85, III:773–85. Fairfax, VA.

Reuse