EDA and ‘ggplot2’ have a common ancestry. EDA as originally proposed by John W. Tukey included data visualization as a key tool. In this page I show examples of some of the data visualizations that are used in EDA, together with the R code I used to create them. I also briefly describe EDA’s position in the scientific research process and mention its origins.
Keywords
ggplot2 pkg, data visualization, dataviz
Note
To see the source of this document click on “</> CODE” to the right of the page title. The page is written using Quarto which is an enhanced version of R Markdown. The diagrams are created with Mermaid, a language inspired by the simplicity of Markdown.
Warning
Package ‘ggplot2’ has gained new features over its long life, and although few changes have been ‘code breaking’ you should be aware that the examples in this page have been tested with version (==3.4.2).
1 Introduction
Exploratory Data Analysis (EDA) as proposed by John W. Tukey is strongly based on data visualization. There is a contrast in how a plot and a numerical summary present the data to us. A well-designed data display shows the data in its context, while an isolated numerical summary does not. The Data visualization handbook (Koponen and Hildén (2019)), and its Finnish language version, Tieto näkyväksi (Koponen, Hildén, and Vapaasalo. (2016)), provides a detailed account of how to produce accurate and effective graphs and diagrams based on data.
In William S. Cleveland’s words
Graphs are exceptionally powerful tools for data analysis. The reason is nicely encapsulated in a sentence from a 1982 letter written to me by W. Edwards Deming: “Graphical methods can retain the information in the data.” Numerical data analytic procedures-such as means, standard deviations, correlation coefficients, and t-tests-are essentially data reduction techniques. Graphical methods complement such numerical techniques. Graphical methods tend to show data sets as a whole, allowing us to summarize the general behavior and to study the detail. This leads to much more thorough data analyses. (Cleveland 1985)
Figure 1 demonstrates how misleading it can be to rely only on numerical summaries of data.
Figure 1: Anscombe’s linear regression examples. The artificial data sets (Anscombe 1973) plotted here demonstrate that even if very different they yield exactly the same numerical estimates when a linear regression model is fit to them. These plots highlight how much information numerical estimates can hide by reducing the observed data into a few numerical values and why extensive use of plots as a tool in data analysis is encouraged.
The paper Remembrances of Things EDAFriendly (2022) recounts the history of EDA and the people involved in its development. In the early 1970’s emphasis was on hand-drafted plots, given that the predominant way of interacting with computers was still based on text-only consoles and displays. Plots like boxplots are easy to draw by hand and can be constructed from values that are easy to calculate. Although EDA is frequently associated with such plots, John W. Tukey’s ideas were broader, as reflected in the quotations below.
Exploratory data analysis is detective work — numerical detective work or counting detective work — or graphical detective work.(Mosteller and Tukey 1977)
There is nothing better than a picture for making you think of questions you had forgotten to ask (even mentally).(Tukey and Tukey 1985).
Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone—as the first step.(Mosteller and Tukey 1977)
The Layered Grammar of Graphics implemented in R package ggplot2 derives from this tradition initiated by John Tukey, which encourages data exploration during analysis and the ad-hoc
2 The role of EDA in research
Exploratory data analysis can play two distinct roles in research, depending on the type of experiment or survey. In both approaches EDA plays a crucial role (Figure 2).
EDA = Detective work for Quality Control (QC). Some research is based on testing of hypotheses set a priori (before the the experiment is designed, or at least independently of the data collected in the current study), in which case EDA can be seen mainly as a quality control step asking whether there is anything unusual in the data that suggests problems during its collection or in the assumptions on which the design of the study was based.
EDA = Detective work searching for interesting relationships and patterns in data that may reveal how the system under study works or is structured, combined withQC. In other cases research aims at describing the system under study, in most cases, searching for meaningful relationships among variables. In such research EDA plays an additional role, it conveys information about the system under study. In such cases it is in most cases necessary to divide the data into a set used to generate testable hypotheses during EDA and a set used for validation (= testing of hypotheses) during Confirmatory Data Analysis (CDA).
EDA-like utilitarian data analysis. It is also possible to have as an aim to develop a purely empirical way of achieving a prediction. This is almost never the aim in basic scientific research because science aims at improving understanding. However this tends to be the norm when artificial intelligence and machine learning are used in practical applications.
Tip
Even if an experiment or survey is designed to test hypotheses set a priori, EDA can suggest new hypotheses to be tested in future experiments. Even if not testable with the same data, such hypotheses can be discussed during communication of results as something to be considered in future studies. However, in all cases it is imperative to clearly distinguish statements supported by strong evidence from new hypotheses suggested by the results of an experiment or survey.
The diagram in Figure 2 shows the position of EDA in the research process. Think solid arrows show the most direct flow of information, while the thin and dotted arrows show information-dependent connections that are also importantl. The dot “arrow” heads indicate that the connection imposes restrictions. For example the design of an experiment determines how the data collected can be analysed and interpreted. Decisions about what can be considered an outlier are informed by anomalies observed during the course of an experiment, and sometimes even design decisions may need to be modified during the realization of a study. I have included Tests of Hypotheses and Model Selection as alternatives, but wording is rather unsatisfactory. The distinction is between testing hypotheses set a priori, versus searching for a model (or explanation) that describes the data well. In the first case we are interested in the P-values themselves, in the second case in the actual parameters estimates, such as the slope and intercept of a linear regression. In the first case, our a priori hypotheses are hypotheses still about the values of parameter estimates from a fit of data to a model, so, in both cases we may rely on fitting one or more models to the observed data for our statistical analysis.
%%{init: {"htmlLabels": true} }%%
flowchart TD
Z([background
information]) ==> Y(Hypothesis)
Y ==> A(Design) ==> Aa(Planning) ==> B(Realization) ==> H(Data collection) =={2.}==> C
C[_full_ **EDA**] ==> D(Model
Selection) == "_R2, f(xi)_, AIC, BIC" ==> E(Interpretation) ==> F([communication])
H == {1.} ==> I[_QC_ **EDA**]
H ==deposit
data+metadata=====> X([data
repository])
I ==> G(**CDA**
Tests of
Hypothesis) == _P_-value ==> E
E --follow up
study--> Y
C <-- new/modified
hypothesis --> Y
C --improved
design--> A
I --improved
design--> A
B <-.-> H
A -.-o D
A -.-o E
A -.-o G
F --scientific
literature--> Z
X --"open data"--> Z
Z ==> E
linkStyle 5,6,7,14,15 stroke: blue
linkStyle 9,11,12,16 stroke: green
Figure 2: A diagram showing the steps of scientific studies. The thick arrows describe the sequence of events/actions, connecting the design of an experiment to the communication of the results. Two paths, 1. for hypothesis based research and 2. for descriptive studies, are highlighted (see main text). The dotted arrows with round heads indicate constraints imposed by design-related decisions. The double headed dotted arrow describes that the realization of a study can be influenced by data observed during its course, especially when data are collected repeatedly. Thin arrows indicate how one study can affect subsequent studies. QC= Quality Control or sanity checks of data. Even when no hypothesis testing is done, a hypothesis of what variables are of interest is involved in deciding what data are going to be used or collected. Only if no formal hypothesis testing is involved, we can revise this weaker hypothesis during data analysis. This abstraction can be applied to empirical research, but with small changes (not shown) also to simulation studies.
In most cases EDA relies heavily on the graphical display of data. It is however important to remember that different types of plots make it easy to visualize different aspects of a data set. In this page I show some examples using R and package ‘ggplot2’. These are very far from being exhaustive. The detective work in most cases involves imagining and implementing a data visualization able to reveal the information we are interested in.
3 Quality control
Before any plotting it is good to start by tabulating the data. table() is useful when we have categorical variables or factors, as we can use it to get counts of observations.
table(iris$Species)
setosa versicolor virginica
50 50 50
With two factors we get a two-way table, and with additional factors, additional 2D tables. From this very simple table we can see that we have 50 observations per species. Had we had wrongly typed the species name in one cell we would get a different table.
# Make version of iris with a wrongly encoded valuebad_iris <- irisbad_iris$Species <-as.character(bad_iris$Species)bad_iris$Species[5] <-"acetosa"bad_iris$Species <-factor(bad_iris$Species)
table(bad_iris$Species)
acetosa setosa versicolor virginica
1 49 50 50
We now know that there is an incorrectly encoded value, and we can find in which row it is.
which(bad_iris$Species =="acetosa")
[1] 5
We can also replace it, if we are sure that this is a mistake. The code below will replace all instances of "acetosa" by "setosa".
bad_iris$Species[bad_iris$Species =="acetosa"] <-"setosa"# replaces the valuebad_iris$Species <-factor(bad_iris$Species) # removes the level
table(bad_iris$Species)
setosa versicolor virginica
50 50 50
In a case like this with 150 rows, I find this approach easier than editing a file. With thousands of observations all other viable options are based on scripts.
The organization of the data frame seems to be fine, so we may want to use summary() for a quick inspection.
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
Summary is useful in that it provides a quick view of all the columns of the data frame and will also report the number of NA values in each colum.
I am not showing an example, but when reading data from a text file (e.g., .csv file) if the decimal mark (, or .) is not recognized as such, numbers are read as text (character values) instead numbers (numeric values). It is enough that a single value in a whole column has the wrong decimal marker, for the whole column to be read as text. summary() can also help in such a case, as the summary for the affected column will not be numeric.
Tip
NA is an abbreviation for Not Available, it is used to indicate missing data. Observations that should have been available but were lost or are missing because of an “accident”. They propagate in arithmetic operations. One or more missing values make the result that would have been obtained had the value(s) been available, also not available or unknown. In general, in R, removal of NA values is not the default, but can be explicitly requested.
NA+10
[1] NA
NaN is an abbreviation for Not a Number, NaN values are generated by undefined arithmetic operations.
log(-1) # is undefined
[1] NaN
1/0# is a valid operation!
[1] Inf
-5/0# is a valid operation!
[1] -Inf
1000/Inf# is a valid operation!
[1] 0
4 Plots of single variables
The plots we will create today are for data exploration, not communication of results. If variable names are self explanatory as in the iris data set we do not need to change axis labels or change how plots look, except to ensure that the features of the data are clearly displayed.
library(ggplot2)theme_set(theme_bw(16))
4.1 Box plot
A very popular type of plot for EDA are box plots (more descriptively, also called box and whiskers plots). John Tukey’s most important and original contribution was on EDA methods, including box plots, not Tukey’s HSD. In the early 1970’s easy of drawing and easy of computation were still crucial for method to be widely used, and this one of the goals in most of Tukey’s work on EDA. In the iris data sets we have data from measurements of the size of different flower parts, petals and sepals. For each of them, width and length are available.
ggplot(iris, aes(x = Species, y = Petal.Length)) +geom_boxplot()
Caution
The advice is not to use box plots with small data sets, say with fewer than 10 to 15 observations per box plot (e.g., 50 observations for each of species in the figure above). The shape of the boxes and whiskers is quite reasonable with 50 observations per species, with very few observations the box plots are less informative than showing all the observations individually.
An enhancement to simple box plots is to show an approximate 95% confidence interval as a notch in the box plot. We can also make them narrower.
For exploration one would think that plotting all observations would be ideal, but how can we ensure that all observations are visible rather than occluded by other observations. We can use transparency, but this works well only with few overlaps, so works well in those situations when we have too few observations for a more complex visual like a boxplot or density plots. In the example I use alpha = 0.25 meaning that a point is 1/4 opaque and 3/4 transparent. For a point to look black, four or more observations have to overlap.
ggplot(iris, aes(x = Species, y = Petal.Length)) +geom_point(alpha =0.25)
Another approach is randomly displace points along the x axis to avoid overlaps.
ggplot(iris, aes(x = Species, y = Petal.Length)) +geom_point(position =position_jitter(width =0.25))
4.3 Violin plot
Surely we can do better than this with modern computers! Box plots aim at describing the distribution of observations based on the median, quartiles and range, and possible “outliers”. Violin plots are like box plots but describe the density distribution with a fitted curve.
ggplot(iris, aes(x = Species, y = Petal.Length)) +geom_violin()
4.4 Beeswarm plot
A final alternative is to combine violin plots with showing the observations.
library(ggbeeswarm)ggplot(iris, aes(x = Species, y = Petal.Length)) +geom_quasirandom(width =0.25)
To any of the plots shown above we can add numerical summaries, either numerically, graphically or both.
These are some of the common variations around the theme of boxplots, or more accurately plots that give information not only on the central tendency and spread of the data but also on the shape the empirical density distribution. This approach differs from the approach of showing means and standard errors, or confidence intervals, that implicitly assume that the data have been sampled from a population that follows a Normal distribution.
4.5 What do these plots tell us
In the case of variable Petal.Length from the iris data set it is clear that the spread of the values is less for I. setosa than for the other two species. The distributions are rather symmetrical around the median, and no unusual or unexpected observations, or outliers can be seen. What is considered an outliers in box plots and plotted individually should be taken with a grain of salt, as it can be affected by the number of observations plotted. For example, I would not worry about the highlighted “outliers” in the data for I. setosa. In this case, what would need to be considered for a data analysis is the heterosedasticity or differences in spread or variance.
We would make similar plots for other variables and explore their properties.
4.6 What type of plot to use?
We are exploring the data, so essentially we can use what works best for us and our data. However, one key consideration is the number of observations. If we have many hundreds of observations, then violin plots are the best option, as anything else would become cluttered. The beeswarm plot is useful with several tens of observations up to a couple of hundreds if one adjusts the size of the points and the width of the swarm. When using points to represent individual observations, the upper limit depends on the spread of the data compared to the plotting area (we can see this effect for Iris setosa compared to the other species). I would avoid jitter, as I personally find it difficult to visualize the density in such plots, although it is used quite frequently.
5 Exploring relationships
In the iris data set we could be interested in the relationships among the measured variables. Even if when plotted individually the data for each measured variable do not show anything unusual, a more subtle kind of outlier would be an unexpected relationship between petal length and width.
5.1 Scatter plot
A scatter plot is the simplest and usually first approach to try. As now there is variation both along x and y axes, overlaps are less likely for the same number of observation than in the case of dot plots (it is anyway important to use transparency to make sure overlaps do not remain unnoticed). We have three species, so I would start by using separate panels. We are not much interested in the actual results or interpretation, we are mainly focusing on quality of the data and properties that will affect how to analyse it.
We can two interesting things: 1) the real world width and length are continuous variables, but measurements have been done to the nearest 0.1 units, and 2) because of this there are quite many overlapping observations, specially for I. setosa because of the smaller spread of the values: we see only 20 points but we have 50 observations. This is a rather uncommon situation, but as earlier we can use jitter, but now along both x and y. Using as maximum displacement half the resolution of the measured quantities gives a more reasonable set of plots. Some partial overlaps remain, but these are clearly visible.
ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) +geom_point(alpha =0.25, position =position_jitter(width =0.05, height =0.05)) +facet_wrap(facets =vars(Species), scales ="free")
Now that we have dealt with the overlaps, we can use a common scale, to be able to compare the species.
ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) +geom_point(alpha =0.25, position =position_jitter(width =0.05, height =0.05)) +facet_wrap(facets =vars(Species))
ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) +geom_point(alpha =0.25, position =position_jitter(width =0.05, height =0.05)) +facet_wrap(facets =vars(Species)) +scale_x_log10() +scale_y_log10()
With many observations we face again the problem of overlaps, and the best option is again to describe the 2D empirical distribution of the data. As with violin plots outliers can disappear from our view. One approach is to add the outliers as individual observations to a density plot. We start with a plain density plot. Package ‘ggplot2’ provides stat_density2D(), however, stat_hdr() presents the areas in a more useful way. For this particular data set stat_density2D(), using default arguments fails. I use the iris data set, although this approach is specially well suited to large data sets (I have used it with over 500000 observations per plot).
library(ggdensity)ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) +stat_hdr() +expand_limits(x =c(-0.5, 3), y =c(0, 7.5)) +facet_wrap(facets =vars(Species))
To understand the relationship between the density estimates and the observations, we can add the observations as semitransparent points.
library(ggdensity)ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) +stat_hdr() +geom_point(colour ="yellow",alpha =0.5, position =position_jitter(width =0.05, height =0.05)) +expand_limits(x =c(-0.5, 3), y =c(0, 7.5)) +facet_wrap(facets =vars(Species))
6 Bad data are not always obvious
To end I will freely tell from memory what is I think, I wonderful demonstration, that plotting can show patterns that otherwise remain hidden. More so, different types of plots can reveal different features, including problems. Data set barley included in package ‘lattice’ includes data for yield in bushels per acre of 10 barley cultivars in Minnesota field trials at six sites on years 1931 and 1932. The question was revisited most recently by Wright (Wright, Kevin (2013). Revisiting Immer’s Barley Data. The American Statistician, 67(3), 129–133.)
The story goes like this, there is what looks like an error in the data coding, that was discovered only in 1993, or nearly 60 years after the first publication of the data. The first report about the problem was by Cleveland (1993), who popularized “trellis” plots, which are equivalent to what are called facets in ‘ggplot2’. Most astonishing is that the data had been used in the interim in examples in statistics books, with the error unnoticed. This just, highlights the need to be careful about EDA, and not assuming that it is very easy, that errors will easily be in sight in the first plot done. The time spent checking the data is very well spent, as it avoids difficulties at all later stages and protects us from reaching wrong conclusions.
The plots below are what I would have used nowadays without consulting the publications by Cleveland and collaborators of which I first became aware of in the early 2000’s.
We can make the difference even more visible by better distinguishing the years. This plot is more refined graphically than the usual EDA plots, but tries to demonstrate that the selection of graphical representation affects affects what patterns we recognize easily.
Errors become obvious when we discover them, but before we plot the data in a revealing way they are not obvious at all! In this case, it is difficult after some many years to get to the root of the problem. We may guess a miscoding of years for the data from Morris, but as far as I know, this is just a hypothesis. Had the odd results been noticed by those collecting the data, and either corrected or described as being a real feature when they reported them in a publication, we could be sure of what had happened.
References
Anscombe, F. J. 1973. “Graphs in Statistical Analysis.”The American Statistician 27 (1): 17. https://doi.org/10.2307/2682899.
Cleveland, William S. 1985. The Elements of Graphing Data. Wadsworth, Inc.
Koponen, Juuso, and Jonatan Hildén. 2019. Data Visualization Handbook. Espoo, Finland: Aalto University.
Koponen, Juuso, Jonatan Hildén, and Tapio Vapaasalo. 2016. Tieto Näkyväksi: Informaatiomuotoilun Perusteet. Aalto yliopisto.
Mosteller, F., and J. W. Tukey. 1977. Data Analysis and Regression. Reading, Massachusetts: Addison-Wesley Publishing Company.
Tukey, John W, and P Tukey. 1985. “Computer Graphics and Exploratory Data Analysis: An Introduction.” In Proceedings of the Sixth Annual Conference and Exposition: Computer Graphics85, III:773–85. Fairfax, VA.
---title: "EDA with 'ggplot2'"subtitle: "EDA = Exploratory Data Analysis"author: "Pedro J. Aphalo"date: 2023-05-24date-modified: 2023-07-27categories: [R, plotting]keywords: [ggplot2 pkg, data visualization, dataviz]format: html: code-fold: false code-tools: truebibliography: design-exp.bibabstract: | EDA and 'ggplot2' have a common ancestry. EDA as originally proposed by John W. Tukey included data visualization as a key tool. In this page I show examples of some of the data visualizations that are used in EDA, together with the R code I used to create them. I also briefly describe EDA's position in the scientific research process and mention its origins.---::: callout-noteTo see the source of this document click on "</> CODE" to the right of thepage title. The page is written using [Quarto](https://quarto.org/) which is an enhanced version of [R Markdown](https://rmarkdown.rstudio.com/). The diagrams are created with [Mermaid](https://mermaid.js.org/), a language inspired by the simplicity of [Markdown](https://daringfireball.net/projects/markdown/syntax).:::::: callout-warningPackage ['ggplot2'](https://ggplot2.tidyverse.org/) has gained new features over its long life, and although few changes have been 'code breaking' you should be aware that the examples in this page have been tested with version (==3.4.2).:::# IntroductionExploratory Data Analysis (EDA) as proposed by John W. Tukey is strongly basedon data visualization. There is a contrast in how a plot and a numerical summary present the data to us. A well-designed data display shows the data in its context, while an isolated numerical summary does not. The _Data visualization handbook_ (@Koponen2019), and its Finnish language version, _Tieto näkyväksi_ (@Koponen2016), provides a detailed account of how to produce accurate and effective graphs and diagrams based on data. ::: callout-tip# In William S. Cleveland's wordsGraphs are exceptionally powerful tools for data analysis. The reason is nicely encapsulated in a sentence from a 1982 letter written to me by W. Edwards Deming: "Graphical methods can retain the information in the data." Numerical data analytic procedures-such as means, standard deviations, correlation coefficients, and t-tests-are essentially data reduction techniques. Graphical methods complement such numerical techniques. Graphical methods tend to show data sets as a whole, allowing us to summarize the general behavior and to study the detail. This leads to much more thorough data analyses. [@Cleveland1985]:::@fig-Anscombe demonstrates how misleading it can be to rely only on numerical summaries of data.```{r setup, include=FALSE}knitr::opts_chunk$set(echo = TRUE, warning = FALSE, fig.height = 8, fig.width = 8)``````{r, message=FALSE}#| label: fig-Anscombe#| fig-cap: "**Anscombe's linear regression examples.** The artificial data sets [@Anscombe1973] plotted here demonstrate that even if very different they yield exactly the same numerical estimates when a linear regression model is fit to them. These plots highlight how much information numerical estimates can hide by reducing the observed data into a few numerical values and why extensive use of plots as a tool in data analysis is encouraged."#| code-fold: true# we rearrange the datalibrary(ggplot2)library(ggpmisc)my.mat <- matrix(as.matrix(anscombe), ncol=2)my.anscombe <- data.frame(x = my.mat[ , 1], y = my.mat[ , 2], case=factor(rep(1:4, rep(11,4))))ggplot(my.anscombe, aes(x,y)) + geom_point(shape=21, fill="orange", size=3) + stat_poly_line() + stat_poly_eq(use_label(c("eq", "R2", "P"))) + facet_wrap(~case, ncol=2) + theme_bw(16)```The paper [Remembrances of Things EDA](https://nightingaledvs.com/remembrances-of-things-eda/) @Friendly2022 recounts the history of EDA and the people involved in its development. In the early 1970's emphasis was on hand-drafted plots, given that the predominant way of interacting with computers was still based on text-only consoles and displays. Plots like boxplots are easy to draw by hand and can be constructed from values that are easy to calculate. Although EDA is frequently associated with such plots, John W. Tukey's ideas were broader, as reflected in the quotations below._Exploratory data analysis is detective work — numerical detective work or counting detective work — or graphical detective work._ [@Mosteller1977]_There is nothing better than a picture for making you think of questions you had forgotten to ask (even mentally)._ [@Tukey1985]._Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone—as the first step._ [@Mosteller1977]The Layered Grammar of Graphics implemented in R package [ggplot2]() derives from this tradition initiated by John Tukey, which encourages data exploration during analysis and the ad-hoc # The role of EDA in researchExploratory data analysis can play two distinct roles in research, depending on the type of experiment or survey. In both approaches EDA plays a crucial role (@fig-EDA-flowchart).1. _EDA = Detective work for Quality Control (QC)_. Some research is based on testing of hypotheses set _a priori_ (before the the experiment is designed, or at least independently of the data collected in the current study), in which case EDA can be seen mainly as a quality control step asking whether there is anything unusual in the data that suggests problems during its collection or in the assumptions on which the design of the study was based.2. _EDA = Detective work searching for interesting relationships and patterns in data that may reveal how the system under study works or is structured, combined withQC._ In other cases research aims at describing the system under study, in most cases, searching for meaningful relationships among variables. In such research EDA plays an additional role, it conveys information about the system under study. In such cases it is in most cases necessary to divide the data into a set used to generate testable hypotheses during EDA and a set used for validation (= testing of hypotheses) during Confirmatory Data Analysis (CDA). 3. _EDA-like utilitarian data analysis._ It is also possible to have as an aim to develop a purely empirical way of achieving a prediction. This is almost never the aim in basic scientific research because science aims at improving understanding. However this tends to be the norm when artificial intelligence and machine learning are used in practical applications.::: callout-tipEven if an experiment or survey is designed to test hypotheses set _a priori_, EDA can suggest new hypotheses to be tested in future experiments. Even if not testable with the same data, such hypotheses can be discussed during communication of results as something to be considered in future studies. However, in all cases it is imperative to clearly distinguish statements _supported by strong evidence_ from _new hypotheses suggested_ by the results of an experiment or survey.:::The diagram in @fig-EDA-flowchart shows the position of EDA in the research process. Think solid arrows show the most direct flow of information, while the thin and dotted arrows show information-dependent connections that are also importantl. The dot "arrow" heads indicate that the connection imposes restrictions. For example the design of an experiment determines how the data collected can be analysed and interpreted. Decisions about what can be considered an outlier are informed by anomalies observed during the course of an experiment, and sometimes even design decisions may need to be modified during the realization of a study. I have included _Tests of Hypotheses_ and _Model Selection_ as alternatives, but wording is rather unsatisfactory. The distinction is between testing hypotheses set _a priori_, versus searching for a model (or explanation) that describes the data well. In the first case we are interested in the _P_-values themselves, in the second case in the actual parameters estimates, such as the slope and intercept of a linear regression. In the first case, our _a priori_ hypotheses are hypotheses still about the values of parameter estimates from a fit of data to a model, so, in both cases we may rely on fitting one or more models to the observed data for our statistical analysis.```{mermaid}%%| label: fig-EDA-flowchart%%| fig-cap: A diagram showing the steps of scientific studies. The thick arrows describe the sequence of events/actions, connecting the design of an experiment to the communication of the results. Two paths, <font color=green>**1.** for hypothesis based research</font> and <font color=blue>**2.** for descriptive studies</font>, are highlighted (see main text). The dotted arrows with round heads indicate constraints imposed by design-related decisions. The double headed dotted arrow describes that the realization of a study can be influenced by data observed during its course, especially when data are collected repeatedly. Thin arrows indicate how one study can affect subsequent studies. _QC_= Quality Control or sanity checks of data. Even when no hypothesis testing is done, a hypothesis of what variables are of interest is involved in deciding what data are going to be used or collected. _Only if no formal hypothesis testing is involved, we can revise this weaker hypothesis during data analysis._ This abstraction can be applied to empirical research, but with small changes (not shown) also to simulation studies.%%{init: {"htmlLabels": true} }%%flowchart TD Z([background information]) ==> Y(Hypothesis) Y ==> A(Design) ==> Aa(Planning) ==> B(Realization) ==> H(Data collection) =={2.}==> C C[_full_ **EDA**] ==> D(Model Selection) == "_R2, f(xi)_, AIC, BIC" ==> E(Interpretation) ==> F([communication]) H == {1.} ==> I[_QC_ **EDA**] H ==deposit data+metadata=====> X([data repository]) I ==> G(**CDA** Tests of Hypothesis) == _P_-value ==> E E --follow up study--> Y C <-- new/modified hypothesis --> Y C --improved design--> A I --improved design--> A B <-.-> H A -.-o D A -.-o E A -.-o G F --scientific literature--> Z X --"open data"--> Z Z ==> E linkStyle 5,6,7,14,15 stroke: blue linkStyle 9,11,12,16 stroke: green```In most cases EDA relies heavily on the graphical display of data. It is however important to remember that different types of plots make it easy to visualize different aspects of a data set. In this page I show some examples using R and package 'ggplot2'. These are very far from being exhaustive. The detective work in most cases involves imagining and implementing a data visualization able to reveal the information we are interested in.# Quality controlBefore any plotting it is good to start by tabulating the data. `table()` is useful when we have categorical variables or factors, as we can use it to get counts of observations.```{r}table(iris$Species)```With two factors we get a two-way table, and with additional factors, additional 2D tables. From this very simple table we can see that we have 50 observations per species. Had we had wrongly typed the species name in one cell we would get a different table.```{r}# Make version of iris with a wrongly encoded valuebad_iris <- irisbad_iris$Species <-as.character(bad_iris$Species)bad_iris$Species[5] <-"acetosa"bad_iris$Species <-factor(bad_iris$Species)``````{r}table(bad_iris$Species)```We now know that there is an incorrectly encoded value, and we can find in which row it is.```{r}which(bad_iris$Species =="acetosa")```We can also replace it, if we are sure that this is a mistake. The code below will replace all instances of `"acetosa"` by `"setosa"`.```{r}bad_iris$Species[bad_iris$Species =="acetosa"] <-"setosa"# replaces the valuebad_iris$Species <-factor(bad_iris$Species) # removes the level``````{r}table(bad_iris$Species)```In a case like this with 150 rows, I find this approach easier than editing a file. With thousands of observations all other viable options are based on scripts.The organization of the data frame seems to be fine, so we may want to use `summary()` for a quick inspection.```{r}summary(iris)```Summary is useful in that it provides a quick view of all the columns of the data frame and will also report the number of `NA` values in each colum.I am not showing an example, but when reading data from a text file (e.g., `.csv` file) if the decimal mark (`,` or `.`) is not recognized as such, numbers are read as text (`character` values) instead numbers (`numeric` values). It is enough that a single value in a whole column has the wrong decimal marker, for the whole column to be read as text. `summary()` can also help in such a case, as the summary for the affected column will not be numeric. ::: callout-tip`NA` is an abbreviation for _Not Available_, it is used to indicate missing data. Observations that should have been available but were lost or are missing because of an "accident". They propagate in arithmetic operations. One or more missing values make the result that would have been obtained had the value(s) been available, also not available or unknown. In general, in R, removal of `NA` values is not the default, but can be explicitly requested.```{r}NA+10````NaN` is an abbreviation for _Not a Number_, `NaN` values are generated by undefined arithmetic operations.```{r}log(-1) # is undefined1/0# is a valid operation!-5/0# is a valid operation!1000/Inf# is a valid operation!```:::# Plots of single variablesThe plots we will create today are for data exploration, not communication of results. If variable names are self explanatory as in the `iris` data set we do not need to change axis labels or change how plots look, except to ensure that the features of the data are clearly displayed.```{r}library(ggplot2)theme_set(theme_bw(16))```## Box plotA very popular type of plot for EDA are box plots (more descriptively, also called box and whiskers plots). John Tukey's most important and original contribution was on EDA methods, including box plots, not Tukey's HSD.In the early 1970's easy of drawing and easy of computation were still crucial for method to be widely used, and this one of the goals in most of Tukey's work on EDA. In the `iris` data sets we have data from measurements of the size of different flower parts, petals and sepals. For each of them, width and length are available.```{r}ggplot(iris, aes(x = Species, y = Petal.Length)) +geom_boxplot()```::: callout-cautionThe advice is not to use box plots with small data sets, say with fewer than 10 to 15 observations per box plot (e.g., 50 observations for each of species in the figure above). The shape of the boxes and whiskers is quite reasonable with 50 observations per species, with very few observations the box plots are less informative than showing all the observations individually.:::An enhancement to simple box plots is to show an approximate 95% confidence interval as a notch in the box plot. We can also make them narrower.```{r}ggplot(iris, aes(x = Species, y = Petal.Length)) +geom_boxplot(notch =TRUE, width =0.3)```## Dot plotFor exploration one would think that plotting all observations would be ideal, but how can we ensure that all observations are visible rather than occluded by other observations. We can use transparency, but this works well only with few overlaps, so works well in those situations when we have too few observations for a more complex visual like a boxplot or density plots. In the example I use `alpha = 0.25` meaning that a point is 1/4 opaque and 3/4 transparent. For a point to look black, four or more observations have to overlap.```{r}ggplot(iris, aes(x = Species, y = Petal.Length)) +geom_point(alpha =0.25)```Another approach is randomly displace points along the _x_ axis to avoid overlaps.```{r}ggplot(iris, aes(x = Species, y = Petal.Length)) +geom_point(position =position_jitter(width =0.25))```## Violin plotSurely we can do better than this with modern computers! Box plots aim at describing the distribution of observations based on the median, quartiles and range, and possible "outliers". Violin plots are like box plots but describe the density distribution with a fitted curve.```{r}ggplot(iris, aes(x = Species, y = Petal.Length)) +geom_violin()```## Beeswarm plotA final alternative is to combine violin plots with showing the observations.```{r}library(ggbeeswarm)ggplot(iris, aes(x = Species, y = Petal.Length)) +geom_quasirandom(width =0.25)```To any of the plots shown above we can add numerical summaries, either numerically, graphically or both.```{r}library(ggbeeswarm)ggplot(iris, aes(x = Species, y = Petal.Length)) +geom_quasirandom(width =0.25) +stat_summary(fun.data = mean_cl_boot, colour ="red", linewidth =1) +stat_group_counts(geom ="text",label.y =0.1,label.x ="factor")```These are some of the common variations around the theme of boxplots, or more accurately plots that give information not only on the _central tendency_ and _spread_ of the data but also on the _shape_ the empirical density distribution. This approach differs from the approach of showing means and standard errors, or confidence intervals, that implicitly assume that the data have been sampled from a population that follows a Normal distribution.## What do these plots tell usIn the case of variable `Petal.Length` from the `iris` data set it is clear that the spread of the values is less for _I. setosa_ than for the other two species. The distributions are rather symmetrical around the median, and no unusual or unexpected observations, or _outliers_ can be seen. What is considered an outliers in box plots and plotted individually should be taken with a grain of salt, as it can be affected by the number of observations plotted. For example, I would not worry about the highlighted "outliers" in the data for _I. setosa_. In this case, what would need to be considered for a data analysis is the heterosedasticity or differences in spread or variance.We would make similar plots for other variables and explore their properties.## What type of plot to use?We are exploring the data, so essentially we can use what works best for us and our data. However, one key consideration is the number of observations. If we have many hundreds of observations, then violin plots are the best option, as anything else would become cluttered. The _beeswarm_ plot is useful with several tens of observations up to a couple of hundreds if one adjusts the size of the points and the width of the swarm. When using points to represent individual observations, the upper limit depends on the spread of the data compared to the plotting area (we can see this effect for _Iris setosa_ compared to the other species). I would avoid jitter, as I personally find it difficult to visualize the density in such plots, although it is used quite frequently.# Exploring relationshipsIn the `iris` data set we could be interested in the relationships among the measured variables. Even if when plotted individually the data for each measured variable do not show anything unusual, a more subtle kind of outlier would be an unexpected relationship between petal length and width.## Scatter plotA scatter plot is the simplest and usually first approach to try. As now there is variation both along _x_ and _y_ axes, overlaps are less likely for the same number of observation than in the case of dot plots (it is anyway important to use transparency to make sure overlaps do not remain unnoticed). We have three species, so I would start by using separate panels. We are not much interested in the actual results or interpretation, we are mainly focusing on quality of the data and properties that will affect how to analyse it.```{r}ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) +geom_point(alpha =0.25) +facet_wrap(facets =vars(Species), scales ="free")```We can two interesting things: 1) the real world width and length are continuous variables, but measurements have been done to the nearest 0.1 units, and 2) because of this there are quite many overlapping observations, specially for _I. setosa_ because of the smaller spread of the values: we see only 20 points but we have 50 observations. This is a rather uncommon situation, but as earlier we can use jitter, but now along both _x_ and _y_. Using as maximum displacement half the resolution of the measured quantities gives a more reasonable set of plots. Some partial overlaps remain, but these are clearly visible.```{r}ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) +geom_point(alpha =0.25, position =position_jitter(width =0.05, height =0.05)) +facet_wrap(facets =vars(Species), scales ="free")```Now that we have dealt with the overlaps, we can use a common scale, to be able to compare the species.```{r}ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) +geom_point(alpha =0.25, position =position_jitter(width =0.05, height =0.05)) +facet_wrap(facets =vars(Species))``````{r}ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) +geom_point(alpha =0.25, position =position_jitter(width =0.05, height =0.05)) +facet_wrap(facets =vars(Species)) +scale_x_log10() +scale_y_log10()```With many observations we face again the problem of overlaps, and the best option is again to describe the 2D empirical distribution of the data. As with violin plots outliers can disappear from our view. One approach is to add the outliers as individual observations to a density plot. We start with a plain density plot. Package 'ggplot2' provides `stat_density2D()`, however, `stat_hdr()` presents the areas in a more useful way. For this particular data set `stat_density2D()`, using default arguments fails. I use the `iris` data set, although this approach is specially well suited to large data sets (I have used it with over 500000 observations per plot).```{r}library(ggdensity)ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) +stat_hdr() +expand_limits(x =c(-0.5, 3), y =c(0, 7.5)) +facet_wrap(facets =vars(Species))```To understand the relationship between the density estimates and the observations, we can add the observations as semitransparent points.```{r}library(ggdensity)ggplot(iris, aes(x = Petal.Width, y = Petal.Length)) +stat_hdr() +geom_point(colour ="yellow",alpha =0.5, position =position_jitter(width =0.05, height =0.05)) +expand_limits(x =c(-0.5, 3), y =c(0, 7.5)) +facet_wrap(facets =vars(Species))```# Bad data are not always obviousTo end I will freely tell from memory what is I think, I wonderful demonstration, that plotting can show patterns that otherwise remain hidden. More so, different types of plots can reveal different features, including problems. Data set `barley` included in package 'lattice' includes data for yield in bushels per acre of 10 barley cultivars in Minnesota field trials at six sites on years 1931 and 1932. The question was revisited most recently by Wright (Wright, Kevin (2013). Revisiting Immer's Barley Data. _The American Statistician_, **67**(3), 129–133.)The story goes like this, there is what looks like an error in the data coding, that was discovered only in 1993, or nearly 60 years after the first publication of the data. The first report about the problem was by Cleveland (1993), who popularized "trellis" plots, which are equivalent to what are called facets in 'ggplot2'. Most astonishing is that the data had been used in the interim in examples in statistics books, with the error unnoticed. This just, highlights the need to be careful about EDA, and not assuming that it is very easy, that errors will easily be in sight in the first plot done. The time spent checking the data is very well spent, as it avoids difficulties at all later stages and protects us from reaching wrong conclusions.The plots below are what I would have used nowadays without consulting the [publications by Cleveland and collaborators](https://hci.stanford.edu/courses/cs448b/papers/becker-trellis-jcgs.pdf) of which I first became aware of in the early 2000's.```{r}#| fig-width: 5#| fig-height: 10library(lattice)ggplot(barley, aes(x = year, y = yield)) +geom_col(width =0.67) +expand_limits(y =0) +facet_grid(rows =vars(variety), cols =vars(site)) +theme_bw(12)``````{r}#| fig-width: 5#| fig-height: 4library(lattice)ggplot(barley, aes(x = year, y = yield)) +stat_summary(fun = median, geom ="col", width =0.67) +expand_limits(y =0) +facet_grid(cols =vars(site)) +theme_bw(12)```We can make the difference even more visible by better distinguishing the years. This plot is more refined graphically than the usual EDA plots, but tries to demonstrate that the selection of graphical representation affects affects what patterns we recognize easily.```{r}#| fig-width: 5#| fig-height: 4library(lattice)ggplot(barley, aes(x = year, y = yield, fill = year)) +stat_summary(fun = median, geom ="col", colour ="black", width =0.67) +expand_limits(y =0) +scale_fill_grey() +facet_grid(cols =vars(site)) +theme(legend.position ="top") +theme_bw(12)```Errors become obvious when we discover them, but before we plot the data in a revealing way they are not obvious at all! In this case, it is difficult after some many years to get to the root of the problem. We may guess a miscoding of years for the data from Morris, but as far as I know, this is just a hypothesis. Had the odd results been noticed by those collecting the data, and either corrected or described as being a real feature when they reported them in a publication, we could be sure of what had happened.