To see the source of this document click on “</> CODE” to the right of the page title. The page is written using Quarto which is an enhanced version of R Markdown. The diagrams are created with Mermaid, a language inspired by the simplicity of Markdown.
Package ‘ggplot2’ has gained new features over its long life, and although few changes have been ‘code brealking’ you should be aware that the examples in this page have been tested with version (==3.4.2).
1 Grammar of Graphics
The Grammar of Graphics (GoG) was deviced by Lee Wilkinson (1999) as a mathematical theory applicable to the construction of any data visualization. It it most relevant to the design of software for data visualization, and does not concern visual and perceptual aspects of plotting data. It can be depicted as a sequence of steps connected by operations (Figure 1).
Basing plot construction on a grammar gives users the freedom to create new types of data visualizations including those not imagined by the designers of software. This is similar to how a natural language like English allows writers and poets to express original ideas within a context that makes them understandable to others and amenable to be processed by existing software and hardware.
A recent article by J. Friedly (2022) in The Nightingale summarises a debate between L. Wilkinson and M. Friendly about what the GoG actually is.
The GoG is conceptual and identifies the key classes of objects starting from data and the classes of methods used to transform one to the next, giving a final graphic object that still needs to be rendered (J. Friedly 2022). To be used to actually construct data visualizations the GoG needs to be implemented as computer software. Use of the GoG as a paradigm affects how users of software will interact with a computer to construct plots, but designers of software user interfaces based on the GoG remain responsible for many decisions affecting how plots are in practice constructed.
One very successful software implementation of the GoG is R package ‘ggplot2’ and its extensions. I have myself published four R packages extending the types of plots that can be constructed using the GoG as implemented in ‘ggplot2’. These packages although respecting Wilkinson’s original GoG extend the implementation of this GoG in R.
2 Package ‘ggplot2’
It is best to think of ‘ggplot2’ and its extensions as a language used to specify how to build a plot. This language gives access to the GoG abstraction about the structure of data plots and how to build this structure step by step.
Understanding the GoG abstraction is the most important aspect of learning to create ggplots. Once one grasps the general picture, what remains is just the choice among different “building blocks”, as building blocks of the same “kind” can in most cases replace each other, differing drastically in the graphical output, but minimally in how they are combined into to a plot-construction description.
I consider packages like ‘ggrepel’, ‘ggpmisc’, and ‘ggbeeswarm’ that provide extensions to the grammar, as extensions to package ‘ggplot2’. Some packages like ‘ggpubr’ mostly define functions that build whole plots using ‘ggplot2’ and return "gg"
objects, but are designed to be used on their own. Such functions do not extend the grammar of graphics, they define their own user interface. Package ‘ggspectra’ has double personality as it extends the grammar but also defines special autoplot()
methods for spectra. It remains consistent with ‘ggplot2’ because the generic method autoplot()
is defined in package ‘ggplot2’.
3 The main steps
Differently to many other data plotting approaches, ggplots are constructed as R objects. So plotting data consists in constructing an object of class "gg"
followed by its rendering. As any R object, "gg"
objects can be stored in variables and can be printed to display them. When printed ggplots are rendered by default as a graphical representation, however they can also be displayed as text to reveal their structure. That both representations can be obtained at each step of their construction and that "gg"
objects can be built by the succesive addition of components has two main advantages: 1) we can save parts of a "gg"
object and reuse them, and 2) we can build a plot bit by bit and check the effect of each addition to the "gg"
objects. It is possible, but infrequently needed, to edit an existing ggplot object: one can add, remove and replace components and also change the order of the layers. In most cases it is easier to edit the code used to create the plot, but if one does not have access to the original code or data, editing a ggplot can save the day. Editing is also an effective way of learning the internals of ggplots if one is interested in them. My package ‘gginnards’ makes editing easier. I developed this pacakge to help me learn how ‘ggplot2’ works and to debug the code in my other packages with extensions to ‘ggplot2’.
The steps needed to create a plot using the grammar of graphics are:
Build an R object with data and instructions for making the plot.
(Possibly add to or even “edit” the R object).
“Render” the plot (convert it into a graphical object).
Display the graphical object or save it in a file.
4 The layers
We can use different abstractions to describe a plot, both static and dynamic. Structurally, a plot can be thought as a stack of graphic layers each drawn on a transparent imaginary substrate. Thus, similarly as when drafting a plot with ink on paper, what we draw first can be occluded by something we draw on top of it. When we build complex plots we construct a "gg"
object layer by layer; these layers even if not drawn at the time we add them, will be rendered into graphical objects in the order we have added them to the "gg"
object.
The abstraction based on layers is the key to the flexibility of ggplots: we can build an almost infinite variety of plots by combining different layers, each one of them, quite simple and with an easy to understand role. In reality, we can also adjust things to an extent within each layer. Importantly, not only layer functions defined in ‘ggplot2’ but also layer functions defined in other R packages or in a user script can be used to add layers to ggplots, further expanding the available range of available types of layers.
Defining new layer functions is fairly simple as layer functions defined in extension packages or scripts can rely on ‘ggplot2’ to do most of the work. This suggests that the overwhelming success of ‘ggplot2’ is similarly to the success of R itself supported by the easy with which new types of plots can be implemented as extensions.
5 The data flow
I consider now, a dynamic abstraction, the data flow describing what transformations are applied to the data in different components of a ggplot (Figure 2). These transformations take place when the plot is rendered, not when it is built, and take place separately in each layer. A ggplot object contains data but also functions that describe the operations to be carried on the data during rendering.
The data plotted in a ggplot can be shared among all (Figure 3) or some layers or be different for each layer (Figure 4). During rendering each layer generates graphical objects (grobs for short) and other code in ‘ggplot2’ creates the ancillary grobs such as those for the axes, grid, background and legends or keys. When a plot is rendered, all these grobs are collected into an R object, i.e., what code within ‘ggplot2’ creates are instructions to draw all graphical features in the final plot. Not yet the plot itself, which is in a final step rendered by R’s graphic devices into any of the formats supported by R.
As with all abstractions, the simple diagrams and explanations above ignored the real complexity. One of the ignored steps is crucial: how information in the data is encoded as graphical elements drawn in the plot, and how can we control this step or mapping. In the next section we will build a plot one step at a time.
6 Building a plot step by step
Building a plot one step at a time, and printing it at each step demonstrates how the grammar of graphics and ‘ggplot2’ work. The first step is to attach the packages we will use.
library(scales)
library(ggplot2)
library(gginnards)
An empty "gg"
object can be rendered as a plot.
ggplot()
We pass a data frame containing the data to be plotted. As we pass it as an argument to ggplot()
it becomes the default data for the individual layers we will later add. Once a mapping is present, the range of values mapped to each aesthetic becomes known, and x and y axes are added.
ggplot(data = mtcars)
The mapping of variables in the data to plot aesthetics is done with function aes()
. As we pass the value returned by function aes()
as an argument to ggplot()
this mapping becomes the default mapping for the individual layers we will later add.
ggplot(data = mtcars,
aes(x = disp, y = mpg))
Geometries are layer functions, geom_point()
used here, creates a graphical representation of the data as mapped to the x and y aesthetics as symbols or points on the drawing area of the plot.
ggplot(data = mtcars,
aes(x = disp, y = mpg)) +
geom_point()
Above, the shape and colour of the points are the default ones. We can, instead of mapping variables to aesthetics, assign constant values to aesthetics. This is best done directly as arguments to layer functions as shown here rather than using aes()
.
ggplot(data = mtcars,
aes(x = disp, y = mpg)) +
geom_point(color = "red", shape = "square open", size = 3)
The default statistic of geom_point()
is stat_identity()
that does not alter the data. So by default geom_point()
behaves as is no statistic was present, thus above the observations were plotted as is. All other statistics modify the data before it reaches the geometry. We add as an example stat_smooth()
which fits a smoother to the data. We override the default geometry of stat_smooth()
setting it to geom_line()
with geom = "line"
.
ggplot(data = mtcars,
aes(x = disp, y = mpg)) +
geom_point() +
stat_smooth(geom = "line", method = "lm", formula = y ~ x)
The correspondence between values in the data and values of an aesthetic is controlled by the corresponding scale. Here we replace the scale used by default (scale_y_continuous()
) by scale_y_log10()
so that the y axis uses a logarithmic scale. Scales, as shown here only change the graphical representation. The legends and tick labels still show the values before the transformation, which in most cases makes the plot easy to read.
ggplot(data = mtcars,
aes(x = disp, y = mpg)) +
geom_point() +
stat_smooth(geom = "line", method = "lm", formula = y ~ x) +
scale_y_log10()
Coordinates are applied after the statistics “see” the data, so changing the limits with them is similar to zooming into a finalized plot based on all the data. This is very important to remember when statistics are used as in a plot like this using scale limits to zoom in would result in the regression being fitted only to the data actually visible within the plotting area, while using coordinate limits will ensure that the regression is fitted to the whole data set.
ggplot(data = mtcars,
aes(x = disp, y = mpg)) +
geom_point() +
stat_smooth(geom = "line", method = "lm", formula = y ~ x) +
coord_cartesian(ylim = c(15, 25))
A transformation applied through a coordinate affects the values after the statistics has computed them, thus in this plot the linear regression is represented by a curve. This is in contrast to the example above with scale_y_log10()
where the linear regression was fit to the log10()
transformed data and thus graphically represented by a straight line in spite of the transformed y scale.
ggplot(data = mtcars,
aes(x = disp, y = mpg)) +
geom_point() +
stat_smooth(geom = "line", method = "lm", formula = y ~ x) +
coord_trans(y = "log10")
Themes are similar to style sheets, and they control the appearance and position of only those graphical elements that are not created by layer functions.
ggplot(data = mtcars,
aes(x = disp, y = mpg)) +
geom_point() +
theme_classic()
Many elements in themes are defined hierarchically, and for example text sizes are by default set relative to a base size. Here we increase the size of text elements and change the base font family. However, the size of the points is not affected.
ggplot(data = mtcars,
aes(x = disp, y = mpg)) +
geom_point() +
theme_classic(base_size = 20, base_family = "serif")
We can modify individual theme settings, instead or in addition to replacing the theme as a whole.
ggplot(data = mtcars,
aes(x = disp, y = mpg)) +
geom_point() +
theme(axis.title = element_text(face = "bold"),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
In a plot for publication, axis and legend labels usually need to be clearer and more elegant that simple variable names. labs()
is a convenience function that makes setting these texts straightforward. Embedding of new line characters (\n
) withing the character strings is supported, and in some cases very useful.
ggplot(data = mtcars,
aes(x = disp, y = mpg)) +
geom_point() +
labs(x = "Engine displacement (cubic inches)",
y = "Fuel use efficiency\n(miles per gallon)",
title = "Motor Trend Car Road Tests",
subtitle = "Source: 1974 Motor Trend US magazine")
Finally, as shown here, mappings can be to R expressions not just variables in data. For example here we plot mpg
(miles per gallon) vs. disp / cyl
(the displacement of individual engine cylinders).
ggplot(data = mtcars,
aes(x = disp / cyl, y = mpg, colour = factor(cyl))) +
geom_point()
7 The internals
For most users of ‘ggplot2’ and its extensions it is crucial to understand the grammar of graphics. The internals of "gg"
objects can be ignored by most users, although a rough idea of how ‘ggplot2’ works can be useful when facing error messages and “code that does not work”. It can be also useful in cases when modifying an existing "gg"
object is the only available or easiest approach.
Above we have implicitly printed the plots into their graphical representation. Here we save the "gg"
object into variable p
and then explore the structure of the object, which reveals how the different components of the “plot drawing recipe” are stored. For example, one can see that “actions” are stored in the object as function definitions.
<- ggplot(mpg, aes(x = displ, y = hwy)) +
p geom_point()
summary(p)
data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy, fl,
class [234x11]
mapping: x = ~displ, y = ~hwy
faceting: <ggproto object: Class FacetNull, Facet, gg>
compute_layout: function
draw_back: function
draw_front: function
draw_labels: function
draw_panels: function
finish_data: function
init_scales: function
map_data: function
params: list
setup_data: function
setup_params: function
shrink: TRUE
train_scales: function
vars: function
super: <ggproto object: Class FacetNull, Facet, gg>
-----------------------------------
geom_point: na.rm = FALSE
stat_identity: na.rm = FALSE
position_identity
str(p, max.level = 1, list.len = 4)
Object size: 30.2 kB
List of 11
$ data : tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
$ layers :List of 1
$ scales :Classes 'ScalesList', 'ggproto', 'gg' <ggproto object: Class ScalesList, gg>
add: function
add_defaults: function
add_missing: function
backtransform_df: function
clone: function
find: function
get_scales: function
has_scale: function
input: function
map_df: function
n: function
non_position_scales: function
scales: list
train_df: function
transform_df: function
super: <ggproto object: Class ScalesList, gg>
$ guides :Classes 'Guides', 'ggproto', 'gg' <ggproto object: Class Guides, gg>
add: function
assemble: function
build: function
draw: function
get_custom: function
get_guide: function
get_params: function
get_position: function
guides: NULL
merge: function
missing: <ggproto object: Class GuideNone, Guide, gg>
add_title: function
arrange_layout: function
assemble_drawing: function
available_aes: any
build_decor: function
build_labels: function
build_ticks: function
build_title: function
draw: function
draw_early_exit: function
elements: list
extract_decor: function
extract_key: function
extract_params: function
get_layer_key: function
hashables: list
measure_grobs: function
merge: function
override_elements: function
params: list
process_layers: function
setup_elements: function
setup_params: function
train: function
transform: function
super: <ggproto object: Class GuideNone, Guide, gg>
package_box: function
print: function
process_layers: function
setup: function
subset_guides: function
train: function
update_params: function
super: <ggproto object: Class Guides, gg>
[list output truncated]
str(p$layers, max.level = 1)
List of 1
$ :Classes 'LayerInstance', 'Layer', 'ggproto', 'gg' <ggproto object: Class LayerInstance, Layer, gg>
aes_params: list
compute_aesthetics: function
compute_geom_1: function
compute_geom_2: function
compute_position: function
compute_statistic: function
computed_geom_params: NULL
computed_mapping: NULL
computed_stat_params: NULL
constructor: call
data: waiver
draw_geom: function
finish_statistics: function
geom: <ggproto object: Class GeomPoint, Geom, gg>
aesthetics: function
default_aes: uneval
draw_group: function
draw_key: function
draw_layer: function
draw_panel: function
extra_params: na.rm
handle_na: function
non_missing_aes: size shape colour
optional_aes:
parameters: function
rename_size: FALSE
required_aes: x y
setup_data: function
setup_params: function
use_defaults: function
super: <ggproto object: Class Geom, gg>
geom_params: list
inherit.aes: TRUE
layer_data: function
map_statistic: function
mapping: NULL
position: <ggproto object: Class PositionIdentity, Position, gg>
compute_layer: function
compute_panel: function
required_aes:
setup_data: function
setup_params: function
super: <ggproto object: Class Position, gg>
print: function
setup_layer: function
show.legend: NA
stat: <ggproto object: Class StatIdentity, Stat, gg>
aesthetics: function
compute_group: function
compute_layer: function
compute_panel: function
default_aes: uneval
dropped_aes:
extra_params: na.rm
finish_layer: function
non_missing_aes:
optional_aes:
parameters: function
required_aes:
retransform: TRUE
setup_data: function
setup_params: function
super: <ggproto object: Class Stat, gg>
stat_params: list
super: <ggproto object: Class Layer, gg>
Package ‘gginnards’ makes it rather easy to modify "gg"
objects. I find occasionally useful to alter the order of layers and to insert layers out-of-order in an exisiting "gg"
object . Deleting those variables that have not been mapped to aesthetics from the data stored within an exisiting "gg"
object can be sometimes simpler than constructing again the plot object with a subset of the data.
This brief introduction only touched on the basic aspects of the grammar of graphics as implemented in R package ‘ggplot2’. This is enough to get many different types of plots done successfully. In most cases, doing different types of plots requires one to find a suitable layer function. While in many cases default arguments to these functions will yield a usable plot, normally, studying the help page of a layer function will make clear its features and how to use them effectively. Thus, it is not necessary, and a waste of time, to try to become an expert across all possible types of plots. It is enough to understand how plots are assembled, and learn when the need arises, how to use individual layer functions.
A good place to start looking for layer functions vailable in extension packages is to visiti the site ggplot2 extensions and its gallery, which is especially effective for those packages that are specialized and export a single or a few layer functions because the gallery displays a single example plot per package, and the list of extensions only very few examples per package.
Several of the pages here listed under galleries contain many examples of the use of the extensions to ‘ggplot2’ that I have published in packages ‘ggpp’, ‘ggpmisc’, ‘gginnards’ and ‘ggspectra’.