Learn R: As a Language

My new book was published on 28 July. Within the next few days I will make available on-line supplementary material, and explain how I created in R the word cloud in the front cover of the book. The word list I used is that for the R index from the book. I typeset the book using LaTeX. It is currently available from the publisher through the book’s web page.

Word cloud figure from LaTeX index entries

I created the word cloud on the cover of “Learn R as a Language” using an R script that takes as input the file for the book index, as generated when creating the PDF from the LaTeX source files. This input file contains quite a lot of additional information, like font changes and page numbers that needed to be stripped into a clean list of words. Now I realize that it may have been easier to produce a cleaner word list to start with.

The script is rather rough, but is what I actually used. I started using R functions when defining function get_words() but later used functions from the tidyverse packages ‘tidytext’ and ‘stringr’ in a pipe using the dot pipe operator from package ‘wrapr’.

library(ngram)
library(ggplot2)
library(ggwordcloud)
library(dplyr)
library(tidytext)
library(stringr)
library(wrapr)

# clean working environment
rm(list = ls(pattern = "*"))

# current working directory
getwd()
# find all LaTeX index files in current directory and read them into an R list
list.files(path = ".", pattern = "*.idx$")
indexed.words <- multiread(extension=".idx", prune.empty = FALSE)
# check that we got one member string per file
str(indexed.words)

# define an R function to do cleaning of the different index files
get_words <- function(x) {
  # remove laTeX commands
  gsub("\\\\textsf|\\\\textit|\\\\textsf|\\\\texttt|\\\\indexentry|\\\\textbar|\\\\ldots", "", x) -> temp
  # replace scaped characters
  gsub("\\\\_", "_", temp) -> temp
  gsub('\\\\"|\\"|\"', '"', temp) -> temp
  gsub("\\\\%", "%", temp) -> temp
  gsub("\\\\$|\\$", "$", temp) -> temp
  gsub("\\\\&|\\&", "&", temp) -> temp
  gsub("\\\\^|\\^", "^", temp) -> temp
  # remove index categories
  gsub("[{]functions and methods!|[{]classes and modes!|[{]data objects!|[{]operators!|[{]control of execution!|[{]names and their scope!|[{]constant and special values!", "", temp) -> temp
  # remove page numbers
  gsub("[{][0-9]*[}]", "", temp) -> temp
  # remove LaTeX formated versions of index entries
  gsub("@  [{][a-zA-Z_.:0-9$<-]*[(][])][}][}]", "", temp) -> temp
  gsub("@  [{][-a-zA-Z_.:0-9$<+*/>&^\\]*[}][}]", "", temp) -> temp
  gsub("@  [{][\\<>.!=, \"%[]*]*[}][}]", "", temp)
}

# we save the first index word list to an object named after the file name
assign(sub("./", "", names(indexed.words)[1]), get_words(indexed.words[[1]]))

# check that we got the word list
string.summary(rcatsidx.idx)
# we can see that get_words() left behind some garbage
cat(rcatsidx.idx)

# the next steps seemed easier to do using the tidyverse
str_replace(rcatsidx.idx, "@", "") %.>%
  str_replace(., '\\"|\\\\"|\"', "") %.>%
  str_replace(., '\\\\$"', "$") %.>%
  str_replace(., "^[{]", "") %.>%
  str_replace(., "[}][}]$", "") %.>%
  str_split(., " ") %.>%
  unlist(.) %.>%
  sort(.) %.>%
  rle(.) %.>%
  tibble(lengths = .$lengths, values = .$values) %.>%
  filter(., !values %in% c("", "NA", "\\$")) %.>%
  mutate(., values = ifelse(values %in% c("{%in%}}", "{%in%}", "%in%@"), 
                            "%in%", values)) %.>%
  mutate(., values = ifelse(values %in% 
                         c("{levels()<-}}", "{levels()<-}", "levels()<-@"), 
                            "levels()<-", values)) %.>%
  group_by(., values) %>%
  summarise(., lengths = sum(lengths)) %>%
  dplyr::arrange(., desc(lengths)) -> word_counts.tb

# the number of distinct index entries
nrow(word_counts.tb)

word_cloud.fig0 <-
  ggplot(word_counts.tb[1:180, ], # we use the 180 most frequent entries
         aes(label = values,
             size = lengths,
             color = lengths)) +
  geom_text_wordcloud(family = "mono",
                      fontface = "bold",
                      area_corr = TRUE,
                      grid_margin = 2,
                      seed = 42) +
  scale_size_area(max_size = 11) +
  scale_color_viridis_c() +
  theme_minimal() +
  theme(aspect.ratio = 5/7)

# the word cloud used
# note that the background color is set when opening the device
png("CRC/web-examples/learnrbook-cover-image-300-0.png",
    width = 2100, height = 1500, res = 300, bg = "black")
print(word_cloud.fig0)
dev.off()

# two examples using different palettes
word_cloud.fig1 <-
  word_cloud.fig0 %+% scale_color_viridis_c(option = "B")

png("CRC/web-examples/learnrbook-cover-image-300-1.png",
    width = 2100, height = 1500, res = 300, bg = "grey25")
print(word_cloud.fig1)
dev.off()

word_cloud.fig2 <-
  word_cloud.fig0 %+% scale_color_viridis_c(option = "C")

png("CRC/web-examples/learnrbook-cover-image-300-2.png",
    width = 2100, height = 1500, res = 300, bg = "black")
print(word_cloud.fig2)
dev.off()

This script was used only once, so I didn’t care mixing native R functions and tidyverse extensions. To keep this example “true” I didn’t edit the code before pasting it here, except that I added a few additional comments and deleted the code for several variations that I tried but decided not to keep or the cover designer did not choose, except for two. These two examples give some idea of how important palette and background choice can be in how the word cloud looks. I did also fix a bug that inflated the count for %in% resulting in a slightly different figure.

The bitmap from the last example might have made it to the cover, it is somewhat warmer in tone, but still overall matching the usual colors on the covers of books in The R Series.
word_cloud.fig2

The file used as input, rcatsidx.idx is 1014 lines long. Here I list the top 15 lines.

\indexentry{functions and methods!print()@\texttt  {print()}}{6}
\indexentry{functions and methods!help()@\texttt  {help()}}{11}
\indexentry{functions and methods!help()@\texttt  {help()}}{11}
\indexentry{functions and methods!help()@\texttt  {help()}}{11}
\indexentry{functions and methods!citation()@\texttt  {citation()}}{12}
\indexentry{classes and modes!numeric@\texttt  {numeric}}{18}
\indexentry{operators!+@\texttt  {+}}{18}
\indexentry{operators!-@\texttt  {-}}{18}
\indexentry{operators!*@\texttt  {*}}{18}
\indexentry{operators!/@\texttt  {/}}{18}
\indexentry{functions and methods!exp()@\texttt  {exp()}}{18}
\indexentry{functions and methods!sin()@\texttt  {sin()}}{18}
\indexentry{constant and special values!pi@\texttt  {pi}}{18}
\indexentry{functions and methods!sqrt()@\texttt  {sqrt()}}{19}
\indexentry{functions and methods!sin()@\texttt  {sin()}}{19}
...

In retrospect I realized that I could have used a much simpler R script, had I produced an index file just for this purpose. I would have only needed to edit a few LaTeX macros I used for adding words to the index of the book, so as to create an additional index with plain words with no font changes or categories. This index, would not have been included in the book, but would have been easier to convert to a word list suitable for constructing a word cloud.

Performance of package ‘photobiology’

In recent updates I have been trying to remove performance bottlenecks in the package. For plotting spectra with ‘ggspectra’ an obvious performance bottleneck has been the computation of color definitions from wavelengths. The solution to this problem was to use pre-computed color definitions in the most common case, that of human vision. Many functions and operators as well as assignments were repeatedly checking the validity of spectral data. Depending on the logic of the code, several of these checks were redundant. It is now possible to enable and disable checks both internally and in users’ code. This has been used within the package to avoid redundant or unnecessary checks when the logic of the computations ensures that results are valid.

In addition changes in some of the ‘tidyverse’ packages like ‘tibble’, ‘dplyr’, ‘vctrs’ and ‘rlang’ seem to have also improved performance of ‘photobiology’ very significantly. If we consider the time taken to run the test cases as an indication of performance, the gain has been massive, with runtime decreasing to nearly 1/3 of what it was a few months ago. This happened in spite of an increase in the number of test cases from about 3900 to 4270. Currently the 4270 test cases run on my laptop in 23.4 s. Updates ‘rlang’ (0.4.7) and/or ‘tibble’ (3.0.3) appearing this week in CRAN seem to have reduced runtime by about 30% compared to the previous versions.

The take home message is that even though there is a small risk of package updates breaking existing scripts, there is usually an advantage in keeping your installed R packages and R itself up to date. If some results change after an update it is important to investigate which one is correct, as it is both possible that earlier bugs have been fixed or new ones introduced. When needed it is possible, although slightly more cumbersome, to install superseded versions from the source-package archive at CRAN, which keeps every single version of the packages earlier available through CRAN. With respect to R itself, multiple versions can coexist on the same computer so it is not necessary to uninstall the version currently in use to test another one, either older or newer.

R 4.0.1 and R 4.0.2

Two minor updates to R were released recently. R 4.0.1 contained a major bug under MS-Windows (affecting R-commander), so it was very soon followed by R 4.0.2 that fixes the problem. As always it is recommended to keep R up-to-date. R for photobiology packages and their dependencies are not affected and pass checks on R >= 3.6.0.