Word cloud figure from LaTeX index entries

I created the word cloud on the cover of “Learn R as a Language” using an R script that takes as input the file for the book index, as generated when creating the PDF from the LaTeX source files. This input file contains quite a lot of additional information, like font changes and page numbers that needed to be stripped into a clean list of words. Now I realize that it may have been easier to produce a cleaner word list to start with.

The script is rather rough, but is what I actually used. I started using R functions when defining function get_words() but later used functions from the tidyverse packages ‘tidytext’ and ‘stringr’ in a pipe using the dot pipe operator from package ‘wrapr’.

library(ngram)
library(ggplot2)
library(ggwordcloud)
library(dplyr)
library(tidytext)
library(stringr)
library(wrapr)

# clean working environment
rm(list = ls(pattern = "*"))

# current working directory
getwd()
# find all LaTeX index files in current directory and read them into an R list
list.files(path = ".", pattern = "*.idx$")
indexed.words <- multiread(extension=".idx", prune.empty = FALSE)
# check that we got one member string per file
str(indexed.words)

# define an R function to do cleaning of the different index files
get_words <- function(x) {
  # remove laTeX commands
  gsub("\\\\textsf|\\\\textit|\\\\textsf|\\\\texttt|\\\\indexentry|\\\\textbar|\\\\ldots", "", x) -> temp
  # replace scaped characters
  gsub("\\\\_", "_", temp) -> temp
  gsub('\\\\"|\\"|\"', '"', temp) -> temp
  gsub("\\\\%", "%", temp) -> temp
  gsub("\\\\$|\\$", "$", temp) -> temp
  gsub("\\\\&|\\&", "&", temp) -> temp
  gsub("\\\\^|\\^", "^", temp) -> temp
  # remove index categories
  gsub("[{]functions and methods!|[{]classes and modes!|[{]data objects!|[{]operators!|[{]control of execution!|[{]names and their scope!|[{]constant and special values!", "", temp) -> temp
  # remove page numbers
  gsub("[{][0-9]*[}]", "", temp) -> temp
  # remove LaTeX formated versions of index entries
  gsub("@  [{][a-zA-Z_.:0-9$<-]*[(][])][}][}]", "", temp) -> temp
  gsub("@  [{][-a-zA-Z_.:0-9$<+*/>&^\\]*[}][}]", "", temp) -> temp
  gsub("@  [{][\\<>.!=, \"%[]*]*[}][}]", "", temp)
}

# we save the first index word list to an object named after the file name
assign(sub("./", "", names(indexed.words)[1]), get_words(indexed.words[[1]]))

# check that we got the word list
string.summary(rcatsidx.idx)
# we can see that get_words() left behind some garbage
cat(rcatsidx.idx)

# the next steps seemed easier to do using the tidyverse
str_replace(rcatsidx.idx, "@", "") %.>%
  str_replace(., '\\"|\\\\"|\"', "") %.>%
  str_replace(., '\\\\$"', "$") %.>%
  str_replace(., "^[{]", "") %.>%
  str_replace(., "[}][}]$", "") %.>%
  str_split(., " ") %.>%
  unlist(.) %.>%
  sort(.) %.>%
  rle(.) %.>%
  tibble(lengths = .$lengths, values = .$values) %.>%
  filter(., !values %in% c("", "NA", "\\$")) %.>%
  mutate(., values = ifelse(values %in% c("{%in%}}", "{%in%}", "%in%@"), 
                            "%in%", values)) %.>%
  mutate(., values = ifelse(values %in% 
                         c("{levels()<-}}", "{levels()<-}", "levels()<-@"), 
                            "levels()<-", values)) %.>%
  group_by(., values) %>%
  summarise(., lengths = sum(lengths)) %>%
  dplyr::arrange(., desc(lengths)) -> word_counts.tb

# the number of distinct index entries
nrow(word_counts.tb)

word_cloud.fig0 <-
  ggplot(word_counts.tb[1:180, ], # we use the 180 most frequent entries
         aes(label = values,
             size = lengths,
             color = lengths)) +
  geom_text_wordcloud(family = "mono",
                      fontface = "bold",
                      area_corr = TRUE,
                      grid_margin = 2,
                      seed = 42) +
  scale_size_area(max_size = 11) +
  scale_color_viridis_c() +
  theme_minimal() +
  theme(aspect.ratio = 5/7)

# the word cloud used
# note that the background color is set when opening the device
png("CRC/web-examples/learnrbook-cover-image-300-0.png",
    width = 2100, height = 1500, res = 300, bg = "black")
print(word_cloud.fig0)
dev.off()

# two examples using different palettes
word_cloud.fig1 <-
  word_cloud.fig0 %+% scale_color_viridis_c(option = "B")

png("CRC/web-examples/learnrbook-cover-image-300-1.png",
    width = 2100, height = 1500, res = 300, bg = "grey25")
print(word_cloud.fig1)
dev.off()

word_cloud.fig2 <-
  word_cloud.fig0 %+% scale_color_viridis_c(option = "C")

png("CRC/web-examples/learnrbook-cover-image-300-2.png",
    width = 2100, height = 1500, res = 300, bg = "black")
print(word_cloud.fig2)
dev.off()

This script was used only once, so I didn’t care mixing native R functions and tidyverse extensions. To keep this example “true” I didn’t edit the code before pasting it here, except that I added a few additional comments and deleted the code for several variations that I tried but decided not to keep or the cover designer did not choose, except for two. These two examples give some idea of how important palette and background choice can be in how the word cloud looks. I did also fix a bug that inflated the count for %in% resulting in a slightly different figure.

The bitmap from the last example might have made it to the cover, it is somewhat warmer in tone, but still overall matching the usual colors on the covers of books in The R Series.
word_cloud.fig2

The file used as input, rcatsidx.idx is 1014 lines long. Here I list the top 15 lines.

\indexentry{functions and methods!print()@\texttt  {print()}}{6}
\indexentry{functions and methods!help()@\texttt  {help()}}{11}
\indexentry{functions and methods!help()@\texttt  {help()}}{11}
\indexentry{functions and methods!help()@\texttt  {help()}}{11}
\indexentry{functions and methods!citation()@\texttt  {citation()}}{12}
\indexentry{classes and modes!numeric@\texttt  {numeric}}{18}
\indexentry{operators!+@\texttt  {+}}{18}
\indexentry{operators!-@\texttt  {-}}{18}
\indexentry{operators!*@\texttt  {*}}{18}
\indexentry{operators!/@\texttt  {/}}{18}
\indexentry{functions and methods!exp()@\texttt  {exp()}}{18}
\indexentry{functions and methods!sin()@\texttt  {sin()}}{18}
\indexentry{constant and special values!pi@\texttt  {pi}}{18}
\indexentry{functions and methods!sqrt()@\texttt  {sqrt()}}{19}
\indexentry{functions and methods!sin()@\texttt  {sin()}}{19}
...

In retrospect I realized that I could have used a much simpler R script, had I produced an index file just for this purpose. I would have only needed to edit a few LaTeX macros I used for adding words to the index of the book, so as to create an additional index with plain words with no font changes or categories. This index, would not have been included in the book, but would have been easier to convert to a word list suitable for constructing a word cloud.

Learn R: As a Language

My new book was published on 28 July. Within the next few days I will make available on-line supplementary material, and explain how I created in R the word cloud in the front cover of the book. The word list I used is that for the R index from the book. I typeset the book using LaTeX. It is currently available from the publisher through the book’s web page.

from Data to Viz (external link)

from data to Viz is a new web site related to data analysis and R. Its aim is to make it easier to choose among different types of data visualisations. It looks beautiful, is easy to navigate, includes “trees” displaying a classification of visualizations and multiple individual examples with the corresponding R code.  Highly recommended!

To access the website and/or to buy the printed poster visit from Data to Viz.