Word cloud figure from LaTeX index entries

I created the word cloud on the cover of “Learn R as a Language” using an R script that takes as input the file for the book index, as generated when creating the PDF from the LaTeX source files. This input file contains quite a lot of additional information, like font changes and page numbers that needed to be stripped into a clean list of words. Now I realize that it may have been easier to produce a cleaner word list to start with.

The script is rather rough, but is what I actually used. I started using R functions when defining function get_words() but later used functions from the tidyverse packages ‘tidytext’ and ‘stringr’ in a pipe using the dot pipe operator from package ‘wrapr’.

library(ngram)
library(ggplot2)
library(ggwordcloud)
library(dplyr)
library(tidytext)
library(stringr)
library(wrapr)

# clean working environment
rm(list = ls(pattern = "*"))

# current working directory
getwd()
# find all LaTeX index files in current directory and read them into an R list
list.files(path = ".", pattern = "*.idx$")
indexed.words <- multiread(extension=".idx", prune.empty = FALSE)
# check that we got one member string per file
str(indexed.words)

# define an R function to do cleaning of the different index files
get_words <- function(x) {
  # remove laTeX commands
  gsub("\\\\textsf|\\\\textit|\\\\textsf|\\\\texttt|\\\\indexentry|\\\\textbar|\\\\ldots", "", x) -> temp
  # replace scaped characters
  gsub("\\\\_", "_", temp) -> temp
  gsub('\\\\"|\\"|\"', '"', temp) -> temp
  gsub("\\\\%", "%", temp) -> temp
  gsub("\\\\$|\\$", "$", temp) -> temp
  gsub("\\\\&|\\&", "&", temp) -> temp
  gsub("\\\\^|\\^", "^", temp) -> temp
  # remove index categories
  gsub("[{]functions and methods!|[{]classes and modes!|[{]data objects!|[{]operators!|[{]control of execution!|[{]names and their scope!|[{]constant and special values!", "", temp) -> temp
  # remove page numbers
  gsub("[{][0-9]*[}]", "", temp) -> temp
  # remove LaTeX formated versions of index entries
  gsub("@  [{][a-zA-Z_.:0-9$<-]*[(][])][}][}]", "", temp) -> temp
  gsub("@  [{][-a-zA-Z_.:0-9$<+*/>&^\\]*[}][}]", "", temp) -> temp
  gsub("@  [{][\\<>.!=, \"%[]*]*[}][}]", "", temp)
}

# we save the first index word list to an object named after the file name
assign(sub("./", "", names(indexed.words)[1]), get_words(indexed.words[[1]]))

# check that we got the word list
string.summary(rcatsidx.idx)
# we can see that get_words() left behind some garbage
cat(rcatsidx.idx)

# the next steps seemed easier to do using the tidyverse
str_replace(rcatsidx.idx, "@", "") %.>%
  str_replace(., '\\"|\\\\"|\"', "") %.>%
  str_replace(., '\\\\$"', "$") %.>%
  str_replace(., "^[{]", "") %.>%
  str_replace(., "[}][}]$", "") %.>%
  str_split(., " ") %.>%
  unlist(.) %.>%
  sort(.) %.>%
  rle(.) %.>%
  tibble(lengths = .$lengths, values = .$values) %.>%
  filter(., !values %in% c("", "NA", "\\$")) %.>%
  mutate(., values = ifelse(values %in% c("{%in%}}", "{%in%}", "%in%@"), 
                            "%in%", values)) %.>%
  mutate(., values = ifelse(values %in% 
                         c("{levels()<-}}", "{levels()<-}", "levels()<-@"), 
                            "levels()<-", values)) %.>%
  group_by(., values) %>%
  summarise(., lengths = sum(lengths)) %>%
  dplyr::arrange(., desc(lengths)) -> word_counts.tb

# the number of distinct index entries
nrow(word_counts.tb)

word_cloud.fig0 <-
  ggplot(word_counts.tb[1:180, ], # we use the 180 most frequent entries
         aes(label = values,
             size = lengths,
             color = lengths)) +
  geom_text_wordcloud(family = "mono",
                      fontface = "bold",
                      area_corr = TRUE,
                      grid_margin = 2,
                      seed = 42) +
  scale_size_area(max_size = 11) +
  scale_color_viridis_c() +
  theme_minimal() +
  theme(aspect.ratio = 5/7)

# the word cloud used
# note that the background color is set when opening the device
png("CRC/web-examples/learnrbook-cover-image-300-0.png",
    width = 2100, height = 1500, res = 300, bg = "black")
print(word_cloud.fig0)
dev.off()

# two examples using different palettes
word_cloud.fig1 <-
  word_cloud.fig0 %+% scale_color_viridis_c(option = "B")

png("CRC/web-examples/learnrbook-cover-image-300-1.png",
    width = 2100, height = 1500, res = 300, bg = "grey25")
print(word_cloud.fig1)
dev.off()

word_cloud.fig2 <-
  word_cloud.fig0 %+% scale_color_viridis_c(option = "C")

png("CRC/web-examples/learnrbook-cover-image-300-2.png",
    width = 2100, height = 1500, res = 300, bg = "black")
print(word_cloud.fig2)
dev.off()

This script was used only once, so I didn’t care mixing native R functions and tidyverse extensions. To keep this example “true” I didn’t edit the code before pasting it here, except that I added a few additional comments and deleted the code for several variations that I tried but decided not to keep or the cover designer did not choose, except for two. These two examples give some idea of how important palette and background choice can be in how the word cloud looks. I did also fix a bug that inflated the count for %in% resulting in a slightly different figure.

The bitmap from the last example might have made it to the cover, it is somewhat warmer in tone, but still overall matching the usual colors on the covers of books in The R Series.
word_cloud.fig2

The file used as input, rcatsidx.idx is 1014 lines long. Here I list the top 15 lines.

\indexentry{functions and methods!print()@\texttt  {print()}}{6}
\indexentry{functions and methods!help()@\texttt  {help()}}{11}
\indexentry{functions and methods!help()@\texttt  {help()}}{11}
\indexentry{functions and methods!help()@\texttt  {help()}}{11}
\indexentry{functions and methods!citation()@\texttt  {citation()}}{12}
\indexentry{classes and modes!numeric@\texttt  {numeric}}{18}
\indexentry{operators!+@\texttt  {+}}{18}
\indexentry{operators!-@\texttt  {-}}{18}
\indexentry{operators!*@\texttt  {*}}{18}
\indexentry{operators!/@\texttt  {/}}{18}
\indexentry{functions and methods!exp()@\texttt  {exp()}}{18}
\indexentry{functions and methods!sin()@\texttt  {sin()}}{18}
\indexentry{constant and special values!pi@\texttt  {pi}}{18}
\indexentry{functions and methods!sqrt()@\texttt  {sqrt()}}{19}
\indexentry{functions and methods!sin()@\texttt  {sin()}}{19}
...

In retrospect I realized that I could have used a much simpler R script, had I produced an index file just for this purpose. I would have only needed to edit a few LaTeX macros I used for adding words to the index of the book, so as to create an additional index with plain words with no font changes or categories. This index, would not have been included in the book, but would have been easier to convert to a word list suitable for constructing a word cloud.

Learn R: As a Language

My new book was published on 28 July. Within the next few days I will make available on-line supplementary material, and explain how I created in R the word cloud in the front cover of the book. The word list I used is that for the R index from the book. I typeset the book using LaTeX. It is currently available from the publisher through the book’s web page.

Book and package: Learn R …as you learnt your mother tongue

Book cover
Book cover

Book draft of 2017-05-14 and ‘learnrbook’ package 0.0.2.

My book titled Learn R …as you learnt your mother tongue is gradually approaching completion, but the text is not yet polished. Today’s milestone is that the first version of the companion R package has been accepted for distribution through CRAN.

In the draft version I uploaded today, published as earlier ones through Leanpub, approximately 95% of the book content is included. Today’s version differs from earlier ones in that as now the original datasets and files used in examples are now in package ‘learnrbook’, the example code has been edited to make use of it. The package is since yesterday in CRAN. The updated version of the book also tracks improvements in version 0.2.15 of my package ‘ggpmisc’, which is since today in CRAN.

The PDF of this updated draft of the book is available at http://leanpub.com/learnr for whatever price you would like to pay for it (including for free). However, paying even as little as 1€ for your copy would go a long way towards supporting the expenses of running this web site.

Book: Learn R …as you learnt your mother tongue

Book cover
Book cover

Draft of 2017-03-26

My book titled Learn R …as you learnt your mother tongue is nearly complete, but the text is not yet polished.

In the fifth draft version, published through Leanpub, approximately 95% of the book content is included.

The PDF of this updated draft of the book is available at http://leanpub.com/learnr for whatever price you would like to pay for it (including for free). However, paying even as little as 1€ for your copy would go a long way towards supporting the expenses of running this web site.

Status as of 2017-03-26. Except for the chapter on data manipulation, all chapters are close to their final form.

Draft of 2017-04-08

Two updates released to Leanpub on 4 April and 8 April, respectively. Chapter on data manipulation is now complete but text is still is rather rough.

Draft of 2017-04-14

More updates to file at Leanpub. Indexing of functions is mostly done. Polishing of text and code examples has started. A few additional code examples added. R package containing original data used in examples in the book is nearly ready for first release.