Word cloud figure from LaTeX index entries

I created the word cloud on the cover of “Learn R as a Language” using an R script that takes as input the file for the book index, as generated when creating the PDF from the LaTeX source files. This input file contained quite a lot of additional information, like font changes and page numbers that needed to be stripped into a clean list of words. Only later I realized that it would have been easier to produce a cleaner word list to start with. So, I first present the code revised to work with a simpler word list. This is actually tested with the book files to work. If you want to do something similar for your own book, follow the revised code in first section below. If you want to see the “hacked-up” code I really used for the cover as included in the book, it is in the second section below.

Polished approach and reusable code

First I need to explain how I encoded index entries in the LaTeX/Rnw source files. I did not use \index directly but instead defined macros wrapping this command. In the case of macros for indexing R related words, the macros also added the markup used in the main text in the same operation. In fact, I defined different macros for functions, classes, etc., even if some of the definitions were initially identical. This added a lot of flexibility, flexibility that also help greatly when implementing this simpler code.

I show here only the macro for R functions as an example.

\newcommand{\Rfunction}[1]{\textt{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{functions and methods!#1@\texttt{#1}}\xspace}

Those familiar with LaTeX will notice that the macro as defined above adds the argument to two different indexes, and encodes the argument in a “typewriter” font both in the main text and the two indexes.

To get a cleaner list of words as an additional index file named cloudindex.idx we can add \index[cloudindex]{#1} to this and similar LaTeX macros. We also need to add in the preamble of the book source file \makeindex[name=cloudindex] but we should not add a \printindex for this index, so that the book PDF remains unchanged.

\newcommand{\Rfunction}[1]{\code{#1}\index[rindex]{#1@\texttt{#1}}\index[rcatsidx]{functions and methods!#1@\texttt{#1}}\index[cloudindex]{#1}\xspace}

The contents of the file cloudindex.idx created when building the book PDF looks as follows, with rows ordered by page number, and containing one row for each call to \index (only first few out of more than 1000 lines are shown):
\indexentry{print()}{6}
\indexentry{help()}{11}
\indexentry{help()}{11}
\indexentry{help()}{11}
\indexentry{citation()}{12}
\indexentry{numeric}{18}
\indexentry{+}{18}
\indexentry{-}{18}
\indexentry{*}{18}
\indexentry{/}{18}
...

The R code used for extracting words, counting the number of entries for each word and assembling a tibble suitable as data argument for ggplot() is shown next.
library(ngram)
library(ggplot2)
library(ggwordcloud)
library(dplyr)
library(tidytext)
library(stringr)
library(wrapr)

clean_words <- function(x) {
  x %.>%
  # remove laTeX commands
  gsub("\\\\indexentry|\\\\textbar|\\\\ldots", "", x = .) %.>%
  # remove page numbers
  gsub("[{][0-9]*[}]", "", x = .) %.>%
  # remove all quotation marks
  gsub('\\\\\"|\\\"|\\"', '', x = .) %.>%
  # replace scaped characters
  gsub("\\\\_", "_", x = .) %.>%
  gsub("\\\\%", "%", x = .) %.>%
  gsub("\\\\[$]|\\[$]", "$", x = .) %.>% # $ needs to be protected
  gsub("\\\\&|\\&", "&", x = .) %.>%
  gsub("\\\\^|\\^", "^", x = .) %.>%
  # remove brackets
  gsub("[{]|[}]", "", x = .)
}

# read all index files, each one into a single character string
getwd()
list.files(path = ".", pattern = "*.idx$")
indexed.words <- multiread(extension=".idx", prune.empty = FALSE)
names(indexed.words)

# we grab the first one (edit index as needed)
my.idx <- clean_words(indexed.words[[1]])

# check what we have got
string.summary(my.idx)

my.idx %.>%
  str_split(., " ") %.>%
  unlist(.) %.>%
  sort(.) %.>%
  rle(.) %.>%
  tibble(lengths = .$lengths, values = .$values) %.>%
  filter(., !values %in% c("", "NA")) %.>% # to be excluded
  dplyr::arrange(., desc(lengths)) -> word_counts.tb

# number of distinct index entries
nrow(word_counts.tb)

We plot the data as a word cloud using ‘ggplot2’ and ‘ggwordcloud’. The values used as arguments for grid_margin, max_size, and the number of words plotted were selected by trial and error.
word_cloud.fig0 <-
  ggplot(word_counts.tb[1:180, ], # we use the 180 most frequent entries
         aes(label = values,
             size = lengths,
             color = lengths)) +
  geom_text_wordcloud(family = "mono",
                      fontface = "bold",
                      area_corr = TRUE,
                      grid_margin = 2,
                      seed = 42) +
  scale_size_area(max_size = 11) +
  scale_color_viridis_c() +
  theme_minimal() +
  theme(aspect.ratio = 5/7)

We next give examples of how to create PNG files and of how style variations can also produced by “editing” a ggplot. It is important to be aware, that in these examples the background color was set when calling the png() device (equivalent to feeding paper of a different color to a printer) as is not coded as part of the ggplot. Of course, other R graphic devices can be used as well.
# the word cloud used
# note that the background color is set when opening the device
png("CRC/web-examples/learnrbook-cover-image-300-0.png",
    width = 2100, height = 1500, res = 300, bg = "black")
print(word_cloud.fig0)
dev.off()

# two examples using different palettes
word_cloud.fig1 <-
  word_cloud.fig0 %+% scale_color_viridis_c(option = "B")

png("CRC/web-examples/learnrbook-cover-image-300-1.png",
    width = 2100, height = 1500, res = 300, bg = "grey25")
print(word_cloud.fig1)
dev.off()

word_cloud.fig2 <-
  word_cloud.fig0 %+% scale_color_viridis_c(option = "C")

png("CRC/web-examples/learnrbook-cover-image-300-2.png",
    width = 2100, height = 1500, res = 300, bg = "black")
print(word_cloud.fig2)
dev.off()

In fact once we realize what needs to be done, and which are the most appropriate tools, a simple script can get the job done elegantly.

The bitmap from the last example, which might have made it to the cover, is somewhat warmer in tone, but still overall matching the usual colors on the covers of books in The R Series. It also slightly differs from that on the book cover as a bug in the original script caused to too high count for %in%.
word_cloud.fig2

“Hacked” approach and rough code

I spent quite a lot of time thinking what could be a good cover image… rather unsuccessfully. At the last minute, I realized that being the book about the language, a word cloud built from the R “words” listed in the book index could be a nice way to emphasize this. The question was then, how to quickly create such an image when the deadline was looming very near, with less than 24 h available. I had not used package ‘ggwordcloud’ before, but I had seem it described in a blog post. As I am familiar with ‘ggplot2’ this seemed to be the way to go. As I did not think-out-of-box at this time, I followed the obvious path of using the intermediate data file used by LaTeX to create the index printed in the book. Parsing this file turned to be tricky and resulted in a long and inelegant script that did get the job done: I generated a bitmap that I was able to send to the publisher.

The file used as input, rcatsidx.idx is over 1000 lines long. Here I list the top 15 lines.

\indexentry{functions and methods!print()@\texttt  {print()}}{6}
\indexentry{functions and methods!help()@\texttt  {help()}}{11}
\indexentry{functions and methods!help()@\texttt  {help()}}{11}
\indexentry{functions and methods!help()@\texttt  {help()}}{11}
\indexentry{functions and methods!citation()@\texttt  {citation()}}{12}
\indexentry{classes and modes!numeric@\texttt  {numeric}}{18}
\indexentry{operators!+@\texttt  {+}}{18}
\indexentry{operators!-@\texttt  {-}}{18}
\indexentry{operators!*@\texttt  {*}}{18}
\indexentry{operators!/@\texttt  {/}}{18}
\indexentry{functions and methods!exp()@\texttt  {exp()}}{18}
\indexentry{functions and methods!sin()@\texttt  {sin()}}{18}
\indexentry{constant and special values!pi@\texttt  {pi}}{18}
\indexentry{functions and methods!sqrt()@\texttt  {sqrt()}}{19}
\indexentry{functions and methods!sin()@\texttt  {sin()}}{19}
...

This script is shown below. It is rough, but it is what I actually used. I started using base R functions when defining function get_words() but later used functions from the tidyverse packages ‘tidytext’ and ‘stringr’ in a pipe using the dot pipe operator from package ‘wrapr’. The comparison of this code with that presented above demonstrates that thinking out-of-the-box, and including the generation of the data in the search for a solution to the problem lead to a simpler and re-usable scrip.

library(ngram)
library(ggplot2)
library(ggwordcloud)
library(dplyr)
library(tidytext)
library(stringr)
library(wrapr)

# clean working environment
rm(list = ls(pattern = "*"))

# current working directory
getwd()
# find all LaTeX index files in current directory and read them into an R list
list.files(path = ".", pattern = "*.idx$")
indexed.words <- multiread(extension=".idx", prune.empty = FALSE)
# check that we got one member string per file
str(indexed.words)

# define an R function to do cleaning of the different index files
get_words <- function(x) {
  # remove laTeX commands
  gsub("\\\\textsf|\\\\textit|\\\\textsf|\\\\texttt|\\\\indexentry|\\\\textbar|\\\\ldots", "", x) -> temp
  # replace scaped characters
  gsub("\\\\_", "_", temp) -> temp
  gsub('\\\\"|\\"|\"', '"', temp) -> temp
  gsub("\\\\%", "%", temp) -> temp
  gsub("\\\\$|\\$", "$", temp) -> temp
  gsub("\\\\&|\\&", "&", temp) -> temp
  gsub("\\\\^|\\^", "^", temp) -> temp
  # remove index categories
  gsub("[{]functions and methods!|[{]classes and modes!|[{]data objects!|[{]operators!|[{]control of execution!|[{]names and their scope!|[{]constant and special values!", "", temp) -> temp
  # remove page numbers
  gsub("[{][0-9]*[}]", "", temp) -> temp
  # remove LaTeX formated versions of index entries
  gsub("@  [{][a-zA-Z_.:0-9$<-]*[(][])][}][}]", "", temp) -> temp
  gsub("@  [{][-a-zA-Z_.:0-9$<+*/>&^\\]*[}][}]", "", temp) -> temp
  gsub("@  [{][\\<>.!=, \"%[]*]*[}][}]", "", temp)
}

# we save the first index word list to an object named after the file name
assign(sub("./", "", names(indexed.words)[1]), get_words(indexed.words[[1]]))

# check that we got the word list
string.summary(rcatsidx.idx)
# we can see that get_words() left behind some garbage
cat(rcatsidx.idx)

# the next steps seemed easier to do using the tidyverse
str_replace(rcatsidx.idx, "@", "") %.>%
  str_replace(., '\\"|\\\\"|\"', "") %.>%
  str_replace(., '\\\\$"', "$") %.>%
  str_replace(., "^[{]", "") %.>%
  str_replace(., "[}][}]$", "") %.>%
  str_split(., " ") %.>%
  unlist(.) %.>%
  sort(.) %.>%
  rle(.) %.>%
  tibble(lengths = .$lengths, values = .$values) %.>%
  filter(., !values %in% c("", "NA", "\\$")) %.>%
  mutate(., values = ifelse(values %in% c("{%in%}}", "{%in%}", "%in%@"), 
                            "%in%", values)) %.>%
  mutate(., values = ifelse(values %in% 
                         c("{levels()<-}}", "{levels()<-}", "levels()<-@"), 
                            "levels()<-", values)) %.>%
  group_by(., values) %>%
  summarise(., lengths = sum(lengths)) %>%
  dplyr::arrange(., desc(lengths)) -> word_counts.tb

# the number of distinct index entries
nrow(word_counts.tb)

See previous section for the actual plotting and export of bitmaps.

This script was used only once, so I didn’t care mixing native R functions and tidyverse extensions. To keep this example “true” I didn’t edit the code before pasting it here, except that I added a few additional comments. I did also fix a bug that inflated the count for %in% resulting in a slightly different figure.

In retrospect I realized that I could have used a much simpler R script, had I produced an index file just for this purpose. As shown above, I would have only needed to edit a few LaTeX macros I used for adding words to the index of the book, so as to create an additional index with plain words with no font changes or categories.

Share with

Leave a Reply