The source codes and contens come from the book Text Mining with R Enjoy

- The unnest_tokens function
| text <- c("Because I could not stop for Death -", |
| "He kindly stopped for me -", |
| "The Carriage held but just Ourselves -", |
| "and Immortality") |
| |
| text |
text_df <- data_frame(line = 1:4, text = text)
- What does it mean that this data frame has printed out as a “tibble”? A tibble is a modern class of data frame within R, available in the dplyr and tibble packages, that has a convenient print method, will not convert strings to factors, and does not use row names. Tibbles are great for use with tidy tools.
| |
| library(tidytext) |
| |
| text_df %>% unnest_tokens(word, text) |
The two basic arguments to unnest_tokens used here are column names. First we have the output column name that will be created as the text is unnested into it (word, in this case), and then the input column that the text comes from (text, in this case). Remember that text_df above has a column called text that contains the data of interest.
Having the text data in this format lets us manipulate, process, and visualize the text using the standard set of tidy tools, namely dplyr, tidyr, and ggplot2, as shown in Figure 1.1 Figure_A_FlowChart.png
knitr::include_graphics("img/Figure_A_FlowChart.png")

- Tidying the works of Jane Austen
| |
| library(janeaustenr) |
| library(dplyr) |
| library(stringr) |
| |
| original_books <- austen_books() %>% |
| group_by(book) %>% |
| mutate(linenumber = row_number(), |
| chapter = cumsum(str_detect(text, |
| regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>% |
| ungroup() |
| |
| original_books %>% head() |
To work with this as a tidy dataset, we need to restructure it in the one-token-per-row format, which as we saw earlier is done with the unnest_tokens() function.
| |
| library(tidytext) |
| tidy_books <- original_books %>% |
| unnest_tokens(word, text) |
| |
| tidy_books %>% head() |
- Often in text analysis, we will want to remove stop words; stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth in English. We can remove stop words (kept in the tidytext dataset stop_words) with an anti_join()
| data(stop_words) |
| |
| tidy_books <- tidy_books %>% |
| anti_join(stop_words) |
## Joining, by = "word"
We can also use dplyr’s count() to find the most common words in all the books as a whole.
| tidy_books %>% |
| count(word, sort = TRUE) |
| library(ggplot2) |
| |
| tidy_books %>% |
| count(word, sort = TRUE) %>% |
| filter(n > 600) %>% |
| mutate(word = reorder(word, n)) %>% |
| ggplot(aes(word, n)) + |
| geom_col() + |
| xlab(NULL) + |
| coord_flip() + |
| labs(caption = "Figure 1.2 TOP words in Jane Austen’s novels") |

- Word Frequencies
We’re going to use gutenbergr package that provides access to the public domain works from the Project Gutenberg collection. We can access works using gutenberg_download()
. The numbers below are followed as well.
| library(gutenbergr) |
| hgwells <- gutenberg_download(c(35, 36, 5230, 159)) |
| ## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ── |
| |
| ## ✔ tibble 1.4.2 ✔ readr 1.3.0 |
| ## ✔ tidyr 0.8.2 ✔ purrr 0.2.5 |
| ## ✔ tibble 1.4.2 ✔ forcats 0.3.0 |
| |
| ## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ── |
| ## ✖ dplyr::filter() masks stats::filter() |
| ## ✖ dplyr::lag() masks stats::lag() |
| library(tidytext) |
| |
| tidy_hgwells <- hgwells %>% |
| unnest_tokens(word, text) %>% |
| anti_join(stop_words) |
## Joining, by = "word"
| tidy_hgwells %>% |
| count(word, sort = TRUE) |
Now let’s get some well-known works of the Brontë sisters, whose lives overlapped with Jane Austen’s somewhat but who wrote in a rather different style. Let’s get Jane Eyre, Wuthering Heights, The Tenant of Wildfell Hall, Villette, and Agnes Grey.
| library(gutenbergr) |
| |
| bronte <- gutenberg_download(c(1260, 768, 969, 9182, 767)) |
| |
| library(tidyverse) |
| library(tidytext) |
| |
| tidy_bronte <- bronte %>% |
| unnest_tokens(word, text) %>% |
| anti_join(stop_words) |
## Joining, by = "word"
| tidy_bronte %>% |
| count(word, sort = TRUE) |
- Interesting that “time”, “eyes”, and “hand” are in the top 10 for both H.G. Wells and the Brontë sisters.
Let’s calculate the frequency for each word for the works of Jane Austen, the Brontë sisters, and H.G. Wells by binding the data frames together. We can use spread and gather from tidyr to reshape our dataframe so that it is just what we need for plotting and comparing the three sets of novels.
| library(tidyr) |
| |
| frequency <- bind_rows(mutate(tidy_bronte, author = "Brontë Sisters"), |
| mutate(tidy_hgwells, author = "H.G. Wells"), |
| mutate(tidy_books, author = "Jane Austen")) %>% |
| mutate(word = str_extract(word, "[a-z']+")) %>% |
| count(author, word) %>% |
| group_by(author) %>% |
| mutate(proportion = n / sum(n)) %>% |
| select(-n) %>% |
| spread(author, proportion) %>% |
| gather(author, proportion, `Brontë Sisters`:`H.G. Wells`) |
We use str_extract() here because the UTF-8 encoded texts from Project Gutenberg have some examples of words with underscores around them to indicate emphasis (like italics). The tokenizer treated these as words, but we don’t want to count “any” separately from “any” as we saw in our initial data exploration before choosing to use str_extract().
- Visualization
Words that are close to the line in these plots have similar frequencies in both sets of texts, for example, in both Austen and Brontë texts (“miss”, “time”, “day” at the upper frequency end) or in both Austen and Wells texts (“time”, “day”, “brother” at the high frequency end).
| library(ggplot2) |
| |
| |
| ggplot(frequency, aes(x = proportion, y = `Jane Austen`, color = abs(`Jane Austen` - proportion))) + |
| geom_abline(color = "gray40", lty = 2) + |
| geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) + |
| geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) + |
| scale_x_log10(labels = percent_format()) + |
| scale_y_log10(labels = percent_format()) + |
| scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") + |
| facet_wrap(~author, ncol = 2) + |
| theme(legend.position="none") + |
| labs(y = "Jane Austen", |
| x = NULL, |
| caption = "Figure 1.3 Comparing the word frequencies of Jane Austen, the Brontë sisters, and H.G. Wells") |

- Correlation
Let’s quantify how similar and different these sets of word frequencies are using a correlation test. How correlated are the word frequencies between Austen and the Brontë sisters, and between Austen and Wells?
| cor.test(data = frequency[frequency$author == "Brontë Sisters",], |
| ~ proportion + `Jane Austen`) |
| cor.test(data = frequency[frequency$author == "H.G. Wells",], |
| ~ proportion + `Jane Austen`) |
Just as we saw in the plots, the word frequencies are more correlated between the Austen and Brontë novels than between Austen and H.G. Wells.