The source codes and contents come from the E-Learning DataCamp: Sentiment Analysis in R: The Tidy Way Enjoy

Intro

  • The next real-world text exploration uses tragedies and comedies by Shakespeare to show how sentiment analysis can lead to insight into differences in word use. You will learn how to transform raw text into a tidy format for further analysis.

1. To be, or not to be

  • The shakespeare dataset contains three columns:
    • title, the title of a Shakespearean play,
    • type, the type of play, either tragedy or comedy, and
    • text, a line from that play. This data frame contains the entire texts of six plays.
library(tidytext)
library(dplyr)
## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
load("data/shakespeare.rda")
glimpse(shakespeare)
## Observations: 25,888
## Variables: 3
## $ title <chr> "The Tragedy of Romeo and Juliet", "The Tragedy of Romeo...
## $ type  <chr> "Tragedy", "Tragedy", "Tragedy", "Tragedy", "Tragedy", "...
## $ text  <chr> "The Complete Works of William Shakespeare", "", "The Tr...
# Use count to find out how many titles/types there are
shakespeare %>% 
  count(title, type)
## # A tibble: 6 x 3
##   title                           type        n
##   <chr>                           <chr>   <int>
## 1 A Midsummer Night's Dream       Comedy   3459
## 2 Hamlet, Prince of Denmark       Tragedy  6776
## 3 Much Ado about Nothing          Comedy   3799
## 4 The Merchant of Venice          Comedy   4225
## 5 The Tragedy of Macbeth          Tragedy  3188
## 6 The Tragedy of Romeo and Juliet Tragedy  4441

2. Unnesting from text to word

he shakespeare dataset is not yet compatible with tidy tools. You need to first break the text into individual tokens (the process of tokenization); a token is a meaningful unit of text for analysis, in many cases, just synonymous with a single word. You also need to transform the text to a tidy data structure with one token per row. You can use tidytext’s unnest_tokens() function to accomplish all of this at once.

library(tidytext)

tidy_shakespeare <- shakespeare %>%
  group_by(title) %>%
  mutate(linenumber = row_number()) %>%
  unnest_tokens(word, text) %>% # Transform the non-tidy text data to tidy text data
  ungroup()

tidy_shakespeare %>% 
  count(word, sort = TRUE)
## # A tibble: 10,736 x 2
##    word      n
##    <chr> <int>
##  1 the    4651
##  2 and    4170
##  3 i      3296
##  4 to     3047
##  5 of     2645
##  6 a      2511
##  7 you    2287
##  8 my     1913
##  9 in     1836
## 10 that   1721
## # ... with 10,726 more rows
  • Notice how the most common words in the data frame are words like “the”, “and”, and “i” that have no sentiments associated with them. In the next exercise, you’ll join the data with a lexicon to implement sentiment analysis.

3. Sentiment analysis of Shakespeare

After transforming the text of these Shakespearean plays to a tidy text dataset in the last exercise, the resulting data frame tidy_shakespeare is ready for sentiment analysis with such an approach. Once you have performed the sentiment analysis, you can find out how many negative and positive words each play has with just one line of code.

shakespeare_sentiment <- tidy_shakespeare %>%
  inner_join(get_sentiments("bing"))  # Implement sentiment analysis with the "bing" lexicon
## Joining, by = "word"
shakespeare_sentiment %>%
  count(title, sentiment) # Find how many positive/negative words each play has
## # A tibble: 12 x 3
##    title                           sentiment     n
##    <chr>                           <chr>     <int>
##  1 A Midsummer Night's Dream       negative    681
##  2 A Midsummer Night's Dream       positive    773
##  3 Hamlet, Prince of Denmark       negative   1323
##  4 Hamlet, Prince of Denmark       positive   1223
##  5 Much Ado about Nothing          negative    767
##  6 Much Ado about Nothing          positive   1127
##  7 The Merchant of Venice          negative    740
##  8 The Merchant of Venice          positive    962
##  9 The Tragedy of Macbeth          negative    914
## 10 The Tragedy of Macbeth          positive    749
## 11 The Tragedy of Romeo and Juliet negative   1235
## 12 The Tragedy of Romeo and Juliet positive   1090
  • Passing two variables to count() returns the count n for each unique combination of the two variables. In this case, you have 6 plays and 2 sentiments, so count() returns 6 x 2 = 12 rows.

4. Tragedy or comedy?

  • Which plays have a higher percentage of negative words? Do the tragedies have more negative words than the comedies?
sentiment_counts <- tidy_shakespeare %>%
    inner_join(get_sentiments("bing")) %>% 
    # Count the number of words by title, type, and sentiment
    count(title, type, sentiment)
## Joining, by = "word"
sentiment_counts %>%
    group_by(title) %>% # Group by the titles of the plays
    mutate(total = sum(n), # Find the total number of words in each play
           percent = n / total) %>% # Calculate the number of words divided by the total
    filter(sentiment == "negative") %>% # Filter the results for only negative sentiment
    arrange(percent)
## # A tibble: 6 x 6
## # Groups:   title [6]
##   title                           type    sentiment     n total percent
##   <chr>                           <chr>   <chr>     <int> <int>   <dbl>
## 1 Much Ado about Nothing          Comedy  negative    767  1894   0.405
## 2 The Merchant of Venice          Comedy  negative    740  1702   0.435
## 3 A Midsummer Night's Dream       Comedy  negative    681  1454   0.468
## 4 Hamlet, Prince of Denmark       Tragedy negative   1323  2546   0.520
## 5 The Tragedy of Romeo and Juliet Tragedy negative   1235  2325   0.531
## 6 The Tragedy of Macbeth          Tragedy negative    914  1663   0.550
  • Looking at the percent column of your output, you can see that tragedies do in fact have a higher percentage of negative words!

5. Most common positive and negative words

  • Now you can explore which specific words are driving these sentiment scores. Which are the most common positive and negative words in these plays?
word_counts <- tidy_shakespeare %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment)
## Joining, by = "word"
top_words <- word_counts %>%
  group_by(sentiment) %>% # Group by sentiment
  top_n(10) %>%  # Take the top 10 for each sentiment
  ungroup() %>%  # Make word a factor in order of n
  mutate(word = reorder(word, n))
## Selecting by n
# Use aes() to put words on the x-axis and n on the y-axis
library(ggplot2)
ggplot(top_words, aes(x = word, y = n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free") +  
  coord_flip()

 

- Death is pretty negative and love is positive, but are there words in that list that had different connotations during Shakespeare’s time? Do you see a word that the lexicon has misidentified? - The word "wilt" was used differently in Shakespeare’s time and was not negative; the lexicon has misidentified it. For example, from Romeo and Juliet, "For thou wilt lie upon the wings of night". It is important to explore the details of how words were - cored when performing sentiment analyses.

6. Word contributions by play

  • You will also practice using a different sentiment lexicon, the “afinn” lexicon in which words have a score from -5 to 5. Different lexicons take different approaches to quantifying the emotion/opinion content of words.
  • Which words contribute to the overall sentiment in which plays?
tidy_shakespeare %>%
  count(title, word, sort = TRUE) %>% # Count by title and word
  inner_join(get_sentiments("afinn")) %>% # Implement sentiment analysis using the "afinn" lexicon
  filter(title == "The Tragedy of Macbeth", score < 0) # Filter to only examine the scores for Macbeth that are negative
## Joining, by = "word"

## # A tibble: 237 x 4
##    title                  word        n score
##    <chr>                  <chr>   <int> <int>
##  1 The Tragedy of Macbeth no         73    -1
##  2 The Tragedy of Macbeth fear       35    -2
##  3 The Tragedy of Macbeth death      20    -2
##  4 The Tragedy of Macbeth bloody     16    -3
##  5 The Tragedy of Macbeth poor       16    -2
##  6 The Tragedy of Macbeth strange    16    -1
##  7 The Tragedy of Macbeth dead       14    -3
##  8 The Tragedy of Macbeth leave      14    -1
##  9 The Tragedy of Macbeth fight      13    -1
## 10 The Tragedy of Macbeth charges    11    -2
## # ... with 227 more rows
  • Notice the use of words specific to Macbeth like “bloody”.

7. Calculating a contribution score

  • you can calculate a relative contribution for each word in each play. This contribution can be found by multiplying the score for each word by the times it is used in each play and divided by the total words in each play.
sentiment_contributions <- tidy_shakespeare %>%
  count(title, word, sort = TRUE) %>%  # Count by title and word
  inner_join(get_sentiments("afinn")) %>% # Implement sentiment analysis using the "afinn" lexicon
  group_by(title) %>%  # Group by title
  mutate(contribution = (n * score) / sum(n)) %>% # Calculate a contribution for each word in each title
  ungroup()
## Joining, by = "word"
sentiment_contributions
## # A tibble: 2,366 x 5
##    title                           word      n score contribution
##    <chr>                           <chr> <int> <int>        <dbl>
##  1 Hamlet, Prince of Denmark       no      143    -1      -0.0652
##  2 The Tragedy of Romeo and Juliet love    140     3       0.213 
##  3 Much Ado about Nothing          no      132    -1      -0.0768
##  4 Much Ado about Nothing          hero    114     2       0.133 
##  5 A Midsummer Night's Dream       love    110     3       0.270 
##  6 Hamlet, Prince of Denmark       good    109     3       0.149 
##  7 The Tragedy of Romeo and Juliet no      102    -1      -0.0518
##  8 Much Ado about Nothing          good     93     3       0.162 
##  9 The Merchant of Venice          no       92    -1      -0.0630
## 10 Much Ado about Nothing          love     91     3       0.159 
## # ... with 2,356 more rows
  • Notice that “hero” shows up in your results there; that is the name of one of the characters in “Much Ado About Nothing”.

8. Alas, poor Yorick!

  • It’s time to explore some of your results! Look at Hamlet and The Merchant of Venice to see what negative and positive words are important in these two plays.

  • Arrange the most negative words

sentiment_contributions %>%
  # Filter for Hamlet
  filter(title == "Hamlet, Prince of Denmark") %>%
  # Arrange to see the most negative words
  arrange(contribution)
## # A tibble: 493 x 5
##    title                     word        n score contribution
##    <chr>                     <chr>   <int> <int>        <dbl>
##  1 Hamlet, Prince of Denmark no        143    -1      -0.0652
##  2 Hamlet, Prince of Denmark dead       33    -3      -0.0451
##  3 Hamlet, Prince of Denmark death      38    -2      -0.0347
##  4 Hamlet, Prince of Denmark madness    22    -3      -0.0301
##  5 Hamlet, Prince of Denmark mad        21    -3      -0.0287
##  6 Hamlet, Prince of Denmark fear       21    -2      -0.0192
##  7 Hamlet, Prince of Denmark poor       20    -2      -0.0182
##  8 Hamlet, Prince of Denmark hell       10    -4      -0.0182
##  9 Hamlet, Prince of Denmark grave      17    -2      -0.0155
## 10 Hamlet, Prince of Denmark ghost      32    -1      -0.0146
## # ... with 483 more rows
  • Arrange the most positive words
sentiment_contributions %>%
  # Filter for Hamlet
  filter(title == "The Merchant of Venice") %>%
  # Arrange to see the most negative words
  arrange(desc(contribution))
## # A tibble: 344 x 5
##    title                  word        n score contribution
##    <chr>                  <chr>   <int> <int>        <dbl>
##  1 The Merchant of Venice good       63     3       0.129 
##  2 The Merchant of Venice love       60     3       0.123 
##  3 The Merchant of Venice fair       35     2       0.0479
##  4 The Merchant of Venice like       34     2       0.0466
##  5 The Merchant of Venice true       24     2       0.0329
##  6 The Merchant of Venice sweet      23     2       0.0315
##  7 The Merchant of Venice pray       42     1       0.0288
##  8 The Merchant of Venice better     21     2       0.0288
##  9 The Merchant of Venice justice    17     2       0.0233
## 10 The Merchant of Venice welcome    17     2       0.0233
## # ... with 334 more rows
  • These are definitely characteristic words for these two plays.

9. Sentiment changes through a play

  • We will start by first implementing sentiment analysis using inner_join(), and then use count() with four arguments:
    • title,
    • type,
    • an index that will section together lines of the play, and
    • sentiment.
  • After these lines of code, you will have the number of positive and negative words used in each index-ed section of the play. These sections will be 70 lines long in your analysis here. You want a chunk of text that is not too small (because then the sentiment changes will be very noisy) and not too big (because then you will not be able to see plot structure). In an analysis of this type you may need to experiment with what size chunks to make; sections of 70 lines works well for these plays.
tidy_shakespeare %>%
  inner_join(get_sentiments("bing")) %>% # Implement sentiment analysis using "bing" lexicon
  count(title, 
        type, 
        index = linenumber %/% 70, 
        sentiment)
## Joining, by = "word"

## # A tibble: 744 x 5
##    title                     type   index sentiment     n
##    <chr>                     <chr>  <dbl> <chr>     <int>
##  1 A Midsummer Night's Dream Comedy     0 negative      4
##  2 A Midsummer Night's Dream Comedy     0 positive     11
##  3 A Midsummer Night's Dream Comedy     1 negative      7
##  4 A Midsummer Night's Dream Comedy     1 positive     19
##  5 A Midsummer Night's Dream Comedy     2 negative     20
##  6 A Midsummer Night's Dream Comedy     2 positive     23
##  7 A Midsummer Night's Dream Comedy     3 negative     12
##  8 A Midsummer Night's Dream Comedy     3 positive     18
##  9 A Midsummer Night's Dream Comedy     4 negative      9
## 10 A Midsummer Night's Dream Comedy     4 positive     27
## # ... with 734 more rows
  • This is the first step in looking at narrative arcs.

10. Calculating net sentiment

The next steps involve spread() from the tidyr package. After these lines of code, you will have the net sentiment in each index-ed section of the play; net sentiment is the negative sentiment subtracted from the positive sentiment.

# Load the tidyr package
library(tidyr)

tidy_shakespeare %>%
  inner_join(get_sentiments("bing")) %>%
  count(title, type, index = linenumber %/% 70, sentiment) %>%
  spread(sentiment, n, fill = 0) %>% # Spread sentiment and n across multiple columns
  mutate(sentiment = positive - negative) # Use mutate to find net sentiment
## Joining, by = "word"

## # A tibble: 373 x 6
##    title                     type   index negative positive sentiment
##    <chr>                     <chr>  <dbl>    <dbl>    <dbl>     <dbl>
##  1 A Midsummer Night's Dream Comedy     0        4       11         7
##  2 A Midsummer Night's Dream Comedy     1        7       19        12
##  3 A Midsummer Night's Dream Comedy     2       20       23         3
##  4 A Midsummer Night's Dream Comedy     3       12       18         6
##  5 A Midsummer Night's Dream Comedy     4        9       27        18
##  6 A Midsummer Night's Dream Comedy     5       11       21        10
##  7 A Midsummer Night's Dream Comedy     6       12       16         4
##  8 A Midsummer Night's Dream Comedy     7        9        6        -3
##  9 A Midsummer Night's Dream Comedy     8        6       12         6
## 10 A Midsummer Night's Dream Comedy     9       19       12        -7
## # ... with 363 more rows

You are closer to plotting the sentiment through these plays.

11. Visualizing narrative arcs

you will continue to build on your manipulations of this text dataset and visualize the results of this sentiment analysis.

library(tidyr)
library(ggplot2)

tidy_shakespeare %>%
  inner_join(get_sentiments("bing")) %>%
  count(title, type, index = linenumber %/% 70, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  ggplot(aes(x = index, # Put index on x-axis
             y = sentiment, # Put sentiment on y-axis
             fill = type)) +  # map comedy/tragedy to fill
  geom_col() + # Make a bar chart with geom_col()
  facet_wrap(~ title, scales = "free_x") # Separate panels for each title with facet_wrap()
## Joining, by = "word"


  • These plots show how sentiment changes through these plays. Notice how the comedies have happier endings and more positive sentiment than the tragedies.


'R > [R] Text Mining' 카테고리의 다른 글

Text Mining: Tweets Across the Unisted States From DataCamp  (0) 2018.12.20
Sentiment_Analysis_with_tidy_data  (0) 2018.12.20
The tidy text format  (0) 2018.12.18

The source codes and contens come from the E-Learning DataCamp: Sentiment Analysis in R: The Tidy Way Enjoy

Intro

  • Text datasets are diverse and ubiquitous, and sentiment analysis provides an approach to understand the attitudes and opinions expressed in these texts. In this course, you will develop your text mining skills using tidy data principles. You will apply these skills by performing sentiment analysis in several case studies, on text data from Twitter to TV news to Shakespeare. These case studies will allow you to practice important data handling skills, learn about the ways sentiment analysis can be applied, and extract relevant insights from real-world data.

1. Sentiment Lexicons

There are several different sentiment lexicons available for sentiment analysis. You will explore three in this course that are available in the tidytext package: + afinn from Finn Årup Nielsen, + bing from Bing Liu and collaborators, and + nrc from Saif Mohammad and Peter Turney. You will see how these lexicons can be used as you work through this course. The decision about which lexicon to use often depends on what question you are trying to answer.

# Load dplyr and tidytext
library(dplyr)
## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidytext)

# Choose the bing lexicon
get_sentiments("bing")
## # A tibble: 6,788 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faced     negative 
##  2 2-faces     negative 
##  3 a+          positive 
##  4 abnormal    negative 
##  5 abolish     negative 
##  6 abominable  negative 
##  7 abominably  negative 
##  8 abominate   negative 
##  9 abomination negative 
## 10 abort       negative 
## # ... with 6,778 more rows
# Choose the nrc lexicon
get_sentiments("nrc") %>%
  count(sentiment) # Count words by sentiment
## # A tibble: 10 x 2
##    sentiment        n
##    <chr>        <int>
##  1 anger         1247
##  2 anticipation   839
##  3 disgust       1058
##  4 fear          1476
##  5 joy            689
##  6 negative      3324
##  7 positive      2312
##  8 sadness       1191
##  9 surprise       534
## 10 trust         1231
  • While the “bing” lexicon classifies words into 2 sentiments, positive or negative, there are 10 sentiments conveyed in the “nrc” lexicon.

2. Implement an inner join

In this exercise you will implement sentiment analysis using an inner join. The inner_join() function from dplyr will identify which words are in both the sentiment lexicon and the text dataset you are examining. To learn more about joining data frames using dplyr. The geocoded_tweets dataset is taken from Quartz and contains three columns:

load("data/geocoded_tweets.rda")

# Access bing lexicon: bing
bing <- get_sentiments("bing")

# Use data frame with text data
geocoded_tweets %>%
  # With inner join, implement sentiment analysis using `bing`
  inner_join(bing)
## Joining, by = "word"

## # A tibble: 64,303 x 4
##    state   word             freq sentiment
##    <chr>   <chr>           <dbl> <chr>    
##  1 alabama abuse           7186. negative 
##  2 alabama abused          3073. negative 
##  3 alabama accomplish      5957. positive 
##  4 alabama accomplished   13121. positive 
##  5 alabama accomplishment  3036. positive 
##  6 alabama accurate       28262. positive 
##  7 alabama ache            7306. negative 
##  8 alabama aching          5080. negative 
##  9 alabama addict          5441. negative 
## 10 alabama addicted       40389. negative 
## # ... with 64,293 more rows
  • you can see the average frequency and the sentiment associated with each word that exists in both data frames.

3. What are the most common sadness words?

After you have implemented sentiment analysis using inner_join(), you can use dplyr functions such as group_by() and summarize() to understand your results. For example, what are the most common words related to sadness in this Twitter dataset?

tweets_nrc <- geocoded_tweets %>%
  inner_join(get_sentiments("nrc"))
## Joining, by = "word"
tweets_nrc %>% 
  filter(sentiment == "sadness") %>% 
  group_by(word) %>% 
  summarise(freq = mean(freq)) %>% 
  arrange(desc(freq))
## # A tibble: 585 x 2
##    word        freq
##    <chr>      <dbl>
##  1 hate    1253840.
##  2 bad      984943.
##  3 bitch    787774.
##  4 hell     486259.
##  5 crazy    447047.
##  6 feeling  407562.
##  7 leave    397806.
##  8 mad      393559.
##  9 music    373608.
## 10 sick     362023.
## # ... with 575 more rows

4. What are the most common joy words?

You can use the same approach from the last exercise to find the most common words associated with joy in these tweets. Use the same pattern of dplyr verbs to find a new result.

joy_words <- tweets_nrc %>%
  filter(sentiment == "joy") %>%
  group_by(word) %>%
  summarise(freq = mean(freq)) %>%
  arrange(desc(freq))    

library(ggplot2)

joy_words %>%
  top_n(20) %>%
  mutate(word = reorder(word, freq)) %>%
  ggplot(aes(x = word, y = freq)) +
  geom_col(stat = "identity") +
  coord_flip() 
## Selecting by freq

## Warning: Ignoring unknown parameters: stat


5. Do people in different states use different words?

So far you have looked at the United States as a whole, but you can use this dataset to examine differences in word use by state. In this exercise, you will examine two states and compare their use of joy words. Do they use the same words associated with joy? Do they use these words at the same rate?

tweets_nrc %>%
  filter(state == "utah",
      sentiment == "joy") %>%
  arrange(desc(freq))
## # A tibble: 326 x 4
##    state word          freq sentiment
##    <chr> <chr>        <dbl> <chr>    
##  1 utah  love      4207322. joy      
##  2 utah  good      3035114. joy      
##  3 utah  happy     1402568. joy      
##  4 utah  pretty     902947. joy      
##  5 utah  fun        764045. joy      
##  6 utah  birthday   663439. joy      
##  7 utah  beautiful  653061. joy      
##  8 utah  friend     627522. joy      
##  9 utah  hope       571841. joy      
## 10 utah  god        536687. joy      
## # ... with 316 more rows
tweets_nrc %>%
  filter(state == "louisiana",
      sentiment == "joy") %>%
    arrange(desc(freq))
## # A tibble: 290 x 4
##    state     word         freq sentiment
##    <chr>     <chr>       <dbl> <chr>    
##  1 louisiana love     3764157. joy      
##  2 louisiana good     2758699. joy      
##  3 louisiana baby     1184392. joy      
##  4 louisiana happy    1176291. joy      
##  5 louisiana god       882457. joy      
##  6 louisiana birthday  740497. joy      
##  7 louisiana money     677899. joy      
##  8 louisiana hope      675559. joy      
##  9 louisiana pretty    581242. joy      
## 10 louisiana feeling   486367. joy      
## # ... with 280 more rows
  • Words like “baby” and “money” are popular in Louisiana but not in Utah.

6. Which states have the most positive Twitter users?

  • you will determine how the overall sentiment of Twitter sentiment varies from state to state. You will use a dataset called tweets_bing, which is the output of an inner join created just the same way that you did earlier. Check out what tweets_bing looks like in the console.
library(tidyr)
tweets_bing <- geocoded_tweets %>% 
  inner_join(get_sentiments("bing"))
## Joining, by = "word"
tweets_bing %>% 
  group_by(state, sentiment) %>% 
  summarise(freq = mean(freq)) %>% 
  spread(sentiment, freq) %>% 
  ungroup() %>% 
  mutate(ratio = positive / negative, 
         state = reorder(state, ratio)) %>% 
  ggplot(aes(x = state, y = ratio)) + 
    geom_point() + 
    coord_flip()


  • Combining your data with a sentiment lexicon, you can do all sorts of exploratory data analysis. Looks like Missouri tops the list for this one!


'R > [R] Text Mining' 카테고리의 다른 글

Sentiment Analysis - Shakespeare gets Sentimental  (0) 2018.12.21
Sentiment_Analysis_with_tidy_data  (0) 2018.12.20
The tidy text format  (0) 2018.12.18

The source codes and contens come from the book Text Mining with R Enjoy

Intro

  • Let’s address the topic of opinion mining or sentiment analysis. When human readers approach a text, we use our understanding of the emotional intent of words to infer whether a section of text is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust. The flow chart is shown in Figure 2.1.
knitr::include_graphics("img/Figure_2.1_Sentiment_Analysis.png")

2.1 FlowChart

  • One way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words. This isn’t the only way to approach sentiment analysis, but it is an often-used approach, and an approach that naturally takes advantage of the tidy tool ecosystem.

  1. The Sentiment DataSet

library(tidytext)
sentiments
## # A tibble: 27,314 x 4
##    word        sentiment lexicon score
##    <chr>       <chr>     <chr>   <int>
##  1 abacus      trust     nrc        NA
##  2 abandon     fear      nrc        NA
##  3 abandon     negative  nrc        NA
##  4 abandon     sadness   nrc        NA
##  5 abandoned   anger     nrc        NA
##  6 abandoned   fear      nrc        NA
##  7 abandoned   negative  nrc        NA
##  8 abandoned   sadness   nrc        NA
##  9 abandonment anger     nrc        NA
## 10 abandonment fear      nrc        NA
## # ... with 27,304 more rows

The three general-purpose lexicons are + AFINN from Finn Årup Nielsen, + bing from Bing Liu and collaborators, and + nrc from Saif Mohammad and Peter Turney.

Common

  • All three of these lexicons are based on unigrams, i.e., single words.
  • These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth.

Differences

  • The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.

  • The bing lexicon categorizes words in a binary fashion into positive and negative categories.

  • The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.

  • tidytext provides a function get_sentiments() to get specific sentiment lexicons without the columns that are not used in that lexicon.

get_sentiments("afinn")
## # A tibble: 2,476 x 2
##    word       score
##    <chr>      <int>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,466 more rows
get_sentiments("bing")
## # A tibble: 6,788 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faced     negative 
##  2 2-faces     negative 
##  3 a+          positive 
##  4 abnormal    negative 
##  5 abolish     negative 
##  6 abominable  negative 
##  7 abominably  negative 
##  8 abominate   negative 
##  9 abomination negative 
## 10 abort       negative 
## # ... with 6,778 more rows
get_sentiments("nrc")
## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows

Considerations

  • They were constructed via either crowdsourcing (using, for example, Amazon Mechanical Turk) or by the labor of one of the authors, and were validated using some combination of crowdsourcing again, restaurant or movie reviews, or Twitter data.
  • Thus, we need to consider to apply these sentiment lexicons to styles of text dramatically different from what they were validated on, such as narrative fiction from 200 years ago.
  • Moreover, There are also some domain-specific sentiment lexicon available, constructed to be used with text from a specific content area.
  • In conclusion, in this tutorial, it’s better to find the workflow of sentiment analysis and to get an insight of using this tutorial for your future context.

  1. Sentiment Analysis with inner join

  • Let’s look at the words with a joy score from the NRC lexicon. What are the most common joy words in Emma?
library(janeaustenr)
library(dplyr)
## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)

tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", 
                                                 ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)
head(tidy_books)
## # A tibble: 6 x 4
##   book                linenumber chapter word       
##   <fct>                    <int>   <int> <chr>      
## 1 Sense & Sensibility          1       0 sense      
## 2 Sense & Sensibility          1       0 and        
## 3 Sense & Sensibility          1       0 sensibility
## 4 Sense & Sensibility          3       0 by         
## 5 Sense & Sensibility          3       0 jane       
## 6 Sense & Sensibility          3       0 austen
  • Now that the text is in a tidy format with one word per row, we are ready to do the sentiment analysis. First, let’s use the NRC lexicon and filter() for the joy words. Next, let’s filter() the data frame with the text from the books for the words from Emma and then use inner_join() to perform the sentiment analysis. What are the most common joy words in Emma? Let’s use count() from dplyr.
nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")

tidy_books %>% 
  filter(book == "Emma") %>% 
  inner_join(nrc_joy) %>% 
  count(word, sort = TRUE)
## Joining, by = "word"

## # A tibble: 303 x 2
##    word        n
##    <chr>   <int>
##  1 good      359
##  2 young     192
##  3 friend    166
##  4 hope      143
##  5 happy     125
##  6 love      117
##  7 deal       92
##  8 found      92
##  9 present    89
## 10 kind       82
## # ... with 293 more rows

We see mostly positive, happy words about hope, friendship, and love here. Now, We can also examine how sentiment changes throughout each novel. + 1. we find a sentiment score for each word using the Bing lexicon and inner_join(). + 2. Next, we count up how many positive and negative words there are in defined sections of each book. We define an index here to keep track of where we are in the narrative; this index (using integer division) counts up sections of 80 lines of text.

The %/% operator does integer division (x %/% y is equivalent to floor(x/y)) so the index keeps track of which 80-line section of text we are counting up negative and positive sentiment in.

    1. We then use spread() so that we have negative and positive sentiment in separate columns, and lastly calculate a net sentiment (positive - negative).
library(tidyr)

jane_austen_sentiment <- tidy_books %>% 
  inner_join(get_sentiments("bing")) %>% 
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining, by = "word"
  • Now we can plot these sentiment scores across the plot trajectory of each novel. Notice that we are plotting against the index on the x-axis that keeps track of narrative time in sections of text.
library(ggplot2)

ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

2.2 Sentiment through the narratives of Jane Austen’s novels

Comparing the three sentiment dictionaries

  • Let’s do it more with other sentiment dictionaries and the narrative arc of Pride and Prejudice.

  • First, let’s use filter() to choose only the words from the one novel we are interested in.

pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")

head(pride_prejudice)
## # A tibble: 6 x 4
##   book              linenumber chapter word     
##   <fct>                  <int>   <int> <chr>    
## 1 Pride & Prejudice          1       0 pride    
## 2 Pride & Prejudice          1       0 and      
## 3 Pride & Prejudice          1       0 prejudice
## 4 Pride & Prejudice          3       0 by       
## 5 Pride & Prejudice          3       0 jane     
## 6 Pride & Prejudice          3       0 austen
  • Second, Now, we can use inner_join() to calculate the sentiment in different ways.

Remember from above that the AFINN lexicon measures sentiment with a numeric score between -5 and 5, while the other two lexicons categorize words in a binary fashion, either positive or negative. To find a sentiment score in chunks of text throughout the novel, we will need to use a different pattern for the AFINN lexicon than for the other two.

  • Third, dataset afinn
afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(score)) %>% 
  mutate(method = "AFINN")
## Joining, by = "word"
head(afinn)
## # A tibble: 6 x 3
##   index sentiment method
##   <dbl>     <int> <chr> 
## 1     0        29 AFINN 
## 2     1         0 AFINN 
## 3     2        20 AFINN 
## 4     3        30 AFINN 
## 5     4        62 AFINN 
## 6     5        66 AFINN
  • Fourth, bing_and_nrc
bing_and_nrc <- bind_rows(pride_prejudice %>% 
                            inner_join(get_sentiments("bing")) %>% 
                            mutate(method = "Bing et al."), 
                          pride_prejudice %>% 
                            inner_join(get_sentiments("nrc") %>% 
                                         filter(sentiment %in% c("positive", "negative"))) %>% 
                            mutate(method = "NRC")
                          ) %>% 
  count(method, index = linenumber %/% 80, sentiment) %>% 
  spread(sentiment, n, fill = 0) %>% 
  mutate(sentiment = positive - negative)
## Joining, by = "word"
## Joining, by = "word"
head(bing_and_nrc)
## # A tibble: 6 x 5
##   method      index negative positive sentiment
##   <chr>       <dbl>    <dbl>    <dbl>     <dbl>
## 1 Bing et al.     0        7       21        14
## 2 Bing et al.     1       20       19        -1
## 3 Bing et al.     2       16       20         4
## 4 Bing et al.     3       19       31        12
## 5 Bing et al.     4       23       47        24
## 6 Bing et al.     5       15       49        34
  • Now, we have an estimate of the net sentiment (positive - negative) in each chunk of the novel text for each sentiment lexicon. Let’s visualize all of them.
bind_rows(afinn, 
          bing_and_nrc) %>% 
  ggplot(aes(index, sentiment, fill = method)) + 
  geom_col(show.legend = FALSE) + 
  facet_wrap(~method, ncol = 1, scales = "free_y")

2.3 Comparing three sentiment lexicons using Pride and Prejudice

  • Three plots are saying that the three different lexicons for calculating sentiment give results that are different in an absolute sense but have similar relative trajectories through the novel.
  • Question, why is, the result for the NRC lexicon biased so high in sentiment compared to the Bing et al. result? Let’s look briefly at how many positive and negative words are in these lexicon.
get_sentiments("nrc") %>% 
     filter(sentiment %in% c("positive", 
                             "negative")) %>% 
  count(sentiment)
## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3324
## 2 positive   2312
get_sentiments("bing") %>% 
  count(sentiment)
## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4782
## 2 positive   2006
  • Both lexicons have more negative than positive words, but the ratio of negative to positive words is higher in the Bing lexicon than the NRC lexicon. This will contribute to the effect we see in the plot above, as will any systematic difference in word matches, e.g. if the negative words in the NRC lexicon do not match the words that Jane Austen uses very well. Whatever the source of these differences, we see similar relative trajectories across the narrative arc, with similar changes in slope, but marked differences in absolute sentiment from lexicon to lexicon. This is all important context to keep in mind when choosing a sentiment lexicon for analysis.

  • point 1. This comment makes me to ponder the direct usage of the given lexicons or packages.

  • point 2. This comment makes me to motivate develop a specific lexcions for the each different context, using given lexicons

Most common positive and negative words

One advantage of having the data frame with both sentiment and word is that we can analyze word counts that contribute to each sentiment. By implementing count() here with arguments of both word and sentiment, we find out how much each word contributed to each sentiment.

bing_word_counts <- tidy_books %>% 
  inner_join(get_sentiments("bing")) %>% 
  count(word, sentiment, sort = TRUE) %>% 
  ungroup()
## Joining, by = "word"
head(bing_word_counts)
## # A tibble: 6 x 3
##   word   sentiment     n
##   <chr>  <chr>     <int>
## 1 miss   negative   1855
## 2 well   positive   1523
## 3 good   positive   1380
## 4 great  positive    981
## 5 like   positive    725
## 6 better positive    639

This can be shown visually, and we can pipe straight into ggplot2, if we like, because of the way we are consistently using tools built for handling tidy data frames.

bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()
## Selecting by n

2.4 Words that contribute to positive and negative sentiment in Jane Austen’s novels

  • Point 1. let us see the word “miss”. The word is coded as negative but it is often used unmarried women n Jane Austen’s works. So, if it were appropriate for our purposes, we could easily add “miss” to a custom stop-words list using bind_rows().
custom_stop_words <- bind_rows(data_frame(word = c("miss"), 
                                          lexicon = c("custom")), 
                               stop_words)

custom_stop_words
## # A tibble: 1,150 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 miss        custom 
##  2 a           SMART  
##  3 a's         SMART  
##  4 able        SMART  
##  5 about       SMART  
##  6 above       SMART  
##  7 according   SMART  
##  8 accordingly SMART  
##  9 across      SMART  
## 10 actually    SMART  
## # ... with 1,140 more rows

Wordclouds

library(wordcloud)
## Loading required package: RColorBrewer
tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100)) 
## Joining, by = "word"

The most common words in Jane Austen’s novels

library(reshape2)
## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths
tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)
## Joining, by = "word"

\label{fig:figs}Most common positive and negative words in Jane Austen’s novels

Looking at units beyond just words

library(reshape2)

PandP_sentences <- data_frame(text = prideprejudice) %>% 
  unnest_tokens(sentence, text, token = "sentences")

PandP_sentences$sentence[2]
## [1] "however little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters."

Bonus: Looking at units beyond just words

  1. Another option in unnest_tokens() is to split into tokens using a regex pattern. We could use this, for example, to split the text of Jane Austen’s novels into a data frame by chapter.
austen_chapters <- austen_books() %>%
  group_by(book) %>%
  unnest_tokens(chapter, text, token = "regex", 
                pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
  ungroup()

austen_chapters %>% 
  group_by(book) %>% 
  summarise(chapters = n())
## # A tibble: 6 x 2
##   book                chapters
##   <fct>                  <int>
## 1 Sense & Sensibility       51
## 2 Pride & Prejudice         62
## 3 Mansfield Park            49
## 4 Emma                      56
## 5 Northanger Abbey          32
## 6 Persuasion                25
  1. We can use tidy text analysis to ask questions such as what are the most negative chapters in each of Jane Austen’s novels?
    1. First, let’s get the list of negative words from the Bing lexicon.
    1. Second, let’s make a data frame of how many words are in each chapter so we can normalize for the length of chapters.
    1. Then, let’s find the number of negative words in each chapter and divide by the total words in each chapter.
    1. For each book, which chapter has the highest proportion of negative words?
bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative")

wordcounts <- tidy_books %>%
  group_by(book, chapter) %>%
  summarize(words = n())

tidy_books %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords/words) %>%
  filter(chapter != 0) %>%
  top_n(1) %>%
  ungroup()
## Joining, by = "word"

## Selecting by ratio

## # A tibble: 6 x 5
##   book                chapter negativewords words  ratio
##   <fct>                 <int>         <int> <int>  <dbl>
## 1 Sense & Sensibility      43           161  3405 0.0473
## 2 Pride & Prejudice        34           111  2104 0.0528
## 3 Mansfield Park           46           173  3685 0.0469
## 4 Emma                     15           151  3340 0.0452
## 5 Northanger Abbey         21           149  2982 0.0500
## 6 Persuasion                4            62  1807 0.0343

These are the chapters with the most sad words in each book, normalized for number of words in the chapter. What is happening in these chapters? + In Chapter 43 of Sense and Sensibility Marianne is seriously ill, near death. + In Chapter 34 of Pride and Prejudice Mr. Darcy proposes for the first time (so badly!). + In Chapter 46 of Mansfield Park is almost the end, when everyone learns of Henry’s scandalous adultery + In Chapter 15 of Emma is when horrifying Mr. Elton proposes, and in Chapter 21 of Northanger Abbey Catherine is deep in her Gothic faux fantasy of murder, etc. + In Chapter 4 of Persuasion is when the reader gets the full flashback of Anne refusing Captain Wentworth and how sad she was and what a terrible mistake she realized it to be.

The source codes and contens come from the book Text Mining with R Enjoy




  1. The unnest_tokens function

text <- c("Because I could not stop for Death -",
          "He kindly stopped for me -",
          "The Carriage held but just Ourselves -",
          "and Immortality")

text
## [1] "Because I could not stop for Death -"  
## [2] "He kindly stopped for me -"            
## [3] "The Carriage held but just Ourselves -"
## [4] "and Immortality"
library(dplyr)
## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
text_df <- data_frame(line = 1:4, text = text)
  • What does it mean that this data frame has printed out as a “tibble”? A tibble is a modern class of data frame within R, available in the dplyr and tibble packages, that has a convenient print method, will not convert strings to factors, and does not use row names. Tibbles are great for use with tidy tools.
# install.packages("tidytext", dependencies = TRUE)
library(tidytext)

text_df %>% unnest_tokens(word, text)
## # A tibble: 20 x 2
##     line word       
##    <int> <chr>      
##  1     1 because    
##  2     1 i          
##  3     1 could      
##  4     1 not        
##  5     1 stop       
##  6     1 for        
##  7     1 death      
##  8     2 he         
##  9     2 kindly     
## 10     2 stopped    
## 11     2 for        
## 12     2 me         
## 13     3 the        
## 14     3 carriage   
## 15     3 held       
## 16     3 but        
## 17     3 just       
## 18     3 ourselves  
## 19     4 and        
## 20     4 immortality
  • The two basic arguments to unnest_tokens used here are column names. First we have the output column name that will be created as the text is unnested into it (word, in this case), and then the input column that the text comes from (text, in this case). Remember that text_df above has a column called text that contains the data of interest.

  • Having the text data in this format lets us manipulate, process, and visualize the text using the standard set of tidy tools, namely dplyr, tidyr, and ggplot2, as shown in Figure 1.1 Figure_A_FlowChart.png

knitr::include_graphics("img/Figure_A_FlowChart.png")

1.1 FlowChart

  1. Tidying the works of Jane Austen

# install.packages("janeaustenr")
library(janeaustenr)
library(dplyr)
library(stringr)

original_books <- austen_books() %>% 
    group_by(book) %>% 
    mutate(linenumber = row_number(), 
           chapter = cumsum(str_detect(text, 
                            regex("^chapter [\\divxlc]",                       ignore_case = TRUE)))) %>% 
    ungroup()

original_books %>% head()
## # A tibble: 6 x 4
##   text                  book                linenumber chapter
##   <chr>                 <fct>                    <int>   <int>
## 1 SENSE AND SENSIBILITY Sense & Sensibility          1       0
## 2 ""                    Sense & Sensibility          2       0
## 3 by Jane Austen        Sense & Sensibility          3       0
## 4 ""                    Sense & Sensibility          4       0
## 5 (1811)                Sense & Sensibility          5       0
## 6 ""                    Sense & Sensibility          6       0

To work with this as a tidy dataset, we need to restructure it in the one-token-per-row format, which as we saw earlier is done with the unnest_tokens() function.

# install.packages("janeaustenr")
library(tidytext)
tidy_books <- original_books %>% 
    unnest_tokens(word, text)

tidy_books %>% head()
## # A tibble: 6 x 4
##   book                linenumber chapter word       
##   <fct>                    <int>   <int> <chr>      
## 1 Sense & Sensibility          1       0 sense      
## 2 Sense & Sensibility          1       0 and        
## 3 Sense & Sensibility          1       0 sensibility
## 4 Sense & Sensibility          3       0 by         
## 5 Sense & Sensibility          3       0 jane       
## 6 Sense & Sensibility          3       0 austen
  • Often in text analysis, we will want to remove stop words; stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth in English. We can remove stop words (kept in the tidytext dataset stop_words) with an anti_join()
data(stop_words)

tidy_books <- tidy_books %>%
  anti_join(stop_words)
## Joining, by = "word"

We can also use dplyr’s count() to find the most common words in all the books as a whole.

tidy_books %>%
  count(word, sort = TRUE)
## # A tibble: 13,914 x 2
##    word       n
##    <chr>  <int>
##  1 miss    1855
##  2 time    1337
##  3 fanny    862
##  4 dear     822
##  5 lady     817
##  6 sir      806
##  7 day      797
##  8 emma     787
##  9 sister   727
## 10 house    699
## # ... with 13,904 more rows
library(ggplot2)

tidy_books %>%
  count(word, sort = TRUE) %>%
  filter(n > 600) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() + 
  labs(caption = "Figure 1.2 TOP words in Jane Austen’s novels")

  1. Word Frequencies

We’re going to use gutenbergr package that provides access to the public domain works from the Project Gutenberg collection. We can access works using gutenberg_download(). The numbers below are followed as well.
library(gutenbergr)
hgwells <- gutenberg_download(c(35, 36, 5230, 159))
## Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest

## Using mirror http://aleph.gutenberg.org
library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──

## ✔ tibble  1.4.2     ✔ readr   1.3.0
## ✔ tidyr   0.8.2     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ forcats 0.3.0

## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(tidytext)

tidy_hgwells <- hgwells %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
## Joining, by = "word"
tidy_hgwells %>% 
    count(word, sort = TRUE)
## # A tibble: 11,769 x 2
##    word       n
##    <chr>  <int>
##  1 time     454
##  2 people   302
##  3 door     260
##  4 heard    249
##  5 black    232
##  6 stood    229
##  7 white    222
##  8 hand     218
##  9 kemp     213
## 10 eyes     210
## # ... with 11,759 more rows

Now let’s get some well-known works of the Brontë sisters, whose lives overlapped with Jane Austen’s somewhat but who wrote in a rather different style. Let’s get Jane Eyre, Wuthering Heights, The Tenant of Wildfell Hall, Villette, and Agnes Grey.

library(gutenbergr)

bronte <- gutenberg_download(c(1260, 768, 969, 9182, 767))

library(tidyverse)
library(tidytext)

tidy_bronte <- bronte %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)
## Joining, by = "word"
tidy_bronte %>%
  count(word, sort = TRUE)
## # A tibble: 23,050 x 2
##    word       n
##    <chr>  <int>
##  1 time    1065
##  2 miss     855
##  3 day      827
##  4 hand     768
##  5 eyes     713
##  6 night    647
##  7 heart    638
##  8 looked   601
##  9 door     592
## 10 half     586
## # ... with 23,040 more rows
  • Interesting that “time”, “eyes”, and “hand” are in the top 10 for both H.G. Wells and the Brontë sisters.

Let’s calculate the frequency for each word for the works of Jane Austen, the Brontë sisters, and H.G. Wells by binding the data frames together. We can use spread and gather from tidyr to reshape our dataframe so that it is just what we need for plotting and comparing the three sets of novels.

library(tidyr)

frequency <- bind_rows(mutate(tidy_bronte, author = "Brontë Sisters"),
                       mutate(tidy_hgwells, author = "H.G. Wells"), 
                       mutate(tidy_books, author = "Jane Austen")) %>% 
  mutate(word = str_extract(word, "[a-z']+")) %>%
  count(author, word) %>%
  group_by(author) %>%
  mutate(proportion = n / sum(n)) %>% 
  select(-n) %>% 
  spread(author, proportion) %>% 
  gather(author, proportion, `Brontë Sisters`:`H.G. Wells`)

We use str_extract() here because the UTF-8 encoded texts from Project Gutenberg have some examples of words with underscores around them to indicate emphasis (like italics). The tokenizer treated these as words, but we don’t want to count “any” separately from “any” as we saw in our initial data exploration before choosing to use str_extract().

  1. Visualization

Words that are close to the line in these plots have similar frequencies in both sets of texts, for example, in both Austen and Brontë texts (“miss”, “time”, “day” at the upper frequency end) or in both Austen and Wells texts (“time”, “day”, “brother” at the high frequency end).

library(scales)
## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor
library(ggplot2)

# expect a warning about rows with missing values being removed
ggplot(frequency, aes(x = proportion, y = `Jane Austen`, color = abs(`Jane Austen` - proportion))) +
  geom_abline(color = "gray40", lty = 2) +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
  facet_wrap(~author, ncol = 2) +
  theme(legend.position="none") +
  labs(y = "Jane Austen", 
       x = NULL, 
       caption = "Figure 1.3 Comparing the word frequencies of Jane Austen, the Brontë sisters, and H.G. Wells")
## Warning: Removed 41357 rows containing missing values (geom_point).

## Warning: Removed 41359 rows containing missing values (geom_text).

  1. Correlation

Let’s quantify how similar and different these sets of word frequencies are using a correlation test. How correlated are the word frequencies between Austen and the Brontë sisters, and between Austen and Wells?

cor.test(data = frequency[frequency$author == "Brontë Sisters",],
         ~ proportion + `Jane Austen`)
## 
##  Pearson's product-moment correlation
## 
## data:  proportion and Jane Austen
## t = 119.65, df = 10404, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7527869 0.7689641
## sample estimates:
##       cor 
## 0.7609938
cor.test(data = frequency[frequency$author == "H.G. Wells",], 
         ~ proportion + `Jane Austen`)
## 
##  Pearson's product-moment correlation
## 
## data:  proportion and Jane Austen
## t = 36.441, df = 6053, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4032800 0.4445987
## sample estimates:
##       cor 
## 0.4241601

Just as we saw in the plots, the word frequencies are more correlated between the Austen and Brontë novels than between Austen and H.G. Wells.

+ Recent posts