Sentiment_Analysis_with_tidy_data

2018. 12. 20. 13:51

The source codes and contens come from the book Text Mining with R Enjoy

Intro

Let’s address the topic of opinion mining or sentiment analysis. When human readers approach a text, we use our understanding of the emotional intent of words to infer whether a section of text is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust. The flow chart is shown in Figure 2.1.

knitr::include_graphics("img/Figure_2.1_Sentiment_Analysis.png")

2.1 FlowChart

One way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words. This isn’t the only way to approach sentiment analysis, but it is an often-used approach, and an approach that naturally takes advantage of the tidy tool ecosystem.

The Sentiment DataSet

 library(tidytext)
sentiments

 ## # A tibble: 27,314 x 4
##    word        sentiment lexicon score
##    <chr>       <chr>     <chr>   <int>
##  1 abacus      trust     nrc        NA
##  2 abandon     fear      nrc        NA
##  3 abandon     negative  nrc        NA
##  4 abandon     sadness   nrc        NA
##  5 abandoned   anger     nrc        NA
##  6 abandoned   fear      nrc        NA
##  7 abandoned   negative  nrc        NA
##  8 abandoned   sadness   nrc        NA
##  9 abandonment anger     nrc        NA
## 10 abandonment fear      nrc        NA
## # ... with 27,304 more rows

The three general-purpose lexicons are + AFINN from Finn Årup Nielsen, + bing from Bing Liu and collaborators, and + nrc from Saif Mohammad and Peter Turney.

Common

All three of these lexicons are based on unigrams, i.e., single words.
These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth.

Differences

The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.
The bing lexicon categorizes words in a binary fashion into positive and negative categories.
The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.
tidytext provides a function get_sentiments() to get specific sentiment lexicons without the columns that are not used in that lexicon.

get_sentiments("afinn")

 ## # A tibble: 2,476 x 2
##    word       score
##    <chr>      <int>
##  1 abandon       -2
##  2 abandoned     -2
##  3 abandons      -2
##  4 abducted      -2
##  5 abduction     -2
##  6 abductions    -2
##  7 abhor         -3
##  8 abhorred      -3
##  9 abhorrent     -3
## 10 abhors        -3
## # ... with 2,466 more rows

get_sentiments("bing")

 ## # A tibble: 6,788 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faced     negative 
##  2 2-faces     negative 
##  3 a+          positive 
##  4 abnormal    negative 
##  5 abolish     negative 
##  6 abominable  negative 
##  7 abominably  negative 
##  8 abominate   negative 
##  9 abomination negative 
## 10 abort       negative 
## # ... with 6,778 more rows

get_sentiments("nrc")

 ## # A tibble: 13,901 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 abacus      trust    
##  2 abandon     fear     
##  3 abandon     negative 
##  4 abandon     sadness  
##  5 abandoned   anger    
##  6 abandoned   fear     
##  7 abandoned   negative 
##  8 abandoned   sadness  
##  9 abandonment anger    
## 10 abandonment fear     
## # ... with 13,891 more rows

Considerations

They were constructed via either crowdsourcing (using, for example, Amazon Mechanical Turk) or by the labor of one of the authors, and were validated using some combination of crowdsourcing again, restaurant or movie reviews, or Twitter data.
Thus, we need to consider to apply these sentiment lexicons to styles of text dramatically different from what they were validated on, such as narrative fiction from 200 years ago.
Moreover, There are also some domain-specific sentiment lexicon available, constructed to be used with text from a specific content area.
In conclusion, in this tutorial, it’s better to find the workflow of sentiment analysis and to get an insight of using this tutorial for your future context.

Sentiment Analysis with inner join

Let’s look at the words with a joy score from the NRC lexicon. What are the most common joy words in Emma?

 library(janeaustenr)
library(dplyr)

 ## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

 library(stringr)
 
tidy_books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", 
                                                 ignore_case = TRUE)))) %>%
  ungroup() %>%
  unnest_tokens(word, text)
head(tidy_books)

 ## # A tibble: 6 x 4
##   book                linenumber chapter word       
##   <fct>                    <int>   <int> <chr>      
## 1 Sense & Sensibility          1       0 sense      
## 2 Sense & Sensibility          1       0 and        
## 3 Sense & Sensibility          1       0 sensibility
## 4 Sense & Sensibility          3       0 by         
## 5 Sense & Sensibility          3       0 jane       
## 6 Sense & Sensibility          3       0 austen

Now that the text is in a tidy format with one word per row, we are ready to do the sentiment analysis. First, let’s use the NRC lexicon and filter() for the joy words. Next, let’s filter() the data frame with the text from the books for the words from Emma and then use inner_join() to perform the sentiment analysis. What are the most common joy words in Emma? Let’s use count() from dplyr.

 nrc_joy <- get_sentiments("nrc") %>% 
  filter(sentiment == "joy")
 
tidy_books %>% 
  filter(book == "Emma") %>% 
  inner_join(nrc_joy) %>% 
  count(word, sort = TRUE)

 ## Joining, by = "word"
## # A tibble: 303 x 2
##    word        n
##    <chr>   <int>
##  1 good      359
##  2 young     192
##  3 friend    166
##  4 hope      143
##  5 happy     125
##  6 love      117
##  7 deal       92
##  8 found      92
##  9 present    89
## 10 kind       82
## # ... with 293 more rows

We see mostly positive, happy words about hope, friendship, and love here. Now, We can also examine how sentiment changes throughout each novel. + 1. we find a sentiment score for each word using the Bing lexicon and inner_join(). + 2. Next, we count up how many positive and negative words there are in defined sections of each book. We define an index here to keep track of where we are in the narrative; this index (using integer division) counts up sections of 80 lines of text.

The %/% operator does integer division (x %/% y is equivalent to floor(x/y)) so the index keeps track of which 80-line section of text we are counting up negative and positive sentiment in.

1. We then use spread() so that we have negative and positive sentiment in separate columns, and lastly calculate a net sentiment (positive - negative).

 library(tidyr)
 
jane_austen_sentiment <- tidy_books %>% 
  inner_join(get_sentiments("bing")) %>% 
  count(book, index = linenumber %/% 80, sentiment) %>%
  spread(sentiment, n, fill = 0) %>% 
  mutate(sentiment = positive - negative)

## Joining, by = "word"

Now we can plot these sentiment scores across the plot trajectory of each novel. Notice that we are plotting against the index on the x-axis that keeps track of narrative time in sections of text.

 library(ggplot2)
 
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~book, ncol = 2, scales = "free_x")

2.2 Sentiment through the narratives of Jane Austen’s novels

Comparing the three sentiment dictionaries

Let’s do it more with other sentiment dictionaries and the narrative arc of Pride and Prejudice.
First, let’s use filter() to choose only the words from the one novel we are interested in.

 pride_prejudice <- tidy_books %>% 
  filter(book == "Pride & Prejudice")
 
head(pride_prejudice)

 ## # A tibble: 6 x 4
##   book              linenumber chapter word     
##   <fct>                  <int>   <int> <chr>    
## 1 Pride & Prejudice          1       0 pride    
## 2 Pride & Prejudice          1       0 and      
## 3 Pride & Prejudice          1       0 prejudice
## 4 Pride & Prejudice          3       0 by       
## 5 Pride & Prejudice          3       0 jane     
## 6 Pride & Prejudice          3       0 austen

Second, Now, we can use inner_join() to calculate the sentiment in different ways.

Remember from above that the AFINN lexicon measures sentiment with a numeric score between -5 and 5, while the other two lexicons categorize words in a binary fashion, either positive or negative. To find a sentiment score in chunks of text throughout the novel, we will need to use a different pattern for the AFINN lexicon than for the other two.

Third, dataset afinn

 afinn <- pride_prejudice %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  summarise(sentiment = sum(score)) %>% 
  mutate(method = "AFINN")

## Joining, by = "word"

head(afinn)

 ## # A tibble: 6 x 3
##   index sentiment method
##   <dbl>     <int> <chr> 
## 1     0        29 AFINN 
## 2     1         0 AFINN 
## 3     2        20 AFINN 
## 4     3        30 AFINN 
## 5     4        62 AFINN 
## 6     5        66 AFINN

Fourth, bing_and_nrc

 bing_and_nrc <- bind_rows(pride_prejudice %>% 
                            inner_join(get_sentiments("bing")) %>% 
                            mutate(method = "Bing et al."), 
                          pride_prejudice %>% 
                            inner_join(get_sentiments("nrc") %>% 
                                         filter(sentiment %in% c("positive", "negative"))) %>% 
                            mutate(method = "NRC")
                          ) %>% 
  count(method, index = linenumber %/% 80, sentiment) %>% 
  spread(sentiment, n, fill = 0) %>% 
  mutate(sentiment = positive - negative)

 ## Joining, by = "word"
## Joining, by = "word"

head(bing_and_nrc)

 ## # A tibble: 6 x 5
##   method      index negative positive sentiment
##   <chr>       <dbl>    <dbl>    <dbl>     <dbl>
## 1 Bing et al.     0        7       21        14
## 2 Bing et al.     1       20       19        -1
## 3 Bing et al.     2       16       20         4
## 4 Bing et al.     3       19       31        12
## 5 Bing et al.     4       23       47        24
## 6 Bing et al.     5       15       49        34

Now, we have an estimate of the net sentiment (positive - negative) in each chunk of the novel text for each sentiment lexicon. Let’s visualize all of them.

 bind_rows(afinn, 
          bing_and_nrc) %>% 
  ggplot(aes(index, sentiment, fill = method)) + 
  geom_col(show.legend = FALSE) + 
  facet_wrap(~method, ncol = 1, scales = "free_y")

2.3 Comparing three sentiment lexicons using Pride and Prejudice

Three plots are saying that the three different lexicons for calculating sentiment give results that are different in an absolute sense but have similar relative trajectories through the novel.
Question, why is, the result for the NRC lexicon biased so high in sentiment compared to the Bing et al. result? Let’s look briefly at how many positive and negative words are in these lexicon.

 get_sentiments("nrc") %>% 
     filter(sentiment %in% c("positive", 
                             "negative")) %>% 
  count(sentiment)

 ## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   3324
## 2 positive   2312

 get_sentiments("bing") %>% 
  count(sentiment)

 ## # A tibble: 2 x 2
##   sentiment     n
##   <chr>     <int>
## 1 negative   4782
## 2 positive   2006

Both lexicons have more negative than positive words, but the ratio of negative to positive words is higher in the Bing lexicon than the NRC lexicon. This will contribute to the effect we see in the plot above, as will any systematic difference in word matches, e.g. if the negative words in the NRC lexicon do not match the words that Jane Austen uses very well. Whatever the source of these differences, we see similar relative trajectories across the narrative arc, with similar changes in slope, but marked differences in absolute sentiment from lexicon to lexicon. This is all important context to keep in mind when choosing a sentiment lexicon for analysis.
point 1. This comment makes me to ponder the direct usage of the given lexicons or packages.
point 2. This comment makes me to motivate develop a specific lexcions for the each different context, using given lexicons

Most common positive and negative words

One advantage of having the data frame with both sentiment and word is that we can analyze word counts that contribute to each sentiment. By implementing count() here with arguments of both word and sentiment, we find out how much each word contributed to each sentiment.

 bing_word_counts <- tidy_books %>% 
  inner_join(get_sentiments("bing")) %>% 
  count(word, sentiment, sort = TRUE) %>% 
  ungroup()

## Joining, by = "word"

head(bing_word_counts)

 ## # A tibble: 6 x 3
##   word   sentiment     n
##   <chr>  <chr>     <int>
## 1 miss   negative   1855
## 2 well   positive   1523
## 3 good   positive   1380
## 4 great  positive    981
## 5 like   positive    725
## 6 better positive    639

This can be shown visually, and we can pipe straight into ggplot2, if we like, because of the way we are consistently using tools built for handling tidy data frames.

 bing_word_counts %>%
  group_by(sentiment) %>%
  top_n(10) %>%
  ungroup() %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free_y") +
  labs(y = "Contribution to sentiment",
       x = NULL) +
  coord_flip()

## Selecting by n

2.4 Words that contribute to positive and negative sentiment in Jane Austen’s novels

Point 1. let us see the word “miss”. The word is coded as negative but it is often used unmarried women n Jane Austen’s works. So, if it were appropriate for our purposes, we could easily add “miss” to a custom stop-words list using bind_rows().

 custom_stop_words <- bind_rows(data_frame(word = c("miss"), 
                                          lexicon = c("custom")), 
                               stop_words)
 
custom_stop_words

 ## # A tibble: 1,150 x 2
##    word        lexicon
##    <chr>       <chr>  
##  1 miss        custom 
##  2 a           SMART  
##  3 a's         SMART  
##  4 able        SMART  
##  5 about       SMART  
##  6 above       SMART  
##  7 according   SMART  
##  8 accordingly SMART  
##  9 across      SMART  
## 10 actually    SMART  
## # ... with 1,140 more rows

Wordclouds

library(wordcloud)

## Loading required package: RColorBrewer

 tidy_books %>%
  anti_join(stop_words) %>%
  count(word) %>%
  with(wordcloud(word, n, max.words = 100))

## Joining, by = "word"

The most common words in Jane Austen’s novels

library(reshape2)

 ## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths

 tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment, sort = TRUE) %>%
  acast(word ~ sentiment, value.var = "n", fill = 0) %>%
  comparison.cloud(colors = c("gray20", "gray80"),
                   max.words = 100)

## Joining, by = "word"

$\label{fig:figs}Most common positive and negative words in Jane Austen’s novels$

Looking at units beyond just words

 library(reshape2)
 
PandP_sentences <- data_frame(text = prideprejudice) %>% 
  unnest_tokens(sentence, text, token = "sentences")
 
PandP_sentences$sentence[2]

## [1] "however little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters."

Bonus: Looking at units beyond just words

Another option in unnest_tokens() is to split into tokens using a regex pattern. We could use this, for example, to split the text of Jane Austen’s novels into a data frame by chapter.

 austen_chapters <- austen_books() %>%
  group_by(book) %>%
  unnest_tokens(chapter, text, token = "regex", 
                pattern = "Chapter|CHAPTER [\\dIVXLC]") %>%
  ungroup()
 
austen_chapters %>% 
  group_by(book) %>% 
  summarise(chapters = n())

 ## # A tibble: 6 x 2
##   book                chapters
##   <fct>                  <int>
## 1 Sense & Sensibility       51
## 2 Pride & Prejudice         62
## 3 Mansfield Park            49
## 4 Emma                      56
## 5 Northanger Abbey          32
## 6 Persuasion                25

We can use tidy text analysis to ask questions such as what are the most negative chapters in each of Jane Austen’s novels?

1. First, let’s get the list of negative words from the Bing lexicon.
1. Second, let’s make a data frame of how many words are in each chapter so we can normalize for the length of chapters.
1. Then, let’s find the number of negative words in each chapter and divide by the total words in each chapter.
1. For each book, which chapter has the highest proportion of negative words?

 bingnegative <- get_sentiments("bing") %>% 
  filter(sentiment == "negative")
 
wordcounts <- tidy_books %>%
  group_by(book, chapter) %>%
  summarize(words = n())
 
tidy_books %>%
  semi_join(bingnegative) %>%
  group_by(book, chapter) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("book", "chapter")) %>%
  mutate(ratio = negativewords/words) %>%
  filter(chapter != 0) %>%
  top_n(1) %>%
  ungroup()

 ## Joining, by = "word"
## Selecting by ratio
## # A tibble: 6 x 5
##   book                chapter negativewords words  ratio
##   <fct>                 <int>         <int> <int>  <dbl>
## 1 Sense & Sensibility      43           161  3405 0.0473
## 2 Pride & Prejudice        34           111  2104 0.0528
## 3 Mansfield Park           46           173  3685 0.0469
## 4 Emma                     15           151  3340 0.0452
## 5 Northanger Abbey         21           149  2982 0.0500
## 6 Persuasion                4            62  1807 0.0343

These are the chapters with the most sad words in each book, normalized for number of words in the chapter. What is happening in these chapters? + In Chapter 43 of Sense and Sensibility Marianne is seriously ill, near death. + In Chapter 34 of Pride and Prejudice Mr. Darcy proposes for the first time (so badly!). + In Chapter 46 of Mansfield Park is almost the end, when everyone learns of Henry’s scandalous adultery + In Chapter 15 of Emma is when horrifying Mr. Elton proposes, and in Chapter 21 of Northanger Abbey Catherine is deep in her Gothic faux fantasy of murder, etc. + In Chapter 4 of Persuasion is when the reader gets the full flashback of Anne refusing Captain Wentworth and how sad she was and what a terrible mistake she realized it to be.

'R > [R] Text Mining' 카테고리의 다른 글

Sentiment Analysis - Shakespeare gets Sentimental (0)	2018.12.21
Text Mining: Tweets Across the Unisted States From DataCamp (0)	2018.12.20
The tidy text format (0)	2018.12.18

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

cozyDS