The source codes and contents come from the E-Learning DataCamp: Sentiment Analysis in R: The Tidy Way Enjoy
Intro
- The next real-world text exploration uses tragedies and comedies by Shakespeare to show how sentiment analysis can lead to insight into differences in word use. You will learn how to transform raw text into a tidy format for further analysis.
1. To be, or not to be
- The shakespeare dataset contains three columns:
- title, the title of a Shakespearean play,
- type, the type of play, either tragedy or comedy, and
- text, a line from that play. This data frame contains the entire texts of six plays.
library(tidytext)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
load("data/shakespeare.rda")
glimpse(shakespeare)
## Observations: 25,888
## Variables: 3
## $ title <chr> "The Tragedy of Romeo and Juliet", "The Tragedy of Romeo...
## $ type <chr> "Tragedy", "Tragedy", "Tragedy", "Tragedy", "Tragedy", "...
## $ text <chr> "The Complete Works of William Shakespeare", "", "The Tr...
# Use count to find out how many titles/types there are
shakespeare %>%
count(title, type)
## # A tibble: 6 x 3
## title type n
## <chr> <chr> <int>
## 1 A Midsummer Night's Dream Comedy 3459
## 2 Hamlet, Prince of Denmark Tragedy 6776
## 3 Much Ado about Nothing Comedy 3799
## 4 The Merchant of Venice Comedy 4225
## 5 The Tragedy of Macbeth Tragedy 3188
## 6 The Tragedy of Romeo and Juliet Tragedy 4441
2. Unnesting from text to word
he shakespeare dataset is not yet compatible with tidy tools. You need to first break the text into individual tokens (the process of tokenization); a token is a meaningful unit of text for analysis, in many cases, just synonymous with a single word. You also need to transform the text to a tidy data structure with one token per row. You can use tidytext’s unnest_tokens() function to accomplish all of this at once.
library(tidytext)
tidy_shakespeare <- shakespeare %>%
group_by(title) %>%
mutate(linenumber = row_number()) %>%
unnest_tokens(word, text) %>% # Transform the non-tidy text data to tidy text data
ungroup()
tidy_shakespeare %>%
count(word, sort = TRUE)
## # A tibble: 10,736 x 2
## word n
## <chr> <int>
## 1 the 4651
## 2 and 4170
## 3 i 3296
## 4 to 3047
## 5 of 2645
## 6 a 2511
## 7 you 2287
## 8 my 1913
## 9 in 1836
## 10 that 1721
## # ... with 10,726 more rows
- Notice how the most common words in the data frame are words like “the”, “and”, and “i” that have no sentiments associated with them. In the next exercise, you’ll join the data with a lexicon to implement sentiment analysis.
3. Sentiment analysis of Shakespeare
After transforming the text of these Shakespearean plays to a tidy text dataset in the last exercise, the resulting data frame tidy_shakespeare is ready for sentiment analysis with such an approach. Once you have performed the sentiment analysis, you can find out how many negative and positive words each play has with just one line of code.
shakespeare_sentiment <- tidy_shakespeare %>%
inner_join(get_sentiments("bing")) # Implement sentiment analysis with the "bing" lexicon
## Joining, by = "word"
shakespeare_sentiment %>%
count(title, sentiment) # Find how many positive/negative words each play has
## # A tibble: 12 x 3
## title sentiment n
## <chr> <chr> <int>
## 1 A Midsummer Night's Dream negative 681
## 2 A Midsummer Night's Dream positive 773
## 3 Hamlet, Prince of Denmark negative 1323
## 4 Hamlet, Prince of Denmark positive 1223
## 5 Much Ado about Nothing negative 767
## 6 Much Ado about Nothing positive 1127
## 7 The Merchant of Venice negative 740
## 8 The Merchant of Venice positive 962
## 9 The Tragedy of Macbeth negative 914
## 10 The Tragedy of Macbeth positive 749
## 11 The Tragedy of Romeo and Juliet negative 1235
## 12 The Tragedy of Romeo and Juliet positive 1090
- Passing two variables to count() returns the count n for each unique combination of the two variables. In this case, you have 6 plays and 2 sentiments, so count() returns 6 x 2 = 12 rows.
4. Tragedy or comedy?
- Which plays have a higher percentage of negative words? Do the tragedies have more negative words than the comedies?
sentiment_counts <- tidy_shakespeare %>%
inner_join(get_sentiments("bing")) %>%
# Count the number of words by title, type, and sentiment
count(title, type, sentiment)
## Joining, by = "word"
sentiment_counts %>%
group_by(title) %>% # Group by the titles of the plays
mutate(total = sum(n), # Find the total number of words in each play
percent = n / total) %>% # Calculate the number of words divided by the total
filter(sentiment == "negative") %>% # Filter the results for only negative sentiment
arrange(percent)
## # A tibble: 6 x 6
## # Groups: title [6]
## title type sentiment n total percent
## <chr> <chr> <chr> <int> <int> <dbl>
## 1 Much Ado about Nothing Comedy negative 767 1894 0.405
## 2 The Merchant of Venice Comedy negative 740 1702 0.435
## 3 A Midsummer Night's Dream Comedy negative 681 1454 0.468
## 4 Hamlet, Prince of Denmark Tragedy negative 1323 2546 0.520
## 5 The Tragedy of Romeo and Juliet Tragedy negative 1235 2325 0.531
## 6 The Tragedy of Macbeth Tragedy negative 914 1663 0.550
- Looking at the percent column of your output, you can see that tragedies do in fact have a higher percentage of negative words!
5. Most common positive and negative words
- Now you can explore which specific words are driving these sentiment scores. Which are the most common positive and negative words in these plays?
word_counts <- tidy_shakespeare %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment)
## Joining, by = "word"
top_words <- word_counts %>%
group_by(sentiment) %>% # Group by sentiment
top_n(10) %>% # Take the top 10 for each sentiment
ungroup() %>% # Make word a factor in order of n
mutate(word = reorder(word, n))
## Selecting by n
# Use aes() to put words on the x-axis and n on the y-axis
library(ggplot2)
ggplot(top_words, aes(x = word, y = n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free") +
coord_flip()
- Death is pretty negative and love is positive, but are there words in that list that had different connotations during Shakespeare’s time? Do you see a word that the lexicon has misidentified? - The word "wilt" was used differently in Shakespeare’s time and was not negative; the lexicon has misidentified it. For example, from Romeo and Juliet, "For thou wilt lie upon the wings of night". It is important to explore the details of how words were - cored when performing sentiment analyses.
6. Word contributions by play
- You will also practice using a different sentiment lexicon, the “afinn” lexicon in which words have a score from -5 to 5. Different lexicons take different approaches to quantifying the emotion/opinion content of words.
- Which words contribute to the overall sentiment in which plays?
tidy_shakespeare %>%
count(title, word, sort = TRUE) %>% # Count by title and word
inner_join(get_sentiments("afinn")) %>% # Implement sentiment analysis using the "afinn" lexicon
filter(title == "The Tragedy of Macbeth", score < 0) # Filter to only examine the scores for Macbeth that are negative
## Joining, by = "word"
## # A tibble: 237 x 4
## title word n score
## <chr> <chr> <int> <int>
## 1 The Tragedy of Macbeth no 73 -1
## 2 The Tragedy of Macbeth fear 35 -2
## 3 The Tragedy of Macbeth death 20 -2
## 4 The Tragedy of Macbeth bloody 16 -3
## 5 The Tragedy of Macbeth poor 16 -2
## 6 The Tragedy of Macbeth strange 16 -1
## 7 The Tragedy of Macbeth dead 14 -3
## 8 The Tragedy of Macbeth leave 14 -1
## 9 The Tragedy of Macbeth fight 13 -1
## 10 The Tragedy of Macbeth charges 11 -2
## # ... with 227 more rows
- Notice the use of words specific to Macbeth like “bloody”.
7. Calculating a contribution score
- you can calculate a relative contribution for each word in each play. This contribution can be found by multiplying the score for each word by the times it is used in each play and divided by the total words in each play.
sentiment_contributions <- tidy_shakespeare %>%
count(title, word, sort = TRUE) %>% # Count by title and word
inner_join(get_sentiments("afinn")) %>% # Implement sentiment analysis using the "afinn" lexicon
group_by(title) %>% # Group by title
mutate(contribution = (n * score) / sum(n)) %>% # Calculate a contribution for each word in each title
ungroup()
## Joining, by = "word"
sentiment_contributions
## # A tibble: 2,366 x 5
## title word n score contribution
## <chr> <chr> <int> <int> <dbl>
## 1 Hamlet, Prince of Denmark no 143 -1 -0.0652
## 2 The Tragedy of Romeo and Juliet love 140 3 0.213
## 3 Much Ado about Nothing no 132 -1 -0.0768
## 4 Much Ado about Nothing hero 114 2 0.133
## 5 A Midsummer Night's Dream love 110 3 0.270
## 6 Hamlet, Prince of Denmark good 109 3 0.149
## 7 The Tragedy of Romeo and Juliet no 102 -1 -0.0518
## 8 Much Ado about Nothing good 93 3 0.162
## 9 The Merchant of Venice no 92 -1 -0.0630
## 10 Much Ado about Nothing love 91 3 0.159
## # ... with 2,356 more rows
- Notice that “hero” shows up in your results there; that is the name of one of the characters in “Much Ado About Nothing”.
8. Alas, poor Yorick!
It’s time to explore some of your results! Look at Hamlet and The Merchant of Venice to see what negative and positive words are important in these two plays.
Arrange the most negative words
sentiment_contributions %>%
# Filter for Hamlet
filter(title == "Hamlet, Prince of Denmark") %>%
# Arrange to see the most negative words
arrange(contribution)
## # A tibble: 493 x 5
## title word n score contribution
## <chr> <chr> <int> <int> <dbl>
## 1 Hamlet, Prince of Denmark no 143 -1 -0.0652
## 2 Hamlet, Prince of Denmark dead 33 -3 -0.0451
## 3 Hamlet, Prince of Denmark death 38 -2 -0.0347
## 4 Hamlet, Prince of Denmark madness 22 -3 -0.0301
## 5 Hamlet, Prince of Denmark mad 21 -3 -0.0287
## 6 Hamlet, Prince of Denmark fear 21 -2 -0.0192
## 7 Hamlet, Prince of Denmark poor 20 -2 -0.0182
## 8 Hamlet, Prince of Denmark hell 10 -4 -0.0182
## 9 Hamlet, Prince of Denmark grave 17 -2 -0.0155
## 10 Hamlet, Prince of Denmark ghost 32 -1 -0.0146
## # ... with 483 more rows
- Arrange the most positive words
sentiment_contributions %>%
# Filter for Hamlet
filter(title == "The Merchant of Venice") %>%
# Arrange to see the most negative words
arrange(desc(contribution))
## # A tibble: 344 x 5
## title word n score contribution
## <chr> <chr> <int> <int> <dbl>
## 1 The Merchant of Venice good 63 3 0.129
## 2 The Merchant of Venice love 60 3 0.123
## 3 The Merchant of Venice fair 35 2 0.0479
## 4 The Merchant of Venice like 34 2 0.0466
## 5 The Merchant of Venice true 24 2 0.0329
## 6 The Merchant of Venice sweet 23 2 0.0315
## 7 The Merchant of Venice pray 42 1 0.0288
## 8 The Merchant of Venice better 21 2 0.0288
## 9 The Merchant of Venice justice 17 2 0.0233
## 10 The Merchant of Venice welcome 17 2 0.0233
## # ... with 334 more rows
- These are definitely characteristic words for these two plays.
9. Sentiment changes through a play
- We will start by first implementing sentiment analysis using inner_join(), and then use count() with four arguments:
- title,
- type,
- an index that will section together lines of the play, and
- sentiment.
- After these lines of code, you will have the number of positive and negative words used in each index-ed section of the play. These sections will be 70 lines long in your analysis here. You want a chunk of text that is not too small (because then the sentiment changes will be very noisy) and not too big (because then you will not be able to see plot structure). In an analysis of this type you may need to experiment with what size chunks to make; sections of 70 lines works well for these plays.
tidy_shakespeare %>%
inner_join(get_sentiments("bing")) %>% # Implement sentiment analysis using "bing" lexicon
count(title,
type,
index = linenumber %/% 70,
sentiment)
## Joining, by = "word"
## # A tibble: 744 x 5
## title type index sentiment n
## <chr> <chr> <dbl> <chr> <int>
## 1 A Midsummer Night's Dream Comedy 0 negative 4
## 2 A Midsummer Night's Dream Comedy 0 positive 11
## 3 A Midsummer Night's Dream Comedy 1 negative 7
## 4 A Midsummer Night's Dream Comedy 1 positive 19
## 5 A Midsummer Night's Dream Comedy 2 negative 20
## 6 A Midsummer Night's Dream Comedy 2 positive 23
## 7 A Midsummer Night's Dream Comedy 3 negative 12
## 8 A Midsummer Night's Dream Comedy 3 positive 18
## 9 A Midsummer Night's Dream Comedy 4 negative 9
## 10 A Midsummer Night's Dream Comedy 4 positive 27
## # ... with 734 more rows
- This is the first step in looking at narrative arcs.
10. Calculating net sentiment
The next steps involve spread() from the tidyr package. After these lines of code, you will have the net sentiment in each index-ed section of the play; net sentiment is the negative sentiment subtracted from the positive sentiment.
# Load the tidyr package
library(tidyr)
tidy_shakespeare %>%
inner_join(get_sentiments("bing")) %>%
count(title, type, index = linenumber %/% 70, sentiment) %>%
spread(sentiment, n, fill = 0) %>% # Spread sentiment and n across multiple columns
mutate(sentiment = positive - negative) # Use mutate to find net sentiment
## Joining, by = "word"
## # A tibble: 373 x 6
## title type index negative positive sentiment
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 A Midsummer Night's Dream Comedy 0 4 11 7
## 2 A Midsummer Night's Dream Comedy 1 7 19 12
## 3 A Midsummer Night's Dream Comedy 2 20 23 3
## 4 A Midsummer Night's Dream Comedy 3 12 18 6
## 5 A Midsummer Night's Dream Comedy 4 9 27 18
## 6 A Midsummer Night's Dream Comedy 5 11 21 10
## 7 A Midsummer Night's Dream Comedy 6 12 16 4
## 8 A Midsummer Night's Dream Comedy 7 9 6 -3
## 9 A Midsummer Night's Dream Comedy 8 6 12 6
## 10 A Midsummer Night's Dream Comedy 9 19 12 -7
## # ... with 363 more rows
You are closer to plotting the sentiment through these plays.
11. Visualizing narrative arcs
you will continue to build on your manipulations of this text dataset and visualize the results of this sentiment analysis.
library(tidyr)
library(ggplot2)
tidy_shakespeare %>%
inner_join(get_sentiments("bing")) %>%
count(title, type, index = linenumber %/% 70, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative) %>%
ggplot(aes(x = index, # Put index on x-axis
y = sentiment, # Put sentiment on y-axis
fill = type)) + # map comedy/tragedy to fill
geom_col() + # Make a bar chart with geom_col()
facet_wrap(~ title, scales = "free_x") # Separate panels for each title with facet_wrap()
## Joining, by = "word"
- These plots show how sentiment changes through these plays. Notice how the comedies have happier endings and more positive sentiment than the tragedies.
'R > [R] Text Mining' 카테고리의 다른 글
Text Mining: Tweets Across the Unisted States From DataCamp (0) | 2018.12.20 |
---|---|
Sentiment_Analysis_with_tidy_data (0) | 2018.12.20 |
The tidy text format (0) | 2018.12.18 |