The source codes and contens come from the E-Learning DataCamp: Sentiment Analysis in R: The Tidy Way Enjoy

Intro

  • Text datasets are diverse and ubiquitous, and sentiment analysis provides an approach to understand the attitudes and opinions expressed in these texts. In this course, you will develop your text mining skills using tidy data principles. You will apply these skills by performing sentiment analysis in several case studies, on text data from Twitter to TV news to Shakespeare. These case studies will allow you to practice important data handling skills, learn about the ways sentiment analysis can be applied, and extract relevant insights from real-world data.

1. Sentiment Lexicons

There are several different sentiment lexicons available for sentiment analysis. You will explore three in this course that are available in the tidytext package: + afinn from Finn Årup Nielsen, + bing from Bing Liu and collaborators, and + nrc from Saif Mohammad and Peter Turney. You will see how these lexicons can be used as you work through this course. The decision about which lexicon to use often depends on what question you are trying to answer.

# Load dplyr and tidytext
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidytext)
# Choose the bing lexicon
get_sentiments("bing")
## # A tibble: 6,788 x 2
## word sentiment
## <chr> <chr>
## 1 2-faced negative
## 2 2-faces negative
## 3 a+ positive
## 4 abnormal negative
## 5 abolish negative
## 6 abominable negative
## 7 abominably negative
## 8 abominate negative
## 9 abomination negative
## 10 abort negative
## # ... with 6,778 more rows
# Choose the nrc lexicon
get_sentiments("nrc") %>%
count(sentiment) # Count words by sentiment
## # A tibble: 10 x 2
## sentiment n
## <chr> <int>
## 1 anger 1247
## 2 anticipation 839
## 3 disgust 1058
## 4 fear 1476
## 5 joy 689
## 6 negative 3324
## 7 positive 2312
## 8 sadness 1191
## 9 surprise 534
## 10 trust 1231
  • While the “bing” lexicon classifies words into 2 sentiments, positive or negative, there are 10 sentiments conveyed in the “nrc” lexicon.

2. Implement an inner join

In this exercise you will implement sentiment analysis using an inner join. The inner_join() function from dplyr will identify which words are in both the sentiment lexicon and the text dataset you are examining. To learn more about joining data frames using dplyr. The geocoded_tweets dataset is taken from Quartz and contains three columns:

load("data/geocoded_tweets.rda")
# Access bing lexicon: bing
bing <- get_sentiments("bing")
# Use data frame with text data
geocoded_tweets %>%
# With inner join, implement sentiment analysis using `bing`
inner_join(bing)
## Joining, by = "word"
## # A tibble: 64,303 x 4
## state word freq sentiment
## <chr> <chr> <dbl> <chr>
## 1 alabama abuse 7186. negative
## 2 alabama abused 3073. negative
## 3 alabama accomplish 5957. positive
## 4 alabama accomplished 13121. positive
## 5 alabama accomplishment 3036. positive
## 6 alabama accurate 28262. positive
## 7 alabama ache 7306. negative
## 8 alabama aching 5080. negative
## 9 alabama addict 5441. negative
## 10 alabama addicted 40389. negative
## # ... with 64,293 more rows
  • you can see the average frequency and the sentiment associated with each word that exists in both data frames.

3. What are the most common sadness words?

After you have implemented sentiment analysis using inner_join(), you can use dplyr functions such as group_by() and summarize() to understand your results. For example, what are the most common words related to sadness in this Twitter dataset?

tweets_nrc <- geocoded_tweets %>%
inner_join(get_sentiments("nrc"))
## Joining, by = "word"
tweets_nrc %>%
filter(sentiment == "sadness") %>%
group_by(word) %>%
summarise(freq = mean(freq)) %>%
arrange(desc(freq))
## # A tibble: 585 x 2
## word freq
## <chr> <dbl>
## 1 hate 1253840.
## 2 bad 984943.
## 3 bitch 787774.
## 4 hell 486259.
## 5 crazy 447047.
## 6 feeling 407562.
## 7 leave 397806.
## 8 mad 393559.
## 9 music 373608.
## 10 sick 362023.
## # ... with 575 more rows

4. What are the most common joy words?

You can use the same approach from the last exercise to find the most common words associated with joy in these tweets. Use the same pattern of dplyr verbs to find a new result.

joy_words <- tweets_nrc %>%
filter(sentiment == "joy") %>%
group_by(word) %>%
summarise(freq = mean(freq)) %>%
arrange(desc(freq))
library(ggplot2)
joy_words %>%
top_n(20) %>%
mutate(word = reorder(word, freq)) %>%
ggplot(aes(x = word, y = freq)) +
geom_col(stat = "identity") +
coord_flip()
## Selecting by freq
## Warning: Ignoring unknown parameters: stat


5. Do people in different states use different words?

So far you have looked at the United States as a whole, but you can use this dataset to examine differences in word use by state. In this exercise, you will examine two states and compare their use of joy words. Do they use the same words associated with joy? Do they use these words at the same rate?

tweets_nrc %>%
filter(state == "utah",
sentiment == "joy") %>%
arrange(desc(freq))
## # A tibble: 326 x 4
## state word freq sentiment
## <chr> <chr> <dbl> <chr>
## 1 utah love 4207322. joy
## 2 utah good 3035114. joy
## 3 utah happy 1402568. joy
## 4 utah pretty 902947. joy
## 5 utah fun 764045. joy
## 6 utah birthday 663439. joy
## 7 utah beautiful 653061. joy
## 8 utah friend 627522. joy
## 9 utah hope 571841. joy
## 10 utah god 536687. joy
## # ... with 316 more rows
tweets_nrc %>%
filter(state == "louisiana",
sentiment == "joy") %>%
arrange(desc(freq))
## # A tibble: 290 x 4
## state word freq sentiment
## <chr> <chr> <dbl> <chr>
## 1 louisiana love 3764157. joy
## 2 louisiana good 2758699. joy
## 3 louisiana baby 1184392. joy
## 4 louisiana happy 1176291. joy
## 5 louisiana god 882457. joy
## 6 louisiana birthday 740497. joy
## 7 louisiana money 677899. joy
## 8 louisiana hope 675559. joy
## 9 louisiana pretty 581242. joy
## 10 louisiana feeling 486367. joy
## # ... with 280 more rows
  • Words like “baby” and “money” are popular in Louisiana but not in Utah.

6. Which states have the most positive Twitter users?

  • you will determine how the overall sentiment of Twitter sentiment varies from state to state. You will use a dataset called tweets_bing, which is the output of an inner join created just the same way that you did earlier. Check out what tweets_bing looks like in the console.
library(tidyr)
tweets_bing <- geocoded_tweets %>%
inner_join(get_sentiments("bing"))
## Joining, by = "word"
tweets_bing %>%
group_by(state, sentiment) %>%
summarise(freq = mean(freq)) %>%
spread(sentiment, freq) %>%
ungroup() %>%
mutate(ratio = positive / negative,
state = reorder(state, ratio)) %>%
ggplot(aes(x = state, y = ratio)) +
geom_point() +
coord_flip()


  • Combining your data with a sentiment lexicon, you can do all sorts of exploratory data analysis. Looks like Missouri tops the list for this one!


'R > [R] Text Mining' 카테고리의 다른 글

Sentiment Analysis - Shakespeare gets Sentimental  (0) 2018.12.21
Sentiment_Analysis_with_tidy_data  (0) 2018.12.20
The tidy text format  (0) 2018.12.18

+ Recent posts