Text Mining: Tweets Across the Unisted States From DataCamp

2018. 12. 20. 17:05

The source codes and contens come from the E-Learning DataCamp: Sentiment Analysis in R: The Tidy Way Enjoy

Intro

Text datasets are diverse and ubiquitous, and sentiment analysis provides an approach to understand the attitudes and opinions expressed in these texts. In this course, you will develop your text mining skills using tidy data principles. You will apply these skills by performing sentiment analysis in several case studies, on text data from Twitter to TV news to Shakespeare. These case studies will allow you to practice important data handling skills, learn about the ways sentiment analysis can be applied, and extract relevant insights from real-world data.

1. Sentiment Lexicons

There are several different sentiment lexicons available for sentiment analysis. You will explore three in this course that are available in the tidytext package: + afinn from Finn Årup Nielsen, + bing from Bing Liu and collaborators, and + nrc from Saif Mohammad and Peter Turney. You will see how these lexicons can be used as you work through this course. The decision about which lexicon to use often depends on what question you are trying to answer.

 # Load dplyr and tidytext
library(dplyr)

 ## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

 library(tidytext)
 
# Choose the bing lexicon
get_sentiments("bing")

 ## # A tibble: 6,788 x 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faced     negative 
##  2 2-faces     negative 
##  3 a+          positive 
##  4 abnormal    negative 
##  5 abolish     negative 
##  6 abominable  negative 
##  7 abominably  negative 
##  8 abominate   negative 
##  9 abomination negative 
## 10 abort       negative 
## # ... with 6,778 more rows

 # Choose the nrc lexicon
get_sentiments("nrc") %>%
  count(sentiment) # Count words by sentiment

 ## # A tibble: 10 x 2
##    sentiment        n
##    <chr>        <int>
##  1 anger         1247
##  2 anticipation   839
##  3 disgust       1058
##  4 fear          1476
##  5 joy            689
##  6 negative      3324
##  7 positive      2312
##  8 sadness       1191
##  9 surprise       534
## 10 trust         1231

While the “bing” lexicon classifies words into 2 sentiments, positive or negative, there are 10 sentiments conveyed in the “nrc” lexicon.

2. Implement an inner join

In this exercise you will implement sentiment analysis using an inner join. The inner_join() function from dplyr will identify which words are in both the sentiment lexicon and the text dataset you are examining. To learn more about joining data frames using dplyr. The geocoded_tweets dataset is taken from Quartz and contains three columns:

 load("data/geocoded_tweets.rda")
 
# Access bing lexicon: bing
bing <- get_sentiments("bing")
 
# Use data frame with text data
geocoded_tweets %>%
  # With inner join, implement sentiment analysis using `bing`
  inner_join(bing)

 ## Joining, by = "word"
## # A tibble: 64,303 x 4
##    state   word             freq sentiment
##    <chr>   <chr>           <dbl> <chr>    
##  1 alabama abuse           7186. negative 
##  2 alabama abused          3073. negative 
##  3 alabama accomplish      5957. positive 
##  4 alabama accomplished   13121. positive 
##  5 alabama accomplishment  3036. positive 
##  6 alabama accurate       28262. positive 
##  7 alabama ache            7306. negative 
##  8 alabama aching          5080. negative 
##  9 alabama addict          5441. negative 
## 10 alabama addicted       40389. negative 
## # ... with 64,293 more rows

you can see the average frequency and the sentiment associated with each word that exists in both data frames.

3. What are the most common sadness words?

After you have implemented sentiment analysis using inner_join(), you can use dplyr functions such as group_by() and summarize() to understand your results. For example, what are the most common words related to sadness in this Twitter dataset?

 tweets_nrc <- geocoded_tweets %>%
  inner_join(get_sentiments("nrc"))

## Joining, by = "word"

 tweets_nrc %>% 
  filter(sentiment == "sadness") %>% 
  group_by(word) %>% 
  summarise(freq = mean(freq)) %>% 
  arrange(desc(freq))

 ## # A tibble: 585 x 2
##    word        freq
##    <chr>      <dbl>
##  1 hate    1253840.
##  2 bad      984943.
##  3 bitch    787774.
##  4 hell     486259.
##  5 crazy    447047.
##  6 feeling  407562.
##  7 leave    397806.
##  8 mad      393559.
##  9 music    373608.
## 10 sick     362023.
## # ... with 575 more rows

4. What are the most common joy words?

You can use the same approach from the last exercise to find the most common words associated with joy in these tweets. Use the same pattern of dplyr verbs to find a new result.

 joy_words <- tweets_nrc %>%
  filter(sentiment == "joy") %>%
  group_by(word) %>%
  summarise(freq = mean(freq)) %>%
  arrange(desc(freq))    
 
library(ggplot2)
 
joy_words %>%
  top_n(20) %>%
  mutate(word = reorder(word, freq)) %>%
  ggplot(aes(x = word, y = freq)) +
  geom_col(stat = "identity") +
  coord_flip()

 ## Selecting by freq
## Warning: Ignoring unknown parameters: stat

5. Do people in different states use different words?

So far you have looked at the United States as a whole, but you can use this dataset to examine differences in word use by state. In this exercise, you will examine two states and compare their use of joy words. Do they use the same words associated with joy? Do they use these words at the same rate?

 tweets_nrc %>%
  filter(state == "utah",
      sentiment == "joy") %>%
  arrange(desc(freq))

 ## # A tibble: 326 x 4
##    state word          freq sentiment
##    <chr> <chr>        <dbl> <chr>    
##  1 utah  love      4207322. joy      
##  2 utah  good      3035114. joy      
##  3 utah  happy     1402568. joy      
##  4 utah  pretty     902947. joy      
##  5 utah  fun        764045. joy      
##  6 utah  birthday   663439. joy      
##  7 utah  beautiful  653061. joy      
##  8 utah  friend     627522. joy      
##  9 utah  hope       571841. joy      
## 10 utah  god        536687. joy      
## # ... with 316 more rows

 tweets_nrc %>%
  filter(state == "louisiana",
      sentiment == "joy") %>%
    arrange(desc(freq))

 ## # A tibble: 290 x 4
##    state     word         freq sentiment
##    <chr>     <chr>       <dbl> <chr>    
##  1 louisiana love     3764157. joy      
##  2 louisiana good     2758699. joy      
##  3 louisiana baby     1184392. joy      
##  4 louisiana happy    1176291. joy      
##  5 louisiana god       882457. joy      
##  6 louisiana birthday  740497. joy      
##  7 louisiana money     677899. joy      
##  8 louisiana hope      675559. joy      
##  9 louisiana pretty    581242. joy      
## 10 louisiana feeling   486367. joy      
## # ... with 280 more rows

Words like “baby” and “money” are popular in Louisiana but not in Utah.

6. Which states have the most positive Twitter users?

you will determine how the overall sentiment of Twitter sentiment varies from state to state. You will use a dataset called tweets_bing, which is the output of an inner join created just the same way that you did earlier. Check out what tweets_bing looks like in the console.

 library(tidyr)
tweets_bing <- geocoded_tweets %>% 
  inner_join(get_sentiments("bing"))

## Joining, by = "word"

 tweets_bing %>% 
  group_by(state, sentiment) %>% 
  summarise(freq = mean(freq)) %>% 
  spread(sentiment, freq) %>% 
  ungroup() %>% 
  mutate(ratio = positive / negative, 
         state = reorder(state, ratio)) %>% 
  ggplot(aes(x = state, y = ratio)) + 
    geom_point() + 
    coord_flip()

Combining your data with a sentiment lexicon, you can do all sorts of exploratory data analysis. Looks like Missouri tops the list for this one!

'R > [R] Text Mining' 카테고리의 다른 글

Sentiment Analysis - Shakespeare gets Sentimental (0)	2018.12.21
Sentiment_Analysis_with_tidy_data (0)	2018.12.20
The tidy text format (0)	2018.12.18

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

cozyDS