The source codes and contens come from the E-Learning DataCamp: Sentiment Analysis in R: The Tidy Way Enjoy
Intro
- Text datasets are diverse and ubiquitous, and sentiment analysis provides an approach to understand the attitudes and opinions expressed in these texts. In this course, you will develop your text mining skills using tidy data principles. You will apply these skills by performing sentiment analysis in several case studies, on text data from Twitter to TV news to Shakespeare. These case studies will allow you to practice important data handling skills, learn about the ways sentiment analysis can be applied, and extract relevant insights from real-world data.
1. Sentiment Lexicons
There are several different sentiment lexicons available for sentiment analysis. You will explore three in this course that are available in the tidytext package: + afinn from Finn Årup Nielsen, + bing from Bing Liu and collaborators, and + nrc from Saif Mohammad and Peter Turney. You will see how these lexicons can be used as you work through this course. The decision about which lexicon to use often depends on what question you are trying to answer.
| library(tidytext) |
| |
| |
| get_sentiments("bing") |
| |
| get_sentiments("nrc") %>% |
| count(sentiment) |
- While the “bing” lexicon classifies words into 2 sentiments, positive or negative, there are 10 sentiments conveyed in the “nrc” lexicon.
2. Implement an inner join
In this exercise you will implement sentiment analysis using an inner join. The inner_join() function from dplyr will identify which words are in both the sentiment lexicon and the text dataset you are examining. To learn more about joining data frames using dplyr. The geocoded_tweets dataset is taken from Quartz and contains three columns:
| load("data/geocoded_tweets.rda") |
| |
| |
| bing <- get_sentiments("bing") |
| |
| |
| geocoded_tweets %>% |
| |
| inner_join(bing) |
- you can see the average frequency and the sentiment associated with each word that exists in both data frames.
3. What are the most common sadness words?
After you have implemented sentiment analysis using inner_join(), you can use dplyr functions such as group_by() and summarize() to understand your results. For example, what are the most common words related to sadness in this Twitter dataset?
| tweets_nrc <- geocoded_tweets %>% |
| inner_join(get_sentiments("nrc")) |
## Joining, by = "word"
| tweets_nrc %>% |
| filter(sentiment == "sadness") %>% |
| group_by(word) %>% |
| summarise(freq = mean(freq)) %>% |
| arrange(desc(freq)) |
4. What are the most common joy words?
You can use the same approach from the last exercise to find the most common words associated with joy in these tweets. Use the same pattern of dplyr verbs to find a new result.
| joy_words <- tweets_nrc %>% |
| filter(sentiment == "joy") %>% |
| group_by(word) %>% |
| summarise(freq = mean(freq)) %>% |
| arrange(desc(freq)) |
| |
| library(ggplot2) |
| |
| joy_words %>% |
| top_n(20) %>% |
| mutate(word = reorder(word, freq)) %>% |
| ggplot(aes(x = word, y = freq)) + |
| geom_col(stat = "identity") + |
| coord_flip() |

5. Do people in different states use different words?
So far you have looked at the United States as a whole, but you can use this dataset to examine differences in word use by state. In this exercise, you will examine two states and compare their use of joy words. Do they use the same words associated with joy? Do they use these words at the same rate?
| tweets_nrc %>% |
| filter(state == "utah", |
| sentiment == "joy") %>% |
| arrange(desc(freq)) |
| tweets_nrc %>% |
| filter(state == "louisiana", |
| sentiment == "joy") %>% |
| arrange(desc(freq)) |
- Words like “baby” and “money” are popular in Louisiana but not in Utah.
- you will determine how the overall sentiment of Twitter sentiment varies from state to state. You will use a dataset called tweets_bing, which is the output of an inner join created just the same way that you did earlier. Check out what tweets_bing looks like in the console.
| library(tidyr) |
| tweets_bing <- geocoded_tweets %>% |
| inner_join(get_sentiments("bing")) |
## Joining, by = "word"
| tweets_bing %>% |
| group_by(state, sentiment) %>% |
| summarise(freq = mean(freq)) %>% |
| spread(sentiment, freq) %>% |
| ungroup() %>% |
| mutate(ratio = positive / negative, |
| state = reorder(state, ratio)) %>% |
| ggplot(aes(x = state, y = ratio)) + |
| geom_point() + |
| coord_flip() |

- Combining your data with a sentiment lexicon, you can do all sorts of exploratory data analysis. Looks like Missouri tops the list for this one!