Text datasets are diverse and ubiquitous, and sentiment analysis provides an approach to understand the attitudes and opinions expressed in these texts. In this course, you will develop your text mining skills using tidy data principles. You will apply these skills by performing sentiment analysis in several case studies, on text data from Twitter to TV news to Shakespeare. These case studies will allow you to practice important data handling skills, learn about the ways sentiment analysis can be applied, and extract relevant insights from real-world data.
1. Sentiment Lexicons
There are several different sentiment lexicons available for sentiment analysis. You will explore three in this course that are available in the tidytext package: + afinn from Finn Årup Nielsen, + bing from Bing Liu and collaborators, and + nrc from Saif Mohammad and Peter Turney. You will see how these lexicons can be used as you work through this course. The decision about which lexicon to use often depends on what question you are trying to answer.
# Load dplyr and tidytext
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidytext)
# Choose the bing lexicon
get_sentiments("bing")
## # A tibble: 6,788 x 2
## word sentiment
## <chr> <chr>
## 1 2-faced negative
## 2 2-faces negative
## 3 a+ positive
## 4 abnormal negative
## 5 abolish negative
## 6 abominable negative
## 7 abominably negative
## 8 abominate negative
## 9 abomination negative
## 10 abort negative
## # ... with 6,778 more rows
# Choose the nrc lexicon
get_sentiments("nrc") %>%
count(sentiment) # Count words by sentiment
## # A tibble: 10 x 2
## sentiment n
## <chr> <int>
## 1 anger 1247
## 2 anticipation 839
## 3 disgust 1058
## 4 fear 1476
## 5 joy 689
## 6 negative 3324
## 7 positive 2312
## 8 sadness 1191
## 9 surprise 534
## 10 trust 1231
While the “bing” lexicon classifies words into 2 sentiments, positive or negative, there are 10 sentiments conveyed in the “nrc” lexicon.
2. Implement an inner join
In this exercise you will implement sentiment analysis using an inner join. The inner_join() function from dplyr will identify which words are in both the sentiment lexicon and the text dataset you are examining. To learn more about joining data frames using dplyr. The geocoded_tweets dataset is taken from Quartz and contains three columns:
load("data/geocoded_tweets.rda")
# Access bing lexicon: bing
bing <-get_sentiments("bing")
# Use data frame with text data
geocoded_tweets %>%
# With inner join, implement sentiment analysis using `bing`
inner_join(bing)
## Joining, by = "word"
## # A tibble: 64,303 x 4
## state word freq sentiment
## <chr> <chr> <dbl> <chr>
## 1 alabama abuse 7186. negative
## 2 alabama abused 3073. negative
## 3 alabama accomplish 5957. positive
## 4 alabama accomplished 13121. positive
## 5 alabama accomplishment 3036. positive
## 6 alabama accurate 28262. positive
## 7 alabama ache 7306. negative
## 8 alabama aching 5080. negative
## 9 alabama addict 5441. negative
## 10 alabama addicted 40389. negative
## # ... with 64,293 more rows
you can see the average frequency and the sentiment associated with each word that exists in both data frames.
3. What are the most common sadness words?
After you have implemented sentiment analysis using inner_join(), you can use dplyr functions such as group_by() and summarize() to understand your results. For example, what are the most common words related to sadness in this Twitter dataset?
tweets_nrc <-geocoded_tweets %>%
inner_join(get_sentiments("nrc"))
## Joining, by = "word"
tweets_nrc%>%
filter(sentiment =="sadness") %>%
group_by(word) %>%
summarise(freq =mean(freq)) %>%
arrange(desc(freq))
## # A tibble: 585 x 2
## word freq
## <chr> <dbl>
## 1 hate 1253840.
## 2 bad 984943.
## 3 bitch 787774.
## 4 hell 486259.
## 5 crazy 447047.
## 6 feeling 407562.
## 7 leave 397806.
## 8 mad 393559.
## 9 music 373608.
## 10 sick 362023.
## # ... with 575 more rows
4. What are the most common joy words?
You can use the same approach from the last exercise to find the most common words associated with joy in these tweets. Use the same pattern of dplyr verbs to find a new result.
joy_words <-tweets_nrc %>%
filter(sentiment =="joy") %>%
group_by(word) %>%
summarise(freq =mean(freq)) %>%
arrange(desc(freq))
library(ggplot2)
joy_words %>%
top_n(20) %>%
mutate(word =reorder(word, freq)) %>%
ggplot(aes(x = word, y = freq)) +
geom_col(stat ="identity") +
coord_flip()
## Selecting by freq
## Warning: Ignoring unknown parameters: stat
5. Do people in different states use different words?
So far you have looked at the United States as a whole, but you can use this dataset to examine differences in word use by state. In this exercise, you will examine two states and compare their use of joy words. Do they use the same words associated with joy? Do they use these words at the same rate?
tweets_nrc%>%
filter(state =="utah",
sentiment =="joy") %>%
arrange(desc(freq))
## # A tibble: 326 x 4
## state word freq sentiment
## <chr> <chr> <dbl> <chr>
## 1 utah love 4207322. joy
## 2 utah good 3035114. joy
## 3 utah happy 1402568. joy
## 4 utah pretty 902947. joy
## 5 utah fun 764045. joy
## 6 utah birthday 663439. joy
## 7 utah beautiful 653061. joy
## 8 utah friend 627522. joy
## 9 utah hope 571841. joy
## 10 utah god 536687. joy
## # ... with 316 more rows
tweets_nrc%>%
filter(state =="louisiana",
sentiment =="joy") %>%
arrange(desc(freq))
## # A tibble: 290 x 4
## state word freq sentiment
## <chr> <chr> <dbl> <chr>
## 1 louisiana love 3764157. joy
## 2 louisiana good 2758699. joy
## 3 louisiana baby 1184392. joy
## 4 louisiana happy 1176291. joy
## 5 louisiana god 882457. joy
## 6 louisiana birthday 740497. joy
## 7 louisiana money 677899. joy
## 8 louisiana hope 675559. joy
## 9 louisiana pretty 581242. joy
## 10 louisiana feeling 486367. joy
## # ... with 280 more rows
Words like “baby” and “money” are popular in Louisiana but not in Utah.
6. Which states have the most positive Twitter users?
you will determine how the overall sentiment of Twitter sentiment varies from state to state. You will use a dataset called tweets_bing, which is the output of an inner join created just the same way that you did earlier. Check out what tweets_bing looks like in the console.
library(tidyr)
tweets_bing <-geocoded_tweets %>%
inner_join(get_sentiments("bing"))
## Joining, by = "word"
tweets_bing %>%
group_by(state, sentiment) %>%
summarise(freq =mean(freq)) %>%
spread(sentiment, freq) %>%
ungroup() %>%
mutate(ratio = positive /negative,
state =reorder(state, ratio)) %>%
ggplot(aes(x =state, y = ratio)) +
geom_point() +
coord_flip()
Combining your data with a sentiment lexicon, you can do all sorts of exploratory data analysis. Looks like Missouri tops the list for this one!
The source codes and contens come from the book Text Mining with R Enjoy
Intro
Let’s address the topic of opinion mining or sentiment analysis. When human readers approach a text, we use our understanding of the emotional intent of words to infer whether a section of text is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust. The flow chart is shown in Figure 2.1.
One way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words. This isn’t the only way to approach sentiment analysis, but it is an often-used approach, and an approach that naturally takes advantage of the tidy tool ecosystem.
All three of these lexicons are based on unigrams, i.e., single words.
These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth.
Differences
The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust.
The bing lexicon categorizes words in a binary fashion into positive and negative categories.
The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.
tidytext provides a function get_sentiments() to get specific sentiment lexicons without the columns that are not used in that lexicon.
get_sentiments("afinn")
## # A tibble: 2,476 x 2
## word score
## <chr> <int>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
## # ... with 2,466 more rows
get_sentiments("bing")
## # A tibble: 6,788 x 2
## word sentiment
## <chr> <chr>
## 1 2-faced negative
## 2 2-faces negative
## 3 a+ positive
## 4 abnormal negative
## 5 abolish negative
## 6 abominable negative
## 7 abominably negative
## 8 abominate negative
## 9 abomination negative
## 10 abort negative
## # ... with 6,778 more rows
get_sentiments("nrc")
## # A tibble: 13,901 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
## 7 abandoned negative
## 8 abandoned sadness
## 9 abandonment anger
## 10 abandonment fear
## # ... with 13,891 more rows
Considerations
They were constructed via either crowdsourcing (using, for example, Amazon Mechanical Turk) or by the labor of one of the authors, and were validated using some combination of crowdsourcing again, restaurant or movie reviews, or Twitter data.
Thus, we need to consider to apply these sentiment lexicons to styles of text dramatically different from what they were validated on, such as narrative fiction from 200 years ago.
Moreover, There are also some domain-specific sentiment lexicon available, constructed to be used with text from a specific content area.
In conclusion, in this tutorial, it’s better to find the workflow of sentiment analysis and to get an insight of using this tutorial for your future context.
Sentiment Analysis with inner join
Let’s look at the words with a joy score from the NRC lexicon. What are the most common joy words in Emma?
library(janeaustenr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
Now that the text is in a tidy format with one word per row, we are ready to do the sentiment analysis. First, let’s use the NRC lexicon and filter() for the joy words. Next, let’s filter() the data frame with the text from the books for the words from Emma and then use inner_join() to perform the sentiment analysis. What are the most common joy words in Emma? Let’s use count() from dplyr.
nrc_joy <-get_sentiments("nrc") %>%
filter(sentiment =="joy")
tidy_books %>%
filter(book =="Emma") %>%
inner_join(nrc_joy) %>%
count(word, sort =TRUE)
## Joining, by = "word"
## # A tibble: 303 x 2
## word n
## <chr> <int>
## 1 good 359
## 2 young 192
## 3 friend 166
## 4 hope 143
## 5 happy 125
## 6 love 117
## 7 deal 92
## 8 found 92
## 9 present 89
## 10 kind 82
## # ... with 293 more rows
We see mostly positive, happy words about hope, friendship, and love here. Now, We can also examine how sentiment changes throughout each novel. + 1. we find a sentiment score for each word using the Bing lexicon and inner_join(). + 2. Next, we count up how many positive and negative words there are in defined sections of each book. We define an index here to keep track of where we are in the narrative; this index (using integer division) counts up sections of 80 lines of text.
The %/% operator does integer division (x %/% y is equivalent to floor(x/y)) so the index keeps track of which 80-line section of text we are counting up negative and positive sentiment in.
We then use spread() so that we have negative and positive sentiment in separate columns, and lastly calculate a net sentiment (positive - negative).
library(tidyr)
jane_austen_sentiment <-tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book, index = linenumber %/%80, sentiment) %>%
spread(sentiment, n, fill =0) %>%
mutate(sentiment = positive -negative)
## Joining, by = "word"
Now we can plot these sentiment scores across the plot trajectory of each novel. Notice that we are plotting against the index on the x-axis that keeps track of narrative time in sections of text.
library(ggplot2)
ggplot(jane_austen_sentiment, aes(index, sentiment, fill = book)) +
geom_col(show.legend =FALSE) +
facet_wrap(~book, ncol =2, scales ="free_x")
Comparing the three sentiment dictionaries
Let’s do it more with other sentiment dictionaries and the narrative arc of Pride and Prejudice.
First, let’s use filter() to choose only the words from the one novel we are interested in.
pride_prejudice <-tidy_books %>%
filter(book =="Pride & Prejudice")
head(pride_prejudice)
## # A tibble: 6 x 4
## book linenumber chapter word
## <fct> <int> <int> <chr>
## 1 Pride & Prejudice 1 0 pride
## 2 Pride & Prejudice 1 0 and
## 3 Pride & Prejudice 1 0 prejudice
## 4 Pride & Prejudice 3 0 by
## 5 Pride & Prejudice 3 0 jane
## 6 Pride & Prejudice 3 0 austen
Second, Now, we can use inner_join() to calculate the sentiment in different ways.
Remember from above that the AFINN lexicon measures sentiment with a numeric score between -5 and 5, while the other two lexicons categorize words in a binary fashion, either positive or negative. To find a sentiment score in chunks of text throughout the novel, we will need to use a different pattern for the AFINN lexicon than for the other two.
count(method, index = linenumber %/%80, sentiment) %>%
spread(sentiment, n, fill =0) %>%
mutate(sentiment = positive -negative)
## Joining, by = "word"
## Joining, by = "word"
head(bing_and_nrc)
## # A tibble: 6 x 5
## method index negative positive sentiment
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Bing et al. 0 7 21 14
## 2 Bing et al. 1 20 19 -1
## 3 Bing et al. 2 16 20 4
## 4 Bing et al. 3 19 31 12
## 5 Bing et al. 4 23 47 24
## 6 Bing et al. 5 15 49 34
Now, we have an estimate of the net sentiment (positive - negative) in each chunk of the novel text for each sentiment lexicon. Let’s visualize all of them.
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend =FALSE) +
facet_wrap(~method, ncol =1, scales ="free_y")
Three plots are saying that the three different lexicons for calculating sentiment give results that are different in an absolute sense but have similar relative trajectories through the novel.
Question, why is, the result for the NRC lexicon biased so high in sentiment compared to the Bing et al. result? Let’s look briefly at how many positive and negative words are in these lexicon.
get_sentiments("nrc") %>%
filter(sentiment %in%c("positive",
"negative")) %>%
count(sentiment)
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 3324
## 2 positive 2312
get_sentiments("bing") %>%
count(sentiment)
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 4782
## 2 positive 2006
Both lexicons have more negative than positive words, but the ratio of negative to positive words is higher in the Bing lexicon than the NRC lexicon. This will contribute to the effect we see in the plot above, as will any systematic difference in word matches, e.g. if the negative words in the NRC lexicon do not match the words that Jane Austen uses very well. Whatever the source of these differences, we see similar relative trajectories across the narrative arc, with similar changes in slope, but marked differences in absolute sentiment from lexicon to lexicon. This is all important context to keep in mind when choosing a sentiment lexicon for analysis.
point 1. This comment makes me to ponder the direct usage of the given lexicons or packages.
point 2. This comment makes me to motivate develop a specific lexcions for the each different context, using given lexicons
Most common positive and negative words
One advantage of having the data frame with both sentiment and word is that we can analyze word counts that contribute to each sentiment. By implementing count() here with arguments of both word and sentiment, we find out how much each word contributed to each sentiment.
bing_word_counts <-tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort =TRUE) %>%
ungroup()
## Joining, by = "word"
head(bing_word_counts)
## # A tibble: 6 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 miss negative 1855
## 2 well positive 1523
## 3 good positive 1380
## 4 great positive 981
## 5 like positive 725
## 6 better positive 639
This can be shown visually, and we can pipe straight into ggplot2, if we like, because of the way we are consistently using tools built for handling tidy data frames.
bing_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word =reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend =FALSE) +
facet_wrap(~sentiment, scales ="free_y") +
labs(y ="Contribution to sentiment",
x =NULL) +
coord_flip()
## Selecting by n
Point 1. let us see the word “miss”. The word is coded as negative but it is often used unmarried women n Jane Austen’s works. So, if it were appropriate for our purposes, we could easily add “miss” to a custom stop-words list using bind_rows().
## [1] "however little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters."
Bonus: Looking at units beyond just words
Another option in unnest_tokens() is to split into tokens using a regex pattern. We could use this, for example, to split the text of Jane Austen’s novels into a data frame by chapter.
austen_chapters<-austen_books() %>%
group_by(book)%>%
unnest_tokens(chapter,text, token ="regex",
pattern ="Chapter|CHAPTER [\\dIVXLC]") %>%
ungroup()
austen_chapters%>%
group_by(book)%>%
summarise(chapters =n())
## # A tibble: 6 x 2
## book chapters
## <fct> <int>
## 1 Sense & Sensibility 51
## 2 Pride & Prejudice 62
## 3 Mansfield Park 49
## 4 Emma 56
## 5 Northanger Abbey 32
## 6 Persuasion 25
We can use tidy text analysis to ask questions such as what are the most negative chapters in each of Jane Austen’s novels?
First, let’s get the list of negative words from the Bing lexicon.
Second, let’s make a data frame of how many words are in each chapter so we can normalize for the length of chapters.
Then, let’s find the number of negative words in each chapter and divide by the total words in each chapter.
For each book, which chapter has the highest proportion of negative words?
bingnegative <-get_sentiments("bing") %>%
filter(sentiment =="negative")
wordcounts <-tidy_books %>%
group_by(book, chapter) %>%
summarize(words =n())
tidy_books %>%
semi_join(bingnegative) %>%
group_by(book, chapter) %>%
summarize(negativewords =n()) %>%
left_join(wordcounts, by =c("book", "chapter")) %>%
mutate(ratio = negativewords/words) %>%
filter(chapter !=0) %>%
top_n(1) %>%
ungroup()
## Joining, by = "word"
## Selecting by ratio
## # A tibble: 6 x 5
## book chapter negativewords words ratio
## <fct> <int> <int> <int> <dbl>
## 1 Sense & Sensibility 43 161 3405 0.0473
## 2 Pride & Prejudice 34 111 2104 0.0528
## 3 Mansfield Park 46 173 3685 0.0469
## 4 Emma 15 151 3340 0.0452
## 5 Northanger Abbey 21 149 2982 0.0500
## 6 Persuasion 4 62 1807 0.0343
These are the chapters with the most sad words in each book, normalized for number of words in the chapter. What is happening in these chapters? + In Chapter 43 of Sense and Sensibility Marianne is seriously ill, near death. + In Chapter 34 of Pride and Prejudice Mr. Darcy proposes for the first time (so badly!). + In Chapter 46 of Mansfield Park is almost the end, when everyone learns of Henry’s scandalous adultery + In Chapter 15 of Emma is when horrifying Mr. Elton proposes, and in Chapter 21 of Northanger Abbey Catherine is deep in her Gothic faux fantasy of murder, etc. + In Chapter 4 of Persuasion is when the reader gets the full flashback of Anne refusing Captain Wentworth and how sad she was and what a terrible mistake she realized it to be.
The source codes and contens come from the book Text Mining with R Enjoy
The unnest_tokens function
text <-c("Because I could not stop for Death -",
"He kindly stopped for me -",
"The Carriage held but just Ourselves -",
"and Immortality")
text
## [1] "Because I could not stop for Death -"
## [2] "He kindly stopped for me -"
## [3] "The Carriage held but just Ourselves -"
## [4] "and Immortality"
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
text_df <-data_frame(line =1:4,text =text)
What does it mean that this data frame has printed out as a “tibble”? A tibble is a modern class of data frame within R, available in the dplyr and tibble packages, that has a convenient print method, will not convert strings to factors, and does not use row names. Tibbles are great for use with tidy tools.
The two basic arguments to unnest_tokens used here are column names. First we have the output column name that will be created as the text is unnested into it (word, in this case), and then the input column that the text comes from (text, in this case). Remember that text_df above has a column called text that contains the data of interest.
Having the text data in this format lets us manipulate, process, and visualize the text using the standard set of tidy tools, namely dplyr, tidyr, and ggplot2, as shown in Figure 1.1 Figure_A_FlowChart.png
## 1 SENSE AND SENSIBILITY Sense & Sensibility 1 0
## 2 "" Sense & Sensibility 2 0
## 3 by Jane Austen Sense & Sensibility 3 0
## 4 "" Sense & Sensibility 4 0
## 5 (1811) Sense & Sensibility 5 0
## 6 "" Sense & Sensibility 6 0
To work with this as a tidy dataset, we need to restructure it in the one-token-per-row format, which as we saw earlier is done with the unnest_tokens() function.
# install.packages("janeaustenr")
library(tidytext)
tidy_books<-original_books %>%
unnest_tokens(word,text)
tidy_books%>%head()
## # A tibble: 6 x 4
## book linenumber chapter word
## <fct> <int> <int> <chr>
## 1 Sense & Sensibility 1 0 sense
## 2 Sense & Sensibility 1 0 and
## 3 Sense & Sensibility 1 0 sensibility
## 4 Sense & Sensibility 3 0 by
## 5 Sense & Sensibility 3 0 jane
## 6 Sense & Sensibility 3 0 austen
Often in text analysis, we will want to remove stop words; stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth in English. We can remove stop words (kept in the tidytext dataset stop_words) with an anti_join()
data(stop_words)
tidy_books<-tidy_books %>%
anti_join(stop_words)
## Joining, by = "word"
We can also use dplyr’s count() to find the most common words in all the books as a whole.
tidy_books%>%
count(word,sort=TRUE)
## # A tibble: 13,914 x 2
## word n
## <chr> <int>
## 1 miss 1855
## 2 time 1337
## 3 fanny 862
## 4 dear 822
## 5 lady 817
## 6 sir 806
## 7 day 797
## 8 emma 787
## 9 sister 727
## 10 house 699
## # ... with 13,904 more rows
library(ggplot2)
tidy_books %>%
count(word, sort =TRUE) %>%
filter(n >600) %>%
mutate(word =reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(caption ="Figure 1.2 TOP words in Jane Austen’s novels")
Word Frequencies
We’re going to use gutenbergr package that provides access to the public domain works from the Project Gutenberg collection. We can access works using gutenberg_download(). The numbers below are followed as well.
Interesting that “time”, “eyes”, and “hand” are in the top 10 for both H.G. Wells and the Brontë sisters.
Let’s calculate the frequency for each word for the works of Jane Austen, the Brontë sisters, and H.G. Wells by binding the data frames together. We can use spread and gather from tidyr to reshape our dataframe so that it is just what we need for plotting and comparing the three sets of novels.
library(tidyr)
frequency <-bind_rows(mutate(tidy_bronte, author ="Brontë Sisters"),
We use str_extract() here because the UTF-8 encoded texts from Project Gutenberg have some examples of words with underscores around them to indicate emphasis (like italics). The tokenizer treated these as words, but we don’t want to count “any” separately from “any” as we saw in our initial data exploration before choosing to use str_extract().
Visualization
Words that are close to the line in these plots have similar frequencies in both sets of texts, for example, in both Austen and Brontë texts (“miss”, “time”, “day” at the upper frequency end) or in both Austen and Wells texts (“time”, “day”, “brother” at the high frequency end).
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
library(ggplot2)
# expect a warning about rows with missing values being removed
ggplot(frequency, aes(x = proportion, y =`Jane Austen`, color =abs(`Jane Austen`-proportion))) +
Let’s quantify how similar and different these sets of word frequencies are using a correlation test. How correlated are the word frequencies between Austen and the Brontë sisters, and between Austen and Wells?
First, it is to make predictions about an outcome. Second, it is to run experiments to study relationship between variables. Third, it is to explore data to identify relationships among variables.
Basic Choices in model architecture
If Categorical response variable (e.g. yes or no, infected or not), then use rpart(). If Numerical response variable (e.g. unemployment rate), then use lm() or rpart(), but in this case, if the case is relevant with graual or proportional, then use lm. If the case is related with Dichotomous or discontinous, then use rpart().
The Prediction Errors?
When you train and test a model, you use data with values for the explanatory variables as well as the response variable.
If the model is good, when provided with the inputs from the testing data, the outputs from the function will give results “close” to the response variable in the testing data. Here is one question. How to measure “close”?
The first step is to subtract the function output from the actual response values in the testing data. The result is called the prediction error and there will be one such error for every case in the testing data. You then summarize that set of prediction errors.
The way to summarise the prediction erros is to calculate the mean of the square of the prediction errors. Since the errors are squared, none of them are negative. It means that the sum of errors reflects the magnitude and not the sign of the errors.
As you can see the result, adding previous as an explanatory variable changes the model outputs.
Step 2. Prediction Performace
One point. Knowing that the models make different predictions doesn’t tell you which model is better. In this exercise, you’ll compare the models’ predictions to the actual values of the response variable. The term prediction error or just error is often used, rather than difference. So, rather than speaking of the mean square difference, we’ll say mean square error.
# Build and evaluate the base model on Runners_100
Surprisingly, although you just add no genuine explanatory variable like bogus to model, MSE is smaller in the expanded model than in the base model. This is called The Null Model. This results in usage of cross validation.
Step 4. Testing and training datasets
The code in the editor uses a style that will give you two prediction error results: one for the training cases and one for the testing cases. Your goal is to see whether there is a systematic difference between prediction accuracy on the training and on the testing cases.
Since the split is being done at random, the results will vary somewhat each time you do the calculation.
# Generate a random TRUE or FALSE for each casein Runners_100
## Skipping install of 'statisticalModeling' from a github remote, the SHA1 (4c5383d3) has not changed since last install.
## Use `force = TRUE` to force installation
library(statisticalModeling)
Preds <-evaluate_model(base_model, data =subset(Runners_100, !training_cases))
# Calculate the MSE on the testing data
with(data = Preds, mean((net -model_output)^2))
## [1] 136.3962
Step 5. Repeating Random Trials
To simplifty things, the cv_pred_error() function in the statisticalModeling package will carry out this repetitive process for you. The function will do all the work of creating training and testing sets for each trial and calculating the mean square error for each trial.
The context for this exercise is to see whether the prediction error calculated from the training data is consistently different from the cross-validated prediction error. To that end, you’ll calculate the in-sample error using only the training data. Then, you’ll do the cross validation and use a t-test to see if the in-sample error is statistically different from the cross-validated error.
# The model
model<-lm(net ~age +sex, data = Runners_100)
# Find the in-sample error (using the training data)
in_sample<-evaluate_model(model, data = Runners_100)
# Find confidence interval on trials and compare to training_error
mosaic::t.test(~mse, mu = in_sample_error, data = trials)
##
## One Sample t-test
##
## data: mse
## t = 7.0898, df = 4, p-value = 0.00209
## alternative hypothesis: true mean is not equal to 131.5594
## 95 percent confidence interval:
## 137.6805 145.5606
## sample estimates:
## mean of x
## 141.6206
The error based on the training data is below the 95% confidence interval representing the cross-validated prediction error.
Step 6. To add or not to add (an explanatory variable)?
you’re going to use cross validation to find out whether adding a new explanatory variable improves the prediction performance of a model. Remember that models are biased to perform well on the training data. Cross validation gives a fair indication of the prediction error on new data.
# The base model
base_model<-lm(net ~age +sex, data = Runners_100)
# An augmented model adding previous as an explanatory variable
aug_model<-lm(net ~age +sex +previous, data = Runners_100)
# Run cross validation trials on the two models
trials<-cv_pred_error(base_model, aug_model)
trials
## mse model
## 1 142.0409 base_model
## 2 136.6856 base_model
## 3 143.4253 base_model
## 4 137.9605 base_model
## 5 134.6230 base_model
## 6 143.2968 aug_model
## 7 142.2867 aug_model
## 8 143.6135 aug_model
## 9 140.3348 aug_model
## 10 145.8624 aug_model
# Compare the two sets of cross-validated errors
t.test(mse ~model, data = trials)
##
## Welch Two Sample t-test
##
## data: mse by model
## t = 2.1983, df = 6.1923, p-value = 0.06887
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.4327813 8.6963517
## sample estimates:
## mean in group aug_model mean in group base_model
## 143.0788 138.9471
Notice that cross validation reveals that the augmented model makes worse predictions (larger prediction error) than the base model. Bigger is not necessarily better when it comes to modeling!
In Conclusion, it’s not good to add the previous variable, just use base_model in this case.