R

How to draw graph for two numeric variables in R 2019.01.17
Correlation in R 2019.01.09
[R] Beginner Graph for Communication - Label 2019.01.03
Sentiment Analysis - Shakespeare gets Sentimental 2018.12.21

How to draw graph for two numeric variables in R

2019. 1. 17. 22:53

Exploring_Two_Variables

Exploring Two or More Variables

Evan Jung January 17, 2019

Intro to Multivariate Analysis

Key Terms

Contingency Tables - A tally of counts between two or more categorical variables
Hexagonal Binning - A plot of two numeric variables with the records binned into hexagons
Contour plots - A plot showing the density of two numeric variables like a topographical map.
Violin plots - Similar a boxplot but showing the density estimate.

Multivariate Analysis depends on the nature of data: numeric versus categorical.

Hexagonal Binning and Contours (Plotting Numeric Versus Numeric Data)

kc_tax contains the tax-assessed values for residential properties in King County, Washington.

 library(readr)
library(dplyr)
kc_tax <- read_csv("kc_tax.csv")
glimpse(kc_tax)

 ## Observations: 498,249
## Variables: 3
## $ TaxAssessedValue <dbl> NA, 206000, 303000, 361000, 459000, 223000, 2...
## $ SqFtTotLiving    <dbl> 1730, 1870, 1530, 2000, 3150, 1570, 1770, 115...
## $ ZipCode          <dbl> 98117, 98002, 98166, 98108, 98108, 98032, 981...

The problem of scatterplots

They are fine when the number of data values is relatively small. But if data sets are enormous, a scatterplot will be too dense, so it becomes difficult to distinctly visualize the relationship. We will compare it to other graph later.

Hexagon binning plot

This plot is to visualize the relationship between the finished squarefeet versus TaxAssessedValue.

 library(ggplot2)
library(gridExtra)

 ## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine

 library(hexbin)
kc_tax0 <- kc_tax %>% 
  filter(TaxAssessedValue < 750000, SqFtTotLiving > 100, SqFtTotLiving < 3500)
p1 <- ggplot(kc_tax0, aes(x = SqFtTotLiving, y = TaxAssessedValue)) + 
  stat_binhex(colour = "white") + 
  theme_bw() + 
  scale_fill_gradient(low = "white", high = "blue") + 
  labs(x = "Finished Square Feet", 
       y = "Tax Assessed Value", 
       title = "Hexagon Binning Plot")
p2 <- ggplot(kc_tax0, aes(x = SqFtTotLiving, y = TaxAssessedValue)) + 
  geom_point(colour = "blue") + 
  theme_bw() + 
  labs(x = "Finished Square Feet", 
       y = "Tax Assessed Value", 
       title = "Scatter Plot")
grid.arrange(p1, p2, nrow = 2)

Let’s compare two plots. Rather than Scatter Plot, hexagon binning plots help to group into the hexagon bins and to plot the hexagons with a color indicating the number of records in that bin. Now, we can clearly see the positive relationship between two variables.

Density2d

The geom_density2d function uses contours overlaid on a scatterplot to visualize the relationship between two variables. The contours are essentially a topographical map to two variables. Each contour band represents a specific density of points, increasing as one nears a “peak”.

 ggplot(kc_tax0, aes(x = SqFtTotLiving, y = TaxAssessedValue)) + 
  theme_bw() + 
  geom_point(alpha = .1) + 
  geom_density2d(colour = "white") + 
  labs(x = "Finished Square Feet", 
       y = "Tax Assessed Value", 
       title = "Density 2D")

Conclusion

These plots are related to Correlation Analysis. So, when we draw graph between two variables, we has to think ahead “what two variables are related.”

All contents comes from the book below.

practical statistics for data scientistsì ëí ì´ë¯¸ì§ ê²ìê²°ê³¼

'R > [R] Data Visualization' 카테고리의 다른 글

[R Markdown] Customized Report (DataCamp) (0)	2019.05.04
[R] Interactive Data Visualisation using leaflet (2) (0)	2019.03.03
[R] Interactive Data Visualisation using leaflet (1) (0)	2019.03.02
[R] Beginner Graph for Communication - Label (0)	2019.01.03

Correlation in R

2019. 1. 9. 20:41

correlation

Correlation

Evan Jung January 09, 2019

1. Intro

Case 1: high values of X go with high values of Y, X and Y are positively corrleated.
Case 2: low values of X go with low values of Y, X and Y are positively corrleated.
Case 3: high values of X go with low values of Y, and vice versa, the variables are negatively correlated.

2. Key Terms

Correlation Coefficient is a metric that measures the extent to which numeric variables are associated with one another (ranges from -1 to +1). The +1 means perfect positive correlation The 0 indicates no correlation The -1 means perfect negative correlation

To compute Pearson’s correlation coefficient, we multiply deviations from the mean for variable 1 times those for variable 2, and divide by the product of the standard deviatinos:

$r = \frac{\sum_{i=1}^{N}(x_i-\bar{x})(y_i-\bar{y})}{(N-1)s_{x}s_{y}}$

Correlation Matrix is a table where the variables are shown on both rows and columns, and the cell values are the correlations between variables.

 # data import
sp500_px <- read.csv("data/sp500_px.csv", stringsAsFactors = F)
sp500_sym <- read.csv("data/sp500_sym.csv", stringsAsFactors = F)
# date conversion
sp500_px$X <- as.Date(sp500_px$X)
# data rows and cols extraction
etfs <- sp500_px[sp500_px$X > "2012-07-01", 
                 sp500_sym[sp500_sym$sector == "etf", "symbol"]]
# install.packages("corrplot")
library(corrplot)
corrplot(cor(etfs), method = "ellipse")

(Explanatoin of this plot remains to you!)

The orientation of the ellipse indicates whether two variables are positively correlated or negatively correlated.
The shading and width of the ellipse indicate the strength of the association: thinner and darker ellipse correspond to stronger relationships.

2.1. Other Correlation Estimates

The Spearman’s rho or Kendall’s tau have long ago been proposed by statisticians. These are generally used on the basis of the rank of the data. These estimates are robust to outliers and can handle certain types of nonlinearities because they use for the ranks.

But, for the data scientists can generally stick to Pearson’s correlation coefficient, and its robust alternatives, for exploratory analysis. The appeal of rank-based estimates is mostly for smaller data sets and specific hypothesis tests

Scatterplot A plot in which the a-xis is the value of one variable, and the y-axis the value of another.

 telecom <- sp500_px[, c("T", "VZ")]
plot(telecom$T, telecom$VZ, xlab = "T", ylab = "VZ", main = "The Correlation betwen T(ATT) and VZ(Verizon)")

The returns have a strong positive relationship: on most days, both stocks go up or go down in tandem. There are very few days where one stock goes down significantly while the other stocs goes up (and vice versa).

3. Key Ideas for Correlation

The correlation coefficient measures the extent to which two variables are associated with one another.
When high values of v1 go with high values of v2, v1 and v2 are positively associated.
When high values of v1 are associated with low values of v2, v1 and v2 are negatively associated.
The correlation coefficient is a standardized metric so that it always ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation)
A correlation coefficent of 0 indicates no correlation, but be aware that random arrangements of data will produce both positive and negative values for the correlation coefficient just by chance. ##
1. Further Reading Statistics, 4th ed., by David Freedman, Robert Pisani, and Roger Purves (W.W. Norton, 2007), has an excellent discussion of correlation.

'R > [R] Statistics' 카테고리의 다른 글

Assessing Prediction Performance R (0)	2018.12.17
Designing_model (0)	2018.12.15
Statistical Modeling in R Part 1 (0)	2018.12.13

[R] Beginner Graph for Communication - Label

2019. 1. 3. 11:07

ggplot2_labels

[Beginner] Graph for Communication - Label

Evan Jung 1/3/2019

1. Introduction

In EDA(Exploratory Data Analysis), many ways to build plots as tools for exploration. Most of you made each plot for some reason and purpose for clients. To let clients understand your data and graph, you need to know the way to communicate your thoughts and understandings to others.

We will review R for Data Science written by Garrett Grolemund & Hadley Wichham.

This simple article aims to R users who knows how to draw graph using tidyverse package. If you are newbie for R, then click above the link and read chapter 3 carefully.

We are on chater 28, though.

2. Label

The easiest place to start when turning an explorator graphic into an expository graphic is with good labels. You add labes with labs

(1) labs() function

 # install.packages("tidyverse", dependencies = TRUE)
library(tidyverse)
ggplot(mpg, aes(displ, hwy)) + 
  geom_point(aes(color=class)) + 
  geom_smooth(se = FALSE) + 
  labs(title = "Fuel efficiency generally decreases with engine size")

What is titie? Well, in general, title is big picture of graph, summarizing the main finding. Part of facts, e.g. “A scatterplot of displ and hwy” is not the title because it has no any finding from Analyst.

(2) Subtitle & Caption

If you want to add more text, giving more information to clients, then you may think of subtitle & caption.

 ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Fuel efficiency generally decreases with engine size",
    subtitle = "Two seaters (sports cars) are an exception because of their light weight",
    caption = "Data from fueleconomy.gov"
  )

Look!

Subtitle adds additional detail in a smaller font beneath the title
caption adds text at the bottom right of the plot, often used to describe the source of the data

(3) x and y title replacement

Look graph 1 & 2. The titles of x and y are a bit strange to read it. Yes, clients can’t understand those abbreviation, so you want to spread it to full name. On the other hand, when you code with R, you want to select variable typing letters. Then, you might be introuble to code.

labs function makes both of group easy to do the job.

 ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(colour = class)) +
  geom_smooth(se = FALSE) +
  labs(
    x = "Engine displacement (L)",
    y = "Highway fuel economy (mpg)",
    colour = "Car type"
  )

(4) Formula on X and Y

This is option. If you working with math, physics, etc. You want to put formula to x-axis and y-axis. Below the way.

 df <- tibble(
  x = runif(10),
  y = runif(10)
)
ggplot(df, aes(x, y)) +
  geom_point() +
  labs(
    x = quote(sum(x[i] ^ 2, i == 1, n)),
    y = quote(alpha + beta + frac(delta, theta))
  )

It looks difficult. but You can do it with R Documentation. Type on code chunk ?plotmath like below

?plotmath

Let me introduce some quotes

Syntax	Meaning
x + y	x plus y
x - y	x minus y
sqrt(x)	square root of x

'R > [R] Data Visualization' 카테고리의 다른 글

[R Markdown] Customized Report (DataCamp) (0)	2019.05.04
[R] Interactive Data Visualisation using leaflet (2) (0)	2019.03.03
[R] Interactive Data Visualisation using leaflet (1) (0)	2019.03.02
How to draw graph for two numeric variables in R (0)	2019.01.17

Sentiment Analysis - Shakespeare gets Sentimental

2018. 12. 21. 22:37

The source codes and contents come from the E-Learning DataCamp: Sentiment Analysis in R: The Tidy Way Enjoy

Intro

The next real-world text exploration uses tragedies and comedies by Shakespeare to show how sentiment analysis can lead to insight into differences in word use. You will learn how to transform raw text into a tidy format for further analysis.

1. To be, or not to be

The shakespeare dataset contains three columns:
- title, the title of a Shakespearean play,
- type, the type of play, either tragedy or comedy, and
- text, a line from that play. This data frame contains the entire texts of six plays.

 library(tidytext)
library(dplyr)

 ## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

 load("data/shakespeare.rda")
glimpse(shakespeare)

 ## Observations: 25,888
## Variables: 3
## $ title <chr> "The Tragedy of Romeo and Juliet", "The Tragedy of Romeo...
## $ type  <chr> "Tragedy", "Tragedy", "Tragedy", "Tragedy", "Tragedy", "...
## $ text  <chr> "The Complete Works of William Shakespeare", "", "The Tr...

 # Use count to find out how many titles/types there are
shakespeare %>% 
  count(title, type)

 ## # A tibble: 6 x 3
##   title                           type        n
##   <chr>                           <chr>   <int>
## 1 A Midsummer Night's Dream       Comedy   3459
## 2 Hamlet, Prince of Denmark       Tragedy  6776
## 3 Much Ado about Nothing          Comedy   3799
## 4 The Merchant of Venice          Comedy   4225
## 5 The Tragedy of Macbeth          Tragedy  3188
## 6 The Tragedy of Romeo and Juliet Tragedy  4441

2. Unnesting from text to word

he shakespeare dataset is not yet compatible with tidy tools. You need to first break the text into individual tokens (the process of tokenization); a token is a meaningful unit of text for analysis, in many cases, just synonymous with a single word. You also need to transform the text to a tidy data structure with one token per row. You can use tidytext’s unnest_tokens() function to accomplish all of this at once.

 library(tidytext)
 
tidy_shakespeare <- shakespeare %>%
  group_by(title) %>%
  mutate(linenumber = row_number()) %>%
  unnest_tokens(word, text) %>% # Transform the non-tidy text data to tidy text data
  ungroup()
 
tidy_shakespeare %>% 
  count(word, sort = TRUE)

 ## # A tibble: 10,736 x 2
##    word      n
##    <chr> <int>
##  1 the    4651
##  2 and    4170
##  3 i      3296
##  4 to     3047
##  5 of     2645
##  6 a      2511
##  7 you    2287
##  8 my     1913
##  9 in     1836
## 10 that   1721
## # ... with 10,726 more rows

Notice how the most common words in the data frame are words like “the”, “and”, and “i” that have no sentiments associated with them. In the next exercise, you’ll join the data with a lexicon to implement sentiment analysis.

3. Sentiment analysis of Shakespeare

After transforming the text of these Shakespearean plays to a tidy text dataset in the last exercise, the resulting data frame tidy_shakespeare is ready for sentiment analysis with such an approach. Once you have performed the sentiment analysis, you can find out how many negative and positive words each play has with just one line of code.

 shakespeare_sentiment <- tidy_shakespeare %>%
  inner_join(get_sentiments("bing"))  # Implement sentiment analysis with the "bing" lexicon

## Joining, by = "word"

 shakespeare_sentiment %>%
  count(title, sentiment) # Find how many positive/negative words each play has

 ## # A tibble: 12 x 3
##    title                           sentiment     n
##    <chr>                           <chr>     <int>
##  1 A Midsummer Night's Dream       negative    681
##  2 A Midsummer Night's Dream       positive    773
##  3 Hamlet, Prince of Denmark       negative   1323
##  4 Hamlet, Prince of Denmark       positive   1223
##  5 Much Ado about Nothing          negative    767
##  6 Much Ado about Nothing          positive   1127
##  7 The Merchant of Venice          negative    740
##  8 The Merchant of Venice          positive    962
##  9 The Tragedy of Macbeth          negative    914
## 10 The Tragedy of Macbeth          positive    749
## 11 The Tragedy of Romeo and Juliet negative   1235
## 12 The Tragedy of Romeo and Juliet positive   1090

Passing two variables to count() returns the count n for each unique combination of the two variables. In this case, you have 6 plays and 2 sentiments, so count() returns 6 x 2 = 12 rows.

4. Tragedy or comedy?

Which plays have a higher percentage of negative words? Do the tragedies have more negative words than the comedies?

 sentiment_counts <- tidy_shakespeare %>%
    inner_join(get_sentiments("bing")) %>% 
    # Count the number of words by title, type, and sentiment
    count(title, type, sentiment)

## Joining, by = "word"

 sentiment_counts %>%
    group_by(title) %>% # Group by the titles of the plays
    mutate(total = sum(n), # Find the total number of words in each play
           percent = n / total) %>% # Calculate the number of words divided by the total
    filter(sentiment == "negative") %>% # Filter the results for only negative sentiment
    arrange(percent)

 ## # A tibble: 6 x 6
## # Groups:   title [6]
##   title                           type    sentiment     n total percent
##   <chr>                           <chr>   <chr>     <int> <int>   <dbl>
## 1 Much Ado about Nothing          Comedy  negative    767  1894   0.405
## 2 The Merchant of Venice          Comedy  negative    740  1702   0.435
## 3 A Midsummer Night's Dream       Comedy  negative    681  1454   0.468
## 4 Hamlet, Prince of Denmark       Tragedy negative   1323  2546   0.520
## 5 The Tragedy of Romeo and Juliet Tragedy negative   1235  2325   0.531
## 6 The Tragedy of Macbeth          Tragedy negative    914  1663   0.550

Looking at the percent column of your output, you can see that tragedies do in fact have a higher percentage of negative words!

5. Most common positive and negative words

Now you can explore which specific words are driving these sentiment scores. Which are the most common positive and negative words in these plays?

 word_counts <- tidy_shakespeare %>%
  inner_join(get_sentiments("bing")) %>%
  count(word, sentiment)

## Joining, by = "word"

 top_words <- word_counts %>%
  group_by(sentiment) %>% # Group by sentiment
  top_n(10) %>%  # Take the top 10 for each sentiment
  ungroup() %>%  # Make word a factor in order of n
  mutate(word = reorder(word, n))

## Selecting by n

 # Use aes() to put words on the x-axis and n on the y-axis
library(ggplot2)
ggplot(top_words, aes(x = word, y = n, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, scales = "free") +  
  coord_flip()

- Death is pretty negative and love is positive, but are there words in that list that had different connotations during Shakespeare’s time? Do you see a word that the lexicon has misidentified? - The word "wilt" was used differently in Shakespeare’s time and was not negative; the lexicon has misidentified it. For example, from Romeo and Juliet, "For thou wilt lie upon the wings of night". It is important to explore the details of how words were - cored when performing sentiment analyses.

6. Word contributions by play

You will also practice using a different sentiment lexicon, the “afinn” lexicon in which words have a score from -5 to 5. Different lexicons take different approaches to quantifying the emotion/opinion content of words.
Which words contribute to the overall sentiment in which plays?

 tidy_shakespeare %>%
  count(title, word, sort = TRUE) %>% # Count by title and word
  inner_join(get_sentiments("afinn")) %>% # Implement sentiment analysis using the "afinn" lexicon
  filter(title == "The Tragedy of Macbeth", score < 0) # Filter to only examine the scores for Macbeth that are negative

 ## Joining, by = "word"
## # A tibble: 237 x 4
##    title                  word        n score
##    <chr>                  <chr>   <int> <int>
##  1 The Tragedy of Macbeth no         73    -1
##  2 The Tragedy of Macbeth fear       35    -2
##  3 The Tragedy of Macbeth death      20    -2
##  4 The Tragedy of Macbeth bloody     16    -3
##  5 The Tragedy of Macbeth poor       16    -2
##  6 The Tragedy of Macbeth strange    16    -1
##  7 The Tragedy of Macbeth dead       14    -3
##  8 The Tragedy of Macbeth leave      14    -1
##  9 The Tragedy of Macbeth fight      13    -1
## 10 The Tragedy of Macbeth charges    11    -2
## # ... with 227 more rows

Notice the use of words specific to Macbeth like “bloody”.

7. Calculating a contribution score

you can calculate a relative contribution for each word in each play. This contribution can be found by multiplying the score for each word by the times it is used in each play and divided by the total words in each play.

 sentiment_contributions <- tidy_shakespeare %>%
  count(title, word, sort = TRUE) %>%  # Count by title and word
  inner_join(get_sentiments("afinn")) %>% # Implement sentiment analysis using the "afinn" lexicon
  group_by(title) %>%  # Group by title
  mutate(contribution = (n * score) / sum(n)) %>% # Calculate a contribution for each word in each title
  ungroup()

## Joining, by = "word"

sentiment_contributions

 ## # A tibble: 2,366 x 5
##    title                           word      n score contribution
##    <chr>                           <chr> <int> <int>        <dbl>
##  1 Hamlet, Prince of Denmark       no      143    -1      -0.0652
##  2 The Tragedy of Romeo and Juliet love    140     3       0.213 
##  3 Much Ado about Nothing          no      132    -1      -0.0768
##  4 Much Ado about Nothing          hero    114     2       0.133 
##  5 A Midsummer Night's Dream       love    110     3       0.270 
##  6 Hamlet, Prince of Denmark       good    109     3       0.149 
##  7 The Tragedy of Romeo and Juliet no      102    -1      -0.0518
##  8 Much Ado about Nothing          good     93     3       0.162 
##  9 The Merchant of Venice          no       92    -1      -0.0630
## 10 Much Ado about Nothing          love     91     3       0.159 
## # ... with 2,356 more rows

Notice that “hero” shows up in your results there; that is the name of one of the characters in “Much Ado About Nothing”.

8. Alas, poor Yorick!

It’s time to explore some of your results! Look at Hamlet and The Merchant of Venice to see what negative and positive words are important in these two plays.
Arrange the most negative words

 sentiment_contributions %>%
  # Filter for Hamlet
  filter(title == "Hamlet, Prince of Denmark") %>%
  # Arrange to see the most negative words
  arrange(contribution)

 ## # A tibble: 493 x 5
##    title                     word        n score contribution
##    <chr>                     <chr>   <int> <int>        <dbl>
##  1 Hamlet, Prince of Denmark no        143    -1      -0.0652
##  2 Hamlet, Prince of Denmark dead       33    -3      -0.0451
##  3 Hamlet, Prince of Denmark death      38    -2      -0.0347
##  4 Hamlet, Prince of Denmark madness    22    -3      -0.0301
##  5 Hamlet, Prince of Denmark mad        21    -3      -0.0287
##  6 Hamlet, Prince of Denmark fear       21    -2      -0.0192
##  7 Hamlet, Prince of Denmark poor       20    -2      -0.0182
##  8 Hamlet, Prince of Denmark hell       10    -4      -0.0182
##  9 Hamlet, Prince of Denmark grave      17    -2      -0.0155
## 10 Hamlet, Prince of Denmark ghost      32    -1      -0.0146
## # ... with 483 more rows

Arrange the most positive words

 sentiment_contributions %>%
  # Filter for Hamlet
  filter(title == "The Merchant of Venice") %>%
  # Arrange to see the most negative words
  arrange(desc(contribution))

 ## # A tibble: 344 x 5
##    title                  word        n score contribution
##    <chr>                  <chr>   <int> <int>        <dbl>
##  1 The Merchant of Venice good       63     3       0.129 
##  2 The Merchant of Venice love       60     3       0.123 
##  3 The Merchant of Venice fair       35     2       0.0479
##  4 The Merchant of Venice like       34     2       0.0466
##  5 The Merchant of Venice true       24     2       0.0329
##  6 The Merchant of Venice sweet      23     2       0.0315
##  7 The Merchant of Venice pray       42     1       0.0288
##  8 The Merchant of Venice better     21     2       0.0288
##  9 The Merchant of Venice justice    17     2       0.0233
## 10 The Merchant of Venice welcome    17     2       0.0233
## # ... with 334 more rows

These are definitely characteristic words for these two plays.

9. Sentiment changes through a play

We will start by first implementing sentiment analysis using inner_join(), and then use count() with four arguments:
- title,
- type,
- an index that will section together lines of the play, and
- sentiment.
After these lines of code, you will have the number of positive and negative words used in each index-ed section of the play. These sections will be 70 lines long in your analysis here. You want a chunk of text that is not too small (because then the sentiment changes will be very noisy) and not too big (because then you will not be able to see plot structure). In an analysis of this type you may need to experiment with what size chunks to make; sections of 70 lines works well for these plays.

 tidy_shakespeare %>%
  inner_join(get_sentiments("bing")) %>% # Implement sentiment analysis using "bing" lexicon
  count(title, 
        type, 
        index = linenumber %/% 70, 
        sentiment)

 ## Joining, by = "word"
## # A tibble: 744 x 5
##    title                     type   index sentiment     n
##    <chr>                     <chr>  <dbl> <chr>     <int>
##  1 A Midsummer Night's Dream Comedy     0 negative      4
##  2 A Midsummer Night's Dream Comedy     0 positive     11
##  3 A Midsummer Night's Dream Comedy     1 negative      7
##  4 A Midsummer Night's Dream Comedy     1 positive     19
##  5 A Midsummer Night's Dream Comedy     2 negative     20
##  6 A Midsummer Night's Dream Comedy     2 positive     23
##  7 A Midsummer Night's Dream Comedy     3 negative     12
##  8 A Midsummer Night's Dream Comedy     3 positive     18
##  9 A Midsummer Night's Dream Comedy     4 negative      9
## 10 A Midsummer Night's Dream Comedy     4 positive     27
## # ... with 734 more rows

This is the first step in looking at narrative arcs.

10. Calculating net sentiment

The next steps involve spread() from the tidyr package. After these lines of code, you will have the net sentiment in each index-ed section of the play; net sentiment is the negative sentiment subtracted from the positive sentiment.

 # Load the tidyr package
library(tidyr)
 
tidy_shakespeare %>%
  inner_join(get_sentiments("bing")) %>%
  count(title, type, index = linenumber %/% 70, sentiment) %>%
  spread(sentiment, n, fill = 0) %>% # Spread sentiment and n across multiple columns
  mutate(sentiment = positive - negative) # Use mutate to find net sentiment

 ## Joining, by = "word"
## # A tibble: 373 x 6
##    title                     type   index negative positive sentiment
##    <chr>                     <chr>  <dbl>    <dbl>    <dbl>     <dbl>
##  1 A Midsummer Night's Dream Comedy     0        4       11         7
##  2 A Midsummer Night's Dream Comedy     1        7       19        12
##  3 A Midsummer Night's Dream Comedy     2       20       23         3
##  4 A Midsummer Night's Dream Comedy     3       12       18         6
##  5 A Midsummer Night's Dream Comedy     4        9       27        18
##  6 A Midsummer Night's Dream Comedy     5       11       21        10
##  7 A Midsummer Night's Dream Comedy     6       12       16         4
##  8 A Midsummer Night's Dream Comedy     7        9        6        -3
##  9 A Midsummer Night's Dream Comedy     8        6       12         6
## 10 A Midsummer Night's Dream Comedy     9       19       12        -7
## # ... with 363 more rows

You are closer to plotting the sentiment through these plays.

11. Visualizing narrative arcs

you will continue to build on your manipulations of this text dataset and visualize the results of this sentiment analysis.

 library(tidyr)
library(ggplot2)
 
tidy_shakespeare %>%
  inner_join(get_sentiments("bing")) %>%
  count(title, type, index = linenumber %/% 70, sentiment) %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative) %>%
  ggplot(aes(x = index, # Put index on x-axis
             y = sentiment, # Put sentiment on y-axis
             fill = type)) +  # map comedy/tragedy to fill
  geom_col() + # Make a bar chart with geom_col()
  facet_wrap(~ title, scales = "free_x") # Separate panels for each title with facet_wrap()

## Joining, by = "word"

These plots show how sentiment changes through these plays. Notice how the comedies have happier endings and more positive sentiment than the tragedies.

'R > [R] Text Mining' 카테고리의 다른 글

Text Mining: Tweets Across the Unisted States From DataCamp (0)	2018.12.20
Sentiment_Analysis_with_tidy_data (0)	2018.12.20
The tidy text format (0)	2018.12.18

PREV 1 2 3 4 5 6 NEXT

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

	library(readr)
	library(dplyr)
	kc_tax <- read_csv("kc_tax.csv")
	glimpse(kc_tax)

	## Observations: 498,249
	## Variables: 3
	## $ TaxAssessedValue <dbl> NA, 206000, 303000, 361000, 459000, 223000, 2...
	## $ SqFtTotLiving <dbl> 1730, 1870, 1530, 2000, 3150, 1570, 1770, 115...
	## $ ZipCode <dbl> 98117, 98002, 98166, 98108, 98108, 98032, 981...

	##
	## Attaching package: 'gridExtra'

	## The following object is masked from 'package:dplyr':
	##
	## combine

	library(hexbin)

	kc_tax0 <- kc_tax %>%
	filter(TaxAssessedValue < 750000, SqFtTotLiving > 100, SqFtTotLiving < 3500)

	p1 <- ggplot(kc_tax0, aes(x = SqFtTotLiving, y = TaxAssessedValue)) +
	stat_binhex(colour = "white") +
	theme_bw() +
	scale_fill_gradient(low = "white", high = "blue") +
	labs(x = "Finished Square Feet",
	y = "Tax Assessed Value",
	title = "Hexagon Binning Plot")

	p2 <- ggplot(kc_tax0, aes(x = SqFtTotLiving, y = TaxAssessedValue)) +
	geom_point(colour = "blue") +
	theme_bw() +
	labs(x = "Finished Square Feet",
	y = "Tax Assessed Value",
	title = "Scatter Plot")
	grid.arrange(p1, p2, nrow = 2)

	ggplot(kc_tax0, aes(x = SqFtTotLiving, y = TaxAssessedValue)) +
	theme_bw() +
	geom_point(alpha = .1) +
	geom_density2d(colour = "white") +
	labs(x = "Finished Square Feet",
	y = "Tax Assessed Value",
	title = "Density 2D")

	# data import
	sp500_px <- read.csv("data/sp500_px.csv", stringsAsFactors = F)
	sp500_sym <- read.csv("data/sp500_sym.csv", stringsAsFactors = F)

	# date conversion
	sp500_px$X <- as.Date(sp500_px$X)

	# data rows and cols extraction
	etfs <- sp500_px[sp500_px$X > "2012-07-01",
	sp500_sym[sp500_sym$sector == "etf", "symbol"]]

	# install.packages("corrplot")
	library(corrplot)
	corrplot(cor(etfs), method = "ellipse")

	telecom <- sp500_px[, c("T", "VZ")]
	plot(telecom$T, telecom$VZ, xlab = "T", ylab = "VZ", main = "The Correlation betwen T(ATT) and VZ(Verizon)")

	# install.packages("tidyverse", dependencies = TRUE)
	library(tidyverse)
	ggplot(mpg, aes(displ, hwy)) +
	geom_point(aes(color=class)) +
	geom_smooth(se = FALSE) +
	labs(title = "Fuel efficiency generally decreases with engine size")

	ggplot(mpg, aes(displ, hwy)) +
	geom_point(aes(color = class)) +
	geom_smooth(se = FALSE) +
	labs(
	title = "Fuel efficiency generally decreases with engine size",
	subtitle = "Two seaters (sports cars) are an exception because of their light weight",
	caption = "Data from fueleconomy.gov"
	)

	ggplot(mpg, aes(displ, hwy)) +
	geom_point(aes(colour = class)) +
	geom_smooth(se = FALSE) +
	labs(
	x = "Engine displacement (L)",
	y = "Highway fuel economy (mpg)",
	colour = "Car type"
	)

	df <- tibble(
	x = runif(10),
	y = runif(10)
	)
	ggplot(df, aes(x, y)) +
	geom_point() +
	labs(
	x = quote(sum(x[i] ^ 2, i == 1, n)),
	y = quote(alpha + beta + frac(delta, theta))
	)

	##
	## Attaching package: 'dplyr'

	## The following objects are masked from 'package:stats':
	##
	## filter, lag

	## The following objects are masked from 'package:base':
	##
	## intersect, setdiff, setequal, union

	## Observations: 25,888
	## Variables: 3
	## $ title <chr> "The Tragedy of Romeo and Juliet", "The Tragedy of Romeo...
	## $ type <chr> "Tragedy", "Tragedy", "Tragedy", "Tragedy", "Tragedy", "...
	## $ text <chr> "The Complete Works of William Shakespeare", "", "The Tr...

	# Use count to find out how many titles/types there are
	shakespeare %>%
	count(title, type)

	## # A tibble: 6 x 3
	## title type n
	## <chr> <chr> <int>
	## 1 A Midsummer Night's Dream Comedy 3459
	## 2 Hamlet, Prince of Denmark Tragedy 6776
	## 3 Much Ado about Nothing Comedy 3799
	## 4 The Merchant of Venice Comedy 4225
	## 5 The Tragedy of Macbeth Tragedy 3188
	## 6 The Tragedy of Romeo and Juliet Tragedy 4441

	library(tidytext)

	tidy_shakespeare <- shakespeare %>%
	group_by(title) %>%
	mutate(linenumber = row_number()) %>%
	unnest_tokens(word, text) %>% # Transform the non-tidy text data to tidy text data
	ungroup()

	tidy_shakespeare %>%
	count(word, sort = TRUE)

	## # A tibble: 10,736 x 2
	## word n
	## <chr> <int>
	## 1 the 4651
	## 2 and 4170
	## 3 i 3296
	## 4 to 3047
	## 5 of 2645
	## 6 a 2511
	## 7 you 2287
	## 8 my 1913
	## 9 in 1836
	## 10 that 1721
	## # ... with 10,726 more rows

	shakespeare_sentiment <- tidy_shakespeare %>%
	inner_join(get_sentiments("bing")) # Implement sentiment analysis with the "bing" lexicon

	shakespeare_sentiment %>%
	count(title, sentiment) # Find how many positive/negative words each play has

	## # A tibble: 12 x 3
	## title sentiment n
	## <chr> <chr> <int>
	## 1 A Midsummer Night's Dream negative 681
	## 2 A Midsummer Night's Dream positive 773
	## 3 Hamlet, Prince of Denmark negative 1323
	## 4 Hamlet, Prince of Denmark positive 1223
	## 5 Much Ado about Nothing negative 767
	## 6 Much Ado about Nothing positive 1127
	## 7 The Merchant of Venice negative 740
	## 8 The Merchant of Venice positive 962
	## 9 The Tragedy of Macbeth negative 914
	## 10 The Tragedy of Macbeth positive 749
	## 11 The Tragedy of Romeo and Juliet negative 1235
	## 12 The Tragedy of Romeo and Juliet positive 1090

cozyDS

R

How to draw graph for two numeric variables in R

Exploring Two or More Variables

Intro to Multivariate Analysis

Key Terms

Hexagonal Binning and Contours (Plotting Numeric Versus Numeric Data)

The problem of scatterplots

Hexagon binning plot

Density2d

Conclusion

'R > [R] Data Visualization' 카테고리의 다른 글

Correlation in R

Correlation

1. Intro

2. Key Terms

2.1. Other Correlation Estimates

3. Key Ideas for Correlation

'R > [R] Statistics' 카테고리의 다른 글

[R] Beginner Graph for Communication - Label

[Beginner] Graph for Communication - Label

1. Introduction

2. Label

(1) labs() function

(2) Subtitle & Caption

(3) x and y title replacement

(4) Formula on X and Y

'R > [R] Data Visualization' 카테고리의 다른 글

Sentiment Analysis - Shakespeare gets Sentimental

The source codes and contents come from the E-Learning DataCamp: Sentiment Analysis in R: The Tidy Way Enjoy

Intro

1. To be, or not to be

2. Unnesting from text to word

3. Sentiment analysis of Shakespeare

4. Tragedy or comedy?

5. Most common positive and negative words

6. Word contributions by play

7. Calculating a contribution score

8. Alas, poor Yorick!

9. Sentiment changes through a play

10. Calculating net sentiment

11. Visualizing narrative arcs

'R > [R] Text Mining' 카테고리의 다른 글

+ Recent posts

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

	sentiment_counts <- tidy_shakespeare %>%
	inner_join(get_sentiments("bing")) %>%
	# Count the number of words by title, type, and sentiment
	count(title, type, sentiment)

	sentiment_counts %>%
	group_by(title) %>% # Group by the titles of the plays
	mutate(total = sum(n), # Find the total number of words in each play
	percent = n / total) %>% # Calculate the number of words divided by the total
	filter(sentiment == "negative") %>% # Filter the results for only negative sentiment
	arrange(percent)

	## # A tibble: 6 x 6
	## # Groups: title [6]
	## title type sentiment n total percent
	## <chr> <chr> <chr> <int> <int> <dbl>
	## 1 Much Ado about Nothing Comedy negative 767 1894 0.405
	## 2 The Merchant of Venice Comedy negative 740 1702 0.435
	## 3 A Midsummer Night's Dream Comedy negative 681 1454 0.468
	## 4 Hamlet, Prince of Denmark Tragedy negative 1323 2546 0.520
	## 5 The Tragedy of Romeo and Juliet Tragedy negative 1235 2325 0.531
	## 6 The Tragedy of Macbeth Tragedy negative 914 1663 0.550

	word_counts <- tidy_shakespeare %>%
	inner_join(get_sentiments("bing")) %>%
	count(word, sentiment)

	top_words <- word_counts %>%
	group_by(sentiment) %>% # Group by sentiment
	top_n(10) %>% # Take the top 10 for each sentiment
	ungroup() %>% # Make word a factor in order of n
	mutate(word = reorder(word, n))

	# Use aes() to put words on the x-axis and n on the y-axis
	library(ggplot2)
	ggplot(top_words, aes(x = word, y = n, fill = sentiment)) +
	geom_col(show.legend = FALSE) +
	facet_wrap(~sentiment, scales = "free") +
	coord_flip()

	tidy_shakespeare %>%
	count(title, word, sort = TRUE) %>% # Count by title and word
	inner_join(get_sentiments("afinn")) %>% # Implement sentiment analysis using the "afinn" lexicon
	filter(title == "The Tragedy of Macbeth", score < 0) # Filter to only examine the scores for Macbeth that are negative

	## Joining, by = "word"

	## # A tibble: 237 x 4
	## title word n score
	## <chr> <chr> <int> <int>
	## 1 The Tragedy of Macbeth no 73 -1
	## 2 The Tragedy of Macbeth fear 35 -2
	## 3 The Tragedy of Macbeth death 20 -2
	## 4 The Tragedy of Macbeth bloody 16 -3
	## 5 The Tragedy of Macbeth poor 16 -2
	## 6 The Tragedy of Macbeth strange 16 -1
	## 7 The Tragedy of Macbeth dead 14 -3
	## 8 The Tragedy of Macbeth leave 14 -1
	## 9 The Tragedy of Macbeth fight 13 -1
	## 10 The Tragedy of Macbeth charges 11 -2
	## # ... with 227 more rows

	sentiment_contributions <- tidy_shakespeare %>%
	count(title, word, sort = TRUE) %>% # Count by title and word
	inner_join(get_sentiments("afinn")) %>% # Implement sentiment analysis using the "afinn" lexicon
	group_by(title) %>% # Group by title
	mutate(contribution = (n * score) / sum(n)) %>% # Calculate a contribution for each word in each title
	ungroup()

	## # A tibble: 2,366 x 5
	## title word n score contribution
	## <chr> <chr> <int> <int> <dbl>
	## 1 Hamlet, Prince of Denmark no 143 -1 -0.0652
	## 2 The Tragedy of Romeo and Juliet love 140 3 0.213
	## 3 Much Ado about Nothing no 132 -1 -0.0768
	## 4 Much Ado about Nothing hero 114 2 0.133
	## 5 A Midsummer Night's Dream love 110 3 0.270
	## 6 Hamlet, Prince of Denmark good 109 3 0.149
	## 7 The Tragedy of Romeo and Juliet no 102 -1 -0.0518
	## 8 Much Ado about Nothing good 93 3 0.162
	## 9 The Merchant of Venice no 92 -1 -0.0630
	## 10 Much Ado about Nothing love 91 3 0.159
	## # ... with 2,356 more rows