Exploring Two or More Variables
Evan Jung January 17, 2019
Intro to Multivariate Analysis
Key Terms
Contingency Tables - A tally of counts between two or more categorical variables
Hexagonal Binning - A plot of two numeric variables with the records binned into hexagons
Contour plots - A plot showing the density of two numeric variables like a topographical map.
Violin plots - Similar a boxplot but showing the density estimate.
Multivariate Analysis depends on the nature of data: numeric versus categorical.
Hexagonal Binning and Contours (Plotting Numeric Versus Numeric Data)
- kc_tax contains the tax-assessed values for residential properties in King County, Washington.
## Observations: 498,249
## Variables: 3
## $ TaxAssessedValue <dbl> NA, 206000, 303000, 361000, 459000, 223000, 2...
## $ SqFtTotLiving <dbl> 1730, 1870, 1530, 2000, 3150, 1570, 1770, 115...
## $ ZipCode <dbl> 98117, 98002, 98166, 98108, 98108, 98032, 981...
The problem of scatterplots
They are fine when the number of data values is relatively small. But if data sets are enormous, a scatterplot will be too dense, so it becomes difficult to distinctly visualize the relationship. We will compare it to other graph later.
Hexagon binning plot
This plot is to visualize the relationship between the finished squarefeet versus TaxAssessedValue.
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(hexbin)
kc_tax0 <- kc_tax %>%
filter(TaxAssessedValue < 750000, SqFtTotLiving > 100, SqFtTotLiving < 3500)
p1 <- ggplot(kc_tax0, aes(x = SqFtTotLiving, y = TaxAssessedValue)) +
stat_binhex(colour = "white") +
theme_bw() +
scale_fill_gradient(low = "white", high = "blue") +
labs(x = "Finished Square Feet",
y = "Tax Assessed Value",
title = "Hexagon Binning Plot")
p2 <- ggplot(kc_tax0, aes(x = SqFtTotLiving, y = TaxAssessedValue)) +
geom_point(colour = "blue") +
theme_bw() +
labs(x = "Finished Square Feet",
y = "Tax Assessed Value",
title = "Scatter Plot")
grid.arrange(p1, p2, nrow = 2)
Let’s compare two plots. Rather than Scatter Plot, hexagon binning plots help to group into the hexagon bins and to plot the hexagons with a color indicating the number of records in that bin. Now, we can clearly see the positive relationship between two variables.
Density2d
The geom_density2d function uses contours overlaid on a scatterplot to visualize the relationship between two variables. The contours are essentially a topographical map to two variables. Each contour band represents a specific density of points, increasing as one nears a “peak”.
ggplot(kc_tax0, aes(x = SqFtTotLiving, y = TaxAssessedValue)) +
theme_bw() +
geom_point(alpha = .1) +
geom_density2d(colour = "white") +
labs(x = "Finished Square Feet",
y = "Tax Assessed Value",
title = "Density 2D")
Conclusion
These plots are related to Correlation Analysis. So, when we draw graph between two variables, we has to think ahead “what two variables are related.”
All contents comes from the book below.
'R > [R] Data Visualization' 카테고리의 다른 글
[R Markdown] Customized Report (DataCamp) (0) | 2019.05.04 |
---|---|
[R] Interactive Data Visualisation using leaflet (2) (0) | 2019.03.03 |
[R] Interactive Data Visualisation using leaflet (1) (0) | 2019.03.02 |
[R] Beginner Graph for Communication - Label (0) | 2019.01.03 |