Exploring_Two_Variables

Exploring Two or More Variables

Evan Jung January 17, 2019

Intro to Multivariate Analysis

Key Terms

  • Contingency Tables - A tally of counts between two or more categorical variables

  • Hexagonal Binning - A plot of two numeric variables with the records binned into hexagons

  • Contour plots - A plot showing the density of two numeric variables like a topographical map.

  • Violin plots - Similar a boxplot but showing the density estimate.

Multivariate Analysis depends on the nature of data: numeric versus categorical.

Hexagonal Binning and Contours (Plotting Numeric Versus Numeric Data)

  • kc_tax contains the tax-assessed values for residential properties in King County, Washington.

## Observations: 498,249
## Variables: 3
## $ TaxAssessedValue <dbl> NA, 206000, 303000, 361000, 459000, 223000, 2...
## $ SqFtTotLiving    <dbl> 1730, 1870, 1530, 2000, 3150, 1570, 1770, 115...
## $ ZipCode          <dbl> 98117, 98002, 98166, 98108, 98108, 98032, 981...

The problem of scatterplots

They are fine when the number of data values is relatively small. But if data sets are enormous, a scatterplot will be too dense, so it becomes difficult to distinctly visualize the relationship. We will compare it to other graph later.

Hexagon binning plot

This plot is to visualize the relationship between the finished squarefeet versus TaxAssessedValue.

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine


Let’s compare two plots. Rather than Scatter Plot, hexagon binning plots help to group into the hexagon bins and to plot the hexagons with a color indicating the number of records in that bin. Now, we can clearly see the positive relationship between two variables.

Density2d

The geom_density2d function uses contours overlaid on a scatterplot to visualize the relationship between two variables. The contours are essentially a topographical map to two variables. Each contour band represents a specific density of points, increasing as one nears a “peak”.


Conclusion

These plots are related to Correlation Analysis. So, when we draw graph between two variables, we has to think ahead “what two variables are related.”


All contents comes from the book below.


practical statistics for data scientists에 대한 이미지 검색결과

+ Recent posts