correlation

Correlation

Evan Jung January 09, 2019

1. Intro

  • Case 1: high values of X go with high values of Y, X and Y are positively corrleated.
  • Case 2: low values of X go with low values of Y, X and Y are positively corrleated.
  • Case 3: high values of X go with low values of Y, and vice versa, the variables are negatively correlated.

2. Key Terms

Correlation Coefficient is a metric that measures the extent to which numeric variables are associated with one another (ranges from -1 to +1). The +1 means perfect positive correlation The 0 indicates no correlation The -1 means perfect negative correlation

To compute Pearson’s correlation coefficient, we multiply deviations from the mean for variable 1 times those for variable 2, and divide by the product of the standard deviatinos:



Correlation Matrix is a table where the variables are shown on both rows and columns, and the cell values are the correlations between variables.

(Explanatoin of this plot remains to you!)

  1. The orientation of the ellipse indicates whether two variables are positively correlated or negatively correlated.

  2. The shading and width of the ellipse indicate the strength of the association: thinner and darker ellipse correspond to stronger relationships.

2.1. Other Correlation Estimates

The Spearman’s rho or Kendall’s tau have long ago been proposed by statisticians. These are generally used on the basis of the rank of the data. These estimates are robust to outliers and can handle certain types of nonlinearities because they use for the ranks.

But, for the data scientists can generally stick to Pearson’s correlation coefficient, and its robust alternatives, for exploratory analysis. The appeal of rank-based estimates is mostly for smaller data sets and specific hypothesis tests

Scatterplot A plot in which the a-xis is the value of one variable, and the y-axis the value of another.

The returns have a strong positive relationship: on most days, both stocks go up or go down in tandem. There are very few days where one stock goes down significantly while the other stocs goes up (and vice versa).

3. Key Ideas for Correlation

  • The correlation coefficient measures the extent to which two variables are associated with one another.

  • When high values of v1 go with high values of v2, v1 and v2 are positively associated.

  • When high values of v1 are associated with low values of v2, v1 and v2 are negatively associated.

  • The correlation coefficient is a standardized metric so that it always ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation)

  • A correlation coefficent of 0 indicates no correlation, but be aware that random arrangements of data will produce both positive and negative values for the correlation coefficient just by chance. ##
    1. Further Reading Statistics, 4th ed., by David Freedman, Robert Pisani, and Roger Purves (W.W. Norton, 2007), has an excellent discussion of correlation.


'R > [R] Statistics' 카테고리의 다른 글

Assessing Prediction Performance R  (0) 2018.12.17
Designing_model  (0) 2018.12.15
Statistical Modeling in R Part 1  (0) 2018.12.13

What is Applying Statistical Models?

First, it is to make predictions about an outcome. Second, it is to run experiments to study relationship between variables. Third, it is to explore data to identify relationships among variables.

Basic Choices in model architecture

If Categorical response variable (e.g. yes or no, infected or not), then use rpart(). If Numerical response variable (e.g. unemployment rate), then use lm() or rpart(), but in this case, if the case is relevant with graual or proportional, then use lm. If the case is related with Dichotomous or discontinous, then use rpart().

The Prediction Errors?

When you train and test a model, you use data with values for the explanatory variables as well as the response variable.

If the model is good, when provided with the inputs from the testing data, the outputs from the function will give results “close” to the response variable in the testing data. Here is one question. How to measure “close”?

The first step is to subtract the function output from the actual response values in the testing data. The result is called the prediction error and there will be one such error for every case in the testing data. You then summarize that set of prediction errors.

The way to summarise the prediction erros is to calculate the mean of the square of the prediction errors. Since the errors are squared, none of them are negative. It means that the sum of errors reflects the magnitude and not the sign of the errors.

Step 1. Build Model

library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.3.0     ✔ forcats 0.3.0

## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
# Data Import 
Runners_100 <- read_csv("https://assets.datacamp.com/production/course_1585/datasets/Runners100.csv") %>% 
    select(-orig.id) %>% 
    sample_n(100)
## Parsed with column specification:
## cols(
##   age = col_double(),
##   net = col_double(),
##   gun = col_double(),
##   sex = col_character(),
##   year = col_double(),
##   previous = col_double(),
##   nruns = col_double(),
##   start_position = col_character(),
##   orig.id = col_double()
## )
# Data Overview
glimpse(Runners_100)
## Observations: 100
## Variables: 8
## $ age            <dbl> 45, 36, 33, 28, 33, 53, 30, 32, 27, 31, 24, 33,...
## $ net            <dbl> 88.52, 101.20, 113.10, 69.70, 99.67, 76.88, 81....
## $ gun            <dbl> 89.20, 103.40, 117.80, 70.17, 104.50, 77.85, 82...
## $ sex            <chr> "M", "F", "M", "M", "F", "M", "M", "F", "F", "F...
## $ year           <dbl> 2000, 2000, 2002, 2006, 2003, 2006, 2005, 2004,...
## $ previous       <dbl> 1, 1, 1, 2, 0, 0, 4, 2, 2, 1, 1, 2, 3, 1, 2, 1,...
## $ nruns          <dbl> 4, 9, 3, 4, 3, 3, 6, 3, 4, 3, 3, 4, 4, 7, 5, 3,...
## $ start_position <chr> "eager", "calm", "mellow", "eager", "mellow", "...
# Build a model of net running time
base_model <- lm(net ~ age + sex, data = Runners_100)

# Evaluate base_model on the training data
base_model_output <- predict(base_model, newdata = Runners_100)

# Build the augmented model
aug_model <- lm(net ~ age + sex + previous, data = Runners_100)

# Evaluate aug_model on the training data
aug_model_output <- predict(aug_model, newdata = Runners_100)

# How much do the model outputs differ?
mean((base_model_output - aug_model_output) ^ 2, na.rm = TRUE)
## [1] 0.5157921

As you can see the result, adding previous as an explanatory variable changes the model outputs.

Step 2. Prediction Performace

One point. Knowing that the models make different predictions doesn’t tell you which model is better. In this exercise, you’ll compare the models’ predictions to the actual values of the response variable. The term prediction error or just error is often used, rather than difference. So, rather than speaking of the mean square difference, we’ll say mean square error.

# Build and evaluate the base model on Runners_100
base_model <- lm(net ~ age + sex, data = Runners_100)
base_model_output <- predict(base_model, newdata = Runners_100)

# Build and evaluate the augmented model on Runners_100
aug_model <- lm(net ~ age + sex + previous, data = Runners_100)
aug_model_output <- predict(aug_model, newdata = Runners_100)

# Find the case-by-case differences
base_model_differences <- with(Runners_100, net - base_model_output)
aug_model_differences <- with(Runners_100, net - aug_model_output)

# Calculate mean square errors
mean(base_model_differences ^ 2)
## [1] 131.5594
mean(aug_model_differences ^ 2)
## [1] 131.0436

The augmented model gives slightly better predictions. But, is it correct to compare two models?

Step 3. The importance of Statistics

data(CPS85, package = "mosaicData")

# Add bogus column to CPS85
CPS85$bogus <- rnorm(nrow(CPS85)) > 0

# Make the base model
base_model <- lm(wage ~ educ + sector + sex, data = CPS85)

# Make the bogus augmented model
aug_model <- lm(wage ~ educ + sector + sex + bogus, data = CPS85)

# Find the MSE of the base model
mean((CPS85$wage - predict(base_model, newdata = CPS85)) ^ 2)
## [1] 19.73308
# Find the MSE of the augmented model
mean((CPS85$wage - predict(aug_model, newdata = CPS85)) ^ 2)
## [1] 19.68466

Surprisingly, although you just add no genuine explanatory variable like bogus to model, MSE is smaller in the expanded model than in the base model. This is called The Null Model. This results in usage of cross validation.

Step 4. Testing and training datasets

The code in the editor uses a style that will give you two prediction error results: one for the training cases and one for the testing cases. Your goal is to see whether there is a systematic difference between prediction accuracy on the training and on the testing cases.

Since the split is being done at random, the results will vary somewhat each time you do the calculation.

# Generate a random TRUE or FALSE for each case in Runners_100
Runners_100$training_cases <- rnorm(nrow(Runners_100)) > 0

# Build base model net ~ age + sex with training cases
base_model <- lm(net ~ age + sex, data = subset(Runners_100, training_cases))

# Evaluate the model for the testing cases
# Install devtools if necessary
# install.packages("devtools")
# Install statisticalModeling
devtools::install_github("dtkaplan/statisticalModeling")
## Skipping install of 'statisticalModeling' from a github remote, the SHA1 (4c5383d3) has not changed since last install.
##   Use `force = TRUE` to force installation
library(statisticalModeling)
Preds <- evaluate_model(base_model, data = subset(Runners_100, !training_cases))

# Calculate the MSE on the testing data
with(data = Preds, mean((net - model_output)^2))
## [1] 136.3962

Step 5. Repeating Random Trials

  • To simplifty things, the cv_pred_error() function in the statisticalModeling package will carry out this repetitive process for you. The function will do all the work of creating training and testing sets for each trial and calculating the mean square error for each trial.
  • The context for this exercise is to see whether the prediction error calculated from the training data is consistently different from the cross-validated prediction error. To that end, you’ll calculate the in-sample error using only the training data. Then, you’ll do the cross validation and use a t-test to see if the in-sample error is statistically different from the cross-validated error.
# The model
model <- lm(net ~ age + sex, data = Runners_100)

# Find the in-sample error (using the training data)
in_sample <- evaluate_model(model, data = Runners_100)
in_sample_error <- 
  with(in_sample, mean((net - model_output)^2, na.rm = TRUE))

# Calculate MSE for many different trials
trials <- cv_pred_error(model)

# View the cross-validated prediction errors
trials
##        mse model
## 1 137.5116 model
## 2 145.9868 model
## 3 142.5793 model
## 4 142.1681 model
## 5 139.8571 model
# Find confidence interval on trials and compare to training_error
mosaic::t.test(~ mse, mu = in_sample_error, data = trials)
## 
##  One Sample t-test
## 
## data:  mse
## t = 7.0898, df = 4, p-value = 0.00209
## alternative hypothesis: true mean is not equal to 131.5594
## 95 percent confidence interval:
##  137.6805 145.5606
## sample estimates:
## mean of x 
##  141.6206

The error based on the training data is below the 95% confidence interval representing the cross-validated prediction error.

Step 6. To add or not to add (an explanatory variable)?

you’re going to use cross validation to find out whether adding a new explanatory variable improves the prediction performance of a model. Remember that models are biased to perform well on the training data. Cross validation gives a fair indication of the prediction error on new data.

# The base model
base_model <- lm(net ~ age + sex, data = Runners_100)

# An augmented model adding previous as an explanatory variable
aug_model <- lm(net ~ age + sex + previous, data = Runners_100)

# Run cross validation trials on the two models
trials <- cv_pred_error(base_model, aug_model)
trials
##         mse      model
## 1  142.0409 base_model
## 2  136.6856 base_model
## 3  143.4253 base_model
## 4  137.9605 base_model
## 5  134.6230 base_model
## 6  143.2968  aug_model
## 7  142.2867  aug_model
## 8  143.6135  aug_model
## 9  140.3348  aug_model
## 10 145.8624  aug_model
# Compare the two sets of cross-validated errors
t.test(mse ~ model, data = trials)
## 
##  Welch Two Sample t-test
## 
## data:  mse by model
## t = 2.1983, df = 6.1923, p-value = 0.06887
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4327813  8.6963517
## sample estimates:
##  mean in group aug_model mean in group base_model 
##                 143.0788                 138.9471
  • Notice that cross validation reveals that the augmented model makes worse predictions (larger prediction error) than the base model. Bigger is not necessarily better when it comes to modeling!

  • In Conclusion, it’s not good to add the previous variable, just use base_model in this case.

'R > [R] Statistics' 카테고리의 다른 글

Correlation in R  (0) 2019.01.09
Designing_model  (0) 2018.12.15
Statistical Modeling in R Part 1  (0) 2018.12.13
Designing_model

What is modeling?

  • Ref. DataCamp, Statistical Modeling in R (Part 1)

Modeling is a Process. 1st Step: A team gets a idea to build modeling. 2nd Step: A team builds a design model. 3rd Step: With data, a team trains model. 4th Step: A team evaluates model. 5th Step: A team tests model performance. 6th Step: A team interprets how model challenges the team's idea.

These steps will be recursive to find a optimized model.

What is 'Training a model'? It is the automatic process carried out by the computer. And it requires to "fit" the model to the data. More importantly, model generally represents both Your (variable) choices and data.

Step 1. Data Import

library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.3.0     ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
Runners <- read_csv("https://assets.datacamp.com/production/course_1585/datasets/Runners100.csv") %>% 
    select(-orig.id)
## Parsed with column specification:
## cols(
##   age = col_double(),
##   net = col_double(),
##   gun = col_double(),
##   sex = col_character(),
##   year = col_double(),
##   previous = col_double(),
##   nruns = col_double(),
##   start_position = col_character(),
##   orig.id = col_double()
## )
names(Runners)
## [1] "age"            "net"            "gun"            "sex"           
## [5] "year"           "previous"       "nruns"          "start_position"

Step 2. Build Model

# Build models: handicap_model_1, handicap_model_2, handicap_model_3 
handicap_model_1 <- lm(net ~ age, data = Runners)
handicap_model_2 <- lm(net ~ sex, data = Runners)
handicap_model_3 <- lm(net ~ age + sex, data = Runners)

# For now, here's a way to visualize the models
# Install devtools if necessary
# install.packages("devtools")
# Install statisticalModeling
devtools::install_github("dtkaplan/statisticalModeling")
## Skipping install of 'statisticalModeling' from a github remote, the SHA1 (4c5383d3) has not changed since last install.
##   Use `force = TRUE` to force installation
library(statisticalModeling)
## 
## Attaching package: 'statisticalModeling'
## The following object is masked _by_ '.GlobalEnv':
## 
##     Runners
fmodel(handicap_model_1)

fmodel(handicap_model_2)

fmodel(handicap_model_3)

- Since age is quantitative, graphs of models of net versus age are continuous curves. But, sex is categorical. The third plot, with both age and sex as explanatory variables shows two continuous curves, one for each sex.

Step 3. Build new model, rpart

  • What is cp? cp(Complexity Parameter) allows you to dial up or down the complexit of the model being built. CP is like "minimum benefit" that a split must add to the tree.
# Load rpart
library(rpart)

# Build rpart model: model_2
model_2 <- rpart(net ~ age + sex, data = Runners, cp = 0.002)

# Examine graph of model_2
fmodel(model_2, ~ age + sex)

 

- In the recursive partitioning architecture, the model functions have 'steps'. It seems unlikely that real people change in the specific way indicated by the model. Presumably, the real change is more gradual and steady. The architecture you choose has consequences for the kinds of relationships that can be depicted.

Step 4. Build new model with new data Ran_twice

Ran_twice <- read_csv("https://assets.datacamp.com/production/course_1585/datasets/Ran_twice.csv") %>% select(-X1, -X)
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   X = col_double(),
##   age = col_double(),
##   net = col_double(),
##   gun = col_double(),
##   sex = col_character(),
##   year = col_double(),
##   nruns = col_double(),
##   runs_again = col_logical()
## )
glimpse(Ran_twice)
## Observations: 5,977
## Variables: 7
## $ age        <dbl> 33, 30, 29, 28, 33, 32, 22, 28, 34, 57, 33, 39, 44,...
## $ net        <dbl> NA, 82.37, 69.32, 79.70, NA, 111.80, 104.20, 76.57,...
## $ gun        <dbl> 92.40, 85.70, 69.68, 80.42, 105.40, 121.20, 104.20,...
## $ sex        <chr> "M", "M", "M", "M", "F", "M", "F", "F", "M", "M", "...
## $ year       <dbl> 2007, 2006, 2004, 2002, 2008, 2006, 2005, 2003, 200...
## $ nruns      <dbl> 4, 3, 4, 8, 3, 3, 5, 4, 8, 4, 4, 6, 4, 4, 3, 4, 3, ...
## $ runs_again <lgl> TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, ...
# Create run_again_model
run_again_model <- rpart(runs_again ~ age + sex + net, data = Ran_twice, cp = 0.005) 

# Visualize the model (don't change)
fmodel(run_again_model, ~ age + net, data = Ran_twice)

  • Previously, the response variable is "net" considered as numeric variable. In this model, the response variable is "runs_again" which is categorical. Surprisingly, the rpart() architecture works for both numerical and categorical responses.
  • There's a somewhat complicated pattern being shown by the model. Runners over age 40 with two races under their belts have about a 50% probability of running for a third time. For runners in their thirties, the model suggests that those with fast times (e.g. 60 minutes) are the most likely to run again. It also suggests that runners with intermediate times (e.g. 80 minutes) are not much less likely to run a third time. Perhaps this is because such runners are hoping for a fast time and discouraged by their intermediate time. Or, perhaps the model has over-reached, finding a pattern which just happens to show up in these data. Techniques for assessing this second possibility will be covered in later chapters of this course.

Step 5. From Inputs to Outputs

# Install statisticalModeling
devtools::install_github("dtkaplan/statisticalModeling")
## Skipping install of 'statisticalModeling' from a github remote, the SHA1 (4c5383d3) has not changed since last install.
##   Use `force = TRUE` to force installation
library(statisticalModeling)

# load data from statisticalModeling
data(AARP)

# Display the variable names
names(AARP)
## [1] "Age"      "Sex"      "Coverage" "Cost"
# Build a model: insurance_cost_model
insurance_cost_model <- lm(Cost ~ Age + Sex + Coverage, data = AARP)

# Construct a data frame: example_vals 
example_vals <- data.frame(Age = 60, Sex = "F", Coverage = 200)

# Predict insurance cost using predict()
predict(insurance_cost_model, newdata = example_vals)
##       1 
## 363.637
# Load statisticalModeling
library(statisticalModeling)

# Calculate model output using evaluate_model()
evaluate_model(insurance_cost_model, example_vals)
  • The statisticalModeling package provides an alternative to the predict() function called evaluate_model(). evaluate_model() has certain advantages, such as formatting the output as a data frame alongside the inputs, and takes two arguments: the model and a data argument containing the data frame of model inputs.

Step 6. Extrapolation

One purpose for evaluating a model is extrapolation: finding the model output for inputs that are outside the range of the data used to train the model.

Extrapolation makes sense only for quantitative explanatory variables. For example, given a variable x that ranges from 50 to 100, any value greater than 100 or smaller than 50 is an extrapolation.

# Build a model: insurance_cost_model
insurance_cost_model <- lm(Cost ~ Age + Sex + Coverage, data = AARP)

# Create a data frame: new_inputs_1
new_inputs_1 <- data.frame(Age = c(30, 90), Sex = c("F", "M"), Coverage = c(0, 100))

new_inputs_1
# Use expand.grid(): new_inputs_2
new_inputs_2 <- expand.grid(Age = c(30, 90), Sex = c("F", "M"), Coverage = c(0, 100))

new_inputs_2
# Use predict() for new_inputs_1 and new_inputs_2
predict(insurance_cost_model, newdata = new_inputs_1)
##         1         2 
## -99.98726 292.88435
predict(insurance_cost_model, newdata = new_inputs_2)
##         1         2         3         4         5         6         7 
## -99.98726 101.11503 -89.75448 111.34781  81.54928 282.65157  91.78206 
##         8 
## 292.88435
# Use evaluate_model() for new_inputs_1 and new_inputs_2
evaluate_model(insurance_cost_model, data = new_inputs_1)
evaluate_model(insurance_cost_model, data = new_inputs_2)
  • Notice how predict() produces only the model output, not the inputs used to generate that output. evaluate_model() helps you keep track of what the inputs were. Returning to a modeling perspective for a moment… Note that the cost of a policy with zero coverage is actually negative for younger people. This kind of thing can happen when extrapolating outside the domain of the data used for training the model. In this case, you didn't have any AARP data for zero coverage. The moral of the story: beware of extrapolation.

Step 7. Typical values of data

  • Sometimes you want to make a very quick check of what the model output looks like for "typical" inputs. When you use evaluate_model() without the data argument, the function will use the data on which the model was trained to select some typical levels of the inputs. evaluate_model() provides a tabular display of inputs and outputs.
# Evaluate insurance_cost_model
evaluate_model(insurance_cost_model)
# Use fmodel() to reproduce the graphic
fmodel(insurance_cost_model, ~ Coverage + Age + Sex)

# A new formula to highlight difference in sexes
new_formula <- ~ Age + Sex + Coverage

# Make the new plot (don't change)
fmodel(insurance_cost_model, new_formula)

  • The syntax for fmodel() is

    fmodel(model_object, ~ x_var + color_var + facet_var)
  • The choices you make in constructing a graphic are important. Using a given variable in a different role in a plot, or omitting it, can highlight or suppress different aspects of the story.


'R > [R] Statistics' 카테고리의 다른 글

Correlation in R  (0) 2019.01.09
Assessing Prediction Performance R  (0) 2018.12.17
Statistical Modeling in R Part 1  (0) 2018.12.13

What is model?

  • Ref. DataCamp, Statistical Modeling in R (Part 1)

Well, I would like to quote from DataCamp Course model is a representation for a purpose. When it says representation, the meaning is standing for something in some situation.


How about the purpose? Well, the main purpose is to use the model in real situation. What kinds of modeling are mainly discussed?


For excel(most) users, they are doing mathematical modeling that constructs out of mathematical entities something like numbers, formulars, etc.


For statisticians, they are doing statistical modeling that tries to build a special type of mathematical model using informed data, incorporating uncerntainity and randomness. Perhaps, ML practitioners are included in this group, in my opinion.


Since this docs is talking about the modeling, I will assume that readers already know (1) importing data (2) transforming data (3) visualizing data. So, I won't explain too much the such codes in detail. If you want to study importing to visualing data, then please read R for data Science.

Step 1. Data Import

library(mosaic)
data("iris")
iris
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica

Step 2. Data Overview

names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"

Step 3. Calculating Data with formula

Formulas such as (Sepal.Length ~ Species) are used to describe the form of relationships among variables.

mean(Sepal.Length ~ Species, data = iris)
##     setosa versicolor  virginica 
##      5.006      5.936      6.588

Step 4. Visualizing Data with formula

Formulas can be used to describe graphics in each of the three popular graphics systems: base graphics, lattice graphics, and in ggplot2

# Create a boxplot using base, lattice, or ggplot2
boxplot(Sepal.Length ~ Species, data = iris)



bwplot(Sepal.Length ~ Species, data = iris)


gf_boxplot(Sepal.Length ~ Species, data = iris)


# Make a scatterplot using base, lattice, or ggplot2
plot(Sepal.Length ~ Species, data = iris)


xyplot(Sepal.Length ~ Species, data = iris)


gf_point(Sepal.Length ~ Species, data = iris)


'R > [R] Statistics' 카테고리의 다른 글

Correlation in R  (0) 2019.01.09
Assessing Prediction Performance R  (0) 2018.12.17
Designing_model  (0) 2018.12.15

+ Recent posts