What is Applying Statistical Models?

First, it is to make predictions about an outcome. Second, it is to run experiments to study relationship between variables. Third, it is to explore data to identify relationships among variables.

Basic Choices in model architecture

If Categorical response variable (e.g. yes or no, infected or not), then use rpart(). If Numerical response variable (e.g. unemployment rate), then use lm() or rpart(), but in this case, if the case is relevant with graual or proportional, then use lm. If the case is related with Dichotomous or discontinous, then use rpart().

The Prediction Errors?

When you train and test a model, you use data with values for the explanatory variables as well as the response variable.

If the model is good, when provided with the inputs from the testing data, the outputs from the function will give results “close” to the response variable in the testing data. Here is one question. How to measure “close”?

The first step is to subtract the function output from the actual response values in the testing data. The result is called the prediction error and there will be one such error for every case in the testing data. You then summarize that set of prediction errors.

The way to summarise the prediction erros is to calculate the mean of the square of the prediction errors. Since the errors are squared, none of them are negative. It means that the sum of errors reflects the magnitude and not the sign of the errors.

Step 1. Build Model

library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.3.0     ✔ forcats 0.3.0

## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
# Data Import 
Runners_100 <- read_csv("https://assets.datacamp.com/production/course_1585/datasets/Runners100.csv") %>% 
    select(-orig.id) %>% 
    sample_n(100)
## Parsed with column specification:
## cols(
##   age = col_double(),
##   net = col_double(),
##   gun = col_double(),
##   sex = col_character(),
##   year = col_double(),
##   previous = col_double(),
##   nruns = col_double(),
##   start_position = col_character(),
##   orig.id = col_double()
## )
# Data Overview
glimpse(Runners_100)
## Observations: 100
## Variables: 8
## $ age            <dbl> 45, 36, 33, 28, 33, 53, 30, 32, 27, 31, 24, 33,...
## $ net            <dbl> 88.52, 101.20, 113.10, 69.70, 99.67, 76.88, 81....
## $ gun            <dbl> 89.20, 103.40, 117.80, 70.17, 104.50, 77.85, 82...
## $ sex            <chr> "M", "F", "M", "M", "F", "M", "M", "F", "F", "F...
## $ year           <dbl> 2000, 2000, 2002, 2006, 2003, 2006, 2005, 2004,...
## $ previous       <dbl> 1, 1, 1, 2, 0, 0, 4, 2, 2, 1, 1, 2, 3, 1, 2, 1,...
## $ nruns          <dbl> 4, 9, 3, 4, 3, 3, 6, 3, 4, 3, 3, 4, 4, 7, 5, 3,...
## $ start_position <chr> "eager", "calm", "mellow", "eager", "mellow", "...
# Build a model of net running time
base_model <- lm(net ~ age + sex, data = Runners_100)

# Evaluate base_model on the training data
base_model_output <- predict(base_model, newdata = Runners_100)

# Build the augmented model
aug_model <- lm(net ~ age + sex + previous, data = Runners_100)

# Evaluate aug_model on the training data
aug_model_output <- predict(aug_model, newdata = Runners_100)

# How much do the model outputs differ?
mean((base_model_output - aug_model_output) ^ 2, na.rm = TRUE)
## [1] 0.5157921

As you can see the result, adding previous as an explanatory variable changes the model outputs.

Step 2. Prediction Performace

One point. Knowing that the models make different predictions doesn’t tell you which model is better. In this exercise, you’ll compare the models’ predictions to the actual values of the response variable. The term prediction error or just error is often used, rather than difference. So, rather than speaking of the mean square difference, we’ll say mean square error.

# Build and evaluate the base model on Runners_100
base_model <- lm(net ~ age + sex, data = Runners_100)
base_model_output <- predict(base_model, newdata = Runners_100)

# Build and evaluate the augmented model on Runners_100
aug_model <- lm(net ~ age + sex + previous, data = Runners_100)
aug_model_output <- predict(aug_model, newdata = Runners_100)

# Find the case-by-case differences
base_model_differences <- with(Runners_100, net - base_model_output)
aug_model_differences <- with(Runners_100, net - aug_model_output)

# Calculate mean square errors
mean(base_model_differences ^ 2)
## [1] 131.5594
mean(aug_model_differences ^ 2)
## [1] 131.0436

The augmented model gives slightly better predictions. But, is it correct to compare two models?

Step 3. The importance of Statistics

data(CPS85, package = "mosaicData")

# Add bogus column to CPS85
CPS85$bogus <- rnorm(nrow(CPS85)) > 0

# Make the base model
base_model <- lm(wage ~ educ + sector + sex, data = CPS85)

# Make the bogus augmented model
aug_model <- lm(wage ~ educ + sector + sex + bogus, data = CPS85)

# Find the MSE of the base model
mean((CPS85$wage - predict(base_model, newdata = CPS85)) ^ 2)
## [1] 19.73308
# Find the MSE of the augmented model
mean((CPS85$wage - predict(aug_model, newdata = CPS85)) ^ 2)
## [1] 19.68466

Surprisingly, although you just add no genuine explanatory variable like bogus to model, MSE is smaller in the expanded model than in the base model. This is called The Null Model. This results in usage of cross validation.

Step 4. Testing and training datasets

The code in the editor uses a style that will give you two prediction error results: one for the training cases and one for the testing cases. Your goal is to see whether there is a systematic difference between prediction accuracy on the training and on the testing cases.

Since the split is being done at random, the results will vary somewhat each time you do the calculation.

# Generate a random TRUE or FALSE for each case in Runners_100
Runners_100$training_cases <- rnorm(nrow(Runners_100)) > 0

# Build base model net ~ age + sex with training cases
base_model <- lm(net ~ age + sex, data = subset(Runners_100, training_cases))

# Evaluate the model for the testing cases
# Install devtools if necessary
# install.packages("devtools")
# Install statisticalModeling
devtools::install_github("dtkaplan/statisticalModeling")
## Skipping install of 'statisticalModeling' from a github remote, the SHA1 (4c5383d3) has not changed since last install.
##   Use `force = TRUE` to force installation
library(statisticalModeling)
Preds <- evaluate_model(base_model, data = subset(Runners_100, !training_cases))

# Calculate the MSE on the testing data
with(data = Preds, mean((net - model_output)^2))
## [1] 136.3962

Step 5. Repeating Random Trials

  • To simplifty things, the cv_pred_error() function in the statisticalModeling package will carry out this repetitive process for you. The function will do all the work of creating training and testing sets for each trial and calculating the mean square error for each trial.
  • The context for this exercise is to see whether the prediction error calculated from the training data is consistently different from the cross-validated prediction error. To that end, you’ll calculate the in-sample error using only the training data. Then, you’ll do the cross validation and use a t-test to see if the in-sample error is statistically different from the cross-validated error.
# The model
model <- lm(net ~ age + sex, data = Runners_100)

# Find the in-sample error (using the training data)
in_sample <- evaluate_model(model, data = Runners_100)
in_sample_error <- 
  with(in_sample, mean((net - model_output)^2, na.rm = TRUE))

# Calculate MSE for many different trials
trials <- cv_pred_error(model)

# View the cross-validated prediction errors
trials
##        mse model
## 1 137.5116 model
## 2 145.9868 model
## 3 142.5793 model
## 4 142.1681 model
## 5 139.8571 model
# Find confidence interval on trials and compare to training_error
mosaic::t.test(~ mse, mu = in_sample_error, data = trials)
## 
##  One Sample t-test
## 
## data:  mse
## t = 7.0898, df = 4, p-value = 0.00209
## alternative hypothesis: true mean is not equal to 131.5594
## 95 percent confidence interval:
##  137.6805 145.5606
## sample estimates:
## mean of x 
##  141.6206

The error based on the training data is below the 95% confidence interval representing the cross-validated prediction error.

Step 6. To add or not to add (an explanatory variable)?

you’re going to use cross validation to find out whether adding a new explanatory variable improves the prediction performance of a model. Remember that models are biased to perform well on the training data. Cross validation gives a fair indication of the prediction error on new data.

# The base model
base_model <- lm(net ~ age + sex, data = Runners_100)

# An augmented model adding previous as an explanatory variable
aug_model <- lm(net ~ age + sex + previous, data = Runners_100)

# Run cross validation trials on the two models
trials <- cv_pred_error(base_model, aug_model)
trials
##         mse      model
## 1  142.0409 base_model
## 2  136.6856 base_model
## 3  143.4253 base_model
## 4  137.9605 base_model
## 5  134.6230 base_model
## 6  143.2968  aug_model
## 7  142.2867  aug_model
## 8  143.6135  aug_model
## 9  140.3348  aug_model
## 10 145.8624  aug_model
# Compare the two sets of cross-validated errors
t.test(mse ~ model, data = trials)
## 
##  Welch Two Sample t-test
## 
## data:  mse by model
## t = 2.1983, df = 6.1923, p-value = 0.06887
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4327813  8.6963517
## sample estimates:
##  mean in group aug_model mean in group base_model 
##                 143.0788                 138.9471
  • Notice that cross validation reveals that the augmented model makes worse predictions (larger prediction error) than the base model. Bigger is not necessarily better when it comes to modeling!

  • In Conclusion, it’s not good to add the previous variable, just use base_model in this case.

'R > [R] Statistics' 카테고리의 다른 글

Correlation in R  (0) 2019.01.09
Designing_model  (0) 2018.12.15
Statistical Modeling in R Part 1  (0) 2018.12.13

+ Recent posts