Designing_model
Evan Jung
December 14, 2018
What is modeling?
- Ref. DataCamp, Statistical Modeling in R (Part 1)
Modeling is a Process. 1st Step: A team gets a idea to build modeling. 2nd Step: A team builds a design model. 3rd Step: With data, a team trains model. 4th Step: A team evaluates model. 5th Step: A team tests model performance. 6th Step: A team interprets how model challenges the team's idea.
These steps will be recursive to find a optimized model.
What is 'Training a model'? It is the automatic process carried out by the computer. And it requires to "fit" the model to the data. More importantly, model generally represents both Your (variable) choices and data.
Step 1. Data Import
library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.8
## ✔ tidyr 0.8.2 ✔ stringr 1.3.1
## ✔ readr 1.3.0 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Runners <- read_csv("https://assets.datacamp.com/production/course_1585/datasets/Runners100.csv") %>%
select(-orig.id)
## Parsed with column specification:
## cols(
## age = col_double(),
## net = col_double(),
## gun = col_double(),
## sex = col_character(),
## year = col_double(),
## previous = col_double(),
## nruns = col_double(),
## start_position = col_character(),
## orig.id = col_double()
## )
names(Runners)
## [1] "age" "net" "gun" "sex"
## [5] "year" "previous" "nruns" "start_position"
Step 2. Build Model
# Build models: handicap_model_1, handicap_model_2, handicap_model_3
handicap_model_1 <- lm(net ~ age, data = Runners)
handicap_model_2 <- lm(net ~ sex, data = Runners)
handicap_model_3 <- lm(net ~ age + sex, data = Runners)
# For now, here's a way to visualize the models
# Install devtools if necessary
# install.packages("devtools")
# Install statisticalModeling
devtools::install_github("dtkaplan/statisticalModeling")
## Skipping install of 'statisticalModeling' from a github remote, the SHA1 (4c5383d3) has not changed since last install.
## Use `force = TRUE` to force installation
library(statisticalModeling)
##
## Attaching package: 'statisticalModeling'
## The following object is masked _by_ '.GlobalEnv':
##
## Runners
fmodel(handicap_model_1)
fmodel(handicap_model_2)
fmodel(handicap_model_3)
- Since age is quantitative, graphs of models of net versus age are continuous curves. But, sex is categorical. The third plot, with both age and sex as explanatory variables shows two continuous curves, one for each sex.
Step 3. Build new model, rpart
- What is cp? cp(Complexity Parameter) allows you to dial up or down the complexit of the model being built. CP is like "minimum benefit" that a split must add to the tree.
# Load rpart
library(rpart)
# Build rpart model: model_2
model_2 <- rpart(net ~ age + sex, data = Runners, cp = 0.002)
# Examine graph of model_2
fmodel(model_2, ~ age + sex)
- In the recursive partitioning architecture, the model functions have 'steps'. It seems unlikely that real people change in the specific way indicated by the model. Presumably, the real change is more gradual and steady. The architecture you choose has consequences for the kinds of relationships that can be depicted.
Step 4. Build new model with new data Ran_twice
Ran_twice <- read_csv("https://assets.datacamp.com/production/course_1585/datasets/Ran_twice.csv") %>% select(-X1, -X)
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## X = col_double(),
## age = col_double(),
## net = col_double(),
## gun = col_double(),
## sex = col_character(),
## year = col_double(),
## nruns = col_double(),
## runs_again = col_logical()
## )
glimpse(Ran_twice)
## Observations: 5,977
## Variables: 7
## $ age <dbl> 33, 30, 29, 28, 33, 32, 22, 28, 34, 57, 33, 39, 44,...
## $ net <dbl> NA, 82.37, 69.32, 79.70, NA, 111.80, 104.20, 76.57,...
## $ gun <dbl> 92.40, 85.70, 69.68, 80.42, 105.40, 121.20, 104.20,...
## $ sex <chr> "M", "M", "M", "M", "F", "M", "F", "F", "M", "M", "...
## $ year <dbl> 2007, 2006, 2004, 2002, 2008, 2006, 2005, 2003, 200...
## $ nruns <dbl> 4, 3, 4, 8, 3, 3, 5, 4, 8, 4, 4, 6, 4, 4, 3, 4, 3, ...
## $ runs_again <lgl> TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, ...
# Create run_again_model
run_again_model <- rpart(runs_again ~ age + sex + net, data = Ran_twice, cp = 0.005)
# Visualize the model (don't change)
fmodel(run_again_model, ~ age + net, data = Ran_twice)
- Previously, the response variable is "net" considered as numeric variable. In this model, the response variable is "runs_again" which is categorical. Surprisingly, the rpart() architecture works for both numerical and categorical responses.
- There's a somewhat complicated pattern being shown by the model. Runners over age 40 with two races under their belts have about a 50% probability of running for a third time. For runners in their thirties, the model suggests that those with fast times (e.g. 60 minutes) are the most likely to run again. It also suggests that runners with intermediate times (e.g. 80 minutes) are not much less likely to run a third time. Perhaps this is because such runners are hoping for a fast time and discouraged by their intermediate time. Or, perhaps the model has over-reached, finding a pattern which just happens to show up in these data. Techniques for assessing this second possibility will be covered in later chapters of this course.
Step 5. From Inputs to Outputs
# Install statisticalModeling
devtools::install_github("dtkaplan/statisticalModeling")
## Skipping install of 'statisticalModeling' from a github remote, the SHA1 (4c5383d3) has not changed since last install.
## Use `force = TRUE` to force installation
library(statisticalModeling)
# load data from statisticalModeling
data(AARP)
# Display the variable names
names(AARP)
## [1] "Age" "Sex" "Coverage" "Cost"
# Build a model: insurance_cost_model
insurance_cost_model <- lm(Cost ~ Age + Sex + Coverage, data = AARP)
# Construct a data frame: example_vals
example_vals <- data.frame(Age = 60, Sex = "F", Coverage = 200)
# Predict insurance cost using predict()
predict(insurance_cost_model, newdata = example_vals)
## 1
## 363.637
# Load statisticalModeling
library(statisticalModeling)
# Calculate model output using evaluate_model()
evaluate_model(insurance_cost_model, example_vals)
- The statisticalModeling package provides an alternative to the predict() function called evaluate_model(). evaluate_model() has certain advantages, such as formatting the output as a data frame alongside the inputs, and takes two arguments: the model and a data argument containing the data frame of model inputs.
Step 6. Extrapolation
One purpose for evaluating a model is extrapolation: finding the model output for inputs that are outside the range of the data used to train the model.
Extrapolation makes sense only for quantitative explanatory variables. For example, given a variable x that ranges from 50 to 100, any value greater than 100 or smaller than 50 is an extrapolation.
# Build a model: insurance_cost_model
insurance_cost_model <- lm(Cost ~ Age + Sex + Coverage, data = AARP)
# Create a data frame: new_inputs_1
new_inputs_1 <- data.frame(Age = c(30, 90), Sex = c("F", "M"), Coverage = c(0, 100))
new_inputs_1
# Use expand.grid(): new_inputs_2
new_inputs_2 <- expand.grid(Age = c(30, 90), Sex = c("F", "M"), Coverage = c(0, 100))
new_inputs_2
# Use predict() for new_inputs_1 and new_inputs_2
predict(insurance_cost_model, newdata = new_inputs_1)
## 1 2
## -99.98726 292.88435
predict(insurance_cost_model, newdata = new_inputs_2)
## 1 2 3 4 5 6 7
## -99.98726 101.11503 -89.75448 111.34781 81.54928 282.65157 91.78206
## 8
## 292.88435
# Use evaluate_model() for new_inputs_1 and new_inputs_2
evaluate_model(insurance_cost_model, data = new_inputs_1)
evaluate_model(insurance_cost_model, data = new_inputs_2)
- Notice how predict() produces only the model output, not the inputs used to generate that output. evaluate_model() helps you keep track of what the inputs were. Returning to a modeling perspective for a moment… Note that the cost of a policy with zero coverage is actually negative for younger people. This kind of thing can happen when extrapolating outside the domain of the data used for training the model. In this case, you didn't have any AARP data for zero coverage. The moral of the story: beware of extrapolation.
Step 7. Typical values of data
- Sometimes you want to make a very quick check of what the model output looks like for "typical" inputs. When you use evaluate_model() without the data argument, the function will use the data on which the model was trained to select some typical levels of the inputs. evaluate_model() provides a tabular display of inputs and outputs.
# Evaluate insurance_cost_model
evaluate_model(insurance_cost_model)
# Use fmodel() to reproduce the graphic
fmodel(insurance_cost_model, ~ Coverage + Age + Sex)
# A new formula to highlight difference in sexes
new_formula <- ~ Age + Sex + Coverage
# Make the new plot (don't change)
fmodel(insurance_cost_model, new_formula)
The syntax for fmodel() is
fmodel(model_object, ~ x_var + color_var + facet_var)
The choices you make in constructing a graphic are important. Using a given variable in a different role in a plot, or omitting it, can highlight or suppress different aspects of the story.
'R > [R] Statistics' 카테고리의 다른 글
Correlation in R (0) | 2019.01.09 |
---|---|
Assessing Prediction Performance R (0) | 2018.12.17 |
Statistical Modeling in R Part 1 (0) | 2018.12.13 |