Designing_model

What is modeling?

  • Ref. DataCamp, Statistical Modeling in R (Part 1)

Modeling is a Process. 1st Step: A team gets a idea to build modeling. 2nd Step: A team builds a design model. 3rd Step: With data, a team trains model. 4th Step: A team evaluates model. 5th Step: A team tests model performance. 6th Step: A team interprets how model challenges the team's idea.

These steps will be recursive to find a optimized model.

What is 'Training a model'? It is the automatic process carried out by the computer. And it requires to "fit" the model to the data. More importantly, model generally represents both Your (variable) choices and data.

Step 1. Data Import

library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.3.0     ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
Runners <- read_csv("https://assets.datacamp.com/production/course_1585/datasets/Runners100.csv") %>% 
    select(-orig.id)
## Parsed with column specification:
## cols(
##   age = col_double(),
##   net = col_double(),
##   gun = col_double(),
##   sex = col_character(),
##   year = col_double(),
##   previous = col_double(),
##   nruns = col_double(),
##   start_position = col_character(),
##   orig.id = col_double()
## )
names(Runners)
## [1] "age"            "net"            "gun"            "sex"           
## [5] "year"           "previous"       "nruns"          "start_position"

Step 2. Build Model

# Build models: handicap_model_1, handicap_model_2, handicap_model_3 
handicap_model_1 <- lm(net ~ age, data = Runners)
handicap_model_2 <- lm(net ~ sex, data = Runners)
handicap_model_3 <- lm(net ~ age + sex, data = Runners)

# For now, here's a way to visualize the models
# Install devtools if necessary
# install.packages("devtools")
# Install statisticalModeling
devtools::install_github("dtkaplan/statisticalModeling")
## Skipping install of 'statisticalModeling' from a github remote, the SHA1 (4c5383d3) has not changed since last install.
##   Use `force = TRUE` to force installation
library(statisticalModeling)
## 
## Attaching package: 'statisticalModeling'
## The following object is masked _by_ '.GlobalEnv':
## 
##     Runners
fmodel(handicap_model_1)

fmodel(handicap_model_2)

fmodel(handicap_model_3)

- Since age is quantitative, graphs of models of net versus age are continuous curves. But, sex is categorical. The third plot, with both age and sex as explanatory variables shows two continuous curves, one for each sex.

Step 3. Build new model, rpart

  • What is cp? cp(Complexity Parameter) allows you to dial up or down the complexit of the model being built. CP is like "minimum benefit" that a split must add to the tree.
# Load rpart
library(rpart)

# Build rpart model: model_2
model_2 <- rpart(net ~ age + sex, data = Runners, cp = 0.002)

# Examine graph of model_2
fmodel(model_2, ~ age + sex)

 

- In the recursive partitioning architecture, the model functions have 'steps'. It seems unlikely that real people change in the specific way indicated by the model. Presumably, the real change is more gradual and steady. The architecture you choose has consequences for the kinds of relationships that can be depicted.

Step 4. Build new model with new data Ran_twice

Ran_twice <- read_csv("https://assets.datacamp.com/production/course_1585/datasets/Ran_twice.csv") %>% select(-X1, -X)
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   X = col_double(),
##   age = col_double(),
##   net = col_double(),
##   gun = col_double(),
##   sex = col_character(),
##   year = col_double(),
##   nruns = col_double(),
##   runs_again = col_logical()
## )
glimpse(Ran_twice)
## Observations: 5,977
## Variables: 7
## $ age        <dbl> 33, 30, 29, 28, 33, 32, 22, 28, 34, 57, 33, 39, 44,...
## $ net        <dbl> NA, 82.37, 69.32, 79.70, NA, 111.80, 104.20, 76.57,...
## $ gun        <dbl> 92.40, 85.70, 69.68, 80.42, 105.40, 121.20, 104.20,...
## $ sex        <chr> "M", "M", "M", "M", "F", "M", "F", "F", "M", "M", "...
## $ year       <dbl> 2007, 2006, 2004, 2002, 2008, 2006, 2005, 2003, 200...
## $ nruns      <dbl> 4, 3, 4, 8, 3, 3, 5, 4, 8, 4, 4, 6, 4, 4, 3, 4, 3, ...
## $ runs_again <lgl> TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, ...
# Create run_again_model
run_again_model <- rpart(runs_again ~ age + sex + net, data = Ran_twice, cp = 0.005) 

# Visualize the model (don't change)
fmodel(run_again_model, ~ age + net, data = Ran_twice)

  • Previously, the response variable is "net" considered as numeric variable. In this model, the response variable is "runs_again" which is categorical. Surprisingly, the rpart() architecture works for both numerical and categorical responses.
  • There's a somewhat complicated pattern being shown by the model. Runners over age 40 with two races under their belts have about a 50% probability of running for a third time. For runners in their thirties, the model suggests that those with fast times (e.g. 60 minutes) are the most likely to run again. It also suggests that runners with intermediate times (e.g. 80 minutes) are not much less likely to run a third time. Perhaps this is because such runners are hoping for a fast time and discouraged by their intermediate time. Or, perhaps the model has over-reached, finding a pattern which just happens to show up in these data. Techniques for assessing this second possibility will be covered in later chapters of this course.

Step 5. From Inputs to Outputs

# Install statisticalModeling
devtools::install_github("dtkaplan/statisticalModeling")
## Skipping install of 'statisticalModeling' from a github remote, the SHA1 (4c5383d3) has not changed since last install.
##   Use `force = TRUE` to force installation
library(statisticalModeling)

# load data from statisticalModeling
data(AARP)

# Display the variable names
names(AARP)
## [1] "Age"      "Sex"      "Coverage" "Cost"
# Build a model: insurance_cost_model
insurance_cost_model <- lm(Cost ~ Age + Sex + Coverage, data = AARP)

# Construct a data frame: example_vals 
example_vals <- data.frame(Age = 60, Sex = "F", Coverage = 200)

# Predict insurance cost using predict()
predict(insurance_cost_model, newdata = example_vals)
##       1 
## 363.637
# Load statisticalModeling
library(statisticalModeling)

# Calculate model output using evaluate_model()
evaluate_model(insurance_cost_model, example_vals)
  • The statisticalModeling package provides an alternative to the predict() function called evaluate_model(). evaluate_model() has certain advantages, such as formatting the output as a data frame alongside the inputs, and takes two arguments: the model and a data argument containing the data frame of model inputs.

Step 6. Extrapolation

One purpose for evaluating a model is extrapolation: finding the model output for inputs that are outside the range of the data used to train the model.

Extrapolation makes sense only for quantitative explanatory variables. For example, given a variable x that ranges from 50 to 100, any value greater than 100 or smaller than 50 is an extrapolation.

# Build a model: insurance_cost_model
insurance_cost_model <- lm(Cost ~ Age + Sex + Coverage, data = AARP)

# Create a data frame: new_inputs_1
new_inputs_1 <- data.frame(Age = c(30, 90), Sex = c("F", "M"), Coverage = c(0, 100))

new_inputs_1
# Use expand.grid(): new_inputs_2
new_inputs_2 <- expand.grid(Age = c(30, 90), Sex = c("F", "M"), Coverage = c(0, 100))

new_inputs_2
# Use predict() for new_inputs_1 and new_inputs_2
predict(insurance_cost_model, newdata = new_inputs_1)
##         1         2 
## -99.98726 292.88435
predict(insurance_cost_model, newdata = new_inputs_2)
##         1         2         3         4         5         6         7 
## -99.98726 101.11503 -89.75448 111.34781  81.54928 282.65157  91.78206 
##         8 
## 292.88435
# Use evaluate_model() for new_inputs_1 and new_inputs_2
evaluate_model(insurance_cost_model, data = new_inputs_1)
evaluate_model(insurance_cost_model, data = new_inputs_2)
  • Notice how predict() produces only the model output, not the inputs used to generate that output. evaluate_model() helps you keep track of what the inputs were. Returning to a modeling perspective for a moment… Note that the cost of a policy with zero coverage is actually negative for younger people. This kind of thing can happen when extrapolating outside the domain of the data used for training the model. In this case, you didn't have any AARP data for zero coverage. The moral of the story: beware of extrapolation.

Step 7. Typical values of data

  • Sometimes you want to make a very quick check of what the model output looks like for "typical" inputs. When you use evaluate_model() without the data argument, the function will use the data on which the model was trained to select some typical levels of the inputs. evaluate_model() provides a tabular display of inputs and outputs.
# Evaluate insurance_cost_model
evaluate_model(insurance_cost_model)
# Use fmodel() to reproduce the graphic
fmodel(insurance_cost_model, ~ Coverage + Age + Sex)

# A new formula to highlight difference in sexes
new_formula <- ~ Age + Sex + Coverage

# Make the new plot (don't change)
fmodel(insurance_cost_model, new_formula)

  • The syntax for fmodel() is

    fmodel(model_object, ~ x_var + color_var + facet_var)
  • The choices you make in constructing a graphic are important. Using a given variable in a different role in a plot, or omitting it, can highlight or suppress different aspects of the story.


'R > [R] Statistics' 카테고리의 다른 글

Correlation in R  (0) 2019.01.09
Assessing Prediction Performance R  (0) 2018.12.17
Statistical Modeling in R Part 1  (0) 2018.12.13

What is model?

  • Ref. DataCamp, Statistical Modeling in R (Part 1)

Well, I would like to quote from DataCamp Course model is a representation for a purpose. When it says representation, the meaning is standing for something in some situation.


How about the purpose? Well, the main purpose is to use the model in real situation. What kinds of modeling are mainly discussed?


For excel(most) users, they are doing mathematical modeling that constructs out of mathematical entities something like numbers, formulars, etc.


For statisticians, they are doing statistical modeling that tries to build a special type of mathematical model using informed data, incorporating uncerntainity and randomness. Perhaps, ML practitioners are included in this group, in my opinion.


Since this docs is talking about the modeling, I will assume that readers already know (1) importing data (2) transforming data (3) visualizing data. So, I won't explain too much the such codes in detail. If you want to study importing to visualing data, then please read R for data Science.

Step 1. Data Import

library(mosaic)
data("iris")
iris
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica

Step 2. Data Overview

names(iris)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
## [5] "Species"

Step 3. Calculating Data with formula

Formulas such as (Sepal.Length ~ Species) are used to describe the form of relationships among variables.

mean(Sepal.Length ~ Species, data = iris)
##     setosa versicolor  virginica 
##      5.006      5.936      6.588

Step 4. Visualizing Data with formula

Formulas can be used to describe graphics in each of the three popular graphics systems: base graphics, lattice graphics, and in ggplot2

# Create a boxplot using base, lattice, or ggplot2
boxplot(Sepal.Length ~ Species, data = iris)



bwplot(Sepal.Length ~ Species, data = iris)


gf_boxplot(Sepal.Length ~ Species, data = iris)


# Make a scatterplot using base, lattice, or ggplot2
plot(Sepal.Length ~ Species, data = iris)


xyplot(Sepal.Length ~ Species, data = iris)


gf_point(Sepal.Length ~ Species, data = iris)


'R > [R] Statistics' 카테고리의 다른 글

Correlation in R  (0) 2019.01.09
Assessing Prediction Performance R  (0) 2018.12.17
Designing_model  (0) 2018.12.15
R with MySQL

R with MySQL

R-DB Connect mysql DBI RMySQL

Send Data to SQL Table

This is simple code to import data from mySQL DB
# install.packages("DBI", dependencies = TRUE)
library(DBI)
# System Settings
# Sys.setenv(PKG_CPPFLAGS = "-I//usr/local/mysql-8.0.12-macos10.13-x86_64/include/")
# Sys.setenv(PKG_LIBS="-L/usr/local/mysql-8.0.12-macos10.13-x86_64/lib/ -lmysqlclient")
install.packages("RMySQL", type = "source")
library(DBI)
library(RMySQL)
# Connect to a public database that I'm running on Google's
# cloud SQL service. It contains a copy of the data in the
# datasets package.
con <- dbConnect(RMySQL::MySQL(),
username = "root",
password = "password",
host = "localhost",
port = 3306,
dbname = "dbname"
)
data("iris")
dbWriteTable(con,
name="iris",
value=iris,
append = TRUE,
row.names=FALSE)
# Error in .local(conn, statement, ...) : could not run statement: The used command is not allowed with this MySQL version
# Then go to terminal access to mysql
# SET GLOBAL local_infile = 1; (Temporary and no Security)
# If you get error like below
# Error in .local(drv, ...) :
# Failed to connect to database: Error: Plugin caching_sha2_password could not be loaded: dlopen(lib/mariadb/plugin/caching_sha2_password.so, 2): image not found
# Run a query
dbGetQuery(con, "SELECT * FROM iris")
# It's polite to let the database know when you're done
dbDisconnect(con)
#> [1] TRUE

'R > [R] DB Connection' 카테고리의 다른 글

How to use Spark in R on MacOS Catalina  (1) 2019.12.08

+ Recent posts