R/[R] Machine Learning

[R] Classification Trees 2019.02.03
[R] Supervised Learning, Logistic Regresison 2019.01.25
[R] k-Nearest Neighbors (kNN) 2019.01.20

[R] Classification Trees

2019. 2. 3. 21:54

Classification_Trees

Supervised Learning: Classification Trees

Evan Jung January 28, 2019

1. Intro

Classification trees use flowchart-like structures to make decisions. Because humans can readily understand these tree structures, classification trees are useful when transparency is needed, such as in loan approval. We’ll use the Lending Club dataset to simulate this scenario.

2. Lending Club Dataset

Lending Club is a US-based peer-to-peer lending company. The loans dataset contains 11,312 randomly-selected people who were applied for and later received loans from Lending Club.

library(tidyverse)
# data url
url <- "https://assets.datacamp.com/production/repositories/718/datasets/7805fceacfb205470c0e8800d4ffc37c6944b30c/loans.csv"

# data import & transformation
loans <- read_csv(url) %>%
  mutate_if(is_character, as.factor) %>% 
  mutate(outcome = as.factor(default), 
         outcome = factor(outcome, 
                          levels = c(0,1), 
                          labels = c("repaid", "default"))) %>% 
  select(-default, -keep, -rand)

# data overview
glimpse(loans)

## Observations: 39,732
## Variables: 14
## $ loan_amount        <fct> LOW, LOW, LOW, MEDIUM, LOW, LOW, MEDIUM, LO...
## $ emp_length         <fct> 10+ years, < 2 years, 10+ years, 10+ years,...
## $ home_ownership     <fct> RENT, RENT, RENT, RENT, RENT, RENT, RENT, R...
## $ income             <fct> LOW, LOW, LOW, MEDIUM, HIGH, LOW, MEDIUM, M...
## $ loan_purpose       <fct> credit_card, car, small_business, other, ot...
## $ debt_to_income     <fct> HIGH, LOW, AVERAGE, HIGH, AVERAGE, AVERAGE,...
## $ credit_score       <fct> AVERAGE, AVERAGE, AVERAGE, AVERAGE, AVERAGE...
## $ recent_inquiry     <fct> YES, YES, YES, YES, NO, YES, YES, YES, YES,...
## $ delinquent         <fct> NEVER, NEVER, NEVER, MORE THAN 2 YEARS AGO,...
## $ credit_accounts    <fct> FEW, FEW, FEW, AVERAGE, MANY, AVERAGE, AVER...
## $ bad_public_record  <fct> NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO,...
## $ credit_utilization <fct> HIGH, LOW, HIGH, LOW, MEDIUM, MEDIUM, HIGH,...
## $ past_bankrupt      <fct> NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO,...
## $ outcome            <fct> repaid, default, repaid, repaid, repaid, re...

3. UpSampling & DownSampling

ggplot(loans, aes(x = outcome)) + 
  geom_bar(aes(fill = outcome))

table(loans$outcome)

## 
##  repaid default 
##   34078    5654

In classification problems, as you can see graph above, a disparity in the frequencies of the observed classes can have a significant negative impact on model fitting. One technique for resolving such a class imbalance is to subsample the training data in a manner that mitigates the issues. Examples of sampling methods for this purpose are:

down-sampling: randomly subset all the classes in the training set so that their class frequencies match the least prevalent class. For example, suppose that 80% of the training set samples are the first class and the remaining 20% are in the second class. Down-sampling would randomly sample the first class to be the same size as the second class (so that only 40% of the total training set is used to fit the model). caret contains a function (downSample) to do this.
up-sampling: randomly sample (with replacement) the minority class to be the same size as the majority class. caret contains a function (upSample) to do this.
hybrid methods: techniques such as SMOTE and ROSE down-sample the majority class and synthesize new data points in the minority class. There are two packages (DMwR and ROSE) that implement these procedures.

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

loans <- downSample(x = loans %>% dplyr::select(-outcome), 
                    y = loans$outcome, 
                    yname = "outcome")
table(loans$outcome)

## 
##  repaid default 
##    5654    5654

4. Building a simple decision tree

(1) Modeling Building

You will use a decision tree to try to learn patterns in the outcome of these loans (either repaid or default) based on the requested loan amount and credit score at the time of application.

Then, see how the tree’s predictions differ for an applicant with good credit versus one with bad credit.

# Load the rpart package
library(rpart)

# modeling building
loan_model <- rpart(outcome ~ loan_amount + credit_score, data = loans, method = "class", control = rpart.control(cp = 0))

# model overview
summary(loan_model)

## Call:
## rpart(formula = outcome ~ loan_amount + credit_score, data = loans, 
##     method = "class", control = rpart.control(cp = 0))
##   n= 11308 
## 
##           CP nsplit rel error    xerror        xstd
## 1 0.10771135      0 1.0000000 1.0231694 0.009401356
## 2 0.01317651      1 0.8922886 0.8922886 0.009349171
## 3 0.00000000      3 0.8659356 0.8659356 0.009318988
## 
## Variable importance
## credit_score  loan_amount 
##           89           11 
## 
## Node number 1: 11308 observations,    complexity param=0.1077114
##   predicted class=repaid   expected loss=0.5  P(node) =1
##     class counts:  5654  5654
##    probabilities: 0.500 0.500 
##   left son=2 (1811 obs) right son=3 (9497 obs)
##   Primary splits:
##       credit_score splits as  RLR, improve=121.92300, (0 missing)
##       loan_amount  splits as  RLL, improve= 29.13543, (0 missing)
## 
## Node number 2: 1811 observations
##   predicted class=repaid   expected loss=0.3318609  P(node) =0.1601521
##     class counts:  1210   601
##    probabilities: 0.668 0.332 
## 
## Node number 3: 9497 observations,    complexity param=0.01317651
##   predicted class=default  expected loss=0.4679372  P(node) =0.8398479
##     class counts:  4444  5053
##    probabilities: 0.468 0.532 
##   left son=6 (7867 obs) right son=7 (1630 obs)
##   Primary splits:
##       credit_score splits as  L-R, improve=42.17392, (0 missing)
##       loan_amount  splits as  RLL, improve=19.24674, (0 missing)
## 
## Node number 6: 7867 observations,    complexity param=0.01317651
##   predicted class=default  expected loss=0.489386  P(node) =0.6957022
##     class counts:  3850  4017
##    probabilities: 0.489 0.511 
##   left son=12 (5397 obs) right son=13 (2470 obs)
##   Primary splits:
##       loan_amount splits as  RLL, improve=20.49803, (0 missing)
## 
## Node number 7: 1630 observations
##   predicted class=default  expected loss=0.3644172  P(node) =0.1441457
##     class counts:   594  1036
##    probabilities: 0.364 0.636 
## 
## Node number 12: 5397 observations
##   predicted class=repaid   expected loss=0.486196  P(node) =0.4772727
##     class counts:  2773  2624
##    probabilities: 0.514 0.486 
## 
## Node number 13: 2470 observations
##   predicted class=default  expected loss=0.4360324  P(node) =0.2184294
##     class counts:  1077  1393
##    probabilities: 0.436 0.564

(2) Predict()

good_credit <- data.frame(
  loan_amount        = "LOW", 
  emp_length         = "10+ years", 
  home_ownership     = "MORTGAGE", 
  income             = "HIGH", 
  loan_purpose       = "major_purchase", 
  debt_to_income     = "AVERAGE", 
  credit_score       = "HIGH", 
  recent_inquiry     = "NO", 
  delinquent         = "NEVER", 
  credit_accounts    = "MANY", 
  bad_public_record  = "NO", 
  credit_utilization = "LOW", 
  past_bankrupt      = "NO", 
  outcome            = "repaid"
)

bad_credit <- data.frame(
  loan_amount        = "LOW", 
  emp_length         = "6 - 9 years", 
  home_ownership     = "RENT", 
  income             = "MEDIUM", 
  loan_purpose       = "car", 
  debt_to_income     = "LOW", 
  credit_score       = "LOW", 
  recent_inquiry     = "YES", 
  delinquent         = "NEVER", 
  credit_accounts    = "FEW", 
  bad_public_record  = "NO", 
  credit_utilization = "HIGH", 
  past_bankrupt      = "NO", 
  outcome            = "repaid"
)

predict(loan_model, good_credit, type = "class")

##      1 
## repaid 
## Levels: repaid default

predict(loan_model, bad_credit, type = "class")

##       1 
## default 
## Levels: repaid default

5.Visualizing classification trees

Due to government rules to prevent illegal discrimination, lenders are required to explain why a loan application was rejected.

The structure of classification trees can be depicted visually, which helps to understand how the tree makes its decisions.

# Load the rpart.plot package
library(rpart.plot)

# Plot the loan_model with default settings
rpart.plot(loan_model)

# Plot the loan_model with customized settings
rpart.plot(loan_model, 
           type = 3, 
           box.palette = c("red", "green"), 
           fallen.leaves = TRUE)

Based on this tree structure, which of the following applicants would be predicted to repay the loan?

Someone with a low requested loan amount and high credit. Using the tree structure, you can clearly see how the tree makes its decisions.

6. Creating Random Test Datasets

Before building a more sophisticated lending model, it is important to hold out a portion of the loan data to simulate how well it will predict the outcomes of future loan applicants.

As depicted in the following image, you can use 75% of the observations for training and 25% for testing the model.

# Determine the number of rows for training
nrow(loans) * 0.75

## [1] 8481

# Create a random sample of row IDs
sample_rows <- sample(11308, 8481)

# Create the training dataset
loans_train <- loans[sample_rows, ]

# Create the test dataset
loans_test <- loans[-sample_rows, ]

The sample() function can be used to generate a random sample of rows to include in the training set. Simply supply it the total number of observations and the number needed for training.

Use the resulting vector of row IDs to subset the loans into training and testing datasets.

7. Building and evaluating a larger tree

Previously, you created a simple decision tree that used the applicant’s credit score and requested loan amount to predict the loan outcome.

Lending Club has additional information about the applicants, such as home ownership status, length of employment, loan purpose, and past bankruptcies, that may be useful for making more accurate predictions.

Using all of the available applicant data, build a more sophisticated lending model using the random training dataset created previously. Then, use this model to make predictions on the testing dataset to estimate the performance of the model on future loan applications.

loan_model <- rpart(outcome ~ ., 
                    data = loans_train, 
                    method = "class", 
                    control = rpart.control(cp = 0))

# Make predictions on the test dataset
loans_test$pred <- predict(loan_model, 
                           loans_test, 
                           type = "class")

# Examine the confusion matrix
table(loans_test$pred, loans_test$outcome)

##          
##           repaid default
##   repaid     781     626
##   default    636     784

# Compute the accuracy on the test dataset
mean(loans_test$pred == loans_test$outcome)

## [1] 0.5535904

The accuracy on the test dataset seems low. If so, how did adding more predictors change the model’s performance?

8. Conducting a fair performance evaluation

Holding out test data reduces the amount of data available for growing the decision tree. In spite of this, it is very important to evaluate decision trees on data it has not seen before.

Which of these is NOT true about the evaluation of decision tree performance?

Decision trees sometimes overfit the training data.
The model’s accuracy is unaffected by the rarity of the outcome.
Performance on the training dataset can overestimate performance on future data.
Creating a test dataset simulates the model’s performance on unseen data.

The answer is (2). Rare events cause problems for many machine learning approaches.

9. Preventing overgrown trees

The tree grown on the full set of applicant data grew to be extremely large and extremely complex, with hundreds of splits and leaf nodes containing only a handful of applicants. This tree would be almost impossible for a loan officer to interpret.

Using the pre-pruning methods for early stopping, you can prevent a tree from growing too large and complex. See how the rpart control options for maximum tree depth and minimum split count impact the resulting tree.

loan_model <- rpart(outcome ~ ., 
                    data = loans_train, 
                    method = "class", 
                    control = rpart.control(cp = 0,
                                            maxdepth = 6 # minsplit = 500
                                            ))
# Make a class prediction on the test set
loans_test$pred <- predict(loan_model, loans_test, type = "class")

# Compute the accuracy of the simpler tree
mean(loans_test$pred == loans_test$outcome)

## [1] 0.5723382

Compared to the previous model, the new model shows the better performance of accuracy. But, still we have new technique to build the better model.

10. Creating a nicely pruned tree

Stopping a tree from growing all the way can lead it to ignore some aspects of the data or miss important trends it may have discovered later.

By using post-pruning, you can intentionally grow a large and complex tree then prune it to be smaller and more efficient later on.

In this exercise, you will have the opportunity to construct a visualization of the tree’s performance versus complexity, and use this information to prune the tree to an appropriate level.

# Grow an overly complex tree
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0))

# Examine the complexity plot
plotcp(loan_model)

# Prune the tree
loan_model_pruned <- prune(loan_model, cp = 0.0014)

# Compute the accuracy of the pruned tree
loans_test$pred <- predict(loan_model_pruned, loans_test, type = "class")
mean(loans_test$pred == loans_test$outcome)

## [1] 0.5928546

As with pre-pruning, creating a simpler tree actually improved the performance of the tree on the test dataset.

11. Why do trees benefit from pruning?

Classification trees can grow indefinitely, until they are told to stop or run out of data to divide-and-conquer.

Just like trees in nature, classification trees that grow overly large can require pruning to reduce the excess growth. However, this generally results in a tree that classifies fewer training examples correctly.

Why, then, are pre-pruning and post-pruning almost always used? (1) Simpler trees are easier to interpret (2) Simpler trees using early stopping are faster to train (3) Simpler trees may perform better on the testing data

12. Building a random forest model

In spite of the fact that a forest can contain hundreds of trees, growing a decision tree forest is perhaps even easier than creating a single highly-tuned tree.

Using the randomForest package, build a random forest and see how it compares to the single trees you built previously.

Keep in mind that due to the random nature of the forest, the results may vary slightly each time you create the forest.

# Load the randomForest package
library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

# Build a random forest model
loan_model <- randomForest(outcome ~ ., data = loans_train)

# Compute the accuracy of the random forest
loans_test$pred <- predict(loan_model, loans_test)
mean(loans_test$pred == loans_test$outcome)

## [1] 0.5854262

'R > [R] Machine Learning' 카테고리의 다른 글

[R] Supervised Learning, Logistic Regresison (0)	2019.01.25
[R] k-Nearest Neighbors (kNN) (0)	2019.01.20

[R] Supervised Learning, Logistic Regresison

2019. 1. 25. 23:01

Logistic_Regression

Supervised Learning, Logistic Regression

Evan Jung January 18, 2019

1. Data import

suppose that we will get dataset from an NGO company below.

url <- "https://assets.datacamp.com/production/repositories/718/datasets/9055dac929e4515286728a2a5dae9f25f0e4eff6/donors.csv"

library(readr)
library(dplyr)

donors <- read_csv(url) %>% 
  mutate_if(is.character, as.factor)
glimpse(donors)

## Observations: 93,462
## Variables: 13
## $ donated           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ veteran           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ bad_address       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ age               <dbl> 60, 46, NA, 70, 78, NA, 38, NA, NA, 65, NA, ...
## $ has_children      <dbl> 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ wealth_rating     <dbl> 0, 3, 1, 2, 1, 0, 2, 3, 1, 0, 1, 2, 1, 0, 2,...
## $ interest_veterans <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ interest_religion <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ pet_owner         <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ catalog_shopper   <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ recency           <fct> CURRENT, CURRENT, CURRENT, CURRENT, CURRENT,...
## $ frequency         <fct> FREQUENT, FREQUENT, FREQUENT, FREQUENT, FREQ...
## $ money             <fct> MEDIUM, HIGH, MEDIUM, MEDIUM, MEDIUM, MEDIUM...

Here the target variable is donated. The donated column is 1 if the person made a donation in response to the mailing and 0 otherwise.

2. Building a model

When building a model in most cases, it’s not a good idea to put all the variables. It good to start with a hypothesis about which independent variables will be predictive of the dependent variable. in this case, well, the bad_address column, which is set to 1 for an invalid mailing address and 0 otherwise, seems like it might reduce the chances of a donation. Similarly, one might suspect that religious interest (interest_religion) and interest in veterans affairs (interest_veterans) would be associated with greater charitable giving.

# Build the donation model
donation_model <- glm(donated ~ bad_address + interest_religion + interest_veterans, 
                      data = donors, family = "binomial")

# Summarize the model results
summary(donation_model)

## 
## Call:
## glm(formula = donated ~ bad_address + interest_religion + interest_veterans, 
##     family = "binomial", data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3480  -0.3192  -0.3192  -0.3192   2.5678  
## 
## Coefficients:
##                   Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)       -2.95139    0.01652 -178.664   <2e-16 ***
## bad_address       -0.30780    0.14348   -2.145   0.0319 *  
## interest_religion  0.06724    0.05069    1.327   0.1847    
## interest_veterans  0.11009    0.04676    2.354   0.0186 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 37316  on 93458  degrees of freedom
## AIC: 37324
## 
## Number of Fisher Scoring iterations: 5

3. Prediction

As other R’s machine learning methods, we will apply predict(). By default, predict() outputs predictions in terms of log odds unless type = "response" is specified. This converts the log odds to probabilities.

Because a logistic regression model estimates the probability of the outcome, it is up to you to determine the threshold at which the probability implies action. One must balance the extremes of being too cautious versus being too aggressive.

For example, if you were to solicit only the people with a 99% or greater donation probability, you may miss out on many people with lower estimated probabilities that still choose to donate. This balance is particularly important to consider for severely imbalanced outcomes, such as in this dataset where donations are relatively rare.

# estimate the donation probabilities
donors$donation_prob <- predict(donation_model, type = "response")

# find donataion probability of the avg prospect
mean(donors$donated)

## [1] 0.05040551

The actual probability that an average person would donate by passing is 0.05.

# Predict a donation if probability of donation is greater than average
donors$donation_pred <- ifelse(donors$donation_prob > 0.0504, 1, 0)

# Calculate the model's accuracy
mean(donors$donated == donors$donation_pred)

## [1] 0.794815

4. Limitation of Accuracy

Although the accuracy of model is almost 80%, the result is misleading due to the rarity of outcome being predicted. What would the accuracy have been if a model had simply predicted “no donation” for each person? Then it could be 95%. See below.

round(prop.table(table(donors$donated)), digits = 2)

## 
##    0    1 
## 0.95 0.05

5. Calculating ROC Curves and AUC

We know that accuracy is a very misleading measure of model performance on imbalanced datasets. Graphing the model’s performance better illustrates the tradeoff between a model that is overly agressive and one that is overly passive.

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

# ROC
ROC <- roc(donors$donated, donors$donation_prob)

# AUC 
auc(ROC)

## Area under the curve: 0.5102

# Plot the curve
plot(ROC, color = "blue")

How can we explain the value of AUC and plot? Based on this visualization, the model isn’t doing much better than baseline— a model doing nothing but making predictions at random.

6. Dummy Coding Categorical Data

Sometimes a dataset contains numeric values that represent a categorical feature.

# Convert the wealth rating to a factor
donors$wealth_rating <- factor(donors$wealth_rating, levels = c(0,1,2,3), labels = c("Unknown", "Low", "Medium", "High"))

# Use relevel() to change reference category
donors$wealth_rating <- relevel(donors$wealth_rating, ref = "Medium")

# See how our factor coding impacts the model
summary(glm(donated ~ wealth_rating, data = donors, family = "binomial"))

## 
## Call:
## glm(formula = donated ~ wealth_rating, family = "binomial", data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3320  -0.3243  -0.3175  -0.3175   2.4582  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -2.91894    0.03614 -80.772   <2e-16 ***
## wealth_ratingUnknown -0.04373    0.04243  -1.031    0.303    
## wealth_ratingLow     -0.05245    0.05332  -0.984    0.325    
## wealth_ratingHigh     0.04804    0.04768   1.008    0.314    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 37323  on 93458  degrees of freedom
## AIC: 37331
## 
## Number of Fisher Scoring iterations: 5

7. Handling Missing Data

Some of the prospective donors have missing age data. Unfortunately, R will exclude any cases with NA values when building a regression model.

One workaround is to replace, or impute, the missing values with an estimated value. After doing so, you may also create a missing data indicator to model the possibility that cases with missing data are different in some way from those without.

# Find the average age among non-missing values
summary(donors$age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   48.00   62.00   61.65   75.00   98.00   22546

The number of NA is 22546. So, we need to deal with handling missing data.

# Impute missing age values with mean(age)
donors$imputed_age <- donors$imputed_age <- ifelse(is.na(donors$age), round(mean(donors$age, na.rm = T), digits = 2), donors$age)

# Create missing value indicator for age
donors$missing_age <- ifelse(is.na(donors$age), 1, 0)

8. Building a more sophisticated model

One of the best predictors of future giving is a history of recent, frequent, and large gifts. In marketing terms, this is known as R/F/M - Recency, Frequency, Money Donors that haven’t given both recently and frequently may be especially likely to give again;

# Build a recency, frequency, and money (RFM) model
rfm_model <- glm(donated ~ recency * frequency + money, data = donors, family = "binomial")

# Summarize the RFM model to see how the parameters were coded
summary(rfm_model)

## 
## Call:
## glm(formula = donated ~ recency * frequency + money, family = "binomial", 
##     data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3696  -0.3696  -0.2895  -0.2895   2.7924  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -3.01142    0.04279 -70.375   <2e-16 ***
## recencyLAPSED                     -0.86677    0.41434  -2.092   0.0364 *  
## frequencyINFREQUENT               -0.50148    0.03107 -16.143   <2e-16 ***
## moneyMEDIUM                        0.36186    0.04300   8.415   <2e-16 ***
## recencyLAPSED:frequencyINFREQUENT  1.01787    0.51713   1.968   0.0490 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 36938  on 93457  degrees of freedom
## AIC: 36948
## 
## Number of Fisher Scoring iterations: 6

Model has got better than the previous model. Based on the result, the combined impact of recency and frequency may be greater than the sum of the separate effects.

# Compute predicted probabilities for the RFM model
rfm_prob <- predict(rfm_model, data = donors, type = "response")

# Plot the ROC curve for the new model
library(pROC)
ROC <- roc(donors$donated, rfm_prob)
plot(ROC, col = "red")

auc(ROC)

## Area under the curve: 0.5785

Based on the ROC curve, you’ve confirmed that past giving patterns are certainly predictive of future giving.

9. The dangers of stepwise regression

In spite of its utility for feature selection, stepwise regression is not frequently used in disciplines outside of machine learning due to some important caveats. First of all, It is not guaranteed to find the best possible model. Second, The stepwise regression procedure violates some statistical assumptions. Third, it can result in a model that makes little sense in the real world

10. Building a stepwise regression model

null_model <- glm(donated ~ 1, data = donors, family = "binomial")
summary(null_model)

## 
## Call:
## glm(formula = donated ~ 1, family = "binomial", data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3216  -0.3216  -0.3216  -0.3216   2.4444  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.93593    0.01495  -196.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 37330  on 93461  degrees of freedom
## AIC: 37332
## 
## Number of Fisher Scoring iterations: 5

full_model <- glm(donated ~ ., data = donors, family = "binomial")
summary(full_model)

## 
## Call:
## glm(formula = donated ~ ., family = "binomial", data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6111  -0.3642  -0.3080  -0.2866   2.7436  
## 
## Coefficients: (2 not defined because of singularities)
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           1.742e+01  1.066e+01   1.634  0.10222    
## veteran              -2.071e-02  5.151e-01  -0.040  0.96793    
## bad_address          -5.442e+00  2.802e+00  -1.942  0.05208 .  
## age                   1.094e-03  1.093e-03   1.001  0.31702    
## has_children         -1.561e-01  5.156e-02  -3.028  0.00247 ** 
## wealth_ratingUnknown -1.196e-02  4.819e-02  -0.248  0.80404    
## wealth_ratingLow     -4.901e-02  5.773e-02  -0.849  0.39594    
## wealth_ratingHigh     1.270e-01  5.079e-02   2.500  0.01243 *  
## interest_veterans     2.429e+00  1.214e+00   2.001  0.04535 *  
## interest_religion     1.491e+00  7.507e-01   1.986  0.04704 *  
## pet_owner             5.060e-02  4.895e-02   1.034  0.30128    
## catalog_shopper       6.686e-02  5.980e-02   1.118  0.26353    
## recencyLAPSED        -1.678e-01  2.565e-01  -0.654  0.51297    
## frequencyINFREQUENT  -4.645e-01  3.523e-02 -13.185  < 2e-16 ***
## moneyMEDIUM           3.734e-01  4.893e-02   7.631 2.34e-14 ***
## donation_prob        -4.131e+02  2.146e+02  -1.926  0.05416 .  
## donation_pred        -1.185e-01  1.222e-01  -0.970  0.33189    
## imputed_age                  NA         NA      NA       NA    
## missing_age                  NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28714  on 70915  degrees of freedom
## Residual deviance: 28405  on 70899  degrees of freedom
##   (22546 observations deleted due to missingness)
## AIC: 28439
## 
## Number of Fisher Scoring iterations: 6

# Use a forward stepwise algorithm to build a parsimonious model
step_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "forward")

## Start:  AIC=37332.13
## donated ~ 1

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + frequency          1    28502 37122
## + money              1    28621 37241
## + has_children       1    28705 37326
## + age                1    28707 37328
## + imputed_age        1    28707 37328
## + wealth_rating      3    28704 37328
## + interest_veterans  1    28709 37330
## + donation_prob      1    28710 37330
## + donation_pred      1    28710 37330
## + catalog_shopper    1    28710 37330
## + pet_owner          1    28711 37331
## <none>                    28714 37332
## + interest_religion  1    28712 37333
## + recency            1    28713 37333
## + bad_address        1    28714 37334
## + veteran            1    28714 37334
## 
## Step:  AIC=37024.77
## donated ~ frequency

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + money              1    28441 36966
## + wealth_rating      3    28490 37019
## + has_children       1    28494 37019
## + donation_prob      1    28498 37023
## + interest_veterans  1    28498 37023
## + catalog_shopper    1    28499 37024
## + donation_pred      1    28499 37024
## + age                1    28499 37024
## + imputed_age        1    28499 37024
## + pet_owner          1    28499 37024
## <none>                    28502 37025
## + interest_religion  1    28501 37026
## + recency            1    28501 37026
## + bad_address        1    28502 37026
## + veteran            1    28502 37027
## 
## Step:  AIC=36949.71
## donated ~ frequency + money

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + wealth_rating      3    28427 36942
## + has_children       1    28432 36943
## + interest_veterans  1    28438 36948
## + donation_prob      1    28438 36949
## + catalog_shopper    1    28438 36949
## + donation_pred      1    28438 36949
## + age                1    28438 36949
## + imputed_age        1    28438 36949
## + pet_owner          1    28439 36949
## <none>                    28441 36950
## + interest_religion  1    28440 36951
## + recency            1    28440 36951
## + bad_address        1    28441 36951
## + veteran            1    28441 36952
## 
## Step:  AIC=36945.48
## donated ~ frequency + money + wealth_rating

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + has_children       1    28416 36937
## + age                1    28424 36944
## + imputed_age        1    28424 36944
## + interest_veterans  1    28424 36945
## + donation_prob      1    28424 36945
## + catalog_shopper    1    28424 36945
## + donation_pred      1    28425 36945
## <none>                    28427 36945
## + pet_owner          1    28425 36946
## + interest_religion  1    28426 36947
## + recency            1    28426 36947
## + bad_address        1    28427 36947
## + veteran            1    28427 36947
## 
## Step:  AIC=36938.4
## donated ~ frequency + money + wealth_rating + has_children

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + pet_owner          1    28413 36937
## + donation_prob      1    28413 36937
## + catalog_shopper    1    28413 36937
## + interest_veterans  1    28413 36937
## + donation_pred      1    28414 36938
## <none>                    28416 36938
## + interest_religion  1    28415 36939
## + age                1    28416 36940
## + imputed_age        1    28416 36940
## + recency            1    28416 36940
## + bad_address        1    28416 36940
## + veteran            1    28416 36940
## 
## Step:  AIC=36932.25
## donated ~ frequency + money + wealth_rating + has_children + 
##     pet_owner

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## <none>                    28413 36932
## + donation_prob      1    28411 36932
## + interest_veterans  1    28411 36932
## + catalog_shopper    1    28412 36933
## + donation_pred      1    28412 36933
## + age                1    28412 36933
## + imputed_age        1    28412 36933
## + recency            1    28413 36934
## + interest_religion  1    28413 36934
## + bad_address        1    28413 36934
## + veteran            1    28413 36934

# Estimate the stepwise donation probability
step_prob <- predict(step_model, type = "response")

# Plot the ROC of the stepwise model
library(pROC)
ROC <- roc(donors$donated, step_prob)
plot(ROC, col = "red")

auc(ROC)

## Area under the curve: 0.5849

Despite the caveats of stepwise regression, it seems to have resulted in a relatively strong model!

All the Contents are From DataCamp

'R > [R] Machine Learning' 카테고리의 다른 글

[R] Classification Trees (0)	2019.02.03
[R] k-Nearest Neighbors (kNN) (0)	2019.01.20

[R] k-Nearest Neighbors (kNN)

2019. 1. 20. 15:00

K_Nearest_Neighbors__KNN_

k-Nearest Neighbors (kNN)

Evan Jung January 18, 2019

1. The Concept of KNN

What is KNN? The word KNN is the abbreviation of kN(Nearest)N(Neighbors). Then what is K? k is therefore just the number of neighbors “voting” on the test example’s class. If k=1, then test examples are given the same label as the closest example in the training set. If k=3, the labels of the three closest classes are checked and the most common (i.e., occuring at least twice) label is assigned, and so on for larger ks.

Measuring Similarity with distance (ex: color by color)

$dist(p,q) = \sqrt{(p_{1} - q_{1})^2 + (p_{2} - q_{2})^2 + ... + (p_{n} - q_{n})^2}$

library(class) pred <- knn(training_data, testing_data, training_labels)

library(class)
library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

url <- "https://assets.datacamp.com/production/course_2906/datasets/knn_traffic_signs.csv"

signs <- read_csv(url)

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   sample = col_character(),
##   sign_type = col_character()
## )

## See spec(...) for full column specifications.

glimpse(signs)

## Observations: 206
## Variables: 51
## $ id        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ sample    <chr> "train", "train", "train", "train", "train", "train"...
## $ sign_type <chr> "pedestrian", "pedestrian", "pedestrian", "pedestria...
## $ r1        <dbl> 155, 142, 57, 22, 169, 75, 136, 118, 149, 13, 123, 1...
## $ g1        <dbl> 228, 217, 54, 35, 179, 67, 149, 105, 225, 34, 124, 1...
## $ b1        <dbl> 251, 242, 50, 41, 170, 60, 157, 69, 241, 28, 107, 13...
## $ r2        <dbl> 135, 166, 187, 171, 231, 131, 200, 244, 34, 5, 83, 3...
## $ g2        <dbl> 188, 204, 201, 178, 254, 89, 203, 245, 45, 21, 61, 4...
## $ b2        <dbl> 101, 44, 68, 26, 27, 53, 107, 67, 1, 11, 26, 37, 26,...
## $ r3        <dbl> 156, 142, 51, 19, 97, 214, 150, 132, 155, 123, 116, ...
## $ g3        <dbl> 227, 217, 51, 27, 107, 144, 167, 123, 226, 154, 124,...
## $ b3        <dbl> 245, 242, 45, 29, 99, 75, 134, 12, 238, 140, 115, 12...
## $ r4        <dbl> 145, 147, 59, 19, 123, 156, 171, 138, 147, 21, 67, 4...
## $ g4        <dbl> 211, 219, 62, 27, 147, 169, 218, 123, 222, 46, 67, 5...
## $ b4        <dbl> 228, 242, 65, 29, 152, 190, 252, 85, 242, 41, 52, 49...
## $ r5        <dbl> 166, 164, 156, 42, 221, 67, 171, 254, 170, 36, 70, 1...
## $ g5        <dbl> 233, 228, 171, 37, 236, 50, 158, 254, 191, 60, 53, 1...
## $ b5        <dbl> 245, 229, 50, 3, 117, 36, 108, 92, 113, 26, 26, 141,...
## $ r6        <dbl> 212, 84, 254, 217, 205, 37, 157, 241, 26, 75, 26, 60...
## $ g6        <dbl> 254, 116, 255, 228, 225, 36, 186, 240, 37, 108, 26, ...
## $ b6        <dbl> 52, 17, 36, 19, 80, 42, 11, 108, 12, 44, 21, 18, 20,...
## $ r7        <dbl> 212, 217, 211, 221, 235, 44, 26, 254, 34, 13, 52, 9,...
## $ g7        <dbl> 254, 254, 226, 235, 254, 42, 35, 254, 45, 27, 45, 13...
## $ b7        <dbl> 11, 26, 70, 20, 60, 44, 10, 99, 19, 25, 27, 17, 20, ...
## $ r8        <dbl> 188, 155, 78, 181, 90, 192, 180, 108, 221, 133, 117,...
## $ g8        <dbl> 229, 203, 73, 183, 110, 131, 211, 106, 249, 163, 109...
## $ b8        <dbl> 117, 128, 64, 73, 9, 73, 236, 27, 184, 126, 83, 33, ...
## $ r9        <dbl> 170, 213, 220, 237, 216, 123, 129, 135, 226, 83, 110...
## $ g9        <dbl> 216, 253, 234, 234, 236, 74, 109, 123, 246, 125, 74,...
## $ b9        <dbl> 120, 51, 59, 44, 66, 22, 73, 40, 59, 19, 12, 12, 18,...
## $ r10       <dbl> 211, 217, 254, 251, 229, 36, 161, 254, 30, 13, 98, 2...
## $ g10       <dbl> 254, 255, 255, 254, 255, 34, 190, 254, 40, 27, 70, 1...
## $ b10       <dbl> 3, 21, 51, 2, 12, 37, 10, 115, 34, 25, 26, 11, 20, 2...
## $ r11       <dbl> 212, 217, 253, 235, 235, 44, 161, 254, 34, 9, 20, 28...
## $ g11       <dbl> 254, 255, 255, 243, 254, 42, 190, 254, 44, 23, 21, 2...
## $ b11       <dbl> 19, 21, 44, 12, 60, 44, 6, 99, 35, 18, 20, 19, 13, 1...
## $ r12       <dbl> 172, 158, 66, 19, 163, 197, 187, 138, 241, 85, 113, ...
## $ g12       <dbl> 235, 225, 68, 27, 168, 114, 215, 123, 255, 128, 76, ...
## $ b12       <dbl> 244, 237, 68, 29, 152, 21, 236, 85, 54, 21, 14, 12, ...
## $ r13       <dbl> 172, 164, 69, 20, 124, 171, 141, 118, 205, 83, 106, ...
## $ g13       <dbl> 235, 227, 65, 29, 117, 102, 142, 105, 229, 125, 69, ...
## $ b13       <dbl> 244, 237, 59, 34, 91, 26, 140, 75, 46, 19, 9, 12, 13...
## $ r14       <dbl> 172, 182, 76, 64, 188, 197, 189, 131, 226, 85, 102, ...
## $ g14       <dbl> 228, 228, 84, 61, 205, 114, 171, 124, 246, 128, 67, ...
## $ b14       <dbl> 235, 143, 22, 4, 78, 21, 140, 5, 59, 21, 6, 12, 13, ...
## $ r15       <dbl> 177, 171, 82, 211, 125, 123, 214, 106, 235, 85, 106,...
## $ g15       <dbl> 235, 228, 93, 222, 147, 74, 221, 94, 252, 128, 69, 4...
## $ b15       <dbl> 244, 196, 17, 78, 20, 22, 201, 53, 67, 21, 9, 11, 18...
## $ r16       <dbl> 22, 164, 58, 19, 160, 180, 188, 101, 237, 83, 43, 60...
## $ g16       <dbl> 52, 227, 60, 27, 183, 107, 211, 91, 254, 125, 29, 45...
## $ b16       <dbl> 53, 237, 60, 29, 187, 26, 227, 59, 53, 19, 11, 18, 1...

signs2 <- signs[, -c(1:2)]

# sample
next_sign <- signs2[sample(NROW(signs2), 1), ]
next_sign <- next_sign[, -1]

str(next_sign)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  48 variables:
##  $ r1 : num 179
##  $ g1 : num 195
##  $ b1 : num 188
##  $ r2 : num 67
##  $ g2 : num 22
##  $ b2 : num 24
##  $ r3 : num 65
##  $ g3 : num 21
##  $ b3 : num 23
##  $ r4 : num 28
##  $ g4 : num 35
##  $ b4 : num 28
##  $ r5 : num 106
##  $ g5 : num 115
##  $ b5 : num 109
##  $ r6 : num 99
##  $ g6 : num 115
##  $ b6 : num 109
##  $ r7 : num 102
##  $ g7 : num 121
##  $ b7 : num 115
##  $ r8 : num 91
##  $ g8 : num 77
##  $ b8 : num 74
##  $ r9 : num 91
##  $ g9 : num 90
##  $ b9 : num 85
##  $ r10: num 107
##  $ g10: num 118
##  $ b10: num 113
##  $ r11: num 85
##  $ g11: num 78
##  $ b11: num 73
##  $ r12: num 67
##  $ g12: num 22
##  $ b12: num 24
##  $ r13: num 146
##  $ g13: num 158
##  $ b13: num 150
##  $ r14: num 68
##  $ g14: num 25
##  $ b14: num 26
##  $ r15: num 68
##  $ g15: num 25
##  $ b15: num 26
##  $ r16: num 68
##  $ g16: num 82
##  $ b16: num 76

# labeling
sign_types <- signs2$sign_type

# knn training
knn(train = signs2[-1], test = next_sign, cl = sign_types)

## [1] stop
## Levels: pedestrian speed stop

How the function knn() correctly classify the stop sign? knn() learned that stop signs are red.

2. Exploring the traffic sign dataset

Each previously observed street sign was divided into a 4x4 grid, and the red, green, and blue level for each of the 16 center pixels is recorded as illustrated here.

The result is a dataset that records the sign_type as well as 16 x 3 = 48 color properties of each sign.

# 각각 Type에 대한 갯수 요약
table(signs2$sign_type)

## 
## pedestrian      speed       stop 
##         65         70         71

# Sign Type에 따른 r10 해당하는 평균 빨간색 레벨
aggregate(r10 ~ sign_type, data = signs2, mean)

##    sign_type       r10
## 1 pedestrian 108.78462
## 2      speed  83.08571
## 3       stop 142.50704

Look at the sign type stop. It has higher value than other sign type. This is how kNN identifies similar signs.

3. Classifying a collection of road signs

Now that the autonomous vehicle has successfully stopped on its own, your team feels confident allowing the car to continue the test course.

The test course includes 59 additional road signs divided into three types:

# get test signs
test_signs <- signs2[sample(NROW(signs2), 59), ]

# knn test
signs_pred <- knn(train = signs2[-1], test = test_signs[-1], cl = sign_types)

# Create a confusion matrix of the actual versus predicted values
signs_actual <- test_signs$sign_type
table(signs_actual, signs_pred)

##             signs_pred
## signs_actual pedestrian speed stop
##   pedestrian         19     0    0
##   speed               0    24    0
##   stop                0     0   16

# Compute the accuracy
mean(signs_actual == signs_pred)

## [1] 1

4. Testing other ‘k’ values

By default, the knn() function in the class package uses only the single nearest neighbor.

Setting a k parameter allows the algorithm to consider additional nearby neighbors. This enlarges the collection of neighbors which will vote on the predicted class.

Compare k values of 1, 7, and 15 to examine the impact on traffic sign classification accuracy.

signs_types <- signs2$sign_type
signs_test <- signs2[sample(NROW(signs2), 59), ]

signs_actual <- signs_test$sign_type

# Compute the accuracy of the baseline model (default k = 1)
k_1 <- knn(train = signs2[-1], test = signs_test[-1], cl = sign_types)
mean(signs_actual == k_1)

## [1] 1

# Modify the above to set k = 7
k_7 <- knn(train = signs2[-1], test = signs_test[-1], cl = sign_types, k = 7)
mean(signs_actual == k_7)

## [1] 0.9661017

# Set k = 15 and compare to the above
k_15 <- knn(train = signs2[-1], test = signs_test[-1], cl = sign_types, k = 15)
mean(signs_actual == k_15)

## [1] 0.9152542

5. Seeing how the neighbors voted

When multiple nearest neighbors hold a vote, it can sometimes be useful to examine whether the voters were unanimous or widely separated.

For example, knowing more about the voters’ confidence in the classification could allow an autonomous vehicle to use caution in the case there is any chance at all that a stop sign is ahead.

# Use the prob parameter to get the proportion of votes for the winning class
sign_pred <- knn(train = signs2[-1], test = signs_test[-1], cl = sign_types, prob = TRUE, k = 7)

# Get the "prob" attribute from the predicted classes
sign_prob <- attr(sign_pred, "prob")

# Examine the first several predictions
head(sign_pred)

## [1] speed stop  speed speed stop  speed
## Levels: pedestrian speed stop

# Examine the proportion of votes for the winning class
head(sign_prob)

## [1] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.7142857

6. Data preparation for kNN

kNN assumes numeric data. So, kNN benefits from normalized data. Min-Max normalize function is good enough for kNN Algorithm.

normalize <- function(x) {
  return((x - min(x)) / (max(x) - min(x)))
}

# normalized version of r1
summary(normalize(signs2$r1))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1935  0.3528  0.4046  0.6129  1.0000

summary(signs2$r1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0    51.0    90.5   103.3   155.0   251.0

Before performing kNN Algorithm, you must normalize data using a technique like min-max function above. why? It is to ensure all data elements may contribute equal shares to distance. Rescaling reduces the influence of extreme values on kNN’s distance function.

All Contents comes from DataCamp

datacamp iconì ëí ì´ë¯¸ì§ ê²ìê²°ê³¼

'R > [R] Machine Learning' 카테고리의 다른 글

[R] Classification Trees (0)	2019.02.03
[R] Supervised Learning, Logistic Regresison (0)	2019.01.25

PREV 1 NEXT

cozyDS

R/[R] Machine Learning

[R] Classification Trees

Supervised Learning: Classification Trees

1. Intro

2. Lending Club Dataset

3. UpSampling & DownSampling

4. Building a simple decision tree

(1) Modeling Building

(2) Predict()

5.Visualizing classification trees

6. Creating Random Test Datasets

7. Building and evaluating a larger tree

8. Conducting a fair performance evaluation

9. Preventing overgrown trees

10. Creating a nicely pruned tree

11. Why do trees benefit from pruning?

12. Building a random forest model

'R > [R] Machine Learning' 카테고리의 다른 글

[R] Supervised Learning, Logistic Regresison

Supervised Learning, Logistic Regression

1. Data import

2. Building a model

3. Prediction

4. Limitation of Accuracy

5. Calculating ROC Curves and AUC

6. Dummy Coding Categorical Data

7. Handling Missing Data

8. Building a more sophisticated model

9. The dangers of stepwise regression

10. Building a stepwise regression model

'R > [R] Machine Learning' 카테고리의 다른 글

[R] k-Nearest Neighbors (kNN)

k-Nearest Neighbors (kNN)

1. The Concept of KNN

2. Exploring the traffic sign dataset

3. Classifying a collection of road signs

4. Testing other ‘k’ values

5. Seeing how the neighbors voted

6. Data preparation for kNN

'R > [R] Machine Learning' 카테고리의 다른 글

+ Recent posts

티스토리툴바