Supervised Learning: Classification Trees
Evan Jung January 28, 2019
1. Intro
Classification trees use flowchart-like structures to make decisions. Because humans can readily understand these tree structures, classification trees are useful when transparency is needed, such as in loan approval. We’ll use the Lending Club dataset to simulate this scenario.
2. Lending Club Dataset
Lending Club is a US-based peer-to-peer lending company. The loans dataset contains 11,312 randomly-selected people who were applied for and later received loans from Lending Club.
library(tidyverse)
# data url
url <- "https://assets.datacamp.com/production/repositories/718/datasets/7805fceacfb205470c0e8800d4ffc37c6944b30c/loans.csv"
# data import & transformation
loans <- read_csv(url) %>%
mutate_if(is_character, as.factor) %>%
mutate(outcome = as.factor(default),
outcome = factor(outcome,
levels = c(0,1),
labels = c("repaid", "default"))) %>%
select(-default, -keep, -rand)
# data overview
glimpse(loans)
## Observations: 39,732
## Variables: 14
## $ loan_amount <fct> LOW, LOW, LOW, MEDIUM, LOW, LOW, MEDIUM, LO...
## $ emp_length <fct> 10+ years, < 2 years, 10+ years, 10+ years,...
## $ home_ownership <fct> RENT, RENT, RENT, RENT, RENT, RENT, RENT, R...
## $ income <fct> LOW, LOW, LOW, MEDIUM, HIGH, LOW, MEDIUM, M...
## $ loan_purpose <fct> credit_card, car, small_business, other, ot...
## $ debt_to_income <fct> HIGH, LOW, AVERAGE, HIGH, AVERAGE, AVERAGE,...
## $ credit_score <fct> AVERAGE, AVERAGE, AVERAGE, AVERAGE, AVERAGE...
## $ recent_inquiry <fct> YES, YES, YES, YES, NO, YES, YES, YES, YES,...
## $ delinquent <fct> NEVER, NEVER, NEVER, MORE THAN 2 YEARS AGO,...
## $ credit_accounts <fct> FEW, FEW, FEW, AVERAGE, MANY, AVERAGE, AVER...
## $ bad_public_record <fct> NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO,...
## $ credit_utilization <fct> HIGH, LOW, HIGH, LOW, MEDIUM, MEDIUM, HIGH,...
## $ past_bankrupt <fct> NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO,...
## $ outcome <fct> repaid, default, repaid, repaid, repaid, re...
3. UpSampling & DownSampling
##
## repaid default
## 34078 5654
In classification problems, as you can see graph above, a disparity in the frequencies of the observed classes can have a significant negative impact on model fitting. One technique for resolving such a class imbalance is to subsample the training data in a manner that mitigates the issues. Examples of sampling methods for this purpose are:
down-sampling: randomly subset all the classes in the training set so that their class frequencies match the least prevalent class. For example, suppose that 80% of the training set samples are the first class and the remaining 20% are in the second class. Down-sampling would randomly sample the first class to be the same size as the second class (so that only 40% of the total training set is used to fit the model). caret contains a function (downSample) to do this.
up-sampling: randomly sample (with replacement) the minority class to be the same size as the majority class. caret contains a function (upSample) to do this.
hybrid methods: techniques such as SMOTE and ROSE down-sample the majority class and synthesize new data points in the minority class. There are two packages (DMwR and ROSE) that implement these procedures.
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
loans <- downSample(x = loans %>% dplyr::select(-outcome),
y = loans$outcome,
yname = "outcome")
table(loans$outcome)
##
## repaid default
## 5654 5654
4. Building a simple decision tree
(1) Modeling Building
You will use a decision tree to try to learn patterns in the outcome of these loans (either repaid or default) based on the requested loan amount and credit score at the time of application.
Then, see how the tree’s predictions differ for an applicant with good credit versus one with bad credit.
# Load the rpart package
library(rpart)
# modeling building
loan_model <- rpart(outcome ~ loan_amount + credit_score, data = loans, method = "class", control = rpart.control(cp = 0))
# model overview
summary(loan_model)
## Call:
## rpart(formula = outcome ~ loan_amount + credit_score, data = loans,
## method = "class", control = rpart.control(cp = 0))
## n= 11308
##
## CP nsplit rel error xerror xstd
## 1 0.10771135 0 1.0000000 1.0231694 0.009401356
## 2 0.01317651 1 0.8922886 0.8922886 0.009349171
## 3 0.00000000 3 0.8659356 0.8659356 0.009318988
##
## Variable importance
## credit_score loan_amount
## 89 11
##
## Node number 1: 11308 observations, complexity param=0.1077114
## predicted class=repaid expected loss=0.5 P(node) =1
## class counts: 5654 5654
## probabilities: 0.500 0.500
## left son=2 (1811 obs) right son=3 (9497 obs)
## Primary splits:
## credit_score splits as RLR, improve=121.92300, (0 missing)
## loan_amount splits as RLL, improve= 29.13543, (0 missing)
##
## Node number 2: 1811 observations
## predicted class=repaid expected loss=0.3318609 P(node) =0.1601521
## class counts: 1210 601
## probabilities: 0.668 0.332
##
## Node number 3: 9497 observations, complexity param=0.01317651
## predicted class=default expected loss=0.4679372 P(node) =0.8398479
## class counts: 4444 5053
## probabilities: 0.468 0.532
## left son=6 (7867 obs) right son=7 (1630 obs)
## Primary splits:
## credit_score splits as L-R, improve=42.17392, (0 missing)
## loan_amount splits as RLL, improve=19.24674, (0 missing)
##
## Node number 6: 7867 observations, complexity param=0.01317651
## predicted class=default expected loss=0.489386 P(node) =0.6957022
## class counts: 3850 4017
## probabilities: 0.489 0.511
## left son=12 (5397 obs) right son=13 (2470 obs)
## Primary splits:
## loan_amount splits as RLL, improve=20.49803, (0 missing)
##
## Node number 7: 1630 observations
## predicted class=default expected loss=0.3644172 P(node) =0.1441457
## class counts: 594 1036
## probabilities: 0.364 0.636
##
## Node number 12: 5397 observations
## predicted class=repaid expected loss=0.486196 P(node) =0.4772727
## class counts: 2773 2624
## probabilities: 0.514 0.486
##
## Node number 13: 2470 observations
## predicted class=default expected loss=0.4360324 P(node) =0.2184294
## class counts: 1077 1393
## probabilities: 0.436 0.564
(2) Predict()
good_credit <- data.frame(
loan_amount = "LOW",
emp_length = "10+ years",
home_ownership = "MORTGAGE",
income = "HIGH",
loan_purpose = "major_purchase",
debt_to_income = "AVERAGE",
credit_score = "HIGH",
recent_inquiry = "NO",
delinquent = "NEVER",
credit_accounts = "MANY",
bad_public_record = "NO",
credit_utilization = "LOW",
past_bankrupt = "NO",
outcome = "repaid"
)
bad_credit <- data.frame(
loan_amount = "LOW",
emp_length = "6 - 9 years",
home_ownership = "RENT",
income = "MEDIUM",
loan_purpose = "car",
debt_to_income = "LOW",
credit_score = "LOW",
recent_inquiry = "YES",
delinquent = "NEVER",
credit_accounts = "FEW",
bad_public_record = "NO",
credit_utilization = "HIGH",
past_bankrupt = "NO",
outcome = "repaid"
)
predict(loan_model, good_credit, type = "class")
## 1
## repaid
## Levels: repaid default
## 1
## default
## Levels: repaid default
5.Visualizing classification trees
Due to government rules to prevent illegal discrimination, lenders are required to explain why a loan application was rejected.
The structure of classification trees can be depicted visually, which helps to understand how the tree makes its decisions.
# Load the rpart.plot package
library(rpart.plot)
# Plot the loan_model with default settings
rpart.plot(loan_model)
# Plot the loan_model with customized settings
rpart.plot(loan_model,
type = 3,
box.palette = c("red", "green"),
fallen.leaves = TRUE)
Based on this tree structure, which of the following applicants would be predicted to repay the loan?
Someone with a low requested loan amount and high credit. Using the tree structure, you can clearly see how the tree makes its decisions.
6. Creating Random Test Datasets
Before building a more sophisticated lending model, it is important to hold out a portion of the loan data to simulate how well it will predict the outcomes of future loan applicants.
As depicted in the following image, you can use 75% of the observations for training and 25% for testing the model.
## [1] 8481
# Create a random sample of row IDs
sample_rows <- sample(11308, 8481)
# Create the training dataset
loans_train <- loans[sample_rows, ]
# Create the test dataset
loans_test <- loans[-sample_rows, ]
The sample() function can be used to generate a random sample of rows to include in the training set. Simply supply it the total number of observations and the number needed for training.
Use the resulting vector of row IDs to subset the loans into training and testing datasets.
7. Building and evaluating a larger tree
Previously, you created a simple decision tree that used the applicant’s credit score and requested loan amount to predict the loan outcome.
Lending Club has additional information about the applicants, such as home ownership status, length of employment, loan purpose, and past bankruptcies, that may be useful for making more accurate predictions.
Using all of the available applicant data, build a more sophisticated lending model using the random training dataset created previously. Then, use this model to make predictions on the testing dataset to estimate the performance of the model on future loan applications.
loan_model <- rpart(outcome ~ .,
data = loans_train,
method = "class",
control = rpart.control(cp = 0))
# Make predictions on the test dataset
loans_test$pred <- predict(loan_model,
loans_test,
type = "class")
# Examine the confusion matrix
table(loans_test$pred, loans_test$outcome)
##
## repaid default
## repaid 781 626
## default 636 784
## [1] 0.5535904
The accuracy on the test dataset seems low. If so, how did adding more predictors change the model’s performance?
8. Conducting a fair performance evaluation
Holding out test data reduces the amount of data available for growing the decision tree. In spite of this, it is very important to evaluate decision trees on data it has not seen before.
Which of these is NOT true about the evaluation of decision tree performance?
- Decision trees sometimes overfit the training data.
- The model’s accuracy is unaffected by the rarity of the outcome.
- Performance on the training dataset can overestimate performance on future data.
- Creating a test dataset simulates the model’s performance on unseen data.
The answer is (2). Rare events cause problems for many machine learning approaches.
9. Preventing overgrown trees
The tree grown on the full set of applicant data grew to be extremely large and extremely complex, with hundreds of splits and leaf nodes containing only a handful of applicants. This tree would be almost impossible for a loan officer to interpret.
Using the pre-pruning methods for early stopping, you can prevent a tree from growing too large and complex. See how the rpart control options for maximum tree depth and minimum split count impact the resulting tree.
loan_model <- rpart(outcome ~ .,
data = loans_train,
method = "class",
control = rpart.control(cp = 0,
maxdepth = 6 # minsplit = 500
))
# Make a class prediction on the test set
loans_test$pred <- predict(loan_model, loans_test, type = "class")
# Compute the accuracy of the simpler tree
mean(loans_test$pred == loans_test$outcome)
## [1] 0.5723382
Compared to the previous model, the new model shows the better performance of accuracy. But, still we have new technique to build the better model.
10. Creating a nicely pruned tree
Stopping a tree from growing all the way can lead it to ignore some aspects of the data or miss important trends it may have discovered later.
By using post-pruning, you can intentionally grow a large and complex tree then prune it to be smaller and more efficient later on.
In this exercise, you will have the opportunity to construct a visualization of the tree’s performance versus complexity, and use this information to prune the tree to an appropriate level.
# Grow an overly complex tree
loan_model <- rpart(outcome ~ ., data = loans_train, method = "class", control = rpart.control(cp = 0))
# Examine the complexity plot
plotcp(loan_model)
# Prune the tree
loan_model_pruned <- prune(loan_model, cp = 0.0014)
# Compute the accuracy of the pruned tree
loans_test$pred <- predict(loan_model_pruned, loans_test, type = "class")
mean(loans_test$pred == loans_test$outcome)
## [1] 0.5928546
As with pre-pruning, creating a simpler tree actually improved the performance of the tree on the test dataset.
11. Why do trees benefit from pruning?
Classification trees can grow indefinitely, until they are told to stop or run out of data to divide-and-conquer.
Just like trees in nature, classification trees that grow overly large can require pruning to reduce the excess growth. However, this generally results in a tree that classifies fewer training examples correctly.
Why, then, are pre-pruning and post-pruning almost always used? (1) Simpler trees are easier to interpret (2) Simpler trees using early stopping are faster to train (3) Simpler trees may perform better on the testing data
12. Building a random forest model
In spite of the fact that a forest can contain hundreds of trees, growing a decision tree forest is perhaps even easier than creating a single highly-tuned tree.
Using the randomForest package, build a random forest and see how it compares to the single trees you built previously.
Keep in mind that due to the random nature of the forest, the results may vary slightly each time you create the forest.
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
# Build a random forest model
loan_model <- randomForest(outcome ~ ., data = loans_train)
# Compute the accuracy of the random forest
loans_test$pred <- predict(loan_model, loans_test)
mean(loans_test$pred == loans_test$outcome)
## [1] 0.5854262
'R > [R] Machine Learning' 카테고리의 다른 글
[R] Supervised Learning, Logistic Regresison (0) | 2019.01.25 |
---|---|
[R] k-Nearest Neighbors (kNN) (0) | 2019.01.20 |