Supervised Learning, Logistic Regression
Evan Jung January 18, 2019
1. Data import
suppose that we will get dataset from an NGO company below.
url <- "https://assets.datacamp.com/production/repositories/718/datasets/9055dac929e4515286728a2a5dae9f25f0e4eff6/donors.csv"
library(readr)
library(dplyr)
donors <- read_csv(url) %>%
mutate_if(is.character, as.factor)
glimpse(donors)
## Observations: 93,462
## Variables: 13
## $ donated <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ veteran <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ bad_address <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ age <dbl> 60, 46, NA, 70, 78, NA, 38, NA, NA, 65, NA, ...
## $ has_children <dbl> 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ wealth_rating <dbl> 0, 3, 1, 2, 1, 0, 2, 3, 1, 0, 1, 2, 1, 0, 2,...
## $ interest_veterans <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ interest_religion <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ pet_owner <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ catalog_shopper <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ recency <fct> CURRENT, CURRENT, CURRENT, CURRENT, CURRENT,...
## $ frequency <fct> FREQUENT, FREQUENT, FREQUENT, FREQUENT, FREQ...
## $ money <fct> MEDIUM, HIGH, MEDIUM, MEDIUM, MEDIUM, MEDIUM...
Here the target variable is donated. The donated column is 1 if the person made a donation in response to the mailing and 0 otherwise.
2. Building a model
When building a model in most cases, it’s not a good idea to put all the variables. It good to start with a hypothesis about which independent variables will be predictive of the dependent variable. in this case, well, the bad_address column, which is set to 1 for an invalid mailing address and 0 otherwise, seems like it might reduce the chances of a donation. Similarly, one might suspect that religious interest (interest_religion) and interest in veterans affairs (interest_veterans) would be associated with greater charitable giving.
# Build the donation model
donation_model <- glm(donated ~ bad_address + interest_religion + interest_veterans,
data = donors, family = "binomial")
# Summarize the model results
summary(donation_model)
##
## Call:
## glm(formula = donated ~ bad_address + interest_religion + interest_veterans,
## family = "binomial", data = donors)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.3480 -0.3192 -0.3192 -0.3192 2.5678
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.95139 0.01652 -178.664 <2e-16 ***
## bad_address -0.30780 0.14348 -2.145 0.0319 *
## interest_religion 0.06724 0.05069 1.327 0.1847
## interest_veterans 0.11009 0.04676 2.354 0.0186 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 37330 on 93461 degrees of freedom
## Residual deviance: 37316 on 93458 degrees of freedom
## AIC: 37324
##
## Number of Fisher Scoring iterations: 5
3. Prediction
As other R’s machine learning methods, we will apply predict()
. By default, predict()
outputs predictions in terms of log odds unless type = "response"
is specified. This converts the log odds to probabilities.
Because a logistic regression model estimates the probability of the outcome, it is up to you to determine the threshold at which the probability implies action. One must balance the extremes of being too cautious versus being too aggressive.
For example, if you were to solicit only the people with a 99% or greater donation probability, you may miss out on many people with lower estimated probabilities that still choose to donate. This balance is particularly important to consider for severely imbalanced outcomes, such as in this dataset where donations are relatively rare.
# estimate the donation probabilities
donors$donation_prob <- predict(donation_model, type = "response")
# find donataion probability of the avg prospect
mean(donors$donated)
## [1] 0.05040551
The actual probability that an average person would donate by passing is 0.05.
# Predict a donation if probability of donation is greater than average
donors$donation_pred <- ifelse(donors$donation_prob > 0.0504, 1, 0)
# Calculate the model's accuracy
mean(donors$donated == donors$donation_pred)
## [1] 0.794815
4. Limitation of Accuracy
Although the accuracy of model is almost 80%, the result is misleading due to the rarity of outcome being predicted. What would the accuracy have been if a model had simply predicted “no donation” for each person? Then it could be 95%. See below.
##
## 0 1
## 0.95 0.05
5. Calculating ROC Curves and AUC
We know that accuracy is a very misleading measure of model performance on imbalanced datasets. Graphing the model’s performance better illustrates the tradeoff between a model that is overly agressive and one that is overly passive.
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
## Area under the curve: 0.5102
How can we explain the value of AUC and plot? Based on this visualization, the model isn’t doing much better than baseline— a model doing nothing but making predictions at random.
6. Dummy Coding Categorical Data
Sometimes a dataset contains numeric values that represent a categorical feature.
# Convert the wealth rating to a factor
donors$wealth_rating <- factor(donors$wealth_rating, levels = c(0,1,2,3), labels = c("Unknown", "Low", "Medium", "High"))
# Use relevel() to change reference category
donors$wealth_rating <- relevel(donors$wealth_rating, ref = "Medium")
# See how our factor coding impacts the model
summary(glm(donated ~ wealth_rating, data = donors, family = "binomial"))
##
## Call:
## glm(formula = donated ~ wealth_rating, family = "binomial", data = donors)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.3320 -0.3243 -0.3175 -0.3175 2.4582
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.91894 0.03614 -80.772 <2e-16 ***
## wealth_ratingUnknown -0.04373 0.04243 -1.031 0.303
## wealth_ratingLow -0.05245 0.05332 -0.984 0.325
## wealth_ratingHigh 0.04804 0.04768 1.008 0.314
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 37330 on 93461 degrees of freedom
## Residual deviance: 37323 on 93458 degrees of freedom
## AIC: 37331
##
## Number of Fisher Scoring iterations: 5
7. Handling Missing Data
Some of the prospective donors have missing age data. Unfortunately, R will exclude any cases with NA values when building a regression model.
One workaround is to replace, or impute, the missing values with an estimated value. After doing so, you may also create a missing data indicator to model the possibility that cases with missing data are different in some way from those without.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 48.00 62.00 61.65 75.00 98.00 22546
The number of NA is 22546. So, we need to deal with handling missing data.
# Impute missing age values with mean(age)
donors$imputed_age <- donors$imputed_age <- ifelse(is.na(donors$age), round(mean(donors$age, na.rm = T), digits = 2), donors$age)
# Create missing value indicator for age
donors$missing_age <- ifelse(is.na(donors$age), 1, 0)
8. Building a more sophisticated model
One of the best predictors of future giving is a history of recent, frequent, and large gifts. In marketing terms, this is known as R/F/M - Recency, Frequency, Money Donors that haven’t given both recently and frequently may be especially likely to give again;
# Build a recency, frequency, and money (RFM) model
rfm_model <- glm(donated ~ recency * frequency + money, data = donors, family = "binomial")
# Summarize the RFM model to see how the parameters were coded
summary(rfm_model)
##
## Call:
## glm(formula = donated ~ recency * frequency + money, family = "binomial",
## data = donors)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.3696 -0.3696 -0.2895 -0.2895 2.7924
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.01142 0.04279 -70.375 <2e-16 ***
## recencyLAPSED -0.86677 0.41434 -2.092 0.0364 *
## frequencyINFREQUENT -0.50148 0.03107 -16.143 <2e-16 ***
## moneyMEDIUM 0.36186 0.04300 8.415 <2e-16 ***
## recencyLAPSED:frequencyINFREQUENT 1.01787 0.51713 1.968 0.0490 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 37330 on 93461 degrees of freedom
## Residual deviance: 36938 on 93457 degrees of freedom
## AIC: 36948
##
## Number of Fisher Scoring iterations: 6
Model has got better than the previous model. Based on the result, the combined impact of recency and frequency may be greater than the sum of the separate effects.
# Compute predicted probabilities for the RFM model
rfm_prob <- predict(rfm_model, data = donors, type = "response")
# Plot the ROC curve for the new model
library(pROC)
ROC <- roc(donors$donated, rfm_prob)
plot(ROC, col = "red")
## Area under the curve: 0.5785
Based on the ROC curve, you’ve confirmed that past giving patterns are certainly predictive of future giving.
9. The dangers of stepwise regression
In spite of its utility for feature selection, stepwise regression is not frequently used in disciplines outside of machine learning due to some important caveats. First of all, It is not guaranteed to find the best possible model. Second, The stepwise regression procedure violates some statistical assumptions. Third, it can result in a model that makes little sense in the real world
10. Building a stepwise regression model
##
## Call:
## glm(formula = donated ~ 1, family = "binomial", data = donors)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.3216 -0.3216 -0.3216 -0.3216 2.4444
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.93593 0.01495 -196.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 37330 on 93461 degrees of freedom
## Residual deviance: 37330 on 93461 degrees of freedom
## AIC: 37332
##
## Number of Fisher Scoring iterations: 5
##
## Call:
## glm(formula = donated ~ ., family = "binomial", data = donors)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6111 -0.3642 -0.3080 -0.2866 2.7436
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.742e+01 1.066e+01 1.634 0.10222
## veteran -2.071e-02 5.151e-01 -0.040 0.96793
## bad_address -5.442e+00 2.802e+00 -1.942 0.05208 .
## age 1.094e-03 1.093e-03 1.001 0.31702
## has_children -1.561e-01 5.156e-02 -3.028 0.00247 **
## wealth_ratingUnknown -1.196e-02 4.819e-02 -0.248 0.80404
## wealth_ratingLow -4.901e-02 5.773e-02 -0.849 0.39594
## wealth_ratingHigh 1.270e-01 5.079e-02 2.500 0.01243 *
## interest_veterans 2.429e+00 1.214e+00 2.001 0.04535 *
## interest_religion 1.491e+00 7.507e-01 1.986 0.04704 *
## pet_owner 5.060e-02 4.895e-02 1.034 0.30128
## catalog_shopper 6.686e-02 5.980e-02 1.118 0.26353
## recencyLAPSED -1.678e-01 2.565e-01 -0.654 0.51297
## frequencyINFREQUENT -4.645e-01 3.523e-02 -13.185 < 2e-16 ***
## moneyMEDIUM 3.734e-01 4.893e-02 7.631 2.34e-14 ***
## donation_prob -4.131e+02 2.146e+02 -1.926 0.05416 .
## donation_pred -1.185e-01 1.222e-01 -0.970 0.33189
## imputed_age NA NA NA NA
## missing_age NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28714 on 70915 degrees of freedom
## Residual deviance: 28405 on 70899 degrees of freedom
## (22546 observations deleted due to missingness)
## AIC: 28439
##
## Number of Fisher Scoring iterations: 6
# Use a forward stepwise algorithm to build a parsimonious model
step_model <- step(null_model, scope = list(lower = null_model, upper = full_model), direction = "forward")
## Start: AIC=37332.13
## donated ~ 1
## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit
## Df Deviance AIC
## + frequency 1 28502 37122
## + money 1 28621 37241
## + has_children 1 28705 37326
## + age 1 28707 37328
## + imputed_age 1 28707 37328
## + wealth_rating 3 28704 37328
## + interest_veterans 1 28709 37330
## + donation_prob 1 28710 37330
## + donation_pred 1 28710 37330
## + catalog_shopper 1 28710 37330
## + pet_owner 1 28711 37331
## <none> 28714 37332
## + interest_religion 1 28712 37333
## + recency 1 28713 37333
## + bad_address 1 28714 37334
## + veteran 1 28714 37334
##
## Step: AIC=37024.77
## donated ~ frequency
## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit
## Df Deviance AIC
## + money 1 28441 36966
## + wealth_rating 3 28490 37019
## + has_children 1 28494 37019
## + donation_prob 1 28498 37023
## + interest_veterans 1 28498 37023
## + catalog_shopper 1 28499 37024
## + donation_pred 1 28499 37024
## + age 1 28499 37024
## + imputed_age 1 28499 37024
## + pet_owner 1 28499 37024
## <none> 28502 37025
## + interest_religion 1 28501 37026
## + recency 1 28501 37026
## + bad_address 1 28502 37026
## + veteran 1 28502 37027
##
## Step: AIC=36949.71
## donated ~ frequency + money
## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit
## Df Deviance AIC
## + wealth_rating 3 28427 36942
## + has_children 1 28432 36943
## + interest_veterans 1 28438 36948
## + donation_prob 1 28438 36949
## + catalog_shopper 1 28438 36949
## + donation_pred 1 28438 36949
## + age 1 28438 36949
## + imputed_age 1 28438 36949
## + pet_owner 1 28439 36949
## <none> 28441 36950
## + interest_religion 1 28440 36951
## + recency 1 28440 36951
## + bad_address 1 28441 36951
## + veteran 1 28441 36952
##
## Step: AIC=36945.48
## donated ~ frequency + money + wealth_rating
## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit
## Df Deviance AIC
## + has_children 1 28416 36937
## + age 1 28424 36944
## + imputed_age 1 28424 36944
## + interest_veterans 1 28424 36945
## + donation_prob 1 28424 36945
## + catalog_shopper 1 28424 36945
## + donation_pred 1 28425 36945
## <none> 28427 36945
## + pet_owner 1 28425 36946
## + interest_religion 1 28426 36947
## + recency 1 28426 36947
## + bad_address 1 28427 36947
## + veteran 1 28427 36947
##
## Step: AIC=36938.4
## donated ~ frequency + money + wealth_rating + has_children
## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit
## Df Deviance AIC
## + pet_owner 1 28413 36937
## + donation_prob 1 28413 36937
## + catalog_shopper 1 28413 36937
## + interest_veterans 1 28413 36937
## + donation_pred 1 28414 36938
## <none> 28416 36938
## + interest_religion 1 28415 36939
## + age 1 28416 36940
## + imputed_age 1 28416 36940
## + recency 1 28416 36940
## + bad_address 1 28416 36940
## + veteran 1 28416 36940
##
## Step: AIC=36932.25
## donated ~ frequency + money + wealth_rating + has_children +
## pet_owner
## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit
## Df Deviance AIC
## <none> 28413 36932
## + donation_prob 1 28411 36932
## + interest_veterans 1 28411 36932
## + catalog_shopper 1 28412 36933
## + donation_pred 1 28412 36933
## + age 1 28412 36933
## + imputed_age 1 28412 36933
## + recency 1 28413 36934
## + interest_religion 1 28413 36934
## + bad_address 1 28413 36934
## + veteran 1 28413 36934
# Estimate the stepwise donation probability
step_prob <- predict(step_model, type = "response")
# Plot the ROC of the stepwise model
library(pROC)
ROC <- roc(donors$donated, step_prob)
plot(ROC, col = "red")
## Area under the curve: 0.5849
Despite the caveats of stepwise regression, it seems to have resulted in a relatively strong model!
All the Contents are From DataCamp
'R > [R] Machine Learning' 카테고리의 다른 글
[R] Classification Trees (0) | 2019.02.03 |
---|---|
[R] k-Nearest Neighbors (kNN) (0) | 2019.01.20 |