Adjusting figure options, optimizing them for mobile devices.


You might have already noticed it: The dot plot you produced in the last chapter still needs some tweaks. There doesn’t seem to be enough space between the arrows, and the last label (Netherlands) doesn’t even show. Also, you want the image to fit the aspect ratio of a mobile device better, so you’re going to change this with another set of chunk options.

 

Summary


The International Labour Organization (ILO) has many data sets on working conditions. For example, one can look at how weekly working hours have been decreasing in many countries of the world, while monetary compensation has risen. In this report, the reduction in weekly working hours in European countries is analysed, and a comparison between 1996 and 2006 is made. All analysed countries have seen a decrease in weekly working hours since 1996 – some more than others.

Preparations


library(dplyr)
library(ggplot2)
library(forcats)

Analysis


Data

The herein used data can be found in the statistics database of the ILO.For the purpose of this course, it has been slightly preprocessed.

load(url("http://s3.amazonaws.com/assets.datacamp.com/production/course_5807/datasets/ilo_data.RData"))

The loaded data contains 380 rows.

# Some summary statistics
ilo_data %>%
  group_by(year) %>%
  summarize(mean_hourly_compensation = mean(hourly_compensation),
            mean_working_hours = mean(working_hours))
## # A tibble: 27 x 3
##    year  mean_hourly_compensation mean_working_hours
##    <fct>                    <dbl>              <dbl>
##  1 1980                      9.27               34.0
##  2 1981                      8.69               33.6
##  3 1982                      8.36               33.5
##  4 1983                      7.81               33.9
##  5 1984                      7.54               33.7
##  6 1985                      7.79               33.7
##  7 1986                      9.70               34.0
##  8 1987                     12.1                33.6
##  9 1988                     13.2                33.7
## 10 1989                     13.1                33.5
## # … with 17 more rows

As can be seen from the above table, the average weekly working hours of European countries have been descreasing since 1980.

Preprocessing

The data is now filtered so it only contains the years 1996 and 2006 – a good time range for comparison.

ilo_data <- ilo_data %>%
  filter(year == "1996" | year == "2006")

# Reorder country factor levels
ilo_data <- ilo_data %>%
  # Arrange data frame first, so last is always 2006
  arrange(year) %>%
  # Use the fct_reorder function inside mutate to reorder countries by working hours in 2006
  mutate(country = fct_reorder(country,
                               working_hours,
                               last))

Results

In the following, a plot that shows the reduction of weekly working hours from 1996 to 2006 in each country is produced.

First, a custom theme is defined. Then, the plot is produced.

# Compute temporary data set for optimal label placement
median_working_hours <- ilo_data %>%
  group_by(country) %>%
  summarize(median_working_hours_per_country = median(working_hours)) %>%
  ungroup()

# Have a look at the structure of this data set
str(median_working_hours)
## Classes 'tbl_df', 'tbl' and 'data.frame':    17 obs. of  2 variables:
##  $ country                         : Factor w/ 30 levels "Netherlands",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ median_working_hours_per_country: num  27 27.8 28.4 31 30.9 ...
# Plot
ggplot(ilo_data) +
  geom_path(aes(x = working_hours, y = country),
            arrow = arrow(length = unit(1.5, "mm"), type = "closed")) +
  # Add labels for values (both 1996 and 2006)
  geom_text(
        aes(x = working_hours,
            y = country,
            label = round(working_hours, 1),
            hjust = ifelse(year == "2006", 1.4, -0.4)
          ),
        # Change the appearance of the text
        size = 3,
        family = "AppleGothic",
        color = "gray25"
   ) +
  # Add labels for country
  geom_text(data = median_working_hours,
            aes(y = country,
                x = median_working_hours_per_country,
                label = country),
            vjust = 2,
            family = "AppleGothic",
            color = "gray25") +
  # Add titles
  labs(
    title = "People work less in 2006 compared to 1996",
    subtitle = "Working hours in European countries, development since 1996",
    caption = "Data source: ILO, 2017"
  ) +
  # Apply your theme 
  theme_ilo() +
  # Remove axes and grids
  theme(
    axis.ticks = element_blank(),
    axis.title = element_blank(),
    axis.text = element_blank(),
    panel.grid = element_blank(),
    # Also, let's reduce the font size of the subtitle
    plot.subtitle = element_text(size = 9)
  ) +
  # Reset coordinate system
  coord_cartesian(xlim = c(25, 41))

An interesting correlation

The results of another analysis are shown here, even though they cannot be reproduced with the data at hand.

As you can see, there’s also an interesting relationship. The more people work, the less compensation they seem to receive, which seems kind of unfair. This is quite possibly related to other proxy variables like overall economic stability and performance of a country.

'R > [R] 데이터 시각화' 카테고리의 다른 글

[R] Facets()을 활용한 데이터 시각화  (0) 2019.11.24
Chapter4_Cleaning_data_for_analysis


All Contents are from DataCamp

Here, you'll dive into some of the grittier aspects of data cleaning. You'll learn about string manipulation and pattern matching to deal with unstructured data, and then explore techniques to deal with missing or duplicate data. You'll also learn the valuable skill of programmatically checking your data for consistency, which will give you confidence that your code is running correctly and that the results of your analysis are reliable!

Chapter 1. Converting data types

In this exercise, you'll see how ensuring all categorical variables in a DataFrame are of type category reduces memory usage.

The tips dataset has been loaded into a DataFrame called tips. This data contains information about how much a customer tipped, whether the customer was male or female, a smoker or not, etc.

Look at the output of tips.info() in the IPython Shell. You'll note that two columns that should be categorical - sex and smoker - are instead of type object, which is pandas' way of storing arbitrary strings. Your job is to convert these two columns to type category and note the reduced memory usage.

In [1]:
import pandas as pd
In [2]:
url = "https://assets.datacamp.com/production/course_2023/datasets/tips.csv"
tips = pd.read_csv(url)
print(tips.head())
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4
In [5]:
# Convert the sex column to type 'category'
tips.sex = tips.sex.astype('category')

# Print the info of tips
print(tips.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null object
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: category(1), float64(2), int64(1), object(3)
memory usage: 11.8+ KB
None
In [6]:
# Convert the smoker column to type 'category'
tips.smoker = tips.smoker.astype('category')

# Print the info of tips
print(tips.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null category
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: category(2), float64(2), int64(1), object(2)
memory usage: 10.3+ KB
None

Interestingly, By converting sex and smoker to categorical variables, the memory usage of the DataFrame went down from 13.4 KB to 10.1KB. This may seem like a small difference here, but when you're dealing with large datasets, the reduction in memory usage can be very significant!

Chapter 2. Working with numeric data

If you expect the data type of a column to be numeric (int or float), but instead it is of type object, this typically means that there is a non numeric value in the column, which also signifies bad data.

You can use the pd.to_numeric() function to convert a column into a numeric data type. If the function raises an error, you can be sure that there is a bad value within the column. You can either use the techniques you learned in Chapter 1 to do some exploratory data analysis and find the bad value, or you can choose to ignore or coerce the value into a missing value, NaN.

A modified version of the tips dataset has been pre-loaded into a DataFrame called tips. For instructional purposes, it has been pre-processed to introduce some 'bad' data for you to clean. Use the .info() method to explore this. You'll note that the total_bill and tip columns, which should be numeric, are instead of type object. Your job is to fix this.

In [7]:
# Convert 'total_bill' to a numeric dtype
tips['total_bill'] = pd.to_numeric(tips['total_bill'], errors='coerce')

# Print the info of tips
print(tips.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null category
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: category(2), float64(2), int64(1), object(2)
memory usage: 10.3+ KB
None
In [8]:
# Convert 'tip' to a numeric dtype
tips['tip'] = pd.to_numeric(tips['tip'], errors='coerce')

# Print the info of tips
print(tips.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null category
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: category(2), float64(2), int64(1), object(2)
memory usage: 10.3+ KB
None

Chapter 3. String parsing with regular expressions

In the video, Dan introduced you to the basics of regular expressions, which are powerful ways of defining patterns to match strings. This exercise will get you started with writing them.

When working with data, it is sometimes necessary to write a regular expression to look for properly entered values. Phone numbers in a dataset is a common field that needs to be checked for validity. Your job in this exercise is to define a regular expression to match US phone numbers that fit the pattern of xxx-xxx-xxxx.

The regular expression module in python is re. When performing pattern matching on data, since the pattern will be used for a match across multiple rows, it's better to compile the pattern first using re.compile(), and then use the compiled pattern to match values.

In [9]:
# Import the regular expression module
import re
In [10]:
# Compile the pattern: prog
prog = re.compile('\d{3}-\d{3}-\d{4}')
print(prog)
re.compile('\\d{3}-\\d{3}-\\d{4}')
In [11]:
# See if the pattern matches
result = prog.match('123-456-7890')
print(bool(result))
True
In [12]:
# See if the pattern matches
result2 = prog.match('1123-456-7890')
print(bool(result2))
False

Chapter 4. Extracting numerical values from strings

Extracting numbers from strings is a common task, particularly when working with unstructured data or log files.

Say you have the following string: 'the recipe calls for 6 strawberries and 2 bananas'.

It would be useful to extract the 6 and the 2 from this string to be saved for later use when comparing strawberry to banana ratios.

When using a regular expression to extract multiple numbers (or multiple pattern matches, to be exact), you can use the re.findall() function. Dan did not discuss this in the video, but it is straightforward to use: You pass in a pattern and a string to re.findall(), and it will return a list of the matches.

In [13]:
# Import the regular expression module
import re

# Find the numeric values: matches
matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana')

# Print the matches
print(matches)
['10', '1']

Chapter 5. Pattern matching

In this exercise, you'll continue practicing your regular expression skills. For each provided string, your job is to write the appropriate pattern to match it.

In [16]:
# Write the first pattern
pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890'))
print(pattern1)
True
In [17]:
# Write the second pattern
pattern2 = bool(re.match(pattern='\$\d*.\d{2}', string='$123.45'))
print(pattern2)
True
In [18]:
# Write the third pattern
pattern3 = bool(re.match(pattern='\w*', string='Australia'))
print(pattern3)
True

Chapter 6. Custom functions to clean data

You'll now practice writing functions to clean data.

The tips dataset has been pre-loaded into a DataFrame called tips. It has a 'sex' column that contains the values 'Male' or 'Female'. Your job is to write a function that will recode 'Female' to 0, 'Male' to 1, and return np.nan for all entries of 'sex' that are neither 'Female' nor 'Male'.

Recoding variables like this is a common data cleaning task. Functions provide a mechanism for you to abstract away complex bits of code as well as reuse code. This makes your code more readable and less error prone.

As Dan showed you in the videos, you can use the .apply() method to apply a function across entire rows or columns of DataFrames. However, note that each column of a DataFrame is a pandas Series. Functions can also be applied across Series. Here, you will apply your function over the 'sex' column.

In [19]:
import numpy as np
In [20]:
## Define recode_gender()
def recode_gender(gender): 
    
    # Return 0 if gender is "Female"
    if gender == 'Female':
        return 0
    
    # Return 1 if gender is 'Male'
    elif gender == 'Male': 
        return 1
    
    # Return np.nan
    else:
        return np.nan
In [21]:
# Apply the function to the sex column
tips['recode'] = tips.sex.apply(recode_gender)

# Print the first five rows of tips
print(tips.head())
   total_bill   tip     sex smoker  day    time  size recode
0       16.99  1.01  Female     No  Sun  Dinner     2      0
1       10.34  1.66    Male     No  Sun  Dinner     3      1
2       21.01  3.50    Male     No  Sun  Dinner     3      1
3       23.68  3.31    Male     No  Sun  Dinner     2      1
4       24.59  3.61  Female     No  Sun  Dinner     4      0

For simple recodes, you can also use the replace method. You can also convert the column into a categorical type.

Chapter 7. Lambda functions

You'll now be introduced to a powerful Python feature that will help you clean your data more effectively: lambda functions. Instead of using the def syntax that you used in the previous exercise, lambda functions let you make simple, one-line functions.

For example, here's a function that squares a variable used in an .apply() method:

def my_square(x):
    return x ** 2

df.apply(my_square)

The equivalent code using a lambda function is:

df.apply(lambda x: x ** 2)

The lambda function takes one parameter - the variable x. The function itself just squares x and returns the result, which is whatever the one line of code evaluates to. In this way, lambda functions can make your code concise and Pythonic.

The tips dataset has been pre-loaded into a DataFrame called tips. Your job is to clean its 'total_dollar' column by removing the dollar sign. You'll do this using two different methods: With the .replace() method, and with regular expressions. The regular expression module re has been pre-imported.

In [24]:
import re
In [34]:
url = "https://assets.datacamp.com/production/course_2023/datasets/tips.csv"
tips = pd.read_csv(url)
print(tips.head())
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4
In [35]:
tips["total_dollar"] = '$' + tips['total_bill'].astype(str)
print(tips.head())
   total_bill   tip     sex smoker  day    time  size total_dollar
0       16.99  1.01  Female     No  Sun  Dinner     2       $16.99
1       10.34  1.66    Male     No  Sun  Dinner     3       $10.34
2       21.01  3.50    Male     No  Sun  Dinner     3       $21.01
3       23.68  3.31    Male     No  Sun  Dinner     2       $23.68
4       24.59  3.61  Female     No  Sun  Dinner     4       $24.59
In [36]:
# Write the lambda function using replace
tips['total_dollar_replace'] = tips.total_dollar.apply(lambda x: x.replace('$', ''))

# Print the head of tips
print(tips.head())
   total_bill   tip     sex smoker  day    time  size total_dollar  \
0       16.99  1.01  Female     No  Sun  Dinner     2       $16.99   
1       10.34  1.66    Male     No  Sun  Dinner     3       $10.34   
2       21.01  3.50    Male     No  Sun  Dinner     3       $21.01   
3       23.68  3.31    Male     No  Sun  Dinner     2       $23.68   
4       24.59  3.61  Female     No  Sun  Dinner     4       $24.59   

  total_dollar_replace  
0                16.99  
1                10.34  
2                21.01  
3                23.68  
4                24.59  
In [38]:
# Write the lambda function using regular expressions
tips['total_dollar_re'] = tips.total_dollar.apply(lambda x: re.findall('\d+\.\d+', x)[0])

# Print the head of tips
print(tips.head())
   total_bill   tip     sex smoker  day    time  size total_dollar  \
0       16.99  1.01  Female     No  Sun  Dinner     2       $16.99   
1       10.34  1.66    Male     No  Sun  Dinner     3       $10.34   
2       21.01  3.50    Male     No  Sun  Dinner     3       $21.01   
3       23.68  3.31    Male     No  Sun  Dinner     2       $23.68   
4       24.59  3.61  Female     No  Sun  Dinner     4       $24.59   

  total_dollar_replace total_dollar_re  
0                16.99           16.99  
1                10.34           10.34  
2                21.01           21.01  
3                23.68           23.68  
4                24.59           24.59  

Notice how the 'total_dollar_re' and 'total_dollar_replace' columns are identical.

Chapter 8. Dropping duplicate data

Duplicate data causes a variety of problems. From the point of view of performance, they use up unnecessary amounts of memory and cause unneeded calculations to be performed when processing data. In addition, they can also bias any analysis results.

Check out its columns in the IPython Shell. Your job in this exercise is to subset this DataFrame and then drop all duplicate rows.

In [31]:
import pandas as pd
import numpy as np
from pandas import DataFrame
In [32]:
raw_data = {'col0': [np.nan, 2, 3, 4, 4],
            'col1': [10, 20, np.nan, 40, 40],
            'col2': [100, 200, 300, 400, 400]}

data = DataFrame(raw_data)
print(data)
   col0  col1  col2
0   NaN  10.0   100
1   2.0  20.0   200
2   3.0   NaN   300
3   4.0  40.0   400
4   4.0  40.0   400
In [27]:
print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
col0    4 non-null float64
col1    4 non-null float64
col2    5 non-null int64
dtypes: float64(2), int64(1)
memory usage: 200.0 bytes
None
In [25]:
data_no_duplicates = data.drop_duplicates()
print(data_no_duplicates)
   col0  col1  col2
0   NaN  10.0   100
1   2.0  20.0   200
2   3.0   NaN   300
3   4.0  40.0   400
In [26]:
print(data_no_duplicates.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 3 columns):
col0    3 non-null float64
col1    3 non-null float64
col2    4 non-null int64
dtypes: float64(2), int64(1)
memory usage: 128.0 bytes
None

Chapter 9. Filling missing data

Here, you'll return to the airquality dataset from Chapter 2. It has been pre-loaded into the DataFrame airquality, and it has missing values for you to practice filling in. Explore airquality in the IPython Shell to checkout which columns have missing values.

It's rare to have a (real-world) dataset without any missing values, and it's important to deal with them because certain calculations cannot handle missing values while some calculations will, by default, skip over any missing values.

Also, understanding how much missing data you have, and thinking about where it comes from is crucial to making unbiased interpretations of data.

In [30]:
import pandas as pd
url = "https://assets.datacamp.com/production/course_2023/datasets/airquality.csv"
airquality = pd.read_csv(url)
print(airquality.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
Ozone      116 non-null float64
Solar.R    146 non-null float64
Wind       153 non-null float64
Temp       153 non-null int64
Month      153 non-null int64
Day        153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 7.2 KB
None
In [29]:
# Calculate the mean of the Ozone column: oz_mean
oz_mean = airquality.Ozone.mean()

# Replace all the missing values in the Ozone column with the mean
airquality['Ozone'] = airquality.Ozone.fillna(oz_mean)

# Print the info of airquality
print(airquality.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
Ozone      153 non-null float64
Solar.R    146 non-null float64
Wind       153 non-null float64
Temp       153 non-null int64
Month      153 non-null int64
Day        153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 7.2 KB
None

Chapter 10. Testing your data with asserts

Here, you'll practice writing assert statements using the Ebola dataset from previous chapters to programmatically check for missing values and to confirm that all values are positive. The dataset has been pre-loaded into a DataFrame called ebola.

In the video, you saw Dan use the .all() method together with the .notnull() DataFrame method to check for missing values in a column. The .all() method returns True if all values are True. When used on a DataFrame, it returns a Series of Booleans - one for each column in the DataFrame. So if you are using it on a DataFrame, like in this exercise, you need to chain another .all() method so that you return only one True or False value. When using these within an assert statement, nothing will be returned if the assert statement is true: This is how you can confirm that the data you are checking are valid.

Note: You can use pd.notnull(df) as an alternative to df.notnull().

In [39]:
import pandas as pd
url = "https://assets.datacamp.com/production/course_2023/datasets/ebola.csv"
ebola = pd.read_csv(url)
print(ebola.head())
         Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone  \
0    1/5/2015  289        2776.0            NaN            10030.0   
1    1/4/2015  288        2775.0            NaN             9780.0   
2    1/3/2015  287        2769.0         8166.0             9722.0   
3    1/2/2015  286           NaN         8157.0                NaN   
4  12/31/2014  284        2730.0         8115.0             9633.0   

   Cases_Nigeria  Cases_Senegal  Cases_UnitedStates  Cases_Spain  Cases_Mali  \
0            NaN            NaN                 NaN          NaN         NaN   
1            NaN            NaN                 NaN          NaN         NaN   
2            NaN            NaN                 NaN          NaN         NaN   
3            NaN            NaN                 NaN          NaN         NaN   
4            NaN            NaN                 NaN          NaN         NaN   

   Deaths_Guinea  Deaths_Liberia  Deaths_SierraLeone  Deaths_Nigeria  \
0         1786.0             NaN              2977.0             NaN   
1         1781.0             NaN              2943.0             NaN   
2         1767.0          3496.0              2915.0             NaN   
3            NaN          3496.0                 NaN             NaN   
4         1739.0          3471.0              2827.0             NaN   

   Deaths_Senegal  Deaths_UnitedStates  Deaths_Spain  Deaths_Mali  
0             NaN                  NaN           NaN          NaN  
1             NaN                  NaN           NaN          NaN  
2             NaN                  NaN           NaN          NaN  
3             NaN                  NaN           NaN          NaN  
4             NaN                  NaN           NaN          NaN  
In [44]:
# delete all na
ebola_drop = ebola.dropna(axis = 0)
print(ebola_drop.head())
          Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone  \
19  11/18/2014  241        2047.0         7082.0             6190.0   

    Cases_Nigeria  Cases_Senegal  Cases_UnitedStates  Cases_Spain  Cases_Mali  \
19           20.0            1.0                 4.0          1.0         6.0   

    Deaths_Guinea  Deaths_Liberia  Deaths_SierraLeone  Deaths_Nigeria  \
19         1214.0          2963.0              1267.0             8.0   

    Deaths_Senegal  Deaths_UnitedStates  Deaths_Spain  Deaths_Mali  
19             0.0                  1.0           0.0          6.0  
In [45]:
# Assert that there are no missing values
assert pd.notnull(ebola_drop).all().all()
In [47]:
# Assert that all values are >= 0
assert (ebola_drop >= 0).all().all()

Since the assert statements did not throw any errors, you can be sure that there are no missing values in the data and that all values are >= 0!


Classification_Trees

Supervised Learning: Classification Trees

Evan Jung January 28, 2019

1. Intro

Classification trees use flowchart-like structures to make decisions. Because humans can readily understand these tree structures, classification trees are useful when transparency is needed, such as in loan approval. We’ll use the Lending Club dataset to simulate this scenario.


2. Lending Club Dataset

Lending Club is a US-based peer-to-peer lending company. The loans dataset contains 11,312 randomly-selected people who were applied for and later received loans from Lending Club.


## Observations: 39,732
## Variables: 14
## $ loan_amount        <fct> LOW, LOW, LOW, MEDIUM, LOW, LOW, MEDIUM, LO...
## $ emp_length         <fct> 10+ years, < 2 years, 10+ years, 10+ years,...
## $ home_ownership     <fct> RENT, RENT, RENT, RENT, RENT, RENT, RENT, R...
## $ income             <fct> LOW, LOW, LOW, MEDIUM, HIGH, LOW, MEDIUM, M...
## $ loan_purpose       <fct> credit_card, car, small_business, other, ot...
## $ debt_to_income     <fct> HIGH, LOW, AVERAGE, HIGH, AVERAGE, AVERAGE,...
## $ credit_score       <fct> AVERAGE, AVERAGE, AVERAGE, AVERAGE, AVERAGE...
## $ recent_inquiry     <fct> YES, YES, YES, YES, NO, YES, YES, YES, YES,...
## $ delinquent         <fct> NEVER, NEVER, NEVER, MORE THAN 2 YEARS AGO,...
## $ credit_accounts    <fct> FEW, FEW, FEW, AVERAGE, MANY, AVERAGE, AVER...
## $ bad_public_record  <fct> NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO,...
## $ credit_utilization <fct> HIGH, LOW, HIGH, LOW, MEDIUM, MEDIUM, HIGH,...
## $ past_bankrupt      <fct> NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO,...
## $ outcome            <fct> repaid, default, repaid, repaid, repaid, re...


3. UpSampling & DownSampling


## 
##  repaid default 
##   34078    5654

In classification problems, as you can see graph above, a disparity in the frequencies of the observed classes can have a significant negative impact on model fitting. One technique for resolving such a class imbalance is to subsample the training data in a manner that mitigates the issues. Examples of sampling methods for this purpose are:

  • down-sampling: randomly subset all the classes in the training set so that their class frequencies match the least prevalent class. For example, suppose that 80% of the training set samples are the first class and the remaining 20% are in the second class. Down-sampling would randomly sample the first class to be the same size as the second class (so that only 40% of the total training set is used to fit the model). caret contains a function (downSample) to do this.

  • up-sampling: randomly sample (with replacement) the minority class to be the same size as the majority class. caret contains a function (upSample) to do this.

  • hybrid methods: techniques such as SMOTE and ROSE down-sample the majority class and synthesize new data points in the minority class. There are two packages (DMwR and ROSE) that implement these procedures.

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift
## 
##  repaid default 
##    5654    5654


4. Building a simple decision tree

(1) Modeling Building

You will use a decision tree to try to learn patterns in the outcome of these loans (either repaid or default) based on the requested loan amount and credit score at the time of application.

Then, see how the tree’s predictions differ for an applicant with good credit versus one with bad credit.

## Call:
## rpart(formula = outcome ~ loan_amount + credit_score, data = loans, 
##     method = "class", control = rpart.control(cp = 0))
##   n= 11308 
## 
##           CP nsplit rel error    xerror        xstd
## 1 0.10771135      0 1.0000000 1.0231694 0.009401356
## 2 0.01317651      1 0.8922886 0.8922886 0.009349171
## 3 0.00000000      3 0.8659356 0.8659356 0.009318988
## 
## Variable importance
## credit_score  loan_amount 
##           89           11 
## 
## Node number 1: 11308 observations,    complexity param=0.1077114
##   predicted class=repaid   expected loss=0.5  P(node) =1
##     class counts:  5654  5654
##    probabilities: 0.500 0.500 
##   left son=2 (1811 obs) right son=3 (9497 obs)
##   Primary splits:
##       credit_score splits as  RLR, improve=121.92300, (0 missing)
##       loan_amount  splits as  RLL, improve= 29.13543, (0 missing)
## 
## Node number 2: 1811 observations
##   predicted class=repaid   expected loss=0.3318609  P(node) =0.1601521
##     class counts:  1210   601
##    probabilities: 0.668 0.332 
## 
## Node number 3: 9497 observations,    complexity param=0.01317651
##   predicted class=default  expected loss=0.4679372  P(node) =0.8398479
##     class counts:  4444  5053
##    probabilities: 0.468 0.532 
##   left son=6 (7867 obs) right son=7 (1630 obs)
##   Primary splits:
##       credit_score splits as  L-R, improve=42.17392, (0 missing)
##       loan_amount  splits as  RLL, improve=19.24674, (0 missing)
## 
## Node number 6: 7867 observations,    complexity param=0.01317651
##   predicted class=default  expected loss=0.489386  P(node) =0.6957022
##     class counts:  3850  4017
##    probabilities: 0.489 0.511 
##   left son=12 (5397 obs) right son=13 (2470 obs)
##   Primary splits:
##       loan_amount splits as  RLL, improve=20.49803, (0 missing)
## 
## Node number 7: 1630 observations
##   predicted class=default  expected loss=0.3644172  P(node) =0.1441457
##     class counts:   594  1036
##    probabilities: 0.364 0.636 
## 
## Node number 12: 5397 observations
##   predicted class=repaid   expected loss=0.486196  P(node) =0.4772727
##     class counts:  2773  2624
##    probabilities: 0.514 0.486 
## 
## Node number 13: 2470 observations
##   predicted class=default  expected loss=0.4360324  P(node) =0.2184294
##     class counts:  1077  1393
##    probabilities: 0.436 0.564


(2) Predict()

##      1 
## repaid 
## Levels: repaid default
##       1 
## default 
## Levels: repaid default


5.Visualizing classification trees

Due to government rules to prevent illegal discrimination, lenders are required to explain why a loan application was rejected.

The structure of classification trees can be depicted visually, which helps to understand how the tree makes its decisions.



Based on this tree structure, which of the following applicants would be predicted to repay the loan?

Someone with a low requested loan amount and high credit. Using the tree structure, you can clearly see how the tree makes its decisions.


6. Creating Random Test Datasets

Before building a more sophisticated lending model, it is important to hold out a portion of the loan data to simulate how well it will predict the outcomes of future loan applicants.



As depicted in the following image, you can use 75% of the observations for training and 25% for testing the model.

## [1] 8481

The sample() function can be used to generate a random sample of rows to include in the training set. Simply supply it the total number of observations and the number needed for training.

Use the resulting vector of row IDs to subset the loans into training and testing datasets.


7. Building and evaluating a larger tree

Previously, you created a simple decision tree that used the applicant’s credit score and requested loan amount to predict the loan outcome.

Lending Club has additional information about the applicants, such as home ownership status, length of employment, loan purpose, and past bankruptcies, that may be useful for making more accurate predictions.

Using all of the available applicant data, build a more sophisticated lending model using the random training dataset created previously. Then, use this model to make predictions on the testing dataset to estimate the performance of the model on future loan applications.

##          
##           repaid default
##   repaid     781     626
##   default    636     784
## [1] 0.5535904

The accuracy on the test dataset seems low. If so, how did adding more predictors change the model’s performance?


8. Conducting a fair performance evaluation

Holding out test data reduces the amount of data available for growing the decision tree. In spite of this, it is very important to evaluate decision trees on data it has not seen before.

Which of these is NOT true about the evaluation of decision tree performance?

  1. Decision trees sometimes overfit the training data.
  2. The model’s accuracy is unaffected by the rarity of the outcome.
  3. Performance on the training dataset can overestimate performance on future data.
  4. Creating a test dataset simulates the model’s performance on unseen data.

The answer is (2). Rare events cause problems for many machine learning approaches.


9. Preventing overgrown trees

The tree grown on the full set of applicant data grew to be extremely large and extremely complex, with hundreds of splits and leaf nodes containing only a handful of applicants. This tree would be almost impossible for a loan officer to interpret.

Using the pre-pruning methods for early stopping, you can prevent a tree from growing too large and complex. See how the rpart control options for maximum tree depth and minimum split count impact the resulting tree.

## [1] 0.5723382

Compared to the previous model, the new model shows the better performance of accuracy. But, still we have new technique to build the better model.


10. Creating a nicely pruned tree

Stopping a tree from growing all the way can lead it to ignore some aspects of the data or miss important trends it may have discovered later.

By using post-pruning, you can intentionally grow a large and complex tree then prune it to be smaller and more efficient later on.

In this exercise, you will have the opportunity to construct a visualization of the tree’s performance versus complexity, and use this information to prune the tree to an appropriate level.


## [1] 0.5928546

As with pre-pruning, creating a simpler tree actually improved the performance of the tree on the test dataset.


11. Why do trees benefit from pruning?

Classification trees can grow indefinitely, until they are told to stop or run out of data to divide-and-conquer.

Just like trees in nature, classification trees that grow overly large can require pruning to reduce the excess growth. However, this generally results in a tree that classifies fewer training examples correctly.

Why, then, are pre-pruning and post-pruning almost always used? (1) Simpler trees are easier to interpret (2) Simpler trees using early stopping are faster to train (3) Simpler trees may perform better on the testing data


12. Building a random forest model

In spite of the fact that a forest can contain hundreds of trees, growing a decision tree forest is perhaps even easier than creating a single highly-tuned tree.

Using the randomForest package, build a random forest and see how it compares to the single trees you built previously.

Keep in mind that due to the random nature of the forest, the results may vary slightly each time you create the forest.

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin
## [1] 0.5854262


'R > [R] Machine Learning' 카테고리의 다른 글

[R] Supervised Learning, Logistic Regresison  (0) 2019.01.25
[R] k-Nearest Neighbors (kNN)  (0) 2019.01.20

+ Recent posts