Chapter3_Combining_data_for_analysis

This all contents come from DataCamp.

The ability to transform and combine your data is a crucial skill in data science, because your data may not always come in one monolithic file or table for you to load. A large dataset may be broken into separate datasets to facilitate easier storage and sharing. Or if you are dealing with time series data, for example, you may have a new dataset for each day. No matter the reason, it is important to be able to combine datasets so you can either clean a single dataset, or clean each dataset separately and then combine them later so you can run your analysis on a single dataset. In this chapter, you'll learn all about combining data.

Chapter 1. Combining rows of data

The dataset you'll be working with here relates to NYC Uber data. The original dataset has all the originating Uber pickup locations by time and latitude and longitude. For didactic purposes, you'll be working with a very small portion of the actual data.

Three DataFrames have been pre-loaded: uber1, which contains data for April 2014, uber2, which contains data for May 2014, and uber3, which contains data for June 2014. Your job in this exercise is to concatenate these DataFrames together such that the resulting DataFrame has the data for all three months.

Begin by exploring the structure of these three DataFrames in the IPython Shell using methods such as .head().

Code from DataCamp is below.

# Concatenate uber1, uber2, and uber3: row_concat
row_concat = pd.concat([uber1, uber2, uber3])

# Print the shape of row_concat
print(row_concat.shape)

# Print the head of row_concat
print(row_concat.head())
In [1]:
# my code below
import pandas as pd
url = "https://assets.datacamp.com/production/course_2023/datasets/nyc_uber_2014.csv"
row_concat = pd.read_csv(url)
print(row_concat.head())
   Unnamed: 0         Date/Time      Lat      Lon    Base
0           0  4/1/2014 0:11:00  40.7690 -73.9549  B02512
1           1  4/1/2014 0:17:00  40.7267 -74.0345  B02512
2           2  4/1/2014 0:21:00  40.7316 -73.9873  B02512
3           3  4/1/2014 0:28:00  40.7588 -73.9776  B02512
4           4  4/1/2014 0:33:00  40.7594 -73.9722  B02512

Chapter 2. Combining columns of data

Think of column-wise concatenation of data as stitching data together from the sides instead of the top and bottom. To perform this action, you use the same pd.concat() function, but this time with the keyword argument axis=1. The default, axis=0, is for a row-wise concatenation.

You'll return to the Ebola dataset you worked with briefly in the last chapter. It has been pre-loaded into a DataFrame called ebola_melt. In this DataFrame, the status and country of a patient is contained in a single column. This column has been parsed into a new DataFrame, status_country, where there are separate columns for status and country.

Explore the ebola_melt and status_country DataFrames in the IPython Shell. Your job is to concatenate them column-wise in order to obtain a final, clean DataFrame.

In [3]:
url = "https://assets.datacamp.com/production/course_2023/datasets/ebola.csv"
ebola = pd.read_csv(url)
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts')
ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')
ebola_melt['type'] = ebola_melt.str_split.str.get(0)
ebola_melt['country'] = ebola_melt.str_split.str.get(1)
print(ebola_melt.head())
         Date  Day  type_country  counts        str_split   type country
0    1/5/2015  289  Cases_Guinea  2776.0  [Cases, Guinea]  Cases  Guinea
1    1/4/2015  288  Cases_Guinea  2775.0  [Cases, Guinea]  Cases  Guinea
2    1/3/2015  287  Cases_Guinea  2769.0  [Cases, Guinea]  Cases  Guinea
3    1/2/2015  286  Cases_Guinea     NaN  [Cases, Guinea]  Cases  Guinea
4  12/31/2014  284  Cases_Guinea  2730.0  [Cases, Guinea]  Cases  Guinea
In [4]:
status_country = ebola_melt["str_split"]
In [5]:
# Concatenate ebola_melt and status_country column-wise: ebola_tidy
ebola_tidy = pd.concat([ebola_melt, status_country], axis = 1)

# Print the shape of ebola_tidy
print(ebola_tidy.shape)
(1952, 8)
In [6]:
# Print the head of ebola_tidy
print(ebola_tidy.head())
         Date  Day  type_country  counts        str_split   type country  \
0    1/5/2015  289  Cases_Guinea  2776.0  [Cases, Guinea]  Cases  Guinea   
1    1/4/2015  288  Cases_Guinea  2775.0  [Cases, Guinea]  Cases  Guinea   
2    1/3/2015  287  Cases_Guinea  2769.0  [Cases, Guinea]  Cases  Guinea   
3    1/2/2015  286  Cases_Guinea     NaN  [Cases, Guinea]  Cases  Guinea   
4  12/31/2014  284  Cases_Guinea  2730.0  [Cases, Guinea]  Cases  Guinea   

         str_split  
0  [Cases, Guinea]  
1  [Cases, Guinea]  
2  [Cases, Guinea]  
3  [Cases, Guinea]  
4  [Cases, Guinea]  

Chapter 3. Finding files that match a pattern

You're now going to practice using the glob module to find all csv files in the workspace. In the next exercise, you'll programmatically load them into DataFrames.

As Dan showed you in the video, the glob module has a function called glob that takes a pattern and returns a list of the files in the working directory that match that pattern.

For example, if you know the pattern is part single digit number .csv, you can write the pattern as 'part?.csv' (which would match part_1.csv, part_2.csv, part_3.csv, etc.)

Similarly, you can find all .csv files with '.csv', or all parts with 'part_'. The ? wildcard represents any 1 character, and the * wildcard represents any number of characters.

In [7]:
# Import necessary modules
import pandas as pd
import glob
import os
In [8]:
# check working directory
os.getcwd()
Out[8]:
'/Users/jihoonjung/Documents/python_e_learning/DataCamp/Cleaning_Data'
In [10]:
os.chdir("/Users/jihoonjung/Documents/python_e_learning/DataCamp/Cleaning_Data/data")
os.getcwd()
Out[10]:
'/Users/jihoonjung/Documents/python_e_learning/DataCamp/Cleaning_Data/data'
In [11]:
# Write the pattern: pattern
pattern = '*.csv'

# Save all file matches: csv_files
csv_files = glob.glob(pattern)
In [12]:
# Print the file names
print(csv_files)
['uber-raw-data-apr14.csv', 'uber-raw-data-jun14.csv', 'uber-raw-data-may14.csv']
In [13]:
# Load the second file into a DataFrame: csv2
csv2 = pd.read_csv(csv_files[1])
In [14]:
# Print the head of csv2
print(csv2.head())
          Date/Time      Lat      Lon    Base
0  6/1/2014 0:00:00  40.7293 -73.9920  B02512
1  6/1/2014 0:01:00  40.7131 -74.0097  B02512
2  6/1/2014 0:04:00  40.3461 -74.6610  B02512
3  6/1/2014 0:04:00  40.7555 -73.9833  B02512
4  6/1/2014 0:07:00  40.6880 -74.1831  B02512

Chapter 4. Iterating and concatenating all matches

Now that you have a list of filenames to load, you can load all the files into a list of DataFrames that can then be concatenated.

You'll start with an empty list called frames. Your job is to use a for loop to:

  1. iterate through each of the filenames
  2. read each filename into a DataFrame, and then
  3. append it to the frames list.

You can then concatenate this list of DataFrames using pd.concat(). Go for it!

In [15]:
# Create an empty list: frames
frames = []
In [20]:
# Iterate over csv_files
for csv in csv_files: 
    
    # Read csv into a DataFrame: df
    df = pd.read_csv(csv)
    
    # Append df to frame
    frames.append(df)
    print('succeed?: ', csv, ' succeded')
succeed?:  uber-raw-data-apr14.csv  succeded
succeed?:  uber-raw-data-jun14.csv  succeded
succeed?:  uber-raw-data-may14.csv  succeded
In [17]:
# Concatenate frames into a single DataFrame: ubser
uber = pd.concat(frames)
In [18]:
# Print the shape of uber
print(uber.shape)
(1880795, 4)
In [19]:
# Print the head of uber
print(uber.head())
          Date/Time      Lat      Lon    Base
0  4/1/2014 0:11:00  40.7690 -73.9549  B02512
1  4/1/2014 0:17:00  40.7267 -74.0345  B02512
2  4/1/2014 0:21:00  40.7316 -73.9873  B02512
3  4/1/2014 0:28:00  40.7588 -73.9776  B02512
4  4/1/2014 0:33:00  40.7594 -73.9722  B02512

Prerequisites. Programming with Databases - Python

To close, let’s have a look at how to access a database from a general-purpose programming language like Python. Other languages use almost exactly the same model: library and function names may differ, but the concepts are the same.

Here’s a short Python program that selects latitudes and longitudes from an SQLite database stored in a file called survey.db:

Please check here: Software Carpentry SQL lesson.

In [27]:
import sqlite3

connection = sqlite3.connect("survey.db")
cursor = connection.cursor()
cursor.execute("SELECT Site.lat, Site.long FROM Site;")
results = cursor.fetchall()
for r in results:
    print(r)
cursor.close()
connection.close()
(-49.85, -128.57)
(-47.15, -126.72)
(-48.87, -123.4)
In [28]:
connection = sqlite3.connect("survey.db")

# site dataframe
site = pd.read_sql_query("select name, lat, long from Site", connection)
print(site.head())
    name    lat    long
0   DR-1 -49.85 -128.57
1   DR-3 -47.15 -126.72
2  MSK-4 -48.87 -123.40
In [29]:
connection = sqlite3.connect("survey.db")

# visited dataframe dataframe
visited = pd.read_sql_query("select id, site, dated from Visited", connection)
print(visited.head())
    id  site       dated
0  619  DR-1  1927-02-08
1  622  DR-1  1927-02-10
2  734  DR-3  1930-01-07
3  735  DR-3  1930-01-12
4  751  DR-3  1930-02-26
In [31]:
connection = sqlite3.connect("survey.db")

# visited dataframe dataframe
survey = pd.read_sql_query("select * from Survey", connection)
print(survey.head())
   taken person quant  reading
0    619   dyer   rad     9.82
1    619   dyer   sal     0.13
2    622   dyer   rad     7.80
3    622   dyer   sal     0.09
4    734     pb   rad     8.41

Chapter 5. 1-to-1 data merge

Merging data allows you to combine disparate datasets into a single dataset to do more complex analysis.

Here, you'll be using survey data that contains readings that William Dyer, Frank Pabodie, and Valentina Roerich took in the late 1920 and 1930 while they were on an expedition towards Antarctica. The dataset was taken from a sqlite database from the Software Carpentry SQL lesson.

Two DataFrames have been pre-loaded: site and visited. Explore them in the IPython Shell and take note of their structure and column names. Your task is to perform a 1-to-1 merge of these two DataFrames using the 'name' column of site and the 'site' column of visited.

In [30]:
# Merge the DataFrames: o2o
m2m = pd.merge(left=site, right=visited, left_on='name', right_on='site')

# Print o2o
print(m2m)
    name    lat    long   id   site       dated
0   DR-1 -49.85 -128.57  619   DR-1  1927-02-08
1   DR-1 -49.85 -128.57  622   DR-1  1927-02-10
2   DR-1 -49.85 -128.57  844   DR-1  1932-03-22
3   DR-3 -47.15 -126.72  734   DR-3  1930-01-07
4   DR-3 -47.15 -126.72  735   DR-3  1930-01-12
5   DR-3 -47.15 -126.72  751   DR-3  1930-02-26
6   DR-3 -47.15 -126.72  752   DR-3        None
7  MSK-4 -48.87 -123.40  837  MSK-4  1932-01-14

Chapter 6. Many-to-Many data merge

The final merging scenario occurs when both DataFrames do not have unique keys for a merge. What happens here is that for each duplicated key, every pairwise combination will be created.

Two example DataFrames that share common key values have been pre-loaded: df1 and df2. Another DataFrame df3, which is the result of df1 merged with df2, has been pre-loaded. All three DataFrames have been printed - look at the output and notice how pairwise combinations have been created. This example is to help you develop your intuition for many-to-many merges.

Here, you'll work with the site and visited DataFrames from before, and a new survey DataFrame. Your task is to merge site and visited as you did in the earlier exercises. You will then merge this merged DataFrame with survey.

Begin by exploring the site, visited, and survey DataFrames in the IPython Shell.

In [35]:
# Merge site and visited: m2m
m2m = pd.merge(left=site, right=visited, left_on='name', right_on='site')

# Merge m2m and survey: m2m
m2m = pd.merge(left=m2m, right=survey, left_on='id', right_on='taken')

# Print the first 20 lines of m2m
print(m2m.head(20))
     name    lat    long   id   site       dated  taken person quant  reading
0    DR-1 -49.85 -128.57  619   DR-1  1927-02-08    619   dyer   rad     9.82
1    DR-1 -49.85 -128.57  619   DR-1  1927-02-08    619   dyer   sal     0.13
2    DR-1 -49.85 -128.57  622   DR-1  1927-02-10    622   dyer   rad     7.80
3    DR-1 -49.85 -128.57  622   DR-1  1927-02-10    622   dyer   sal     0.09
4    DR-1 -49.85 -128.57  844   DR-1  1932-03-22    844    roe   rad    11.25
5    DR-3 -47.15 -126.72  734   DR-3  1930-01-07    734     pb   rad     8.41
6    DR-3 -47.15 -126.72  734   DR-3  1930-01-07    734   lake   sal     0.05
7    DR-3 -47.15 -126.72  734   DR-3  1930-01-07    734     pb  temp   -21.50
8    DR-3 -47.15 -126.72  735   DR-3  1930-01-12    735     pb   rad     7.22
9    DR-3 -47.15 -126.72  735   DR-3  1930-01-12    735   None   sal     0.06
10   DR-3 -47.15 -126.72  735   DR-3  1930-01-12    735   None  temp   -26.00
11   DR-3 -47.15 -126.72  751   DR-3  1930-02-26    751     pb   rad     4.35
12   DR-3 -47.15 -126.72  751   DR-3  1930-02-26    751     pb  temp   -18.50
13   DR-3 -47.15 -126.72  751   DR-3  1930-02-26    751   lake   sal     0.10
14   DR-3 -47.15 -126.72  752   DR-3        None    752   lake   rad     2.19
15   DR-3 -47.15 -126.72  752   DR-3        None    752   lake   sal     0.09
16   DR-3 -47.15 -126.72  752   DR-3        None    752   lake  temp   -16.00
17   DR-3 -47.15 -126.72  752   DR-3        None    752    roe   sal    41.60
18  MSK-4 -48.87 -123.40  837  MSK-4  1932-01-14    837   lake   rad     1.46
19  MSK-4 -48.87 -123.40  837  MSK-4  1932-01-14    837   lake   sal     0.21
Chapter2_Tidying_data_for_analysis

Here, you'll learn about the principles of tidy data and more importantly, why you should care about them and how they make subsequent data analysis more efficient. You'll gain first hand experience with reshaping and tidying your data using techniques such as pivoting and melting.

Chapter 1. Recognizing tidy data

For data to be tidy, it must have:

  • Each variable as a separate column.
  • Each row as a separate observation.

As a data scientist, you'll encounter data that is represented in a variety of different ways, so it is important to be able to recognize tidy (or untidy) data when you see it.

Chapter 2. Reshaping your data using melt

Melting data is the process of turning columns of your data into rows of data. Consider the DataFrames from the previous exercise. In the tidy DataFrame, the variables Ozone, Solar.R, Wind, and Temp each had their own column. If, however, you wanted these variables to be in rows instead, you could melt the DataFrame. In doing so, however, you would make the data untidy! This is important to keep in mind: Depending on how your data is represented, you will have to reshape it differently (e.g., this could make it easier to plot values).

In this exercise, you will practice melting a DataFrame using pd.melt(). There are two parameters you should be aware of: id_vars and value_vars. The id_vars represent the columns of the data you do not want to melt (i.e., keep it in its current shape), while the value_vars represent the columns you do wish to melt into rows. By default, if no value_vars are provided, all columns not set in the id_vars will be melted. This could save a bit of typing, depending on the number of columns that need to be melted.

The (tidy) DataFrame airquality has been pre-loaded. Your job is to melt its Ozone, Solar.R, Wind, and Temp columns into rows. Later in this chapter, you'll learn how to bring this melted DataFrame back into a tidy form.

In [3]:
import pandas as pd
url = "https://assets.datacamp.com/production/course_2023/datasets/airquality.csv"
airquality = pd.read_csv(url)
print(airquality.head())
   Ozone  Solar.R  Wind  Temp  Month  Day
0   41.0    190.0   7.4    67      5    1
1   36.0    118.0   8.0    72      5    2
2   12.0    149.0  12.6    74      5    3
3   18.0    313.0  11.5    62      5    4
4    NaN      NaN  14.3    56      5    5
In [5]:
# Melt airquality: airquality_melt
airquality_melt = pd.melt(frame = airquality, id_vars=['Month', 'Day'])

# Print the head of airquality_melt
print(airquality_melt.head())
   Month  Day variable  value
0      5    1    Ozone   41.0
1      5    2    Ozone   36.0
2      5    3    Ozone   12.0
3      5    4    Ozone   18.0
4      5    5    Ozone    NaN

Chapter 3. Customizing melted data

When melting DataFrames, it would be better to have column names more meaningful than variable and value (the default names used by pd.melt()).

The default names may work in certain situations, but it's best to always have data that is self explanatory.

You can rename the variable column by specifying an argument to the var_name parameter, and the value column by specifying an argument to the value_name parameter. You will now practice doing exactly this. Pandas as pd and the DataFrame airquality has been pre-loaded for you.

In [6]:
# Melt airquality: airquality_melt
airquality_melt = pd.melt(frame=airquality, id_vars=['Month', 'Day'], var_name='measurement', value_name='reading')


# Print the head of airquality_melt
print(airquality_melt.head())
   Month  Day measurement  reading
0      5    1       Ozone     41.0
1      5    2       Ozone     36.0
2      5    3       Ozone     12.0
3      5    4       Ozone     18.0
4      5    5       Ozone      NaN

Chapter 4. Pivot data

Pivoting data is the opposite of melting it. Remember the tidy form that the airquality DataFrame was in before you melted it? You'll now begin pivoting it back into that form using the .pivot_table() method!

While melting takes a set of columns and turns it into a single column, pivoting will create a new column for each unique value in a specified column.

.pivot_table() has an index parameter which you can use to specify the columns that you don't want pivoted: It is similar to the id_vars parameter of pd.melt(). Two other parameters that you have to specify are columns (the name of the column you want to pivot), and values (the values to be used when the column is pivoted). The melted DataFrame airquality_melt has been pre-loaded for you.

In [7]:
# Print the head of airquality_melt
print(airquality_melt.head())
   Month  Day measurement  reading
0      5    1       Ozone     41.0
1      5    2       Ozone     36.0
2      5    3       Ozone     12.0
3      5    4       Ozone     18.0
4      5    5       Ozone      NaN
In [8]:
# Pivot airquality_melt: airquality_pivot
airquality_pivot = airquality_melt.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading')

# Print the head of airquality_pivot
print(airquality_pivot.head())
measurement  Ozone  Solar.R  Temp  Wind
Month Day                              
5     1       41.0    190.0  67.0   7.4
      2       36.0    118.0  72.0   8.0
      3       12.0    149.0  74.0  12.6
      4       18.0    313.0  62.0  11.5
      5        NaN      NaN  56.0  14.3

Chapter 5. Resetting the index of a DataFrame

After pivoting airquality_melt in the previous exercise, you didn't quite get back the original DataFrame.

What you got back instead was a pandas DataFrame with a hierarchical index (also known as a MultiIndex).

Hierarchical indexes are covered in depth in Manipulating DataFrames with pandas. In essence, they allow you to group columns or rows by another variable - in this case, by 'Month' as well as 'Day'.

There's a very simple method you can use to get back the original DataFrame from the pivoted DataFrame: .reset_index(). Dan didn't show you how to use this method in the video, but you're now going to practice using it in this exercise to get back the original DataFrame from airquality_pivot, which has been pre-loaded.

In [9]:
# Print the index of airquality_pivot
print(airquality_pivot.index)
MultiIndex(levels=[[5, 6, 7, 8, 9], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]],
           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]],
           names=['Month', 'Day'])
In [10]:
# Reset the index of airquality_pivot: airquality_pivot_reset
airquality_pivot_reset = airquality_pivot.reset_index()
In [11]:
# Print the new index of airquality_pivot_reset
print(airquality_pivot_reset.index)
RangeIndex(start=0, stop=153, step=1)
In [12]:
# Print the head of airquality_pivot_reset
print(airquality_pivot_reset.head())
measurement  Month  Day  Ozone  Solar.R  Temp  Wind
0                5    1   41.0    190.0  67.0   7.4
1                5    2   36.0    118.0  72.0   8.0
2                5    3   12.0    149.0  74.0  12.6
3                5    4   18.0    313.0  62.0  11.5
4                5    5    NaN      NaN  56.0  14.3

Chapter 6. Pivoting duplicate values

So far, you've used the .pivot_table() method when there are multiple index values you want to hold constant during a pivot. In the video, Dan showed you how you can also use pivot tables to deal with duplicate values by providing an aggregation function through the aggfunc parameter. Here, you're going to combine both these uses of pivot tables.

Let's say your data collection method accidentally duplicated your dataset. Such a dataset, in which each row is duplicated, has been pre-loaded as airquality_dup. In addition, the airquality_melt DataFrame from the previous exercise has been pre-loaded. Explore their shapes in the IPython Shell by accessing their .shape attributes to confirm the duplicate rows present in airquality_dup.

You'll see that by using .pivot_table() and the aggfunc parameter, you can not only reshape your data, but also remove duplicates. Finally, you can then flatten the columns of the pivoted DataFrame using .reset_index().

NumPy and pandas have been imported as np and pd respectively.

In [24]:
airquality_dup = airquality_melt.append(airquality_melt, ignore_index=True)
airquality_dup
Out[24]:
Month Day measurement reading
0 5 1 Ozone 41.0
1 5 2 Ozone 36.0
2 5 3 Ozone 12.0
3 5 4 Ozone 18.0
4 5 5 Ozone NaN
5 5 6 Ozone 28.0
6 5 7 Ozone 23.0
7 5 8 Ozone 19.0
8 5 9 Ozone 8.0
9 5 10 Ozone NaN
10 5 11 Ozone 7.0
11 5 12 Ozone 16.0
12 5 13 Ozone 11.0
13 5 14 Ozone 14.0
14 5 15 Ozone 18.0
15 5 16 Ozone 14.0
16 5 17 Ozone 34.0
17 5 18 Ozone 6.0
18 5 19 Ozone 30.0
19 5 20 Ozone 11.0
20 5 21 Ozone 1.0
21 5 22 Ozone 11.0
22 5 23 Ozone 4.0
23 5 24 Ozone 32.0
24 5 25 Ozone NaN
25 5 26 Ozone NaN
26 5 27 Ozone NaN
27 5 28 Ozone 23.0
28 5 29 Ozone 45.0
29 5 30 Ozone 115.0
... ... ... ... ...
1194 9 1 Temp 91.0
1195 9 2 Temp 92.0
1196 9 3 Temp 93.0
1197 9 4 Temp 93.0
1198 9 5 Temp 87.0
1199 9 6 Temp 84.0
1200 9 7 Temp 80.0
1201 9 8 Temp 78.0
1202 9 9 Temp 75.0
1203 9 10 Temp 73.0
1204 9 11 Temp 81.0
1205 9 12 Temp 76.0
1206 9 13 Temp 77.0
1207 9 14 Temp 71.0
1208 9 15 Temp 71.0
1209 9 16 Temp 78.0
1210 9 17 Temp 67.0
1211 9 18 Temp 76.0
1212 9 19 Temp 68.0
1213 9 20 Temp 82.0
1214 9 21 Temp 64.0
1215 9 22 Temp 71.0
1216 9 23 Temp 81.0
1217 9 24 Temp 69.0
1218 9 25 Temp 63.0
1219 9 26 Temp 70.0
1220 9 27 Temp 77.0
1221 9 28 Temp 75.0
1222 9 29 Temp 76.0
1223 9 30 Temp 68.0

1224 rows × 4 columns

In [28]:
import numpy as np
# Pivot airquality_dup: airquality_pivot
airquality_pivot = airquality_dup.pivot_table(index=['Month', 'Day'], 
                                              columns='measurement', 
                                              values='reading', 
                                              aggfunc=np.mean)

print(airquality_pivot)
measurement  Ozone  Solar.R  Temp  Wind
Month Day                              
5     1       41.0    190.0  67.0   7.4
      2       36.0    118.0  72.0   8.0
      3       12.0    149.0  74.0  12.6
      4       18.0    313.0  62.0  11.5
      5        NaN      NaN  56.0  14.3
      6       28.0      NaN  66.0  14.9
      7       23.0    299.0  65.0   8.6
      8       19.0     99.0  59.0  13.8
      9        8.0     19.0  61.0  20.1
      10       NaN    194.0  69.0   8.6
      11       7.0      NaN  74.0   6.9
      12      16.0    256.0  69.0   9.7
      13      11.0    290.0  66.0   9.2
      14      14.0    274.0  68.0  10.9
      15      18.0     65.0  58.0  13.2
      16      14.0    334.0  64.0  11.5
      17      34.0    307.0  66.0  12.0
      18       6.0     78.0  57.0  18.4
      19      30.0    322.0  68.0  11.5
      20      11.0     44.0  62.0   9.7
      21       1.0      8.0  59.0   9.7
      22      11.0    320.0  73.0  16.6
      23       4.0     25.0  61.0   9.7
      24      32.0     92.0  61.0  12.0
      25       NaN     66.0  57.0  16.6
      26       NaN    266.0  58.0  14.9
      27       NaN      NaN  57.0   8.0
      28      23.0     13.0  67.0  12.0
      29      45.0    252.0  81.0  14.9
      30     115.0    223.0  79.0   5.7
...            ...      ...   ...   ...
9     1       96.0    167.0  91.0   6.9
      2       78.0    197.0  92.0   5.1
      3       73.0    183.0  93.0   2.8
      4       91.0    189.0  93.0   4.6
      5       47.0     95.0  87.0   7.4
      6       32.0     92.0  84.0  15.5
      7       20.0    252.0  80.0  10.9
      8       23.0    220.0  78.0  10.3
      9       21.0    230.0  75.0  10.9
      10      24.0    259.0  73.0   9.7
      11      44.0    236.0  81.0  14.9
      12      21.0    259.0  76.0  15.5
      13      28.0    238.0  77.0   6.3
      14       9.0     24.0  71.0  10.9
      15      13.0    112.0  71.0  11.5
      16      46.0    237.0  78.0   6.9
      17      18.0    224.0  67.0  13.8
      18      13.0     27.0  76.0  10.3
      19      24.0    238.0  68.0  10.3
      20      16.0    201.0  82.0   8.0
      21      13.0    238.0  64.0  12.6
      22      23.0     14.0  71.0   9.2
      23      36.0    139.0  81.0  10.3
      24       7.0     49.0  69.0  10.3
      25      14.0     20.0  63.0  16.6
      26      30.0    193.0  70.0   6.9
      27       NaN    145.0  77.0  13.2
      28      14.0    191.0  75.0  14.3
      29      18.0    131.0  76.0   8.0
      30      20.0    223.0  68.0  11.5

[153 rows x 4 columns]
In [29]:
# Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()

# Print the head of airquality_pivot
print(airquality_pivot.head())
measurement  Month  Day  Ozone  Solar.R  Temp  Wind
0                5    1   41.0    190.0  67.0   7.4
1                5    2   36.0    118.0  72.0   8.0
2                5    3   12.0    149.0  74.0  12.6
3                5    4   18.0    313.0  62.0  11.5
4                5    5    NaN      NaN  56.0  14.3
In [30]:
# Print the head of airquality
print(airquality.head())
   Ozone  Solar.R  Wind  Temp  Month  Day
0   41.0    190.0   7.4    67      5    1
1   36.0    118.0   8.0    72      5    2
2   12.0    149.0  12.6    74      5    3
3   18.0    313.0  11.5    62      5    4
4    NaN      NaN  14.3    56      5    5

Chapter 7. Splitting a column with .str

The dataset you saw in the video, consisting of case counts of tuberculosis by country, year, gender, and age group, has been pre-loaded into a DataFrame as tb.

In this exercise, you're going to tidy the 'm014' column, which represents males aged 0-14 years of age. In order to parse this value, you need to extract the first letter into a new column for gender, and the rest into a column for age_group. Here, since you can parse values by position, you can take advantage of pandas' vectorized string slicing by using the str attribute of columns of type object.

Begin by printing the columns of tb in the IPython Shell using its .columns attribute, and take note of the problematic column.

In [37]:
import pandas as pd
url = "https://assets.datacamp.com/production/course_2023/datasets/tb.csv"
tb = pd.read_csv(url)
print(tb.head())
  country  year  m014  m1524  m2534  m3544  m4554  m5564   m65  mu  f014  \
0      AD  2000   0.0    0.0    1.0    0.0    0.0    0.0   0.0 NaN   NaN   
1      AE  2000   2.0    4.0    4.0    6.0    5.0   12.0  10.0 NaN   3.0   
2      AF  2000  52.0  228.0  183.0  149.0  129.0   94.0  80.0 NaN  93.0   
3      AG  2000   0.0    0.0    0.0    0.0    0.0    0.0   1.0 NaN   1.0   
4      AL  2000   2.0   19.0   21.0   14.0   24.0   19.0  16.0 NaN   3.0   

   f1524  f2534  f3544  f4554  f5564   f65  fu  
0    NaN    NaN    NaN    NaN    NaN   NaN NaN  
1   16.0    1.0    3.0    0.0    0.0   4.0 NaN  
2  414.0  565.0  339.0  205.0   99.0  36.0 NaN  
3    1.0    1.0    0.0    0.0    0.0   0.0 NaN  
4   11.0   10.0    8.0    8.0    5.0  11.0 NaN  
In [38]:
# Melt tb: tb_melt
tb_melt = pd.melt(tb, id_vars=['country', 'year'])
print(tb_melt.head())
  country  year variable  value
0      AD  2000     m014    0.0
1      AE  2000     m014    2.0
2      AF  2000     m014   52.0
3      AG  2000     m014    0.0
4      AL  2000     m014    2.0
In [40]:
# Create the 'gender' column
tb_melt['gender'] = tb_melt.variable.str[0]
print(tb_melt.head())
  country  year variable  value gender
0      AD  2000     m014    0.0      m
1      AE  2000     m014    2.0      m
2      AF  2000     m014   52.0      m
3      AG  2000     m014    0.0      m
4      AL  2000     m014    2.0      m
In [41]:
# Create the 'age_group' column
tb_melt['age_group'] = tb_melt.variable.str[1:]
# Print the head of tb_melt
print(tb_melt.head())
  country  year variable  value gender age_group
0      AD  2000     m014    0.0      m       014
1      AE  2000     m014    2.0      m       014
2      AF  2000     m014   52.0      m       014
3      AG  2000     m014    0.0      m       014
4      AL  2000     m014    2.0      m       014

Chapter 8. Splitting a column with .split() and .get()

Another common way multiple variables are stored in columns is with a delimiter. You'll learn how to deal with such cases in this exercise, using a dataset consisting of Ebola cases and death counts by state and country. It has been pre-loaded into a DataFrame as ebola.

Print the columns of ebola in the IPython Shell using ebola.columns. Notice that the data has column names such as Cases_Guinea and DeathsGuinea. Here, the underscore serves as a delimiter between the first part (cases or deaths), and the second part (country).

This time, you cannot directly slice the variable by position as in the previous exercise. You now need to use Python's built-in string method called .split(). By default, this method will split a string into parts separated by a space. However, in this case you want it to split by an underscore. You can do this on Cases_Guinea, for example, using CasesGuinea.split(''), which returns the list ['Cases', 'Guinea'].

The next challenge is to extract the first element of this list and assign it to a type variable, and the second element of the list to a country variable. You can accomplish this by accessing the str attribute of the column and using the .get() method to retrieve the 0 or 1 index, depending on the part you want.

In [42]:
url = "https://assets.datacamp.com/production/course_2023/datasets/ebola.csv"
ebola = pd.read_csv(url)
print(ebola.head())
         Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone  \
0    1/5/2015  289        2776.0            NaN            10030.0   
1    1/4/2015  288        2775.0            NaN             9780.0   
2    1/3/2015  287        2769.0         8166.0             9722.0   
3    1/2/2015  286           NaN         8157.0                NaN   
4  12/31/2014  284        2730.0         8115.0             9633.0   

   Cases_Nigeria  Cases_Senegal  Cases_UnitedStates  Cases_Spain  Cases_Mali  \
0            NaN            NaN                 NaN          NaN         NaN   
1            NaN            NaN                 NaN          NaN         NaN   
2            NaN            NaN                 NaN          NaN         NaN   
3            NaN            NaN                 NaN          NaN         NaN   
4            NaN            NaN                 NaN          NaN         NaN   

   Deaths_Guinea  Deaths_Liberia  Deaths_SierraLeone  Deaths_Nigeria  \
0         1786.0             NaN              2977.0             NaN   
1         1781.0             NaN              2943.0             NaN   
2         1767.0          3496.0              2915.0             NaN   
3            NaN          3496.0                 NaN             NaN   
4         1739.0          3471.0              2827.0             NaN   

   Deaths_Senegal  Deaths_UnitedStates  Deaths_Spain  Deaths_Mali  
0             NaN                  NaN           NaN          NaN  
1             NaN                  NaN           NaN          NaN  
2             NaN                  NaN           NaN          NaN  
3             NaN                  NaN           NaN          NaN  
4             NaN                  NaN           NaN          NaN  
In [43]:
# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts')
print(ebola_melt.head())
         Date  Day  type_country  counts
0    1/5/2015  289  Cases_Guinea  2776.0
1    1/4/2015  288  Cases_Guinea  2775.0
2    1/3/2015  287  Cases_Guinea  2769.0
3    1/2/2015  286  Cases_Guinea     NaN
4  12/31/2014  284  Cases_Guinea  2730.0
In [44]:
# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')
print(ebola_melt.head())
         Date  Day  type_country  counts        str_split
0    1/5/2015  289  Cases_Guinea  2776.0  [Cases, Guinea]
1    1/4/2015  288  Cases_Guinea  2775.0  [Cases, Guinea]
2    1/3/2015  287  Cases_Guinea  2769.0  [Cases, Guinea]
3    1/2/2015  286  Cases_Guinea     NaN  [Cases, Guinea]
4  12/31/2014  284  Cases_Guinea  2730.0  [Cases, Guinea]
In [45]:
# Create the 'type' column
ebola_melt['type'] = ebola_melt.str_split.str.get(0)
print(ebola_melt.head())
         Date  Day  type_country  counts        str_split   type
0    1/5/2015  289  Cases_Guinea  2776.0  [Cases, Guinea]  Cases
1    1/4/2015  288  Cases_Guinea  2775.0  [Cases, Guinea]  Cases
2    1/3/2015  287  Cases_Guinea  2769.0  [Cases, Guinea]  Cases
3    1/2/2015  286  Cases_Guinea     NaN  [Cases, Guinea]  Cases
4  12/31/2014  284  Cases_Guinea  2730.0  [Cases, Guinea]  Cases
In [46]:
# Create the 'country' column
ebola_melt['country'] = ebola_melt.str_split.str.get(1)
print(ebola_melt.head())
         Date  Day  type_country  counts        str_split   type country
0    1/5/2015  289  Cases_Guinea  2776.0  [Cases, Guinea]  Cases  Guinea
1    1/4/2015  288  Cases_Guinea  2775.0  [Cases, Guinea]  Cases  Guinea
2    1/3/2015  287  Cases_Guinea  2769.0  [Cases, Guinea]  Cases  Guinea
3    1/2/2015  286  Cases_Guinea     NaN  [Cases, Guinea]  Cases  Guinea
4  12/31/2014  284  Cases_Guinea  2730.0  [Cases, Guinea]  Cases  Guinea
Classification_Trees

Supervised Learning: Classification Trees

Evan Jung January 28, 2019

1. Intro

Classification trees use flowchart-like structures to make decisions. Because humans can readily understand these tree structures, classification trees are useful when transparency is needed, such as in loan approval. We’ll use the Lending Club dataset to simulate this scenario.


2. Lending Club Dataset

Lending Club is a US-based peer-to-peer lending company. The loans dataset contains 11,312 randomly-selected people who were applied for and later received loans from Lending Club.


## Observations: 39,732
## Variables: 14
## $ loan_amount        <fct> LOW, LOW, LOW, MEDIUM, LOW, LOW, MEDIUM, LO...
## $ emp_length         <fct> 10+ years, < 2 years, 10+ years, 10+ years,...
## $ home_ownership     <fct> RENT, RENT, RENT, RENT, RENT, RENT, RENT, R...
## $ income             <fct> LOW, LOW, LOW, MEDIUM, HIGH, LOW, MEDIUM, M...
## $ loan_purpose       <fct> credit_card, car, small_business, other, ot...
## $ debt_to_income     <fct> HIGH, LOW, AVERAGE, HIGH, AVERAGE, AVERAGE,...
## $ credit_score       <fct> AVERAGE, AVERAGE, AVERAGE, AVERAGE, AVERAGE...
## $ recent_inquiry     <fct> YES, YES, YES, YES, NO, YES, YES, YES, YES,...
## $ delinquent         <fct> NEVER, NEVER, NEVER, MORE THAN 2 YEARS AGO,...
## $ credit_accounts    <fct> FEW, FEW, FEW, AVERAGE, MANY, AVERAGE, AVER...
## $ bad_public_record  <fct> NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO,...
## $ credit_utilization <fct> HIGH, LOW, HIGH, LOW, MEDIUM, MEDIUM, HIGH,...
## $ past_bankrupt      <fct> NO, NO, NO, NO, NO, NO, NO, NO, NO, NO, NO,...
## $ outcome            <fct> repaid, default, repaid, repaid, repaid, re...


3. UpSampling & DownSampling


## 
##  repaid default 
##   34078    5654

In classification problems, as you can see graph above, a disparity in the frequencies of the observed classes can have a significant negative impact on model fitting. One technique for resolving such a class imbalance is to subsample the training data in a manner that mitigates the issues. Examples of sampling methods for this purpose are:

  • down-sampling: randomly subset all the classes in the training set so that their class frequencies match the least prevalent class. For example, suppose that 80% of the training set samples are the first class and the remaining 20% are in the second class. Down-sampling would randomly sample the first class to be the same size as the second class (so that only 40% of the total training set is used to fit the model). caret contains a function (downSample) to do this.

  • up-sampling: randomly sample (with replacement) the minority class to be the same size as the majority class. caret contains a function (upSample) to do this.

  • hybrid methods: techniques such as SMOTE and ROSE down-sample the majority class and synthesize new data points in the minority class. There are two packages (DMwR and ROSE) that implement these procedures.

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift
## 
##  repaid default 
##    5654    5654


4. Building a simple decision tree

(1) Modeling Building

You will use a decision tree to try to learn patterns in the outcome of these loans (either repaid or default) based on the requested loan amount and credit score at the time of application.

Then, see how the tree’s predictions differ for an applicant with good credit versus one with bad credit.

## Call:
## rpart(formula = outcome ~ loan_amount + credit_score, data = loans, 
##     method = "class", control = rpart.control(cp = 0))
##   n= 11308 
## 
##           CP nsplit rel error    xerror        xstd
## 1 0.10771135      0 1.0000000 1.0231694 0.009401356
## 2 0.01317651      1 0.8922886 0.8922886 0.009349171
## 3 0.00000000      3 0.8659356 0.8659356 0.009318988
## 
## Variable importance
## credit_score  loan_amount 
##           89           11 
## 
## Node number 1: 11308 observations,    complexity param=0.1077114
##   predicted class=repaid   expected loss=0.5  P(node) =1
##     class counts:  5654  5654
##    probabilities: 0.500 0.500 
##   left son=2 (1811 obs) right son=3 (9497 obs)
##   Primary splits:
##       credit_score splits as  RLR, improve=121.92300, (0 missing)
##       loan_amount  splits as  RLL, improve= 29.13543, (0 missing)
## 
## Node number 2: 1811 observations
##   predicted class=repaid   expected loss=0.3318609  P(node) =0.1601521
##     class counts:  1210   601
##    probabilities: 0.668 0.332 
## 
## Node number 3: 9497 observations,    complexity param=0.01317651
##   predicted class=default  expected loss=0.4679372  P(node) =0.8398479
##     class counts:  4444  5053
##    probabilities: 0.468 0.532 
##   left son=6 (7867 obs) right son=7 (1630 obs)
##   Primary splits:
##       credit_score splits as  L-R, improve=42.17392, (0 missing)
##       loan_amount  splits as  RLL, improve=19.24674, (0 missing)
## 
## Node number 6: 7867 observations,    complexity param=0.01317651
##   predicted class=default  expected loss=0.489386  P(node) =0.6957022
##     class counts:  3850  4017
##    probabilities: 0.489 0.511 
##   left son=12 (5397 obs) right son=13 (2470 obs)
##   Primary splits:
##       loan_amount splits as  RLL, improve=20.49803, (0 missing)
## 
## Node number 7: 1630 observations
##   predicted class=default  expected loss=0.3644172  P(node) =0.1441457
##     class counts:   594  1036
##    probabilities: 0.364 0.636 
## 
## Node number 12: 5397 observations
##   predicted class=repaid   expected loss=0.486196  P(node) =0.4772727
##     class counts:  2773  2624
##    probabilities: 0.514 0.486 
## 
## Node number 13: 2470 observations
##   predicted class=default  expected loss=0.4360324  P(node) =0.2184294
##     class counts:  1077  1393
##    probabilities: 0.436 0.564


(2) Predict()

##      1 
## repaid 
## Levels: repaid default
##       1 
## default 
## Levels: repaid default


5.Visualizing classification trees

Due to government rules to prevent illegal discrimination, lenders are required to explain why a loan application was rejected.

The structure of classification trees can be depicted visually, which helps to understand how the tree makes its decisions.



Based on this tree structure, which of the following applicants would be predicted to repay the loan?

Someone with a low requested loan amount and high credit. Using the tree structure, you can clearly see how the tree makes its decisions.


6. Creating Random Test Datasets

Before building a more sophisticated lending model, it is important to hold out a portion of the loan data to simulate how well it will predict the outcomes of future loan applicants.



As depicted in the following image, you can use 75% of the observations for training and 25% for testing the model.

## [1] 8481

The sample() function can be used to generate a random sample of rows to include in the training set. Simply supply it the total number of observations and the number needed for training.

Use the resulting vector of row IDs to subset the loans into training and testing datasets.


7. Building and evaluating a larger tree

Previously, you created a simple decision tree that used the applicant’s credit score and requested loan amount to predict the loan outcome.

Lending Club has additional information about the applicants, such as home ownership status, length of employment, loan purpose, and past bankruptcies, that may be useful for making more accurate predictions.

Using all of the available applicant data, build a more sophisticated lending model using the random training dataset created previously. Then, use this model to make predictions on the testing dataset to estimate the performance of the model on future loan applications.

##          
##           repaid default
##   repaid     781     626
##   default    636     784
## [1] 0.5535904

The accuracy on the test dataset seems low. If so, how did adding more predictors change the model’s performance?


8. Conducting a fair performance evaluation

Holding out test data reduces the amount of data available for growing the decision tree. In spite of this, it is very important to evaluate decision trees on data it has not seen before.

Which of these is NOT true about the evaluation of decision tree performance?

  1. Decision trees sometimes overfit the training data.
  2. The model’s accuracy is unaffected by the rarity of the outcome.
  3. Performance on the training dataset can overestimate performance on future data.
  4. Creating a test dataset simulates the model’s performance on unseen data.

The answer is (2). Rare events cause problems for many machine learning approaches.


9. Preventing overgrown trees

The tree grown on the full set of applicant data grew to be extremely large and extremely complex, with hundreds of splits and leaf nodes containing only a handful of applicants. This tree would be almost impossible for a loan officer to interpret.

Using the pre-pruning methods for early stopping, you can prevent a tree from growing too large and complex. See how the rpart control options for maximum tree depth and minimum split count impact the resulting tree.

## [1] 0.5723382

Compared to the previous model, the new model shows the better performance of accuracy. But, still we have new technique to build the better model.


10. Creating a nicely pruned tree

Stopping a tree from growing all the way can lead it to ignore some aspects of the data or miss important trends it may have discovered later.

By using post-pruning, you can intentionally grow a large and complex tree then prune it to be smaller and more efficient later on.

In this exercise, you will have the opportunity to construct a visualization of the tree’s performance versus complexity, and use this information to prune the tree to an appropriate level.


## [1] 0.5928546

As with pre-pruning, creating a simpler tree actually improved the performance of the tree on the test dataset.


11. Why do trees benefit from pruning?

Classification trees can grow indefinitely, until they are told to stop or run out of data to divide-and-conquer.

Just like trees in nature, classification trees that grow overly large can require pruning to reduce the excess growth. However, this generally results in a tree that classifies fewer training examples correctly.

Why, then, are pre-pruning and post-pruning almost always used? (1) Simpler trees are easier to interpret (2) Simpler trees using early stopping are faster to train (3) Simpler trees may perform better on the testing data


12. Building a random forest model

In spite of the fact that a forest can contain hundreds of trees, growing a decision tree forest is perhaps even easier than creating a single highly-tuned tree.

Using the randomForest package, build a random forest and see how it compares to the single trees you built previously.

Keep in mind that due to the random nature of the forest, the results may vary slightly each time you create the forest.

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin
## [1] 0.5854262


'R > [R] Machine Learning' 카테고리의 다른 글

[R] Supervised Learning, Logistic Regresison  (0) 2019.01.25
[R] k-Nearest Neighbors (kNN)  (0) 2019.01.20
Logistic_Regression

Supervised Learning, Logistic Regression

Evan Jung January 18, 2019

1. Data import

suppose that we will get dataset from an NGO company below.

## Observations: 93,462
## Variables: 13
## $ donated           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ veteran           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ bad_address       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ age               <dbl> 60, 46, NA, 70, 78, NA, 38, NA, NA, 65, NA, ...
## $ has_children      <dbl> 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ wealth_rating     <dbl> 0, 3, 1, 2, 1, 0, 2, 3, 1, 0, 1, 2, 1, 0, 2,...
## $ interest_veterans <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ interest_religion <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ pet_owner         <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ catalog_shopper   <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ recency           <fct> CURRENT, CURRENT, CURRENT, CURRENT, CURRENT,...
## $ frequency         <fct> FREQUENT, FREQUENT, FREQUENT, FREQUENT, FREQ...
## $ money             <fct> MEDIUM, HIGH, MEDIUM, MEDIUM, MEDIUM, MEDIUM...

Here the target variable is donated. The donated column is 1 if the person made a donation in response to the mailing and 0 otherwise.


2. Building a model

When building a model in most cases, it’s not a good idea to put all the variables. It good to start with a hypothesis about which independent variables will be predictive of the dependent variable. in this case, well, the bad_address column, which is set to 1 for an invalid mailing address and 0 otherwise, seems like it might reduce the chances of a donation. Similarly, one might suspect that religious interest (interest_religion) and interest in veterans affairs (interest_veterans) would be associated with greater charitable giving.

## 
## Call:
## glm(formula = donated ~ bad_address + interest_religion + interest_veterans, 
##     family = "binomial", data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3480  -0.3192  -0.3192  -0.3192   2.5678  
## 
## Coefficients:
##                   Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)       -2.95139    0.01652 -178.664   <2e-16 ***
## bad_address       -0.30780    0.14348   -2.145   0.0319 *  
## interest_religion  0.06724    0.05069    1.327   0.1847    
## interest_veterans  0.11009    0.04676    2.354   0.0186 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 37316  on 93458  degrees of freedom
## AIC: 37324
## 
## Number of Fisher Scoring iterations: 5


3. Prediction

As other R’s machine learning methods, we will apply predict(). By default, predict() outputs predictions in terms of log odds unless type = "response" is specified. This converts the log odds to probabilities.

Because a logistic regression model estimates the probability of the outcome, it is up to you to determine the threshold at which the probability implies action. One must balance the extremes of being too cautious versus being too aggressive.

For example, if you were to solicit only the people with a 99% or greater donation probability, you may miss out on many people with lower estimated probabilities that still choose to donate. This balance is particularly important to consider for severely imbalanced outcomes, such as in this dataset where donations are relatively rare.

## [1] 0.05040551

The actual probability that an average person would donate by passing is 0.05.

## [1] 0.794815


4. Limitation of Accuracy

Although the accuracy of model is almost 80%, the result is misleading due to the rarity of outcome being predicted. What would the accuracy have been if a model had simply predicted “no donation” for each person? Then it could be 95%. See below.

## 
##    0    1 
## 0.95 0.05


5. Calculating ROC Curves and AUC

We know that accuracy is a very misleading measure of model performance on imbalanced datasets. Graphing the model’s performance better illustrates the tradeoff between a model that is overly agressive and one that is overly passive.

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
## Area under the curve: 0.5102


How can we explain the value of AUC and plot? Based on this visualization, the model isn’t doing much better than baseline— a model doing nothing but making predictions at random.


6. Dummy Coding Categorical Data

Sometimes a dataset contains numeric values that represent a categorical feature.

## 
## Call:
## glm(formula = donated ~ wealth_rating, family = "binomial", data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3320  -0.3243  -0.3175  -0.3175   2.4582  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -2.91894    0.03614 -80.772   <2e-16 ***
## wealth_ratingUnknown -0.04373    0.04243  -1.031    0.303    
## wealth_ratingLow     -0.05245    0.05332  -0.984    0.325    
## wealth_ratingHigh     0.04804    0.04768   1.008    0.314    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 37323  on 93458  degrees of freedom
## AIC: 37331
## 
## Number of Fisher Scoring iterations: 5


7. Handling Missing Data

Some of the prospective donors have missing age data. Unfortunately, R will exclude any cases with NA values when building a regression model.

One workaround is to replace, or impute, the missing values with an estimated value. After doing so, you may also create a missing data indicator to model the possibility that cases with missing data are different in some way from those without.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   48.00   62.00   61.65   75.00   98.00   22546

The number of NA is 22546. So, we need to deal with handling missing data.


8. Building a more sophisticated model

One of the best predictors of future giving is a history of recent, frequent, and large gifts. In marketing terms, this is known as R/F/M - Recency, Frequency, Money Donors that haven’t given both recently and frequently may be especially likely to give again;

## 
## Call:
## glm(formula = donated ~ recency * frequency + money, family = "binomial", 
##     data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3696  -0.3696  -0.2895  -0.2895   2.7924  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -3.01142    0.04279 -70.375   <2e-16 ***
## recencyLAPSED                     -0.86677    0.41434  -2.092   0.0364 *  
## frequencyINFREQUENT               -0.50148    0.03107 -16.143   <2e-16 ***
## moneyMEDIUM                        0.36186    0.04300   8.415   <2e-16 ***
## recencyLAPSED:frequencyINFREQUENT  1.01787    0.51713   1.968   0.0490 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 36938  on 93457  degrees of freedom
## AIC: 36948
## 
## Number of Fisher Scoring iterations: 6

Model has got better than the previous model. Based on the result, the combined impact of recency and frequency may be greater than the sum of the separate effects.


## Area under the curve: 0.5785

Based on the ROC curve, you’ve confirmed that past giving patterns are certainly predictive of future giving.


9. The dangers of stepwise regression

In spite of its utility for feature selection, stepwise regression is not frequently used in disciplines outside of machine learning due to some important caveats. First of all, It is not guaranteed to find the best possible model. Second, The stepwise regression procedure violates some statistical assumptions. Third, it can result in a model that makes little sense in the real world


10. Building a stepwise regression model

## 
## Call:
## glm(formula = donated ~ 1, family = "binomial", data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3216  -0.3216  -0.3216  -0.3216   2.4444  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.93593    0.01495  -196.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 37330  on 93461  degrees of freedom
## AIC: 37332
## 
## Number of Fisher Scoring iterations: 5
## 
## Call:
## glm(formula = donated ~ ., family = "binomial", data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6111  -0.3642  -0.3080  -0.2866   2.7436  
## 
## Coefficients: (2 not defined because of singularities)
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           1.742e+01  1.066e+01   1.634  0.10222    
## veteran              -2.071e-02  5.151e-01  -0.040  0.96793    
## bad_address          -5.442e+00  2.802e+00  -1.942  0.05208 .  
## age                   1.094e-03  1.093e-03   1.001  0.31702    
## has_children         -1.561e-01  5.156e-02  -3.028  0.00247 ** 
## wealth_ratingUnknown -1.196e-02  4.819e-02  -0.248  0.80404    
## wealth_ratingLow     -4.901e-02  5.773e-02  -0.849  0.39594    
## wealth_ratingHigh     1.270e-01  5.079e-02   2.500  0.01243 *  
## interest_veterans     2.429e+00  1.214e+00   2.001  0.04535 *  
## interest_religion     1.491e+00  7.507e-01   1.986  0.04704 *  
## pet_owner             5.060e-02  4.895e-02   1.034  0.30128    
## catalog_shopper       6.686e-02  5.980e-02   1.118  0.26353    
## recencyLAPSED        -1.678e-01  2.565e-01  -0.654  0.51297    
## frequencyINFREQUENT  -4.645e-01  3.523e-02 -13.185  < 2e-16 ***
## moneyMEDIUM           3.734e-01  4.893e-02   7.631 2.34e-14 ***
## donation_prob        -4.131e+02  2.146e+02  -1.926  0.05416 .  
## donation_pred        -1.185e-01  1.222e-01  -0.970  0.33189    
## imputed_age                  NA         NA      NA       NA    
## missing_age                  NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28714  on 70915  degrees of freedom
## Residual deviance: 28405  on 70899  degrees of freedom
##   (22546 observations deleted due to missingness)
## AIC: 28439
## 
## Number of Fisher Scoring iterations: 6
## Start:  AIC=37332.13
## donated ~ 1

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + frequency          1    28502 37122
## + money              1    28621 37241
## + has_children       1    28705 37326
## + age                1    28707 37328
## + imputed_age        1    28707 37328
## + wealth_rating      3    28704 37328
## + interest_veterans  1    28709 37330
## + donation_prob      1    28710 37330
## + donation_pred      1    28710 37330
## + catalog_shopper    1    28710 37330
## + pet_owner          1    28711 37331
## <none>                    28714 37332
## + interest_religion  1    28712 37333
## + recency            1    28713 37333
## + bad_address        1    28714 37334
## + veteran            1    28714 37334
## 
## Step:  AIC=37024.77
## donated ~ frequency

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + money              1    28441 36966
## + wealth_rating      3    28490 37019
## + has_children       1    28494 37019
## + donation_prob      1    28498 37023
## + interest_veterans  1    28498 37023
## + catalog_shopper    1    28499 37024
## + donation_pred      1    28499 37024
## + age                1    28499 37024
## + imputed_age        1    28499 37024
## + pet_owner          1    28499 37024
## <none>                    28502 37025
## + interest_religion  1    28501 37026
## + recency            1    28501 37026
## + bad_address        1    28502 37026
## + veteran            1    28502 37027
## 
## Step:  AIC=36949.71
## donated ~ frequency + money

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + wealth_rating      3    28427 36942
## + has_children       1    28432 36943
## + interest_veterans  1    28438 36948
## + donation_prob      1    28438 36949
## + catalog_shopper    1    28438 36949
## + donation_pred      1    28438 36949
## + age                1    28438 36949
## + imputed_age        1    28438 36949
## + pet_owner          1    28439 36949
## <none>                    28441 36950
## + interest_religion  1    28440 36951
## + recency            1    28440 36951
## + bad_address        1    28441 36951
## + veteran            1    28441 36952
## 
## Step:  AIC=36945.48
## donated ~ frequency + money + wealth_rating

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + has_children       1    28416 36937
## + age                1    28424 36944
## + imputed_age        1    28424 36944
## + interest_veterans  1    28424 36945
## + donation_prob      1    28424 36945
## + catalog_shopper    1    28424 36945
## + donation_pred      1    28425 36945
## <none>                    28427 36945
## + pet_owner          1    28425 36946
## + interest_religion  1    28426 36947
## + recency            1    28426 36947
## + bad_address        1    28427 36947
## + veteran            1    28427 36947
## 
## Step:  AIC=36938.4
## donated ~ frequency + money + wealth_rating + has_children

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + pet_owner          1    28413 36937
## + donation_prob      1    28413 36937
## + catalog_shopper    1    28413 36937
## + interest_veterans  1    28413 36937
## + donation_pred      1    28414 36938
## <none>                    28416 36938
## + interest_religion  1    28415 36939
## + age                1    28416 36940
## + imputed_age        1    28416 36940
## + recency            1    28416 36940
## + bad_address        1    28416 36940
## + veteran            1    28416 36940
## 
## Step:  AIC=36932.25
## donated ~ frequency + money + wealth_rating + has_children + 
##     pet_owner

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## <none>                    28413 36932
## + donation_prob      1    28411 36932
## + interest_veterans  1    28411 36932
## + catalog_shopper    1    28412 36933
## + donation_pred      1    28412 36933
## + age                1    28412 36933
## + imputed_age        1    28412 36933
## + recency            1    28413 36934
## + interest_religion  1    28413 36934
## + bad_address        1    28413 36934
## + veteran            1    28413 36934


## Area under the curve: 0.5849

Despite the caveats of stepwise regression, it seems to have resulted in a relatively strong model!


All the Contents are From DataCamp

'R > [R] Machine Learning' 카테고리의 다른 글

[R] Classification Trees  (0) 2019.02.03
[R] k-Nearest Neighbors (kNN)  (0) 2019.01.20

+ Recent posts