Chapter3_Combining_data_for_analysis

This all contents come from DataCamp.

The ability to transform and combine your data is a crucial skill in data science, because your data may not always come in one monolithic file or table for you to load. A large dataset may be broken into separate datasets to facilitate easier storage and sharing. Or if you are dealing with time series data, for example, you may have a new dataset for each day. No matter the reason, it is important to be able to combine datasets so you can either clean a single dataset, or clean each dataset separately and then combine them later so you can run your analysis on a single dataset. In this chapter, you'll learn all about combining data.

Chapter 1. Combining rows of data

The dataset you'll be working with here relates to NYC Uber data. The original dataset has all the originating Uber pickup locations by time and latitude and longitude. For didactic purposes, you'll be working with a very small portion of the actual data.

Three DataFrames have been pre-loaded: uber1, which contains data for April 2014, uber2, which contains data for May 2014, and uber3, which contains data for June 2014. Your job in this exercise is to concatenate these DataFrames together such that the resulting DataFrame has the data for all three months.

Begin by exploring the structure of these three DataFrames in the IPython Shell using methods such as .head().

Code from DataCamp is below.

# Concatenate uber1, uber2, and uber3: row_concat
row_concat = pd.concat([uber1, uber2, uber3])

# Print the shape of row_concat
print(row_concat.shape)

# Print the head of row_concat
print(row_concat.head())
In [1]:
# my code below
import pandas as pd
url = "https://assets.datacamp.com/production/course_2023/datasets/nyc_uber_2014.csv"
row_concat = pd.read_csv(url)
print(row_concat.head())
   Unnamed: 0         Date/Time      Lat      Lon    Base
0           0  4/1/2014 0:11:00  40.7690 -73.9549  B02512
1           1  4/1/2014 0:17:00  40.7267 -74.0345  B02512
2           2  4/1/2014 0:21:00  40.7316 -73.9873  B02512
3           3  4/1/2014 0:28:00  40.7588 -73.9776  B02512
4           4  4/1/2014 0:33:00  40.7594 -73.9722  B02512

Chapter 2. Combining columns of data

Think of column-wise concatenation of data as stitching data together from the sides instead of the top and bottom. To perform this action, you use the same pd.concat() function, but this time with the keyword argument axis=1. The default, axis=0, is for a row-wise concatenation.

You'll return to the Ebola dataset you worked with briefly in the last chapter. It has been pre-loaded into a DataFrame called ebola_melt. In this DataFrame, the status and country of a patient is contained in a single column. This column has been parsed into a new DataFrame, status_country, where there are separate columns for status and country.

Explore the ebola_melt and status_country DataFrames in the IPython Shell. Your job is to concatenate them column-wise in order to obtain a final, clean DataFrame.

In [3]:
url = "https://assets.datacamp.com/production/course_2023/datasets/ebola.csv"
ebola = pd.read_csv(url)
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts')
ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')
ebola_melt['type'] = ebola_melt.str_split.str.get(0)
ebola_melt['country'] = ebola_melt.str_split.str.get(1)
print(ebola_melt.head())
         Date  Day  type_country  counts        str_split   type country
0    1/5/2015  289  Cases_Guinea  2776.0  [Cases, Guinea]  Cases  Guinea
1    1/4/2015  288  Cases_Guinea  2775.0  [Cases, Guinea]  Cases  Guinea
2    1/3/2015  287  Cases_Guinea  2769.0  [Cases, Guinea]  Cases  Guinea
3    1/2/2015  286  Cases_Guinea     NaN  [Cases, Guinea]  Cases  Guinea
4  12/31/2014  284  Cases_Guinea  2730.0  [Cases, Guinea]  Cases  Guinea
In [4]:
status_country = ebola_melt["str_split"]
In [5]:
# Concatenate ebola_melt and status_country column-wise: ebola_tidy
ebola_tidy = pd.concat([ebola_melt, status_country], axis = 1)

# Print the shape of ebola_tidy
print(ebola_tidy.shape)
(1952, 8)
In [6]:
# Print the head of ebola_tidy
print(ebola_tidy.head())
         Date  Day  type_country  counts        str_split   type country  \
0    1/5/2015  289  Cases_Guinea  2776.0  [Cases, Guinea]  Cases  Guinea   
1    1/4/2015  288  Cases_Guinea  2775.0  [Cases, Guinea]  Cases  Guinea   
2    1/3/2015  287  Cases_Guinea  2769.0  [Cases, Guinea]  Cases  Guinea   
3    1/2/2015  286  Cases_Guinea     NaN  [Cases, Guinea]  Cases  Guinea   
4  12/31/2014  284  Cases_Guinea  2730.0  [Cases, Guinea]  Cases  Guinea   

         str_split  
0  [Cases, Guinea]  
1  [Cases, Guinea]  
2  [Cases, Guinea]  
3  [Cases, Guinea]  
4  [Cases, Guinea]  

Chapter 3. Finding files that match a pattern

You're now going to practice using the glob module to find all csv files in the workspace. In the next exercise, you'll programmatically load them into DataFrames.

As Dan showed you in the video, the glob module has a function called glob that takes a pattern and returns a list of the files in the working directory that match that pattern.

For example, if you know the pattern is part single digit number .csv, you can write the pattern as 'part?.csv' (which would match part_1.csv, part_2.csv, part_3.csv, etc.)

Similarly, you can find all .csv files with '.csv', or all parts with 'part_'. The ? wildcard represents any 1 character, and the * wildcard represents any number of characters.

In [7]:
# Import necessary modules
import pandas as pd
import glob
import os
In [8]:
# check working directory
os.getcwd()
Out[8]:
'/Users/jihoonjung/Documents/python_e_learning/DataCamp/Cleaning_Data'
In [10]:
os.chdir("/Users/jihoonjung/Documents/python_e_learning/DataCamp/Cleaning_Data/data")
os.getcwd()
Out[10]:
'/Users/jihoonjung/Documents/python_e_learning/DataCamp/Cleaning_Data/data'
In [11]:
# Write the pattern: pattern
pattern = '*.csv'

# Save all file matches: csv_files
csv_files = glob.glob(pattern)
In [12]:
# Print the file names
print(csv_files)
['uber-raw-data-apr14.csv', 'uber-raw-data-jun14.csv', 'uber-raw-data-may14.csv']
In [13]:
# Load the second file into a DataFrame: csv2
csv2 = pd.read_csv(csv_files[1])
In [14]:
# Print the head of csv2
print(csv2.head())
          Date/Time      Lat      Lon    Base
0  6/1/2014 0:00:00  40.7293 -73.9920  B02512
1  6/1/2014 0:01:00  40.7131 -74.0097  B02512
2  6/1/2014 0:04:00  40.3461 -74.6610  B02512
3  6/1/2014 0:04:00  40.7555 -73.9833  B02512
4  6/1/2014 0:07:00  40.6880 -74.1831  B02512

Chapter 4. Iterating and concatenating all matches

Now that you have a list of filenames to load, you can load all the files into a list of DataFrames that can then be concatenated.

You'll start with an empty list called frames. Your job is to use a for loop to:

  1. iterate through each of the filenames
  2. read each filename into a DataFrame, and then
  3. append it to the frames list.

You can then concatenate this list of DataFrames using pd.concat(). Go for it!

In [15]:
# Create an empty list: frames
frames = []
In [20]:
# Iterate over csv_files
for csv in csv_files: 
    
    # Read csv into a DataFrame: df
    df = pd.read_csv(csv)
    
    # Append df to frame
    frames.append(df)
    print('succeed?: ', csv, ' succeded')
succeed?:  uber-raw-data-apr14.csv  succeded
succeed?:  uber-raw-data-jun14.csv  succeded
succeed?:  uber-raw-data-may14.csv  succeded
In [17]:
# Concatenate frames into a single DataFrame: ubser
uber = pd.concat(frames)
In [18]:
# Print the shape of uber
print(uber.shape)
(1880795, 4)
In [19]:
# Print the head of uber
print(uber.head())
          Date/Time      Lat      Lon    Base
0  4/1/2014 0:11:00  40.7690 -73.9549  B02512
1  4/1/2014 0:17:00  40.7267 -74.0345  B02512
2  4/1/2014 0:21:00  40.7316 -73.9873  B02512
3  4/1/2014 0:28:00  40.7588 -73.9776  B02512
4  4/1/2014 0:33:00  40.7594 -73.9722  B02512

Prerequisites. Programming with Databases - Python

To close, let’s have a look at how to access a database from a general-purpose programming language like Python. Other languages use almost exactly the same model: library and function names may differ, but the concepts are the same.

Here’s a short Python program that selects latitudes and longitudes from an SQLite database stored in a file called survey.db:

Please check here: Software Carpentry SQL lesson.

In [27]:
import sqlite3

connection = sqlite3.connect("survey.db")
cursor = connection.cursor()
cursor.execute("SELECT Site.lat, Site.long FROM Site;")
results = cursor.fetchall()
for r in results:
    print(r)
cursor.close()
connection.close()
(-49.85, -128.57)
(-47.15, -126.72)
(-48.87, -123.4)
In [28]:
connection = sqlite3.connect("survey.db")

# site dataframe
site = pd.read_sql_query("select name, lat, long from Site", connection)
print(site.head())
    name    lat    long
0   DR-1 -49.85 -128.57
1   DR-3 -47.15 -126.72
2  MSK-4 -48.87 -123.40
In [29]:
connection = sqlite3.connect("survey.db")

# visited dataframe dataframe
visited = pd.read_sql_query("select id, site, dated from Visited", connection)
print(visited.head())
    id  site       dated
0  619  DR-1  1927-02-08
1  622  DR-1  1927-02-10
2  734  DR-3  1930-01-07
3  735  DR-3  1930-01-12
4  751  DR-3  1930-02-26
In [31]:
connection = sqlite3.connect("survey.db")

# visited dataframe dataframe
survey = pd.read_sql_query("select * from Survey", connection)
print(survey.head())
   taken person quant  reading
0    619   dyer   rad     9.82
1    619   dyer   sal     0.13
2    622   dyer   rad     7.80
3    622   dyer   sal     0.09
4    734     pb   rad     8.41

Chapter 5. 1-to-1 data merge

Merging data allows you to combine disparate datasets into a single dataset to do more complex analysis.

Here, you'll be using survey data that contains readings that William Dyer, Frank Pabodie, and Valentina Roerich took in the late 1920 and 1930 while they were on an expedition towards Antarctica. The dataset was taken from a sqlite database from the Software Carpentry SQL lesson.

Two DataFrames have been pre-loaded: site and visited. Explore them in the IPython Shell and take note of their structure and column names. Your task is to perform a 1-to-1 merge of these two DataFrames using the 'name' column of site and the 'site' column of visited.

In [30]:
# Merge the DataFrames: o2o
m2m = pd.merge(left=site, right=visited, left_on='name', right_on='site')

# Print o2o
print(m2m)
    name    lat    long   id   site       dated
0   DR-1 -49.85 -128.57  619   DR-1  1927-02-08
1   DR-1 -49.85 -128.57  622   DR-1  1927-02-10
2   DR-1 -49.85 -128.57  844   DR-1  1932-03-22
3   DR-3 -47.15 -126.72  734   DR-3  1930-01-07
4   DR-3 -47.15 -126.72  735   DR-3  1930-01-12
5   DR-3 -47.15 -126.72  751   DR-3  1930-02-26
6   DR-3 -47.15 -126.72  752   DR-3        None
7  MSK-4 -48.87 -123.40  837  MSK-4  1932-01-14

Chapter 6. Many-to-Many data merge

The final merging scenario occurs when both DataFrames do not have unique keys for a merge. What happens here is that for each duplicated key, every pairwise combination will be created.

Two example DataFrames that share common key values have been pre-loaded: df1 and df2. Another DataFrame df3, which is the result of df1 merged with df2, has been pre-loaded. All three DataFrames have been printed - look at the output and notice how pairwise combinations have been created. This example is to help you develop your intuition for many-to-many merges.

Here, you'll work with the site and visited DataFrames from before, and a new survey DataFrame. Your task is to merge site and visited as you did in the earlier exercises. You will then merge this merged DataFrame with survey.

Begin by exploring the site, visited, and survey DataFrames in the IPython Shell.

In [35]:
# Merge site and visited: m2m
m2m = pd.merge(left=site, right=visited, left_on='name', right_on='site')

# Merge m2m and survey: m2m
m2m = pd.merge(left=m2m, right=survey, left_on='id', right_on='taken')

# Print the first 20 lines of m2m
print(m2m.head(20))
     name    lat    long   id   site       dated  taken person quant  reading
0    DR-1 -49.85 -128.57  619   DR-1  1927-02-08    619   dyer   rad     9.82
1    DR-1 -49.85 -128.57  619   DR-1  1927-02-08    619   dyer   sal     0.13
2    DR-1 -49.85 -128.57  622   DR-1  1927-02-10    622   dyer   rad     7.80
3    DR-1 -49.85 -128.57  622   DR-1  1927-02-10    622   dyer   sal     0.09
4    DR-1 -49.85 -128.57  844   DR-1  1932-03-22    844    roe   rad    11.25
5    DR-3 -47.15 -126.72  734   DR-3  1930-01-07    734     pb   rad     8.41
6    DR-3 -47.15 -126.72  734   DR-3  1930-01-07    734   lake   sal     0.05
7    DR-3 -47.15 -126.72  734   DR-3  1930-01-07    734     pb  temp   -21.50
8    DR-3 -47.15 -126.72  735   DR-3  1930-01-12    735     pb   rad     7.22
9    DR-3 -47.15 -126.72  735   DR-3  1930-01-12    735   None   sal     0.06
10   DR-3 -47.15 -126.72  735   DR-3  1930-01-12    735   None  temp   -26.00
11   DR-3 -47.15 -126.72  751   DR-3  1930-02-26    751     pb   rad     4.35
12   DR-3 -47.15 -126.72  751   DR-3  1930-02-26    751     pb  temp   -18.50
13   DR-3 -47.15 -126.72  751   DR-3  1930-02-26    751   lake   sal     0.10
14   DR-3 -47.15 -126.72  752   DR-3        None    752   lake   rad     2.19
15   DR-3 -47.15 -126.72  752   DR-3        None    752   lake   sal     0.09
16   DR-3 -47.15 -126.72  752   DR-3        None    752   lake  temp   -16.00
17   DR-3 -47.15 -126.72  752   DR-3        None    752    roe   sal    41.60
18  MSK-4 -48.87 -123.40  837  MSK-4  1932-01-14    837   lake   rad     1.46
19  MSK-4 -48.87 -123.40  837  MSK-4  1932-01-14    837   lake   sal     0.21

+ Recent posts