All Contents are from DataCamp
Here, you'll dive into some of the grittier aspects of data cleaning. You'll learn about string manipulation and pattern matching to deal with unstructured data, and then explore techniques to deal with missing or duplicate data. You'll also learn the valuable skill of programmatically checking your data for consistency, which will give you confidence that your code is running correctly and that the results of your analysis are reliable!
Chapter 1. Converting data types¶
In this exercise, you'll see how ensuring all categorical variables in a DataFrame are of type category reduces memory usage.
The tips dataset has been loaded into a DataFrame called tips. This data contains information about how much a customer tipped, whether the customer was male or female, a smoker or not, etc.
Look at the output of tips.info() in the IPython Shell. You'll note that two columns that should be categorical - sex and smoker - are instead of type object, which is pandas' way of storing arbitrary strings. Your job is to convert these two columns to type category and note the reduced memory usage.
import pandas as pd
url = "https://assets.datacamp.com/production/course_2023/datasets/tips.csv"
tips = pd.read_csv(url)
print(tips.head())
# Convert the sex column to type 'category'
tips.sex = tips.sex.astype('category')
# Print the info of tips
print(tips.info())
# Convert the smoker column to type 'category'
tips.smoker = tips.smoker.astype('category')
# Print the info of tips
print(tips.info())
Interestingly, By converting sex and smoker to categorical variables, the memory usage of the DataFrame went down from 13.4 KB to 10.1KB. This may seem like a small difference here, but when you're dealing with large datasets, the reduction in memory usage can be very significant!
Chapter 2. Working with numeric data¶
If you expect the data type of a column to be numeric (int
or float
), but instead it is of type object, this typically means that there is a non numeric value in the column, which also signifies bad data.
You can use the pd.to_numeric()
function to convert a column into a numeric data type. If the function raises an error, you can be sure that there is a bad value within the column. You can either use the techniques you learned in Chapter 1 to do some exploratory data analysis and find the bad value, or you can choose to ignore or coerce
the value into a missing value, NaN
.
A modified version of the tips dataset has been pre-loaded into a DataFrame called tips
. For instructional purposes, it has been pre-processed to introduce some 'bad' data for you to clean. Use the .info()
method to explore this. You'll note that the total_bill
and tip
columns, which should be numeric, are instead of type object
. Your job is to fix this.
# Convert 'total_bill' to a numeric dtype
tips['total_bill'] = pd.to_numeric(tips['total_bill'], errors='coerce')
# Print the info of tips
print(tips.info())
# Convert 'tip' to a numeric dtype
tips['tip'] = pd.to_numeric(tips['tip'], errors='coerce')
# Print the info of tips
print(tips.info())
Chapter 3. String parsing with regular expressions¶
In the video, Dan introduced you to the basics of regular expressions, which are powerful ways of defining patterns to match strings. This exercise will get you started with writing them.
When working with data, it is sometimes necessary to write a regular expression to look for properly entered values. Phone numbers in a dataset is a common field that needs to be checked for validity. Your job in this exercise is to define a regular expression to match US phone numbers that fit the pattern of xxx-xxx-xxxx.
The regular expression module in python is re. When performing pattern matching on data, since the pattern will be used for a match across multiple rows, it's better to compile the pattern first using re.compile(), and then use the compiled pattern to match values.
# Import the regular expression module
import re
# Compile the pattern: prog
prog = re.compile('\d{3}-\d{3}-\d{4}')
print(prog)
# See if the pattern matches
result = prog.match('123-456-7890')
print(bool(result))
# See if the pattern matches
result2 = prog.match('1123-456-7890')
print(bool(result2))
Chapter 4. Extracting numerical values from strings¶
Extracting numbers from strings is a common task, particularly when working with unstructured data or log files.
Say you have the following string:
'the recipe calls for 6 strawberries and 2 bananas'
.
It would be useful to extract the 6
and the 2
from this string to be saved for later use when comparing strawberry to banana ratios.
When using a regular expression to extract multiple numbers (or multiple pattern matches, to be exact), you can use the re.findall()
function. Dan did not discuss this in the video, but it is straightforward to use: You pass in a pattern and a string to re.findall()
, and it will return a list of the matches.
# Import the regular expression module
import re
# Find the numeric values: matches
matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana')
# Print the matches
print(matches)
Chapter 5. Pattern matching¶
In this exercise, you'll continue practicing your regular expression skills. For each provided string, your job is to write the appropriate pattern to match it.
# Write the first pattern
pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890'))
print(pattern1)
# Write the second pattern
pattern2 = bool(re.match(pattern='\$\d*.\d{2}', string='$123.45'))
print(pattern2)
# Write the third pattern
pattern3 = bool(re.match(pattern='\w*', string='Australia'))
print(pattern3)
Chapter 6. Custom functions to clean data¶
You'll now practice writing functions to clean data.
The tips dataset has been pre-loaded into a DataFrame called tips
. It has a 'sex'
column that contains the values 'Male'
or 'Female'
. Your job is to write a function that will recode 'Female'
to 0
, 'Male'
to 1
, and return np.nan for all entries of 'sex'
that are neither 'Female'
nor 'Male'
.
Recoding variables like this is a common data cleaning task. Functions provide a mechanism for you to abstract away complex bits of code as well as reuse code. This makes your code more readable and less error prone.
As Dan showed you in the videos, you can use the .apply()
method to apply a function across entire rows or columns of DataFrames. However, note that each column of a DataFrame is a pandas Series. Functions can also be applied across Series. Here, you will apply your function over the 'sex'
column.
import numpy as np
## Define recode_gender()
def recode_gender(gender):
# Return 0 if gender is "Female"
if gender == 'Female':
return 0
# Return 1 if gender is 'Male'
elif gender == 'Male':
return 1
# Return np.nan
else:
return np.nan
# Apply the function to the sex column
tips['recode'] = tips.sex.apply(recode_gender)
# Print the first five rows of tips
print(tips.head())
For simple recodes, you can also use the replace method. You can also convert the column into a categorical type.
Chapter 7. Lambda functions¶
You'll now be introduced to a powerful Python feature that will help you clean your data more effectively: lambda functions. Instead of using the def syntax that you used in the previous exercise, lambda functions let you make simple, one-line functions.
For example, here's a function that squares a variable used in an .apply() method:
def my_square(x):
return x ** 2
df.apply(my_square)
The equivalent code using a lambda function is:
df.apply(lambda x: x ** 2)
The lambda function takes one parameter - the variable x
. The function itself just squares x
and returns the result, which is whatever the one line of code evaluates to. In this way, lambda functions can make your code concise and Pythonic.
The tips dataset has been pre-loaded into a DataFrame called tips
. Your job is to clean its 'total_dollar'
column by removing the dollar sign. You'll do this using two different methods: With the .replace()
method, and with regular expressions. The regular expression module re
has been pre-imported.
import re
url = "https://assets.datacamp.com/production/course_2023/datasets/tips.csv"
tips = pd.read_csv(url)
print(tips.head())
tips["total_dollar"] = '$' + tips['total_bill'].astype(str)
print(tips.head())
# Write the lambda function using replace
tips['total_dollar_replace'] = tips.total_dollar.apply(lambda x: x.replace('$', ''))
# Print the head of tips
print(tips.head())
# Write the lambda function using regular expressions
tips['total_dollar_re'] = tips.total_dollar.apply(lambda x: re.findall('\d+\.\d+', x)[0])
# Print the head of tips
print(tips.head())
Notice how the 'total_dollar_re'
and 'total_dollar_replace'
columns are identical.
Chapter 8. Dropping duplicate data¶
Duplicate data causes a variety of problems. From the point of view of performance, they use up unnecessary amounts of memory and cause unneeded calculations to be performed when processing data. In addition, they can also bias any analysis results.
Check out its columns in the IPython Shell. Your job in this exercise is to subset this DataFrame and then drop all duplicate rows.
import pandas as pd
import numpy as np
from pandas import DataFrame
raw_data = {'col0': [np.nan, 2, 3, 4, 4],
'col1': [10, 20, np.nan, 40, 40],
'col2': [100, 200, 300, 400, 400]}
data = DataFrame(raw_data)
print(data)
print(data.info())
data_no_duplicates = data.drop_duplicates()
print(data_no_duplicates)
print(data_no_duplicates.info())
Chapter 9. Filling missing data¶
Here, you'll return to the airquality
dataset from Chapter 2. It has been pre-loaded into the DataFrame airquality
, and it has missing values for you to practice filling in. Explore airquality
in the IPython Shell to checkout which columns have missing values.
It's rare to have a (real-world) dataset without any missing values, and it's important to deal with them because certain calculations cannot handle missing values while some calculations will, by default, skip over any missing values.
Also, understanding how much missing data you have, and thinking about where it comes from is crucial to making unbiased interpretations of data.
import pandas as pd
url = "https://assets.datacamp.com/production/course_2023/datasets/airquality.csv"
airquality = pd.read_csv(url)
print(airquality.info())
# Calculate the mean of the Ozone column: oz_mean
oz_mean = airquality.Ozone.mean()
# Replace all the missing values in the Ozone column with the mean
airquality['Ozone'] = airquality.Ozone.fillna(oz_mean)
# Print the info of airquality
print(airquality.info())
Chapter 10. Testing your data with asserts¶
Here, you'll practice writing assert statements using the Ebola dataset from previous chapters to programmatically check for missing values and to confirm that all values are positive. The dataset has been pre-loaded into a DataFrame called ebola.
In the video, you saw Dan use the .all() method together with the .notnull() DataFrame method to check for missing values in a column. The .all() method returns True if all values are True. When used on a DataFrame, it returns a Series of Booleans - one for each column in the DataFrame. So if you are using it on a DataFrame, like in this exercise, you need to chain another .all() method so that you return only one True or False value. When using these within an assert statement, nothing will be returned if the assert statement is true: This is how you can confirm that the data you are checking are valid.
Note: You can use pd.notnull(df) as an alternative to df.notnull().
import pandas as pd
url = "https://assets.datacamp.com/production/course_2023/datasets/ebola.csv"
ebola = pd.read_csv(url)
print(ebola.head())
# delete all na
ebola_drop = ebola.dropna(axis = 0)
print(ebola_drop.head())
# Assert that there are no missing values
assert pd.notnull(ebola_drop).all().all()
# Assert that all values are >= 0
assert (ebola_drop >= 0).all().all()
Since the assert statements did not throw any errors, you can be sure that there are no missing values in the data and that all values are >= 0
!
'Python > [Python] Data Transformation' 카테고리의 다른 글
[Python] Data Transformation Using Pandas Review (0) | 2019.02.20 |
---|---|
[Python] Combining Data using Pandas (0) | 2019.02.09 |
[Python] Tidying data for analysis (0) | 2019.02.06 |