Chapter5_Case_Study

ALL CONTENTS ARE FROM DATACAMP

you'll apply all of the data cleaning techniques you've learned in this course towards tidying a real-world, messy dataset obtained from the Gapminder Foundation. Once you're done, not only will you have a clean and tidy dataset, you'll also be ready to start working on your own data science projects using the power of Python!

Chapter 1. Exploratory analysis

Whenever you obtain a new dataset, your first task should always be to do some exploratory analysis to get a better understanding of the data and diagnose it for any potential issues.

The Gapminder data for the 19th century has been loaded into a DataFrame called g1800s. In the IPython Shell, use pandas methods such as .head(), .info(), and .describe(), and DataFrame attributes like .columns and .shape to explore it.

In [91]:
import pandas as pd
url = 'https://assets.datacamp.com/production/course_2023/datasets/gapminder.csv'
gapminder = pd.read_csv(url, index_col=0)
print(gapminder.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 780 entries, 0 to 259
Columns: 218 entries, 1800 to Life expectancy
dtypes: float64(217), object(1)
memory usage: 1.3+ MB
None
In [92]:
print(gapminder.head())
    1800   1801   1802   1803   1804   1805   1806   1807   1808   1809  \
0    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
1  28.21  28.20  28.19  28.18  28.17  28.16  28.15  28.14  28.13  28.12   
2    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
3  35.40  35.40  35.40  35.40  35.40  35.40  35.40  35.40  35.40  35.40   
4  28.82  28.82  28.82  28.82  28.82  28.82  28.82  28.82  28.82  28.82   

           ...            2008  2009  2010  2011  2012  2013  2014  2015  \
0          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
1          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
2          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
3          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
4          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   

   2016        Life expectancy  
0   NaN               Abkhazia  
1   NaN            Afghanistan  
2   NaN  Akrotiri and Dhekelia  
3   NaN                Albania  
4   NaN                Algeria  

[5 rows x 218 columns]

Chapter 2. Visualizing your data

Since 1800, life expectancy around the globe has been steadily going up. You would expect the Gapminder data to confirm this.

The DataFrame g1800s has been pre-loaded. Your job in this exercise is to create a scatter plot with life expectancy in '1800' on the x-axis and life expectancy in '1899' on the y-axis.

Here, the goal is to visually check the data for insights as well as errors. When looking at the plot, pay attention to whether the scatter plot takes the form of a diagonal line, and which points fall below or above the diagonal line. This will inform how life expectancy in 1899 changed (or did not change) compared to 1800 for different countries. If points fall on a diagonal line, it means that life expectancy remained the same!

In [93]:
# Import matplotlib.pyplot
import matplotlib.pyplot as plt
In [94]:
# Create the scatter plot
gapminder.plot(kind='scatter', x='1800', y='1899')

# Specify axis labels
plt.xlabel('Life Expectancy by Country in 1800')
plt.ylabel('Life Expectancy by Country in 1899')

# Specify axis limits
plt.xlim(20, 55)
plt.ylim(20, 55)

# Display the plot
plt.show()

Excellent work! As you can see, there are a surprising number of countries that fall on the diagonal line. In fact, examining the DataFrame reveals that the life expectancy for 140 of the 260 countries did not change at all in the 19th century! This is possibly a result of not having access to the data for all the years back then. In this way, visualizing your data can help you uncover insights as well as diagnose it for errors.

Chapter 3. Thinking about the question at hand

Since you are given life expectancy level data by country and year, you could ask questions about how much the average life expectancy changes over each year.

Before continuing, however, it's important to make sure that the following assumptions about the data are true:

  • 'Life expectancy' is the first column (index 0) of the DataFrame.
  • The other columns contain either null or numeric values.
  • The numeric values are all greater than or equal to 0.
  • There is only one instance of each country.

You can write a function that you can apply over the entire DataFrame to verify some of these assumptions. Note that spending the time to write such a script will help you when working with other datasets as well.

In [95]:
import pandas as pd

def check_null_or_valid(row_data):
    """Function that takes a row of data,
    drops all missing values,
    and checks if all remaining values are greater than or equal to 0
    """
    no_na = row_data.dropna()
    numeric = pd.to_numeric(no_na)
    ge0 = numeric >= 0
    return ge0
In [96]:
# Check whether the last column is 'Life expectancy'
assert gapminder.columns[217] == 'Life expectancy'
In [97]:
# Check whether the values in the row are valid
assert gapminder.iloc[:, 0:216].apply(check_null_or_valid, axis=1).all().all()
In [98]:
# Check that there is only one instance of each country
assert gapminder['Life expectancy'].value_counts()[0] == 1
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-98-5d9f300af706> in <module>
      1 # Check that there is only one instance of each country
----> 2 assert gapminder['Life expectancy'].value_counts()[0] == 1

AssertionError: 
In [99]:
# If got error then please check
print(gapminder['Life expectancy'].value_counts()[0])
3

Getting into the habit of testing your code like this is an important skill.

Chapter 4. DisAssembling & Assembling your data

Here, three DataFrames have been pre-loaded: g1800s, g1900s, and g2000s. These contain the Gapminder life expectancy data for, respectively, the 19th century, the 20th century, and the 21st century.

Your task in this exercise is to concatenate them into a single DataFrame called gapminder. This is a row-wise concatenation, similar to how you concatenated the monthly Uber datasets in Chapter 3.

In [104]:
cols = ['Life expectancy'] + list(gapminder.loc[:,'1800':'1899'])
g1800s = gapminder.loc[:, cols]
print(g1800s.columns)
Index(['Life expectancy', '1800', '1801', '1802', '1803', '1804', '1805',
       '1806', '1807', '1808',
       ...
       '1890', '1891', '1892', '1893', '1894', '1895', '1896', '1897', '1898',
       '1899'],
      dtype='object', length=101)
In [105]:
cols = ['Life expectancy'] + list(gapminder.loc[:,'1900':'1999'])
g1900s = gapminder.loc[:, cols]
print(g1900s.columns)
Index(['Life expectancy', '1900', '1901', '1902', '1903', '1904', '1905',
       '1906', '1907', '1908',
       ...
       '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998',
       '1999'],
      dtype='object', length=101)
In [106]:
cols = ['Life expectancy'] + list(gapminder.loc[:,'2000':'2016'])
g2000s = gapminder.loc[:, cols]
print(g2000s.columns)
Index(['Life expectancy', '2000', '2001', '2002', '2003', '2004', '2005',
       '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014',
       '2015', '2016'],
      dtype='object')
In [107]:
# Concatenate the DataFrames row-wise
gapminder2 = pd.concat([g1800s, g1900s, g2000s])

# Print the shape of gapminder
print(gapminder2.shape)
(2340, 218)
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  
In [108]:
# Print the head of gapminder
print(gapminder2.head())
    1800   1801   1802   1803   1804   1805   1806   1807   1808   1809  \
0    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
1  28.21  28.20  28.19  28.18  28.17  28.16  28.15  28.14  28.13  28.12   
2    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
3  35.40  35.40  35.40  35.40  35.40  35.40  35.40  35.40  35.40  35.40   
4  28.82  28.82  28.82  28.82  28.82  28.82  28.82  28.82  28.82  28.82   

           ...            2008  2009  2010  2011  2012  2013  2014  2015  \
0          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
1          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
2          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
3          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
4          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   

   2016        Life expectancy  
0   NaN               Abkhazia  
1   NaN            Afghanistan  
2   NaN  Akrotiri and Dhekelia  
3   NaN                Albania  
4   NaN                Algeria  

[5 rows x 218 columns]

All the Gapminder data, from 1800 to 2016, is now contained in one DataFrame.

Chapter 5. Reshaping your data

Now that you have all the data combined into a single DataFrame, the next step is to reshape it into a tidy data format.

Currently, the gapminder DataFrame has a separate column for each year. What you want instead is a single column that contains the year, and a single column that represents the average life expectancy for each year and country. By having year in its own column, you can use it as a predictor variable in a later analysis.

You can convert the DataFrame into the desired tidy format by melting it.

In [109]:
import pandas as pd

# Melt gapminder: gapminder_melt
gapminder_melt = pd.melt(gapminder, id_vars = 'Life expectancy')
In [110]:
# Rename the columns
gapminder_melt.columns = ['country', 'year', 'life_expectancy']
In [111]:
# Print the head of gapminder_melt
print(gapminder_melt.head())
                 country  year  life_expectancy
0               Abkhazia  1800              NaN
1            Afghanistan  1800            28.21
2  Akrotiri and Dhekelia  1800              NaN
3                Albania  1800            35.40
4                Algeria  1800            28.82

Chapter 6. Checking the data types

Now that your data are in the proper shape, you need to ensure that the columns are of the proper data type. That is, you need to ensure that country is of type object, year is of type int64, and life_expectancy is of type float64.

The tidy DataFrame has been pre-loaded as gapminder. Explore it in the IPython Shell using the .info() method. Notice that the column 'year' is of type object. This is incorrect, so you'll need to use the pd.to_numeric() function to convert it to a numeric data type.

NumPy and pandas have been pre-imported as np and pd.

In [114]:
# Convert the year column to numeric
gapminder = gapminder_melt
gapminder.year = pd.to_numeric(gapminder.year)
In [116]:
# Test if country is of type object
import numpy as np
assert gapminder.country.dtypes == np.object
In [117]:
# Test if year is of type int64
assert gapminder.year.dtypes == np.int64
In [118]:
# Test if life_expectancy is of type float64
assert gapminder.life_expectancy.dtypes == np.float64

Excellent work! Since the assert statements did not throw any errors, you can be sure that your columns have the correct data types!

Chapter 7. Looking at country spellings

Having tidied your DataFrame and checked the data types, your next task in the data cleaning process is to look at the 'country' column to see if there are any special or invalid characters you may need to deal with.

It is reasonable to assume that country names will contain:

  • The set of lower and upper case letters.
  • Whitespace between words.
  • Periods for any abbreviations.

To confirm that this is the case, you can leverage the power of regular expressions again. For common operations like this, Pandas has a built-in string method - str.contains() - which takes a regular expression pattern, and applies it to the Series, returning True if there is a match, and False otherwise.

Since here you want to find the values that do not match, you have to invert the boolean, which can be done using ~. This Boolean series can then be used to get the Series of countries that have invalid names.

In [119]:
# Create the series of countries: countries
countries = gapminder['country']

# Drop all the duplicates from countries
countries = countries.drop_duplicates()

# Write the regular expression: pattern
pattern = '^[A-Za-z\.\s]*$'

# Create the Boolean vector: mask
mask = countries.str.contains(pattern)

# Invert the mask: mask_inverse
mask_inverse = ~mask

# Subset countries using mask_inverse: invalid_countries
invalid_countries = countries.loc[mask_inverse]

# Print invalid_countries
print(invalid_countries)
49            Congo, Dem. Rep.
50                 Congo, Rep.
53               Cote d'Ivoire
73      Falkland Is (Malvinas)
93               Guinea-Bissau
98            Hong Kong, China
118    United Korea (former)\n
131               Macao, China
132             Macedonia, FYR
145      Micronesia, Fed. Sts.
161            Ngorno-Karabakh
187             St. Barthélemy
193     St.-Pierre-et-Miquelon
225                Timor-Leste
251      Virgin Islands (U.S.)
252       North Yemen (former)
253       South Yemen (former)
258                      Åland
Name: country, dtype: object

As you can see, not all these country names are actually invalid so maybe the assumptions need to be tweaked a little. However, there certainly are a few cases worth further investigation, such as St. Barth?lemy. Whenever you are dealing with columns of raw data consisting of strings, it is important to check them for consistency like this.

Chapter 8. More data cleaning and processing

It's now time to deal with the missing data. There are several strategies for this: You can drop them, fill them in using the mean of the column or row that the missing value is in (also known as imputation), or, if you are dealing with time series data, use a forward fill or backward fill, in which you replace missing values in a column with the most recent known value in the column. See pandas Foundations for more on forward fill and backward fill.

In general, it is not the best idea to drop missing values, because in doing so you may end up throwing away useful information. In this data, the missing values refer to years where no estimate for life expectancy is available for a given country. You could fill in, or guess what these life expectancies could be by looking at the average life expectancies for other countries in that year, for example. Whichever strategy you go with, it is important to carefully consider all options and understand how they will affect your data.

In this exercise, you'll practice dropping missing values. Your job is to drop all the rows that have NaN in the life_expectancy column. Before doing so, it would be valuable to use assert statements to confirm that year and country do not have any missing values.

Begin by printing the shape of gapminder in the IPython Shell prior to dropping the missing values. Complete the exercise to find out what its shape will be after dropping the missing values!

In [120]:
# Assert that country does not contain any missing values
assert pd.notnull(gapminder.country).all()
In [121]:
# Assert that year does not contain any missing values
assert pd.notnull(gapminder.year).all()
In [122]:
# Drop the missing values
gapminder = gapminder.dropna(how='any')

# Print the shape of gapminder
print(gapminder.shape)
(43857, 3)

After dropping the missing values from 'life_expectancy', the number of rows in the DataFrame has gone down from 169260 to 43857. In general, you should avoid dropping too much of your data, but if there is no reasonable way to fill in or impute missing values, then dropping the missing data may be the best solution.

Chapter 9. Wrapping up

Now that you have a clean and tidy dataset, you can do a bit of visualization and aggregation. In this exercise, you'll begin by creating a histogram of the life_expectancy column. You should not get any values under 0 and you should see something reasonable on the higher end of the life_expectancy age range.

Your next task is to investigate how average life expectancy changed over the years. To do this, you need to subset the data by each year, get the life_expectancy column from each subset, and take an average of the values. You can achieve this using the .groupby() method. This .groupby() method is covered in greater depth in Manipulating DataFrames with pandas.

Finally, you can save your tidy and summarized DataFrame to a file using the .to_csv() method.

matplotlib.pyplot and pandas have been pre-imported as plt and pd. Go for it!

In [123]:
# Add first subplot
plt.subplot(2, 1, 1) 
Out[123]:
<matplotlib.axes._subplots.AxesSubplot at 0x1195bc780>
In [124]:
# Create a histogram of life_expectancy
gapminder.life_expectancy.plot(kind='hist')
Out[124]:
<matplotlib.axes._subplots.AxesSubplot at 0x11d0a2978>
In [125]:
# Group gapminder: gapminder_agg
gapminder_agg = gapminder.groupby('year')['life_expectancy'].mean()
In [126]:
# Print the head of gapminder_agg
print(gapminder_agg.head())
year
1800    31.486020
1801    31.448905
1802    31.463483
1803    31.377413
1804    31.446318
Name: life_expectancy, dtype: float64
In [127]:
# Print the tail of gapminder_agg
print(gapminder_agg.tail())
year
2012    71.663077
2013    71.916106
2014    72.088125
2015    72.321010
2016    72.556635
Name: life_expectancy, dtype: float64
In [128]:
# Add second subplot
plt.subplot(2, 1, 2)
Out[128]:
<matplotlib.axes._subplots.AxesSubplot at 0x11ab18a20>
In [129]:
# Create a line plot of life expectancy per year
gapminder_agg.plot()
Out[129]:
<matplotlib.axes._subplots.AxesSubplot at 0x1086d77b8>
In [130]:
# Add title and specify axis labels
plt.title('Life expectancy over the years')
plt.ylabel('Life expectancy')
plt.xlabel('Year')
Out[130]:
Text(0.5, 0, 'Year')
In [131]:
# Display the plots
plt.tight_layout()
plt.show()
<Figure size 432x288 with 0 Axes>
In [132]:
# Save both DataFrames to csv files
gapminder.to_csv('gapminder.csv')
gapminder_agg.to_csv('gapminder_agg.csv')

You've stepped through each stage of the data cleaning process and your data is now ready for serious analysis! Looking at the line plot, it seems like life expectancy has, as expected, increased over the years. There is a surprising dip around 1920 that may be worth further investigation!

Chapter4_Cleaning_data_for_analysis


All Contents are from DataCamp

Here, you'll dive into some of the grittier aspects of data cleaning. You'll learn about string manipulation and pattern matching to deal with unstructured data, and then explore techniques to deal with missing or duplicate data. You'll also learn the valuable skill of programmatically checking your data for consistency, which will give you confidence that your code is running correctly and that the results of your analysis are reliable!

Chapter 1. Converting data types

In this exercise, you'll see how ensuring all categorical variables in a DataFrame are of type category reduces memory usage.

The tips dataset has been loaded into a DataFrame called tips. This data contains information about how much a customer tipped, whether the customer was male or female, a smoker or not, etc.

Look at the output of tips.info() in the IPython Shell. You'll note that two columns that should be categorical - sex and smoker - are instead of type object, which is pandas' way of storing arbitrary strings. Your job is to convert these two columns to type category and note the reduced memory usage.

In [1]:
import pandas as pd
In [2]:
url = "https://assets.datacamp.com/production/course_2023/datasets/tips.csv"
tips = pd.read_csv(url)
print(tips.head())
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4
In [5]:
# Convert the sex column to type 'category'
tips.sex = tips.sex.astype('category')

# Print the info of tips
print(tips.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null object
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: category(1), float64(2), int64(1), object(3)
memory usage: 11.8+ KB
None
In [6]:
# Convert the smoker column to type 'category'
tips.smoker = tips.smoker.astype('category')

# Print the info of tips
print(tips.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null category
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: category(2), float64(2), int64(1), object(2)
memory usage: 10.3+ KB
None

Interestingly, By converting sex and smoker to categorical variables, the memory usage of the DataFrame went down from 13.4 KB to 10.1KB. This may seem like a small difference here, but when you're dealing with large datasets, the reduction in memory usage can be very significant!

Chapter 2. Working with numeric data

If you expect the data type of a column to be numeric (int or float), but instead it is of type object, this typically means that there is a non numeric value in the column, which also signifies bad data.

You can use the pd.to_numeric() function to convert a column into a numeric data type. If the function raises an error, you can be sure that there is a bad value within the column. You can either use the techniques you learned in Chapter 1 to do some exploratory data analysis and find the bad value, or you can choose to ignore or coerce the value into a missing value, NaN.

A modified version of the tips dataset has been pre-loaded into a DataFrame called tips. For instructional purposes, it has been pre-processed to introduce some 'bad' data for you to clean. Use the .info() method to explore this. You'll note that the total_bill and tip columns, which should be numeric, are instead of type object. Your job is to fix this.

In [7]:
# Convert 'total_bill' to a numeric dtype
tips['total_bill'] = pd.to_numeric(tips['total_bill'], errors='coerce')

# Print the info of tips
print(tips.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null category
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: category(2), float64(2), int64(1), object(2)
memory usage: 10.3+ KB
None
In [8]:
# Convert 'tip' to a numeric dtype
tips['tip'] = pd.to_numeric(tips['tip'], errors='coerce')

# Print the info of tips
print(tips.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
total_bill    244 non-null float64
tip           244 non-null float64
sex           244 non-null category
smoker        244 non-null category
day           244 non-null object
time          244 non-null object
size          244 non-null int64
dtypes: category(2), float64(2), int64(1), object(2)
memory usage: 10.3+ KB
None

Chapter 3. String parsing with regular expressions

In the video, Dan introduced you to the basics of regular expressions, which are powerful ways of defining patterns to match strings. This exercise will get you started with writing them.

When working with data, it is sometimes necessary to write a regular expression to look for properly entered values. Phone numbers in a dataset is a common field that needs to be checked for validity. Your job in this exercise is to define a regular expression to match US phone numbers that fit the pattern of xxx-xxx-xxxx.

The regular expression module in python is re. When performing pattern matching on data, since the pattern will be used for a match across multiple rows, it's better to compile the pattern first using re.compile(), and then use the compiled pattern to match values.

In [9]:
# Import the regular expression module
import re
In [10]:
# Compile the pattern: prog
prog = re.compile('\d{3}-\d{3}-\d{4}')
print(prog)
re.compile('\\d{3}-\\d{3}-\\d{4}')
In [11]:
# See if the pattern matches
result = prog.match('123-456-7890')
print(bool(result))
True
In [12]:
# See if the pattern matches
result2 = prog.match('1123-456-7890')
print(bool(result2))
False

Chapter 4. Extracting numerical values from strings

Extracting numbers from strings is a common task, particularly when working with unstructured data or log files.

Say you have the following string: 'the recipe calls for 6 strawberries and 2 bananas'.

It would be useful to extract the 6 and the 2 from this string to be saved for later use when comparing strawberry to banana ratios.

When using a regular expression to extract multiple numbers (or multiple pattern matches, to be exact), you can use the re.findall() function. Dan did not discuss this in the video, but it is straightforward to use: You pass in a pattern and a string to re.findall(), and it will return a list of the matches.

In [13]:
# Import the regular expression module
import re

# Find the numeric values: matches
matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana')

# Print the matches
print(matches)
['10', '1']

Chapter 5. Pattern matching

In this exercise, you'll continue practicing your regular expression skills. For each provided string, your job is to write the appropriate pattern to match it.

In [16]:
# Write the first pattern
pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890'))
print(pattern1)
True
In [17]:
# Write the second pattern
pattern2 = bool(re.match(pattern='\$\d*.\d{2}', string='$123.45'))
print(pattern2)
True
In [18]:
# Write the third pattern
pattern3 = bool(re.match(pattern='\w*', string='Australia'))
print(pattern3)
True

Chapter 6. Custom functions to clean data

You'll now practice writing functions to clean data.

The tips dataset has been pre-loaded into a DataFrame called tips. It has a 'sex' column that contains the values 'Male' or 'Female'. Your job is to write a function that will recode 'Female' to 0, 'Male' to 1, and return np.nan for all entries of 'sex' that are neither 'Female' nor 'Male'.

Recoding variables like this is a common data cleaning task. Functions provide a mechanism for you to abstract away complex bits of code as well as reuse code. This makes your code more readable and less error prone.

As Dan showed you in the videos, you can use the .apply() method to apply a function across entire rows or columns of DataFrames. However, note that each column of a DataFrame is a pandas Series. Functions can also be applied across Series. Here, you will apply your function over the 'sex' column.

In [19]:
import numpy as np
In [20]:
## Define recode_gender()
def recode_gender(gender): 
    
    # Return 0 if gender is "Female"
    if gender == 'Female':
        return 0
    
    # Return 1 if gender is 'Male'
    elif gender == 'Male': 
        return 1
    
    # Return np.nan
    else:
        return np.nan
In [21]:
# Apply the function to the sex column
tips['recode'] = tips.sex.apply(recode_gender)

# Print the first five rows of tips
print(tips.head())
   total_bill   tip     sex smoker  day    time  size recode
0       16.99  1.01  Female     No  Sun  Dinner     2      0
1       10.34  1.66    Male     No  Sun  Dinner     3      1
2       21.01  3.50    Male     No  Sun  Dinner     3      1
3       23.68  3.31    Male     No  Sun  Dinner     2      1
4       24.59  3.61  Female     No  Sun  Dinner     4      0

For simple recodes, you can also use the replace method. You can also convert the column into a categorical type.

Chapter 7. Lambda functions

You'll now be introduced to a powerful Python feature that will help you clean your data more effectively: lambda functions. Instead of using the def syntax that you used in the previous exercise, lambda functions let you make simple, one-line functions.

For example, here's a function that squares a variable used in an .apply() method:

def my_square(x):
    return x ** 2

df.apply(my_square)

The equivalent code using a lambda function is:

df.apply(lambda x: x ** 2)

The lambda function takes one parameter - the variable x. The function itself just squares x and returns the result, which is whatever the one line of code evaluates to. In this way, lambda functions can make your code concise and Pythonic.

The tips dataset has been pre-loaded into a DataFrame called tips. Your job is to clean its 'total_dollar' column by removing the dollar sign. You'll do this using two different methods: With the .replace() method, and with regular expressions. The regular expression module re has been pre-imported.

In [24]:
import re
In [34]:
url = "https://assets.datacamp.com/production/course_2023/datasets/tips.csv"
tips = pd.read_csv(url)
print(tips.head())
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4
In [35]:
tips["total_dollar"] = '$' + tips['total_bill'].astype(str)
print(tips.head())
   total_bill   tip     sex smoker  day    time  size total_dollar
0       16.99  1.01  Female     No  Sun  Dinner     2       $16.99
1       10.34  1.66    Male     No  Sun  Dinner     3       $10.34
2       21.01  3.50    Male     No  Sun  Dinner     3       $21.01
3       23.68  3.31    Male     No  Sun  Dinner     2       $23.68
4       24.59  3.61  Female     No  Sun  Dinner     4       $24.59
In [36]:
# Write the lambda function using replace
tips['total_dollar_replace'] = tips.total_dollar.apply(lambda x: x.replace('$', ''))

# Print the head of tips
print(tips.head())
   total_bill   tip     sex smoker  day    time  size total_dollar  \
0       16.99  1.01  Female     No  Sun  Dinner     2       $16.99   
1       10.34  1.66    Male     No  Sun  Dinner     3       $10.34   
2       21.01  3.50    Male     No  Sun  Dinner     3       $21.01   
3       23.68  3.31    Male     No  Sun  Dinner     2       $23.68   
4       24.59  3.61  Female     No  Sun  Dinner     4       $24.59   

  total_dollar_replace  
0                16.99  
1                10.34  
2                21.01  
3                23.68  
4                24.59  
In [38]:
# Write the lambda function using regular expressions
tips['total_dollar_re'] = tips.total_dollar.apply(lambda x: re.findall('\d+\.\d+', x)[0])

# Print the head of tips
print(tips.head())
   total_bill   tip     sex smoker  day    time  size total_dollar  \
0       16.99  1.01  Female     No  Sun  Dinner     2       $16.99   
1       10.34  1.66    Male     No  Sun  Dinner     3       $10.34   
2       21.01  3.50    Male     No  Sun  Dinner     3       $21.01   
3       23.68  3.31    Male     No  Sun  Dinner     2       $23.68   
4       24.59  3.61  Female     No  Sun  Dinner     4       $24.59   

  total_dollar_replace total_dollar_re  
0                16.99           16.99  
1                10.34           10.34  
2                21.01           21.01  
3                23.68           23.68  
4                24.59           24.59  

Notice how the 'total_dollar_re' and 'total_dollar_replace' columns are identical.

Chapter 8. Dropping duplicate data

Duplicate data causes a variety of problems. From the point of view of performance, they use up unnecessary amounts of memory and cause unneeded calculations to be performed when processing data. In addition, they can also bias any analysis results.

Check out its columns in the IPython Shell. Your job in this exercise is to subset this DataFrame and then drop all duplicate rows.

In [31]:
import pandas as pd
import numpy as np
from pandas import DataFrame
In [32]:
raw_data = {'col0': [np.nan, 2, 3, 4, 4],
            'col1': [10, 20, np.nan, 40, 40],
            'col2': [100, 200, 300, 400, 400]}

data = DataFrame(raw_data)
print(data)
   col0  col1  col2
0   NaN  10.0   100
1   2.0  20.0   200
2   3.0   NaN   300
3   4.0  40.0   400
4   4.0  40.0   400
In [27]:
print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
col0    4 non-null float64
col1    4 non-null float64
col2    5 non-null int64
dtypes: float64(2), int64(1)
memory usage: 200.0 bytes
None
In [25]:
data_no_duplicates = data.drop_duplicates()
print(data_no_duplicates)
   col0  col1  col2
0   NaN  10.0   100
1   2.0  20.0   200
2   3.0   NaN   300
3   4.0  40.0   400
In [26]:
print(data_no_duplicates.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 3 columns):
col0    3 non-null float64
col1    3 non-null float64
col2    4 non-null int64
dtypes: float64(2), int64(1)
memory usage: 128.0 bytes
None

Chapter 9. Filling missing data

Here, you'll return to the airquality dataset from Chapter 2. It has been pre-loaded into the DataFrame airquality, and it has missing values for you to practice filling in. Explore airquality in the IPython Shell to checkout which columns have missing values.

It's rare to have a (real-world) dataset without any missing values, and it's important to deal with them because certain calculations cannot handle missing values while some calculations will, by default, skip over any missing values.

Also, understanding how much missing data you have, and thinking about where it comes from is crucial to making unbiased interpretations of data.

In [30]:
import pandas as pd
url = "https://assets.datacamp.com/production/course_2023/datasets/airquality.csv"
airquality = pd.read_csv(url)
print(airquality.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
Ozone      116 non-null float64
Solar.R    146 non-null float64
Wind       153 non-null float64
Temp       153 non-null int64
Month      153 non-null int64
Day        153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 7.2 KB
None
In [29]:
# Calculate the mean of the Ozone column: oz_mean
oz_mean = airquality.Ozone.mean()

# Replace all the missing values in the Ozone column with the mean
airquality['Ozone'] = airquality.Ozone.fillna(oz_mean)

# Print the info of airquality
print(airquality.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
Ozone      153 non-null float64
Solar.R    146 non-null float64
Wind       153 non-null float64
Temp       153 non-null int64
Month      153 non-null int64
Day        153 non-null int64
dtypes: float64(3), int64(3)
memory usage: 7.2 KB
None

Chapter 10. Testing your data with asserts

Here, you'll practice writing assert statements using the Ebola dataset from previous chapters to programmatically check for missing values and to confirm that all values are positive. The dataset has been pre-loaded into a DataFrame called ebola.

In the video, you saw Dan use the .all() method together with the .notnull() DataFrame method to check for missing values in a column. The .all() method returns True if all values are True. When used on a DataFrame, it returns a Series of Booleans - one for each column in the DataFrame. So if you are using it on a DataFrame, like in this exercise, you need to chain another .all() method so that you return only one True or False value. When using these within an assert statement, nothing will be returned if the assert statement is true: This is how you can confirm that the data you are checking are valid.

Note: You can use pd.notnull(df) as an alternative to df.notnull().

In [39]:
import pandas as pd
url = "https://assets.datacamp.com/production/course_2023/datasets/ebola.csv"
ebola = pd.read_csv(url)
print(ebola.head())
         Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone  \
0    1/5/2015  289        2776.0            NaN            10030.0   
1    1/4/2015  288        2775.0            NaN             9780.0   
2    1/3/2015  287        2769.0         8166.0             9722.0   
3    1/2/2015  286           NaN         8157.0                NaN   
4  12/31/2014  284        2730.0         8115.0             9633.0   

   Cases_Nigeria  Cases_Senegal  Cases_UnitedStates  Cases_Spain  Cases_Mali  \
0            NaN            NaN                 NaN          NaN         NaN   
1            NaN            NaN                 NaN          NaN         NaN   
2            NaN            NaN                 NaN          NaN         NaN   
3            NaN            NaN                 NaN          NaN         NaN   
4            NaN            NaN                 NaN          NaN         NaN   

   Deaths_Guinea  Deaths_Liberia  Deaths_SierraLeone  Deaths_Nigeria  \
0         1786.0             NaN              2977.0             NaN   
1         1781.0             NaN              2943.0             NaN   
2         1767.0          3496.0              2915.0             NaN   
3            NaN          3496.0                 NaN             NaN   
4         1739.0          3471.0              2827.0             NaN   

   Deaths_Senegal  Deaths_UnitedStates  Deaths_Spain  Deaths_Mali  
0             NaN                  NaN           NaN          NaN  
1             NaN                  NaN           NaN          NaN  
2             NaN                  NaN           NaN          NaN  
3             NaN                  NaN           NaN          NaN  
4             NaN                  NaN           NaN          NaN  
In [44]:
# delete all na
ebola_drop = ebola.dropna(axis = 0)
print(ebola_drop.head())
          Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone  \
19  11/18/2014  241        2047.0         7082.0             6190.0   

    Cases_Nigeria  Cases_Senegal  Cases_UnitedStates  Cases_Spain  Cases_Mali  \
19           20.0            1.0                 4.0          1.0         6.0   

    Deaths_Guinea  Deaths_Liberia  Deaths_SierraLeone  Deaths_Nigeria  \
19         1214.0          2963.0              1267.0             8.0   

    Deaths_Senegal  Deaths_UnitedStates  Deaths_Spain  Deaths_Mali  
19             0.0                  1.0           0.0          6.0  
In [45]:
# Assert that there are no missing values
assert pd.notnull(ebola_drop).all().all()
In [47]:
# Assert that all values are >= 0
assert (ebola_drop >= 0).all().all()

Since the assert statements did not throw any errors, you can be sure that there are no missing values in the data and that all values are >= 0!


Chapter3_Combining_data_for_analysis

This all contents come from DataCamp.

The ability to transform and combine your data is a crucial skill in data science, because your data may not always come in one monolithic file or table for you to load. A large dataset may be broken into separate datasets to facilitate easier storage and sharing. Or if you are dealing with time series data, for example, you may have a new dataset for each day. No matter the reason, it is important to be able to combine datasets so you can either clean a single dataset, or clean each dataset separately and then combine them later so you can run your analysis on a single dataset. In this chapter, you'll learn all about combining data.

Chapter 1. Combining rows of data

The dataset you'll be working with here relates to NYC Uber data. The original dataset has all the originating Uber pickup locations by time and latitude and longitude. For didactic purposes, you'll be working with a very small portion of the actual data.

Three DataFrames have been pre-loaded: uber1, which contains data for April 2014, uber2, which contains data for May 2014, and uber3, which contains data for June 2014. Your job in this exercise is to concatenate these DataFrames together such that the resulting DataFrame has the data for all three months.

Begin by exploring the structure of these three DataFrames in the IPython Shell using methods such as .head().

Code from DataCamp is below.

# Concatenate uber1, uber2, and uber3: row_concat
row_concat = pd.concat([uber1, uber2, uber3])

# Print the shape of row_concat
print(row_concat.shape)

# Print the head of row_concat
print(row_concat.head())
In [1]:
# my code below
import pandas as pd
url = "https://assets.datacamp.com/production/course_2023/datasets/nyc_uber_2014.csv"
row_concat = pd.read_csv(url)
print(row_concat.head())
   Unnamed: 0         Date/Time      Lat      Lon    Base
0           0  4/1/2014 0:11:00  40.7690 -73.9549  B02512
1           1  4/1/2014 0:17:00  40.7267 -74.0345  B02512
2           2  4/1/2014 0:21:00  40.7316 -73.9873  B02512
3           3  4/1/2014 0:28:00  40.7588 -73.9776  B02512
4           4  4/1/2014 0:33:00  40.7594 -73.9722  B02512

Chapter 2. Combining columns of data

Think of column-wise concatenation of data as stitching data together from the sides instead of the top and bottom. To perform this action, you use the same pd.concat() function, but this time with the keyword argument axis=1. The default, axis=0, is for a row-wise concatenation.

You'll return to the Ebola dataset you worked with briefly in the last chapter. It has been pre-loaded into a DataFrame called ebola_melt. In this DataFrame, the status and country of a patient is contained in a single column. This column has been parsed into a new DataFrame, status_country, where there are separate columns for status and country.

Explore the ebola_melt and status_country DataFrames in the IPython Shell. Your job is to concatenate them column-wise in order to obtain a final, clean DataFrame.

In [3]:
url = "https://assets.datacamp.com/production/course_2023/datasets/ebola.csv"
ebola = pd.read_csv(url)
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts')
ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')
ebola_melt['type'] = ebola_melt.str_split.str.get(0)
ebola_melt['country'] = ebola_melt.str_split.str.get(1)
print(ebola_melt.head())
         Date  Day  type_country  counts        str_split   type country
0    1/5/2015  289  Cases_Guinea  2776.0  [Cases, Guinea]  Cases  Guinea
1    1/4/2015  288  Cases_Guinea  2775.0  [Cases, Guinea]  Cases  Guinea
2    1/3/2015  287  Cases_Guinea  2769.0  [Cases, Guinea]  Cases  Guinea
3    1/2/2015  286  Cases_Guinea     NaN  [Cases, Guinea]  Cases  Guinea
4  12/31/2014  284  Cases_Guinea  2730.0  [Cases, Guinea]  Cases  Guinea
In [4]:
status_country = ebola_melt["str_split"]
In [5]:
# Concatenate ebola_melt and status_country column-wise: ebola_tidy
ebola_tidy = pd.concat([ebola_melt, status_country], axis = 1)

# Print the shape of ebola_tidy
print(ebola_tidy.shape)
(1952, 8)
In [6]:
# Print the head of ebola_tidy
print(ebola_tidy.head())
         Date  Day  type_country  counts        str_split   type country  \
0    1/5/2015  289  Cases_Guinea  2776.0  [Cases, Guinea]  Cases  Guinea   
1    1/4/2015  288  Cases_Guinea  2775.0  [Cases, Guinea]  Cases  Guinea   
2    1/3/2015  287  Cases_Guinea  2769.0  [Cases, Guinea]  Cases  Guinea   
3    1/2/2015  286  Cases_Guinea     NaN  [Cases, Guinea]  Cases  Guinea   
4  12/31/2014  284  Cases_Guinea  2730.0  [Cases, Guinea]  Cases  Guinea   

         str_split  
0  [Cases, Guinea]  
1  [Cases, Guinea]  
2  [Cases, Guinea]  
3  [Cases, Guinea]  
4  [Cases, Guinea]  

Chapter 3. Finding files that match a pattern

You're now going to practice using the glob module to find all csv files in the workspace. In the next exercise, you'll programmatically load them into DataFrames.

As Dan showed you in the video, the glob module has a function called glob that takes a pattern and returns a list of the files in the working directory that match that pattern.

For example, if you know the pattern is part single digit number .csv, you can write the pattern as 'part?.csv' (which would match part_1.csv, part_2.csv, part_3.csv, etc.)

Similarly, you can find all .csv files with '.csv', or all parts with 'part_'. The ? wildcard represents any 1 character, and the * wildcard represents any number of characters.

In [7]:
# Import necessary modules
import pandas as pd
import glob
import os
In [8]:
# check working directory
os.getcwd()
Out[8]:
'/Users/jihoonjung/Documents/python_e_learning/DataCamp/Cleaning_Data'
In [10]:
os.chdir("/Users/jihoonjung/Documents/python_e_learning/DataCamp/Cleaning_Data/data")
os.getcwd()
Out[10]:
'/Users/jihoonjung/Documents/python_e_learning/DataCamp/Cleaning_Data/data'
In [11]:
# Write the pattern: pattern
pattern = '*.csv'

# Save all file matches: csv_files
csv_files = glob.glob(pattern)
In [12]:
# Print the file names
print(csv_files)
['uber-raw-data-apr14.csv', 'uber-raw-data-jun14.csv', 'uber-raw-data-may14.csv']
In [13]:
# Load the second file into a DataFrame: csv2
csv2 = pd.read_csv(csv_files[1])
In [14]:
# Print the head of csv2
print(csv2.head())
          Date/Time      Lat      Lon    Base
0  6/1/2014 0:00:00  40.7293 -73.9920  B02512
1  6/1/2014 0:01:00  40.7131 -74.0097  B02512
2  6/1/2014 0:04:00  40.3461 -74.6610  B02512
3  6/1/2014 0:04:00  40.7555 -73.9833  B02512
4  6/1/2014 0:07:00  40.6880 -74.1831  B02512

Chapter 4. Iterating and concatenating all matches

Now that you have a list of filenames to load, you can load all the files into a list of DataFrames that can then be concatenated.

You'll start with an empty list called frames. Your job is to use a for loop to:

  1. iterate through each of the filenames
  2. read each filename into a DataFrame, and then
  3. append it to the frames list.

You can then concatenate this list of DataFrames using pd.concat(). Go for it!

In [15]:
# Create an empty list: frames
frames = []
In [20]:
# Iterate over csv_files
for csv in csv_files: 
    
    # Read csv into a DataFrame: df
    df = pd.read_csv(csv)
    
    # Append df to frame
    frames.append(df)
    print('succeed?: ', csv, ' succeded')
succeed?:  uber-raw-data-apr14.csv  succeded
succeed?:  uber-raw-data-jun14.csv  succeded
succeed?:  uber-raw-data-may14.csv  succeded
In [17]:
# Concatenate frames into a single DataFrame: ubser
uber = pd.concat(frames)
In [18]:
# Print the shape of uber
print(uber.shape)
(1880795, 4)
In [19]:
# Print the head of uber
print(uber.head())
          Date/Time      Lat      Lon    Base
0  4/1/2014 0:11:00  40.7690 -73.9549  B02512
1  4/1/2014 0:17:00  40.7267 -74.0345  B02512
2  4/1/2014 0:21:00  40.7316 -73.9873  B02512
3  4/1/2014 0:28:00  40.7588 -73.9776  B02512
4  4/1/2014 0:33:00  40.7594 -73.9722  B02512

Prerequisites. Programming with Databases - Python

To close, let’s have a look at how to access a database from a general-purpose programming language like Python. Other languages use almost exactly the same model: library and function names may differ, but the concepts are the same.

Here’s a short Python program that selects latitudes and longitudes from an SQLite database stored in a file called survey.db:

Please check here: Software Carpentry SQL lesson.

In [27]:
import sqlite3

connection = sqlite3.connect("survey.db")
cursor = connection.cursor()
cursor.execute("SELECT Site.lat, Site.long FROM Site;")
results = cursor.fetchall()
for r in results:
    print(r)
cursor.close()
connection.close()
(-49.85, -128.57)
(-47.15, -126.72)
(-48.87, -123.4)
In [28]:
connection = sqlite3.connect("survey.db")

# site dataframe
site = pd.read_sql_query("select name, lat, long from Site", connection)
print(site.head())
    name    lat    long
0   DR-1 -49.85 -128.57
1   DR-3 -47.15 -126.72
2  MSK-4 -48.87 -123.40
In [29]:
connection = sqlite3.connect("survey.db")

# visited dataframe dataframe
visited = pd.read_sql_query("select id, site, dated from Visited", connection)
print(visited.head())
    id  site       dated
0  619  DR-1  1927-02-08
1  622  DR-1  1927-02-10
2  734  DR-3  1930-01-07
3  735  DR-3  1930-01-12
4  751  DR-3  1930-02-26
In [31]:
connection = sqlite3.connect("survey.db")

# visited dataframe dataframe
survey = pd.read_sql_query("select * from Survey", connection)
print(survey.head())
   taken person quant  reading
0    619   dyer   rad     9.82
1    619   dyer   sal     0.13
2    622   dyer   rad     7.80
3    622   dyer   sal     0.09
4    734     pb   rad     8.41

Chapter 5. 1-to-1 data merge

Merging data allows you to combine disparate datasets into a single dataset to do more complex analysis.

Here, you'll be using survey data that contains readings that William Dyer, Frank Pabodie, and Valentina Roerich took in the late 1920 and 1930 while they were on an expedition towards Antarctica. The dataset was taken from a sqlite database from the Software Carpentry SQL lesson.

Two DataFrames have been pre-loaded: site and visited. Explore them in the IPython Shell and take note of their structure and column names. Your task is to perform a 1-to-1 merge of these two DataFrames using the 'name' column of site and the 'site' column of visited.

In [30]:
# Merge the DataFrames: o2o
m2m = pd.merge(left=site, right=visited, left_on='name', right_on='site')

# Print o2o
print(m2m)
    name    lat    long   id   site       dated
0   DR-1 -49.85 -128.57  619   DR-1  1927-02-08
1   DR-1 -49.85 -128.57  622   DR-1  1927-02-10
2   DR-1 -49.85 -128.57  844   DR-1  1932-03-22
3   DR-3 -47.15 -126.72  734   DR-3  1930-01-07
4   DR-3 -47.15 -126.72  735   DR-3  1930-01-12
5   DR-3 -47.15 -126.72  751   DR-3  1930-02-26
6   DR-3 -47.15 -126.72  752   DR-3        None
7  MSK-4 -48.87 -123.40  837  MSK-4  1932-01-14

Chapter 6. Many-to-Many data merge

The final merging scenario occurs when both DataFrames do not have unique keys for a merge. What happens here is that for each duplicated key, every pairwise combination will be created.

Two example DataFrames that share common key values have been pre-loaded: df1 and df2. Another DataFrame df3, which is the result of df1 merged with df2, has been pre-loaded. All three DataFrames have been printed - look at the output and notice how pairwise combinations have been created. This example is to help you develop your intuition for many-to-many merges.

Here, you'll work with the site and visited DataFrames from before, and a new survey DataFrame. Your task is to merge site and visited as you did in the earlier exercises. You will then merge this merged DataFrame with survey.

Begin by exploring the site, visited, and survey DataFrames in the IPython Shell.

In [35]:
# Merge site and visited: m2m
m2m = pd.merge(left=site, right=visited, left_on='name', right_on='site')

# Merge m2m and survey: m2m
m2m = pd.merge(left=m2m, right=survey, left_on='id', right_on='taken')

# Print the first 20 lines of m2m
print(m2m.head(20))
     name    lat    long   id   site       dated  taken person quant  reading
0    DR-1 -49.85 -128.57  619   DR-1  1927-02-08    619   dyer   rad     9.82
1    DR-1 -49.85 -128.57  619   DR-1  1927-02-08    619   dyer   sal     0.13
2    DR-1 -49.85 -128.57  622   DR-1  1927-02-10    622   dyer   rad     7.80
3    DR-1 -49.85 -128.57  622   DR-1  1927-02-10    622   dyer   sal     0.09
4    DR-1 -49.85 -128.57  844   DR-1  1932-03-22    844    roe   rad    11.25
5    DR-3 -47.15 -126.72  734   DR-3  1930-01-07    734     pb   rad     8.41
6    DR-3 -47.15 -126.72  734   DR-3  1930-01-07    734   lake   sal     0.05
7    DR-3 -47.15 -126.72  734   DR-3  1930-01-07    734     pb  temp   -21.50
8    DR-3 -47.15 -126.72  735   DR-3  1930-01-12    735     pb   rad     7.22
9    DR-3 -47.15 -126.72  735   DR-3  1930-01-12    735   None   sal     0.06
10   DR-3 -47.15 -126.72  735   DR-3  1930-01-12    735   None  temp   -26.00
11   DR-3 -47.15 -126.72  751   DR-3  1930-02-26    751     pb   rad     4.35
12   DR-3 -47.15 -126.72  751   DR-3  1930-02-26    751     pb  temp   -18.50
13   DR-3 -47.15 -126.72  751   DR-3  1930-02-26    751   lake   sal     0.10
14   DR-3 -47.15 -126.72  752   DR-3        None    752   lake   rad     2.19
15   DR-3 -47.15 -126.72  752   DR-3        None    752   lake   sal     0.09
16   DR-3 -47.15 -126.72  752   DR-3        None    752   lake  temp   -16.00
17   DR-3 -47.15 -126.72  752   DR-3        None    752    roe   sal    41.60
18  MSK-4 -48.87 -123.40  837  MSK-4  1932-01-14    837   lake   rad     1.46
19  MSK-4 -48.87 -123.40  837  MSK-4  1932-01-14    837   lake   sal     0.21
Chapter2_Tidying_data_for_analysis

Here, you'll learn about the principles of tidy data and more importantly, why you should care about them and how they make subsequent data analysis more efficient. You'll gain first hand experience with reshaping and tidying your data using techniques such as pivoting and melting.

Chapter 1. Recognizing tidy data

For data to be tidy, it must have:

  • Each variable as a separate column.
  • Each row as a separate observation.

As a data scientist, you'll encounter data that is represented in a variety of different ways, so it is important to be able to recognize tidy (or untidy) data when you see it.

Chapter 2. Reshaping your data using melt

Melting data is the process of turning columns of your data into rows of data. Consider the DataFrames from the previous exercise. In the tidy DataFrame, the variables Ozone, Solar.R, Wind, and Temp each had their own column. If, however, you wanted these variables to be in rows instead, you could melt the DataFrame. In doing so, however, you would make the data untidy! This is important to keep in mind: Depending on how your data is represented, you will have to reshape it differently (e.g., this could make it easier to plot values).

In this exercise, you will practice melting a DataFrame using pd.melt(). There are two parameters you should be aware of: id_vars and value_vars. The id_vars represent the columns of the data you do not want to melt (i.e., keep it in its current shape), while the value_vars represent the columns you do wish to melt into rows. By default, if no value_vars are provided, all columns not set in the id_vars will be melted. This could save a bit of typing, depending on the number of columns that need to be melted.

The (tidy) DataFrame airquality has been pre-loaded. Your job is to melt its Ozone, Solar.R, Wind, and Temp columns into rows. Later in this chapter, you'll learn how to bring this melted DataFrame back into a tidy form.

In [3]:
import pandas as pd
url = "https://assets.datacamp.com/production/course_2023/datasets/airquality.csv"
airquality = pd.read_csv(url)
print(airquality.head())
   Ozone  Solar.R  Wind  Temp  Month  Day
0   41.0    190.0   7.4    67      5    1
1   36.0    118.0   8.0    72      5    2
2   12.0    149.0  12.6    74      5    3
3   18.0    313.0  11.5    62      5    4
4    NaN      NaN  14.3    56      5    5
In [5]:
# Melt airquality: airquality_melt
airquality_melt = pd.melt(frame = airquality, id_vars=['Month', 'Day'])

# Print the head of airquality_melt
print(airquality_melt.head())
   Month  Day variable  value
0      5    1    Ozone   41.0
1      5    2    Ozone   36.0
2      5    3    Ozone   12.0
3      5    4    Ozone   18.0
4      5    5    Ozone    NaN

Chapter 3. Customizing melted data

When melting DataFrames, it would be better to have column names more meaningful than variable and value (the default names used by pd.melt()).

The default names may work in certain situations, but it's best to always have data that is self explanatory.

You can rename the variable column by specifying an argument to the var_name parameter, and the value column by specifying an argument to the value_name parameter. You will now practice doing exactly this. Pandas as pd and the DataFrame airquality has been pre-loaded for you.

In [6]:
# Melt airquality: airquality_melt
airquality_melt = pd.melt(frame=airquality, id_vars=['Month', 'Day'], var_name='measurement', value_name='reading')


# Print the head of airquality_melt
print(airquality_melt.head())
   Month  Day measurement  reading
0      5    1       Ozone     41.0
1      5    2       Ozone     36.0
2      5    3       Ozone     12.0
3      5    4       Ozone     18.0
4      5    5       Ozone      NaN

Chapter 4. Pivot data

Pivoting data is the opposite of melting it. Remember the tidy form that the airquality DataFrame was in before you melted it? You'll now begin pivoting it back into that form using the .pivot_table() method!

While melting takes a set of columns and turns it into a single column, pivoting will create a new column for each unique value in a specified column.

.pivot_table() has an index parameter which you can use to specify the columns that you don't want pivoted: It is similar to the id_vars parameter of pd.melt(). Two other parameters that you have to specify are columns (the name of the column you want to pivot), and values (the values to be used when the column is pivoted). The melted DataFrame airquality_melt has been pre-loaded for you.

In [7]:
# Print the head of airquality_melt
print(airquality_melt.head())
   Month  Day measurement  reading
0      5    1       Ozone     41.0
1      5    2       Ozone     36.0
2      5    3       Ozone     12.0
3      5    4       Ozone     18.0
4      5    5       Ozone      NaN
In [8]:
# Pivot airquality_melt: airquality_pivot
airquality_pivot = airquality_melt.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading')

# Print the head of airquality_pivot
print(airquality_pivot.head())
measurement  Ozone  Solar.R  Temp  Wind
Month Day                              
5     1       41.0    190.0  67.0   7.4
      2       36.0    118.0  72.0   8.0
      3       12.0    149.0  74.0  12.6
      4       18.0    313.0  62.0  11.5
      5        NaN      NaN  56.0  14.3

Chapter 5. Resetting the index of a DataFrame

After pivoting airquality_melt in the previous exercise, you didn't quite get back the original DataFrame.

What you got back instead was a pandas DataFrame with a hierarchical index (also known as a MultiIndex).

Hierarchical indexes are covered in depth in Manipulating DataFrames with pandas. In essence, they allow you to group columns or rows by another variable - in this case, by 'Month' as well as 'Day'.

There's a very simple method you can use to get back the original DataFrame from the pivoted DataFrame: .reset_index(). Dan didn't show you how to use this method in the video, but you're now going to practice using it in this exercise to get back the original DataFrame from airquality_pivot, which has been pre-loaded.

In [9]:
# Print the index of airquality_pivot
print(airquality_pivot.index)
MultiIndex(levels=[[5, 6, 7, 8, 9], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]],
           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]],
           names=['Month', 'Day'])
In [10]:
# Reset the index of airquality_pivot: airquality_pivot_reset
airquality_pivot_reset = airquality_pivot.reset_index()
In [11]:
# Print the new index of airquality_pivot_reset
print(airquality_pivot_reset.index)
RangeIndex(start=0, stop=153, step=1)
In [12]:
# Print the head of airquality_pivot_reset
print(airquality_pivot_reset.head())
measurement  Month  Day  Ozone  Solar.R  Temp  Wind
0                5    1   41.0    190.0  67.0   7.4
1                5    2   36.0    118.0  72.0   8.0
2                5    3   12.0    149.0  74.0  12.6
3                5    4   18.0    313.0  62.0  11.5
4                5    5    NaN      NaN  56.0  14.3

Chapter 6. Pivoting duplicate values

So far, you've used the .pivot_table() method when there are multiple index values you want to hold constant during a pivot. In the video, Dan showed you how you can also use pivot tables to deal with duplicate values by providing an aggregation function through the aggfunc parameter. Here, you're going to combine both these uses of pivot tables.

Let's say your data collection method accidentally duplicated your dataset. Such a dataset, in which each row is duplicated, has been pre-loaded as airquality_dup. In addition, the airquality_melt DataFrame from the previous exercise has been pre-loaded. Explore their shapes in the IPython Shell by accessing their .shape attributes to confirm the duplicate rows present in airquality_dup.

You'll see that by using .pivot_table() and the aggfunc parameter, you can not only reshape your data, but also remove duplicates. Finally, you can then flatten the columns of the pivoted DataFrame using .reset_index().

NumPy and pandas have been imported as np and pd respectively.

In [24]:
airquality_dup = airquality_melt.append(airquality_melt, ignore_index=True)
airquality_dup
Out[24]:
Month Day measurement reading
0 5 1 Ozone 41.0
1 5 2 Ozone 36.0
2 5 3 Ozone 12.0
3 5 4 Ozone 18.0
4 5 5 Ozone NaN
5 5 6 Ozone 28.0
6 5 7 Ozone 23.0
7 5 8 Ozone 19.0
8 5 9 Ozone 8.0
9 5 10 Ozone NaN
10 5 11 Ozone 7.0
11 5 12 Ozone 16.0
12 5 13 Ozone 11.0
13 5 14 Ozone 14.0
14 5 15 Ozone 18.0
15 5 16 Ozone 14.0
16 5 17 Ozone 34.0
17 5 18 Ozone 6.0
18 5 19 Ozone 30.0
19 5 20 Ozone 11.0
20 5 21 Ozone 1.0
21 5 22 Ozone 11.0
22 5 23 Ozone 4.0
23 5 24 Ozone 32.0
24 5 25 Ozone NaN
25 5 26 Ozone NaN
26 5 27 Ozone NaN
27 5 28 Ozone 23.0
28 5 29 Ozone 45.0
29 5 30 Ozone 115.0
... ... ... ... ...
1194 9 1 Temp 91.0
1195 9 2 Temp 92.0
1196 9 3 Temp 93.0
1197 9 4 Temp 93.0
1198 9 5 Temp 87.0
1199 9 6 Temp 84.0
1200 9 7 Temp 80.0
1201 9 8 Temp 78.0
1202 9 9 Temp 75.0
1203 9 10 Temp 73.0
1204 9 11 Temp 81.0
1205 9 12 Temp 76.0
1206 9 13 Temp 77.0
1207 9 14 Temp 71.0
1208 9 15 Temp 71.0
1209 9 16 Temp 78.0
1210 9 17 Temp 67.0
1211 9 18 Temp 76.0
1212 9 19 Temp 68.0
1213 9 20 Temp 82.0
1214 9 21 Temp 64.0
1215 9 22 Temp 71.0
1216 9 23 Temp 81.0
1217 9 24 Temp 69.0
1218 9 25 Temp 63.0
1219 9 26 Temp 70.0
1220 9 27 Temp 77.0
1221 9 28 Temp 75.0
1222 9 29 Temp 76.0
1223 9 30 Temp 68.0

1224 rows × 4 columns

In [28]:
import numpy as np
# Pivot airquality_dup: airquality_pivot
airquality_pivot = airquality_dup.pivot_table(index=['Month', 'Day'], 
                                              columns='measurement', 
                                              values='reading', 
                                              aggfunc=np.mean)

print(airquality_pivot)
measurement  Ozone  Solar.R  Temp  Wind
Month Day                              
5     1       41.0    190.0  67.0   7.4
      2       36.0    118.0  72.0   8.0
      3       12.0    149.0  74.0  12.6
      4       18.0    313.0  62.0  11.5
      5        NaN      NaN  56.0  14.3
      6       28.0      NaN  66.0  14.9
      7       23.0    299.0  65.0   8.6
      8       19.0     99.0  59.0  13.8
      9        8.0     19.0  61.0  20.1
      10       NaN    194.0  69.0   8.6
      11       7.0      NaN  74.0   6.9
      12      16.0    256.0  69.0   9.7
      13      11.0    290.0  66.0   9.2
      14      14.0    274.0  68.0  10.9
      15      18.0     65.0  58.0  13.2
      16      14.0    334.0  64.0  11.5
      17      34.0    307.0  66.0  12.0
      18       6.0     78.0  57.0  18.4
      19      30.0    322.0  68.0  11.5
      20      11.0     44.0  62.0   9.7
      21       1.0      8.0  59.0   9.7
      22      11.0    320.0  73.0  16.6
      23       4.0     25.0  61.0   9.7
      24      32.0     92.0  61.0  12.0
      25       NaN     66.0  57.0  16.6
      26       NaN    266.0  58.0  14.9
      27       NaN      NaN  57.0   8.0
      28      23.0     13.0  67.0  12.0
      29      45.0    252.0  81.0  14.9
      30     115.0    223.0  79.0   5.7
...            ...      ...   ...   ...
9     1       96.0    167.0  91.0   6.9
      2       78.0    197.0  92.0   5.1
      3       73.0    183.0  93.0   2.8
      4       91.0    189.0  93.0   4.6
      5       47.0     95.0  87.0   7.4
      6       32.0     92.0  84.0  15.5
      7       20.0    252.0  80.0  10.9
      8       23.0    220.0  78.0  10.3
      9       21.0    230.0  75.0  10.9
      10      24.0    259.0  73.0   9.7
      11      44.0    236.0  81.0  14.9
      12      21.0    259.0  76.0  15.5
      13      28.0    238.0  77.0   6.3
      14       9.0     24.0  71.0  10.9
      15      13.0    112.0  71.0  11.5
      16      46.0    237.0  78.0   6.9
      17      18.0    224.0  67.0  13.8
      18      13.0     27.0  76.0  10.3
      19      24.0    238.0  68.0  10.3
      20      16.0    201.0  82.0   8.0
      21      13.0    238.0  64.0  12.6
      22      23.0     14.0  71.0   9.2
      23      36.0    139.0  81.0  10.3
      24       7.0     49.0  69.0  10.3
      25      14.0     20.0  63.0  16.6
      26      30.0    193.0  70.0   6.9
      27       NaN    145.0  77.0  13.2
      28      14.0    191.0  75.0  14.3
      29      18.0    131.0  76.0   8.0
      30      20.0    223.0  68.0  11.5

[153 rows x 4 columns]
In [29]:
# Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()

# Print the head of airquality_pivot
print(airquality_pivot.head())
measurement  Month  Day  Ozone  Solar.R  Temp  Wind
0                5    1   41.0    190.0  67.0   7.4
1                5    2   36.0    118.0  72.0   8.0
2                5    3   12.0    149.0  74.0  12.6
3                5    4   18.0    313.0  62.0  11.5
4                5    5    NaN      NaN  56.0  14.3
In [30]:
# Print the head of airquality
print(airquality.head())
   Ozone  Solar.R  Wind  Temp  Month  Day
0   41.0    190.0   7.4    67      5    1
1   36.0    118.0   8.0    72      5    2
2   12.0    149.0  12.6    74      5    3
3   18.0    313.0  11.5    62      5    4
4    NaN      NaN  14.3    56      5    5

Chapter 7. Splitting a column with .str

The dataset you saw in the video, consisting of case counts of tuberculosis by country, year, gender, and age group, has been pre-loaded into a DataFrame as tb.

In this exercise, you're going to tidy the 'm014' column, which represents males aged 0-14 years of age. In order to parse this value, you need to extract the first letter into a new column for gender, and the rest into a column for age_group. Here, since you can parse values by position, you can take advantage of pandas' vectorized string slicing by using the str attribute of columns of type object.

Begin by printing the columns of tb in the IPython Shell using its .columns attribute, and take note of the problematic column.

In [37]:
import pandas as pd
url = "https://assets.datacamp.com/production/course_2023/datasets/tb.csv"
tb = pd.read_csv(url)
print(tb.head())
  country  year  m014  m1524  m2534  m3544  m4554  m5564   m65  mu  f014  \
0      AD  2000   0.0    0.0    1.0    0.0    0.0    0.0   0.0 NaN   NaN   
1      AE  2000   2.0    4.0    4.0    6.0    5.0   12.0  10.0 NaN   3.0   
2      AF  2000  52.0  228.0  183.0  149.0  129.0   94.0  80.0 NaN  93.0   
3      AG  2000   0.0    0.0    0.0    0.0    0.0    0.0   1.0 NaN   1.0   
4      AL  2000   2.0   19.0   21.0   14.0   24.0   19.0  16.0 NaN   3.0   

   f1524  f2534  f3544  f4554  f5564   f65  fu  
0    NaN    NaN    NaN    NaN    NaN   NaN NaN  
1   16.0    1.0    3.0    0.0    0.0   4.0 NaN  
2  414.0  565.0  339.0  205.0   99.0  36.0 NaN  
3    1.0    1.0    0.0    0.0    0.0   0.0 NaN  
4   11.0   10.0    8.0    8.0    5.0  11.0 NaN  
In [38]:
# Melt tb: tb_melt
tb_melt = pd.melt(tb, id_vars=['country', 'year'])
print(tb_melt.head())
  country  year variable  value
0      AD  2000     m014    0.0
1      AE  2000     m014    2.0
2      AF  2000     m014   52.0
3      AG  2000     m014    0.0
4      AL  2000     m014    2.0
In [40]:
# Create the 'gender' column
tb_melt['gender'] = tb_melt.variable.str[0]
print(tb_melt.head())
  country  year variable  value gender
0      AD  2000     m014    0.0      m
1      AE  2000     m014    2.0      m
2      AF  2000     m014   52.0      m
3      AG  2000     m014    0.0      m
4      AL  2000     m014    2.0      m
In [41]:
# Create the 'age_group' column
tb_melt['age_group'] = tb_melt.variable.str[1:]
# Print the head of tb_melt
print(tb_melt.head())
  country  year variable  value gender age_group
0      AD  2000     m014    0.0      m       014
1      AE  2000     m014    2.0      m       014
2      AF  2000     m014   52.0      m       014
3      AG  2000     m014    0.0      m       014
4      AL  2000     m014    2.0      m       014

Chapter 8. Splitting a column with .split() and .get()

Another common way multiple variables are stored in columns is with a delimiter. You'll learn how to deal with such cases in this exercise, using a dataset consisting of Ebola cases and death counts by state and country. It has been pre-loaded into a DataFrame as ebola.

Print the columns of ebola in the IPython Shell using ebola.columns. Notice that the data has column names such as Cases_Guinea and DeathsGuinea. Here, the underscore serves as a delimiter between the first part (cases or deaths), and the second part (country).

This time, you cannot directly slice the variable by position as in the previous exercise. You now need to use Python's built-in string method called .split(). By default, this method will split a string into parts separated by a space. However, in this case you want it to split by an underscore. You can do this on Cases_Guinea, for example, using CasesGuinea.split(''), which returns the list ['Cases', 'Guinea'].

The next challenge is to extract the first element of this list and assign it to a type variable, and the second element of the list to a country variable. You can accomplish this by accessing the str attribute of the column and using the .get() method to retrieve the 0 or 1 index, depending on the part you want.

In [42]:
url = "https://assets.datacamp.com/production/course_2023/datasets/ebola.csv"
ebola = pd.read_csv(url)
print(ebola.head())
         Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone  \
0    1/5/2015  289        2776.0            NaN            10030.0   
1    1/4/2015  288        2775.0            NaN             9780.0   
2    1/3/2015  287        2769.0         8166.0             9722.0   
3    1/2/2015  286           NaN         8157.0                NaN   
4  12/31/2014  284        2730.0         8115.0             9633.0   

   Cases_Nigeria  Cases_Senegal  Cases_UnitedStates  Cases_Spain  Cases_Mali  \
0            NaN            NaN                 NaN          NaN         NaN   
1            NaN            NaN                 NaN          NaN         NaN   
2            NaN            NaN                 NaN          NaN         NaN   
3            NaN            NaN                 NaN          NaN         NaN   
4            NaN            NaN                 NaN          NaN         NaN   

   Deaths_Guinea  Deaths_Liberia  Deaths_SierraLeone  Deaths_Nigeria  \
0         1786.0             NaN              2977.0             NaN   
1         1781.0             NaN              2943.0             NaN   
2         1767.0          3496.0              2915.0             NaN   
3            NaN          3496.0                 NaN             NaN   
4         1739.0          3471.0              2827.0             NaN   

   Deaths_Senegal  Deaths_UnitedStates  Deaths_Spain  Deaths_Mali  
0             NaN                  NaN           NaN          NaN  
1             NaN                  NaN           NaN          NaN  
2             NaN                  NaN           NaN          NaN  
3             NaN                  NaN           NaN          NaN  
4             NaN                  NaN           NaN          NaN  
In [43]:
# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts')
print(ebola_melt.head())
         Date  Day  type_country  counts
0    1/5/2015  289  Cases_Guinea  2776.0
1    1/4/2015  288  Cases_Guinea  2775.0
2    1/3/2015  287  Cases_Guinea  2769.0
3    1/2/2015  286  Cases_Guinea     NaN
4  12/31/2014  284  Cases_Guinea  2730.0
In [44]:
# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')
print(ebola_melt.head())
         Date  Day  type_country  counts        str_split
0    1/5/2015  289  Cases_Guinea  2776.0  [Cases, Guinea]
1    1/4/2015  288  Cases_Guinea  2775.0  [Cases, Guinea]
2    1/3/2015  287  Cases_Guinea  2769.0  [Cases, Guinea]
3    1/2/2015  286  Cases_Guinea     NaN  [Cases, Guinea]
4  12/31/2014  284  Cases_Guinea  2730.0  [Cases, Guinea]
In [45]:
# Create the 'type' column
ebola_melt['type'] = ebola_melt.str_split.str.get(0)
print(ebola_melt.head())
         Date  Day  type_country  counts        str_split   type
0    1/5/2015  289  Cases_Guinea  2776.0  [Cases, Guinea]  Cases
1    1/4/2015  288  Cases_Guinea  2775.0  [Cases, Guinea]  Cases
2    1/3/2015  287  Cases_Guinea  2769.0  [Cases, Guinea]  Cases
3    1/2/2015  286  Cases_Guinea     NaN  [Cases, Guinea]  Cases
4  12/31/2014  284  Cases_Guinea  2730.0  [Cases, Guinea]  Cases
In [46]:
# Create the 'country' column
ebola_melt['country'] = ebola_melt.str_split.str.get(1)
print(ebola_melt.head())
         Date  Day  type_country  counts        str_split   type country
0    1/5/2015  289  Cases_Guinea  2776.0  [Cases, Guinea]  Cases  Guinea
1    1/4/2015  288  Cases_Guinea  2775.0  [Cases, Guinea]  Cases  Guinea
2    1/3/2015  287  Cases_Guinea  2769.0  [Cases, Guinea]  Cases  Guinea
3    1/2/2015  286  Cases_Guinea     NaN  [Cases, Guinea]  Cases  Guinea
4  12/31/2014  284  Cases_Guinea  2730.0  [Cases, Guinea]  Cases  Guinea

+ Recent posts