ALL CONTENTS ARE FROM DATACAMP

you'll apply all of the data cleaning techniques you've learned in this course towards tidying a real-world, messy dataset obtained from the Gapminder Foundation. Once you're done, not only will you have a clean and tidy dataset, you'll also be ready to start working on your own data science projects using the power of Python!

Chapter 1. Exploratory analysis¶

Whenever you obtain a new dataset, your first task should always be to do some exploratory analysis to get a better understanding of the data and diagnose it for any potential issues.

The Gapminder data for the 19th century has been loaded into a DataFrame called g1800s. In the IPython Shell, use pandas methods such as .head(), .info(), and .describe(), and DataFrame attributes like .columns and .shape to explore it.

import pandas as pd
url = 'https://assets.datacamp.com/production/course_2023/datasets/gapminder.csv'
gapminder = pd.read_csv(url, index_col=0)
print(gapminder.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 780 entries, 0 to 259
Columns: 218 entries, 1800 to Life expectancy
dtypes: float64(217), object(1)
memory usage: 1.3+ MB
None

print(gapminder.head())

    1800   1801   1802   1803   1804   1805   1806   1807   1808   1809  \
0    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
1  28.21  28.20  28.19  28.18  28.17  28.16  28.15  28.14  28.13  28.12   
2    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
3  35.40  35.40  35.40  35.40  35.40  35.40  35.40  35.40  35.40  35.40   
4  28.82  28.82  28.82  28.82  28.82  28.82  28.82  28.82  28.82  28.82   

           ...            2008  2009  2010  2011  2012  2013  2014  2015  \
0          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
1          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
2          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
3          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
4          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   

   2016        Life expectancy  
0   NaN               Abkhazia  
1   NaN            Afghanistan  
2   NaN  Akrotiri and Dhekelia  
3   NaN                Albania  
4   NaN                Algeria  

[5 rows x 218 columns]

Chapter 2. Visualizing your data¶

Since 1800, life expectancy around the globe has been steadily going up. You would expect the Gapminder data to confirm this.

The DataFrame g1800s has been pre-loaded. Your job in this exercise is to create a scatter plot with life expectancy in '1800' on the x-axis and life expectancy in '1899' on the y-axis.

Here, the goal is to visually check the data for insights as well as errors. When looking at the plot, pay attention to whether the scatter plot takes the form of a diagonal line, and which points fall below or above the diagonal line. This will inform how life expectancy in 1899 changed (or did not change) compared to 1800 for different countries. If points fall on a diagonal line, it means that life expectancy remained the same!

# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Create the scatter plot
gapminder.plot(kind='scatter', x='1800', y='1899')

# Specify axis labels
plt.xlabel('Life Expectancy by Country in 1800')
plt.ylabel('Life Expectancy by Country in 1899')

# Specify axis limits
plt.xlim(20, 55)
plt.ylim(20, 55)

# Display the plot
plt.show()

Excellent work! As you can see, there are a surprising number of countries that fall on the diagonal line. In fact, examining the DataFrame reveals that the life expectancy for 140 of the 260 countries did not change at all in the 19th century! This is possibly a result of not having access to the data for all the years back then. In this way, visualizing your data can help you uncover insights as well as diagnose it for errors.

Chapter 3. Thinking about the question at hand¶

Since you are given life expectancy level data by country and year, you could ask questions about how much the average life expectancy changes over each year.

Before continuing, however, it's important to make sure that the following assumptions about the data are true:

'Life expectancy' is the first column (index 0) of the DataFrame.
The other columns contain either null or numeric values.
The numeric values are all greater than or equal to 0.
There is only one instance of each country.

You can write a function that you can apply over the entire DataFrame to verify some of these assumptions. Note that spending the time to write such a script will help you when working with other datasets as well.

import pandas as pd

def check_null_or_valid(row_data):
    """Function that takes a row of data,
    drops all missing values,
    and checks if all remaining values are greater than or equal to 0
    """
    no_na = row_data.dropna()
    numeric = pd.to_numeric(no_na)
    ge0 = numeric >= 0
    return ge0

# Check whether the last column is 'Life expectancy'
assert gapminder.columns[217] == 'Life expectancy'

# Check whether the values in the row are valid
assert gapminder.iloc[:, 0:216].apply(check_null_or_valid, axis=1).all().all()

# Check that there is only one instance of each country
assert gapminder['Life expectancy'].value_counts()[0] == 1

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-98-5d9f300af706> in <module>
      1 # Check that there is only one instance of each country
----> 2 assert gapminder['Life expectancy'].value_counts()[0] == 1

AssertionError:

# If got error then please check
print(gapminder['Life expectancy'].value_counts()[0])

3

Getting into the habit of testing your code like this is an important skill.

Chapter 4. DisAssembling & Assembling your data¶

Here, three DataFrames have been pre-loaded: g1800s, g1900s, and g2000s. These contain the Gapminder life expectancy data for, respectively, the 19th century, the 20th century, and the 21st century.

Your task in this exercise is to concatenate them into a single DataFrame called gapminder. This is a row-wise concatenation, similar to how you concatenated the monthly Uber datasets in Chapter 3.

cols = ['Life expectancy'] + list(gapminder.loc[:,'1800':'1899'])
g1800s = gapminder.loc[:, cols]
print(g1800s.columns)

Index(['Life expectancy', '1800', '1801', '1802', '1803', '1804', '1805',
       '1806', '1807', '1808',
       ...
       '1890', '1891', '1892', '1893', '1894', '1895', '1896', '1897', '1898',
       '1899'],
      dtype='object', length=101)

cols = ['Life expectancy'] + list(gapminder.loc[:,'1900':'1999'])
g1900s = gapminder.loc[:, cols]
print(g1900s.columns)

Index(['Life expectancy', '1900', '1901', '1902', '1903', '1904', '1905',
       '1906', '1907', '1908',
       ...
       '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998',
       '1999'],
      dtype='object', length=101)

cols = ['Life expectancy'] + list(gapminder.loc[:,'2000':'2016'])
g2000s = gapminder.loc[:, cols]
print(g2000s.columns)

Index(['Life expectancy', '2000', '2001', '2002', '2003', '2004', '2005',
       '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014',
       '2015', '2016'],
      dtype='object')

# Concatenate the DataFrames row-wise
gapminder2 = pd.concat([g1800s, g1900s, g2000s])

# Print the shape of gapminder
print(gapminder2.shape)

(2340, 218)

/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

# Print the head of gapminder
print(gapminder2.head())

    1800   1801   1802   1803   1804   1805   1806   1807   1808   1809  \
0    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
1  28.21  28.20  28.19  28.18  28.17  28.16  28.15  28.14  28.13  28.12   
2    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
3  35.40  35.40  35.40  35.40  35.40  35.40  35.40  35.40  35.40  35.40   
4  28.82  28.82  28.82  28.82  28.82  28.82  28.82  28.82  28.82  28.82   

           ...            2008  2009  2010  2011  2012  2013  2014  2015  \
0          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
1          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
2          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
3          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
4          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   

   2016        Life expectancy  
0   NaN               Abkhazia  
1   NaN            Afghanistan  
2   NaN  Akrotiri and Dhekelia  
3   NaN                Albania  
4   NaN                Algeria  

[5 rows x 218 columns]

All the Gapminder data, from 1800 to 2016, is now contained in one DataFrame.

Chapter 5. Reshaping your data¶

Now that you have all the data combined into a single DataFrame, the next step is to reshape it into a tidy data format.

Currently, the gapminder DataFrame has a separate column for each year. What you want instead is a single column that contains the year, and a single column that represents the average life expectancy for each year and country. By having year in its own column, you can use it as a predictor variable in a later analysis.

You can convert the DataFrame into the desired tidy format by melting it.

import pandas as pd

# Melt gapminder: gapminder_melt
gapminder_melt = pd.melt(gapminder, id_vars = 'Life expectancy')

# Rename the columns
gapminder_melt.columns = ['country', 'year', 'life_expectancy']

# Print the head of gapminder_melt
print(gapminder_melt.head())

                 country  year  life_expectancy
0               Abkhazia  1800              NaN
1            Afghanistan  1800            28.21
2  Akrotiri and Dhekelia  1800              NaN
3                Albania  1800            35.40
4                Algeria  1800            28.82

Chapter 6. Checking the data types¶

Now that your data are in the proper shape, you need to ensure that the columns are of the proper data type. That is, you need to ensure that country is of type object, year is of type int64, and life_expectancy is of type float64.

The tidy DataFrame has been pre-loaded as gapminder. Explore it in the IPython Shell using the .info() method. Notice that the column 'year' is of type object. This is incorrect, so you'll need to use the pd.to_numeric() function to convert it to a numeric data type.

NumPy and pandas have been pre-imported as np and pd.

# Convert the year column to numeric
gapminder = gapminder_melt
gapminder.year = pd.to_numeric(gapminder.year)

# Test if country is of type object
import numpy as np
assert gapminder.country.dtypes == np.object

# Test if year is of type int64
assert gapminder.year.dtypes == np.int64

# Test if life_expectancy is of type float64
assert gapminder.life_expectancy.dtypes == np.float64

Excellent work! Since the assert statements did not throw any errors, you can be sure that your columns have the correct data types!

Chapter 7. Looking at country spellings¶

Having tidied your DataFrame and checked the data types, your next task in the data cleaning process is to look at the 'country' column to see if there are any special or invalid characters you may need to deal with.

It is reasonable to assume that country names will contain:

The set of lower and upper case letters.
Whitespace between words.
Periods for any abbreviations.

To confirm that this is the case, you can leverage the power of regular expressions again. For common operations like this, Pandas has a built-in string method - str.contains() - which takes a regular expression pattern, and applies it to the Series, returning True if there is a match, and False otherwise.

Since here you want to find the values that do not match, you have to invert the boolean, which can be done using ~. This Boolean series can then be used to get the Series of countries that have invalid names.

# Create the series of countries: countries
countries = gapminder['country']

# Drop all the duplicates from countries
countries = countries.drop_duplicates()

# Write the regular expression: pattern
pattern = '^[A-Za-z\.\s]*$'

# Create the Boolean vector: mask
mask = countries.str.contains(pattern)

# Invert the mask: mask_inverse
mask_inverse = ~mask

# Subset countries using mask_inverse: invalid_countries
invalid_countries = countries.loc[mask_inverse]

# Print invalid_countries
print(invalid_countries)

49            Congo, Dem. Rep.
50                 Congo, Rep.
53               Cote d'Ivoire
73      Falkland Is (Malvinas)
93               Guinea-Bissau
98            Hong Kong, China
118    United Korea (former)\n
131               Macao, China
132             Macedonia, FYR
145      Micronesia, Fed. Sts.
161            Ngorno-Karabakh
187             St. Barthélemy
193     St.-Pierre-et-Miquelon
225                Timor-Leste
251      Virgin Islands (U.S.)
252       North Yemen (former)
253       South Yemen (former)
258                      Åland
Name: country, dtype: object

As you can see, not all these country names are actually invalid so maybe the assumptions need to be tweaked a little. However, there certainly are a few cases worth further investigation, such as St. Barth?lemy. Whenever you are dealing with columns of raw data consisting of strings, it is important to check them for consistency like this.

Chapter 8. More data cleaning and processing¶

It's now time to deal with the missing data. There are several strategies for this: You can drop them, fill them in using the mean of the column or row that the missing value is in (also known as imputation), or, if you are dealing with time series data, use a forward fill or backward fill, in which you replace missing values in a column with the most recent known value in the column. See pandas Foundations for more on forward fill and backward fill.

In general, it is not the best idea to drop missing values, because in doing so you may end up throwing away useful information. In this data, the missing values refer to years where no estimate for life expectancy is available for a given country. You could fill in, or guess what these life expectancies could be by looking at the average life expectancies for other countries in that year, for example. Whichever strategy you go with, it is important to carefully consider all options and understand how they will affect your data.

In this exercise, you'll practice dropping missing values. Your job is to drop all the rows that have NaN in the life_expectancy column. Before doing so, it would be valuable to use assert statements to confirm that year and country do not have any missing values.

Begin by printing the shape of gapminder in the IPython Shell prior to dropping the missing values. Complete the exercise to find out what its shape will be after dropping the missing values!

# Assert that country does not contain any missing values
assert pd.notnull(gapminder.country).all()

# Assert that year does not contain any missing values
assert pd.notnull(gapminder.year).all()

# Drop the missing values
gapminder = gapminder.dropna(how='any')

# Print the shape of gapminder
print(gapminder.shape)

(43857, 3)

After dropping the missing values from 'life_expectancy', the number of rows in the DataFrame has gone down from 169260 to 43857. In general, you should avoid dropping too much of your data, but if there is no reasonable way to fill in or impute missing values, then dropping the missing data may be the best solution.

Chapter 9. Wrapping up¶

Now that you have a clean and tidy dataset, you can do a bit of visualization and aggregation. In this exercise, you'll begin by creating a histogram of the life_expectancy column. You should not get any values under 0 and you should see something reasonable on the higher end of the life_expectancy age range.

Your next task is to investigate how average life expectancy changed over the years. To do this, you need to subset the data by each year, get the life_expectancy column from each subset, and take an average of the values. You can achieve this using the .groupby() method. This .groupby() method is covered in greater depth in Manipulating DataFrames with pandas.

Finally, you can save your tidy and summarized DataFrame to a file using the .to_csv() method.

matplotlib.pyplot and pandas have been pre-imported as plt and pd. Go for it!

# Add first subplot
plt.subplot(2, 1, 1)

<matplotlib.axes._subplots.AxesSubplot at 0x1195bc780>

# Create a histogram of life_expectancy
gapminder.life_expectancy.plot(kind='hist')

<matplotlib.axes._subplots.AxesSubplot at 0x11d0a2978>

# Group gapminder: gapminder_agg
gapminder_agg = gapminder.groupby('year')['life_expectancy'].mean()

# Print the head of gapminder_agg
print(gapminder_agg.head())

year
1800    31.486020
1801    31.448905
1802    31.463483
1803    31.377413
1804    31.446318
Name: life_expectancy, dtype: float64

# Print the tail of gapminder_agg
print(gapminder_agg.tail())

year
2012    71.663077
2013    71.916106
2014    72.088125
2015    72.321010
2016    72.556635
Name: life_expectancy, dtype: float64

# Add second subplot
plt.subplot(2, 1, 2)

<matplotlib.axes._subplots.AxesSubplot at 0x11ab18a20>

# Create a line plot of life expectancy per year
gapminder_agg.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x1086d77b8>

# Add title and specify axis labels
plt.title('Life expectancy over the years')
plt.ylabel('Life expectancy')
plt.xlabel('Year')

Text(0.5, 0, 'Year')

# Display the plots
plt.tight_layout()
plt.show()

<Figure size 432x288 with 0 Axes>

# Save both DataFrames to csv files
gapminder.to_csv('gapminder.csv')
gapminder_agg.to_csv('gapminder_agg.csv')

You've stepped through each stage of the data cleaning process and your data is now ready for serious analysis! Looking at the line plot, it seems like life expectancy has, as expected, increased over the years. There is a surprising dip around 1920 that may be worth further investigation!

[파이썬] 판다스와 함수를 활용한 데이터 가공 (0)	2019.11.21
[파이썬] 판다스를 활용한 행과 열 동시에 다루기 (0)	2019.11.15
[파이썬] 판다스 활용 데이터 전처리 - 행 다루기 (0)	2019.11.12
[Python] 판다스 자료 구조 - 데이터프레임 (0)	2019.11.04
[Python] 판다스 자료 구조 - Series (0)	2019.10.30

[파이썬] 판다스와 함수를 활용한 데이터 가공 (0)	2019.11.21
[파이썬] 판다스를 활용한 행과 열 동시에 다루기 (0)	2019.11.15
[파이썬] 판다스 활용 데이터 전처리 - 행 다루기 (0)	2019.11.12
[파이썬] 판다스 활용 데이터 전처리 - 열 다루기 (0)	2019.11.08
[Python] 판다스 자료 구조 - Series (0)	2019.10.30

[Python] Cleaning_data_for_analysis (0)	2019.02.09
[Python] Combining Data using Pandas (0)	2019.02.09
[Python] Tidying data for analysis (0)	2019.02.06

cozyDS

pandas

[파이썬] 판다스 활용 데이터 전처리 - 열 다루기

1. Intro

2. 열 다루기

열을 다루는 방법은 여러가지가 있는데, 대표적으로 말하면 크게 3가지가 존재한다.
- 열 선택(Selection)
- 열 추가(Addition)
- 열 삭제(Deletion)

이제 본격적으로 차례대로 하나씩 진행하도록 해본다.

2.1 열 선택

2.2. 열 추가

2.3 열 삭제

'Python > [Python] 데이터 전처리' 카테고리의 다른 글

[Python] 판다스 자료 구조 - 데이터프레임

1. 딕셔너리에서 데이터프레임으로의 변환

2. 시리즈에서 데이터프레임으로의 변환

3. ndArrays & Lists에서 데이터프레임으로의 변환

'Python > [Python] 데이터 전처리' 카테고리의 다른 글

[Python] Connecting to MongoDB

Intro

MongDB Installation

Note - Network Access

Python Connecting to MongoDB Cluster

Python module Installation

Error Handling 1 - dnspython

Erros Handling - Python Error Certificate Verify Failed

From MongoDB to Pandas

Creating Database and Collection

Insert iris data into Collection

Conclusion

[Python] Data Transformation Using Pandas Review

Chapter 1. Exploratory analysis¶

Chapter 2. Visualizing your data¶

Chapter 3. Thinking about the question at hand¶

Chapter 4. DisAssembling & Assembling your data¶

Chapter 5. Reshaping your data¶

Chapter 6. Checking the data types¶

Chapter 7. Looking at country spellings¶

Chapter 8. More data cleaning and processing¶

Chapter 9. Wrapping up¶

'Python > [Python] Data Transformation' 카테고리의 다른 글

+ Recent posts

티스토리툴바

cozyDS

pandas

[파이썬] 판다스 활용 데이터 전처리 - 열 다루기

1. Intro

2. 열 다루기

열을 다루는 방법은 여러가지가 있는데, 대표적으로 말하면 크게 3가지가 존재한다. - 열 선택(Selection)- 열 추가(Addition)- 열 삭제(Deletion)이제 본격적으로 차례대로 하나씩 진행하도록 해본다. 2.1 열 선택

2.2. 열 추가

2.3 열 삭제

'Python > [Python] 데이터 전처리' 카테고리의 다른 글

[Python] 판다스 자료 구조 - 데이터프레임

1. 딕셔너리에서 데이터프레임으로의 변환

2. 시리즈에서 데이터프레임으로의 변환

3. ndArrays & Lists에서 데이터프레임으로의 변환

'Python > [Python] 데이터 전처리' 카테고리의 다른 글

[Python] Connecting to MongoDB

Intro

MongDB Installation

Note - Network Access

Python Connecting to MongoDB Cluster

Python module Installation

Error Handling 1 - dnspython

Erros Handling - Python Error Certificate Verify Failed

From MongoDB to Pandas

Creating Database and Collection

Insert iris data into Collection

Conclusion

[Python] Data Transformation Using Pandas Review

Chapter 1. Exploratory analysis¶

Chapter 2. Visualizing your data¶

Chapter 3. Thinking about the question at hand¶

Chapter 4. DisAssembling & Assembling your data¶

Chapter 5. Reshaping your data¶

Chapter 6. Checking the data types¶

Chapter 7. Looking at country spellings¶

Chapter 8. More data cleaning and processing¶

Chapter 9. Wrapping up¶

'Python > [Python] Data Transformation' 카테고리의 다른 글

+ Recent posts

티스토리툴바

열을 다루는 방법은 여러가지가 있는데, 대표적으로 말하면 크게 3가지가 존재한다.
- 열 선택(Selection)
- 열 추가(Addition)
- 열 삭제(Deletion)

이제 본격적으로 차례대로 하나씩 진행하도록 해본다.

2.1 열 선택