1. Intro

지난 시간에는 열(=Column)을 다루는 것을 배웠다. 구체적으로는 열을 선택하는 방법, 추가하는 방법, 삭제하는 방법을 배웠다. 이번에는 행을 다루는 것을 배워보도록 해보자. 


[지난 글]

2019/11/12 - [Python/[Python] 데이터 전처리] - [파이썬] 판다스 활용 데이터 전처리 - 행 다루기


우선 지난 챕터와 마찬가지로 연습용 데이터를 만들어보도록 하자. 학생들의 이름을 변수로 추가하도록 한다. 

import pandas as pd
exam_dic = {'이름' : ['Evan', 'Chloe', 'Alice', 'John'],
            '국어' : [80, 70, 90, 87], 
            '영어' : [100, 80, 50, 90], 
            '수학' : [85, 75, 65, 100], 
            '과학' : [89, 67, 90, 85]}

data = pd.DataFrame(exam_dic)

print(data)
      이름  국어   영어   수학  과학
0   Evan  80  100   85  89
1  Chloe  70   80   75  67
2  Alice  90   50   65  90
3   John  87   90  100  85

 

2. 행 다루기

행을 다루는 방법은 여러가지가 있는데, 대표적으로 말하면 크게 3가지가 존재한다. 
- 행 선택(Selection)
- 행 추가(Addition)
- 행 삭제(Deletion)

이제 본격적으로 차례대로 하나씩 진행하도록 해본다.

2.1 행 선택

판다스에서 행을 선택하는 방법은 loc[]을 사용한다. 먼저 `loc[]`을 사용하여 Chloe을 선택 해보자.

data.loc[1]
이름    Chloe
국어       70
영어       80
수학       75
과학       67
Name: 1, dtype: object

이번에는 Alice를 선택해보자. 

data.loc[2]
이름    Alice
국어       90
영어       50
수학       65
과학       90
Name: 2, dtype: object

위 데이터의 문제점은 이름으로 행을 추출하는 것이 아니라 숫자로 선택해야 한다는 점이다. 데이터가 많아지면 특정행을 추출하기가 매우 어려워지기 때문에 데이터에 '이름'열을 새로운 인덱스로 지정하겠다. 이 때에는 set_index('새로운_인덱스_열', inplace=True)을 추가하면 된다.

data.set_index('이름', inplace=True)
print(data)
       국어   영어   수학  과학
이름                     
Evan   80  100   85  89
Chloe  70   80   75  67
Alice  90   50   65  90
John   87   90  100  85

이름이 새로운 인덱스로 변환된 것을 확인할 수 있다. 이제 다시 ChloeAlice를 선택해본다.

data.loc['Chloe']
국어    70
영어    80
수학    75
과학    67
Name: Chloe, dtype: int64
data.loc['Alice']
국어    90
영어    50
수학    65
과학    90
Name: Alice, dtype: int64

만약 2개 이상의 행을 선택하고 싶다면 data.loc[['행이름1', '행이름2', ..., '행이름N']]와 같은 형태로 지정하면 이번에는 시리즈 객체가 아닌 데이터프레임으로 반환한다. 

Chloe와 Alice를 동시에 선택하도록 해본다. 코드로 확인해보자. 

data.loc[['Chloe', 'Alice']]
        국어	영어	수학	과학
이름				
Chloe	70	80	75	67
Alice	90	50	65	90

위에서 보는 것처럼 데이터프레임으로 반환된 것을 볼 수 있다. 범위 슬라이싱과 같은 고급 기술도 있지만, 이 부분은 실무 데이터와 함께 적용하는 것으로 일단 남겨둔다. 

2.2 행 추가
열을 추가하는 것은 엑셀을 쉽게 떠올리면 매우 쉽다. 간단하게 문법을 소개하면 data['새로운 행 이름'] = 데이터 값 와 같으며, 이 때 데이터 값을 잘 다뤄야 한다. 

이번에는 Chris라는 학생을 추가하도록 해본다. 이 때 주의해야 하는 점은 새로운 행을 추가할 때는 data.loc['새로운_인덱스_이름']으로 시작해야 한다는 점이다. 그리고, 입력값의 열의 이름이 동일한지 확인해봐야 한다. 

data.loc['Chris'] = [90, 85, 80, 50]
print(data)
       국어   영어   수학  과학
이름                     
Evan   80  100   85  89
Chloe  70   80   75  67
Alice  90   50   65  90
John   87   90  100  85
Chris  90   85   80  50

약에 새로운 데이터프레임이 있다고 한다면 어떻게 합쳐야 할까? 이 때에는 pandas.concat([data, new_row])과 같은 형태로 진행하면 된다. 두개 Row가 있는 데이터프레임을 만든 후 행을 합치는 것을 진행해보겠다.

exam_dic2 = {'국어' : [80, 70], 
            '영어' : [100, 80], 
            '수학' : [85, 75], 
            '과학' : [89, 67]}

new_row = pd.DataFrame(exam_dic2, index=['대한이', '민국'])

data2 = pd.concat([data, new_row])
print(data2)
       국어   영어   수학  과학
Evan   80  100   85  89
Chloe  70   80   75  67
Alice  90   50   65  90
John   87   90  100  85
Chris  90   85   80  50
대한이    80  100   85  89
민국     70   80   75  67

2.3 행 삭제
행을 삭제하는 방법은 drop함수를 사용하는 방법이다. drop 함수를 사용 시, 각 행의 이름 또는 위치 숫자를 입력하여 제거하면 된다. Evan과 Chloe 행을 제거해보자.

data.drop(["Evan", "Chloe"], inplace = True) 
  
data 
	    국어	영어	수학	과학
이름				
Alice	90	50	65	90
John	87	90	100	85
Chris	90	85	80	50

지금까지 데이터프레임의 행(=Row)을 다루는 가장 기본적인 과정을 다루었다. 물론, 실무에서는 고급기술이 더 많이 사용되기는 하지만, 언제나 기초가 중요하기 때문에 기본적인 함수의 사용법에 대해 익히기를 바란다. 

 

다음 시간에는 행과 열을 동시에 사용하여 데이터를 다루는 방법을 배우도록 하겠다.

 

 

 

 

 

 

 

 

 

 

파이썬 Seaborn을 이용한 선 그래프에 하위 그룹을 적용해보겠습니다. 

Let's continue to look at the mpg dataset. We've seen that the average miles per gallon for cars has increased over time, but how has the average horsepower for cars changed over time? And does this trend differ by country of origin?

 

mpg 데이터 세트를 계속 살펴 보겠습니다. 자동차에 대한 갤런 당 평균 마일이 시간이 지남에 따라 증가했다는 것을 알았지만, 차의 평균 마력은 시간이 지남에 따라 어떻게 변했는가? 그리고 이러한 데이터의 흐름은 차량 생산지에 따라 다른지 확인하는 작업을 하려고 합니다. 

 

# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

url = 'https://assets.datacamp.com/production/repositories/3996/datasets/e0b285b89bdbfbbe8d81123e64727ff150d544e0/mpg.csv'
mpg = pd.read_csv(url)
print(mpg)
      mpg  cylinders  displacement  horsepower  weight  acceleration  \
0    18.0          8         307.0       130.0    3504          12.0   
1    15.0          8         350.0       165.0    3693          11.5   
2    18.0          8         318.0       150.0    3436          11.0   
3    16.0          8         304.0       150.0    3433          12.0   
4    17.0          8         302.0       140.0    3449          10.5   
5    15.0          8         429.0       198.0    4341          10.0   
6    14.0          8         454.0       220.0    4354           9.0   
7    14.0          8         440.0       215.0    4312           8.5   
8    14.0          8         455.0       225.0    4425          10.0   
9    15.0          8         390.0       190.0    3850           8.5   
10   15.0          8         383.0       170.0    3563          10.0   
11   14.0          8         340.0       160.0    3609           8.0   
12   15.0          8         400.0       150.0    3761           9.5   
13   14.0          8         455.0       225.0    3086          10.0   
14   24.0          4         113.0        95.0    2372          15.0   
15   22.0          6         198.0        95.0    2833          15.5   
16   18.0          6         199.0        97.0    2774          15.5   
17   21.0          6         200.0        85.0    2587          16.0   
18   27.0          4          97.0        88.0    2130          14.5   
19   26.0          4          97.0        46.0    1835          20.5   
20   25.0          4         110.0        87.0    2672          17.5   
21   24.0          4         107.0        90.0    2430          14.5   
22   25.0          4         104.0        95.0    2375          17.5   
23   26.0          4         121.0       113.0    2234          12.5   
24   21.0          6         199.0        90.0    2648          15.0   
25   10.0          8         360.0       215.0    4615          14.0   
26   10.0          8         307.0       200.0    4376          15.0   
27   11.0          8         318.0       210.0    4382          13.5   
28    9.0          8         304.0       193.0    4732          18.5   
29   27.0          4          97.0        88.0    2130          14.5   
..    ...        ...           ...         ...     ...           ...   
368  27.0          4         112.0        88.0    2640          18.6   
369  34.0          4         112.0        88.0    2395          18.0   
370  31.0          4         112.0        85.0    2575          16.2   
371  29.0          4         135.0        84.0    2525          16.0   
372  27.0          4         151.0        90.0    2735          18.0   
373  24.0          4         140.0        92.0    2865          16.4   
374  23.0          4         151.0         NaN    3035          20.5   
375  36.0          4         105.0        74.0    1980          15.3   
376  37.0          4          91.0        68.0    2025          18.2   
377  31.0          4          91.0        68.0    1970          17.6   
378  38.0          4         105.0        63.0    2125          14.7   
379  36.0          4          98.0        70.0    2125          17.3   
380  36.0          4         120.0        88.0    2160          14.5   
381  36.0          4         107.0        75.0    2205          14.5   
382  34.0          4         108.0        70.0    2245          16.9   
383  38.0          4          91.0        67.0    1965          15.0   
384  32.0          4          91.0        67.0    1965          15.7   
385  38.0          4          91.0        67.0    1995          16.2   
386  25.0          6         181.0       110.0    2945          16.4   
387  38.0          6         262.0        85.0    3015          17.0   
388  26.0          4         156.0        92.0    2585          14.5   
389  22.0          6         232.0       112.0    2835          14.7   
390  32.0          4         144.0        96.0    2665          13.9   
391  36.0          4         135.0        84.0    2370          13.0   
392  27.0          4         151.0        90.0    2950          17.3   
393  27.0          4         140.0        86.0    2790          15.6   
394  44.0          4          97.0        52.0    2130          24.6   
395  32.0          4         135.0        84.0    2295          11.6   
396  28.0          4         120.0        79.0    2625          18.6   
397  31.0          4         119.0        82.0    2720          19.4   

     model_year  origin                               name  
0            70     usa          chevrolet chevelle malibu  
1            70     usa                  buick skylark 320  
2            70     usa                 plymouth satellite  
3            70     usa                      amc rebel sst  
4            70     usa                        ford torino  
5            70     usa                   ford galaxie 500  
6            70     usa                   chevrolet impala  
7            70     usa                  plymouth fury iii  
8            70     usa                   pontiac catalina  
9            70     usa                 amc ambassador dpl  
10           70     usa                dodge challenger se  
11           70     usa                 plymouth 'cuda 340  
12           70     usa              chevrolet monte carlo  
13           70     usa            buick estate wagon (sw)  
14           70   japan              toyota corona mark ii  
15           70     usa                    plymouth duster  
16           70     usa                         amc hornet  
17           70     usa                      ford maverick  
18           70   japan                       datsun pl510  
19           70  europe       volkswagen 1131 deluxe sedan  
20           70  europe                        peugeot 504  
21           70  europe                        audi 100 ls  
22           70  europe                           saab 99e  
23           70  europe                           bmw 2002  
24           70     usa                        amc gremlin  
25           70     usa                          ford f250  
26           70     usa                          chevy c20  
27           70     usa                         dodge d200  
28           70     usa                           hi 1200d  
29           71   japan                       datsun pl510  
..          ...     ...                                ...  
368          82     usa           chevrolet cavalier wagon  
369          82     usa          chevrolet cavalier 2-door  
370          82     usa         pontiac j2000 se hatchback  
371          82     usa                     dodge aries se  
372          82     usa                    pontiac phoenix  
373          82     usa               ford fairmont futura  
374          82     usa                     amc concord dl  
375          82  europe                volkswagen rabbit l  
376          82   japan                 mazda glc custom l  
377          82   japan                   mazda glc custom  
378          82     usa             plymouth horizon miser  
379          82     usa                     mercury lynx l  
380          82   japan                   nissan stanza xe  
381          82   japan                       honda accord  
382          82   japan                     toyota corolla  
383          82   japan                        honda civic  
384          82   japan                 honda civic (auto)  
385          82   japan                      datsun 310 gx  
386          82     usa              buick century limited  
387          82     usa  oldsmobile cutlass ciera (diesel)  
388          82     usa         chrysler lebaron medallion  
389          82     usa                     ford granada l  
390          82   japan                   toyota celica gt  
391          82     usa                  dodge charger 2.2  
392          82     usa                   chevrolet camaro  
393          82     usa                    ford mustang gl  
394          82  europe                          vw pickup  
395          82     usa                      dodge rampage  
396          82     usa                        ford ranger  
397          82     usa                         chevy s-10  

[398 rows x 9 columns]

Step 1. Turn off Confidence Intervals on the Plot

Use relplot() and the mpg DataFrame to create a line plot with "model_year" on the x-axis and "horsepower" on the y-axis. Turn off the confidence intervals on the plot.

 

relplot ()과 mpg DataFrame을 사용하여 x 축에 "model_year", y 축에 "horsepower"을 가진 선 그림을 만듭니다. 플롯에서 신뢰 구간 기능은 Off 하도록 합니다.

 

# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Create line plot of model year vs. horsepower
sns.relplot(x='model_year', 
            y='horsepower', 
            data=mpg, 
            kind='line', 
            ci=None)

# Show plot
plt.show()

Add Style and Color

Create different lines for each country of origin ("origin") that vary in both line style and color.

# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Change to create subgroups for country of origin
sns.relplot(x="model_year", 
y="horsepower", 
            data=mpg, kind="line", 
            ci=None, style='origin', hue='origin')

# Show plot
plt.show()

Add Markers

Add markers for each data point to the lines.

각 데이터 포인트에 대한 마커를 선에 추가하십시오.

# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Add markers and make each line have the same style
sns.relplot(x="model_year", y="horsepower", 
            data=mpg, kind="line", 
            ci=None, style="origin", 
            hue="origin", 
            markers=True)

# Show plot
plt.show()

Now that we've added subgroups, we can see that this downward trend in horsepower was more pronounced among cars from the USA.

 

이제 하위 그룹을 추가 했으므로 'horsepower'의 하락 추세 중 미국의 자동차 가운데 더 두드러 졌다는 것을 알 수 있습니다.

 

All the contents are from DataCamp

Chapter2_Tidying_data_for_analysis

Here, you'll learn about the principles of tidy data and more importantly, why you should care about them and how they make subsequent data analysis more efficient. You'll gain first hand experience with reshaping and tidying your data using techniques such as pivoting and melting.

Chapter 1. Recognizing tidy data

For data to be tidy, it must have:

  • Each variable as a separate column.
  • Each row as a separate observation.

As a data scientist, you'll encounter data that is represented in a variety of different ways, so it is important to be able to recognize tidy (or untidy) data when you see it.

Chapter 2. Reshaping your data using melt

Melting data is the process of turning columns of your data into rows of data. Consider the DataFrames from the previous exercise. In the tidy DataFrame, the variables Ozone, Solar.R, Wind, and Temp each had their own column. If, however, you wanted these variables to be in rows instead, you could melt the DataFrame. In doing so, however, you would make the data untidy! This is important to keep in mind: Depending on how your data is represented, you will have to reshape it differently (e.g., this could make it easier to plot values).

In this exercise, you will practice melting a DataFrame using pd.melt(). There are two parameters you should be aware of: id_vars and value_vars. The id_vars represent the columns of the data you do not want to melt (i.e., keep it in its current shape), while the value_vars represent the columns you do wish to melt into rows. By default, if no value_vars are provided, all columns not set in the id_vars will be melted. This could save a bit of typing, depending on the number of columns that need to be melted.

The (tidy) DataFrame airquality has been pre-loaded. Your job is to melt its Ozone, Solar.R, Wind, and Temp columns into rows. Later in this chapter, you'll learn how to bring this melted DataFrame back into a tidy form.

In [3]:
import pandas as pd
url = "https://assets.datacamp.com/production/course_2023/datasets/airquality.csv"
airquality = pd.read_csv(url)
print(airquality.head())
   Ozone  Solar.R  Wind  Temp  Month  Day
0   41.0    190.0   7.4    67      5    1
1   36.0    118.0   8.0    72      5    2
2   12.0    149.0  12.6    74      5    3
3   18.0    313.0  11.5    62      5    4
4    NaN      NaN  14.3    56      5    5
In [5]:
# Melt airquality: airquality_melt
airquality_melt = pd.melt(frame = airquality, id_vars=['Month', 'Day'])

# Print the head of airquality_melt
print(airquality_melt.head())
   Month  Day variable  value
0      5    1    Ozone   41.0
1      5    2    Ozone   36.0
2      5    3    Ozone   12.0
3      5    4    Ozone   18.0
4      5    5    Ozone    NaN

Chapter 3. Customizing melted data

When melting DataFrames, it would be better to have column names more meaningful than variable and value (the default names used by pd.melt()).

The default names may work in certain situations, but it's best to always have data that is self explanatory.

You can rename the variable column by specifying an argument to the var_name parameter, and the value column by specifying an argument to the value_name parameter. You will now practice doing exactly this. Pandas as pd and the DataFrame airquality has been pre-loaded for you.

In [6]:
# Melt airquality: airquality_melt
airquality_melt = pd.melt(frame=airquality, id_vars=['Month', 'Day'], var_name='measurement', value_name='reading')


# Print the head of airquality_melt
print(airquality_melt.head())
   Month  Day measurement  reading
0      5    1       Ozone     41.0
1      5    2       Ozone     36.0
2      5    3       Ozone     12.0
3      5    4       Ozone     18.0
4      5    5       Ozone      NaN

Chapter 4. Pivot data

Pivoting data is the opposite of melting it. Remember the tidy form that the airquality DataFrame was in before you melted it? You'll now begin pivoting it back into that form using the .pivot_table() method!

While melting takes a set of columns and turns it into a single column, pivoting will create a new column for each unique value in a specified column.

.pivot_table() has an index parameter which you can use to specify the columns that you don't want pivoted: It is similar to the id_vars parameter of pd.melt(). Two other parameters that you have to specify are columns (the name of the column you want to pivot), and values (the values to be used when the column is pivoted). The melted DataFrame airquality_melt has been pre-loaded for you.

In [7]:
# Print the head of airquality_melt
print(airquality_melt.head())
   Month  Day measurement  reading
0      5    1       Ozone     41.0
1      5    2       Ozone     36.0
2      5    3       Ozone     12.0
3      5    4       Ozone     18.0
4      5    5       Ozone      NaN
In [8]:
# Pivot airquality_melt: airquality_pivot
airquality_pivot = airquality_melt.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading')

# Print the head of airquality_pivot
print(airquality_pivot.head())
measurement  Ozone  Solar.R  Temp  Wind
Month Day                              
5     1       41.0    190.0  67.0   7.4
      2       36.0    118.0  72.0   8.0
      3       12.0    149.0  74.0  12.6
      4       18.0    313.0  62.0  11.5
      5        NaN      NaN  56.0  14.3

Chapter 5. Resetting the index of a DataFrame

After pivoting airquality_melt in the previous exercise, you didn't quite get back the original DataFrame.

What you got back instead was a pandas DataFrame with a hierarchical index (also known as a MultiIndex).

Hierarchical indexes are covered in depth in Manipulating DataFrames with pandas. In essence, they allow you to group columns or rows by another variable - in this case, by 'Month' as well as 'Day'.

There's a very simple method you can use to get back the original DataFrame from the pivoted DataFrame: .reset_index(). Dan didn't show you how to use this method in the video, but you're now going to practice using it in this exercise to get back the original DataFrame from airquality_pivot, which has been pre-loaded.

In [9]:
# Print the index of airquality_pivot
print(airquality_pivot.index)
MultiIndex(levels=[[5, 6, 7, 8, 9], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]],
           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]],
           names=['Month', 'Day'])
In [10]:
# Reset the index of airquality_pivot: airquality_pivot_reset
airquality_pivot_reset = airquality_pivot.reset_index()
In [11]:
# Print the new index of airquality_pivot_reset
print(airquality_pivot_reset.index)
RangeIndex(start=0, stop=153, step=1)
In [12]:
# Print the head of airquality_pivot_reset
print(airquality_pivot_reset.head())
measurement  Month  Day  Ozone  Solar.R  Temp  Wind
0                5    1   41.0    190.0  67.0   7.4
1                5    2   36.0    118.0  72.0   8.0
2                5    3   12.0    149.0  74.0  12.6
3                5    4   18.0    313.0  62.0  11.5
4                5    5    NaN      NaN  56.0  14.3

Chapter 6. Pivoting duplicate values

So far, you've used the .pivot_table() method when there are multiple index values you want to hold constant during a pivot. In the video, Dan showed you how you can also use pivot tables to deal with duplicate values by providing an aggregation function through the aggfunc parameter. Here, you're going to combine both these uses of pivot tables.

Let's say your data collection method accidentally duplicated your dataset. Such a dataset, in which each row is duplicated, has been pre-loaded as airquality_dup. In addition, the airquality_melt DataFrame from the previous exercise has been pre-loaded. Explore their shapes in the IPython Shell by accessing their .shape attributes to confirm the duplicate rows present in airquality_dup.

You'll see that by using .pivot_table() and the aggfunc parameter, you can not only reshape your data, but also remove duplicates. Finally, you can then flatten the columns of the pivoted DataFrame using .reset_index().

NumPy and pandas have been imported as np and pd respectively.

In [24]:
airquality_dup = airquality_melt.append(airquality_melt, ignore_index=True)
airquality_dup
Out[24]:
Month Day measurement reading
0 5 1 Ozone 41.0
1 5 2 Ozone 36.0
2 5 3 Ozone 12.0
3 5 4 Ozone 18.0
4 5 5 Ozone NaN
5 5 6 Ozone 28.0
6 5 7 Ozone 23.0
7 5 8 Ozone 19.0
8 5 9 Ozone 8.0
9 5 10 Ozone NaN
10 5 11 Ozone 7.0
11 5 12 Ozone 16.0
12 5 13 Ozone 11.0
13 5 14 Ozone 14.0
14 5 15 Ozone 18.0
15 5 16 Ozone 14.0
16 5 17 Ozone 34.0
17 5 18 Ozone 6.0
18 5 19 Ozone 30.0
19 5 20 Ozone 11.0
20 5 21 Ozone 1.0
21 5 22 Ozone 11.0
22 5 23 Ozone 4.0
23 5 24 Ozone 32.0
24 5 25 Ozone NaN
25 5 26 Ozone NaN
26 5 27 Ozone NaN
27 5 28 Ozone 23.0
28 5 29 Ozone 45.0
29 5 30 Ozone 115.0
... ... ... ... ...
1194 9 1 Temp 91.0
1195 9 2 Temp 92.0
1196 9 3 Temp 93.0
1197 9 4 Temp 93.0
1198 9 5 Temp 87.0
1199 9 6 Temp 84.0
1200 9 7 Temp 80.0
1201 9 8 Temp 78.0
1202 9 9 Temp 75.0
1203 9 10 Temp 73.0
1204 9 11 Temp 81.0
1205 9 12 Temp 76.0
1206 9 13 Temp 77.0
1207 9 14 Temp 71.0
1208 9 15 Temp 71.0
1209 9 16 Temp 78.0
1210 9 17 Temp 67.0
1211 9 18 Temp 76.0
1212 9 19 Temp 68.0
1213 9 20 Temp 82.0
1214 9 21 Temp 64.0
1215 9 22 Temp 71.0
1216 9 23 Temp 81.0
1217 9 24 Temp 69.0
1218 9 25 Temp 63.0
1219 9 26 Temp 70.0
1220 9 27 Temp 77.0
1221 9 28 Temp 75.0
1222 9 29 Temp 76.0
1223 9 30 Temp 68.0

1224 rows × 4 columns

In [28]:
import numpy as np
# Pivot airquality_dup: airquality_pivot
airquality_pivot = airquality_dup.pivot_table(index=['Month', 'Day'], 
                                              columns='measurement', 
                                              values='reading', 
                                              aggfunc=np.mean)

print(airquality_pivot)
measurement  Ozone  Solar.R  Temp  Wind
Month Day                              
5     1       41.0    190.0  67.0   7.4
      2       36.0    118.0  72.0   8.0
      3       12.0    149.0  74.0  12.6
      4       18.0    313.0  62.0  11.5
      5        NaN      NaN  56.0  14.3
      6       28.0      NaN  66.0  14.9
      7       23.0    299.0  65.0   8.6
      8       19.0     99.0  59.0  13.8
      9        8.0     19.0  61.0  20.1
      10       NaN    194.0  69.0   8.6
      11       7.0      NaN  74.0   6.9
      12      16.0    256.0  69.0   9.7
      13      11.0    290.0  66.0   9.2
      14      14.0    274.0  68.0  10.9
      15      18.0     65.0  58.0  13.2
      16      14.0    334.0  64.0  11.5
      17      34.0    307.0  66.0  12.0
      18       6.0     78.0  57.0  18.4
      19      30.0    322.0  68.0  11.5
      20      11.0     44.0  62.0   9.7
      21       1.0      8.0  59.0   9.7
      22      11.0    320.0  73.0  16.6
      23       4.0     25.0  61.0   9.7
      24      32.0     92.0  61.0  12.0
      25       NaN     66.0  57.0  16.6
      26       NaN    266.0  58.0  14.9
      27       NaN      NaN  57.0   8.0
      28      23.0     13.0  67.0  12.0
      29      45.0    252.0  81.0  14.9
      30     115.0    223.0  79.0   5.7
...            ...      ...   ...   ...
9     1       96.0    167.0  91.0   6.9
      2       78.0    197.0  92.0   5.1
      3       73.0    183.0  93.0   2.8
      4       91.0    189.0  93.0   4.6
      5       47.0     95.0  87.0   7.4
      6       32.0     92.0  84.0  15.5
      7       20.0    252.0  80.0  10.9
      8       23.0    220.0  78.0  10.3
      9       21.0    230.0  75.0  10.9
      10      24.0    259.0  73.0   9.7
      11      44.0    236.0  81.0  14.9
      12      21.0    259.0  76.0  15.5
      13      28.0    238.0  77.0   6.3
      14       9.0     24.0  71.0  10.9
      15      13.0    112.0  71.0  11.5
      16      46.0    237.0  78.0   6.9
      17      18.0    224.0  67.0  13.8
      18      13.0     27.0  76.0  10.3
      19      24.0    238.0  68.0  10.3
      20      16.0    201.0  82.0   8.0
      21      13.0    238.0  64.0  12.6
      22      23.0     14.0  71.0   9.2
      23      36.0    139.0  81.0  10.3
      24       7.0     49.0  69.0  10.3
      25      14.0     20.0  63.0  16.6
      26      30.0    193.0  70.0   6.9
      27       NaN    145.0  77.0  13.2
      28      14.0    191.0  75.0  14.3
      29      18.0    131.0  76.0   8.0
      30      20.0    223.0  68.0  11.5

[153 rows x 4 columns]
In [29]:
# Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()

# Print the head of airquality_pivot
print(airquality_pivot.head())
measurement  Month  Day  Ozone  Solar.R  Temp  Wind
0                5    1   41.0    190.0  67.0   7.4
1                5    2   36.0    118.0  72.0   8.0
2                5    3   12.0    149.0  74.0  12.6
3                5    4   18.0    313.0  62.0  11.5
4                5    5    NaN      NaN  56.0  14.3
In [30]:
# Print the head of airquality
print(airquality.head())
   Ozone  Solar.R  Wind  Temp  Month  Day
0   41.0    190.0   7.4    67      5    1
1   36.0    118.0   8.0    72      5    2
2   12.0    149.0  12.6    74      5    3
3   18.0    313.0  11.5    62      5    4
4    NaN      NaN  14.3    56      5    5

Chapter 7. Splitting a column with .str

The dataset you saw in the video, consisting of case counts of tuberculosis by country, year, gender, and age group, has been pre-loaded into a DataFrame as tb.

In this exercise, you're going to tidy the 'm014' column, which represents males aged 0-14 years of age. In order to parse this value, you need to extract the first letter into a new column for gender, and the rest into a column for age_group. Here, since you can parse values by position, you can take advantage of pandas' vectorized string slicing by using the str attribute of columns of type object.

Begin by printing the columns of tb in the IPython Shell using its .columns attribute, and take note of the problematic column.

In [37]:
import pandas as pd
url = "https://assets.datacamp.com/production/course_2023/datasets/tb.csv"
tb = pd.read_csv(url)
print(tb.head())
  country  year  m014  m1524  m2534  m3544  m4554  m5564   m65  mu  f014  \
0      AD  2000   0.0    0.0    1.0    0.0    0.0    0.0   0.0 NaN   NaN   
1      AE  2000   2.0    4.0    4.0    6.0    5.0   12.0  10.0 NaN   3.0   
2      AF  2000  52.0  228.0  183.0  149.0  129.0   94.0  80.0 NaN  93.0   
3      AG  2000   0.0    0.0    0.0    0.0    0.0    0.0   1.0 NaN   1.0   
4      AL  2000   2.0   19.0   21.0   14.0   24.0   19.0  16.0 NaN   3.0   

   f1524  f2534  f3544  f4554  f5564   f65  fu  
0    NaN    NaN    NaN    NaN    NaN   NaN NaN  
1   16.0    1.0    3.0    0.0    0.0   4.0 NaN  
2  414.0  565.0  339.0  205.0   99.0  36.0 NaN  
3    1.0    1.0    0.0    0.0    0.0   0.0 NaN  
4   11.0   10.0    8.0    8.0    5.0  11.0 NaN  
In [38]:
# Melt tb: tb_melt
tb_melt = pd.melt(tb, id_vars=['country', 'year'])
print(tb_melt.head())
  country  year variable  value
0      AD  2000     m014    0.0
1      AE  2000     m014    2.0
2      AF  2000     m014   52.0
3      AG  2000     m014    0.0
4      AL  2000     m014    2.0
In [40]:
# Create the 'gender' column
tb_melt['gender'] = tb_melt.variable.str[0]
print(tb_melt.head())
  country  year variable  value gender
0      AD  2000     m014    0.0      m
1      AE  2000     m014    2.0      m
2      AF  2000     m014   52.0      m
3      AG  2000     m014    0.0      m
4      AL  2000     m014    2.0      m
In [41]:
# Create the 'age_group' column
tb_melt['age_group'] = tb_melt.variable.str[1:]
# Print the head of tb_melt
print(tb_melt.head())
  country  year variable  value gender age_group
0      AD  2000     m014    0.0      m       014
1      AE  2000     m014    2.0      m       014
2      AF  2000     m014   52.0      m       014
3      AG  2000     m014    0.0      m       014
4      AL  2000     m014    2.0      m       014

Chapter 8. Splitting a column with .split() and .get()

Another common way multiple variables are stored in columns is with a delimiter. You'll learn how to deal with such cases in this exercise, using a dataset consisting of Ebola cases and death counts by state and country. It has been pre-loaded into a DataFrame as ebola.

Print the columns of ebola in the IPython Shell using ebola.columns. Notice that the data has column names such as Cases_Guinea and DeathsGuinea. Here, the underscore serves as a delimiter between the first part (cases or deaths), and the second part (country).

This time, you cannot directly slice the variable by position as in the previous exercise. You now need to use Python's built-in string method called .split(). By default, this method will split a string into parts separated by a space. However, in this case you want it to split by an underscore. You can do this on Cases_Guinea, for example, using CasesGuinea.split(''), which returns the list ['Cases', 'Guinea'].

The next challenge is to extract the first element of this list and assign it to a type variable, and the second element of the list to a country variable. You can accomplish this by accessing the str attribute of the column and using the .get() method to retrieve the 0 or 1 index, depending on the part you want.

In [42]:
url = "https://assets.datacamp.com/production/course_2023/datasets/ebola.csv"
ebola = pd.read_csv(url)
print(ebola.head())
         Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone  \
0    1/5/2015  289        2776.0            NaN            10030.0   
1    1/4/2015  288        2775.0            NaN             9780.0   
2    1/3/2015  287        2769.0         8166.0             9722.0   
3    1/2/2015  286           NaN         8157.0                NaN   
4  12/31/2014  284        2730.0         8115.0             9633.0   

   Cases_Nigeria  Cases_Senegal  Cases_UnitedStates  Cases_Spain  Cases_Mali  \
0            NaN            NaN                 NaN          NaN         NaN   
1            NaN            NaN                 NaN          NaN         NaN   
2            NaN            NaN                 NaN          NaN         NaN   
3            NaN            NaN                 NaN          NaN         NaN   
4            NaN            NaN                 NaN          NaN         NaN   

   Deaths_Guinea  Deaths_Liberia  Deaths_SierraLeone  Deaths_Nigeria  \
0         1786.0             NaN              2977.0             NaN   
1         1781.0             NaN              2943.0             NaN   
2         1767.0          3496.0              2915.0             NaN   
3            NaN          3496.0                 NaN             NaN   
4         1739.0          3471.0              2827.0             NaN   

   Deaths_Senegal  Deaths_UnitedStates  Deaths_Spain  Deaths_Mali  
0             NaN                  NaN           NaN          NaN  
1             NaN                  NaN           NaN          NaN  
2             NaN                  NaN           NaN          NaN  
3             NaN                  NaN           NaN          NaN  
4             NaN                  NaN           NaN          NaN  
In [43]:
# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts')
print(ebola_melt.head())
         Date  Day  type_country  counts
0    1/5/2015  289  Cases_Guinea  2776.0
1    1/4/2015  288  Cases_Guinea  2775.0
2    1/3/2015  287  Cases_Guinea  2769.0
3    1/2/2015  286  Cases_Guinea     NaN
4  12/31/2014  284  Cases_Guinea  2730.0
In [44]:
# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')
print(ebola_melt.head())
         Date  Day  type_country  counts        str_split
0    1/5/2015  289  Cases_Guinea  2776.0  [Cases, Guinea]
1    1/4/2015  288  Cases_Guinea  2775.0  [Cases, Guinea]
2    1/3/2015  287  Cases_Guinea  2769.0  [Cases, Guinea]
3    1/2/2015  286  Cases_Guinea     NaN  [Cases, Guinea]
4  12/31/2014  284  Cases_Guinea  2730.0  [Cases, Guinea]
In [45]:
# Create the 'type' column
ebola_melt['type'] = ebola_melt.str_split.str.get(0)
print(ebola_melt.head())
         Date  Day  type_country  counts        str_split   type
0    1/5/2015  289  Cases_Guinea  2776.0  [Cases, Guinea]  Cases
1    1/4/2015  288  Cases_Guinea  2775.0  [Cases, Guinea]  Cases
2    1/3/2015  287  Cases_Guinea  2769.0  [Cases, Guinea]  Cases
3    1/2/2015  286  Cases_Guinea     NaN  [Cases, Guinea]  Cases
4  12/31/2014  284  Cases_Guinea  2730.0  [Cases, Guinea]  Cases
In [46]:
# Create the 'country' column
ebola_melt['country'] = ebola_melt.str_split.str.get(1)
print(ebola_melt.head())
         Date  Day  type_country  counts        str_split   type country
0    1/5/2015  289  Cases_Guinea  2776.0  [Cases, Guinea]  Cases  Guinea
1    1/4/2015  288  Cases_Guinea  2775.0  [Cases, Guinea]  Cases  Guinea
2    1/3/2015  287  Cases_Guinea  2769.0  [Cases, Guinea]  Cases  Guinea
3    1/2/2015  286  Cases_Guinea     NaN  [Cases, Guinea]  Cases  Guinea
4  12/31/2014  284  Cases_Guinea  2730.0  [Cases, Guinea]  Cases  Guinea

+ Recent posts