Changing the size of scatter plot points

In this exercise, we'll explore Seaborn's mpg dataset, which contains one row per car model and includes information such as the year the car was made, the number of miles per gallon ("M.P.G.") it achieves, the power of its engine (measured in "horsepower"), and its country of origin.

What is the relationship between the power of a car's engine ("horsepower") and its fuel efficiency ("mpg")? And how does this relationship vary by the number of cylinders ("cylinders") the car has? Let's find out.

Let's continue to use relplot() instead of scatterplot() since it offers more flexibility.

# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

url = 'https://assets.datacamp.com/production/repositories/3996/datasets/e0b285b89bdbfbbe8d81123e64727ff150d544e0/mpg.csv'
mpg = pd.read_csv(url)
print(mpg)
      mpg  cylinders  displacement  horsepower  weight  acceleration  \
0    18.0          8         307.0       130.0    3504          12.0   
1    15.0          8         350.0       165.0    3693          11.5   
2    18.0          8         318.0       150.0    3436          11.0   
3    16.0          8         304.0       150.0    3433          12.0   
4    17.0          8         302.0       140.0    3449          10.5   
5    15.0          8         429.0       198.0    4341          10.0   
6    14.0          8         454.0       220.0    4354           9.0   
7    14.0          8         440.0       215.0    4312           8.5   
8    14.0          8         455.0       225.0    4425          10.0   
9    15.0          8         390.0       190.0    3850           8.5   
10   15.0          8         383.0       170.0    3563          10.0   
11   14.0          8         340.0       160.0    3609           8.0   
12   15.0          8         400.0       150.0    3761           9.5   
13   14.0          8         455.0       225.0    3086          10.0   
14   24.0          4         113.0        95.0    2372          15.0   
15   22.0          6         198.0        95.0    2833          15.5   
16   18.0          6         199.0        97.0    2774          15.5   
17   21.0          6         200.0        85.0    2587          16.0   
18   27.0          4          97.0        88.0    2130          14.5   
19   26.0          4          97.0        46.0    1835          20.5   
20   25.0          4         110.0        87.0    2672          17.5   
21   24.0          4         107.0        90.0    2430          14.5   
22   25.0          4         104.0        95.0    2375          17.5   
23   26.0          4         121.0       113.0    2234          12.5   
24   21.0          6         199.0        90.0    2648          15.0   
25   10.0          8         360.0       215.0    4615          14.0   
26   10.0          8         307.0       200.0    4376          15.0   
27   11.0          8         318.0       210.0    4382          13.5   
28    9.0          8         304.0       193.0    4732          18.5   
29   27.0          4          97.0        88.0    2130          14.5   
..    ...        ...           ...         ...     ...           ...   
368  27.0          4         112.0        88.0    2640          18.6   
369  34.0          4         112.0        88.0    2395          18.0   
370  31.0          4         112.0        85.0    2575          16.2   
371  29.0          4         135.0        84.0    2525          16.0   
372  27.0          4         151.0        90.0    2735          18.0   
373  24.0          4         140.0        92.0    2865          16.4   
374  23.0          4         151.0         NaN    3035          20.5   
375  36.0          4         105.0        74.0    1980          15.3   
376  37.0          4          91.0        68.0    2025          18.2   
377  31.0          4          91.0        68.0    1970          17.6   
378  38.0          4         105.0        63.0    2125          14.7   
379  36.0          4          98.0        70.0    2125          17.3   
380  36.0          4         120.0        88.0    2160          14.5   
381  36.0          4         107.0        75.0    2205          14.5   
382  34.0          4         108.0        70.0    2245          16.9   
383  38.0          4          91.0        67.0    1965          15.0   
384  32.0          4          91.0        67.0    1965          15.7   
385  38.0          4          91.0        67.0    1995          16.2   
386  25.0          6         181.0       110.0    2945          16.4   
387  38.0          6         262.0        85.0    3015          17.0   
388  26.0          4         156.0        92.0    2585          14.5   
389  22.0          6         232.0       112.0    2835          14.7   
390  32.0          4         144.0        96.0    2665          13.9   
391  36.0          4         135.0        84.0    2370          13.0   
392  27.0          4         151.0        90.0    2950          17.3   
393  27.0          4         140.0        86.0    2790          15.6   
394  44.0          4          97.0        52.0    2130          24.6   
395  32.0          4         135.0        84.0    2295          11.6   
396  28.0          4         120.0        79.0    2625          18.6   
397  31.0          4         119.0        82.0    2720          19.4   

     model_year  origin                               name  
0            70     usa          chevrolet chevelle malibu  
1            70     usa                  buick skylark 320  
2            70     usa                 plymouth satellite  
3            70     usa                      amc rebel sst  
4            70     usa                        ford torino  
5            70     usa                   ford galaxie 500  
6            70     usa                   chevrolet impala  
7            70     usa                  plymouth fury iii  
8            70     usa                   pontiac catalina  
9            70     usa                 amc ambassador dpl  
10           70     usa                dodge challenger se  
11           70     usa                 plymouth 'cuda 340  
12           70     usa              chevrolet monte carlo  
13           70     usa            buick estate wagon (sw)  
14           70   japan              toyota corona mark ii  
15           70     usa                    plymouth duster  
16           70     usa                         amc hornet  
17           70     usa                      ford maverick  
18           70   japan                       datsun pl510  
19           70  europe       volkswagen 1131 deluxe sedan  
20           70  europe                        peugeot 504  
21           70  europe                        audi 100 ls  
22           70  europe                           saab 99e  
23           70  europe                           bmw 2002  
24           70     usa                        amc gremlin  
25           70     usa                          ford f250  
26           70     usa                          chevy c20  
27           70     usa                         dodge d200  
28           70     usa                           hi 1200d  
29           71   japan                       datsun pl510  
..          ...     ...                                ...  
368          82     usa           chevrolet cavalier wagon  
369          82     usa          chevrolet cavalier 2-door  
370          82     usa         pontiac j2000 se hatchback  
371          82     usa                     dodge aries se  
372          82     usa                    pontiac phoenix  
373          82     usa               ford fairmont futura  
374          82     usa                     amc concord dl  
375          82  europe                volkswagen rabbit l  
376          82   japan                 mazda glc custom l  
377          82   japan                   mazda glc custom  
378          82     usa             plymouth horizon miser  
379          82     usa                     mercury lynx l  
380          82   japan                   nissan stanza xe  
381          82   japan                       honda accord  
382          82   japan                     toyota corolla  
383          82   japan                        honda civic  
384          82   japan                 honda civic (auto)  
385          82   japan                      datsun 310 gx  
386          82     usa              buick century limited  
387          82     usa  oldsmobile cutlass ciera (diesel)  
388          82     usa         chrysler lebaron medallion  
389          82     usa                     ford granada l  
390          82   japan                   toyota celica gt  
391          82     usa                  dodge charger 2.2  
392          82     usa                   chevrolet camaro  
393          82     usa                    ford mustang gl  
394          82  europe                          vw pickup  
395          82     usa                      dodge rampage  
396          82     usa                        ford ranger  
397          82     usa                         chevy s-10  

[398 rows x 9 columns]
# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Create scatter plot of horsepower vs. mpg
sns.relplot(x="horsepower", y="mpg", 
            data=mpg, 
            kind="scatter", 
            size="cylinders", # size
            hue='cylinders')  # color

# Show plot
plt.show()

 

All the contents are from DataCamp

 

Learn R, Python & Data Science Online

 

www.datacamp.com

 

Creating two-factor subplots

Let's continue looking at the student_data dataset of students in secondary school. Here, we want to answer the following question: does a student's first semester grade ("G1") tend to correlate with their final grade ("G3")?

There are many aspects of a student's life that could result in a higher or lower final grade in the class. For example, some students receive extra educational support from their school ("schoolsup") or from their family ("famsup"), which could result in higher grades. Let's try to control for these two factors by creating subplots based on whether the student received extra educational support from their school or family.

Seaborn has been imported as sns and matplotlib.pyplot has been imported as plt.

# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd 

url = 'https://assets.datacamp.com/production/repositories/3996/datasets/61e08004fef1a1b02b62620e3cd2533834239c90/student-alcohol-consumption.csv'
student_data = pd.read_csv(url)
print(student_data)
     Unnamed: 0 school sex  age famsize Pstatus  Medu  Fedu  traveltime  \
0             0     GP   F   18     GT3       A     4     4           2   
1             1     GP   F   17     GT3       T     1     1           1   
2             2     GP   F   15     LE3       T     1     1           1   
3             3     GP   F   15     GT3       T     4     2           1   
4             4     GP   F   16     GT3       T     3     3           1   
5             5     GP   M   16     LE3       T     4     3           1   
6             6     GP   M   16     LE3       T     2     2           1   
7             7     GP   F   17     GT3       A     4     4           2   
8             8     GP   M   15     LE3       A     3     2           1   
9             9     GP   M   15     GT3       T     3     4           1   
10           10     GP   F   15     GT3       T     4     4           1   
11           11     GP   F   15     GT3       T     2     1           3   
12           12     GP   M   15     LE3       T     4     4           1   
13           13     GP   M   15     GT3       T     4     3           2   
14           14     GP   M   15     GT3       A     2     2           1   
15           15     GP   F   16     GT3       T     4     4           1   
16           16     GP   F   16     GT3       T     4     4           1   
17           17     GP   F   16     GT3       T     3     3           3   
18           18     GP   M   17     GT3       T     3     2           1   
19           19     GP   M   16     LE3       T     4     3           1   
20           20     GP   M   15     GT3       T     4     3           1   
21           21     GP   M   15     GT3       T     4     4           1   
22           22     GP   M   16     LE3       T     4     2           1   
23           23     GP   M   16     LE3       T     2     2           2   
24           24     GP   F   15     GT3       T     2     4           1   
25           25     GP   F   16     GT3       T     2     2           1   
26           26     GP   M   15     GT3       T     2     2           1   
27           27     GP   M   15     GT3       T     4     2           1   
28           28     GP   M   16     LE3       A     3     4           1   
29           29     GP   M   16     GT3       T     4     4           1   
..          ...    ...  ..  ...     ...     ...   ...   ...         ...   
365         365     MS   M   18     GT3       T     1     3           2   
366         366     MS   M   18     LE3       T     4     4           2   
367         367     MS   F   17     GT3       T     1     1           3   
368         368     MS   F   18     GT3       T     2     3           2   
369         369     MS   F   18     GT3       T     4     4           3   
370         370     MS   F   19     LE3       T     3     2           2   
371         371     MS   M   18     LE3       T     1     2           3   
372         372     MS   F   17     GT3       T     2     2           1   
373         373     MS   F   17     GT3       T     1     2           1   
374         374     MS   F   18     LE3       T     4     4           2   
375         375     MS   F   18     GT3       T     1     1           4   
376         376     MS   F   20     GT3       T     4     2           2   
377         377     MS   F   18     LE3       T     4     4           1   
378         378     MS   F   18     GT3       T     3     3           1   
379         379     MS   F   17     GT3       T     3     1           1   
380         380     MS   M   18     GT3       T     4     4           1   
381         381     MS   M   18     GT3       T     2     1           2   
382         382     MS   M   17     GT3       T     2     3           2   
383         383     MS   M   19     GT3       T     1     1           2   
384         384     MS   M   18     GT3       T     4     2           2   
385         385     MS   F   18     GT3       T     2     2           2   
386         386     MS   F   18     GT3       T     4     4           3   
387         387     MS   F   19     GT3       T     2     3           1   
388         388     MS   F   18     LE3       T     3     1           1   
389         389     MS   F   18     GT3       T     1     1           2   
390         390     MS   M   20     LE3       A     2     2           1   
391         391     MS   M   17     LE3       T     3     1           2   
392         392     MS   M   21     GT3       T     1     1           1   
393         393     MS   M   18     LE3       T     3     2           3   
394         394     MS   M   19     LE3       T     1     1           1   

     failures  ... goout Dalc Walc health absences  G1  G2  G3  location  \
0           0  ...     4    1    1      3        6   5   6   6     Urban   
1           0  ...     3    1    1      3        4   5   5   6     Urban   
2           3  ...     2    2    3      3       10   7   8  10     Urban   
3           0  ...     2    1    1      5        2  15  14  15     Urban   
4           0  ...     2    1    2      5        4   6  10  10     Urban   
5           0  ...     2    1    2      5       10  15  15  15     Urban   
6           0  ...     4    1    1      3        0  12  12  11     Urban   
7           0  ...     4    1    1      1        6   6   5   6     Urban   
8           0  ...     2    1    1      1        0  16  18  19     Urban   
9           0  ...     1    1    1      5        0  14  15  15     Urban   
10          0  ...     3    1    2      2        0  10   8   9     Urban   
11          0  ...     2    1    1      4        4  10  12  12     Urban   
12          0  ...     3    1    3      5        2  14  14  14     Urban   
13          0  ...     3    1    2      3        2  10  10  11     Urban   
14          0  ...     2    1    1      3        0  14  16  16     Urban   
15          0  ...     4    1    2      2        4  14  14  14     Urban   
16          0  ...     3    1    2      2        6  13  14  14     Urban   
17          0  ...     2    1    1      4        4   8  10  10     Urban   
18          3  ...     5    2    4      5       16   6   5   5     Urban   
19          0  ...     3    1    3      5        4   8  10  10     Urban   
20          0  ...     1    1    1      1        0  13  14  15     Urban   
21          0  ...     2    1    1      5        0  12  15  15     Urban   
22          0  ...     1    1    3      5        2  15  15  16     Urban   
23          0  ...     4    2    4      5        0  13  13  12     Urban   
24          0  ...     2    1    1      5        2  10   9   8     Rural   
25          2  ...     2    1    3      5       14   6   9   8     Urban   
26          0  ...     2    1    2      5        2  12  12  11     Urban   
27          0  ...     4    2    4      1        4  15  16  15     Urban   
28          0  ...     3    1    1      5        4  11  11  11     Urban   
29          0  ...     5    5    5      5       16  10  12  11     Urban   
..        ...  ...   ...  ...  ...    ...      ...  ..  ..  ..       ...   
365         0  ...     4    2    4      3        4  10  10  10     Rural   
366         0  ...     2    2    2      5        0  13  13  13     Urban   
367         1  ...     1    1    2      1        0   7   6   0     Rural   
368         0  ...     3    1    2      4        0  11  10  10     Urban   
369         0  ...     2    4    2      5       10  14  12  11     Rural   
370         2  ...     2    1    1      3        4   7   7   9     Urban   
371         0  ...     3    2    3      3        3  14  12  12     Rural   
372         0  ...     3    1    1      3        8  13  11  11     Urban   
373         0  ...     5    1    3      1       14   6   5   5     Rural   
374         0  ...     4    1    1      1        0  19  18  19     Rural   
375         0  ...     2    1    2      4        2   8   8  10     Rural   
376         2  ...     3    1    1      3        4  15  14  15     Urban   
377         0  ...     3    3    4      2        4   8   9  10     Rural   
378         0  ...     3    1    2      1        0  15  15  15     Urban   
379         0  ...     4    2    3      1       17  10  10  10     Rural   
380         0  ...     4    1    4      2        4  15  14  14     Urban   
381         0  ...     3    1    3      5        5   7   6   7     Rural   
382         0  ...     3    1    1      3        2  11  11  10     Urban   
383         1  ...     2    1    3      5        0   6   5   0     Rural   
384         1  ...     3    4    3      3       14   6   5   5     Rural   
385         0  ...     3    1    3      4        2  10   9  10     Rural   
386         0  ...     3    2    2      5        7   6   5   6     Rural   
387         1  ...     2    1    2      5        0   7   5   0     Rural   
388         0  ...     4    1    1      1        0   7   9   8     Urban   
389         1  ...     1    1    1      5        0   6   5   0     Urban   
390         2  ...     4    4    5      4       11   9   9   9     Urban   
391         0  ...     5    3    4      2        3  14  16  16     Urban   
392         3  ...     3    3    3      3        3  10   8   7     Rural   
393         0  ...     1    3    4      5        0  11  12  10     Rural   
394         0  ...     3    3    3      5        5   8   9   9     Urban   

        study_time  
0     2 to 5 hours  
1     2 to 5 hours  
2     2 to 5 hours  
3    5 to 10 hours  
4     2 to 5 hours  
5     2 to 5 hours  
6     2 to 5 hours  
7     2 to 5 hours  
8     2 to 5 hours  
9     2 to 5 hours  
10    2 to 5 hours  
11   5 to 10 hours  
12        <2 hours  
13    2 to 5 hours  
14   5 to 10 hours  
15        <2 hours  
16   5 to 10 hours  
17    2 to 5 hours  
18        <2 hours  
19        <2 hours  
20    2 to 5 hours  
21        <2 hours  
22    2 to 5 hours  
23    2 to 5 hours  
24   5 to 10 hours  
25        <2 hours  
26        <2 hours  
27        <2 hours  
28    2 to 5 hours  
29    2 to 5 hours  
..             ...  
365   2 to 5 hours  
366  5 to 10 hours  
367       <2 hours  
368       <2 hours  
369   2 to 5 hours  
370   2 to 5 hours  
371       <2 hours  
372  5 to 10 hours  
373       <2 hours  
374  5 to 10 hours  
375  5 to 10 hours  
376  5 to 10 hours  
377   2 to 5 hours  
378   2 to 5 hours  
379   2 to 5 hours  
380   2 to 5 hours  
381       <2 hours  
382   2 to 5 hours  
383       <2 hours  
384       <2 hours  
385  5 to 10 hours  
386       <2 hours  
387  5 to 10 hours  
388   2 to 5 hours  
389   2 to 5 hours  
390   2 to 5 hours  
391       <2 hours  
392       <2 hours  
393       <2 hours  
394       <2 hours  

[395 rows x 30 columns]
# Adjust to add subplots based on school support
sns.relplot(x="G1", y="G3", 
            data=student_data,
            kind="scatter", 
            col = "schoolsup", 
            col_order=["yes", "no"])

# Show plot
plt.show()

# Adjust further to add subplots based on family support
sns.relplot(x="G1", y="G3", 
            data=student_data,
            kind="scatter", 
            col="schoolsup",
            col_order=["yes", "no"],
            row="famsup",
            row_order=["yes", "no"])

# Show plot
plt.show()

 

All the Contents are from DataCamp

 

Learn R, Python & Data Science Online

 

www.datacamp.com

 

Creating subplots with col and row

We've seen in prior exercises that students with more absences ("absences") tend to have lower final grades ("G3"). Does this relationship hold regardless of how much time students study each week?

To answer this, we'll look at the relationship between the number of absences that a student has in school and their final grade in the course, creating separate subplots based on each student's weekly study time ("study_time").

Seaborn has been imported as sns and matplotlib.pyplot has been imported as plt.

# Import Matplotlib and Seaborn
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd 

url = 'https://assets.datacamp.com/production/repositories/3996/datasets/61e08004fef1a1b02b62620e3cd2533834239c90/student-alcohol-consumption.csv'
student_data = pd.read_csv(url)
print(student_data)
     Unnamed: 0 school sex  age famsize Pstatus  Medu  Fedu  traveltime  \
0             0     GP   F   18     GT3       A     4     4           2   
1             1     GP   F   17     GT3       T     1     1           1   
2             2     GP   F   15     LE3       T     1     1           1   
3             3     GP   F   15     GT3       T     4     2           1   
4             4     GP   F   16     GT3       T     3     3           1   
5             5     GP   M   16     LE3       T     4     3           1   
6             6     GP   M   16     LE3       T     2     2           1   
7             7     GP   F   17     GT3       A     4     4           2   
8             8     GP   M   15     LE3       A     3     2           1   
9             9     GP   M   15     GT3       T     3     4           1   
10           10     GP   F   15     GT3       T     4     4           1   
11           11     GP   F   15     GT3       T     2     1           3   
12           12     GP   M   15     LE3       T     4     4           1   
13           13     GP   M   15     GT3       T     4     3           2   
14           14     GP   M   15     GT3       A     2     2           1   
15           15     GP   F   16     GT3       T     4     4           1   
16           16     GP   F   16     GT3       T     4     4           1   
17           17     GP   F   16     GT3       T     3     3           3   
18           18     GP   M   17     GT3       T     3     2           1   
19           19     GP   M   16     LE3       T     4     3           1   
20           20     GP   M   15     GT3       T     4     3           1   
21           21     GP   M   15     GT3       T     4     4           1   
22           22     GP   M   16     LE3       T     4     2           1   
23           23     GP   M   16     LE3       T     2     2           2   
24           24     GP   F   15     GT3       T     2     4           1   
25           25     GP   F   16     GT3       T     2     2           1   
26           26     GP   M   15     GT3       T     2     2           1   
27           27     GP   M   15     GT3       T     4     2           1   
28           28     GP   M   16     LE3       A     3     4           1   
29           29     GP   M   16     GT3       T     4     4           1   
..          ...    ...  ..  ...     ...     ...   ...   ...         ...   
365         365     MS   M   18     GT3       T     1     3           2   
366         366     MS   M   18     LE3       T     4     4           2   
367         367     MS   F   17     GT3       T     1     1           3   
368         368     MS   F   18     GT3       T     2     3           2   
369         369     MS   F   18     GT3       T     4     4           3   
370         370     MS   F   19     LE3       T     3     2           2   
371         371     MS   M   18     LE3       T     1     2           3   
372         372     MS   F   17     GT3       T     2     2           1   
373         373     MS   F   17     GT3       T     1     2           1   
374         374     MS   F   18     LE3       T     4     4           2   
375         375     MS   F   18     GT3       T     1     1           4   
376         376     MS   F   20     GT3       T     4     2           2   
377         377     MS   F   18     LE3       T     4     4           1   
378         378     MS   F   18     GT3       T     3     3           1   
379         379     MS   F   17     GT3       T     3     1           1   
380         380     MS   M   18     GT3       T     4     4           1   
381         381     MS   M   18     GT3       T     2     1           2   
382         382     MS   M   17     GT3       T     2     3           2   
383         383     MS   M   19     GT3       T     1     1           2   
384         384     MS   M   18     GT3       T     4     2           2   
385         385     MS   F   18     GT3       T     2     2           2   
386         386     MS   F   18     GT3       T     4     4           3   
387         387     MS   F   19     GT3       T     2     3           1   
388         388     MS   F   18     LE3       T     3     1           1   
389         389     MS   F   18     GT3       T     1     1           2   
390         390     MS   M   20     LE3       A     2     2           1   
391         391     MS   M   17     LE3       T     3     1           2   
392         392     MS   M   21     GT3       T     1     1           1   
393         393     MS   M   18     LE3       T     3     2           3   
394         394     MS   M   19     LE3       T     1     1           1   

     failures  ... goout Dalc Walc health absences  G1  G2  G3  location  \
0           0  ...     4    1    1      3        6   5   6   6     Urban   
1           0  ...     3    1    1      3        4   5   5   6     Urban   
2           3  ...     2    2    3      3       10   7   8  10     Urban   
3           0  ...     2    1    1      5        2  15  14  15     Urban   
4           0  ...     2    1    2      5        4   6  10  10     Urban   
5           0  ...     2    1    2      5       10  15  15  15     Urban   
6           0  ...     4    1    1      3        0  12  12  11     Urban   
7           0  ...     4    1    1      1        6   6   5   6     Urban   
8           0  ...     2    1    1      1        0  16  18  19     Urban   
9           0  ...     1    1    1      5        0  14  15  15     Urban   
10          0  ...     3    1    2      2        0  10   8   9     Urban   
11          0  ...     2    1    1      4        4  10  12  12     Urban   
12          0  ...     3    1    3      5        2  14  14  14     Urban   
13          0  ...     3    1    2      3        2  10  10  11     Urban   
14          0  ...     2    1    1      3        0  14  16  16     Urban   
15          0  ...     4    1    2      2        4  14  14  14     Urban   
16          0  ...     3    1    2      2        6  13  14  14     Urban   
17          0  ...     2    1    1      4        4   8  10  10     Urban   
18          3  ...     5    2    4      5       16   6   5   5     Urban   
19          0  ...     3    1    3      5        4   8  10  10     Urban   
20          0  ...     1    1    1      1        0  13  14  15     Urban   
21          0  ...     2    1    1      5        0  12  15  15     Urban   
22          0  ...     1    1    3      5        2  15  15  16     Urban   
23          0  ...     4    2    4      5        0  13  13  12     Urban   
24          0  ...     2    1    1      5        2  10   9   8     Rural   
25          2  ...     2    1    3      5       14   6   9   8     Urban   
26          0  ...     2    1    2      5        2  12  12  11     Urban   
27          0  ...     4    2    4      1        4  15  16  15     Urban   
28          0  ...     3    1    1      5        4  11  11  11     Urban   
29          0  ...     5    5    5      5       16  10  12  11     Urban   
..        ...  ...   ...  ...  ...    ...      ...  ..  ..  ..       ...   
365         0  ...     4    2    4      3        4  10  10  10     Rural   
366         0  ...     2    2    2      5        0  13  13  13     Urban   
367         1  ...     1    1    2      1        0   7   6   0     Rural   
368         0  ...     3    1    2      4        0  11  10  10     Urban   
369         0  ...     2    4    2      5       10  14  12  11     Rural   
370         2  ...     2    1    1      3        4   7   7   9     Urban   
371         0  ...     3    2    3      3        3  14  12  12     Rural   
372         0  ...     3    1    1      3        8  13  11  11     Urban   
373         0  ...     5    1    3      1       14   6   5   5     Rural   
374         0  ...     4    1    1      1        0  19  18  19     Rural   
375         0  ...     2    1    2      4        2   8   8  10     Rural   
376         2  ...     3    1    1      3        4  15  14  15     Urban   
377         0  ...     3    3    4      2        4   8   9  10     Rural   
378         0  ...     3    1    2      1        0  15  15  15     Urban   
379         0  ...     4    2    3      1       17  10  10  10     Rural   
380         0  ...     4    1    4      2        4  15  14  14     Urban   
381         0  ...     3    1    3      5        5   7   6   7     Rural   
382         0  ...     3    1    1      3        2  11  11  10     Urban   
383         1  ...     2    1    3      5        0   6   5   0     Rural   
384         1  ...     3    4    3      3       14   6   5   5     Rural   
385         0  ...     3    1    3      4        2  10   9  10     Rural   
386         0  ...     3    2    2      5        7   6   5   6     Rural   
387         1  ...     2    1    2      5        0   7   5   0     Rural   
388         0  ...     4    1    1      1        0   7   9   8     Urban   
389         1  ...     1    1    1      5        0   6   5   0     Urban   
390         2  ...     4    4    5      4       11   9   9   9     Urban   
391         0  ...     5    3    4      2        3  14  16  16     Urban   
392         3  ...     3    3    3      3        3  10   8   7     Rural   
393         0  ...     1    3    4      5        0  11  12  10     Rural   
394         0  ...     3    3    3      5        5   8   9   9     Urban   

        study_time  
0     2 to 5 hours  
1     2 to 5 hours  
2     2 to 5 hours  
3    5 to 10 hours  
4     2 to 5 hours  
5     2 to 5 hours  
6     2 to 5 hours  
7     2 to 5 hours  
8     2 to 5 hours  
9     2 to 5 hours  
10    2 to 5 hours  
11   5 to 10 hours  
12        <2 hours  
13    2 to 5 hours  
14   5 to 10 hours  
15        <2 hours  
16   5 to 10 hours  
17    2 to 5 hours  
18        <2 hours  
19        <2 hours  
20    2 to 5 hours  
21        <2 hours  
22    2 to 5 hours  
23    2 to 5 hours  
24   5 to 10 hours  
25        <2 hours  
26        <2 hours  
27        <2 hours  
28    2 to 5 hours  
29    2 to 5 hours  
..             ...  
365   2 to 5 hours  
366  5 to 10 hours  
367       <2 hours  
368       <2 hours  
369   2 to 5 hours  
370   2 to 5 hours  
371       <2 hours  
372  5 to 10 hours  
373       <2 hours  
374  5 to 10 hours  
375  5 to 10 hours  
376  5 to 10 hours  
377   2 to 5 hours  
378   2 to 5 hours  
379   2 to 5 hours  
380   2 to 5 hours  
381       <2 hours  
382   2 to 5 hours  
383       <2 hours  
384       <2 hours  
385  5 to 10 hours  
386       <2 hours  
387  5 to 10 hours  
388   2 to 5 hours  
389   2 to 5 hours  
390   2 to 5 hours  
391       <2 hours  
392       <2 hours  
393       <2 hours  
394       <2 hours  

[395 rows x 30 columns]
# Change to make subplots based on study time
sns.relplot(x="absences", y="G3", 
            data=student_data,
            kind="scatter")

# Show plot
plt.show()

# Modify the code to create one scatter plot for each level of the variable "study_time", arranged in columns.
sns.relplot(x="absences", y="G3", 
            data=student_data,
            kind="scatter", col = "study_time")

# Show plot
plt.show()

# Adapt your code to create one scatter plot for each level of a student's weekly study time, this time arranged in rows.
# Change this scatter plot to arrange the plots in rows instead of columns
sns.relplot(x="absences", y="G3", 
            data=student_data,
            kind="scatter", 
            row="study_time")

# Show plot
plt.show()

All the Contents Are from DataCamp

 

Learn R, Python & Data Science Online

 

www.datacamp.com

Chapter5_Case_Study

ALL CONTENTS ARE FROM DATACAMP

you'll apply all of the data cleaning techniques you've learned in this course towards tidying a real-world, messy dataset obtained from the Gapminder Foundation. Once you're done, not only will you have a clean and tidy dataset, you'll also be ready to start working on your own data science projects using the power of Python!

Chapter 1. Exploratory analysis

Whenever you obtain a new dataset, your first task should always be to do some exploratory analysis to get a better understanding of the data and diagnose it for any potential issues.

The Gapminder data for the 19th century has been loaded into a DataFrame called g1800s. In the IPython Shell, use pandas methods such as .head(), .info(), and .describe(), and DataFrame attributes like .columns and .shape to explore it.

In [91]:
import pandas as pd
url = 'https://assets.datacamp.com/production/course_2023/datasets/gapminder.csv'
gapminder = pd.read_csv(url, index_col=0)
print(gapminder.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 780 entries, 0 to 259
Columns: 218 entries, 1800 to Life expectancy
dtypes: float64(217), object(1)
memory usage: 1.3+ MB
None
In [92]:
print(gapminder.head())
    1800   1801   1802   1803   1804   1805   1806   1807   1808   1809  \
0    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
1  28.21  28.20  28.19  28.18  28.17  28.16  28.15  28.14  28.13  28.12   
2    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
3  35.40  35.40  35.40  35.40  35.40  35.40  35.40  35.40  35.40  35.40   
4  28.82  28.82  28.82  28.82  28.82  28.82  28.82  28.82  28.82  28.82   

           ...            2008  2009  2010  2011  2012  2013  2014  2015  \
0          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
1          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
2          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
3          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
4          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   

   2016        Life expectancy  
0   NaN               Abkhazia  
1   NaN            Afghanistan  
2   NaN  Akrotiri and Dhekelia  
3   NaN                Albania  
4   NaN                Algeria  

[5 rows x 218 columns]

Chapter 2. Visualizing your data

Since 1800, life expectancy around the globe has been steadily going up. You would expect the Gapminder data to confirm this.

The DataFrame g1800s has been pre-loaded. Your job in this exercise is to create a scatter plot with life expectancy in '1800' on the x-axis and life expectancy in '1899' on the y-axis.

Here, the goal is to visually check the data for insights as well as errors. When looking at the plot, pay attention to whether the scatter plot takes the form of a diagonal line, and which points fall below or above the diagonal line. This will inform how life expectancy in 1899 changed (or did not change) compared to 1800 for different countries. If points fall on a diagonal line, it means that life expectancy remained the same!

In [93]:
# Import matplotlib.pyplot
import matplotlib.pyplot as plt
In [94]:
# Create the scatter plot
gapminder.plot(kind='scatter', x='1800', y='1899')

# Specify axis labels
plt.xlabel('Life Expectancy by Country in 1800')
plt.ylabel('Life Expectancy by Country in 1899')

# Specify axis limits
plt.xlim(20, 55)
plt.ylim(20, 55)

# Display the plot
plt.show()

Excellent work! As you can see, there are a surprising number of countries that fall on the diagonal line. In fact, examining the DataFrame reveals that the life expectancy for 140 of the 260 countries did not change at all in the 19th century! This is possibly a result of not having access to the data for all the years back then. In this way, visualizing your data can help you uncover insights as well as diagnose it for errors.

Chapter 3. Thinking about the question at hand

Since you are given life expectancy level data by country and year, you could ask questions about how much the average life expectancy changes over each year.

Before continuing, however, it's important to make sure that the following assumptions about the data are true:

  • 'Life expectancy' is the first column (index 0) of the DataFrame.
  • The other columns contain either null or numeric values.
  • The numeric values are all greater than or equal to 0.
  • There is only one instance of each country.

You can write a function that you can apply over the entire DataFrame to verify some of these assumptions. Note that spending the time to write such a script will help you when working with other datasets as well.

In [95]:
import pandas as pd

def check_null_or_valid(row_data):
    """Function that takes a row of data,
    drops all missing values,
    and checks if all remaining values are greater than or equal to 0
    """
    no_na = row_data.dropna()
    numeric = pd.to_numeric(no_na)
    ge0 = numeric >= 0
    return ge0
In [96]:
# Check whether the last column is 'Life expectancy'
assert gapminder.columns[217] == 'Life expectancy'
In [97]:
# Check whether the values in the row are valid
assert gapminder.iloc[:, 0:216].apply(check_null_or_valid, axis=1).all().all()
In [98]:
# Check that there is only one instance of each country
assert gapminder['Life expectancy'].value_counts()[0] == 1
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-98-5d9f300af706> in <module>
      1 # Check that there is only one instance of each country
----> 2 assert gapminder['Life expectancy'].value_counts()[0] == 1

AssertionError: 
In [99]:
# If got error then please check
print(gapminder['Life expectancy'].value_counts()[0])
3

Getting into the habit of testing your code like this is an important skill.

Chapter 4. DisAssembling & Assembling your data

Here, three DataFrames have been pre-loaded: g1800s, g1900s, and g2000s. These contain the Gapminder life expectancy data for, respectively, the 19th century, the 20th century, and the 21st century.

Your task in this exercise is to concatenate them into a single DataFrame called gapminder. This is a row-wise concatenation, similar to how you concatenated the monthly Uber datasets in Chapter 3.

In [104]:
cols = ['Life expectancy'] + list(gapminder.loc[:,'1800':'1899'])
g1800s = gapminder.loc[:, cols]
print(g1800s.columns)
Index(['Life expectancy', '1800', '1801', '1802', '1803', '1804', '1805',
       '1806', '1807', '1808',
       ...
       '1890', '1891', '1892', '1893', '1894', '1895', '1896', '1897', '1898',
       '1899'],
      dtype='object', length=101)
In [105]:
cols = ['Life expectancy'] + list(gapminder.loc[:,'1900':'1999'])
g1900s = gapminder.loc[:, cols]
print(g1900s.columns)
Index(['Life expectancy', '1900', '1901', '1902', '1903', '1904', '1905',
       '1906', '1907', '1908',
       ...
       '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998',
       '1999'],
      dtype='object', length=101)
In [106]:
cols = ['Life expectancy'] + list(gapminder.loc[:,'2000':'2016'])
g2000s = gapminder.loc[:, cols]
print(g2000s.columns)
Index(['Life expectancy', '2000', '2001', '2002', '2003', '2004', '2005',
       '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014',
       '2015', '2016'],
      dtype='object')
In [107]:
# Concatenate the DataFrames row-wise
gapminder2 = pd.concat([g1800s, g1900s, g2000s])

# Print the shape of gapminder
print(gapminder2.shape)
(2340, 218)
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  
In [108]:
# Print the head of gapminder
print(gapminder2.head())
    1800   1801   1802   1803   1804   1805   1806   1807   1808   1809  \
0    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
1  28.21  28.20  28.19  28.18  28.17  28.16  28.15  28.14  28.13  28.12   
2    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN    NaN   
3  35.40  35.40  35.40  35.40  35.40  35.40  35.40  35.40  35.40  35.40   
4  28.82  28.82  28.82  28.82  28.82  28.82  28.82  28.82  28.82  28.82   

           ...            2008  2009  2010  2011  2012  2013  2014  2015  \
0          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
1          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
2          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
3          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
4          ...             NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   

   2016        Life expectancy  
0   NaN               Abkhazia  
1   NaN            Afghanistan  
2   NaN  Akrotiri and Dhekelia  
3   NaN                Albania  
4   NaN                Algeria  

[5 rows x 218 columns]

All the Gapminder data, from 1800 to 2016, is now contained in one DataFrame.

Chapter 5. Reshaping your data

Now that you have all the data combined into a single DataFrame, the next step is to reshape it into a tidy data format.

Currently, the gapminder DataFrame has a separate column for each year. What you want instead is a single column that contains the year, and a single column that represents the average life expectancy for each year and country. By having year in its own column, you can use it as a predictor variable in a later analysis.

You can convert the DataFrame into the desired tidy format by melting it.

In [109]:
import pandas as pd

# Melt gapminder: gapminder_melt
gapminder_melt = pd.melt(gapminder, id_vars = 'Life expectancy')
In [110]:
# Rename the columns
gapminder_melt.columns = ['country', 'year', 'life_expectancy']
In [111]:
# Print the head of gapminder_melt
print(gapminder_melt.head())
                 country  year  life_expectancy
0               Abkhazia  1800              NaN
1            Afghanistan  1800            28.21
2  Akrotiri and Dhekelia  1800              NaN
3                Albania  1800            35.40
4                Algeria  1800            28.82

Chapter 6. Checking the data types

Now that your data are in the proper shape, you need to ensure that the columns are of the proper data type. That is, you need to ensure that country is of type object, year is of type int64, and life_expectancy is of type float64.

The tidy DataFrame has been pre-loaded as gapminder. Explore it in the IPython Shell using the .info() method. Notice that the column 'year' is of type object. This is incorrect, so you'll need to use the pd.to_numeric() function to convert it to a numeric data type.

NumPy and pandas have been pre-imported as np and pd.

In [114]:
# Convert the year column to numeric
gapminder = gapminder_melt
gapminder.year = pd.to_numeric(gapminder.year)
In [116]:
# Test if country is of type object
import numpy as np
assert gapminder.country.dtypes == np.object
In [117]:
# Test if year is of type int64
assert gapminder.year.dtypes == np.int64
In [118]:
# Test if life_expectancy is of type float64
assert gapminder.life_expectancy.dtypes == np.float64

Excellent work! Since the assert statements did not throw any errors, you can be sure that your columns have the correct data types!

Chapter 7. Looking at country spellings

Having tidied your DataFrame and checked the data types, your next task in the data cleaning process is to look at the 'country' column to see if there are any special or invalid characters you may need to deal with.

It is reasonable to assume that country names will contain:

  • The set of lower and upper case letters.
  • Whitespace between words.
  • Periods for any abbreviations.

To confirm that this is the case, you can leverage the power of regular expressions again. For common operations like this, Pandas has a built-in string method - str.contains() - which takes a regular expression pattern, and applies it to the Series, returning True if there is a match, and False otherwise.

Since here you want to find the values that do not match, you have to invert the boolean, which can be done using ~. This Boolean series can then be used to get the Series of countries that have invalid names.

In [119]:
# Create the series of countries: countries
countries = gapminder['country']

# Drop all the duplicates from countries
countries = countries.drop_duplicates()

# Write the regular expression: pattern
pattern = '^[A-Za-z\.\s]*$'

# Create the Boolean vector: mask
mask = countries.str.contains(pattern)

# Invert the mask: mask_inverse
mask_inverse = ~mask

# Subset countries using mask_inverse: invalid_countries
invalid_countries = countries.loc[mask_inverse]

# Print invalid_countries
print(invalid_countries)
49            Congo, Dem. Rep.
50                 Congo, Rep.
53               Cote d'Ivoire
73      Falkland Is (Malvinas)
93               Guinea-Bissau
98            Hong Kong, China
118    United Korea (former)\n
131               Macao, China
132             Macedonia, FYR
145      Micronesia, Fed. Sts.
161            Ngorno-Karabakh
187             St. Barthélemy
193     St.-Pierre-et-Miquelon
225                Timor-Leste
251      Virgin Islands (U.S.)
252       North Yemen (former)
253       South Yemen (former)
258                      Åland
Name: country, dtype: object

As you can see, not all these country names are actually invalid so maybe the assumptions need to be tweaked a little. However, there certainly are a few cases worth further investigation, such as St. Barth?lemy. Whenever you are dealing with columns of raw data consisting of strings, it is important to check them for consistency like this.

Chapter 8. More data cleaning and processing

It's now time to deal with the missing data. There are several strategies for this: You can drop them, fill them in using the mean of the column or row that the missing value is in (also known as imputation), or, if you are dealing with time series data, use a forward fill or backward fill, in which you replace missing values in a column with the most recent known value in the column. See pandas Foundations for more on forward fill and backward fill.

In general, it is not the best idea to drop missing values, because in doing so you may end up throwing away useful information. In this data, the missing values refer to years where no estimate for life expectancy is available for a given country. You could fill in, or guess what these life expectancies could be by looking at the average life expectancies for other countries in that year, for example. Whichever strategy you go with, it is important to carefully consider all options and understand how they will affect your data.

In this exercise, you'll practice dropping missing values. Your job is to drop all the rows that have NaN in the life_expectancy column. Before doing so, it would be valuable to use assert statements to confirm that year and country do not have any missing values.

Begin by printing the shape of gapminder in the IPython Shell prior to dropping the missing values. Complete the exercise to find out what its shape will be after dropping the missing values!

In [120]:
# Assert that country does not contain any missing values
assert pd.notnull(gapminder.country).all()
In [121]:
# Assert that year does not contain any missing values
assert pd.notnull(gapminder.year).all()
In [122]:
# Drop the missing values
gapminder = gapminder.dropna(how='any')

# Print the shape of gapminder
print(gapminder.shape)
(43857, 3)

After dropping the missing values from 'life_expectancy', the number of rows in the DataFrame has gone down from 169260 to 43857. In general, you should avoid dropping too much of your data, but if there is no reasonable way to fill in or impute missing values, then dropping the missing data may be the best solution.

Chapter 9. Wrapping up

Now that you have a clean and tidy dataset, you can do a bit of visualization and aggregation. In this exercise, you'll begin by creating a histogram of the life_expectancy column. You should not get any values under 0 and you should see something reasonable on the higher end of the life_expectancy age range.

Your next task is to investigate how average life expectancy changed over the years. To do this, you need to subset the data by each year, get the life_expectancy column from each subset, and take an average of the values. You can achieve this using the .groupby() method. This .groupby() method is covered in greater depth in Manipulating DataFrames with pandas.

Finally, you can save your tidy and summarized DataFrame to a file using the .to_csv() method.

matplotlib.pyplot and pandas have been pre-imported as plt and pd. Go for it!

In [123]:
# Add first subplot
plt.subplot(2, 1, 1) 
Out[123]:
<matplotlib.axes._subplots.AxesSubplot at 0x1195bc780>
In [124]:
# Create a histogram of life_expectancy
gapminder.life_expectancy.plot(kind='hist')
Out[124]:
<matplotlib.axes._subplots.AxesSubplot at 0x11d0a2978>
In [125]:
# Group gapminder: gapminder_agg
gapminder_agg = gapminder.groupby('year')['life_expectancy'].mean()
In [126]:
# Print the head of gapminder_agg
print(gapminder_agg.head())
year
1800    31.486020
1801    31.448905
1802    31.463483
1803    31.377413
1804    31.446318
Name: life_expectancy, dtype: float64
In [127]:
# Print the tail of gapminder_agg
print(gapminder_agg.tail())
year
2012    71.663077
2013    71.916106
2014    72.088125
2015    72.321010
2016    72.556635
Name: life_expectancy, dtype: float64
In [128]:
# Add second subplot
plt.subplot(2, 1, 2)
Out[128]:
<matplotlib.axes._subplots.AxesSubplot at 0x11ab18a20>
In [129]:
# Create a line plot of life expectancy per year
gapminder_agg.plot()
Out[129]:
<matplotlib.axes._subplots.AxesSubplot at 0x1086d77b8>
In [130]:
# Add title and specify axis labels
plt.title('Life expectancy over the years')
plt.ylabel('Life expectancy')
plt.xlabel('Year')
Out[130]:
Text(0.5, 0, 'Year')
In [131]:
# Display the plots
plt.tight_layout()
plt.show()
<Figure size 432x288 with 0 Axes>
In [132]:
# Save both DataFrames to csv files
gapminder.to_csv('gapminder.csv')
gapminder_agg.to_csv('gapminder_agg.csv')

You've stepped through each stage of the data cleaning process and your data is now ready for serious analysis! Looking at the line plot, it seems like life expectancy has, as expected, increased over the years. There is a surprising dip around 1920 that may be worth further investigation!

+ Recent posts