Chapter2_Tidying_data_for_analysis

Here, you'll learn about the principles of tidy data and more importantly, why you should care about them and how they make subsequent data analysis more efficient. You'll gain first hand experience with reshaping and tidying your data using techniques such as pivoting and melting.

Chapter 1. Recognizing tidy data

For data to be tidy, it must have:

  • Each variable as a separate column.
  • Each row as a separate observation.

As a data scientist, you'll encounter data that is represented in a variety of different ways, so it is important to be able to recognize tidy (or untidy) data when you see it.

Chapter 2. Reshaping your data using melt

Melting data is the process of turning columns of your data into rows of data. Consider the DataFrames from the previous exercise. In the tidy DataFrame, the variables Ozone, Solar.R, Wind, and Temp each had their own column. If, however, you wanted these variables to be in rows instead, you could melt the DataFrame. In doing so, however, you would make the data untidy! This is important to keep in mind: Depending on how your data is represented, you will have to reshape it differently (e.g., this could make it easier to plot values).

In this exercise, you will practice melting a DataFrame using pd.melt(). There are two parameters you should be aware of: id_vars and value_vars. The id_vars represent the columns of the data you do not want to melt (i.e., keep it in its current shape), while the value_vars represent the columns you do wish to melt into rows. By default, if no value_vars are provided, all columns not set in the id_vars will be melted. This could save a bit of typing, depending on the number of columns that need to be melted.

The (tidy) DataFrame airquality has been pre-loaded. Your job is to melt its Ozone, Solar.R, Wind, and Temp columns into rows. Later in this chapter, you'll learn how to bring this melted DataFrame back into a tidy form.

In [3]:
import pandas as pd
url = "https://assets.datacamp.com/production/course_2023/datasets/airquality.csv"
airquality = pd.read_csv(url)
print(airquality.head())
   Ozone  Solar.R  Wind  Temp  Month  Day
0   41.0    190.0   7.4    67      5    1
1   36.0    118.0   8.0    72      5    2
2   12.0    149.0  12.6    74      5    3
3   18.0    313.0  11.5    62      5    4
4    NaN      NaN  14.3    56      5    5
In [5]:
# Melt airquality: airquality_melt
airquality_melt = pd.melt(frame = airquality, id_vars=['Month', 'Day'])

# Print the head of airquality_melt
print(airquality_melt.head())
   Month  Day variable  value
0      5    1    Ozone   41.0
1      5    2    Ozone   36.0
2      5    3    Ozone   12.0
3      5    4    Ozone   18.0
4      5    5    Ozone    NaN

Chapter 3. Customizing melted data

When melting DataFrames, it would be better to have column names more meaningful than variable and value (the default names used by pd.melt()).

The default names may work in certain situations, but it's best to always have data that is self explanatory.

You can rename the variable column by specifying an argument to the var_name parameter, and the value column by specifying an argument to the value_name parameter. You will now practice doing exactly this. Pandas as pd and the DataFrame airquality has been pre-loaded for you.

In [6]:
# Melt airquality: airquality_melt
airquality_melt = pd.melt(frame=airquality, id_vars=['Month', 'Day'], var_name='measurement', value_name='reading')


# Print the head of airquality_melt
print(airquality_melt.head())
   Month  Day measurement  reading
0      5    1       Ozone     41.0
1      5    2       Ozone     36.0
2      5    3       Ozone     12.0
3      5    4       Ozone     18.0
4      5    5       Ozone      NaN

Chapter 4. Pivot data

Pivoting data is the opposite of melting it. Remember the tidy form that the airquality DataFrame was in before you melted it? You'll now begin pivoting it back into that form using the .pivot_table() method!

While melting takes a set of columns and turns it into a single column, pivoting will create a new column for each unique value in a specified column.

.pivot_table() has an index parameter which you can use to specify the columns that you don't want pivoted: It is similar to the id_vars parameter of pd.melt(). Two other parameters that you have to specify are columns (the name of the column you want to pivot), and values (the values to be used when the column is pivoted). The melted DataFrame airquality_melt has been pre-loaded for you.

In [7]:
# Print the head of airquality_melt
print(airquality_melt.head())
   Month  Day measurement  reading
0      5    1       Ozone     41.0
1      5    2       Ozone     36.0
2      5    3       Ozone     12.0
3      5    4       Ozone     18.0
4      5    5       Ozone      NaN
In [8]:
# Pivot airquality_melt: airquality_pivot
airquality_pivot = airquality_melt.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading')

# Print the head of airquality_pivot
print(airquality_pivot.head())
measurement  Ozone  Solar.R  Temp  Wind
Month Day                              
5     1       41.0    190.0  67.0   7.4
      2       36.0    118.0  72.0   8.0
      3       12.0    149.0  74.0  12.6
      4       18.0    313.0  62.0  11.5
      5        NaN      NaN  56.0  14.3

Chapter 5. Resetting the index of a DataFrame

After pivoting airquality_melt in the previous exercise, you didn't quite get back the original DataFrame.

What you got back instead was a pandas DataFrame with a hierarchical index (also known as a MultiIndex).

Hierarchical indexes are covered in depth in Manipulating DataFrames with pandas. In essence, they allow you to group columns or rows by another variable - in this case, by 'Month' as well as 'Day'.

There's a very simple method you can use to get back the original DataFrame from the pivoted DataFrame: .reset_index(). Dan didn't show you how to use this method in the video, but you're now going to practice using it in this exercise to get back the original DataFrame from airquality_pivot, which has been pre-loaded.

In [9]:
# Print the index of airquality_pivot
print(airquality_pivot.index)
MultiIndex(levels=[[5, 6, 7, 8, 9], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]],
           labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]],
           names=['Month', 'Day'])
In [10]:
# Reset the index of airquality_pivot: airquality_pivot_reset
airquality_pivot_reset = airquality_pivot.reset_index()
In [11]:
# Print the new index of airquality_pivot_reset
print(airquality_pivot_reset.index)
RangeIndex(start=0, stop=153, step=1)
In [12]:
# Print the head of airquality_pivot_reset
print(airquality_pivot_reset.head())
measurement  Month  Day  Ozone  Solar.R  Temp  Wind
0                5    1   41.0    190.0  67.0   7.4
1                5    2   36.0    118.0  72.0   8.0
2                5    3   12.0    149.0  74.0  12.6
3                5    4   18.0    313.0  62.0  11.5
4                5    5    NaN      NaN  56.0  14.3

Chapter 6. Pivoting duplicate values

So far, you've used the .pivot_table() method when there are multiple index values you want to hold constant during a pivot. In the video, Dan showed you how you can also use pivot tables to deal with duplicate values by providing an aggregation function through the aggfunc parameter. Here, you're going to combine both these uses of pivot tables.

Let's say your data collection method accidentally duplicated your dataset. Such a dataset, in which each row is duplicated, has been pre-loaded as airquality_dup. In addition, the airquality_melt DataFrame from the previous exercise has been pre-loaded. Explore their shapes in the IPython Shell by accessing their .shape attributes to confirm the duplicate rows present in airquality_dup.

You'll see that by using .pivot_table() and the aggfunc parameter, you can not only reshape your data, but also remove duplicates. Finally, you can then flatten the columns of the pivoted DataFrame using .reset_index().

NumPy and pandas have been imported as np and pd respectively.

In [24]:
airquality_dup = airquality_melt.append(airquality_melt, ignore_index=True)
airquality_dup
Out[24]:
Month Day measurement reading
0 5 1 Ozone 41.0
1 5 2 Ozone 36.0
2 5 3 Ozone 12.0
3 5 4 Ozone 18.0
4 5 5 Ozone NaN
5 5 6 Ozone 28.0
6 5 7 Ozone 23.0
7 5 8 Ozone 19.0
8 5 9 Ozone 8.0
9 5 10 Ozone NaN
10 5 11 Ozone 7.0
11 5 12 Ozone 16.0
12 5 13 Ozone 11.0
13 5 14 Ozone 14.0
14 5 15 Ozone 18.0
15 5 16 Ozone 14.0
16 5 17 Ozone 34.0
17 5 18 Ozone 6.0
18 5 19 Ozone 30.0
19 5 20 Ozone 11.0
20 5 21 Ozone 1.0
21 5 22 Ozone 11.0
22 5 23 Ozone 4.0
23 5 24 Ozone 32.0
24 5 25 Ozone NaN
25 5 26 Ozone NaN
26 5 27 Ozone NaN
27 5 28 Ozone 23.0
28 5 29 Ozone 45.0
29 5 30 Ozone 115.0
... ... ... ... ...
1194 9 1 Temp 91.0
1195 9 2 Temp 92.0
1196 9 3 Temp 93.0
1197 9 4 Temp 93.0
1198 9 5 Temp 87.0
1199 9 6 Temp 84.0
1200 9 7 Temp 80.0
1201 9 8 Temp 78.0
1202 9 9 Temp 75.0
1203 9 10 Temp 73.0
1204 9 11 Temp 81.0
1205 9 12 Temp 76.0
1206 9 13 Temp 77.0
1207 9 14 Temp 71.0
1208 9 15 Temp 71.0
1209 9 16 Temp 78.0
1210 9 17 Temp 67.0
1211 9 18 Temp 76.0
1212 9 19 Temp 68.0
1213 9 20 Temp 82.0
1214 9 21 Temp 64.0
1215 9 22 Temp 71.0
1216 9 23 Temp 81.0
1217 9 24 Temp 69.0
1218 9 25 Temp 63.0
1219 9 26 Temp 70.0
1220 9 27 Temp 77.0
1221 9 28 Temp 75.0
1222 9 29 Temp 76.0
1223 9 30 Temp 68.0

1224 rows × 4 columns

In [28]:
import numpy as np
# Pivot airquality_dup: airquality_pivot
airquality_pivot = airquality_dup.pivot_table(index=['Month', 'Day'], 
                                              columns='measurement', 
                                              values='reading', 
                                              aggfunc=np.mean)

print(airquality_pivot)
measurement  Ozone  Solar.R  Temp  Wind
Month Day                              
5     1       41.0    190.0  67.0   7.4
      2       36.0    118.0  72.0   8.0
      3       12.0    149.0  74.0  12.6
      4       18.0    313.0  62.0  11.5
      5        NaN      NaN  56.0  14.3
      6       28.0      NaN  66.0  14.9
      7       23.0    299.0  65.0   8.6
      8       19.0     99.0  59.0  13.8
      9        8.0     19.0  61.0  20.1
      10       NaN    194.0  69.0   8.6
      11       7.0      NaN  74.0   6.9
      12      16.0    256.0  69.0   9.7
      13      11.0    290.0  66.0   9.2
      14      14.0    274.0  68.0  10.9
      15      18.0     65.0  58.0  13.2
      16      14.0    334.0  64.0  11.5
      17      34.0    307.0  66.0  12.0
      18       6.0     78.0  57.0  18.4
      19      30.0    322.0  68.0  11.5
      20      11.0     44.0  62.0   9.7
      21       1.0      8.0  59.0   9.7
      22      11.0    320.0  73.0  16.6
      23       4.0     25.0  61.0   9.7
      24      32.0     92.0  61.0  12.0
      25       NaN     66.0  57.0  16.6
      26       NaN    266.0  58.0  14.9
      27       NaN      NaN  57.0   8.0
      28      23.0     13.0  67.0  12.0
      29      45.0    252.0  81.0  14.9
      30     115.0    223.0  79.0   5.7
...            ...      ...   ...   ...
9     1       96.0    167.0  91.0   6.9
      2       78.0    197.0  92.0   5.1
      3       73.0    183.0  93.0   2.8
      4       91.0    189.0  93.0   4.6
      5       47.0     95.0  87.0   7.4
      6       32.0     92.0  84.0  15.5
      7       20.0    252.0  80.0  10.9
      8       23.0    220.0  78.0  10.3
      9       21.0    230.0  75.0  10.9
      10      24.0    259.0  73.0   9.7
      11      44.0    236.0  81.0  14.9
      12      21.0    259.0  76.0  15.5
      13      28.0    238.0  77.0   6.3
      14       9.0     24.0  71.0  10.9
      15      13.0    112.0  71.0  11.5
      16      46.0    237.0  78.0   6.9
      17      18.0    224.0  67.0  13.8
      18      13.0     27.0  76.0  10.3
      19      24.0    238.0  68.0  10.3
      20      16.0    201.0  82.0   8.0
      21      13.0    238.0  64.0  12.6
      22      23.0     14.0  71.0   9.2
      23      36.0    139.0  81.0  10.3
      24       7.0     49.0  69.0  10.3
      25      14.0     20.0  63.0  16.6
      26      30.0    193.0  70.0   6.9
      27       NaN    145.0  77.0  13.2
      28      14.0    191.0  75.0  14.3
      29      18.0    131.0  76.0   8.0
      30      20.0    223.0  68.0  11.5

[153 rows x 4 columns]
In [29]:
# Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()

# Print the head of airquality_pivot
print(airquality_pivot.head())
measurement  Month  Day  Ozone  Solar.R  Temp  Wind
0                5    1   41.0    190.0  67.0   7.4
1                5    2   36.0    118.0  72.0   8.0
2                5    3   12.0    149.0  74.0  12.6
3                5    4   18.0    313.0  62.0  11.5
4                5    5    NaN      NaN  56.0  14.3
In [30]:
# Print the head of airquality
print(airquality.head())
   Ozone  Solar.R  Wind  Temp  Month  Day
0   41.0    190.0   7.4    67      5    1
1   36.0    118.0   8.0    72      5    2
2   12.0    149.0  12.6    74      5    3
3   18.0    313.0  11.5    62      5    4
4    NaN      NaN  14.3    56      5    5

Chapter 7. Splitting a column with .str

The dataset you saw in the video, consisting of case counts of tuberculosis by country, year, gender, and age group, has been pre-loaded into a DataFrame as tb.

In this exercise, you're going to tidy the 'm014' column, which represents males aged 0-14 years of age. In order to parse this value, you need to extract the first letter into a new column for gender, and the rest into a column for age_group. Here, since you can parse values by position, you can take advantage of pandas' vectorized string slicing by using the str attribute of columns of type object.

Begin by printing the columns of tb in the IPython Shell using its .columns attribute, and take note of the problematic column.

In [37]:
import pandas as pd
url = "https://assets.datacamp.com/production/course_2023/datasets/tb.csv"
tb = pd.read_csv(url)
print(tb.head())
  country  year  m014  m1524  m2534  m3544  m4554  m5564   m65  mu  f014  \
0      AD  2000   0.0    0.0    1.0    0.0    0.0    0.0   0.0 NaN   NaN   
1      AE  2000   2.0    4.0    4.0    6.0    5.0   12.0  10.0 NaN   3.0   
2      AF  2000  52.0  228.0  183.0  149.0  129.0   94.0  80.0 NaN  93.0   
3      AG  2000   0.0    0.0    0.0    0.0    0.0    0.0   1.0 NaN   1.0   
4      AL  2000   2.0   19.0   21.0   14.0   24.0   19.0  16.0 NaN   3.0   

   f1524  f2534  f3544  f4554  f5564   f65  fu  
0    NaN    NaN    NaN    NaN    NaN   NaN NaN  
1   16.0    1.0    3.0    0.0    0.0   4.0 NaN  
2  414.0  565.0  339.0  205.0   99.0  36.0 NaN  
3    1.0    1.0    0.0    0.0    0.0   0.0 NaN  
4   11.0   10.0    8.0    8.0    5.0  11.0 NaN  
In [38]:
# Melt tb: tb_melt
tb_melt = pd.melt(tb, id_vars=['country', 'year'])
print(tb_melt.head())
  country  year variable  value
0      AD  2000     m014    0.0
1      AE  2000     m014    2.0
2      AF  2000     m014   52.0
3      AG  2000     m014    0.0
4      AL  2000     m014    2.0
In [40]:
# Create the 'gender' column
tb_melt['gender'] = tb_melt.variable.str[0]
print(tb_melt.head())
  country  year variable  value gender
0      AD  2000     m014    0.0      m
1      AE  2000     m014    2.0      m
2      AF  2000     m014   52.0      m
3      AG  2000     m014    0.0      m
4      AL  2000     m014    2.0      m
In [41]:
# Create the 'age_group' column
tb_melt['age_group'] = tb_melt.variable.str[1:]
# Print the head of tb_melt
print(tb_melt.head())
  country  year variable  value gender age_group
0      AD  2000     m014    0.0      m       014
1      AE  2000     m014    2.0      m       014
2      AF  2000     m014   52.0      m       014
3      AG  2000     m014    0.0      m       014
4      AL  2000     m014    2.0      m       014

Chapter 8. Splitting a column with .split() and .get()

Another common way multiple variables are stored in columns is with a delimiter. You'll learn how to deal with such cases in this exercise, using a dataset consisting of Ebola cases and death counts by state and country. It has been pre-loaded into a DataFrame as ebola.

Print the columns of ebola in the IPython Shell using ebola.columns. Notice that the data has column names such as Cases_Guinea and DeathsGuinea. Here, the underscore serves as a delimiter between the first part (cases or deaths), and the second part (country).

This time, you cannot directly slice the variable by position as in the previous exercise. You now need to use Python's built-in string method called .split(). By default, this method will split a string into parts separated by a space. However, in this case you want it to split by an underscore. You can do this on Cases_Guinea, for example, using CasesGuinea.split(''), which returns the list ['Cases', 'Guinea'].

The next challenge is to extract the first element of this list and assign it to a type variable, and the second element of the list to a country variable. You can accomplish this by accessing the str attribute of the column and using the .get() method to retrieve the 0 or 1 index, depending on the part you want.

In [42]:
url = "https://assets.datacamp.com/production/course_2023/datasets/ebola.csv"
ebola = pd.read_csv(url)
print(ebola.head())
         Date  Day  Cases_Guinea  Cases_Liberia  Cases_SierraLeone  \
0    1/5/2015  289        2776.0            NaN            10030.0   
1    1/4/2015  288        2775.0            NaN             9780.0   
2    1/3/2015  287        2769.0         8166.0             9722.0   
3    1/2/2015  286           NaN         8157.0                NaN   
4  12/31/2014  284        2730.0         8115.0             9633.0   

   Cases_Nigeria  Cases_Senegal  Cases_UnitedStates  Cases_Spain  Cases_Mali  \
0            NaN            NaN                 NaN          NaN         NaN   
1            NaN            NaN                 NaN          NaN         NaN   
2            NaN            NaN                 NaN          NaN         NaN   
3            NaN            NaN                 NaN          NaN         NaN   
4            NaN            NaN                 NaN          NaN         NaN   

   Deaths_Guinea  Deaths_Liberia  Deaths_SierraLeone  Deaths_Nigeria  \
0         1786.0             NaN              2977.0             NaN   
1         1781.0             NaN              2943.0             NaN   
2         1767.0          3496.0              2915.0             NaN   
3            NaN          3496.0                 NaN             NaN   
4         1739.0          3471.0              2827.0             NaN   

   Deaths_Senegal  Deaths_UnitedStates  Deaths_Spain  Deaths_Mali  
0             NaN                  NaN           NaN          NaN  
1             NaN                  NaN           NaN          NaN  
2             NaN                  NaN           NaN          NaN  
3             NaN                  NaN           NaN          NaN  
4             NaN                  NaN           NaN          NaN  
In [43]:
# Melt ebola: ebola_melt
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts')
print(ebola_melt.head())
         Date  Day  type_country  counts
0    1/5/2015  289  Cases_Guinea  2776.0
1    1/4/2015  288  Cases_Guinea  2775.0
2    1/3/2015  287  Cases_Guinea  2769.0
3    1/2/2015  286  Cases_Guinea     NaN
4  12/31/2014  284  Cases_Guinea  2730.0
In [44]:
# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')
print(ebola_melt.head())
         Date  Day  type_country  counts        str_split
0    1/5/2015  289  Cases_Guinea  2776.0  [Cases, Guinea]
1    1/4/2015  288  Cases_Guinea  2775.0  [Cases, Guinea]
2    1/3/2015  287  Cases_Guinea  2769.0  [Cases, Guinea]
3    1/2/2015  286  Cases_Guinea     NaN  [Cases, Guinea]
4  12/31/2014  284  Cases_Guinea  2730.0  [Cases, Guinea]
In [45]:
# Create the 'type' column
ebola_melt['type'] = ebola_melt.str_split.str.get(0)
print(ebola_melt.head())
         Date  Day  type_country  counts        str_split   type
0    1/5/2015  289  Cases_Guinea  2776.0  [Cases, Guinea]  Cases
1    1/4/2015  288  Cases_Guinea  2775.0  [Cases, Guinea]  Cases
2    1/3/2015  287  Cases_Guinea  2769.0  [Cases, Guinea]  Cases
3    1/2/2015  286  Cases_Guinea     NaN  [Cases, Guinea]  Cases
4  12/31/2014  284  Cases_Guinea  2730.0  [Cases, Guinea]  Cases
In [46]:
# Create the 'country' column
ebola_melt['country'] = ebola_melt.str_split.str.get(1)
print(ebola_melt.head())
         Date  Day  type_country  counts        str_split   type country
0    1/5/2015  289  Cases_Guinea  2776.0  [Cases, Guinea]  Cases  Guinea
1    1/4/2015  288  Cases_Guinea  2775.0  [Cases, Guinea]  Cases  Guinea
2    1/3/2015  287  Cases_Guinea  2769.0  [Cases, Guinea]  Cases  Guinea
3    1/2/2015  286  Cases_Guinea     NaN  [Cases, Guinea]  Cases  Guinea
4  12/31/2014  284  Cases_Guinea  2730.0  [Cases, Guinea]  Cases  Guinea
Edit

Jupyter Notebook + R 연동하기


강의를 해야하는 상황이 점점 오고 있습니다. 기존에 RStudio를 써봐서 강의를 진행하기도 하였지만, 강의를 듣는 청자의 입장에서 이것저것 불편한 점이 많았습니다.

RStudio가 개발 환경에서는 분명 좋기는 하지만, 무언가 더 좋은 프로그램이 없을까 고민하던 찰나에 python 데이터 분석가가 주로 사용한다는 Jupyter NotebookR과 연동이 된다는 얘기를 듣게 되어 바로 환경설치 작업에 들어갑니다. 

몇몇 싸이트를 참고하기는 했습니다만, 대표적으로 글을 써야겠다고 느낀점은 아래 StackOverFlow에서 근본적인 원인을 발견하고, 해결하여, 비슷한 경험이 없기를 바라는 마음에 작성합니다. 

rJava not loading in Jupyter Notebook with R kernel

참고로 제 개발환경은 아래와 같습니다.

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.2

Matrix products: default

BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base
  • 1. 첫번째, 환경설정 시 주의해야 할 사항

주의!

기존에 R 사용자는 Jupyter Notebook를 사용하기 위해 Anaconda를 다운받지 마세요.

  • 보통 위 그림과 같이 Jupyter Notebook를 사용하기 위해 Anaconda를 많이 사용합니다. 저 역시, 처음 Python을 배울 때는 Jupyter Notebook을 사용했습니다.

  • Anaconda와 기존에 설치된 R을 다시 연결해야 하는데, 초보자에게는 쉽지 않습니다.

  • Anaconda에서 R을 설치하려면 아래와 같이 명령어를 입력해야 합니다.

$ conda install -c r r-essentials

WHY?

  • 간단한 통계분석을 한다고 하면, 사실 상관이 없습니다.

  • 그런데, rJava 패키지를 활용하려면 자바 경로를 Anaconda에 맞게끔 재 설정해야 합니다.

  • 즉, 기존에 설치된 RRStudio에서는 이상없이 설치되는 패키지가 Anaconda Jupyter notebook에서는 에러가 되는 상황이 발생이 됩니다.

  • 이유는, Anaconda Jupyter notebook이 자바 경로와 설정이 되지 않았기 때문입니다.

  • 다시 말하면, 시스템 초보자에게는 이 부분이 매우 어려울 수 있습니다.

  • 저 역시, rJavaAnacondaJupyter notebook과 연결이 되지 않아 결국 Anaconda를 전체 삭제했습니다.

2. 두번째, Jupyter Notebook 설치

Jupyter 설치 방법은 링크를 앞에 Jupyter를 클릭하시기 바랍니다.


(1) Python 버전 확인

먼저 터미널을 엽니다. (Terminal)

터미널은 위와 같이 생겼습니다.


$ python --version
Python 3.7.1

(2) 버전에 맞게 설치 방법 확인

Jupyterpip이라는 설치 모듈(module)이라는 도구를 활용하여 설치를 진행합니다.

$ python3 -m pip install --upgrade pip
$ python3 -m pip install jupyter

설치가 완료가 되면 먼저 jupyter notebook를 실행합니다.

$ jupyter notebook

3. IRkernel 패키지 설치

IRkernel 패키지는 일종의 Jupyter notebookR이 등록될 수 있도록 도와주는 패키지로 이해하면 좋을 듯 합니다. Windows, MacOS, Linux 등 각 OS에 맞는 자세한 설치방법은 IRkernel 홈페이지에 있습니다.

저는 맥북 사용자이기 때문에 MacOS 설치 버전을 아래 그림과 같이 따라했습니다.

$ xcode-select --install
$ brew install zmq
# or upgrade
$ brew update
$ brew upgrade zmq

(1) brew가 없으신 분들을 위해

brew는 일종의 맥북 패키지 관리자 도구입니다. 자세한 설치방법은 Homebrew 홈페이지를 참조하시기를 바랍니다.

$ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

(2) MacPorts

저는 주로 brew를 이용합니다만, MacPorts를 사용하시는 분들을 위해 남겨 놓습니다.

$ xcode-select --install
$ sudo port install zmq
$ export CPATH=/opt/local/include
$ export LIBRARY_PATH=/opt/local/lib

4. R Package 설치

주의!

  • R App 또는 RStudio에서 해당 패키지를 설치하지 말아주세요.

  • 홈페이지에서는 다음과 같이 경고 합니다.

    On OS X, be sure to execute this in R started from the Terminal, not the R App!
    (This is because the R app doesn’t honor $PATH changes in ~/.bash_profile)

  • 간단히 말하면, 위 패키지는 R과 경로 설정이 안되어 있다는 뜻입니다.

  • Terminal에서 R을 실행합니다.

$ R

R을 실행하면 아래와 같은 화면이 나오면 정상적으로 실행이 완료가 된 것입니다.


이제, R 패키지를 위 화면에서 설치합니다.

> install.packages(c('repr', 'IRdisplay', 'IRkernel'), type = 'source')

5. R kernel을 Jupyter에 등록하기

이제 마지막입니다. 아래와 같이 입력한 후 결과값이 나와야 합니다. 설치가 정상적으로 완료되면, q() 함수를 이용하여 다시 터미널 환경으로 돌아옵니다.

> IRkernel::installspec()
[InstallKernelSpec] Removing existing kernelspec in /Users/jihoonjung/Library/Jupyter/kernels/ir
[InstallKernelSpec] Installed kernelspec ir in /Users/jihoonjung/Library/Jupyter/kernels/ir
> q()

6. Jupyter notebook에서 R Kernel 확인하기

실제로 R Kernel이 등록되었는지 확인합니다.

jupyter notebook

아래 그림에서 보는 것처럼 정상적으로 설치가 되었고, rJava도 정상적으로 호출이 되었습니다.



[R에서 딥러닝 해보실래요?https://cozydatascientist.tistory.com/77


'R > [R] 잡동사니' 카테고리의 다른 글

[R 마크다운 소개] R Markdown 소개 및 환경설정  (0) 2019.10.04

+ Recent posts