# ------------------
# 본 장은 R에서 Keras를 실습하는 과정입니다.  
# 딥러닝에 대한 자세한 이론 설명은 별도로 하지 않습니다. 
# 추후 딥러닝 강좌 개설 시, 중요 개념들에 대해 짚고 넘어갑니다. 
# ------------------


1단계: 패키지 설치

R에서 패키지 설치 진행은 아래와 같이 진행하시면 됩니다. 

파이썬이 없는 경우 자동으로 설치를 도와주기 때문에 큰 걱정은 안하셔도 됩니다. 

 

# 패키지는 H2O와 다르게 쉽습니다. 
devtools::install_github("rstudio/keras")
library(keras)

R의 다른 패키지와 달리, keras는 실행함수가 하나 더 있습니다. 

# 텐서플로우 설치: tensorflow를 벡엔드로 사용하기 때문. 
install_keras()

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



==> WARNING: A newer version of conda exists. <==
  current version: 4.7.12
  latest version: 4.8.2

Please update conda by running

    $ conda update -n base -c defaults conda


Requirement already up-to-date: tensorflow==2.0.0 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (2.0.0)
Requirement already up-to-date: keras in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (2.3.1)
Requirement already up-to-date: tensorflow-hub in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (0.7.0)
Requirement already up-to-date: h5py in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (2.10.0)
Requirement already up-to-date: pyyaml in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (5.3)
Requirement already up-to-date: requests in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (2.23.0)
Requirement already up-to-date: Pillow in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (7.0.0)
Requirement already up-to-date: scipy in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (1.4.1)
Requirement already satisfied, skipping upgrade: wheel>=0.26 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from tensorflow==2.0.0) (0.34.2)
Requirement already satisfied, skipping upgrade: gast==0.2.2 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from tensorflow==2.0.0) (0.2.2)
Requirement already satisfied, skipping upgrade: keras-preprocessing>=1.0.5 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from tensorflow==2.0.0) (1.1.0)
Requirement already satisfied, skipping upgrade: opt-einsum>=2.3.2 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from tensorflow==2.0.0) (3.2.0)
Requirement already satisfied, skipping upgrade: termcolor>=1.1.0 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from tensorflow==2.0.0) (1.1.0)
Requirement already satisfied, skipping upgrade: google-pasta>=0.1.6 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from tensorflow==2.0.0) (0.1.8)
Requirement already satisfied, skipping upgrade: tensorboard<2.1.0,>=2.0.0 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from tensorflow==2.0.0) (2.0.2)
Requirement already satisfied, skipping upgrade: tensorflow-estimator<2.1.0,>=2.0.0 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from tensorflow==2.0.0) (2.0.1)
Requirement already satisfied, skipping upgrade: wrapt>=1.11.1 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from tensorflow==2.0.0) (1.12.1)
Requirement already satisfied, skipping upgrade: grpcio>=1.8.6 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from tensorflow==2.0.0) (1.27.2)
Requirement already satisfied, skipping upgrade: keras-applications>=1.0.8 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from tensorflow==2.0.0) (1.0.8)
Requirement already satisfied, skipping upgrade: numpy<2.0,>=1.16.0 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from tensorflow==2.0.0) (1.18.1)
Requirement already satisfied, skipping upgrade: six>=1.10.0 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from tensorflow==2.0.0) (1.14.0)
Requirement already satisfied, skipping upgrade: astor>=0.6.0 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from tensorflow==2.0.0) (0.8.1)
Requirement already satisfied, skipping upgrade: absl-py>=0.7.0 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from tensorflow==2.0.0) (0.9.0)
Requirement already satisfied, skipping upgrade: protobuf>=3.6.1 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from tensorflow==2.0.0) (3.11.3)
Requirement already satisfied, skipping upgrade: chardet<4,>=3.0.2 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from requests) (3.0.4)
Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from requests) (2.9)
Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from requests) (1.25.8)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from requests) (2019.11.28)
Requirement already satisfied, skipping upgrade: setuptools>=41.0.0 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow==2.0.0) (45.2.0.post20200210)
Requirement already satisfied, skipping upgrade: markdown>=2.6.8 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow==2.0.0) (3.2.1)
Requirement already satisfied, skipping upgrade: werkzeug>=0.11.15 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow==2.0.0) (1.0.0)
Requirement already satisfied, skipping upgrade: google-auth-oauthlib<0.5,>=0.4.1 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow==2.0.0) (0.4.1)
Requirement already satisfied, skipping upgrade: google-auth<2,>=1.6.3 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow==2.0.0) (1.11.2)
Requirement already satisfied, skipping upgrade: requests-oauthlib>=0.7.0 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.1.0,>=2.0.0->tensorflow==2.0.0) (1.3.0)
Requirement already satisfied, skipping upgrade: rsa<4.1,>=3.1.4 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow==2.0.0) (4.0)
Requirement already satisfied, skipping upgrade: pyasn1-modules>=0.2.1 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow==2.0.0) (0.2.8)
Requirement already satisfied, skipping upgrade: cachetools<5.0,>=2.0.0 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow==2.0.0) (4.0.0)
Requirement already satisfied, skipping upgrade: oauthlib>=3.0.0 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.1.0,>=2.0.0->tensorflow==2.0.0) (3.1.0)
Requirement already satisfied, skipping upgrade: pyasn1>=0.1.3 in /Users/jihoonjung/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages (from rsa<4.1,>=3.1.4->google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow==2.0.0) (0.4.8)

Installation complete.

위와 같은 표시가 나오면 R에서도 keras를 사용할 수 있습니다. 


2단계: 데이터 불러오기

간단하게 iris 데이터를 활용하도록 합니다. iris 데이터를 활용하는 이유는, R은 데이터프레임에 최적화된 언어이고, 대부분의 사용자가 데이터프레임에 익숙합니다. 따라서, 데이터프레임 데이터를 어떻게 적용하는 것이 궁금하실 것 같아서 일부러 쉬운 데이터를 적용했습니다. 

 

R에서도 당연히 이미지 분류 등이 가능합니다. 

#### 2단계: 데이터 불러오기 ####
iris <- iris

3단계: 간단하게 상관관계 그래프 그리기

사실 여기는 굳이 하지 않으셔도 됩니다. 다만, 분석가는 늘 데이터를 시각화해서 데이터의 관계 등을 확인하는 EDA(탐색적 자료 분석)을 꼭 진행하는 걸 권장합니다. 

#### 3단계: 상관관계 그래프 작성하기 (옵션) #### 
library(corrplot)

# 수치형 데이터를 상관관계수로 변환하기
M <- cor(iris[,1:4])

# 상관관계 그래프 작성하기
corrplot(M, method="circle")

 

이렇게 나오면 됩니다. 


4단계: 데이터 전처리 및 가공

 

여기가 핵심입니다. 분류 모형이지만, 딥러닝은 행렬로 연산하기 때문에, 매트릭스 형태로 변환해야 합니다. 또한 데이터 정규화도 같이 진행합니다. 본 장에서는 왜 이러한 작업을 하는지는 구체적으로 설명하지 않습니다. 

 

추후 딥러닝 이론은 요약해서 올릴 예정입니다. 

# (1) 데이터프레임을 행렬로 변환 -------------------
# 범주형을 숫자로 바꿔야 한다. 
# 딥러닝은 데이터프레임이 아니라 행렬형태로 데이터를 받는다. 
iris[,5] <- as.numeric(as.factor(unlist(iris[,5]))) -1
iris <- as.matrix(iris)
dimnames(iris) <- NULL

# (2) 데이터 정규화 -------------------
# keras 패키지에는 normalize 함수가 있다. 
iris_x <- normalize(iris[,1:4])

# 정규화가 진행된 데이터를 다시 5 column과 합친 코드이다. 
iris_mat <- cbind(iris_x, iris[,5])
head(iris_mat)

# 아래는 결과 값입니다. 
          [,1]      [,2]      [,3]       [,4] [,5]
[1,] 0.8037728 0.5516088 0.2206435 0.03152050    0
[2,] 0.8281329 0.5070201 0.2366094 0.03380134    0
[3,] 0.8053331 0.5483119 0.2227517 0.03426949    0
[4,] 0.8000302 0.5391508 0.2608794 0.03478392    0
[5,] 0.7909650 0.5694948 0.2214702 0.03163860    0
[6,] 0.7841750 0.5663486 0.2468699 0.05808704    0

5단계: 딥러닝 모형 데이터셋 분리

# 데이터 셋 나누기 index 
ind <- sample(2, nrow(iris_mat), replace=TRUE, prob=c(0.67, 0.33))

# 모형 설계행렬
iris.training <- iris_mat[ind==1, 1:4]
iris.test <- iris_mat[ind==2, 1:4]

# 모형 예측변수
iris.trainingtarget <- iris_mat[ind==1, 5]
iris.testtarget <- iris_mat[ind==2, 5]

# One-Hot 인코딩: 훈련예측변수
iris.trainLabels <- to_categorical(iris.trainingtarget)

# One-Hot 인코딩: 검증예측변수
iris.testLabels <- to_categorical(iris.testtarget)

6단계: 딥러닝 모형 개발

데이터에 적합한 딥러닝 모형을 적용해야 하는데 신경망 계층(layer)은 몇층으로 할지, 노드는 몇개로 할지, 활성화(activation) 함수는 무엇으로 할지, 하이퍼 모수 학습률(learning rate)은 어떻게 정할지, 이렇게 다양한 조합이 모형의 성능에 영향을 미치게 된다. 그런 점에서 케라스는 모형자체에 개발자가 집중할 수 있도록 함으로써 큰 도움을 주고 있다.

 

일단 잘 모르겠다. 

 

그냥 한번 만들어 보자. 

set.seed(777)
# (1) 모형 초기화
model <- keras_model_sequential()

# 여기에 우선적으로 아래와 같이 신경망 계층을 만듭니다. 
# output 값이 3개입니다. (versicolor, virginica, setosa)
# 8 hidden notes를 만들었고, input shape = 4개로 만들었는데, 
# 이유는 4개의 column이 존재하기 때문입니다. 

model %>% 
  layer_dense(units = 8, activation = 'relu', input_shape = c(4)) %>% 
  layer_dense(units = 3, activation = 'softmax')
2020-03-12 22:46:36.235402: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-03-12 22:46:36.297796: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fa12eb3fe50 executing computations on platform Host. Devices:
2020-03-12 22:46:36.297819: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version

# 모형 요약
summary(model)
Model: "sequential"
_______________________________________________________________________________
Layer (type)                       Output Shape                    Param #     
===============================================================================
dense (Dense)                      (None, 8)                       40          
_______________________________________________________________________________
dense_1 (Dense)                    (None, 3)                       27          
===============================================================================
Total params: 67
Trainable params: 67
Non-trainable params: 0
_______________________________________________________________________________

# 모형 configuration
get_config(model)
{'name': 'sequential', 'layers': ...} # 직접 콘솔에서 확인 바랍니다. 

#  layer configuration
get_layer(model, index = 1)
<tensorflow.python.keras.layers.core.Dense>

# List the model's layers
model$layers
[[1]]
<tensorflow.python.keras.layers.core.Dense>

[[2]]
<tensorflow.python.keras.layers.core.Dense>

# List the input tensors
model$inputs
[[1]]
Tensor("dense_input:0", shape=(None, 4), dtype=float32)

# List the output tensors
model$outputs
[[1]]
Tensor("dense_1/Identity:0", shape=(None, 3), dtype=float32)
  • output 값이 3개입니다. (versicolor, virginica, setosa)
  • 8 hidden notes를 만들었고, input shape = 4개로 만들었는데,
  • 이유는 4개의 column이 존재하기 때문입니다. 

7단계: 딥러닝 모형 컴파일

  • categorical_crossentropy는 다중 분류를 의미합니다. 
  • 만약 이중분류를 하고 싶다면, 'binary_crossentropy' 입력하면 됩니다.
  • 가장 많이 사용되는 최적화 알고리즘은 SGD (Stochastic Gradient Descent), ADAM 및 RMSprop입니다. 
model %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = 'adam',
  metrics = 'accuracy'
)

8단계: 딥러닝 모형 적합

드디어 우리가 하려고 하는 것이 나왔습니다. 시간이 조금 걸립니다. 멋진 그래프가 움직이는 것을 보는 경험을 할 것입니다. 저는 처음에 조금 신기했습니다. 

model %>% fit(
  iris.training, 
  iris.trainLabels, 
  epochs = 500, 
  batch_size = 5,
  validation_split = 0.1
)

궁금하신 분만 보세요, 약 3분 45초 러닝타임 나옵니다. iris 데이터만 해도 3분 45초이니.. 다른 큰 데이터는 로컬 환경에서는 못할 듯 싶습니다. (참고로 맥북 에어로 작업중입니다!) 


9단계: 예측 모형 테스트

# 테스트 데이터를 활용한 예측값 산출
classes <- model %>% predict_classes(iris.test, batch_size = 128)

# 혼동행렬
table(iris.testtarget, classes)

# 결과표
               classes
iris.testtarget  0  1  2
              0 19  0  0
              1  0 13  2
              2  0  1 13

10단계: 모형 평가

# 테스트 데이터를 기반으로 모형을 평가한다. 
score <- model %>% evaluate(iris.test, iris.testLabels, batch_size = 128)

# loss 및 accuracy가 산출 된 것을 확인할 수 있다. 
print(score)

$loss
[1] 0.1443771

$accuracy
[1] 0.9375

 

자세한 내용은 keras 홈페이지에 참고하세요. https://keras.io/

 

Home - Keras Documentation

Keras: The Python Deep Learning library You have just found Keras. Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Be

keras.io


여기서부터 광고합니다. 

 

- 강사가 필요한 곳은 언제든지 연락 주세요. (2020년 6월부터 가능)

- 온라인 강의 제작도 가능합니다.

- Kaggle 데이터 시각화 위주로 공동으로 책 쓰실 분 찾습니다.

- 안타깝지만 개인 과외는 하지 않습니다!

 

연락처: j2hoon85@gmail.com

 

--- 커리큘럼은 아래와 같습니다. ---

* 소스코드 및 강의안 (PPT) 모두 준비 완료

1일차: 데이터 전처리와 시각화

- R & RStudio 설치 및 CRAN 생태계 설명

- 데이터 전처리: dplyr 활용

- 데이터 시각화: ggplot2 패키지 & 그외 동적 시각화 패키지

*웹크롤링, 몽고DB 연동, shiny 대시보드 연동 커리큘럼도 제작중입니다. 

 

2일차: 기초통계 및 회귀분석 ML모형

- 통계분석: 기술통계분석, 교차분석, 집단간 차이분석, 회귀분석 및 보고서 작성 요령

- 회귀모형개발: 선형회귀, 다중회귀, caret 패키지 활용

*통계분석은 지속적으로 업데이트 될 예정입니다.

 

3일차: 분류모형 및 h2o 패키지 소개

- 이상치 및 결측치 처리

- 이항 및 다항분류

- caret 패키지 & h2o 패키지 활용 모형 개발

 

4일차: 모형성능 향상 및 딥러닝 예제

- 통계적 preProcessing 절차 소개

- Parameter Tuning 절차 및 Sample 소개

- 딥러닝 소개 및 keras and tensorflow 2.0 Sample 튜토리얼

* shiny에서 머신러닝 모형 deploy 하는법도 올 상반기 중으로 준비중입니다 (GCP 서버 활용)

 

chapter2_Plotting_Points

All the contents are from DataCamp


In chapter 2 students will build on the leaflet map they created in chapter 1 to create an interactive web map of every four year college in California. After plotting hundreds of points on an interactive leaflet map, students will learn to customize the markers on their leaflet map. This chapter will also how to color code markers based on a factor variable.

Chapter 1. Cleaning up the Base Map

If you are storing leaflet maps in objects, there will come a time when you need to remove markers or reset the view. You can accomplish these tasks with the following functions.

clearMarkers()- Remove one or more features from a map clearBounds()- Clear bounds and automatically determine bounds based on map elements

To remove the markers and to reset the bounds of our m map we would:

{r}
m <- m  %>% 
        addMarkers(lng = dc_hq$lon, lat = dc_hq$lat) %>% 
        setView(lat = 50.9, lng = 4.7, zoom = 5)

m  %>% 
    clearMarkers() %>% 
    clearBounds()

The leaflet map of DataCamp's headquarters has been printed for you.

```{r}

Store leaflet hq map in an object called map

Plot DataCamp's NYC HQ

pkgs <- c("tidyverse", "leaflet", "htmlwidgets", "webshot") sapply(pkgs, require, character.only = TRUE)

dc_hq <- data.frame(hq = c("DataCamp - NYC", "DataCamp - Belgium"), lon = c(-74.0, 4.72), lat = c(40.7, 50.9))

map <- leaflet() %>% addProviderTiles("CartoDB") %>%

      # Use dc_hq to add the hq column as popups
      addMarkers(lng = dc_hq$lon, lat = dc_hq$lat,
                 popup = dc_hq$hq)

Center the view of map on the Belgium HQ with a zoom of 5

map_zoom <- map %>% setView(lat = 50.881363, lng = 4.717863, zoom = 5) ```{r}

{r}
# Remove markers, reset bounds, and store the updated map in the m object
map_clear <- map %>%
        clearMarkers() %>% 
        clearBounds()

# Print the cleared map
map_clear

Chapter 2. Exploring the IPEDS Data

In Chapters 2 and 3, we will be using a subset of the IPEDS data that focuses on public, private, and for-profit four-year institutions. The United States also has many institutions that are classified as two-year colleges or vocational institutions, which are not included this course. Our subset has five variables on 3,146 colleges.

The sector_label column in the ipeds data frame indicates whether a college is public, private, or for-profit. In the console, use the group_by() and the count() functions from the dplyr package to determine which sector of college is most common.

The tidyverse package, which includes dplyr, has been loaded for you. In your workspace, you also have access to the ipeds dataframe.

Which sector of college is most common in the IPEDS data?

Data comes from tableu public data You can directly download data here.

In [63]:
# data cleansing function
# this code I built
data_cleansing <- function(data = data) {
  library(dplyr)
  
  data <- data %>% select(Name, 'Longitude location of institution', 'Latitude location of institution', 'State abbreviation', 'Sector of institution')
  
  names(data) <- c("name", "lng", "lat", "state", "sector_label")
  
  data$sector_label[grepl('Private', data$sector_label)] <- 'private'
  data$sector_label[grepl('Public', data$sector_label)] <- 'public'
  
  return(data)
}
In [11]:
library(rio)
# step 1. data import
data <- import("data/IPEDS_data.xlsx")

# step 2. data cleansing
ipeds <- data_cleansing(data = data)
glimpse(ipeds)
Observations: 1,534
Variables: 5
$ name         <chr> "Alabama A & M University", "University of Alabama at Bi…
$ lng          <dbl> -86.56850, -86.80917, -86.17401, -86.63842, -86.29568, -…
$ lat          <dbl> 34.78337, 33.50223, 32.36261, 34.72282, 32.36432, 33.214…
$ state        <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "…
$ sector_label <chr> "public", "public", "private", "public", "public", "publ…

Chapter 3. Exploring the IPEDS Data II

Most analyses require data wrangling. Luckily, there are many functions in the tidyverse that facilitate data frame cleaning. For example, the drop_na() function will remove observations with missing values. By default, drop_na() will check all columns for missing values and will remove all observations with one or more missing values.

{r}
miss_ex <- tibble(
             animal = c("dog", "cat", "rat", NA),
             name   = c("Woodruf", "Stryker", NA, "Morris"),
             age    = c(1:4))
miss_ex

miss_ex %>% 
     drop_na() %>% 
     arrange(desc(age))

# A tibble: 2 x 3
  animal    name   age
   <chr>   <chr> <dbl>
1    cat Stryker     2
2    dog Woodruf     1
In [14]:
# Remove colleges with missing sector information
library(tidyverse)
ipeds2 <- ipeds %>% drop_na()
glimpse(ipeds2)
Observations: 1,534
Variables: 5
$ name         <chr> "Alabama A & M University", "University of Alabama at Bi…
$ lng          <dbl> -86.56850, -86.80917, -86.17401, -86.63842, -86.29568, -…
$ lat          <dbl> 34.78337, 33.50223, 32.36261, 34.72282, 32.36432, 33.214…
$ state        <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "…
$ sector_label <chr> "public", "public", "private", "public", "public", "publ…
In [18]:
# Count the number of four-year colleges in each state
ipeds2 %>% group_by(state) %>% count() %>% head(6)
staten
Alabama 28
Alaska 4
Arizona 8
Arkansas 20
California93
Colorado 20
In [23]:
# Create a list of US States in descending order by the number of colleges in each state
ipeds2 %>% 
    group_by(state) %>% 
    count() %>% 
    arrange(desc(n)) %>% 
    head(6)
staten
New York 122
Pennsylvania 114
California 93
Texas 70
Ohio 60
Massachusetts 59

4. California Colleges

Now it is your turn to map all of the colleges in a state. In this exercise, we'll apply our example of mapping Maine's colleges to California's colleges. The first step is to set up your data by filtering the ipeds data frame to include only colleges in California. For reference, you will find how we accomplished this with the colleges in Maine below.

{r}
maine_colleges <- 
    ipeds %>% 
        filter(state == "ME")

maine_colleges

# A tibble: 21 x 5
                     name       lng      lat state sector_label
                    <chr>     <dbl>    <dbl> <chr>        <chr>
1           Bates College -70.20333 44.10530    ME      Private
2         Bowdoin College -69.96524 43.90690    ME      Private
In [28]:
## Create Dataframe called 'ca' with data on only colleges in California
ca <- ipeds2 %>% filter(state == "California")
glimpse(ca)
Observations: 93
Variables: 5
$ name         <chr> "Azusa Pacific University", "Biola University", "Califor…
$ lng          <dbl> -117.8880, -118.0173, -122.4165, -117.4259, -118.1257, -…
$ lat          <dbl> 34.13087, 33.90482, 37.77477, 33.92857, 34.13927, 34.225…
$ state        <chr> "California", "California", "California", "California", …
$ sector_label <chr> "private", "private", "private", "private", "private", "…
In [40]:
# Use `addMarkers` to plot all of the colleges in `ca` on the `m` leaflet map
library(leaflet)
{r}
map <- leaflet() %>% addProviderTiles("CartoDB")
map %>% 
    addMarkers(lng = ca$lng, lat = ca$lat)

Chapter 5. The City of Colleges

Based on our map of California colleges it appears that there is a cluster of colleges in and around the City of Angels (e.g., Los Angeles). Let's take a closer look at these institutions on our leaflet map.

The coordinates for the center of LA are provided for you in the la_coords data frame.

{r}
la_coords <- data.frame(lat = 34.05223, lon = -118.2437)

Once you create a map focused on LA, try panning and zooming the map. Can you find the cluster of colleges East of LA known as the Claremont Colleges?

When there are hundreds of markers, do you find the pin markers helpful or do they get in your way?

The coordinates of LA have been provided in the la_coords data frame and the ca data frame of California colleges and the map have been loaded for you.

{r}
la_coords <- data.frame(lat = 34.05223, lon = -118.2437) 

# Center the map on LA 
map %>% 
   addMarkers(data = ca) %>% 
   setView(lat = la_coords$lat, lng = la_coords$lon, zoom = 12)
{r}
# Set the zoom level to 8 and store in the m object
map_zoom <-
    map %>%
    addMarkers(data = ca) %>%
     setView(lat = la_coords$lat, lng = la_coords$lon, zoom = 8)

map_zoom

Chapter 6. Circle Markers

Circle markers are notably different from pin markers:

We can control their size They do not "stand-up" on the map We can more easily change their color There are many ways to customize circle markers and the design of your leaflet map. To get started we will focus on the following arguments.

{r}
addCircleMarkers(map, lng = NULL, lat = NULL, 
                 radius = 10, color = "#03F", popup = NULL)

The first argument map takes a leaflet object, which we will pipe directly into addCircleMarkers(). lng and lat are the coordinates we are mapping. The other arguments can customize the appearance and information presented by each marker.

The ca data frame and the leaflet object map have been loaded for you.

{r}
# Clear the markers from the map 
map2 <- map %>% 
            clearMarkers()
{r}
# Use addCircleMarkers() to plot each college as a circle
map2 %>%
    addCircleMarkers(lng = ca$lng, lat = ca$lat)
{r}
# Change the radius of each circle to be 2 pixels and the color to red
map2 %>% 
    addCircleMarkers(lng = ca$lng, lat = ca$lat,
                     radius = 2, color = "red")

7. Making our Map Pop

Similar to building a plot with ggplot2 or manipulating data with dplyr, your map needs to be stored in an object if you reference it later in your code.

Speaking of dplyr, the %>% operator can pipe data into the function chain that creates a leaflet map.

{r}
ipeds %>% 
    leaflet()  %>% 
        addTiles() %>% 
        addCircleMarkers(popup = ~name, color = "#FF0000")

Piping makes our code more readable and allows us to refer to variables using the ~ operator rather than repeatedly specifying the data frame.

The color argument in addCircleMarkers() takes the name of a color or a hex code. For example, red or #FF0000.

map has been printed for you. Notice the circle markers are gone!

{r}
# Add circle markers with popups for college names
map %>% 
    addCircleMarkers(data = ca, radius = 2, popup = ~name)
{r}
# Change circle color to #2cb42c and store map in map_color object
map_color <- map %>% 
    addCircleMarkers(data = ca, radius = 2, color = "#2cb42c", popup = ~name)

# Print map_color
map_color

Chapter 8. Building a Better Pop-up

With the paste0() function and a few html tags, we can customize our popups. paste0() converts its arguments to characters and combines them into a single string without separating the arguments.

{r}
addCircleMarkers(popup = ~paste0(name,
                                 "<br/>",
                                 sector_label))

We can use the
tag to create a line break to have each element appear on a separate line.

To distinguish different data elements, we can make the name of each college italics by wrapping the name variable in

{r}
addCircleMarkers(popup = ~paste0("<i>",
                                 name,
                                 "</i>", 
                                 "<br/>", 
                                 sector_label))
In [51]:
# Clear the bounds and markers on the map object and store in map2
map2 <- map %>% 
        clearMarkers() %>% 
        clearBounds()
{r}
# Add circle markers with popups that display both the institution name and sector
map2 %>% 
    addCircleMarkers(data = ca, radius = 2, 
                     popup = ~paste0(name, "<br/>", sector_label))
{r}
# Make the institution name in each popup bold
map2 %>% 
    addCircleMarkers(data = ca, radius = 2, 
                     popup = ~paste0("<b>", name, "</b>", "<br/>", sector_label))

9. Swapping Popups for Labels

Popups are great, but they require a little extra effort. That is when labels come to our the aid. Using the label argument in the addCircleMarkers() function we can get more information about one of our markers with a simple hover!

{r}
ipeds %>% 
    leaflet()  %>% 
    addProviderTiles("CartoDB")  %>% 
    addCircleMarkers(label = ~name, radius = 2)

Labels are especially helpful when mapping more than a few locations as they provide quick access to detail about what each marker represents.

{r}
# Add circle markers with labels identifying the name of each college
map %>% 
    addCircleMarkers(data = ca, radius = 2, label = ~name)
In [55]:
# Use paste0 to add sector information to the label inside parentheses 
map %>% 
    addCircleMarkers(data = ca, radius = 2, label = ~paste0(name, " (", sector_label, ")"))
Assuming "lng" and "lat" are longitude and latitude, respectively

Chapter 10. Creating a Color Palette using colorFactor

So far we have only used color to customize the style of our map. With colorFactor() we can create a color palette that maps colors the levels of a factor variable.

{r}
pal <- 
   colorFactor(palette = c("blue", "red", "green"), 
               levels = c("Public", "Private", "For-Profit"))

m %>% 
    addCircleMarkers(color = ~pal(sector_label))

Why might we not want to use this particular color palette?

If you are interested in using a continuous variable to color a map see colorNumeric().

{r}
pal <- colorNumeric(palette = "RdBu", domain = c(25:50))

ipeds %>% 
    leaflet() %>% 
        addProviderTiles("CartoDB")  %>% 
        addCircleMarkers(radius = 2, color = ~pal(lat))
{r}

# Make a color palette called pal for the values of `sector_label` using `colorFactor()`  
# Colors should be: "red", "blue", and "#9b4a11" for "Public", "Private", and "For-Profit" colleges, respectively
pal <- colorFactor(palette = c("red", "blue"), 
                   levels = c("public", "private"))

# Add circle markers that color colleges using pal() and the values of sector_label
map2 <- 
    map %>% 
        addCircleMarkers(data = ca, radius = 2, 
                         color = ~pal(sector_label), 
                         label = ~paste0(name, " (", sector_label, ")"))

# Print map2
map2

Chapter 11. A Legendary Map

Adding information to our map using color is great, but it is only helpful if we remember what the colors represent. With addLegend() we can add a legend to remind us.

There are several arguments we can use to custom the legend to our liking, including opacity, title, and position. To create a legend for our colorNumeric() example, we would do the following.

{r}
pal <- colorNumeric(palette = "RdBu", domain = c(25:50))

ipeds %>% 
    leaflet() %>% 
        addProviderTiles("CartoDB")  %>% 
        addCircleMarkers(radius = 2,
                         color = ~pal(lat)) %>% 
         addLegend(pal = pal,
                   values = c(25:50),
                   opacity = 0.75,
                   title = "Latitude",
                   position = "topleft")
{r}
# Make a color palette called pal for the values of `sector_label` using `colorFactor()`  
# Colors should be: "red", "blue", and "#9b4a11" for "Public", "Private", and "For-Profit" colleges, respectively
pal <- colorFactor(palette = c("red", "blue"), 
                   levels = c("public", "private"))

# Customize the legend
map2 %>% 
    addLegend(pal = pal, 
              values = c("public", "private"),
              # opacity of .5, title of Sector, and position of topright
              opacity = 0.5, title = "Sector", position = "topright")

The Final Output code is followed. 
In [96]:
# Store leaflet hq map in an object called map
# Plot DataCamp's NYC HQ
pkgs <- c("tidyverse", "leaflet", "htmlwidgets", "webshot", "rio")
sapply(pkgs, require, character.only = TRUE)

# step 1. data import
data <- import("data/IPEDS_data.xlsx")

# step 2. data cleansing
ipeds <- data_cleansing(data = data)

# step 3. visualization
pal <- colorFactor(palette = c("red", "blue"), 
                   levels = c("public", "private"))
map_circle <- ipeds %>% 
    leaflet() %>% 
    addProviderTiles("CartoDB") %>% 
    addCircleMarkers(radius = 2, 
                     color = ~pal(sector_label), 
                     label = ~paste0(name, " (", sector_label, ")")) %>% 
    addLegend(pal = pal, 
              values = c("public", "private"),
              # opacity of .5, title of Sector, and position of topright
              opacity = 0.5, title = "Sector", position = "topright")

# saving leaflet
## create .html and .png
## save html to png
saveWidget(map_circle, "chapter2_map_circle.html", selfcontained = FALSE)
webshot("chapter2_map_circle.html", file = "chapter2_map_circle.png",
        cliprect = "viewport")
tidyverse
TRUE
leaflet
TRUE
htmlwidgets
TRUE
webshot
TRUE
rio
TRUE
Assuming "lng" and "lat" are longitude and latitude, respectively



Chapter1_Setting_Up_Interactive_Web_Maps

Chapter 1 will introduce students to the htmlwidgets package and the leaflet package. Following this introduction, students will build their first interactive web map using leaflet. Through the process of creating this first map students will be introduced to many of the core features of the leaflet package, including adding different map tiles, setting the center point and zoom level, plotting single points based on latitude and longitude coordinates, and storing leaflet maps as objects. Chapter 1 will conclude with students geocoding DataCamp’s headquarters, and creating a leaflet map that plots the headquarters and displays a popup describing the location.

1. Creating an Interactive Web Map

Similar to the packages in the tidyverse, the leaflet package makes use of the pipe operator (i.e., %>%) from the magrittr package to chain function calls together. This means we can pipe the result of one function into another without having to store the intermediate output in an object. For example, one way to find every car in the mtcars data set with a mpg >= 25 is to pipe the data through a series of functions.

{r}
mtcars  %>% 
    mutate(car = rownames(.))  %>% 
    select(car, mpg)  %>% 
    filter(mpg >= 25)

To create a web map in R, you will chain together a series of function calls using the %>% operator. Our first function leaflet() will initialize the htmlwidget then we will add a map tile using the addTiles() function.

In [1]:
# Load the leaflet library
library(leaflet)
{r}
# Create a leaflet map with default map tile using addTiles()
library(htmlwidgets)
leaflet() %>% addTiles()

2. Provider Tiles

In the previous exercise, addTiles() added the default OpenStreetMap (OSM) tile to your leaflet map. Map tiles weave multiple map images together. The map tiles presented adjust when a user zooms or pans the map enabling the interactive features you experimented with in exercise 2.

The leaflet package comes with more than 100 map tiles that you can use. These tiles are stored in a list called providers and can be added to your map using addProviderTiles() instead of addTiles().

The leaflet and tidyverse libraries have been loaded for you.

In [6]:
pkgs <- c("tidyverse", "leaflet")
sapply(pkgs, require, character.only = TRUE)
Loading required package: tidyverse
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.0     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.8
✔ tidyr   0.8.2     ✔ stringr 1.3.1
✔ readr   1.3.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
tidyverse
TRUE
leaflet
TRUE
In [9]:
# Print the providers list included in the leaflet library
providers[1:5]
$OpenStreetMap
'OpenStreetMap'
$OpenStreetMap.Mapnik
'OpenStreetMap.Mapnik'
$OpenStreetMap.BlackAndWhite
'OpenStreetMap.BlackAndWhite'
$OpenStreetMap.DE
'OpenStreetMap.DE'
$OpenStreetMap.CH
'OpenStreetMap.CH'
In [10]:
# Print only the names of the map tiles in the providers list 
names(providers)
  1. 'OpenStreetMap'
  2. 'OpenStreetMap.Mapnik'
  3. 'OpenStreetMap.BlackAndWhite'
  4. 'OpenStreetMap.DE'
  5. 'OpenStreetMap.CH'
  6. 'OpenStreetMap.France'
  7. 'OpenStreetMap.HOT'
  8. 'OpenStreetMap.BZH'
  9. 'OpenInfraMap'
  10. 'OpenInfraMap.Power'
  11. 'OpenInfraMap.Telecom'
  12. 'OpenInfraMap.Petroleum'
  13. 'OpenInfraMap.Water'
  14. 'OpenSeaMap'
  15. 'OpenPtMap'
  16. 'OpenTopoMap'
  17. 'OpenRailwayMap'
  18. 'OpenFireMap'
  19. 'SafeCast'
  20. 'Thunderforest'
  21. 'Thunderforest.OpenCycleMap'
  22. 'Thunderforest.Transport'
  23. 'Thunderforest.TransportDark'
  24. 'Thunderforest.SpinalMap'
  25. 'Thunderforest.Landscape'
  26. 'Thunderforest.Outdoors'
  27. 'Thunderforest.Pioneer'
  28. 'OpenMapSurfer'
  29. 'OpenMapSurfer.Roads'
  30. 'OpenMapSurfer.AdminBounds'
  31. 'OpenMapSurfer.Grayscale'
  32. 'Hydda'
  33. 'Hydda.Full'
  34. 'Hydda.Base'
  35. 'Hydda.RoadsAndLabels'
  36. 'MapBox'
  37. 'Stamen'
  38. 'Stamen.Toner'
  39. 'Stamen.TonerBackground'
  40. 'Stamen.TonerHybrid'
  41. 'Stamen.TonerLines'
  42. 'Stamen.TonerLabels'
  43. 'Stamen.TonerLite'
  44. 'Stamen.Watercolor'
  45. 'Stamen.Terrain'
  46. 'Stamen.TerrainBackground'
  47. 'Stamen.TopOSMRelief'
  48. 'Stamen.TopOSMFeatures'
  49. 'Esri'
  50. 'Esri.WorldStreetMap'
  51. 'Esri.DeLorme'
  52. 'Esri.WorldTopoMap'
  53. 'Esri.WorldImagery'
  54. 'Esri.WorldTerrain'
  55. 'Esri.WorldShadedRelief'
  56. 'Esri.WorldPhysical'
  57. 'Esri.OceanBasemap'
  58. 'Esri.NatGeoWorldMap'
  59. 'Esri.WorldGrayCanvas'
  60. 'OpenWeatherMap'
  61. 'OpenWeatherMap.Clouds'
  62. 'OpenWeatherMap.CloudsClassic'
  63. 'OpenWeatherMap.Precipitation'
  64. 'OpenWeatherMap.PrecipitationClassic'
  65. 'OpenWeatherMap.Rain'
  66. 'OpenWeatherMap.RainClassic'
  67. 'OpenWeatherMap.Pressure'
  68. 'OpenWeatherMap.PressureContour'
  69. 'OpenWeatherMap.Wind'
  70. 'OpenWeatherMap.Temperature'
  71. 'OpenWeatherMap.Snow'
  72. 'HERE'
  73. 'HERE.normalDay'
  74. 'HERE.normalDayCustom'
  75. 'HERE.normalDayGrey'
  76. 'HERE.normalDayMobile'
  77. 'HERE.normalDayGreyMobile'
  78. 'HERE.normalDayTransit'
  79. 'HERE.normalDayTransitMobile'
  80. 'HERE.normalNight'
  81. 'HERE.normalNightMobile'
  82. 'HERE.normalNightGrey'
  83. 'HERE.normalNightGreyMobile'
  84. 'HERE.basicMap'
  85. 'HERE.mapLabels'
  86. 'HERE.trafficFlow'
  87. 'HERE.carnavDayGrey'
  88. 'HERE.hybridDay'
  89. 'HERE.hybridDayMobile'
  90. 'HERE.pedestrianDay'
  91. 'HERE.pedestrianNight'
  92. 'HERE.satelliteDay'
  93. 'HERE.terrainDay'
  94. 'HERE.terrainDayMobile'
  95. 'FreeMapSK'
  96. 'MtbMap'
  97. 'CartoDB'
  98. 'CartoDB.Positron'
  99. 'CartoDB.PositronNoLabels'
  100. 'CartoDB.PositronOnlyLabels'
  101. 'CartoDB.DarkMatter'
  102. 'CartoDB.DarkMatterNoLabels'
  103. 'CartoDB.DarkMatterOnlyLabels'
  104. 'HikeBike'
  105. 'HikeBike.HikeBike'
  106. 'HikeBike.HillShading'
  107. 'BasemapAT'
  108. 'BasemapAT.basemap'
  109. 'BasemapAT.grau'
  110. 'BasemapAT.overlay'
  111. 'BasemapAT.highdpi'
  112. 'BasemapAT.orthofoto'
  113. 'nlmaps'
  114. 'nlmaps.standaard'
  115. 'nlmaps.pastel'
  116. 'nlmaps.grijs'
  117. 'nlmaps.luchtfoto'
  118. 'NASAGIBS'
  119. 'NASAGIBS.ModisTerraTrueColorCR'
  120. 'NASAGIBS.ModisTerraBands367CR'
  121. 'NASAGIBS.ViirsEarthAtNight2012'
  122. 'NASAGIBS.ModisTerraLSTDay'
  123. 'NASAGIBS.ModisTerraSnowCover'
  124. 'NASAGIBS.ModisTerraAOD'
  125. 'NASAGIBS.ModisTerraChlorophyll'
  126. 'NLS'
  127. 'JusticeMap'
  128. 'JusticeMap.income'
  129. 'JusticeMap.americanIndian'
  130. 'JusticeMap.asian'
  131. 'JusticeMap.black'
  132. 'JusticeMap.hispanic'
  133. 'JusticeMap.multi'
  134. 'JusticeMap.nonWhite'
  135. 'JusticeMap.white'
  136. 'JusticeMap.plurality'
  137. 'Wikimedia'
In [11]:
# Use str_detect() to determine if the name of each provider tile contains the string "CartoDB"
str_detect(names(providers), "CartoDB")
  1. FALSE
  2. FALSE
  3. FALSE
  4. FALSE
  5. FALSE
  6. FALSE
  7. FALSE
  8. FALSE
  9. FALSE
  10. FALSE
  11. FALSE
  12. FALSE
  13. FALSE
  14. FALSE
  15. FALSE
  16. FALSE
  17. FALSE
  18. FALSE
  19. FALSE
  20. FALSE
  21. FALSE
  22. FALSE
  23. FALSE
  24. FALSE
  25. FALSE
  26. FALSE
  27. FALSE
  28. FALSE
  29. FALSE
  30. FALSE
  31. FALSE
  32. FALSE
  33. FALSE
  34. FALSE
  35. FALSE
  36. FALSE
  37. FALSE
  38. FALSE
  39. FALSE
  40. FALSE
  41. FALSE
  42. FALSE
  43. FALSE
  44. FALSE
  45. FALSE
  46. FALSE
  47. FALSE
  48. FALSE
  49. FALSE
  50. FALSE
  51. FALSE
  52. FALSE
  53. FALSE
  54. FALSE
  55. FALSE
  56. FALSE
  57. FALSE
  58. FALSE
  59. FALSE
  60. FALSE
  61. FALSE
  62. FALSE
  63. FALSE
  64. FALSE
  65. FALSE
  66. FALSE
  67. FALSE
  68. FALSE
  69. FALSE
  70. FALSE
  71. FALSE
  72. FALSE
  73. FALSE
  74. FALSE
  75. FALSE
  76. FALSE
  77. FALSE
  78. FALSE
  79. FALSE
  80. FALSE
  81. FALSE
  82. FALSE
  83. FALSE
  84. FALSE
  85. FALSE
  86. FALSE
  87. FALSE
  88. FALSE
  89. FALSE
  90. FALSE
  91. FALSE
  92. FALSE
  93. FALSE
  94. FALSE
  95. FALSE
  96. FALSE
  97. TRUE
  98. TRUE
  99. TRUE
  100. TRUE
  101. TRUE
  102. TRUE
  103. TRUE
  104. FALSE
  105. FALSE
  106. FALSE
  107. FALSE
  108. FALSE
  109. FALSE
  110. FALSE
  111. FALSE
  112. FALSE
  113. FALSE
  114. FALSE
  115. FALSE
  116. FALSE
  117. FALSE
  118. FALSE
  119. FALSE
  120. FALSE
  121. FALSE
  122. FALSE
  123. FALSE
  124. FALSE
  125. FALSE
  126. FALSE
  127. FALSE
  128. FALSE
  129. FALSE
  130. FALSE
  131. FALSE
  132. FALSE
  133. FALSE
  134. FALSE
  135. FALSE
  136. FALSE
  137. FALSE
In [12]:
# Use str_detect() to print only the provider tile names that include the string "CartoDB"
names(providers)[str_detect(names(providers), "CartoDB")]
  1. 'CartoDB'
  2. 'CartoDB.Positron'
  3. 'CartoDB.PositronNoLabels'
  4. 'CartoDB.PositronOnlyLabels'
  5. 'CartoDB.DarkMatter'
  6. 'CartoDB.DarkMatterNoLabels'
  7. 'CartoDB.DarkMatterOnlyLabels'

3. Adding a Custom Map Tile

Did any tile names look familiar? If you have worked with the mapping software you may recognize the name ESRI or CartoDB.

We create our first leaflet map using the default OSM map tile.

{r}
leaflet() %>% 
    addTiles()

We will primarily use CartoDB provider tiles, but feel free to try others, like Esri. To add a custom provider tile to our map we will use the addProviderTiles() function. The first argument to addProviderTiles() is your leaflet map, which allows us to pipe leaflet() output directly into addProviderTiles(). The second argument is provider, which accepts any of the map tiles included in the providers list.

Familiarize yourself with the SCRIPT.R and HTML VIEWER tabs. Click back and forth to type your code and view your maps.

{r}
leaflet() %>% 
    addProviderTiles("Esri")
{r}
leaflet() %>% 
    addProviderTiles("CartoDB.PositronNoLabels")

4. A Map with a View I

You may have noticed that, by default, maps are zoomed out to the farthest level. Rather than manually zooming and panning, we can load the map centered on a particular point using the setView() function.

{r}
leaflet()  %>% 
    addProviderTiles("CartoDB")  %>% 
    setView(lat = 40.7, lng = -74.0, zoom = 10)

Currently, DataCamp has offices at the following locations:

350 5th Ave, Floor 77, New York, NY 10118

Martelarenlaan 38, 3010 Kessel-Lo, Belgium

These addresses were converted to coordinates using the geocode() function in the ggmaps package.

NYC: (-73.98575, 40.74856) Belgium: (4.717863, 50.881363)

{r}
leaflet()  %>% 
    addProviderTiles("CartoDB")  %>% 
    setView(lng = -73.98575, lat = 40.74856, zoom = 6)
{r}
hc_dq <- data.frame(hq = c("DataCamp - NYC", "DataCamp - Belgium"), 
                   lon = c(-74.0, 4.72), 
                   lat = c(40.7, 50.9))
leaflet() %>% 
    addProviderTiles("CartoDB.PositronNoLabels") %>% 
    setView(lng = hc_dq$lon[2], lat = hc_dq$lat[2], zoom = 4)

5. A Map with a Narrower View

We can limit users' ability to pan away from the map's focus using the options argument in the leaflet() function. By setting minZoom anddragging, we can create an interactive web map that will always be focused on a specific area.

{r}
leaflet(options = 
        leafletOptions(minZoom = 14, dragging = FALSE))  %>% 
  addProviderTiles("CartoDB")  %>% 
  setView(lng = -73.98575, lat = 40.74856, zoom = 14)

Alternatively, if we want our users to be able to drag the map while ensuring that they do not stray too far, we can set the maps maximum boundaries by specifying two diagonal corners of a rectangle.

You'll use dc_hq to create a map with the "CartoDB" provider tile that is centered on DataCamp's Belgium office.

{r}
leaflet(options = leafletOptions(
                    # Set minZoom and dragging 
                    minZoom = 12, dragging = TRUE))  %>% 
  addProviderTiles("CartoDB")  %>% 

  # Set default zoom level 
  setView(lng = hc_dq$lon[2], lat = hc_dq$lat[2], zoom = 14) %>% 

  # Set max bounds of map 
  setMaxBounds(lng1 = hc_dq$lon[2] + .05, 
               lat1 = hc_dq$lat[2] + .05, 
               lng2 = hc_dq$lon[2] - .05, 
               lat2 = hc_dq$lat[2] - .05)

6. Mark it

So far we have been creating maps with a single layer: a base map. We can add layers to this base map similar to how you add layers to a plot in ggplot2. One of the most common layers to add to a leaflet map is location markers, which you can add by piping the result of addTiles() or addProviderTiles() into the add markers function.

For example, if we plot DataCamp's NYC HQ by passing the coordinates to addMarkers() as numeric vectors with one element, our web map will place a blue drop pin at the coordinate. In chapters 2 and 3, we will review some options for customizing these markers.

{r}
leaflet()  %>% 
    addProviderTiles("CartoDB")  %>% 
    addMarkers(lng = -73.98575, lat = 40.74856)

The dc_hq tibble is available in your work space.

{r}
# Plot DataCamp's NYC HQ
hc_dq <- data.frame(hq = c("DataCamp - NYC", "DataCamp - Belgium"), 
                   lon = c(-74.0, 4.72), 
                   lat = c(40.7, 50.9))

leaflet() %>% 
    addProviderTiles("CartoDB") %>% 
    addMarkers(lng = hc_dq$lon[1], lat = hc_dq$lat[1])
{r}
# Plot DataCamp's NYC HQ with zoom of 12    
leaflet() %>% 
    addProviderTiles("CartoDB") %>% 
    addMarkers(lng = -73.98575, lat = 40.74856)  %>% 
    setView(lng = -73.98575, lat = 40.74856, zoom = 12)
{r}
# Plot both DataCamp's NYC and Belgium locations
leaflet() %>% 
    addProviderTiles("CartoDB") %>% 
    addMarkers(lng = dc_hq$lon, lat = dc_hq$lat)

7. Adding Popups and Storing your Map

To make our map more informative we can add popups. To add popups that appear when a marker is clicked we need to specify the popup argument in the addMarkers() function. Once we have a map we would like to preserve, we can store it in an object. Then we can pipe this object into functions to add or edit the map's layers.

{r}
dc_nyc <- 
    leaflet() %>% 
        addTiles() %>% 
        addMarkers(lng = -73.98575, lat = 40.74856, 
                   popup = "DataCamp - NYC") 

dc_nyc %>% 
    setView(lng = -73.98575, lat = 40.74856, 
            zoom = 2)

Let's try adding popups to both DataCamp location markers and storing our map in an object.

In [1]:
# Store leaflet hq map in an object called map
# Plot DataCamp's NYC HQ
pkgs <- c("tidyverse", "leaflet", "htmlwidgets", "webshot")
sapply(pkgs, require, character.only = TRUE)

dc_hq <- data.frame(hq = c("DataCamp - NYC", "DataCamp - Belgium"), 
                   lon = c(-74.0, 4.72), 
                   lat = c(40.7, 50.9))

map <- leaflet() %>%
          addProviderTiles("CartoDB") %>%
          # Use dc_hq to add the hq column as popups
          addMarkers(lng = dc_hq$lon, lat = dc_hq$lat,
                     popup = dc_hq$hq)

# Center the view of map on the Belgium HQ with a zoom of 5 
map_zoom <- map %>%
      setView(lat = 50.881363, lng = 4.717863,
              zoom = 5)

# Print map_zoom
# map_zoom

# saving leaflet
## create .html and .png
## save html to png
saveWidget(map_zoom, "chapter1_mapZoom.html", selfcontained = FALSE)
webshot("chapter1_mapZoom.html", file = "chapter1_mapZoom.png",
        cliprect = "viewport")
Loading required package: tidyverse
── Attaching packages ─────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 3.1.0       ✔ purrr   0.2.5  
✔ tibble  2.0.1       ✔ dplyr   0.8.0.1
✔ tidyr   0.8.2       ✔ stringr 1.3.1  
✔ readr   1.3.1       ✔ forcats 0.3.0  
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
Loading required package: leaflet
Loading required package: htmlwidgets
Loading required package: webshot
tidyverse
TRUE
leaflet
TRUE
htmlwidgets
TRUE
webshot
TRUE



Logistic_Regression

Supervised Learning, Logistic Regression

Evan Jung January 18, 2019

1. Data import

suppose that we will get dataset from an NGO company below.

## Observations: 93,462
## Variables: 13
## $ donated           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ veteran           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ bad_address       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ age               <dbl> 60, 46, NA, 70, 78, NA, 38, NA, NA, 65, NA, ...
## $ has_children      <dbl> 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ wealth_rating     <dbl> 0, 3, 1, 2, 1, 0, 2, 3, 1, 0, 1, 2, 1, 0, 2,...
## $ interest_veterans <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ interest_religion <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ pet_owner         <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ catalog_shopper   <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ recency           <fct> CURRENT, CURRENT, CURRENT, CURRENT, CURRENT,...
## $ frequency         <fct> FREQUENT, FREQUENT, FREQUENT, FREQUENT, FREQ...
## $ money             <fct> MEDIUM, HIGH, MEDIUM, MEDIUM, MEDIUM, MEDIUM...

Here the target variable is donated. The donated column is 1 if the person made a donation in response to the mailing and 0 otherwise.


2. Building a model

When building a model in most cases, it’s not a good idea to put all the variables. It good to start with a hypothesis about which independent variables will be predictive of the dependent variable. in this case, well, the bad_address column, which is set to 1 for an invalid mailing address and 0 otherwise, seems like it might reduce the chances of a donation. Similarly, one might suspect that religious interest (interest_religion) and interest in veterans affairs (interest_veterans) would be associated with greater charitable giving.

## 
## Call:
## glm(formula = donated ~ bad_address + interest_religion + interest_veterans, 
##     family = "binomial", data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3480  -0.3192  -0.3192  -0.3192   2.5678  
## 
## Coefficients:
##                   Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)       -2.95139    0.01652 -178.664   <2e-16 ***
## bad_address       -0.30780    0.14348   -2.145   0.0319 *  
## interest_religion  0.06724    0.05069    1.327   0.1847    
## interest_veterans  0.11009    0.04676    2.354   0.0186 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 37316  on 93458  degrees of freedom
## AIC: 37324
## 
## Number of Fisher Scoring iterations: 5


3. Prediction

As other R’s machine learning methods, we will apply predict(). By default, predict() outputs predictions in terms of log odds unless type = "response" is specified. This converts the log odds to probabilities.

Because a logistic regression model estimates the probability of the outcome, it is up to you to determine the threshold at which the probability implies action. One must balance the extremes of being too cautious versus being too aggressive.

For example, if you were to solicit only the people with a 99% or greater donation probability, you may miss out on many people with lower estimated probabilities that still choose to donate. This balance is particularly important to consider for severely imbalanced outcomes, such as in this dataset where donations are relatively rare.

## [1] 0.05040551

The actual probability that an average person would donate by passing is 0.05.

## [1] 0.794815


4. Limitation of Accuracy

Although the accuracy of model is almost 80%, the result is misleading due to the rarity of outcome being predicted. What would the accuracy have been if a model had simply predicted “no donation” for each person? Then it could be 95%. See below.

## 
##    0    1 
## 0.95 0.05


5. Calculating ROC Curves and AUC

We know that accuracy is a very misleading measure of model performance on imbalanced datasets. Graphing the model’s performance better illustrates the tradeoff between a model that is overly agressive and one that is overly passive.

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
## Area under the curve: 0.5102


How can we explain the value of AUC and plot? Based on this visualization, the model isn’t doing much better than baseline— a model doing nothing but making predictions at random.


6. Dummy Coding Categorical Data

Sometimes a dataset contains numeric values that represent a categorical feature.

## 
## Call:
## glm(formula = donated ~ wealth_rating, family = "binomial", data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3320  -0.3243  -0.3175  -0.3175   2.4582  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -2.91894    0.03614 -80.772   <2e-16 ***
## wealth_ratingUnknown -0.04373    0.04243  -1.031    0.303    
## wealth_ratingLow     -0.05245    0.05332  -0.984    0.325    
## wealth_ratingHigh     0.04804    0.04768   1.008    0.314    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 37323  on 93458  degrees of freedom
## AIC: 37331
## 
## Number of Fisher Scoring iterations: 5


7. Handling Missing Data

Some of the prospective donors have missing age data. Unfortunately, R will exclude any cases with NA values when building a regression model.

One workaround is to replace, or impute, the missing values with an estimated value. After doing so, you may also create a missing data indicator to model the possibility that cases with missing data are different in some way from those without.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00   48.00   62.00   61.65   75.00   98.00   22546

The number of NA is 22546. So, we need to deal with handling missing data.


8. Building a more sophisticated model

One of the best predictors of future giving is a history of recent, frequent, and large gifts. In marketing terms, this is known as R/F/M - Recency, Frequency, Money Donors that haven’t given both recently and frequently may be especially likely to give again;

## 
## Call:
## glm(formula = donated ~ recency * frequency + money, family = "binomial", 
##     data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3696  -0.3696  -0.2895  -0.2895   2.7924  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -3.01142    0.04279 -70.375   <2e-16 ***
## recencyLAPSED                     -0.86677    0.41434  -2.092   0.0364 *  
## frequencyINFREQUENT               -0.50148    0.03107 -16.143   <2e-16 ***
## moneyMEDIUM                        0.36186    0.04300   8.415   <2e-16 ***
## recencyLAPSED:frequencyINFREQUENT  1.01787    0.51713   1.968   0.0490 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 36938  on 93457  degrees of freedom
## AIC: 36948
## 
## Number of Fisher Scoring iterations: 6

Model has got better than the previous model. Based on the result, the combined impact of recency and frequency may be greater than the sum of the separate effects.


## Area under the curve: 0.5785

Based on the ROC curve, you’ve confirmed that past giving patterns are certainly predictive of future giving.


9. The dangers of stepwise regression

In spite of its utility for feature selection, stepwise regression is not frequently used in disciplines outside of machine learning due to some important caveats. First of all, It is not guaranteed to find the best possible model. Second, The stepwise regression procedure violates some statistical assumptions. Third, it can result in a model that makes little sense in the real world


10. Building a stepwise regression model

## 
## Call:
## glm(formula = donated ~ 1, family = "binomial", data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3216  -0.3216  -0.3216  -0.3216   2.4444  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.93593    0.01495  -196.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 37330  on 93461  degrees of freedom
## Residual deviance: 37330  on 93461  degrees of freedom
## AIC: 37332
## 
## Number of Fisher Scoring iterations: 5
## 
## Call:
## glm(formula = donated ~ ., family = "binomial", data = donors)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6111  -0.3642  -0.3080  -0.2866   2.7436  
## 
## Coefficients: (2 not defined because of singularities)
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           1.742e+01  1.066e+01   1.634  0.10222    
## veteran              -2.071e-02  5.151e-01  -0.040  0.96793    
## bad_address          -5.442e+00  2.802e+00  -1.942  0.05208 .  
## age                   1.094e-03  1.093e-03   1.001  0.31702    
## has_children         -1.561e-01  5.156e-02  -3.028  0.00247 ** 
## wealth_ratingUnknown -1.196e-02  4.819e-02  -0.248  0.80404    
## wealth_ratingLow     -4.901e-02  5.773e-02  -0.849  0.39594    
## wealth_ratingHigh     1.270e-01  5.079e-02   2.500  0.01243 *  
## interest_veterans     2.429e+00  1.214e+00   2.001  0.04535 *  
## interest_religion     1.491e+00  7.507e-01   1.986  0.04704 *  
## pet_owner             5.060e-02  4.895e-02   1.034  0.30128    
## catalog_shopper       6.686e-02  5.980e-02   1.118  0.26353    
## recencyLAPSED        -1.678e-01  2.565e-01  -0.654  0.51297    
## frequencyINFREQUENT  -4.645e-01  3.523e-02 -13.185  < 2e-16 ***
## moneyMEDIUM           3.734e-01  4.893e-02   7.631 2.34e-14 ***
## donation_prob        -4.131e+02  2.146e+02  -1.926  0.05416 .  
## donation_pred        -1.185e-01  1.222e-01  -0.970  0.33189    
## imputed_age                  NA         NA      NA       NA    
## missing_age                  NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28714  on 70915  degrees of freedom
## Residual deviance: 28405  on 70899  degrees of freedom
##   (22546 observations deleted due to missingness)
## AIC: 28439
## 
## Number of Fisher Scoring iterations: 6
## Start:  AIC=37332.13
## donated ~ 1

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + frequency          1    28502 37122
## + money              1    28621 37241
## + has_children       1    28705 37326
## + age                1    28707 37328
## + imputed_age        1    28707 37328
## + wealth_rating      3    28704 37328
## + interest_veterans  1    28709 37330
## + donation_prob      1    28710 37330
## + donation_pred      1    28710 37330
## + catalog_shopper    1    28710 37330
## + pet_owner          1    28711 37331
## <none>                    28714 37332
## + interest_religion  1    28712 37333
## + recency            1    28713 37333
## + bad_address        1    28714 37334
## + veteran            1    28714 37334
## 
## Step:  AIC=37024.77
## donated ~ frequency

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + money              1    28441 36966
## + wealth_rating      3    28490 37019
## + has_children       1    28494 37019
## + donation_prob      1    28498 37023
## + interest_veterans  1    28498 37023
## + catalog_shopper    1    28499 37024
## + donation_pred      1    28499 37024
## + age                1    28499 37024
## + imputed_age        1    28499 37024
## + pet_owner          1    28499 37024
## <none>                    28502 37025
## + interest_religion  1    28501 37026
## + recency            1    28501 37026
## + bad_address        1    28502 37026
## + veteran            1    28502 37027
## 
## Step:  AIC=36949.71
## donated ~ frequency + money

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + wealth_rating      3    28427 36942
## + has_children       1    28432 36943
## + interest_veterans  1    28438 36948
## + donation_prob      1    28438 36949
## + catalog_shopper    1    28438 36949
## + donation_pred      1    28438 36949
## + age                1    28438 36949
## + imputed_age        1    28438 36949
## + pet_owner          1    28439 36949
## <none>                    28441 36950
## + interest_religion  1    28440 36951
## + recency            1    28440 36951
## + bad_address        1    28441 36951
## + veteran            1    28441 36952
## 
## Step:  AIC=36945.48
## donated ~ frequency + money + wealth_rating

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + has_children       1    28416 36937
## + age                1    28424 36944
## + imputed_age        1    28424 36944
## + interest_veterans  1    28424 36945
## + donation_prob      1    28424 36945
## + catalog_shopper    1    28424 36945
## + donation_pred      1    28425 36945
## <none>                    28427 36945
## + pet_owner          1    28425 36946
## + interest_religion  1    28426 36947
## + recency            1    28426 36947
## + bad_address        1    28427 36947
## + veteran            1    28427 36947
## 
## Step:  AIC=36938.4
## donated ~ frequency + money + wealth_rating + has_children

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## + pet_owner          1    28413 36937
## + donation_prob      1    28413 36937
## + catalog_shopper    1    28413 36937
## + interest_veterans  1    28413 36937
## + donation_pred      1    28414 36938
## <none>                    28416 36938
## + interest_religion  1    28415 36939
## + age                1    28416 36940
## + imputed_age        1    28416 36940
## + recency            1    28416 36940
## + bad_address        1    28416 36940
## + veteran            1    28416 36940
## 
## Step:  AIC=36932.25
## donated ~ frequency + money + wealth_rating + has_children + 
##     pet_owner

## Warning in add1.glm(fit, scope$add, scale = scale, trace = trace, k = k, :
## using the 70916/93462 rows from a combined fit

##                     Df Deviance   AIC
## <none>                    28413 36932
## + donation_prob      1    28411 36932
## + interest_veterans  1    28411 36932
## + catalog_shopper    1    28412 36933
## + donation_pred      1    28412 36933
## + age                1    28412 36933
## + imputed_age        1    28412 36933
## + recency            1    28413 36934
## + interest_religion  1    28413 36934
## + bad_address        1    28413 36934
## + veteran            1    28413 36934


## Area under the curve: 0.5849

Despite the caveats of stepwise regression, it seems to have resulted in a relatively strong model!


All the Contents are From DataCamp

'R > [R] Machine Learning' 카테고리의 다른 글

[R] Classification Trees  (0) 2019.02.03
[R] k-Nearest Neighbors (kNN)  (0) 2019.01.20

+ Recent posts