분류 전체보기

Jupyter Notebook + R 연동하기 2019.01.24 1
[R] k-Nearest Neighbors (kNN) 2019.01.20
How to draw graph for two numeric variables in R 2019.01.17
Correlation in R 2019.01.09

Jupyter Notebook + R 연동하기

강의를 해야하는 상황이 점점 오고 있습니다. 기존에 RStudio를 써봐서 강의를 진행하기도 하였지만, 강의를 듣는 청자의 입장에서 이것저것 불편한 점이 많았습니다.

RStudio가 개발 환경에서는 분명 좋기는 하지만, 무언가 더 좋은 프로그램이 없을까 고민하던 찰나에 python 데이터 분석가가 주로 사용한다는 Jupyter Notebook이 R과 연동이 된다는 얘기를 듣게 되어 바로 환경설치 작업에 들어갑니다.

몇몇 싸이트를 참고하기는 했습니다만, 대표적으로 글을 써야겠다고 느낀점은 아래 StackOverFlow에서 근본적인 원인을 발견하고, 해결하여, 비슷한 경험이 없기를 바라는 마음에 작성합니다.

rJava not loading in Jupyter Notebook with R kernel

참고로 제 개발환경은 아래와 같습니다.

 > sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.2
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

1. 첫번째, 환경설정 시 주의해야 할 사항

주의!

기존에 R 사용자는 Jupyter Notebook를 사용하기 위해 Anaconda를 다운받지 마세요.

보통 위 그림과 같이 Jupyter Notebook를 사용하기 위해 Anaconda를 많이 사용합니다. 저 역시, 처음 Python을 배울 때는 Jupyter Notebook을 사용했습니다.
Anaconda와 기존에 설치된 R을 다시 연결해야 하는데, 초보자에게는 쉽지 않습니다.
Anaconda에서 R을 설치하려면 아래와 같이 명령어를 입력해야 합니다.

 $ conda install -c r r-essentials

WHY?

간단한 통계분석을 한다고 하면, 사실 상관이 없습니다.
그런데, rJava 패키지를 활용하려면 자바 경로를 Anaconda에 맞게끔 재 설정해야 합니다.
즉, 기존에 설치된 R과 RStudio에서는 이상없이 설치되는 패키지가 Anaconda Jupyter notebook에서는 에러가 되는 상황이 발생이 됩니다.
이유는, Anaconda Jupyter notebook이 자바 경로와 설정이 되지 않았기 때문입니다.
다시 말하면, 시스템 초보자에게는 이 부분이 매우 어려울 수 있습니다.
저 역시, rJava가 Anaconda내 Jupyter notebook과 연결이 되지 않아 결국 Anaconda를 전체 삭제했습니다.

2. 두번째, Jupyter Notebook 설치

Jupyter 설치 방법은 링크를 앞에 Jupyter를 클릭하시기 바랍니다.

(1) Python 버전 확인

먼저 터미널을 엽니다. (Terminal)

터미널은 위와 같이 생겼습니다.

 $ python --version
Python 3.7.1

(2) 버전에 맞게 설치 방법 확인

Jupyter를 pip이라는 설치 모듈(module)이라는 도구를 활용하여 설치를 진행합니다.

 $ python3 -m pip install --upgrade pip
$ python3 -m pip install jupyter

설치가 완료가 되면 먼저 jupyter notebook를 실행합니다.

 $ jupyter notebook

3. IRkernel 패키지 설치

IRkernel 패키지는 일종의 Jupyter notebook에 R이 등록될 수 있도록 도와주는 패키지로 이해하면 좋을 듯 합니다. Windows, MacOS, Linux 등 각 OS에 맞는 자세한 설치방법은 IRkernel 홈페이지에 있습니다.

저는 맥북 사용자이기 때문에 MacOS 설치 버전을 아래 그림과 같이 따라했습니다.

 $ xcode-select --install
$ brew install zmq
# or upgrade
$ brew update
$ brew upgrade zmq

(1) brew가 없으신 분들을 위해

brew는 일종의 맥북 패키지 관리자 도구입니다. 자세한 설치방법은 Homebrew 홈페이지를 참조하시기를 바랍니다.

 $ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

(2) MacPorts

저는 주로 brew를 이용합니다만, MacPorts를 사용하시는 분들을 위해 남겨 놓습니다.

 $ xcode-select --install
$ sudo port install zmq
$ export CPATH=/opt/local/include
$ export LIBRARY_PATH=/opt/local/lib

4. R Package 설치

주의!

R App 또는 RStudio에서 해당 패키지를 설치하지 말아주세요.
홈페이지에서는 다음과 같이 경고 합니다.

On OS X, be sure to execute this in R started from the Terminal, not the R App!
(This is because the R app doesn’t honor $PATH changes in ~/.bash_profile)
간단히 말하면, 위 패키지는 R과 경로 설정이 안되어 있다는 뜻입니다.
Terminal에서 R을 실행합니다.

$ R

R을 실행하면 아래와 같은 화면이 나오면 정상적으로 실행이 완료가 된 것입니다.

이제, R 패키지를 위 화면에서 설치합니다.

 > install.packages(c('repr', 'IRdisplay', 'IRkernel'), type = 'source')

5. R kernel을 Jupyter에 등록하기

이제 마지막입니다. 아래와 같이 입력한 후 결과값이 나와야 합니다. 설치가 정상적으로 완료되면, q() 함수를 이용하여 다시 터미널 환경으로 돌아옵니다.

 > IRkernel::installspec()
[InstallKernelSpec] Removing existing kernelspec in /Users/jihoonjung/Library/Jupyter/kernels/ir
[InstallKernelSpec] Installed kernelspec ir in /Users/jihoonjung/Library/Jupyter/kernels/ir
> q()

6. Jupyter notebook에서 R Kernel 확인하기

실제로 R Kernel이 등록되었는지 확인합니다.

 jupyter notebook

아래 그림에서 보는 것처럼 정상적으로 설치가 되었고, rJava도 정상적으로 호출이 되었습니다.

[R에서 딥러닝 해보실래요?] https://cozydatascientist.tistory.com/77

'R > [R] 잡동사니' 카테고리의 다른 글

[R 마크다운 소개] R Markdown 소개 및 환경설정 (0)	2019.10.04

[R] k-Nearest Neighbors (kNN)

2019. 1. 20. 15:00

K_Nearest_Neighbors__KNN_

k-Nearest Neighbors (kNN)

Evan Jung January 18, 2019

1. The Concept of KNN

What is KNN? The word KNN is the abbreviation of kN(Nearest)N(Neighbors). Then what is K? k is therefore just the number of neighbors “voting” on the test example’s class. If k=1, then test examples are given the same label as the closest example in the training set. If k=3, the labels of the three closest classes are checked and the most common (i.e., occuring at least twice) label is assigned, and so on for larger ks.

Measuring Similarity with distance (ex: color by color)

$dist(p,q) = \sqrt{(p_{1} - q_{1})^2 + (p_{2} - q_{2})^2 + ... + (p_{n} - q_{n})^2}$

library(class) pred <- knn(training_data, testing_data, training_labels)

 library(class)
library(readr)
library(dplyr)

 ## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

 url <- "https://assets.datacamp.com/production/course_2906/datasets/knn_traffic_signs.csv"
signs <- read_csv(url)

 ## Parsed with column specification:
## cols(
##   .default = col_double(),
##   sample = col_character(),
##   sign_type = col_character()
## )
## See spec(...) for full column specifications.

glimpse(signs)

 ## Observations: 206
## Variables: 51
## $ id        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ sample    <chr> "train", "train", "train", "train", "train", "train"...
## $ sign_type <chr> "pedestrian", "pedestrian", "pedestrian", "pedestria...
## $ r1        <dbl> 155, 142, 57, 22, 169, 75, 136, 118, 149, 13, 123, 1...
## $ g1        <dbl> 228, 217, 54, 35, 179, 67, 149, 105, 225, 34, 124, 1...
## $ b1        <dbl> 251, 242, 50, 41, 170, 60, 157, 69, 241, 28, 107, 13...
## $ r2        <dbl> 135, 166, 187, 171, 231, 131, 200, 244, 34, 5, 83, 3...
## $ g2        <dbl> 188, 204, 201, 178, 254, 89, 203, 245, 45, 21, 61, 4...
## $ b2        <dbl> 101, 44, 68, 26, 27, 53, 107, 67, 1, 11, 26, 37, 26,...
## $ r3        <dbl> 156, 142, 51, 19, 97, 214, 150, 132, 155, 123, 116, ...
## $ g3        <dbl> 227, 217, 51, 27, 107, 144, 167, 123, 226, 154, 124,...
## $ b3        <dbl> 245, 242, 45, 29, 99, 75, 134, 12, 238, 140, 115, 12...
## $ r4        <dbl> 145, 147, 59, 19, 123, 156, 171, 138, 147, 21, 67, 4...
## $ g4        <dbl> 211, 219, 62, 27, 147, 169, 218, 123, 222, 46, 67, 5...
## $ b4        <dbl> 228, 242, 65, 29, 152, 190, 252, 85, 242, 41, 52, 49...
## $ r5        <dbl> 166, 164, 156, 42, 221, 67, 171, 254, 170, 36, 70, 1...
## $ g5        <dbl> 233, 228, 171, 37, 236, 50, 158, 254, 191, 60, 53, 1...
## $ b5        <dbl> 245, 229, 50, 3, 117, 36, 108, 92, 113, 26, 26, 141,...
## $ r6        <dbl> 212, 84, 254, 217, 205, 37, 157, 241, 26, 75, 26, 60...
## $ g6        <dbl> 254, 116, 255, 228, 225, 36, 186, 240, 37, 108, 26, ...
## $ b6        <dbl> 52, 17, 36, 19, 80, 42, 11, 108, 12, 44, 21, 18, 20,...
## $ r7        <dbl> 212, 217, 211, 221, 235, 44, 26, 254, 34, 13, 52, 9,...
## $ g7        <dbl> 254, 254, 226, 235, 254, 42, 35, 254, 45, 27, 45, 13...
## $ b7        <dbl> 11, 26, 70, 20, 60, 44, 10, 99, 19, 25, 27, 17, 20, ...
## $ r8        <dbl> 188, 155, 78, 181, 90, 192, 180, 108, 221, 133, 117,...
## $ g8        <dbl> 229, 203, 73, 183, 110, 131, 211, 106, 249, 163, 109...
## $ b8        <dbl> 117, 128, 64, 73, 9, 73, 236, 27, 184, 126, 83, 33, ...
## $ r9        <dbl> 170, 213, 220, 237, 216, 123, 129, 135, 226, 83, 110...
## $ g9        <dbl> 216, 253, 234, 234, 236, 74, 109, 123, 246, 125, 74,...
## $ b9        <dbl> 120, 51, 59, 44, 66, 22, 73, 40, 59, 19, 12, 12, 18,...
## $ r10       <dbl> 211, 217, 254, 251, 229, 36, 161, 254, 30, 13, 98, 2...
## $ g10       <dbl> 254, 255, 255, 254, 255, 34, 190, 254, 40, 27, 70, 1...
## $ b10       <dbl> 3, 21, 51, 2, 12, 37, 10, 115, 34, 25, 26, 11, 20, 2...
## $ r11       <dbl> 212, 217, 253, 235, 235, 44, 161, 254, 34, 9, 20, 28...
## $ g11       <dbl> 254, 255, 255, 243, 254, 42, 190, 254, 44, 23, 21, 2...
## $ b11       <dbl> 19, 21, 44, 12, 60, 44, 6, 99, 35, 18, 20, 19, 13, 1...
## $ r12       <dbl> 172, 158, 66, 19, 163, 197, 187, 138, 241, 85, 113, ...
## $ g12       <dbl> 235, 225, 68, 27, 168, 114, 215, 123, 255, 128, 76, ...
## $ b12       <dbl> 244, 237, 68, 29, 152, 21, 236, 85, 54, 21, 14, 12, ...
## $ r13       <dbl> 172, 164, 69, 20, 124, 171, 141, 118, 205, 83, 106, ...
## $ g13       <dbl> 235, 227, 65, 29, 117, 102, 142, 105, 229, 125, 69, ...
## $ b13       <dbl> 244, 237, 59, 34, 91, 26, 140, 75, 46, 19, 9, 12, 13...
## $ r14       <dbl> 172, 182, 76, 64, 188, 197, 189, 131, 226, 85, 102, ...
## $ g14       <dbl> 228, 228, 84, 61, 205, 114, 171, 124, 246, 128, 67, ...
## $ b14       <dbl> 235, 143, 22, 4, 78, 21, 140, 5, 59, 21, 6, 12, 13, ...
## $ r15       <dbl> 177, 171, 82, 211, 125, 123, 214, 106, 235, 85, 106,...
## $ g15       <dbl> 235, 228, 93, 222, 147, 74, 221, 94, 252, 128, 69, 4...
## $ b15       <dbl> 244, 196, 17, 78, 20, 22, 201, 53, 67, 21, 9, 11, 18...
## $ r16       <dbl> 22, 164, 58, 19, 160, 180, 188, 101, 237, 83, 43, 60...
## $ g16       <dbl> 52, 227, 60, 27, 183, 107, 211, 91, 254, 125, 29, 45...
## $ b16       <dbl> 53, 237, 60, 29, 187, 26, 227, 59, 53, 19, 11, 18, 1...

 signs2 <- signs[, -c(1:2)]
# sample
next_sign <- signs2[sample(NROW(signs2), 1), ]
next_sign <- next_sign[, -1]
str(next_sign)

 ## Classes 'tbl_df', 'tbl' and 'data.frame':    1 obs. of  48 variables:
##  $ r1 : num 179
##  $ g1 : num 195
##  $ b1 : num 188
##  $ r2 : num 67
##  $ g2 : num 22
##  $ b2 : num 24
##  $ r3 : num 65
##  $ g3 : num 21
##  $ b3 : num 23
##  $ r4 : num 28
##  $ g4 : num 35
##  $ b4 : num 28
##  $ r5 : num 106
##  $ g5 : num 115
##  $ b5 : num 109
##  $ r6 : num 99
##  $ g6 : num 115
##  $ b6 : num 109
##  $ r7 : num 102
##  $ g7 : num 121
##  $ b7 : num 115
##  $ r8 : num 91
##  $ g8 : num 77
##  $ b8 : num 74
##  $ r9 : num 91
##  $ g9 : num 90
##  $ b9 : num 85
##  $ r10: num 107
##  $ g10: num 118
##  $ b10: num 113
##  $ r11: num 85
##  $ g11: num 78
##  $ b11: num 73
##  $ r12: num 67
##  $ g12: num 22
##  $ b12: num 24
##  $ r13: num 146
##  $ g13: num 158
##  $ b13: num 150
##  $ r14: num 68
##  $ g14: num 25
##  $ b14: num 26
##  $ r15: num 68
##  $ g15: num 25
##  $ b15: num 26
##  $ r16: num 68
##  $ g16: num 82
##  $ b16: num 76

 # labeling
sign_types <- signs2$sign_type
# knn training
knn(train = signs2[-1], test = next_sign, cl = sign_types)

 ## [1] stop
## Levels: pedestrian speed stop

How the function knn() correctly classify the stop sign? knn() learned that stop signs are red.

2. Exploring the traffic sign dataset

Each previously observed street sign was divided into a 4x4 grid, and the red, green, and blue level for each of the 16 center pixels is recorded as illustrated here.

The result is a dataset that records the sign_type as well as 16 x 3 = 48 color properties of each sign.

 # 각각 Type에 대한 갯수 요약
table(signs2$sign_type)

 ## 
## pedestrian      speed       stop 
##         65         70         71

 # Sign Type에 따른 r10 해당하는 평균 빨간색 레벨
aggregate(r10 ~ sign_type, data = signs2, mean)

 ##    sign_type       r10
## 1 pedestrian 108.78462
## 2      speed  83.08571
## 3       stop 142.50704

Look at the sign type stop. It has higher value than other sign type. This is how kNN identifies similar signs.

3. Classifying a collection of road signs

Now that the autonomous vehicle has successfully stopped on its own, your team feels confident allowing the car to continue the test course.

The test course includes 59 additional road signs divided into three types:

 # get test signs
test_signs <- signs2[sample(NROW(signs2), 59), ]
# knn test
signs_pred <- knn(train = signs2[-1], test = test_signs[-1], cl = sign_types)
# Create a confusion matrix of the actual versus predicted values
signs_actual <- test_signs$sign_type
table(signs_actual, signs_pred)

 ##             signs_pred
## signs_actual pedestrian speed stop
##   pedestrian         19     0    0
##   speed               0    24    0
##   stop                0     0   16

 # Compute the accuracy
mean(signs_actual == signs_pred)

## [1] 1

4. Testing other ‘k’ values

By default, the knn() function in the class package uses only the single nearest neighbor.

Setting a k parameter allows the algorithm to consider additional nearby neighbors. This enlarges the collection of neighbors which will vote on the predicted class.

Compare k values of 1, 7, and 15 to examine the impact on traffic sign classification accuracy.

 signs_types <- signs2$sign_type
signs_test <- signs2[sample(NROW(signs2), 59), ]
signs_actual <- signs_test$sign_type
# Compute the accuracy of the baseline model (default k = 1)
k_1 <- knn(train = signs2[-1], test = signs_test[-1], cl = sign_types)
mean(signs_actual == k_1)

## [1] 1

 # Modify the above to set k = 7
k_7 <- knn(train = signs2[-1], test = signs_test[-1], cl = sign_types, k = 7)
mean(signs_actual == k_7)

## [1] 0.9661017

 # Set k = 15 and compare to the above
k_15 <- knn(train = signs2[-1], test = signs_test[-1], cl = sign_types, k = 15)
mean(signs_actual == k_15)

## [1] 0.9152542

5. Seeing how the neighbors voted

When multiple nearest neighbors hold a vote, it can sometimes be useful to examine whether the voters were unanimous or widely separated.

For example, knowing more about the voters’ confidence in the classification could allow an autonomous vehicle to use caution in the case there is any chance at all that a stop sign is ahead.

 # Use the prob parameter to get the proportion of votes for the winning class
sign_pred <- knn(train = signs2[-1], test = signs_test[-1], cl = sign_types, prob = TRUE, k = 7)
# Get the "prob" attribute from the predicted classes
sign_prob <- attr(sign_pred, "prob")
# Examine the first several predictions
head(sign_pred)

 ## [1] speed stop  speed speed stop  speed
## Levels: pedestrian speed stop

 # Examine the proportion of votes for the winning class
head(sign_prob)

## [1] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.7142857

6. Data preparation for kNN

kNN assumes numeric data. So, kNN benefits from normalized data. Min-Max normalize function is good enough for kNN Algorithm.

 normalize <- function(x) {
  return((x - min(x)) / (max(x) - min(x)))
}
# normalized version of r1
summary(normalize(signs2$r1))

 ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1935  0.3528  0.4046  0.6129  1.0000

summary(signs2$r1)

 ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0    51.0    90.5   103.3   155.0   251.0

Before performing kNN Algorithm, you must normalize data using a technique like min-max function above. why? It is to ensure all data elements may contribute equal shares to distance. Rescaling reduces the influence of extreme values on kNN’s distance function.

All Contents comes from DataCamp

datacamp iconì ëí ì´ë¯¸ì§ ê²ìê²°ê³¼

'R > [R] Machine Learning' 카테고리의 다른 글

[R] Classification Trees (0)	2019.02.03
[R] Supervised Learning, Logistic Regresison (0)	2019.01.25

How to draw graph for two numeric variables in R

2019. 1. 17. 22:53

Exploring_Two_Variables

Exploring Two or More Variables

Evan Jung January 17, 2019

Intro to Multivariate Analysis

Key Terms

Contingency Tables - A tally of counts between two or more categorical variables
Hexagonal Binning - A plot of two numeric variables with the records binned into hexagons
Contour plots - A plot showing the density of two numeric variables like a topographical map.
Violin plots - Similar a boxplot but showing the density estimate.

Multivariate Analysis depends on the nature of data: numeric versus categorical.

Hexagonal Binning and Contours (Plotting Numeric Versus Numeric Data)

kc_tax contains the tax-assessed values for residential properties in King County, Washington.

 library(readr)
library(dplyr)
kc_tax <- read_csv("kc_tax.csv")
glimpse(kc_tax)

 ## Observations: 498,249
## Variables: 3
## $ TaxAssessedValue <dbl> NA, 206000, 303000, 361000, 459000, 223000, 2...
## $ SqFtTotLiving    <dbl> 1730, 1870, 1530, 2000, 3150, 1570, 1770, 115...
## $ ZipCode          <dbl> 98117, 98002, 98166, 98108, 98108, 98032, 981...

The problem of scatterplots

They are fine when the number of data values is relatively small. But if data sets are enormous, a scatterplot will be too dense, so it becomes difficult to distinctly visualize the relationship. We will compare it to other graph later.

Hexagon binning plot

This plot is to visualize the relationship between the finished squarefeet versus TaxAssessedValue.

 library(ggplot2)
library(gridExtra)

 ## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine

 library(hexbin)
kc_tax0 <- kc_tax %>% 
  filter(TaxAssessedValue < 750000, SqFtTotLiving > 100, SqFtTotLiving < 3500)
p1 <- ggplot(kc_tax0, aes(x = SqFtTotLiving, y = TaxAssessedValue)) + 
  stat_binhex(colour = "white") + 
  theme_bw() + 
  scale_fill_gradient(low = "white", high = "blue") + 
  labs(x = "Finished Square Feet", 
       y = "Tax Assessed Value", 
       title = "Hexagon Binning Plot")
p2 <- ggplot(kc_tax0, aes(x = SqFtTotLiving, y = TaxAssessedValue)) + 
  geom_point(colour = "blue") + 
  theme_bw() + 
  labs(x = "Finished Square Feet", 
       y = "Tax Assessed Value", 
       title = "Scatter Plot")
grid.arrange(p1, p2, nrow = 2)

Let’s compare two plots. Rather than Scatter Plot, hexagon binning plots help to group into the hexagon bins and to plot the hexagons with a color indicating the number of records in that bin. Now, we can clearly see the positive relationship between two variables.

Density2d

The geom_density2d function uses contours overlaid on a scatterplot to visualize the relationship between two variables. The contours are essentially a topographical map to two variables. Each contour band represents a specific density of points, increasing as one nears a “peak”.

 ggplot(kc_tax0, aes(x = SqFtTotLiving, y = TaxAssessedValue)) + 
  theme_bw() + 
  geom_point(alpha = .1) + 
  geom_density2d(colour = "white") + 
  labs(x = "Finished Square Feet", 
       y = "Tax Assessed Value", 
       title = "Density 2D")

Conclusion

These plots are related to Correlation Analysis. So, when we draw graph between two variables, we has to think ahead “what two variables are related.”

All contents comes from the book below.

practical statistics for data scientistsì ëí ì´ë¯¸ì§ ê²ìê²°ê³¼

'R > [R] Data Visualization' 카테고리의 다른 글

[R Markdown] Customized Report (DataCamp) (0)	2019.05.04
[R] Interactive Data Visualisation using leaflet (2) (0)	2019.03.03
[R] Interactive Data Visualisation using leaflet (1) (0)	2019.03.02
[R] Beginner Graph for Communication - Label (0)	2019.01.03

Correlation in R

2019. 1. 9. 20:41

correlation

Correlation

Evan Jung January 09, 2019

1. Intro

Case 1: high values of X go with high values of Y, X and Y are positively corrleated.
Case 2: low values of X go with low values of Y, X and Y are positively corrleated.
Case 3: high values of X go with low values of Y, and vice versa, the variables are negatively correlated.

2. Key Terms

Correlation Coefficient is a metric that measures the extent to which numeric variables are associated with one another (ranges from -1 to +1). The +1 means perfect positive correlation The 0 indicates no correlation The -1 means perfect negative correlation

To compute Pearson’s correlation coefficient, we multiply deviations from the mean for variable 1 times those for variable 2, and divide by the product of the standard deviatinos:

$r = \frac{\sum_{i=1}^{N}(x_i-\bar{x})(y_i-\bar{y})}{(N-1)s_{x}s_{y}}$

Correlation Matrix is a table where the variables are shown on both rows and columns, and the cell values are the correlations between variables.

 # data import
sp500_px <- read.csv("data/sp500_px.csv", stringsAsFactors = F)
sp500_sym <- read.csv("data/sp500_sym.csv", stringsAsFactors = F)
# date conversion
sp500_px$X <- as.Date(sp500_px$X)
# data rows and cols extraction
etfs <- sp500_px[sp500_px$X > "2012-07-01", 
                 sp500_sym[sp500_sym$sector == "etf", "symbol"]]
# install.packages("corrplot")
library(corrplot)
corrplot(cor(etfs), method = "ellipse")

(Explanatoin of this plot remains to you!)

The orientation of the ellipse indicates whether two variables are positively correlated or negatively correlated.
The shading and width of the ellipse indicate the strength of the association: thinner and darker ellipse correspond to stronger relationships.

2.1. Other Correlation Estimates

The Spearman’s rho or Kendall’s tau have long ago been proposed by statisticians. These are generally used on the basis of the rank of the data. These estimates are robust to outliers and can handle certain types of nonlinearities because they use for the ranks.

But, for the data scientists can generally stick to Pearson’s correlation coefficient, and its robust alternatives, for exploratory analysis. The appeal of rank-based estimates is mostly for smaller data sets and specific hypothesis tests

Scatterplot A plot in which the a-xis is the value of one variable, and the y-axis the value of another.

 telecom <- sp500_px[, c("T", "VZ")]
plot(telecom$T, telecom$VZ, xlab = "T", ylab = "VZ", main = "The Correlation betwen T(ATT) and VZ(Verizon)")

The returns have a strong positive relationship: on most days, both stocks go up or go down in tandem. There are very few days where one stock goes down significantly while the other stocs goes up (and vice versa).

3. Key Ideas for Correlation

The correlation coefficient measures the extent to which two variables are associated with one another.
When high values of v1 go with high values of v2, v1 and v2 are positively associated.
When high values of v1 are associated with low values of v2, v1 and v2 are negatively associated.
The correlation coefficient is a standardized metric so that it always ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation)
A correlation coefficent of 0 indicates no correlation, but be aware that random arrangements of data will produce both positive and negative values for the correlation coefficient just by chance. ##
1. Further Reading Statistics, 4th ed., by David Freedman, Robert Pisani, and Roger Purves (W.W. Norton, 2007), has an excellent discussion of correlation.

'R > [R] Statistics' 카테고리의 다른 글

Assessing Prediction Performance R (0)	2018.12.17
Designing_model (0)	2018.12.15
Statistical Modeling in R Part 1 (0)	2018.12.13

PREV 1 ···9 10 11 12 13 14 15 NEXT

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

	> sessionInfo()
	R version 3.5.2 (2018-12-20)
	Platform: x86_64-apple-darwin15.6.0 (64-bit)
	Running under: macOS Mojave 10.14.2
	Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
	LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
	attached base packages: [1] stats graphics grDevices utils datasets methods base

	$ python3 -m pip install --upgrade pip
	$ python3 -m pip install jupyter

	$ xcode-select --install
	$ brew install zmq
	# or upgrade
	$ brew update
	$ brew upgrade zmq

	$ /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

	$ xcode-select --install
	$ sudo port install zmq
	$ export CPATH=/opt/local/include
	$ export LIBRARY_PATH=/opt/local/lib

	> install.packages(c('repr', 'IRdisplay', 'IRkernel'), type = 'source')

	> IRkernel::installspec()
	[InstallKernelSpec] Removing existing kernelspec in /Users/jihoonjung/Library/Jupyter/kernels/ir
	[InstallKernelSpec] Installed kernelspec ir in /Users/jihoonjung/Library/Jupyter/kernels/ir
	> q()

	##
	## Attaching package: 'dplyr'

	## The following objects are masked from 'package:stats':
	##
	## filter, lag

	## The following objects are masked from 'package:base':
	##
	## intersect, setdiff, setequal, union

	url <- "https://assets.datacamp.com/production/course_2906/datasets/knn_traffic_signs.csv"

	signs <- read_csv(url)

	## Parsed with column specification:
	## cols(
	## .default = col_double(),
	## sample = col_character(),
	## sign_type = col_character()
	## )

	## See spec(...) for full column specifications.

	## Observations: 206
	## Variables: 51
	## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
	## $ sample <chr> "train", "train", "train", "train", "train", "train"...
	## $ sign_type <chr> "pedestrian", "pedestrian", "pedestrian", "pedestria...
	## $ r1 <dbl> 155, 142, 57, 22, 169, 75, 136, 118, 149, 13, 123, 1...
	## $ g1 <dbl> 228, 217, 54, 35, 179, 67, 149, 105, 225, 34, 124, 1...
	## $ b1 <dbl> 251, 242, 50, 41, 170, 60, 157, 69, 241, 28, 107, 13...
	## $ r2 <dbl> 135, 166, 187, 171, 231, 131, 200, 244, 34, 5, 83, 3...
	## $ g2 <dbl> 188, 204, 201, 178, 254, 89, 203, 245, 45, 21, 61, 4...
	## $ b2 <dbl> 101, 44, 68, 26, 27, 53, 107, 67, 1, 11, 26, 37, 26,...
	## $ r3 <dbl> 156, 142, 51, 19, 97, 214, 150, 132, 155, 123, 116, ...
	## $ g3 <dbl> 227, 217, 51, 27, 107, 144, 167, 123, 226, 154, 124,...
	## $ b3 <dbl> 245, 242, 45, 29, 99, 75, 134, 12, 238, 140, 115, 12...
	## $ r4 <dbl> 145, 147, 59, 19, 123, 156, 171, 138, 147, 21, 67, 4...
	## $ g4 <dbl> 211, 219, 62, 27, 147, 169, 218, 123, 222, 46, 67, 5...
	## $ b4 <dbl> 228, 242, 65, 29, 152, 190, 252, 85, 242, 41, 52, 49...
	## $ r5 <dbl> 166, 164, 156, 42, 221, 67, 171, 254, 170, 36, 70, 1...
	## $ g5 <dbl> 233, 228, 171, 37, 236, 50, 158, 254, 191, 60, 53, 1...
	## $ b5 <dbl> 245, 229, 50, 3, 117, 36, 108, 92, 113, 26, 26, 141,...
	## $ r6 <dbl> 212, 84, 254, 217, 205, 37, 157, 241, 26, 75, 26, 60...
	## $ g6 <dbl> 254, 116, 255, 228, 225, 36, 186, 240, 37, 108, 26, ...
	## $ b6 <dbl> 52, 17, 36, 19, 80, 42, 11, 108, 12, 44, 21, 18, 20,...
	## $ r7 <dbl> 212, 217, 211, 221, 235, 44, 26, 254, 34, 13, 52, 9,...
	## $ g7 <dbl> 254, 254, 226, 235, 254, 42, 35, 254, 45, 27, 45, 13...
	## $ b7 <dbl> 11, 26, 70, 20, 60, 44, 10, 99, 19, 25, 27, 17, 20, ...
	## $ r8 <dbl> 188, 155, 78, 181, 90, 192, 180, 108, 221, 133, 117,...
	## $ g8 <dbl> 229, 203, 73, 183, 110, 131, 211, 106, 249, 163, 109...
	## $ b8 <dbl> 117, 128, 64, 73, 9, 73, 236, 27, 184, 126, 83, 33, ...
	## $ r9 <dbl> 170, 213, 220, 237, 216, 123, 129, 135, 226, 83, 110...
	## $ g9 <dbl> 216, 253, 234, 234, 236, 74, 109, 123, 246, 125, 74,...
	## $ b9 <dbl> 120, 51, 59, 44, 66, 22, 73, 40, 59, 19, 12, 12, 18,...
	## $ r10 <dbl> 211, 217, 254, 251, 229, 36, 161, 254, 30, 13, 98, 2...
	## $ g10 <dbl> 254, 255, 255, 254, 255, 34, 190, 254, 40, 27, 70, 1...
	## $ b10 <dbl> 3, 21, 51, 2, 12, 37, 10, 115, 34, 25, 26, 11, 20, 2...
	## $ r11 <dbl> 212, 217, 253, 235, 235, 44, 161, 254, 34, 9, 20, 28...
	## $ g11 <dbl> 254, 255, 255, 243, 254, 42, 190, 254, 44, 23, 21, 2...
	## $ b11 <dbl> 19, 21, 44, 12, 60, 44, 6, 99, 35, 18, 20, 19, 13, 1...
	## $ r12 <dbl> 172, 158, 66, 19, 163, 197, 187, 138, 241, 85, 113, ...
	## $ g12 <dbl> 235, 225, 68, 27, 168, 114, 215, 123, 255, 128, 76, ...
	## $ b12 <dbl> 244, 237, 68, 29, 152, 21, 236, 85, 54, 21, 14, 12, ...
	## $ r13 <dbl> 172, 164, 69, 20, 124, 171, 141, 118, 205, 83, 106, ...
	## $ g13 <dbl> 235, 227, 65, 29, 117, 102, 142, 105, 229, 125, 69, ...
	## $ b13 <dbl> 244, 237, 59, 34, 91, 26, 140, 75, 46, 19, 9, 12, 13...
	## $ r14 <dbl> 172, 182, 76, 64, 188, 197, 189, 131, 226, 85, 102, ...
	## $ g14 <dbl> 228, 228, 84, 61, 205, 114, 171, 124, 246, 128, 67, ...
	## $ b14 <dbl> 235, 143, 22, 4, 78, 21, 140, 5, 59, 21, 6, 12, 13, ...
	## $ r15 <dbl> 177, 171, 82, 211, 125, 123, 214, 106, 235, 85, 106,...
	## $ g15 <dbl> 235, 228, 93, 222, 147, 74, 221, 94, 252, 128, 69, 4...
	## $ b15 <dbl> 244, 196, 17, 78, 20, 22, 201, 53, 67, 21, 9, 11, 18...
	## $ r16 <dbl> 22, 164, 58, 19, 160, 180, 188, 101, 237, 83, 43, 60...
	## $ g16 <dbl> 52, 227, 60, 27, 183, 107, 211, 91, 254, 125, 29, 45...
	## $ b16 <dbl> 53, 237, 60, 29, 187, 26, 227, 59, 53, 19, 11, 18, 1...

	signs2 <- signs[, -c(1:2)]

	# sample
	next_sign <- signs2[sample(NROW(signs2), 1), ]
	next_sign <- next_sign[, -1]

	str(next_sign)

	## Classes 'tbl_df', 'tbl' and 'data.frame': 1 obs. of 48 variables:
	## $ r1 : num 179
	## $ g1 : num 195
	## $ b1 : num 188
	## $ r2 : num 67
	## $ g2 : num 22
	## $ b2 : num 24
	## $ r3 : num 65
	## $ g3 : num 21
	## $ b3 : num 23
	## $ r4 : num 28
	## $ g4 : num 35
	## $ b4 : num 28
	## $ r5 : num 106
	## $ g5 : num 115
	## $ b5 : num 109
	## $ r6 : num 99
	## $ g6 : num 115
	## $ b6 : num 109
	## $ r7 : num 102
	## $ g7 : num 121
	## $ b7 : num 115
	## $ r8 : num 91
	## $ g8 : num 77
	## $ b8 : num 74
	## $ r9 : num 91
	## $ g9 : num 90
	## $ b9 : num 85
	## $ r10: num 107
	## $ g10: num 118
	## $ b10: num 113
	## $ r11: num 85
	## $ g11: num 78
	## $ b11: num 73
	## $ r12: num 67
	## $ g12: num 22
	## $ b12: num 24
	## $ r13: num 146
	## $ g13: num 158
	## $ b13: num 150
	## $ r14: num 68
	## $ g14: num 25
	## $ b14: num 26
	## $ r15: num 68
	## $ g15: num 25
	## $ b15: num 26
	## $ r16: num 68
	## $ g16: num 82
	## $ b16: num 76

	# labeling
	sign_types <- signs2$sign_type

	# knn training
	knn(train = signs2[-1], test = next_sign, cl = sign_types)

	# 각각 Type에 대한 갯수 요약
	table(signs2$sign_type)

	# Sign Type에 따른 r10 해당하는 평균 빨간색 레벨
	aggregate(r10 ~ sign_type, data = signs2, mean)

	## sign_type r10
	## 1 pedestrian 108.78462
	## 2 speed 83.08571
	## 3 stop 142.50704

	# get test signs
	test_signs <- signs2[sample(NROW(signs2), 59), ]

	# knn test
	signs_pred <- knn(train = signs2[-1], test = test_signs[-1], cl = sign_types)

	# Create a confusion matrix of the actual versus predicted values
	signs_actual <- test_signs$sign_type
	table(signs_actual, signs_pred)

	## signs_pred
	## signs_actual pedestrian speed stop
	## pedestrian 19 0 0
	## speed 0 24 0
	## stop 0 0 16

	signs_types <- signs2$sign_type
	signs_test <- signs2[sample(NROW(signs2), 59), ]

	signs_actual <- signs_test$sign_type

	# Compute the accuracy of the baseline model (default k = 1)
	k_1 <- knn(train = signs2[-1], test = signs_test[-1], cl = sign_types)
	mean(signs_actual == k_1)

cozyDS

분류 전체보기