Case 1: high values of X go with high values of Y, X and Y are positively corrleated.
Case 2: low values of X go with low values of Y, X and Y are positively corrleated.
Case 3: high values of X go with low values of Y, and vice versa, the variables are negatively correlated.
2. Key Terms
Correlation Coefficient is a metric that measures the extent to which numeric variables are associated with one another (ranges from -1 to +1). The +1 means perfect positive correlation The 0 indicates no correlation The -1 means perfect negative correlation
To compute Pearson’s correlation coefficient, we multiply deviations from the mean for variable 1 times those for variable 2, and divide by the product of the standard deviatinos:
Correlation Matrix is a table where the variables are shown on both rows and columns, and the cell values are the correlations between variables.
The orientation of the ellipse indicates whether two variables are positively correlated or negatively correlated.
The shading and width of the ellipse indicate the strength of the association: thinner and darker ellipse correspond to stronger relationships.
2.1. Other Correlation Estimates
The Spearman’s rho or Kendall’s tau have long ago been proposed by statisticians. These are generally used on the basis of the rank of the data. These estimates are robust to outliers and can handle certain types of nonlinearities because they use for the ranks.
But, for the data scientists can generally stick to Pearson’s correlation coefficient, and its robust alternatives, for exploratory analysis. The appeal of rank-based estimates is mostly for smaller data sets and specific hypothesis tests
Scatterplot A plot in which the a-xis is the value of one variable, and the y-axis the value of another.
The returns have a strong positive relationship: on most days, both stocks go up or go down in tandem. There are very few days where one stock goes down significantly while the other stocs goes up (and vice versa).
3. Key Ideas for Correlation
The correlation coefficient measures the extent to which two variables are associated with one another.
When high values of v1 go with high values of v2, v1 and v2 are positively associated.
When high values of v1 are associated with low values of v2, v1 and v2 are negatively associated.
The correlation coefficient is a standardized metric so that it always ranges from -1 (perfect negative correlation) to +1 (perfect positive correlation)
A correlation coefficent of 0 indicates no correlation, but be aware that random arrangements of data will produce both positive and negative values for the correlation coefficient just by chance. ##
Further Reading Statistics, 4th ed., by David Freedman, Robert Pisani, and Roger Purves (W.W. Norton, 2007), has an excellent discussion of correlation.
First, it is to make predictions about an outcome. Second, it is to run experiments to study relationship between variables. Third, it is to explore data to identify relationships among variables.
Basic Choices in model architecture
If Categorical response variable (e.g. yes or no, infected or not), then use rpart(). If Numerical response variable (e.g. unemployment rate), then use lm() or rpart(), but in this case, if the case is relevant with graual or proportional, then use lm. If the case is related with Dichotomous or discontinous, then use rpart().
The Prediction Errors?
When you train and test a model, you use data with values for the explanatory variables as well as the response variable.
If the model is good, when provided with the inputs from the testing data, the outputs from the function will give results “close” to the response variable in the testing data. Here is one question. How to measure “close”?
The first step is to subtract the function output from the actual response values in the testing data. The result is called the prediction error and there will be one such error for every case in the testing data. You then summarize that set of prediction errors.
The way to summarise the prediction erros is to calculate the mean of the square of the prediction errors. Since the errors are squared, none of them are negative. It means that the sum of errors reflects the magnitude and not the sign of the errors.
As you can see the result, adding previous as an explanatory variable changes the model outputs.
Step 2. Prediction Performace
One point. Knowing that the models make different predictions doesn’t tell you which model is better. In this exercise, you’ll compare the models’ predictions to the actual values of the response variable. The term prediction error or just error is often used, rather than difference. So, rather than speaking of the mean square difference, we’ll say mean square error.
# Build and evaluate the base model on Runners_100
Surprisingly, although you just add no genuine explanatory variable like bogus to model, MSE is smaller in the expanded model than in the base model. This is called The Null Model. This results in usage of cross validation.
Step 4. Testing and training datasets
The code in the editor uses a style that will give you two prediction error results: one for the training cases and one for the testing cases. Your goal is to see whether there is a systematic difference between prediction accuracy on the training and on the testing cases.
Since the split is being done at random, the results will vary somewhat each time you do the calculation.
# Generate a random TRUE or FALSE for each casein Runners_100
## Skipping install of 'statisticalModeling' from a github remote, the SHA1 (4c5383d3) has not changed since last install.
## Use `force = TRUE` to force installation
library(statisticalModeling)
Preds <-evaluate_model(base_model, data =subset(Runners_100, !training_cases))
# Calculate the MSE on the testing data
with(data = Preds, mean((net -model_output)^2))
## [1] 136.3962
Step 5. Repeating Random Trials
To simplifty things, the cv_pred_error() function in the statisticalModeling package will carry out this repetitive process for you. The function will do all the work of creating training and testing sets for each trial and calculating the mean square error for each trial.
The context for this exercise is to see whether the prediction error calculated from the training data is consistently different from the cross-validated prediction error. To that end, you’ll calculate the in-sample error using only the training data. Then, you’ll do the cross validation and use a t-test to see if the in-sample error is statistically different from the cross-validated error.
# The model
model<-lm(net ~age +sex, data = Runners_100)
# Find the in-sample error (using the training data)
in_sample<-evaluate_model(model, data = Runners_100)
# Find confidence interval on trials and compare to training_error
mosaic::t.test(~mse, mu = in_sample_error, data = trials)
##
## One Sample t-test
##
## data: mse
## t = 7.0898, df = 4, p-value = 0.00209
## alternative hypothesis: true mean is not equal to 131.5594
## 95 percent confidence interval:
## 137.6805 145.5606
## sample estimates:
## mean of x
## 141.6206
The error based on the training data is below the 95% confidence interval representing the cross-validated prediction error.
Step 6. To add or not to add (an explanatory variable)?
you’re going to use cross validation to find out whether adding a new explanatory variable improves the prediction performance of a model. Remember that models are biased to perform well on the training data. Cross validation gives a fair indication of the prediction error on new data.
# The base model
base_model<-lm(net ~age +sex, data = Runners_100)
# An augmented model adding previous as an explanatory variable
aug_model<-lm(net ~age +sex +previous, data = Runners_100)
# Run cross validation trials on the two models
trials<-cv_pred_error(base_model, aug_model)
trials
## mse model
## 1 142.0409 base_model
## 2 136.6856 base_model
## 3 143.4253 base_model
## 4 137.9605 base_model
## 5 134.6230 base_model
## 6 143.2968 aug_model
## 7 142.2867 aug_model
## 8 143.6135 aug_model
## 9 140.3348 aug_model
## 10 145.8624 aug_model
# Compare the two sets of cross-validated errors
t.test(mse ~model, data = trials)
##
## Welch Two Sample t-test
##
## data: mse by model
## t = 2.1983, df = 6.1923, p-value = 0.06887
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.4327813 8.6963517
## sample estimates:
## mean in group aug_model mean in group base_model
## 143.0788 138.9471
Notice that cross validation reveals that the augmented model makes worse predictions (larger prediction error) than the base model. Bigger is not necessarily better when it comes to modeling!
In Conclusion, it’s not good to add the previous variable, just use base_model in this case.
Modeling is a Process. 1st Step: A team gets a idea to build modeling. 2nd Step: A team builds a design model. 3rd Step: With data, a team trains model. 4th Step: A team evaluates model. 5th Step: A team tests model performance. 6th Step: A team interprets how model challenges the team's idea.
These steps will be recursive to find a optimized model.
What is 'Training a model'? It is the automatic process carried out by the computer. And it requires to "fit" the model to the data. More importantly, model generally represents both Your (variable) choices and data.
## Skipping install of 'statisticalModeling' from a github remote, the SHA1 (4c5383d3) has not changed since last install.
## Use `force = TRUE` to force installation
library(statisticalModeling)
##
## Attaching package: 'statisticalModeling'
## The following object is masked _by_ '.GlobalEnv':
##
## Runners
fmodel(handicap_model_1)
fmodel(handicap_model_2)
fmodel(handicap_model_3)
- Since age is quantitative, graphs of models of net versus age are continuous curves. But, sex is categorical. The third plot, with both age and sex as explanatory variables shows two continuous curves, one for each sex.
Step 3. Build new model, rpart
What is cp? cp(Complexity Parameter) allows you to dial up or down the complexit of the model being built. CP is like "minimum benefit" that a split must add to the tree.
# Load rpart
library(rpart)
# Build rpart model: model_2
model_2<-rpart(net ~age +sex, data = Runners, cp =0.002)
# Examine graph of model_2
fmodel(model_2,~age +sex)
- In the recursive partitioning architecture, the model functions have 'steps'. It seems unlikely that real people change in the specific way indicated by the model. Presumably, the real change is more gradual and steady. The architecture you choose has consequences for the kinds of relationships that can be depicted.
run_again_model <-rpart(runs_again ~age+sex +net, data = Ran_twice, cp =0.005)
# Visualize the model (don't change)
fmodel(run_again_model, ~age+net, data = Ran_twice)
Previously, the response variable is "net" considered as numeric variable. In this model, the response variable is "runs_again" which is categorical. Surprisingly, the rpart() architecture works for both numerical and categorical responses.
There's a somewhat complicated pattern being shown by the model. Runners over age 40 with two races under their belts have about a 50% probability of running for a third time. For runners in their thirties, the model suggests that those with fast times (e.g. 60 minutes) are the most likely to run again. It also suggests that runners with intermediate times (e.g. 80 minutes) are not much less likely to run a third time. Perhaps this is because such runners are hoping for a fast time and discouraged by their intermediate time. Or, perhaps the model has over-reached, finding a pattern which just happens to show up in these data. Techniques for assessing this second possibility will be covered in later chapters of this course.
The statisticalModeling package provides an alternative to the predict() function called evaluate_model(). evaluate_model() has certain advantages, such as formatting the output as a data frame alongside the inputs, and takes two arguments: the model and a data argument containing the data frame of model inputs.
Step 6. Extrapolation
One purpose for evaluating a model is extrapolation: finding the model output for inputs that are outside the range of the data used to train the model.
Extrapolation makes sense only for quantitative explanatory variables. For example, given a variable x that ranges from 50 to 100, any value greater than 100 or smaller than 50 is an extrapolation.
# Build a model: insurance_cost_model
insurance_cost_model <-lm(Cost ~Age+Sex +Coverage, data = AARP)
# Create a data frame: new_inputs_1
new_inputs_1 <-data.frame(Age =c(30, 90), Sex =c("F", "M"), Coverage =c(0, 100))
new_inputs_1
# Use expand.grid(): new_inputs_2
new_inputs_2 <-expand.grid(Age =c(30, 90), Sex =c("F", "M"), Coverage =c(0, 100))
# Use evaluate_model() for new_inputs_1 and new_inputs_2
evaluate_model(insurance_cost_model, data = new_inputs_1)
evaluate_model(insurance_cost_model, data = new_inputs_2)
Notice how predict() produces only the model output, not the inputs used to generate that output. evaluate_model() helps you keep track of what the inputs were. Returning to a modeling perspective for a moment… Note that the cost of a policy with zero coverage is actually negative for younger people. This kind of thing can happen when extrapolating outside the domain of the data used for training the model. In this case, you didn't have any AARP data for zero coverage. The moral of the story: beware of extrapolation.
Step 7. Typical values of data
Sometimes you want to make a very quick check of what the model output looks like for "typical" inputs. When you use evaluate_model() without the data argument, the function will use the data on which the model was trained to select some typical levels of the inputs. evaluate_model() provides a tabular display of inputs and outputs.
The choices you make in constructing a graphic are important. Using a given variable in a different role in a plot, or omitting it, can highlight or suppress different aspects of the story.
Well, I would like to quote from DataCamp Course model is a representation for a purpose. When it says representation, the meaning is standing for something in some situation.
How about the purpose? Well, the main purpose is to use the model in real situation. What kinds of modeling are mainly discussed?
For excel(most) users, they are doing mathematical modeling that constructs out of mathematical entities something like numbers, formulars, etc.
For statisticians, they are doing statistical modeling that tries to build a special type of mathematical model using informed data, incorporating uncerntainity and randomness. Perhaps, ML practitioners are included in this group, in my opinion.
Since this docs is talking about the modeling, I will assume that readers already know (1) importing data (2) transforming data (3) visualizing data. So, I won't explain too much the such codes in detail. If you want to study importing to visualing data, then please readR for data Science.