Prediction using random forests and boosting by Chris Hammond.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.
The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.
Your submission for the Peer Review portion should consist of a link to a Github repo with your R markdown and compiled HTML file describing your analysis. Please constrain the text of the writeup to < 2000 words and the number of figures to be less than 5. It will make it easier for the graders if you submit a repo with a gh-pages branch so the HTML page can be viewed online (and you always want to make it easy on graders :-).
Apply your machine learning algorithm to the 20 test cases available in the test data above and submit your predictions in appropriate format to the Course Project Prediction Quiz for automated grading.
Due to security concerns with the exchange of R code, your code will not be run during the evaluation by your classmates. Please be sure that if they download the repo, they will be able to view the compiled HTML version of your analysis.
Load the necessary libraries and set the seed for reproducibility.
library(data.table) # Use install.packages() to download libraries from CRAN
library(caret)
library(rpart)
library(rattle)
set.seed(1337)
Enable and setup parallel processing for caret to increase calculation speed on complex models. Set the train method to cross-validation for a good balanace of performance and accuracy.
library(parallel)
library(doParallel)
cluster <- makeCluster(detectCores() - 1) # Convention to leave 1 core for OS
Load the data.
dataUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv" # Load data from the web
quizUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
data <- fread(dataUrl, na.strings=c("NA","#DIV/0!",""))
quiz <- fread(quizUrl, na.strings=c("NA","#DIV/0!",""))
# data <- fread("pml-training.csv", na.strings=c("NA","#DIV/0!","")) # Load data from Working Directory
# quiz <- fread("pml-testing.csv", na.strings=c("NA","#DIV/0!",""))
Clean the data and select useful features. Remove the irrelevant id variable. Remove variables containing NAs in the test set. Remove variables with near zero variance variables.
# Use names(data) to view the variables in the dataset
data <- data[,-c(1), with=FALSE]; quiz <- quiz[,-c(1), with=FALSE] # Removing user id
data$user_name <- as.factor(data$user_name);quiz$user_name <- as.factor(quiz$user_name) # Change user to factor
data$classe <- as.factor(data$classe) # Change outcome to factor
realVars <- colSums(is.na(data)) == 0 # Make a vector of the variables without NAs
data <- data[,realVars, with=FALSE]; quiz <- quiz[,realVars, with=FALSE];
nzv <- nearZeroVar(data) # Remove near zero variance variables
data <- data[,-nzv, with=FALSE]; quiz <- quiz[,-nzv, with=FALSE]
Partition the training set to create a training set(60%), validation set(20%) and test set(20%). The size of the dataset allows the use of all three sets. The training set will be used to train the models, the validation set will be used to select the model and the test set will be used to assess the expected out-of-sample error.
inTrain <- createDataPartition(data$classe, p=0.6, list=FALSE)
training <- data[inTrain, ]; notTraining <- data[-inTrain, ]
inValidation <- createDataPartition(notTraining$classe, p=0.5, list=FALSE)
validation <- notTraining[inValidation, ]; testing <- notTraining[-inValidation, ]
Three different model fits will be attempted: decision tree, random forest and generalized boosted regression. The results of the models will be stored in a matrix for comparison later.
models <- matrix(nrow=3, ncol=2) # Rows: models, Columns: accuracy on validation set and elapsed running time
colnames(models) <- c("Accuracy", "Elapsed"); rownames(models) <- c("dt","rf","gbm");
First, fit a decision tree and review the model. The decision tree was chosen as the first model because of its speed and simplicity.
registerDoParallel(cluster) # Start cluster for parallel processing
dtControl <- trainControl(method="none", allowParallel=TRUE) # Turn off resampling (increases speed)
dtTune <- data.frame(cp=0.01) # Only split if lack of fit is greater than 0.01 (increases accuracy)
models[1,2] <- system.time(dtModel <- train(classe ~ ., data=training, method="rpart", trControl=dtControl,
tuneGrid=dtTune))[3]
dtPrediction <- predict(dtModel, validation); dtModel
## CART
##
## 11776 samples
## 57 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: None
fancyRpartPlot(dtModel$finalModel)
# Create confusion matrix, save model accuracy and print results
cm <- confusionMatrix(dtPrediction, validation$classe); models[1,1] <- cm$overall['Accuracy']; cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1063 0 0 0 13
## B 14 728 30 23 2
## C 4 19 642 15 5
## D 9 5 7 556 106
## E 26 7 5 49 595
##
## Overall Statistics
##
## Accuracy : 0.9136
## 95% CI : (0.9044, 0.9222)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8909
## Mcnemar's Test P-Value : 1.802e-11
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9525 0.9592 0.9386 0.8647 0.8252
## Specificity 0.9954 0.9782 0.9867 0.9613 0.9728
## Pos Pred Value 0.9879 0.9134 0.9372 0.8141 0.8724
## Neg Pred Value 0.9814 0.9901 0.9870 0.9731 0.9611
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2710 0.1856 0.1637 0.1417 0.1517
## Detection Prevalence 0.2743 0.2032 0.1746 0.1741 0.1738
## Balanced Accuracy 0.9739 0.9687 0.9627 0.9130 0.8990
The accuracy on the validation set was 0.9136. The plot reveals that the raw timestamp part 1 and window number were key features for the model. The belt roll, forearm pitch and dumbell magnet y and z axes were also important features.
Next, fit a random forest. The random forest was chosen as the second model due to its high degree of accuracy. Random forest are typically among the best performing models for most machine learning problems. K-fold cross-validation with 10 folds will be used for resampling.
fitControl <- trainControl(method="cv", number=10, allowParallel=TRUE) # Set resampling method to 10-fold cv
models[2,2] <- system.time(rfModel <- train(classe ~ ., data=training, method="rf", trControl=fitControl))[3]
rfPrediction <- predict(rfModel, validation); rfModel
## Random Forest
##
## 11776 samples
## 57 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 10598, 10597, 10599, 10598, 10600, 10598, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9867520 0.9832385
## 40 0.9987260 0.9983885
## 79 0.9986409 0.9982809
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 40.
plot(rfModel)
cm <- confusionMatrix(rfPrediction, validation$classe); models[2,1] <- cm$overall['Accuracy']; cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1116 0 0 0 0
## B 0 759 2 0 0
## C 0 0 682 0 0
## D 0 0 0 643 0
## E 0 0 0 0 721
##
## Overall Statistics
##
## Accuracy : 0.9995
## 95% CI : (0.9982, 0.9999)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9994
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 0.9971 1.0000 1.0000
## Specificity 1.0000 0.9994 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 0.9974 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 0.9994 1.0000 1.0000
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2845 0.1935 0.1738 0.1639 0.1838
## Detection Prevalence 0.2845 0.1940 0.1738 0.1639 0.1838
## Balanced Accuracy 1.0000 0.9997 0.9985 1.0000 1.0000
As expected, the random forest yields much higher accuracy than the simple decision tree model. Accuracy was 0.9995 on the validation set.
Finally, fit a generalized boosted regression model. The generalized boosted regression model uses trees for boosting and was chosen as the final model due to the high accuracy typically associated with boosting and the speed associated with trees.
models[3,2] <- system.time(gbmModel <- train(classe ~ ., data=training, method="gbm", trControl=fitControl))[3]
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.2446
## 2 1.4511 nan 0.1000 0.1788
## 3 1.3366 nan 0.1000 0.1462
## 4 1.2426 nan 0.1000 0.1212
## 5 1.1651 nan 0.1000 0.1170
## 6 1.0910 nan 0.1000 0.0973
## 7 1.0303 nan 0.1000 0.0753
## 8 0.9815 nan 0.1000 0.0732
## 9 0.9349 nan 0.1000 0.0706
## 10 0.8904 nan 0.1000 0.0626
## 20 0.5842 nan 0.1000 0.0319
## 40 0.2930 nan 0.1000 0.0149
## 60 0.1634 nan 0.1000 0.0071
## 80 0.1012 nan 0.1000 0.0040
## 100 0.0670 nan 0.1000 0.0014
## 120 0.0473 nan 0.1000 0.0009
## 140 0.0348 nan 0.1000 0.0008
## 150 0.0300 nan 0.1000 0.0004
gbmPrediction <- predict(gbmModel, validation); gbmModel
## Stochastic Gradient Boosting
##
## 11776 samples
## 57 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 10600, 10598, 10598, 10599, 10597, 10597, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.8392505 0.7960672
## 1 100 0.8991171 0.8722335
## 1 150 0.9264608 0.9068498
## 2 50 0.9589849 0.9480706
## 2 100 0.9856505 0.9818476
## 2 150 0.9913397 0.9890453
## 3 50 0.9821673 0.9774396
## 3 100 0.9937165 0.9920523
## 3 150 0.9960946 0.9950602
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
plot(gbmModel)
cm <- confusionMatrix(gbmPrediction, validation$classe); models[3,1] <- cm$overall['Accuracy']; cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1116 2 0 0 0
## B 0 757 1 0 0
## C 0 0 678 0 0
## D 0 0 5 643 1
## E 0 0 0 0 720
##
## Overall Statistics
##
## Accuracy : 0.9977
## 95% CI : (0.9956, 0.999)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9971
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9974 0.9912 1.0000 0.9986
## Specificity 0.9993 0.9997 1.0000 0.9982 1.0000
## Pos Pred Value 0.9982 0.9987 1.0000 0.9908 1.0000
## Neg Pred Value 1.0000 0.9994 0.9982 1.0000 0.9997
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2845 0.1930 0.1728 0.1639 0.1835
## Detection Prevalence 0.2850 0.1932 0.1728 0.1654 0.1835
## Balanced Accuracy 0.9996 0.9985 0.9956 0.9991 0.9993
stopCluster(cluster) # Stop the parallel processing cluster
While more accurate than the decision tree model at 0.9977, the generalized boosted regression model is not quite as accurate as the random forest model.
Compare the models using the matrix of accuracy and run time.
models
## Accuracy Elapsed
## dt 0.9135865 2.99
## rf 0.9994902 557.83
## gbm 0.9977058 180.10
Each model has an advantage. The decision tree model is by far the fastest, but trades this for lower accuracy. The generalized boosted regression model strikes a balance between performance and accuracy. The random forest model has a slight edge in accuracy over the generalized boosted model and is selected as the final model for this project. View details of the model.
varImp(rfModel) # Display important variables identified by the model
## rf variable importance
##
## only 20 most important variables shown (out of 79)
##
## Overall
## raw_timestamp_part_1 100.000
## num_window 47.427
## roll_belt 45.027
## pitch_forearm 27.800
## magnet_dumbbell_z 19.929
## magnet_dumbbell_y 17.122
## pitch_belt 13.999
## yaw_belt 13.903
## roll_forearm 13.599
## cvtd_timestamp30/11/2011 17:12 10.705
## cvtd_timestamp02/12/2011 14:58 9.576
## accel_belt_z 8.348
## magnet_dumbbell_x 7.550
## cvtd_timestamp28/11/2011 14:15 7.516
## cvtd_timestamp02/12/2011 13:33 7.002
## accel_dumbbell_y 6.251
## roll_dumbbell 6.143
## cvtd_timestamp05/12/2011 11:24 5.941
## accel_forearm_x 5.470
## magnet_belt_y 5.254
rfModel$finalModel # Display details of final model
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 40
##
## OOB estimate of error rate: 0.09%
## Confusion matrix:
## A B C D E class.error
## A 3348 0 0 0 0 0.0000000000
## B 3 2276 0 0 0 0.0013163668
## C 0 2 2051 1 0 0.0014605648
## D 0 0 2 1927 1 0.0015544041
## E 0 0 0 2 2163 0.0009237875
Much like the decision tree, key features include raw timestamp part 1, window number, belt roll, forearm pitch and the dumbell magnet y and z axes. Evaluate the final model on the test set to estimate out-of-sample error.
testPrediction <- predict(rfModel, testing)
confusionMatrix(testPrediction, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1116 1 0 0 0
## B 0 758 3 0 0
## C 0 0 681 2 0
## D 0 0 0 641 2
## E 0 0 0 0 719
##
## Overall Statistics
##
## Accuracy : 0.998
## 95% CI : (0.996, 0.9991)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9974
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9987 0.9956 0.9969 0.9972
## Specificity 0.9996 0.9991 0.9994 0.9994 1.0000
## Pos Pred Value 0.9991 0.9961 0.9971 0.9969 1.0000
## Neg Pred Value 1.0000 0.9997 0.9991 0.9994 0.9994
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2845 0.1932 0.1736 0.1634 0.1833
## Detection Prevalence 0.2847 0.1940 0.1741 0.1639 0.1833
## Balanced Accuracy 0.9998 0.9989 0.9975 0.9981 0.9986
The model achieves an accuracy of 0.998 on the test set. Since the test set was not used in any way during the model training, out-of-sample error is estimated to be 0.002.
Finally, predict the 20 test cases for the quiz.
quizPredictions <- predict(rfModel, quiz)
quizPredictions
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
The results are 100% accuracy on the quiz dataset.
The random forest model was selected due to its high accuracy. The generalized boosted regression model would make a strong candidate in situtations where faster performance is needed without much loss of accuracy. The decision tree is surpisingly accurate for such a simple model and very fast. All three models created in this project are most likely overfit to the data, particulary concerning the user name and time reference features, and the true out-of-sample error would likely be higher if applied to a dataset with new users.
The controls for the models could potentially be tuned further to improve results marginally. The models may generalize better to real world data if the user name and time reference features were removed. The dataset could be improved by collecting information about the height, weight and other physical characteristics of the user, as is commonly recordered on modern exercise measurement devices like Fitbit.