Exercise Prediction Project

Prediction using random forests and boosting by Chris Hammond.

View all projects »

Project Instructions

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Data

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.

What you should submit

The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.

Peer Review Portion

Your submission for the Peer Review portion should consist of a link to a Github repo with your R markdown and compiled HTML file describing your analysis. Please constrain the text of the writeup to < 2000 words and the number of figures to be less than 5. It will make it easier for the graders if you submit a repo with a gh-pages branch so the HTML page can be viewed online (and you always want to make it easy on graders :-).

Course Project Prediction Quiz Portion

Apply your machine learning algorithm to the 20 test cases available in the test data above and submit your predictions in appropriate format to the Course Project Prediction Quiz for automated grading.

Reproducibility

Due to security concerns with the exchange of R code, your code will not be run during the evaluation by your classmates. Please be sure that if they download the repo, they will be able to view the compiled HTML version of your analysis.


Setup & Data Wrangling

Load the necessary libraries and set the seed for reproducibility.

library(data.table) # Use install.packages() to download libraries from CRAN
library(caret)
library(rpart)
library(rattle)
set.seed(1337)

Enable and setup parallel processing for caret to increase calculation speed on complex models. Set the train method to cross-validation for a good balanace of performance and accuracy.

library(parallel)
library(doParallel)
cluster <- makeCluster(detectCores() - 1) # Convention to leave 1 core for OS

Load the data.

dataUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv" # Load data from the web
quizUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
data <- fread(dataUrl, na.strings=c("NA","#DIV/0!",""))
quiz <- fread(quizUrl, na.strings=c("NA","#DIV/0!","")) 
# data <- fread("pml-training.csv", na.strings=c("NA","#DIV/0!","")) # Load data from Working Directory
# quiz <- fread("pml-testing.csv", na.strings=c("NA","#DIV/0!",""))

Clean the data and select useful features. Remove the irrelevant id variable. Remove variables containing NAs in the test set. Remove variables with near zero variance variables.

# Use names(data) to view the variables in the dataset
data <- data[,-c(1), with=FALSE]; quiz <- quiz[,-c(1), with=FALSE] # Removing user id
data$user_name <- as.factor(data$user_name);quiz$user_name <- as.factor(quiz$user_name) # Change user to factor
data$classe <- as.factor(data$classe) # Change outcome to factor

realVars <- colSums(is.na(data)) == 0 # Make a vector of the variables without NAs
data <- data[,realVars, with=FALSE]; quiz <- quiz[,realVars, with=FALSE];

nzv <- nearZeroVar(data) # Remove near zero variance variables
data <- data[,-nzv, with=FALSE]; quiz <- quiz[,-nzv, with=FALSE]

Partition the training set to create a training set(60%), validation set(20%) and test set(20%). The size of the dataset allows the use of all three sets. The training set will be used to train the models, the validation set will be used to select the model and the test set will be used to assess the expected out-of-sample error.

inTrain <- createDataPartition(data$classe, p=0.6, list=FALSE)
training <- data[inTrain, ]; notTraining <- data[-inTrain, ]
inValidation <- createDataPartition(notTraining$classe, p=0.5, list=FALSE)
validation <- notTraining[inValidation, ]; testing <- notTraining[-inValidation, ]

Machine Learning Model Fits

Three different model fits will be attempted: decision tree, random forest and generalized boosted regression. The results of the models will be stored in a matrix for comparison later.

models <- matrix(nrow=3, ncol=2) # Rows: models, Columns: accuracy on validation set and elapsed running time
colnames(models) <- c("Accuracy", "Elapsed"); rownames(models) <- c("dt","rf","gbm");

Decision Tree

First, fit a decision tree and review the model. The decision tree was chosen as the first model because of its speed and simplicity.

registerDoParallel(cluster) # Start cluster for parallel processing
dtControl <- trainControl(method="none", allowParallel=TRUE) # Turn off resampling (increases speed)
dtTune <- data.frame(cp=0.01) # Only split if lack of fit is greater than 0.01 (increases accuracy)
models[1,2] <- system.time(dtModel <- train(classe ~ ., data=training, method="rpart", trControl=dtControl, 
                                            tuneGrid=dtTune))[3]
dtPrediction <- predict(dtModel, validation); dtModel
## CART 
## 
## 11776 samples
##    57 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: None
fancyRpartPlot(dtModel$finalModel)

# Create confusion matrix, save model accuracy and print results
cm <- confusionMatrix(dtPrediction, validation$classe); models[1,1] <- cm$overall['Accuracy']; cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1063    0    0    0   13
##          B   14  728   30   23    2
##          C    4   19  642   15    5
##          D    9    5    7  556  106
##          E   26    7    5   49  595
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9136          
##                  95% CI : (0.9044, 0.9222)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8909          
##  Mcnemar's Test P-Value : 1.802e-11       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9525   0.9592   0.9386   0.8647   0.8252
## Specificity            0.9954   0.9782   0.9867   0.9613   0.9728
## Pos Pred Value         0.9879   0.9134   0.9372   0.8141   0.8724
## Neg Pred Value         0.9814   0.9901   0.9870   0.9731   0.9611
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2710   0.1856   0.1637   0.1417   0.1517
## Detection Prevalence   0.2743   0.2032   0.1746   0.1741   0.1738
## Balanced Accuracy      0.9739   0.9687   0.9627   0.9130   0.8990

The accuracy on the validation set was 0.9136. The plot reveals that the raw timestamp part 1 and window number were key features for the model. The belt roll, forearm pitch and dumbell magnet y and z axes were also important features.

Random Forest

Next, fit a random forest. The random forest was chosen as the second model due to its high degree of accuracy. Random forest are typically among the best performing models for most machine learning problems. K-fold cross-validation with 10 folds will be used for resampling.

fitControl <- trainControl(method="cv", number=10, allowParallel=TRUE) # Set resampling method to 10-fold cv
models[2,2] <- system.time(rfModel <- train(classe ~ ., data=training, method="rf", trControl=fitControl))[3] 
rfPrediction <- predict(rfModel, validation); rfModel
## Random Forest 
## 
## 11776 samples
##    57 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 10598, 10597, 10599, 10598, 10600, 10598, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9867520  0.9832385
##   40    0.9987260  0.9983885
##   79    0.9986409  0.9982809
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 40.
plot(rfModel)

cm <- confusionMatrix(rfPrediction, validation$classe); models[2,1] <- cm$overall['Accuracy']; cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1116    0    0    0    0
##          B    0  759    2    0    0
##          C    0    0  682    0    0
##          D    0    0    0  643    0
##          E    0    0    0    0  721
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9995          
##                  95% CI : (0.9982, 0.9999)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9994          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   0.9971   1.0000   1.0000
## Specificity            1.0000   0.9994   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   0.9974   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   0.9994   1.0000   1.0000
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2845   0.1935   0.1738   0.1639   0.1838
## Detection Prevalence   0.2845   0.1940   0.1738   0.1639   0.1838
## Balanced Accuracy      1.0000   0.9997   0.9985   1.0000   1.0000

As expected, the random forest yields much higher accuracy than the simple decision tree model. Accuracy was 0.9995 on the validation set.

Generalized Boosted Regression

Finally, fit a generalized boosted regression model. The generalized boosted regression model uses trees for boosting and was chosen as the final model due to the high accuracy typically associated with boosting and the speed associated with trees.

models[3,2] <- system.time(gbmModel <- train(classe ~ ., data=training, method="gbm", trControl=fitControl))[3]
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.2446
##      2        1.4511             nan     0.1000    0.1788
##      3        1.3366             nan     0.1000    0.1462
##      4        1.2426             nan     0.1000    0.1212
##      5        1.1651             nan     0.1000    0.1170
##      6        1.0910             nan     0.1000    0.0973
##      7        1.0303             nan     0.1000    0.0753
##      8        0.9815             nan     0.1000    0.0732
##      9        0.9349             nan     0.1000    0.0706
##     10        0.8904             nan     0.1000    0.0626
##     20        0.5842             nan     0.1000    0.0319
##     40        0.2930             nan     0.1000    0.0149
##     60        0.1634             nan     0.1000    0.0071
##     80        0.1012             nan     0.1000    0.0040
##    100        0.0670             nan     0.1000    0.0014
##    120        0.0473             nan     0.1000    0.0009
##    140        0.0348             nan     0.1000    0.0008
##    150        0.0300             nan     0.1000    0.0004
gbmPrediction <- predict(gbmModel, validation); gbmModel
## Stochastic Gradient Boosting 
## 
## 11776 samples
##    57 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 10600, 10598, 10598, 10599, 10597, 10597, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.8392505  0.7960672
##   1                  100      0.8991171  0.8722335
##   1                  150      0.9264608  0.9068498
##   2                   50      0.9589849  0.9480706
##   2                  100      0.9856505  0.9818476
##   2                  150      0.9913397  0.9890453
##   3                   50      0.9821673  0.9774396
##   3                  100      0.9937165  0.9920523
##   3                  150      0.9960946  0.9950602
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
plot(gbmModel)

cm <- confusionMatrix(gbmPrediction, validation$classe); models[3,1] <- cm$overall['Accuracy']; cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1116    2    0    0    0
##          B    0  757    1    0    0
##          C    0    0  678    0    0
##          D    0    0    5  643    1
##          E    0    0    0    0  720
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9977         
##                  95% CI : (0.9956, 0.999)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9971         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9974   0.9912   1.0000   0.9986
## Specificity            0.9993   0.9997   1.0000   0.9982   1.0000
## Pos Pred Value         0.9982   0.9987   1.0000   0.9908   1.0000
## Neg Pred Value         1.0000   0.9994   0.9982   1.0000   0.9997
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2845   0.1930   0.1728   0.1639   0.1835
## Detection Prevalence   0.2850   0.1932   0.1728   0.1654   0.1835
## Balanced Accuracy      0.9996   0.9985   0.9956   0.9991   0.9993
stopCluster(cluster) # Stop the parallel processing cluster

While more accurate than the decision tree model at 0.9977, the generalized boosted regression model is not quite as accurate as the random forest model.

Model Selection and Out-of-sample Error

Compare the models using the matrix of accuracy and run time.

models
##      Accuracy Elapsed
## dt  0.9135865    2.99
## rf  0.9994902  557.83
## gbm 0.9977058  180.10

Each model has an advantage. The decision tree model is by far the fastest, but trades this for lower accuracy. The generalized boosted regression model strikes a balance between performance and accuracy. The random forest model has a slight edge in accuracy over the generalized boosted model and is selected as the final model for this project. View details of the model.

varImp(rfModel) # Display important variables identified by the model
## rf variable importance
## 
##   only 20 most important variables shown (out of 79)
## 
##                                Overall
## raw_timestamp_part_1           100.000
## num_window                      47.427
## roll_belt                       45.027
## pitch_forearm                   27.800
## magnet_dumbbell_z               19.929
## magnet_dumbbell_y               17.122
## pitch_belt                      13.999
## yaw_belt                        13.903
## roll_forearm                    13.599
## cvtd_timestamp30/11/2011 17:12  10.705
## cvtd_timestamp02/12/2011 14:58   9.576
## accel_belt_z                     8.348
## magnet_dumbbell_x                7.550
## cvtd_timestamp28/11/2011 14:15   7.516
## cvtd_timestamp02/12/2011 13:33   7.002
## accel_dumbbell_y                 6.251
## roll_dumbbell                    6.143
## cvtd_timestamp05/12/2011 11:24   5.941
## accel_forearm_x                  5.470
## magnet_belt_y                    5.254
rfModel$finalModel # Display details of final model
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 40
## 
##         OOB estimate of  error rate: 0.09%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3348    0    0    0    0 0.0000000000
## B    3 2276    0    0    0 0.0013163668
## C    0    2 2051    1    0 0.0014605648
## D    0    0    2 1927    1 0.0015544041
## E    0    0    0    2 2163 0.0009237875

Much like the decision tree, key features include raw timestamp part 1, window number, belt roll, forearm pitch and the dumbell magnet y and z axes. Evaluate the final model on the test set to estimate out-of-sample error.

testPrediction <- predict(rfModel, testing)
confusionMatrix(testPrediction, testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1116    1    0    0    0
##          B    0  758    3    0    0
##          C    0    0  681    2    0
##          D    0    0    0  641    2
##          E    0    0    0    0  719
## 
## Overall Statistics
##                                          
##                Accuracy : 0.998          
##                  95% CI : (0.996, 0.9991)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9974         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9987   0.9956   0.9969   0.9972
## Specificity            0.9996   0.9991   0.9994   0.9994   1.0000
## Pos Pred Value         0.9991   0.9961   0.9971   0.9969   1.0000
## Neg Pred Value         1.0000   0.9997   0.9991   0.9994   0.9994
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2845   0.1932   0.1736   0.1634   0.1833
## Detection Prevalence   0.2847   0.1940   0.1741   0.1639   0.1833
## Balanced Accuracy      0.9998   0.9989   0.9975   0.9981   0.9986

The model achieves an accuracy of 0.998 on the test set. Since the test set was not used in any way during the model training, out-of-sample error is estimated to be 0.002.

Prediction Quiz

Finally, predict the 20 test cases for the quiz.

quizPredictions <- predict(rfModel, quiz)
quizPredictions
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

The results are 100% accuracy on the quiz dataset.

Conclusion

Summary

The random forest model was selected due to its high accuracy. The generalized boosted regression model would make a strong candidate in situtations where faster performance is needed without much loss of accuracy. The decision tree is surpisingly accurate for such a simple model and very fast. All three models created in this project are most likely overfit to the data, particulary concerning the user name and time reference features, and the true out-of-sample error would likely be higher if applied to a dataset with new users.

Improvements

The controls for the models could potentially be tuned further to improve results marginally. The models may generalize better to real world data if the user name and time reference features were removed. The dataset could be improved by collecting information about the height, weight and other physical characteristics of the user, as is commonly recordered on modern exercise measurement devices like Fitbit.