The manner in which six subjects performed weight lifting exercises were quantified by attaching accelerometers, speedometers and position sensors to parts of their body during the activity. The goal is to predict the manner in which the exercise was performed. There is a “classe” variable in the training set which parameterizes the “outcome” of the exercise. Where the outcome in this case quantifies how the exercise was performed. The features or variables represent the data recorded from the accelerometers etc. We will build a model and make predictions using these features.
We begin by importing machine learning related libraries used for the analysis:
## Loading required package: lattice
## Loading required package: ggplot2
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
An important first step in the data analysis process is to get a feel for how the data should be handled by “manually” inspecting the data set. A spreadsheet processor, such as EXCEL, may be used to browse the data. This works quite well with the csv file used here. We will notice a few important things:
Import dataset:
rawdata <- read.table("~/Desktop/DataScience/pml-training.csv",sep=",", header = TRUE, na.strings=c("",NA))
Note: The na.strings argument is important for importing the dataset in such a way that allows calculations to be easily performed. That is, we handle the NA values (missing values) with care.
Remove the first few “irrelevant” columns of the dataframe. These columns were determined from our initial manual inspection.
data <- rawdata[,c(-1,-2,-3,-4,-5,-7)]
We remove “sparse columns”. That is, there are many columns with missing values. These columns are delimited by the “window” feature where “window”= yes.
windowindex <- data[,"new_window"]=="no"
subdata <- data[windowindex,]
naindex <- sapply(subdata,function(x) any(is.na(x)))
newdata <- subdata[,!naindex]
Data visualization is an important step in data analysis. Let us use “featurePlot” to explore potential correlations between a few randomly chosen features. We show how the features printed below vary with eachother in a pairwise fashion:
names(newdata[,c("pitch_belt","yaw_belt","pitch_dumbbell","pitch_forearm")])
## [1] "pitch_belt" "yaw_belt" "pitch_dumbbell" "pitch_forearm"
newdata <- newdata[,-1]
testdata <- newdata[,c("pitch_belt","yaw_belt","magnet_forearm_x","pitch_forearm")]
featurePlot(x=testdata, y=data$classe, plot="pairs")
Partition our data set into a training set (70%) and a crossvalidation set (30%):
inTrain <- createDataPartition(y=newdata$classe, p=0.7, list = FALSE)
training <- newdata[inTrain,]
crossval <- newdata[-inTrain,]
Now let’s build a model based on the training data. There are many choices for models. One that is often considered first is the generalized linear model. However, a linear model is not appropriate for our data which we have shown to have non-linear behavior. We will use the popular random Forest algorithm to build our model based on the “classe” outcome. The caret package has a version of the random forest algorithm (via the train() function) which is slow and is not executed in the following code snippet. Instead, we will use the much faster randomForest() function.
# model <- train(training$classe ~., data=training, method = "rf" )
model <- randomForest(classe ~., training, ntree=100)
Now that we have built the model using our training data, we can make a prediction using the features in the crossvalidation data set. These data are not seen by the algorithm during training and provides a better estimate of the success or accuracy of our model.
myprediction <- predict(model, crossval)
The “myprediction” variable contains the model predictions using the crossvalidation data set. We can compare the predictions to the actual values recorded to test the accuracy of our model. The confusionMatrix() function does this:
confusionMatrix(myprediction,crossval$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1638 2 0 0 0
## B 2 1113 6 0 0
## C 0 0 997 1 0
## D 0 0 2 943 2
## E 1 0 0 0 1056
##
## Overall Statistics
##
## Accuracy : 0.997
## 95% CI : (0.995, 0.998)
## No Information Rate : 0.285
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.996
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.998 0.998 0.992 0.999 0.998
## Specificity 1.000 0.998 1.000 0.999 1.000
## Pos Pred Value 0.999 0.993 0.999 0.996 0.999
## Neg Pred Value 0.999 1.000 0.998 1.000 1.000
## Prevalence 0.285 0.193 0.174 0.164 0.184
## Detection Rate 0.284 0.193 0.173 0.164 0.183
## Detection Prevalence 0.285 0.195 0.173 0.164 0.183
## Balanced Accuracy 0.999 0.998 0.996 0.999 0.999
We have a high accuracy of 99.7%! Other relevant comparative statistics are also shown, such as the sensitivity and the specificity.
It may be useful to know which features are the leading predictors in our model. We can examine this with the following function, which displays the most important variables with respect to the Mean Decrease Gini index. The greater the index, the greater the importance of the variable.
varImpPlot(model)
The data used in this sample maybe be found at: http://groupware.les.inf.puc-rio.br/har and was inspired by Ref. [1]
[1] Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.