Chapter 8 Tree Based Classification and Model Validation

8.0.1 Summary of this document

  • Exploring dataset avaialbe in R -Iris data exploration

  • Decision Tree Algorithm . -Using rPart and Ctree packags

  • Cross Validations and testing hypothesis statistically

8.0.2 Exploring the iris data set.

Iris dataset is preloaded in R

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Additional options in basic R polt

8.1 Decision trees with the rpart package.

Decision tree is a tree based algorithm for classification and regression problems.

Confusion matrix can be used to test how well the classification worked based on the algorithm.

A second look at the iris scatterplot.

Accuracy for rpart tree.

Accuracy for a model can be tested by looking at the true output value against the predicted value.

The confusion matrix function above calculates accuracy for the decision tree prediction.

## [1] 0.96
## $matrix
##             predy
## y            setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         49         1
##   virginica       0          5        45
## 
## $accuracy
## [1] 0.96
## 
## $error
## [1] 0.04

The party package.

Simple plot of decision tree using ctree .

## $matrix
##             predy
## y            setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         49         1
##   virginica       0          5        45
## 
## $accuracy
## [1] 0.96
## 
## $error
## [1] 0.04

Controling the depth of the tree.