Chapter 9 Training and Testing Sets for Iris Data

Use training data to construct a tree.

Using confmatrix function to calculate accuracy and error

## $matrix
##             predy
## y            setosa versicolor virginica
##   setosa         36          0         0
##   versicolor      0         32         1
##   virginica       0          3        33
## 
## $accuracy
## [1] 0.9619048
## 
## $error
## [1] 0.03809524
## $matrix
##             predy
## y            setosa versicolor virginica
##   setosa         14          0         0
##   versicolor      0         17         0
##   virginica       0          2        12
## 
## $accuracy
## [1] 0.9555556
## 
## $error
## [1] 0.04444444

9.1 Example Data

Import traindata.csv and testdata.csv. Make sure class variable is a factor. And quick data exploration.

## [1] 900   3
##             x         y class
## 1  4.76295819  9.583156     0
## 2  9.77532792  8.282632     0
## 3  0.05409077 18.185264     0
## 4 16.48119755 16.129525     0
## 5 12.87665297 16.614146     1
## 6  8.61560780  3.930424     1

## [1] 2100    3
##            x         y class
## 1 17.9655968  3.183029     0
## 2  6.6738789  6.675273     1
## 3 10.2086342  8.073150     0
## 4  0.7836262  2.630917     0
## 5 13.1109049  7.214607     0
## 6  4.2286522 13.943382     1

Building Tree 1 from the Slides

## $matrix
##    predy
## y     0   1
##   0 414 125
##   1  67 294
## 
## $accuracy
## [1] 0.7866667
## 
## $error
## [1] 0.2133333
## $matrix
##    predy
## y     0   1
##   0 873 388
##   1 224 615
## 
## $accuracy
## [1] 0.7085714
## 
## $error
## [1] 0.2914286

Number of Nodes for Tree 1

## [1] 27  9

Class Breakdown for Training and Testing Data

## 
##         0         1 
## 0.5988889 0.4011111
## 
##         0         1 
## 0.6004762 0.3995238
## [1] 900
## [1] 2100

9.2 Statistical tests to test the model

9.2.1 Confidence Intervals for Classification Accuracy

Exact binomial test. Example test data had 2100 records, and 1488 were classified correctly. The confidence interval based on the binomial distribution

## $matrix
##    predy
## y     0   1
##   0 873 388
##   1 224 615
## 
## $accuracy
## [1] 0.7085714
## 
## $error
## [1] 0.2914286
## 
## 	Exact binomial test
## 
## data:  1488 and 2100
## number of successes = 1488, number of trials = 2100, p-value < 2.2e-16
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.6886162 0.7279429
## sample estimates:
## probability of success 
##              0.7085714

Building tree 2

## Warning: labs do not fit even at cex 0.15, there may be some overplotting

## $matrix
##    predy
## y     0   1
##   0 539   0
##   1   0 361
## 
## $accuracy
## [1] 1
## 
## $error
## [1] 0
## $matrix
##    predy
## y     0   1
##   0 883 378
##   1 338 501
## 
## $accuracy
## [1] 0.6590476
## 
## $error
## [1] 0.3409524

Building accuracy vectors

## accvector1
## FALSE  TRUE 
##   612  1488
## accvector1
##     FALSE      TRUE 
## 0.2914286 0.7085714
## accvector2
## FALSE  TRUE 
##   716  1384
## accvector2
##     FALSE      TRUE 
## 0.3409524 0.6590476

McNemar Table

##           accvector2
## accvector1 FALSE TRUE
##      FALSE   438  174
##      TRUE    278 1210

Chi-square statistic and p-value

## [1] 23.47124
## [1] 1.267952e-06

Built-in Function

## 
## 	McNemar's Chi-squared test with continuity correction
## 
## data:  mcnemartable
## McNemar's chi-squared = 23.471, df = 1, p-value = 1.268e-06

Exact McNemar Test

## 
## 	Exact McNemar test (with central confidence intervals)
## 
## data:  mcnemartable
## b = 174, c = 278, p-value = 1.144e-06
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.5148605 0.7591830
## sample estimates:
## odds ratio 
##  0.6258993