Chapter 3 Classification

3.0.1 Overview

  • This presentation covers the ways to create a machine learnining algorithm to classify cancerous cells. We will use differnt techniques and measure the accuracy of the prediction using each models.

  • The dataset comes from UCI Machine Learning Recipotary. You can read about the dataset in the website

  • Please note that this data is already cleaned and ready for use. So a majority of work is already done. Now we will use this data and predict if our algorithms can correctly classify.

  • what is being covered are as follows: – Decision Trees – KNN – Naive Bayes – Support Vector Machines – Artifitial Neural Networks

  • What is not being covered (will be covered separately) – Logistic Regression – Ensemble Methods – Other ML Algorithms – Cross Validation – Changing Default parameters and tuning models

3.0.2 Getting the data

Download the data direcelty from the source to R. Please get acquinted with the data and the variable names. The following code chunk saves the data in the R environment as wdbc and provides the column name that we assign to it.

Once the data is loaded to the R environment, it is import to explore it to understand the data itself. The results of following codes are hidden.

These gives the basic idea of how the data are. From str(wdbc) it can be seen that all the variables are numerical and diagnosis is categorical with two levels, M and B. And our goal here will be to predict diagnosis based on other variables.

After some thought we realize that the column id is not necessary. So we remove it. Also remember that there are each cell nucleus is measured three different times using different methods (mean, standard error, and worst).

The scatter Matrix below shows how each variables are related to other

##   radius_1 radius_2 radius_3
## 1    17.99    1.095    25.38

Notice that now there are only 31 columns now.

3.0.3 Exploring dataset using statistics and plots (ggplot2)

A lot can be realized about the data through simple statistics and plots of the data. It is up to an individual data anaylst on how they want to explore the data before they start making the model. Following are a few ways data can be explored. Play with different built in functions and plots.

## 
##   B   M 
## 357 212
## 
##         B         M 
## 0.6274165 0.3725835
##        
##                   B           M
##   FALSE 0.147627417 0.001757469
##   TRUE  0.479789104 0.370826011
## wdbc$diagnosis: B
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.981  11.080  12.200  12.147  13.370  17.850 
## --------------------------------------------------------------------------- 
## wdbc$diagnosis: M
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.95   15.07   17.32   17.46   19.59   28.11

Can we say that malignant cells have bigger radius than benign cells?

3.1 Building our Machine Learning Algorithm

This will briefly cover some Machine Learning algorithm and how they can be used in making prediction. This will however not cover how to improve the models, change default parameters, doing cross validations and creating confusion matrices and using it to find the accuracy and error rates. They will be covered in the future.

We have aready created the most fundamental models. It is just simple guess and check method. Earlier we tabulated how many of the cells had radius >11 and which category they belonged to. That is a simple predictive model. However things are not always simple and they are affected by more than one variable. Machine Learning Algorithms help in such cases. Here are some algorithms that are widely used.

  • 3.1 Decision Trees

We will be using decision trees to classify the cells into M or B based on other variables. For theories on decision tree please read the textbook or this [Wikipedia entry] (https://en.wikipedia.org/wiki/Decision_tree)

Install and load the following rpart, rpart.plot and rattle libraries

## n= 569 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 569 212 B (0.62741652 0.37258348)  
##    2) radius_3< 16.795 379  33 B (0.91292876 0.08707124)  
##      4) concave_pts_3< 0.1358 333   5 B (0.98498498 0.01501502) *
##      5) concave_pts_3>=0.1358 46  18 M (0.39130435 0.60869565)  
##       10) texture_3< 25.67 19   4 B (0.78947368 0.21052632) *
##       11) texture_3>=25.67 27   3 M (0.11111111 0.88888889) *
##    3) radius_3>=16.795 190  11 M (0.05789474 0.94210526) *

You can enter the code summary(first_dt) to see mathematical output of decision trees. In this example, we fed in all the data we had in the model. This gives us the output but this is not what we are looking for. This is called overfitting. We want to create a model so that when we feed in newer data, the model predicts the output for this data correctly.

To train our model, we will need to divide our dataset into two sets, training and testing sets. We will use training data to train the model and then test the validity of the model using the testing data.

## [1] 569  31
## [1] 398  31
## [1] TRUE

So, now we have splitted the wdbc dataset into two separate dataset in no particular order. We can now build the model using training data and test using testing data

## predict.tree_1
##   B   M 
## 113  58

So, we predicted diagnosis results on testing set using the model we created by using training dataset. How do we know if our prediction is correct. Well, in this case we compare the result from our model against the true value in the test data.

##               
## predict.tree_1   B   M
##              B 101  12
##              M   1  57

Decision tree was a very easy solution to this problem because it recognized the important factors for the diagnosis in three splits with accuracy 0.9239766.

Will model predict as good if the data was sampled in different ways? In order to do that, we need to engineer the features and reduce overfitting. A simpler model that predicts well over longer term is a better model than a more complex model that only perfoms on certain data. We can also play with the tree default parameters. We will skip feature engineering and cross validation for now.

  • 3.1 Kth- Nearest Neighbours (KNN)

In Layman’s term, in KNN algorithm, the test data point finds the closest neighbour from the train data. The idea is that items with similar features live closer to each other.

##           
## wdbc.knn_1  B  M
##          B 98 10
##          M  4 59

This KNN model found the nearest 5 neighbours of the test point and classified it into whichever set B or M had at least 3 training set into it. The accuracy of prediction is 0.9181287

We did not normalize the data in this example. Normalization like Z score from X is necessary. Also , feature engineering and cross validation is skipped for now.

We will also use kknn package in the future and see the power of Kth- Nearest Neighbour algorighm.

  • 3.1 Naive Bayes

Naive Bayes classifier algorithm is the closest to the Baysian Model and it relies on probability of events happening prior. If you remember conditional probabilities from statistics, you are making good progress.

Install and load e1071 library

## factor(0)
## Levels:

With the basic naive bayes model we got the accuracy of 0.9181287

Again, there is a lot of optimization we can do using this method which is skipped here for now.

  • 3.1 Support Vector Machines

SVM is a very powerful techniques whose workings consists of applied math (linear algebra).

Install and load e1071 library

This algorithm used the default linear separator/kernel with reasonable cost. The parameter of SVM must be tuned to get sensible results. However this model predicted very successfully at the rate of 0.9649123

  • 3.1 Artifitial Neural Netrowks

Artifitial Neural Networks algorithm closely resemble the neurons or workings of brains. It consists of one input and output layer and multiple hidden ones.

Install and load the nnet library

The artificial neural network algorithm on this data had the accuracy of 0.9239766