Classification of data using decision tree and regression tree methods

Classification of data using decision tree [C50] and regression tree [rpart] methods

Tasks covered:

Introduction to data classification and decision trees

Read csv data files into R

Decision tree classification with C50

Decision [regression] tree classification with rpart [R implementation of CART]

Visualization of decision trees

Project script: Decision Tree Classification exercise.r

Data sets used in this exercise: Iris_Data.csv Titanic.csv Wine.csv

Package documentation: C50 rpart rpart.plot

Data classification and decision trees

Data classification is a machine learning methodology that helps assign known class labels to unknown data. This methodology is a supervised learning technique that uses a training dataset labeled with known class labels. The classification method develops a classification model [a decision tree in this example exercise] using information from the training data and a class purity algorithm. The resulting model can be used to assign a known class to new unknown data.

A decision tree is the specific model output of the two data classification techniques covered in this exercise. A decision tree is a graphical representation of a rule set that results in some conclusion, in this case, a classification of an input data item. A principal advantage of decision trees is that they are easy to explain and use. The rules of a decision tree follow a basic format. The tree starts at a root node [usually placed at the top of the tree]. Each node of the tree represents a rule whose result splits the options into several branches. As the tree is traversed downward a final leaf node is finally reached. This leaf node determines the class assigned to the data. This process could also be accomplished using a simple rule set [and most decision tree methods can output a rule set] but, as stated above, the graphical tree representation tends to be easier to explain to a decision maker.

This exercise will introduce and experiment with two decision tree classification methods: C5.0 and rpart.

C50

C50 is an R implementation of the supervised machine learning algorithm C5.0 that can generate a decision tree. The original algorithm was developed by Ross Quinlan. It is an improved version of C4.5, which is based on ID3. This algorithm uses an information entropy computation to determine the best rule that splits the data, at that node, into purer classes by minimizing the computed entropy value. This means that as each node splits the data, based on the rule at that node, each subset of data split by the rule will contain less diversity of classes and will, eventually, contain only one class [complete purity]. This process is simple to compute and therefore C50 runs quickly. C50 is robust. It can work with both numeric or categorical data [this example shows both types]. It can also tolerate missing data values. The output from the R implementation can be either a decision tree or a rule set. The output model can be used to assign [predict] a class to new unclassified data items.

rpart

The R function rpart is an implementation of the CART [Classification and Regression Tree] supervised machine learning algorithm used to generate a decision tree. CART was developed by Leo Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. CART is a trademarked name of a particular software implementation. The R implementation is called rpart for Recursive PARTitioning. Like C50, rpart uses a computational metric to determine the best rule that splits the data, at that node, into purer classes. In the rpart algorithm the computational metric is the Gini coefficient. At each node, rpart minimizes the Gini coefficient and thus splits the data into purer class subsets with the class leaf nodes at the bottom of the decision tree. The process is simple to compute and runs fairly well, but our example will highlight some computational issues. The output from the R implementation is a decision tree that can be used to assign [predict] a class to new unclassified data items.

The datasets

This example uses three datasets [in csv format]: Iris, Wine, and Titanic. All three of these datasets are good examples to use with classification algorithms. Both Iris and Wine are comprised of numeric data. Titanic is entirely categorical data. All three datasets have one attribute that can be designated as the class variable [Iris -> Classification (3 values), Wine -> Class (3 values), Titanic -> Survived (2 values)]. The Wine dataset introduces some potential computational complexity because it has 14 variables.This complexity is a good test of the performance of the two methods used in this exercise. The descriptive statistics and charting of these datasets is left as an additional exercise for interested readers.

Description of the Iris dataset

Description of the Wine dataset

Description of the Titanic dataset: this file is a modified subset of the Kaggle Titanic dataset. This version contains four categorical attributes: Class, Age, Sex, and Survived]. Like the two datasets above, it was downloaded from the UCI Machine Learning Repository. That dataset no longer exists in that repository because of the existence of the richer Kaggle version.

The classification exercise - C50

This exercise demonstrates decision tree classification, first using C50 and then using rpart. Because the two methods use different purity metrics and computational steps, it will compare and contrast the output of the two methods.

The C50 exercise begins by loading the package and reading in the file Iris_Data.csv.

# get the C5.0 package

install.packages('C50')

library('C50') # load the package

ir <- read.csv('Iris_Data.csv') # open iris dataset

The exercise looks at some descriptive statistics and charts for the iris data. This can help understand what potential patterns are found in the dataset. While these tests are not explicitly included in the script for the other datasets, an interested reader can adapt the commands and repeat the descriptive analysis for themselves.

# summary, boxplot, pairs plot

summary(ir)

boxplot(ir[-5], main = 'Boxplot of Iris data by attributes')

pairs(ir[,-5], main="Edgar Anderson's iris Data", pch=21, bg = c("black", "red", "blue")[unclass(ir$Classification)])

The summary statistics provide context to the planned classification task. There are three labels in the Classification attribute. These are the target classes for our decision tree. The pairs plot [shown below] is helpful. The colors represent the three iris classes. The black dots [Setosa instances] appear to be well separated for the other dots. the other two groups [red and blue dots] are not as well separated.

The function c5.0( ) trains the decision tree model.

irTree <- C5.0(ir[,-5], ir[,5])

This is the minimal set of input arguments for this function. The first argument [the four flower measurement attributes, minus the Classification attribute] identifies the data that will be used to compute the information entropy and determine the classification splits. The second argument [the Classification attribute] identifies the class labels. The next two commands view the model output

summary(irTree) # view the model components

plot(irTree, main = 'Iris decision tree') # view the model graphically

The summary( ) function is a C50 package version of the standard R library function. This version displays the C5.0 model output in three sections. The first section is the summary header. It states the function call, the class specification, and states how many data instances were in the training dataset. the second section displays a text version of the decision tree.

PL <= 1.9: Setosa (50)

PL > 1.9:

:...PW > 1.7: Virginica (46/1)

PW <= 1.7:

:...PL <= 4.9: Versicolor (48/1)

PL > 4.9: Virginica (6/2)

The first two output lines show a split based on the PL attribute at the value 1.9. Less than or equal to that value branches to a leaf node containing 50 instances of Setosa items. If PL is greater than 1.9, the tree branches to a node that splits based on the PW attribute at the value 1.7. Greater than 1.7 branches to a leaf node containing 46 instances of Virginica items and one item that is not Virginica. If PW is less than or equal to 1.7, the tree branches to a node that splits based on the PL attribute at the value 4.9. Less than or equal to that value branches to a leaf node containing 48 instances of Versicolor items and one item that is not Veriscolor. If PL is greater than 4.9, the tree branches to a node 6 instances of Virginica items and 2 items that are not Virginica. The third section of the summary( ) output shows the analysis of the classification quality based on the training data classification with this decision tree model.

Evaluation on training data (150 cases):

Decision Tree

----------------

Size Errors

4 4( 2.7%) <<

(a) (b) (c) <-classified as

---- ---- ----

50 (a): class Setosa

47 3 (b): class Versicolor

1 49 (c): class Virginica

Attribute usage:

100.00% PL

66.67% PW

This output shows that the decision tree has 4 leaf [classification] nodes and, using the training data, resulted in four items being mis-classified [assigned a class that does not match its actual class]. Below those results is a matrix showing the test classification results. Each row represents the number of data instances having a known class label [1st row (a) = Setosa, 2nd row (b) = Versicolor, 3rd row (c) = Virginica]. Each column indicated the number of data instances classified under a given label [(a) = Setosa, (b) = Versicolor, (c) = Virginica]. All 50 Setosa data instances were classified correctly. 47 Versicolor data instances were classified as Versicolor and 3 were classified as Virginica. 1 Virginica data instance was classified as Versicolor and 49 were classified correctly. Only two of the four flower measurement attributes [PL and PW] were used in the decision tree. Here is a graphical plot of the decision tree produced by the plot( ) function included in the C50 package [this function overloads the base plot( ) function for C50 tree objects]

plot(irTree, main = 'Iris decision tree') # view the model graphically

This decision tree chart depicts the same information as the text-based tree shown above, but it is visually more appealing. Each of the split nodes and the splitting criterion is easily understood. The classification results are represented as percentiles, with the total number of data instances in each leaf node listed at the top of the node. this chart shows why decision trees are easy to understand. The output from C50 can be represented by a rule set instead of a decision tree. This is helpful if the model results are intended to be converted into programming code in another computer language. Explicit rules are easier to convert versus the tree split criterion.

# build a rules set

irRules <- C5.0(ir[,-5], ir[,5], rules = TRUE)

summary(irRules) # view the ruleset

The c5.0( ) function is modified with an additional argument [rules = TRUE]. This changes the model output from a decision tree to a decision rule set. The summary output of the rule set is similar to the summary output from the decision tree except that the text-based decision tree is replaced by a rule set.

Rules:

Rule 1: (50, lift 2.9)

PL <= 1.9

-> class Setosa [0.981]

Rule 2: (48/1, lift 2.9)

PL > 1.9

PL <= 4.9

PW <= 1.7

-> class Versicolor [0.960]

Rule 3: (46/1, lift 2.9)

PW > 1.7

-> class Virginica [0.958]

Rule 4: (46/2, lift 2.8)

PL > 4.9

-> class Virginica [0.938]

Default class: Setosa

The rule set corresponds to the decision tree leaf nodes [one rule per leaf node], but a careful review of the rules reveals that the some rules are different than the decision tree branches [Rule 3 for example]. The rules are applied in order. Any data items not classified by the first rule are tested by the second rule. This process proceeds through each rule. Any data items that pass through all of the rules without being classified are trapped by the final default rule. The strength of each rule is indicated by the number next to the class label. The lift metric is a measure of the performance of the rule at predicting the class [larger lift = better performance]. This is included to trap data that does not conform to the domain of the original training data set.

Before moving on to the next data set, this exercise provides and example of how to use the predict( ) function in this package. Unknown data can be classified using the trained tree model in the function predict( ). The exercise uses all of the iris dataset in this example instead of actual unclassified data. While being artificial, it allows the example to walk through a procedure that matches the new classifications to the actual classifications. After labeling the data with predicted classes, the prediction data set is compared to the actual set and the mis-classified instanced are found. This section is left as an exercise for the interested reader to explore.

The next section of the example computes a C5.0 decision tree and rule set model for the Wine dataset. This dataset is interesting because it consists of 13 numeric attributes representing the chemical and physical properties of the wine and a class attribute which is used as the target classes for the classification model.

wine <- read.csv('Wine.csv') # read the dataset

head(wine) # look at the 1st 6 rows

wTree <- C5.0(wine[,-14], as.factor(wine[,14])) # train the tree

The command head(wine)shows that the 14 attributes of the Wine dataset are all numeric. This means that the Class attribute must be defined as a factor in C5.0 so that it can be the target classes for the decision tree. C5.0 runs quickly and has no difficulty computing the information entropy and discovering the best split points. Here is the summary output and final decision tree:

summary(wTree) # view the model components

Decision tree:

Flavanoids <= 1.57:

:...Color.intensity <= 3.8: 2 (13)

: Color.intensity > 3.8: 3 (49/1)s

Flavanoids > 1.57:

:...Proline <= 720: 2 (54/1)

: Proline > 720:

:...Color.intensity <= 3.4: 2 (4)

: Color.intensity > 3.4: 1 (58)

Evaluation on training data (178 cases):

Decision Tree

----------------

Size Errors

5 2 (1.1%) <<

(a) (b) (c) <-classified as

---- ---- ----

58 1 (a): class 1

70 1 (b): class 2

48 (c): class 3

Attribute usage:

100.00% Flavanoids

69.66% Color.intensity

65.17% Proline

plot(wTree, main = 'Wine decision tree') # view the model graphically

Notice that C5.0 only uses three of the 14 attributes in the decision tree. These three attributes provide enough information to split the dataset into refines class subsets at each tree node. Here is the C5.0 rule set:

wRules <- C5.0(wine[,-14], as.factor(wine[,14]), rules = TRUE)

summary(wRules) # view the ruleset

Rules:

Rule 1: (58, lift 3.0)

Flavanoids > 1.57

Color.intensity > 3.4

Proline > 720

-> class 1 [0.983]

Rule 2: (55, lift 2.5)

Color.intensity <= 3.4

-> class 2 [0.982]

Rule 3: (54/1, lift 2.4)

Flavanoids > 1.57

Proline <= 720

-> class 2 [0.964]

Rule 4: (13, lift 2.3)

Flavanoids <= 1.57

Color.intensity <= 3.8

-> class 2 [0.933]

Rule 5: (49/1, lift 3.6)

Flavanoids <= 1.57

Color.intensity > 3.8

-> class 3 [0.961]

Default class: 2

Attribute usage:

97.75% Flavanoids

92.13% Color.intensity

62.92% Proline

These rules resemble the split conditions in the decision tree and the same subset of three attributes are used in the rule set as in the decision tree, but the attribute usage percentages are different for the rule set versus the decision tree. This result is a result of the sequential application of the rules versus the split criterion applied in a decision tree.

The third example uses the Titanic dataset. As stated above, this dataset consists of 2201 rows and four categorical attributes [Class, Age, Sex, and Survived].

tn <- read.csv('Titanic.csv') # load the dataset into an object

head(tn) # view the first six rows of the dataset

Train a decision tree and view the results

tnTree <- C5.0(tn[,-4], tn[,4])

plot(tnTree, main = 'Titanic decision tree') #view the tree

This decision tree uses only two tests for its classifications. The final leaf nodes look different than the two tree above. Since there are only two classes [Survived = Yes, No], the leaf nodes show the proportions for each class within each node. Here is the summary for the decision tree:

summary(tnTree) # view the tree object

Decision tree:

Sex = Male: No (1731/367)

Sex = Female:

:...Class in {Crew,First,Second}: Yes (274/20)

: Class = Third: No (196/90)

Evaluation on training data (2201 cases):

Decision Tree

----------------

Size Errors

3 477 (21.7%) <<

(a) (b) <-classified as

---- ----

1470 20 (a): class No

457 254 (b): class Yes

Attribute usage:

100.00% Sex

21.35% Class

Train a rule set for the dataset and view the results

tnRules <- C5.0(tn[,-4], tn[,4], rules = TRUE)

summary(tnRules) # view the ruleset

Rules:

Rule 1: (1731/367, lift 1.2)

Sex = Male

-> class No [0.788]

Rule 2: (706/178, lift 1.1)

Class = Third

-> class No [0.747]

Rule 3: (274/20, lift 2.9)

Class in {Crew, First, Second}

Sex = Female

-> class Yes [0.924]

Default class: No

Attribute usage:

91.09% Sex

44.53% Class

The classification exercise - rpart

The rpart( ) function trains a classification regression decision tree using the Gini index as its class purity metric. Since this algorithm is different from the information entropy computation used in C5.0, it may compute different splitting criterion for its decision trees. The rpart( ) function uses a pre-specified regression function as its first argument. The format for this function is: Class variable ~ input variable A + input variable B + [any other input variables]. The examples in this discussion will use all of the dataset attributes as input variables and let rpart select the best ones for the decision tree model. Additionally, the summary of an rpart decision tree object is very different from the summary of a C5.0 decision tree object.

Here is the code for the first example that trains a rpart regression decision tree with the iris dataset

# create a label for our formula

f = ir$Classification ~ ir$SL + ir$SW + ir$PL + ir$PW

# train the tree

irrTree = rpart(f, method = 'class')

# view the tree summary

summary(irrTree)

The first command defines the regression formula f. All four measurement attributes are used in the regression formula to predict the Classification attribute. The second command trains a regression classification tree using the formula f. The third command prints out the summary of the regression tree object. Here is the output from summary(irrTree)

Call:

rpart(formula = f, method = "class")

n= 150

CP nsplit rel error xerror xstd

1 0.50 0 1.00 1.21 0.04836666

2 0.44 1 0.50 0.66 0.06079474

3 0.01 2 0.06 0.12 0.03322650

Variable importance

ir$PW ir$PL ir$SL ir$SW

34 31 21 13

Node number 1: 150 observations, complexity param=0.5

predicted class=Setosa expected loss=0.6666667 P(node) =1

class counts: 50 50 50

probabilities: 0.333 0.333 0.333

left son=2 (50 obs) right son=3 (100 obs)

Primary splits:

ir$PL < 2.45 to the left, improve=50.00000, (0 missing)

ir$PW < 0.8 to the left, improve=50.00000, (0 missing)

ir$SL < 5.45 to the left, improve=34.16405, (0 missing)

ir$SW < 3.35 to the right, improve=18.05556, (0 missing)

Surrogate splits:

ir$PW < 0.8 to the left, agree=1.000, adj=1.00, (0 split)

ir$SL < 5.45 to the left, agree=0.920, adj=0.76, (0 split)

ir$SW < 3.35 to the right, agree=0.827, adj=0.48, (0 split)

Node number 2: 50 observations

predicted class=Setosa expected loss=0 P(node) =0.3333333

class counts: 50 0 0

probabilities: 1.000 0.000 0.000

Node number 3: 100 observations, complexity param=0.44

predicted class=Versicolor expected loss=0.5 P(node) =0.6666667

class counts: 0 50 50

probabilities: 0.000 0.500 0.500

left son=6 (54 obs) right son=7 (46 obs)

Primary splits:

ir$PW < 1.75 to the left, improve=38.969400, (0 missing)

ir$PL < 4.75 to the left, improve=37.353540, (0 missing)

ir$SL < 6.15 to the left, improve=10.686870, (0 missing)

ir$SW < 2.45 to the left, improve= 3.555556, (0 missing)

Surrogate splits:

ir$PL < 4.75 to the left, agree=0.91, adj=0.804, (0 split)

ir$SL < 6.15 to the left, agree=0.73, adj=0.413, (0 split)

ir$SW < 2.95 to the left, agree=0.67, adj=0.283, (0 split)

Node number 6: 54 observations:

predicted class=Versicolor expected loss=0.09259259 P(node) =0.36

class counts: 0 49 5

probabilities: 0.000 0.907 0.093

Node number 7: 46 observations

predicted class=Virginica expected loss=0.02173913 P(node) =0.3066667

class counts: 0 1 45

probabilities: 0.000 0.022 0.978

This output provides the details of how rpart( ) selects the attribute and value at each split point [nodes 1 and 3]. Node 1 is the root node. There are 50 instances of each class at this node. Four primary split choices and three surrogate split choices are shown [best choice first]. The best split criterion [ir$PL < 2.45] splits the data left to node 2 [50 instances of Setosa] and right to node 3 [100 instances of both Versicolor and Virginica]. Note that the split criterion used by rpart are different than the split criterion produced by C5.0. This difference is due to the different splitting algorithm [information entropy versus GINI] used by each method. Node 2 contains 50 instances of Setosa. Since this node is one pure class, no additional split is needed. Node 3 splits the data based on the best primary split choice [ir$PW < 1.75] left to node 6 and right to node 7. This node numbering is an unusual behavior of rpart( ) and, if this regression tree went deeper, the node numbers would increase by jumps at each deeper level. Node 6 contains 54 data instances, with 49 Versicolor instances and 5 Virginica instances. Node 7 contains 46 instances, with 1 Versicolor instance and 45 Virginica instances.

A text version of the tree is displayed using the command:

print(irrTree) # view a text version of the tree

n= 150

node), split, n, loss, yval, (yprob)

* denotes terminal node

1) root 150 100 Setosa (0.33333333 0.33333333 0.33333333)

2) ir$PL< 2.45 50 0 Setosa (1.00000000 0.00000000 0.00000000) *

3) ir$PL>=2.45 100 50 Versicolor (0.00000000 0.50000000 0.50000000)

6) ir$PW< 1.75 54 5 Versicolor (0.00000000 0.90740741 0.09259259) *

7) ir$PW>=1.75 46 1 Virginica (0.00000000 0.02173913 0.97826087) *

The plot( ) function included in the rpart package is functional, but it produces a tree that is not as visually appealing as the one included in C5.0. Three commands are needed to plot the regression tree:

par(xpd = TRUE) # define graphic parameter

plot(irrTree, main = 'Iris regresion tree') # plot the tree

text(irrTree, use.n = TRUE) # add text labels to tree

The plot( ) function draws the tree and displays the chart title. The text( ) function adds the node labels, indicating the split criterion at the interior nodes and the classification results at the leaf nodes. Note that the leaf nodes also show the counts of each actual class at the leaf nodes.

For a more visually appealing regression tree the rpart.plot package can be used. Here is the same iris regression tree using rpart.plot( ):

rpart.plot(irrTree, main = 'Iris regresion tree') # a better tree plot

One function draws this chart. The interior nodes are colored in a light shade of the target class. The node shows the proportion of each class at that node and the percentage of the dataset at that node. The leaf nodes are colored based on the node class. The node shows the proportion of each class at that node and the percentage of the correct class from the dataset at that node. This display of percentages differs from the class counts displayed by the rpart plot( ) function.

Here is the second example of a rpart regression decision tree using the wine dataset.

f <- wine$Class ~ wine$Alcohol + wine$Malic.acid + wine$Ash + wine$Alcalinity.of.ash + wine$Magnesium + wine$Total.phenols + wine$Flavanoids + wine$Nonflavanoid.phenols + wine$Proanthocyanins + wine$Color.intensity + wine$Hue + wine$OD280.OD315.of.diluted.wines + wine$Proline

winerTree = rpart(f, method = 'class') # train the tree

summary(winerTree) # view the tree summary

The first command defines the regression formula f. All thirteen measurement attributes are used in the regression formula to predict the class attribute. The second command trains a regression classification tree using the formula f. The third command prints out the summary of the regression tree object. Here is the output from summary(winerTree).

Call:

rpart(formula = f, method = "class")

n= 178

CP nsplit rel error xerror xstd

1 0.49532710 0 1.0000000 1.0000000 0.06105585

2 0.31775701 1 0.5046729 0.4859813 0.05670132

3 0.05607477 2 0.1869159 0.3364486 0.05008430

4 0.02803738 3 0.1308411 0.2056075 0.04103740

5 0.01000000 4 0.1028037 0.1588785 0.03664744

Variable importance

wine$Flavanoids wine$OD280.OD315.of.diluted.wines

18 17

wine$Proline wine$Alcohol

13 12

wine$Hue wine$Color.intensity

10 9

wine$Total.phenols wine$Proanthocyanins

8 7

wine$Alcalinity.of.ash wine$Malic.acid

6 1

Node number 1: 178 observations, complexity param=0.4953271

predicted class=2 expected loss=0.6011236 P(node) =1

class counts: 59 71 48

probabilities: 0.331 0.399 0.270

left son=2 (67 obs) right son=3 (111 obs)

Primary splits:

wine$Proline < 755 to the right, improve=44.81780, (0 missing)

wine$Color.intensity < 3.82 to the left, improve=43.48679, (0 missing)

wine$Alcohol < 12.78 to the right, improve=40.45675, (0 missing)

wine$OD280.OD315.of.diluted.wines < 2.115 to the right, improve=39.27074, (0 missing)

wine$Flavanoids < 1.4 to the right, improve=39.21747, (0 missing)

Surrogate splits:

wine$Flavanoids < 2.31 to the right, agree=0.831, adj=0.552, (0 split)

wine$Total.phenols < 2.335 to the right, agree=0.781, adj=0.418, (0 split)

wine$Alcohol < 12.975 to the right, agree=0.775, adj=0.403, (0 split)

wine$Alcalinity.of.ash < 17.45 to the left, agree=0.770, adj=0.388, (0 split)

wine$OD280.OD315.of.diluted.wines < 3.305 to the right, agree=0.725, adj=0.269, (0 split)

Node number 2: 67 observations, complexity param=0.05607477

predicted class=1 expected loss=0.1492537 P(node) =0.3764045

class counts: 57 4 6

probabilities: 0.851 0.060 0.090

left son=4 (59 obs) right son=5 (8 obs)

Primary splits:

wine$Flavanoids < 2.165 to the right, improve=10.866940, (0 missing)

wine$Total.phenols < 2.05 to the right, improve=10.317060, (0 missing)

wine$OD280.OD315.of.diluted.wines < 2.49 to the right, improve=10.317060, (0 missing)

wine$Hue < 0.865 to the right, improve= 8.550391, (0 missing)

wine$Alcohol < 13.02 to the right, improve= 5.273716, (0 missing)

Surrogate splits:

wine$Total.phenols < 2.05 to the right, agree=0.985, adj=0.875, (0 split)

wine$OD280.OD315.of.diluted.wines < 2.49 to the right, agree=0.985, adj=0.875, (0 split)

wine$Hue < 0.78 to the right, agree=0.970, adj=0.750, (0 split)

wine$Alcohol < 12.46 to the right, agree=0.940, adj=0.500, (0 split)

wine$Proanthocyanins < 1.195 to the right, agree=0.925, adj=0.375, (0 split)

Node number 3: 111 observations, complexity param=0.317757

predicted class=2 expected loss=0.3963964 P(node) =0.6235955

class counts: 2 67 42

probabilities: 0.018 0.604 0.378

left son=6 (65 obs) right son=7 (46 obs)

Primary splits:

wine$OD280.OD315.of.diluted.wines < 2.115 to the right, improve=36.56508, (0 missing)

wine$Color.intensity < 4.85 to the left, improve=36.17922, (0 missing)

wine$Flavanoids < 1.235 to the right, improve=34.53661, (0 missing)

wine$Hue < 0.785 to the right, improve=28.24602, (0 missing)

wine$Alcohol < 12.745 to the left, improve=23.14780, (0 missing)

Surrogate splits:

wine$Flavanoids < 1.48 to the right, agree=0.910, adj=0.783, (0 split)

wine$Color.intensity < 4.74 to the left, agree=0.901, adj=0.761, (0 split)

wine$Hue < 0.785 to the right, agree=0.829, adj=0.587, (0 split)

wine$Alcohol < 12.525 to the left, agree=0.802, adj=0.522, (0 split)

wine$Proanthocyanins < 1.285 to the right, agree=0.775, adj=0.457, (0 split)

Node number 4: 59 observations

predicted class=1 expected loss=0.03389831 P(node) =0.3314607

class counts: 57 2 0

probabilities: 0.966 0.034 0.000

Node number 5: 8 observations

predicted class=3 expected loss=0.25 P(node) =0.04494382

class counts: 0 2 6

probabilities: 0.000 0.250 0.750

Node number 6: 65 observations

predicted class=2 expected loss=0.06153846 P(node) =0.3651685

class counts: 2 61 2

probabilities: 0.031 0.938 0.031

Node number 7: 46 observations, complexity param=0.02803738

predicted class=3 expected loss=0.1304348 P(node) =0.258427

class counts: 0 6 40

probabilities: 0.000 0.130 0.870

left son=14 (7 obs) right son=15 (39 obs)

Primary splits:

wine$Hue < 0.9 to the right, improve=5.628922, (0 missing)

wine$Malic.acid < 1.6 to the left, improve=4.737414, (0 missing)

wine$Color.intensity < 4.85 to the left, improve=4.044392, (0 missing)

wine$Proanthocyanins < 0.705 to the left, improve=3.211339, (0 missing)

wine$Flavanoids < 1.29 to the right, improve=2.645309, (0 missing)

Surrogate splits:

wine$Alcalinity.of.ash < 17.25 to the left, agree=0.935, adj=0.571, (0 split)

wine$Color.intensity < 3.56 to the left, agree=0.935, adj=0.571, (0 split)

wine$Malic.acid < 1.17 to the left, agree=0.913, adj=0.429, (0 split)

wine$Proanthocyanins < 0.485 to the left, agree=0.913, adj=0.429, (0 split)

wine$Ash < 2.06 to the left, agree=0.891, adj=0.286, (0 split)

Node number 14: 7 observations

predicted class=2 expected loss=0.2857143 P(node) =0.03932584

class counts: 0 5 2

probabilities: 0.000 0.714 0.286

Node number 15: 39 observations

predicted class=3 expected loss=0.02564103 P(node) =0.2191011

class counts: 0 1 38

probabilities: 0.000 0.026 0.974

This output provides the details of how rpart( ) selects the attribute and value at each split point [nodes 1, 2, 3, and 7]. Node 1 is the root node. Note that the split criterion used by rpart are different than the split criterion produced by C5.0. This difference is due to the different splitting algorithm [information entropy versus GINI] used by each method. As stated with the regression classification of the iris dataset, the node numbering is not completely sequential and is an unusual behavior of rpart( ) and, if this regression tree went deeper, the node numbers would increase by jumps at each deeper level.

A text version of the tree is displayed using the command:

print(winerTree) # view a text version of the tree

n= 178

node), split, n, loss, yval, (yprob)

* denotes terminal node

1) root 178 107 2 (0.33146067 0.39887640 0.26966292)

2) wine$Proline>=755 67 10 1 (0.85074627 0.05970149 0.08955224)

4) wine$Flavanoids>=2.165 59 2 1 (0.96610169 0.03389831 0.00000000) *

5) wine$Flavanoids< 2.165 8 2 3 (0.00000000 0.25000000 0.75000000) *

3) wine$Proline< 755 111 44 2 (0.01801802 0.60360360 0.37837838)

6) wine$OD280.OD315.of.diluted.wines>=2.115 65 4 2 (0.03076923 0.93846154 0.03076923) *

7) wine$OD280.OD315.of.diluted.wines< 2.115 46 6 3 (0.00000000 0.13043478 0.86956522)

14) wine$Hue>=0.9 7 2 2 (0.00000000 0.71428571 0.28571429) *

15) wine$Hue< 0.9 39 1 3 (0.00000000 0.02564103 0.97435897) *

Several points are worth noticing in this regression tree. Only four of the thirteen attributes are used in splitting the data into classes. Additionally, while the goal is to classify the data into each of three classes, the regression tree uses five leaf nodes to accomplish this task. This result is an indicator that there are no definite class boundaries in this data. This differs from the definite boundary in the iris dataset between the Setosa class and the rest of the data.

Here is the output from the plot( ) function included in the rpart package:

par(xpd = TRUE) # define graphic parameter

plot(winerTree, main = 'Wine regresion tree') # plot the tree

text(winerTree, use.n = TRUE) # add text labels to tree

Here is the same wine regression tree using rpart.plot( ):

rpart.plot(winerTree, main = 'Wine regresion tree') # a better tree plot

The third example of rpart decision tree classification uses the Titanic dataset. Remember, that this dataset consists of 2201 rows and four categorical attributes [Class, Age, Sex, and Survived].

f = tn$Survived ~ tn$Class + tn$Age + tn$Sex # declare the regression formula

tnrTree = rpart(f, method = 'class') # train the tree

The first command defines the regression formula f. The Class, Age, and Sex attributes are used in the regression formula to predict the Survived attribute. The second command trains a regression classification tree using the formula f. The third command prints out the summary of the regression tree object. Here is the output from summary(tnrTree).

Call:

rpart(formula = f, method = "class")

n= 2201

CP nsplit rel error xerror xstd

1 0.30661041 0 1.0000000 1.0000000 0.03085662

2 0.02250352 1 0.6933896 0.6933896 0.02750982

3 0.01125176 2 0.6708861 0.6863572 0.02741000

4 0.01000000 4 0.6483826 0.6765120 0.02726824

Variable importance

tn$Sex tn$Class tn$Age

73 23 4

Node number 1: 2201 observations, complexity param=0.3066104

predicted class=No expected loss=0.323035 P(node) =1

class counts: 1490 711

probabilities: 0.677 0.323

left son=2 (1731 obs) right son=3 (470 obs)

Primary splits:

tn$Sex splits as RL, improve=199.821600, (0 missing)

tn$Class splits as LRRL, improve= 69.684100, (0 missing)

tn$Age splits as LR, improve= 9.165241, (0 missing)

Node number 2: 1731 observations, complexity param=0.01125176

predicted class=No expected loss=0.2120162 P(node) =0.7864607

class counts: 1364 367

probabilities: 0.788 0.212

left son=4 (1667 obs) right son=5 (64 obs)

Primary splits:

tn$Age splits as LR, improve=7.726764, (0 missing)

tn$Class splits as LRLL, improve=7.046106, (0 missing)

Node number 3: 470 observations, complexity param=0.02250352

predicted class=Yes expected loss=0.2680851 P(node) =0.2135393

class counts: 126 344

probabilities: 0.268 0.732

left son=6 (196 obs) right son=7 (274 obs)

Primary splits:

tn$Class splits as RRRL, improve=50.015320, (0 missing)

tn$Age splits as RL, improve= 1.197586, (0 missing)

Surrogate splits:

tn$Age splits as RL, agree=0.619, adj=0.087, (0 split)

Node number 4: 1667 observations

predicted class=No expected loss=0.2027594 P(node) =0.757383

class counts: 1329 338

probabilities: 0.797 0.203

Node number 5: 64 observations, complexity param=0.01125176

predicted class=No expected loss=0.453125 P(node) =0.02907769

class counts: 35 29

probabilities: 0.547 0.453

left son=10 (48 obs) right son=11 (16 obs)

Primary splits:

tn$Class splits as -RRL, improve=12.76042, (0 missing)

Node number 6: 196 observations

predicted class=No expected loss=0.4591837 P(node) =0.08905043

class counts: 106 90

probabilities: 0.541 0.459

Node number 7: 274 observations

predicted class=Yes expected loss=0.0729927 P(node) =0.1244889

class counts: 20 254

probabilities: 0.073 0.927

Node number 10: 48 observations

predicted class=No expected loss=0.2708333 P(node) =0.02180827

class counts: 35 13

probabilities: 0.729 0.271

Node number 11: 16 observations

predicted class=Yes expected loss=0 P(node) =0.007269423

class counts: 0 16

probabilities: 0.000 1.000

Here is the output from the plot( ) function included in the rpart package:.

par(xpd = TRUE) # define graphic parameter

plot(tnrTree, main = 'Titanic regresion tree') # plot the tree

text(tnrTree, use.n = TRUE) # add text labels to tree

Here is the same titanic regression tree using rpart.plot( ):

rpart.plot(tnrTree, main = 'Titanic regresion tree') # a better tree plot

This concludes the classification exercise using C5.0 and rpart. Hopefully, this can help you set up a classification model with either of these methods.

Return to the R Learning Infrastructure Home Web Page