An Integrated Study on Decision Tree Induction Algorithms in Data Mining

1. INTRODUCTION1600 dogs. The result shows that there are 26 races
There are many alternatives to represent classifiers.that the ID3 did not make any bet. This showed that
The decision tree is probably the most widely usedthe system was restricted from making any illogical
approach for this purpose. Originally it has beenchoices, which is unlike human that were to gamble
studied in the fields of decision theory and statistics.without logic in order to gain more winning.
However, it was found to be effective in other5. C4.5 ALGORITHM
disciplines such as data mining, machine learning, andAt each node of the tree, C4.5 chooses one
pattern recognition. Decision trees are alsoattribute of the data that most effectively splits its
implemented in many real-world applications. Given theset of samples into subsets enriched in one class or
long history and the intense interest in this approach,the other. Its criterion is the normalized information
it is not surprising that several surveys on decisiongain (difference in entropy) that results from
trees are available in the literature. Nevertheless, thischoosing an attribute for splitting the data. The
survey proposes a profound but concise descriptionattribute with the highest normalized information gain
of issues related specifically to top-down constructionis chosen to make the decision. The C4.5 algorithm
of decision trees, which is considered the mostthen recurses on the smaller sublists. This algorithm
popular construction approach. This paper aims tohas a few base cases.
organize all significant methods developed into a- All the samples in the list belong to the same class.
coherent and unified reference.When this happens, it simply creates a leaf node for
2. DECISION TREESthe decision tree saying to choose that class.
A decision tree (or tree diagram) is a decision support- None of the features provide any information gain.
tool that uses a tree-like graph or model of decisionsIn this case, C4.5 creates a decision node higher up
and their possible consequences, including chancethe tree using the expected value of the class.
event outcomes, resource costs, and utility. Decision- Instance of previously-unseen class encountered.
trees are commonly used in operations research,Again, C4.5 creates a decision node higher up the
specifically in decision analysis, to help identify atree using the expected value.
strategy most likely to reach a goal. Another use ofIn pseudo code the algorithm is
decision trees is as a descriptive means for calculating1. Check for base cases
conditional probabilities. In data mining and machine2. For each attribute a
learning, a decision tree is a predictive model; that is,3. Find the normalized information gain
a mapping from observations about an item toLet a_best be the attribute with the highest
conclusions about its target value. More descriptivenormalized information gain
names for such tree models are classification treeCreate a decision node that splits on a_best
(discrete outcome) or regression tree (continuousRecurse on the sublists obtained by splitting on
outcome). In these tree structures, leaves representa_best, and add those nodes as children of node
classifications and branches represent conjunctions ofImprovements from ID3 algorithm
features that lead to those classifications. TheC4.5 made a number of improvements to ID3. Some
machine learning technique for inducing a decision treeof these are:
from data is called decision tree learning, or- Handling both continuous and discrete attributes - In
(colloquially) decision trees.order to handle continuous attributes, C4.5 creates a
3. DECISION TREE REPRESNTATIONthreshold and then splits the list into those whose
The decision tree induction algorithm has been usedattribute value is above the threshold and those that
broadly for several years. It is an approximationare less than or equal to it.
discrete function method and can yield lots of useful- Handling training data with missing attribute values -
expressions. It is one of the most important methodsC4.5 allows attribute values to be marked for missing.
for classification. This algorithm’s terms followMissing attribute values are simply not used in gain
the “tree” metaphor. It has a root, which isand entropy calculations.
the first split point of the data attribute for building a- Handling attributes with differing costs.
decision tree. It also has leaves, so that every path- Pruning trees after creation - C4.5 goes back
from root to leaf will form a rule that is easilythrough the tree once it's been created and
understood. Since the decision tree is built by givenattempts to remove branches that do not help by
data, the data value and character will be morereplacing them with leaf nodes.
important. For example, the amount of data will6. CART ALGORITHM
affect the result of the tree building procedure. TheClassification and regression trees (CART) is a
type of attribute value will also affect the treenon-parametric technique that produces either
model. Decision trees need two kinds of data:classification or regression trees, depending on
Training and Testing.whether the dependent variable is categorical or
Training data, which are usually the bigger part ofnumeric, respectively. Trees are formed by a
data, are used for constructing trees. The morecollection of rules based on values of certain variables
training data collected, the higher the accuracy of thein the modeling data set.
results. The other group of data, testing, is used to- Rules are selected based on how well splits based
get the accuracy rate and misclassification rate ofon variables’ values can differentiate
the decision tree. Many decision-tree algorithms haveobservations based on the dependent variable
been developed. One of the most famous is ID3- Once a rule is selected and splits a node into two,
(Quinlan 1986, 1983), whose choice of split attribute isthe same logic is applied to each “child” node
based on information entropy. C4.5 is an extension of(i.e. it is a recursive procedure)
ID3 (Prather et al. 1997). It improves computing- Splitting stops when CART detects no further gain
efficiency, deals with continuous values, handlescan be made, or some pre-set stopping rules are met
attributes with missing values, avoids over fitting, andEach branch of the tree ends in a terminal node
performs other functions.- Each observation falls into one and exactly one
CART (Classification and Regression tree) is aterminal node
data-exploration and prediction algorithm similar to- Each terminal node is uniquely defined by a set of
C4.5, which is a tree construction algorithm. Breimanrules
et al. (1984) summarized the classification andThe basic idea of tree growing is to choose a split
regression tree. Instead of information entropy, itamong all the possible splits at each node so that the
introduces measures of node impurity. It is used on aresulting child nodes are the “purest”. In this
variety of different problems, such as the detectionalgorithm, only univariate splits are considered. That is,
of chlorine from the data contained in a masseach split depends on the value of only one predictor
spectrum). Although decision trees may not be thevariable. All possible splits consist of possible splits of
best method for classification accuracy, even peopleeach predictor.
who are not familiar with them find them easy to7. COMPARISON OF ID3, C4.5 and CART
use and understand. Figure 1 shows a binary decisionAlgorithm designers have had much success with
tree. It gives us an impression of a decision. It uses agreedy, divide-and-conquer approaches to building
circle as the decision node and a square as theclass descriptions. It is chosen decision tree learners
terminal node. Each decision node has a condition thatmade popular by ID3, C4.5 (Quinlan1986) and CART
is represented by a function F, and the parameter is(Breiman, Friedman, Olshen, and Stone 1984) for this
the split point of the split attribute. Each terminalsurvey, because they are relatively fast and typically
node has a class label C, the value of whichthey produce competitive classifiers. In fact, the
represents a class. It is apparent that it is easy todecision tree generator C4.5, a successor to ID3, has
use decision trees to interpret the tree to rules, frombecome a standard factor for comparison in machine
which we can do analysis, and easy to interpret thelearning research, because it produces good classifiers
representation of a nonlinear input-output mappingquickly. For non numeric datasets, the growth of the
(Jang 1994).run time of ID3 (and C4.5) is linear in all examples.
Figure 1: A Typical binary Decision treeThe practical run time complexity of C4.5 has been
Figure 1. A typical binary decision tree Lots of worksdetermined empirically to be worse than O (e2) on
address the splitting node choosing method andsome datasets. One possible explanation is based on
optimization of tree size, but less attention has beenthe observation of Oates and Jensen (1998) that the
given to the weight of the data attributes. In thissize of C4.5 trees increases linearly with the number
study, we use a system-reconstruction analysisof examples. One of the factors of a in C4.5’s
method to get the weight of each attribute, whichrun-time complexity corresponds to the tree depth,
we use to reform raw data. After that, we use thewhich cannot be larger than the number of attributes.
decision-tree algorithm mentioned above to build aTree depth is related to tree size, and thereby to
decision tree, from which we can find thethe number of examples. When compared with C4.5,
decision-accuracy and misclassification rates.the run time complexity of CART is satisfactory.
4. ID3 ALGORITHM8. CONCLUSION
The ID3 algorithm can be summarized as follows:The decision-tree algorithm is one of the most
Take all unused attributes and count their entropyeffective classification methods. The data will judge
concerning test samplesthe efficiency and correction rate of the algorithm.
1. Choose attribute for which entropy is maximumThe survey is made on the decision tree algorithms
2. Make node containing that attributeID3, C4.5 and CART towards their steps of
The algorithm is as follows:processing data and Complexity of running data. The
According to Gestwicki, Itemized Dichotomozer 3inductive learning algorithms had successfully
algorithm, or better known as ID3 algorithm was firstrecognized and generalized the rules contains in the
introduced by J.R Quinlan in the late 1970’s. Thetraining data given. The accuracies for the algorithms
algorithm ‘learned’ from relatively smallwere also very high, which means the system
training set of data to organize and process veryproduced a reliable result. This result also showed that
large data sets. Ballard stated that ID3 algorithm is ainductive learning can be successfully implemented in
greedy algorithm that selects the next attributesa complex problem domain, and therefore it is very
based on the information gain associated with theuseful to be implemented in the real world problems.
attributes. The information gain is measured byThe second conclusion is that the algorithms had the
entropy, where Claude Shannon first introduced theability to learn new rules and therefore had the ability
idea in 1948.to adapt to changes. Finally it can be concluded that
ID3 algorithm prefers that the generated tree isbetween the three algorithms, the CART algorithm
shorter and the attributes with lower entropies areperforms better in performance of rules generated
put near the top of the tree. These techniquesand accuracy. CART algorithm produced less rules
satisfy the idea of Occam’s Razor. Occam’syet was more accurate than the other two
Razor stated that, “one should not increase,algorithms. This showed that the CART algorithm is
beyond what is necessary, the number of entitiesbetter in induction and rules generalization compared
required to explain anything”, which means thatto ID3 algorithm and C4.5 algorithm.
one should not make more assumptions thanACKNOWLEDGEMENT
minimum needed. Hild described the basic technique1. First, I would like to thank Almighty for His
on the implementation of ID3 algorithm and it isblessings towards the successful completion of this
shown below.survey paper. I would like to extend my thanks to
1. For each uncategorized attribute, its entropy wouldmy Research
be calculated with respect to the categorizedGuide                            Dr.
attribute or conclusion.(Mrs.) M. Punithavalli, Director, Dept. of Computer
2. The attribute with lowest entropy would beScience, Sri Rama Krishna College for Women,
selected.Coimbatore for her valuable assistance, help and
3. The data would be divided into sets according toguidance during the research process. I also would like
the attribute’s value. For example, if theto extend my gratitude to my Husband      
  attribute ‘Size’ was chosen, and the               Mr. M. S. Raja Sekaran for his
values for ‘Size’ were ‘big’,moral support and co-operation.
‘medium’ and ‘small, therefore threeREFERENCES
sets would be created, divided by these values.[1] S. R. Safavin and D. Landgrebe. A survey of
4. A tree with branches that represent the setsdecision tree classifier methodology. IEEE Trans. on
would be constructed. For the above example, threeSystems, Man and Cybernetics, 21(3):660-674, 1991.
branches would be created where first branch would[2] S. K. Murthy, Automatic Construction of Decision
be ‘big’, second branch would beTrees from Data: A  MultiDisciplinary Survey. Data
‘medium’ and third branch would beMining and Knowledge Discovery, 2(4):345-389, 1998.
‘small’.[3] R. Kohavi and J. R. Quinlan. Decision-tree discovery.
5. Step 1 would be repeated for each branch, but theIn Will Klosgen and Jan M. Zytkow, editors, Handbook
already selected attribute would be removed and theof Data Mining and Knowledge Discovery, chapter
data used was only the data that exists in the sets.16.1.3, pages 267-276. Oxford University Press, 2002.
6. The process stopped when there were no more[4] S. Grumbach and T. Milo: Towards Tractable
attribute to be considered or the data in the set hadAlgebras for Bags. Journal of Computer and System
the same conclusion, for example, all data had theSciences 52(3): 570-588, 1996. IEEE TRANSACTIONS
‘Result’ = yes.ON SYSTEMS, MAN AND CYBERNETICS: PART C,
ID3 algorithm had been used and implemented inVOL. 1, NO. 11, NOVEMBER 2002 11
many fields. One of the earliest implementation of[5] L. Breiman, J. Friedman, R. Olshen, and C. Stone.
ID3 algorithm is on a chess game. Ivan Bratko, theClassification and Regression Trees. Wadsworth Int.
artificial intelligence researcher was the oneGroup, 1984.
implemented this chess game. According to[6] J.R. Quinlan, Simplifying decision trees, International
Gestwicki, Bratko supplied the ID3 program withJournal of Man-Machine Studies, 27, 221-234, 1987.
several pages of textbook recommendations for[7] T. R. Hancock, T. Jiang, M. Li, J. Tromp: Lower
playing the chess endgame of white king and rookBounds on Learning Decision Lists and Trees.
versus black king and knight. He made the rulesInformation and Computation 126(2): 114-122, 1996.
around the idea of ‘knight’s side lost in at[8] L. Hyafil and R.L. Rivest. Constructing optimal
most n moves’. The result shows that ID3binary decision trees is NP-complete. Information
algorithm is efficient in both time and spaceProcessing Letters, 5(1):15-17, 1976
considerations, where the feature vector of the[9] H. Zantema and H. L. Bodlaender, Finding Small
games and the decision tree size is small, comparedEquivalent Decision Trees is Hard, International Journal
to the training instances.of Foundations of Computer Science, 11(2):343-354,
In a study by Gestwicki, one experiment had been2000.
conducted to predict the greyhound race. The[10] G.E. Naumov. NP-completeness of problems of
experiment was to compare between the net profitconstruction of optimal decision trees. Soviet Physics:
gained by the ID3 algorithm and by threeDoklady, 36(4):270-271, 1991.
greyhound-racing experts. In this experiment, the[11] J.R. Quinlan, Induction of decision trees, Machine
system had been trained with 200 training races andLearning 1, 81-106, 1986.