| 1. INTRODUCTION | | | | 1600 dogs. The result shows that there are 26 races |
| There are many alternatives to represent classifiers. | | | | that the ID3 did not make any bet. This showed that |
| The decision tree is probably the most widely used | | | | the system was restricted from making any illogical |
| approach for this purpose. Originally it has been | | | | choices, which is unlike human that were to gamble |
| studied in the fields of decision theory and statistics. | | | | without logic in order to gain more winning. |
| However, it was found to be effective in other | | | | 5. C4.5 ALGORITHM |
| disciplines such as data mining, machine learning, and | | | | At each node of the tree, C4.5 chooses one |
| pattern recognition. Decision trees are also | | | | attribute of the data that most effectively splits its |
| implemented in many real-world applications. Given the | | | | set of samples into subsets enriched in one class or |
| long history and the intense interest in this approach, | | | | the other. Its criterion is the normalized information |
| it is not surprising that several surveys on decision | | | | gain (difference in entropy) that results from |
| trees are available in the literature. Nevertheless, this | | | | choosing an attribute for splitting the data. The |
| survey proposes a profound but concise description | | | | attribute with the highest normalized information gain |
| of issues related specifically to top-down construction | | | | is chosen to make the decision. The C4.5 algorithm |
| of decision trees, which is considered the most | | | | then recurses on the smaller sublists. This algorithm |
| popular construction approach. This paper aims to | | | | has a few base cases. |
| organize all significant methods developed into a | | | | - All the samples in the list belong to the same class. |
| coherent and unified reference. | | | | When this happens, it simply creates a leaf node for |
| 2. DECISION TREES | | | | the decision tree saying to choose that class. |
| A decision tree (or tree diagram) is a decision support | | | | - None of the features provide any information gain. |
| tool that uses a tree-like graph or model of decisions | | | | In this case, C4.5 creates a decision node higher up |
| and their possible consequences, including chance | | | | the tree using the expected value of the class. |
| event outcomes, resource costs, and utility. Decision | | | | - Instance of previously-unseen class encountered. |
| trees are commonly used in operations research, | | | | Again, C4.5 creates a decision node higher up the |
| specifically in decision analysis, to help identify a | | | | tree using the expected value. |
| strategy most likely to reach a goal. Another use of | | | | In pseudo code the algorithm is |
| decision trees is as a descriptive means for calculating | | | | 1. Check for base cases |
| conditional probabilities. In data mining and machine | | | | 2. For each attribute a |
| learning, a decision tree is a predictive model; that is, | | | | 3. Find the normalized information gain |
| a mapping from observations about an item to | | | | Let a_best be the attribute with the highest |
| conclusions about its target value. More descriptive | | | | normalized information gain |
| names for such tree models are classification tree | | | | Create a decision node that splits on a_best |
| (discrete outcome) or regression tree (continuous | | | | Recurse on the sublists obtained by splitting on |
| outcome). In these tree structures, leaves represent | | | | a_best, and add those nodes as children of node |
| classifications and branches represent conjunctions of | | | | Improvements from ID3 algorithm |
| features that lead to those classifications. The | | | | C4.5 made a number of improvements to ID3. Some |
| machine learning technique for inducing a decision tree | | | | of these are: |
| from data is called decision tree learning, or | | | | - Handling both continuous and discrete attributes - In |
| (colloquially) decision trees. | | | | order to handle continuous attributes, C4.5 creates a |
| 3. DECISION TREE REPRESNTATION | | | | threshold and then splits the list into those whose |
| The decision tree induction algorithm has been used | | | | attribute value is above the threshold and those that |
| broadly for several years. It is an approximation | | | | are less than or equal to it. |
| discrete function method and can yield lots of useful | | | | - Handling training data with missing attribute values - |
| expressions. It is one of the most important methods | | | | C4.5 allows attribute values to be marked for missing. |
| for classification. This algorithm’s terms follow | | | | Missing attribute values are simply not used in gain |
| the “tree” metaphor. It has a root, which is | | | | and entropy calculations. |
| the first split point of the data attribute for building a | | | | - Handling attributes with differing costs. |
| decision tree. It also has leaves, so that every path | | | | - Pruning trees after creation - C4.5 goes back |
| from root to leaf will form a rule that is easily | | | | through the tree once it's been created and |
| understood. Since the decision tree is built by given | | | | attempts to remove branches that do not help by |
| data, the data value and character will be more | | | | replacing them with leaf nodes. |
| important. For example, the amount of data will | | | | 6. CART ALGORITHM |
| affect the result of the tree building procedure. The | | | | Classification and regression trees (CART) is a |
| type of attribute value will also affect the tree | | | | non-parametric technique that produces either |
| model. Decision trees need two kinds of data: | | | | classification or regression trees, depending on |
| Training and Testing. | | | | whether the dependent variable is categorical or |
| Training data, which are usually the bigger part of | | | | numeric, respectively. Trees are formed by a |
| data, are used for constructing trees. The more | | | | collection of rules based on values of certain variables |
| training data collected, the higher the accuracy of the | | | | in the modeling data set. |
| results. The other group of data, testing, is used to | | | | - Rules are selected based on how well splits based |
| get the accuracy rate and misclassification rate of | | | | on variables’ values can differentiate |
| the decision tree. Many decision-tree algorithms have | | | | observations based on the dependent variable |
| been developed. One of the most famous is ID3 | | | | - Once a rule is selected and splits a node into two, |
| (Quinlan 1986, 1983), whose choice of split attribute is | | | | the same logic is applied to each “child” node |
| based on information entropy. C4.5 is an extension of | | | | (i.e. it is a recursive procedure) |
| ID3 (Prather et al. 1997). It improves computing | | | | - Splitting stops when CART detects no further gain |
| efficiency, deals with continuous values, handles | | | | can be made, or some pre-set stopping rules are met |
| attributes with missing values, avoids over fitting, and | | | | Each branch of the tree ends in a terminal node |
| performs other functions. | | | | - Each observation falls into one and exactly one |
| CART (Classification and Regression tree) is a | | | | terminal node |
| data-exploration and prediction algorithm similar to | | | | - Each terminal node is uniquely defined by a set of |
| C4.5, which is a tree construction algorithm. Breiman | | | | rules |
| et al. (1984) summarized the classification and | | | | The basic idea of tree growing is to choose a split |
| regression tree. Instead of information entropy, it | | | | among all the possible splits at each node so that the |
| introduces measures of node impurity. It is used on a | | | | resulting child nodes are the “purest”. In this |
| variety of different problems, such as the detection | | | | algorithm, only univariate splits are considered. That is, |
| of chlorine from the data contained in a mass | | | | each split depends on the value of only one predictor |
| spectrum). Although decision trees may not be the | | | | variable. All possible splits consist of possible splits of |
| best method for classification accuracy, even people | | | | each predictor. |
| who are not familiar with them find them easy to | | | | 7. COMPARISON OF ID3, C4.5 and CART |
| use and understand. Figure 1 shows a binary decision | | | | Algorithm designers have had much success with |
| tree. It gives us an impression of a decision. It uses a | | | | greedy, divide-and-conquer approaches to building |
| circle as the decision node and a square as the | | | | class descriptions. It is chosen decision tree learners |
| terminal node. Each decision node has a condition that | | | | made popular by ID3, C4.5 (Quinlan1986) and CART |
| is represented by a function F, and the parameter is | | | | (Breiman, Friedman, Olshen, and Stone 1984) for this |
| the split point of the split attribute. Each terminal | | | | survey, because they are relatively fast and typically |
| node has a class label C, the value of which | | | | they produce competitive classifiers. In fact, the |
| represents a class. It is apparent that it is easy to | | | | decision tree generator C4.5, a successor to ID3, has |
| use decision trees to interpret the tree to rules, from | | | | become a standard factor for comparison in machine |
| which we can do analysis, and easy to interpret the | | | | learning research, because it produces good classifiers |
| representation of a nonlinear input-output mapping | | | | quickly. For non numeric datasets, the growth of the |
| (Jang 1994). | | | | run time of ID3 (and C4.5) is linear in all examples. |
| Figure 1: A Typical binary Decision tree | | | | The practical run time complexity of C4.5 has been |
| Figure 1. A typical binary decision tree Lots of works | | | | determined empirically to be worse than O (e2) on |
| address the splitting node choosing method and | | | | some datasets. One possible explanation is based on |
| optimization of tree size, but less attention has been | | | | the observation of Oates and Jensen (1998) that the |
| given to the weight of the data attributes. In this | | | | size of C4.5 trees increases linearly with the number |
| study, we use a system-reconstruction analysis | | | | of examples. One of the factors of a in C4.5’s |
| method to get the weight of each attribute, which | | | | run-time complexity corresponds to the tree depth, |
| we use to reform raw data. After that, we use the | | | | which cannot be larger than the number of attributes. |
| decision-tree algorithm mentioned above to build a | | | | Tree depth is related to tree size, and thereby to |
| decision tree, from which we can find the | | | | the number of examples. When compared with C4.5, |
| decision-accuracy and misclassification rates. | | | | the run time complexity of CART is satisfactory. |
| 4. ID3 ALGORITHM | | | | 8. CONCLUSION |
| The ID3 algorithm can be summarized as follows: | | | | The decision-tree algorithm is one of the most |
| Take all unused attributes and count their entropy | | | | effective classification methods. The data will judge |
| concerning test samples | | | | the efficiency and correction rate of the algorithm. |
| 1. Choose attribute for which entropy is maximum | | | | The survey is made on the decision tree algorithms |
| 2. Make node containing that attribute | | | | ID3, C4.5 and CART towards their steps of |
| The algorithm is as follows: | | | | processing data and Complexity of running data. The |
| According to Gestwicki, Itemized Dichotomozer 3 | | | | inductive learning algorithms had successfully |
| algorithm, or better known as ID3 algorithm was first | | | | recognized and generalized the rules contains in the |
| introduced by J.R Quinlan in the late 1970’s. The | | | | training data given. The accuracies for the algorithms |
| algorithm ‘learned’ from relatively small | | | | were also very high, which means the system |
| training set of data to organize and process very | | | | produced a reliable result. This result also showed that |
| large data sets. Ballard stated that ID3 algorithm is a | | | | inductive learning can be successfully implemented in |
| greedy algorithm that selects the next attributes | | | | a complex problem domain, and therefore it is very |
| based on the information gain associated with the | | | | useful to be implemented in the real world problems. |
| attributes. The information gain is measured by | | | | The second conclusion is that the algorithms had the |
| entropy, where Claude Shannon first introduced the | | | | ability to learn new rules and therefore had the ability |
| idea in 1948. | | | | to adapt to changes. Finally it can be concluded that |
| ID3 algorithm prefers that the generated tree is | | | | between the three algorithms, the CART algorithm |
| shorter and the attributes with lower entropies are | | | | performs better in performance of rules generated |
| put near the top of the tree. These techniques | | | | and accuracy. CART algorithm produced less rules |
| satisfy the idea of Occam’s Razor. Occam’s | | | | yet was more accurate than the other two |
| Razor stated that, “one should not increase, | | | | algorithms. This showed that the CART algorithm is |
| beyond what is necessary, the number of entities | | | | better in induction and rules generalization compared |
| required to explain anything”, which means that | | | | to ID3 algorithm and C4.5 algorithm. |
| one should not make more assumptions than | | | | ACKNOWLEDGEMENT |
| minimum needed. Hild described the basic technique | | | | 1. First, I would like to thank Almighty for His |
| on the implementation of ID3 algorithm and it is | | | | blessings towards the successful completion of this |
| shown below. | | | | survey paper. I would like to extend my thanks to |
| 1. For each uncategorized attribute, its entropy would | | | | my Research |
| be calculated with respect to the categorized | | | | Guide Dr. |
| attribute or conclusion. | | | | (Mrs.) M. Punithavalli, Director, Dept. of Computer |
| 2. The attribute with lowest entropy would be | | | | Science, Sri Rama Krishna College for Women, |
| selected. | | | | Coimbatore for her valuable assistance, help and |
| 3. The data would be divided into sets according to | | | | guidance during the research process. I also would like |
| the attribute’s value. For example, if the | | | | to extend my gratitude to my Husband |
| attribute ‘Size’ was chosen, and the | | | | Mr. M. S. Raja Sekaran for his |
| values for ‘Size’ were ‘big’, | | | | moral support and co-operation. |
| ‘medium’ and ‘small, therefore three | | | | REFERENCES |
| sets would be created, divided by these values. | | | | [1] S. R. Safavin and D. Landgrebe. A survey of |
| 4. A tree with branches that represent the sets | | | | decision tree classifier methodology. IEEE Trans. on |
| would be constructed. For the above example, three | | | | Systems, Man and Cybernetics, 21(3):660-674, 1991. |
| branches would be created where first branch would | | | | [2] S. K. Murthy, Automatic Construction of Decision |
| be ‘big’, second branch would be | | | | Trees from Data: A MultiDisciplinary Survey. Data |
| ‘medium’ and third branch would be | | | | Mining and Knowledge Discovery, 2(4):345-389, 1998. |
| ‘small’. | | | | [3] R. Kohavi and J. R. Quinlan. Decision-tree discovery. |
| 5. Step 1 would be repeated for each branch, but the | | | | In Will Klosgen and Jan M. Zytkow, editors, Handbook |
| already selected attribute would be removed and the | | | | of Data Mining and Knowledge Discovery, chapter |
| data used was only the data that exists in the sets. | | | | 16.1.3, pages 267-276. Oxford University Press, 2002. |
| 6. The process stopped when there were no more | | | | [4] S. Grumbach and T. Milo: Towards Tractable |
| attribute to be considered or the data in the set had | | | | Algebras for Bags. Journal of Computer and System |
| the same conclusion, for example, all data had the | | | | Sciences 52(3): 570-588, 1996. IEEE TRANSACTIONS |
| ‘Result’ = yes. | | | | ON SYSTEMS, MAN AND CYBERNETICS: PART C, |
| ID3 algorithm had been used and implemented in | | | | VOL. 1, NO. 11, NOVEMBER 2002 11 |
| many fields. One of the earliest implementation of | | | | [5] L. Breiman, J. Friedman, R. Olshen, and C. Stone. |
| ID3 algorithm is on a chess game. Ivan Bratko, the | | | | Classification and Regression Trees. Wadsworth Int. |
| artificial intelligence researcher was the one | | | | Group, 1984. |
| implemented this chess game. According to | | | | [6] J.R. Quinlan, Simplifying decision trees, International |
| Gestwicki, Bratko supplied the ID3 program with | | | | Journal of Man-Machine Studies, 27, 221-234, 1987. |
| several pages of textbook recommendations for | | | | [7] T. R. Hancock, T. Jiang, M. Li, J. Tromp: Lower |
| playing the chess endgame of white king and rook | | | | Bounds on Learning Decision Lists and Trees. |
| versus black king and knight. He made the rules | | | | Information and Computation 126(2): 114-122, 1996. |
| around the idea of ‘knight’s side lost in at | | | | [8] L. Hyafil and R.L. Rivest. Constructing optimal |
| most n moves’. The result shows that ID3 | | | | binary decision trees is NP-complete. Information |
| algorithm is efficient in both time and space | | | | Processing Letters, 5(1):15-17, 1976 |
| considerations, where the feature vector of the | | | | [9] H. Zantema and H. L. Bodlaender, Finding Small |
| games and the decision tree size is small, compared | | | | Equivalent Decision Trees is Hard, International Journal |
| to the training instances. | | | | of Foundations of Computer Science, 11(2):343-354, |
| In a study by Gestwicki, one experiment had been | | | | 2000. |
| conducted to predict the greyhound race. The | | | | [10] G.E. Naumov. NP-completeness of problems of |
| experiment was to compare between the net profit | | | | construction of optimal decision trees. Soviet Physics: |
| gained by the ID3 algorithm and by three | | | | Doklady, 36(4):270-271, 1991. |
| greyhound-racing experts. In this experiment, the | | | | [11] J.R. Quinlan, Induction of decision trees, Machine |
| system had been trained with 200 training races and | | | | Learning 1, 81-106, 1986. |