Decision tree is a supervised learning algorithm. It allows us to predict future outcome of a process by asking a series of dependent questions. Given historical data of the process, decision tree learns best set of questions to ask and the sequence in which those questions should be asked. Most important question is asked first and then next important one and so on, making decision tree where first question is at root and terminals or leaves are outcomes and in between are split/decision nodes. The intuition of decision trees is very similar to decision making of us human beings.
Imagine every Saturday you go out to play golf based on following weather factors;
- Outlook: sunny, overcast, rain
- Temperature: hot, cool, mild
- Humidity: high, normal
- Wind: weak, strong
Imagine, over last 14 weeks you have been keeping record of above factors along with outcome i.e. whether there was a play or no-play. Now you want to build a decision tree model so that every Saturday you can feed values of outlook, temperature, humidity and wind to the model and rightly predict whether you might go for play or no-play.
Although in our example outcome variable is categorical, but decision trees can also handle real-valued. That means we can build models of both regression and classification decision trees.
Technically speaking, a decision tree consists of three types of nodes;
- Root node is the top node and do not have any parent.
- Split node, also referred as decision node, has one parent and one or more child nodes.
- Leaf or terminal node holds class label
How do we build decision tree using labelled training data?
Given labelled training data, how do we actually learn a decision tree? Decision trees are built by defining a set of rules (questions) against input attributes (X) of training data. Each attribute (x) together with associated rule makes a decision node at which we split data into two or more further branches. Top split node is node is called root.
Imagine we are sitting at root and have to make a split on data. How do we decide which attribute is best to make that split? A perfect attribute would ideally split training examples into subsets which are either all negative or all positive.
How do we measure strength of an attribute? We often use following metrics for that;
- Information gain measures worth of an attribute in terms purity of the split.
- Entropy is a measure of uncertainty or randomness of a set of node objects. High information gain would always result in lower entropy.
- Cost function measures cost we have to pay for accuracy
In our golf play decision, the split at root node would be made at attribute x if x is most information among all features.
Pros and Cons:
Decision trees are very simple and easy to interpret. Their visual representation can help us to understand strength and relationship among attributes. Decision trees work with both categorical and numerical features, and can handle missing values.
Decision trees are often biased towards attributes with fewer levels or unique values. They can even overfit (bigger tree) if training data is noisy. Additionally, large number of features may lead to complex decision tree.