[Artificial Intelligence] 결정트리(ID3)
Decision Tree Learning
- 후보제거 알고리즘
- 속성의 부울 함수 발견
- 두개 이상의 출력을 갖는 함수에도 적용가능함
- 널리 사용됨
- 잡음에 강인함
- 논리합표현식 생성
- 쉽게 해석이 가능함 (트리구조, if-then rules)
- 분류 결정트리
- 예제들을 카테고리 중 한 개로 분류함
- 학습된 함수는 트리로 표현됨
- 트리의 각 노드는 예제의 속성을 테스트
- 가지는 속성값을 나타냄
- 근노드로부터 단말노드로 진행하여 출력을 구함
- 트리는 가정을 형성함
- Disjunction (OR’s) of conjunctions (AND’s)
- Each path from root to leaf forms conjunction of constraints on attributes
- Separate branches are disjunctions
- Example from PlayTennis decision tree:
- (Outlook=Sunny $\wedge$ Humidity=Normal) $vee$ (Outlook=Overcast) $\vee$ (Outlook=Rain $\wedge$ Wind=Weak)
Types of problems decision tree learning is good for:
- Instances represented by attribute-value pairs
- For algorithm in note, attributes take on a small number of discrete values
- Can be extended to real-valued attributes (numerical data)
- Target function has discrete output values
- Can be extended to multiple output values
- Hypothesis space can include disjunctive expressions.
- In fact, hypothesis space is complete space of finite discretevalued functions
- Robust to imperfect training data
- classification errors
- errors in attribute values
- missing attribute values
- Examples:
- Equipment diagnosis
- Medical diagnosis
- Credit card risk analysis
- Robot movement
- Pattern Recognition
- face recognition
ID3 Algorithm
- A greedy algorithm for Decision Tree Construction developed by Ross Quinlan, 1987
- Top-down, greedy search through space of possible decision trees
- Remember, decision trees represent hypotheses, so this is a search through hypothesis space.
- What is top-down?
- How to start tree?
- What attribute should represent the root?
- As you proceed down tree, choose attribute for each successive node.
- No backtracking:
- So, algorithm proceeds from top to bottom
- How to start tree?
The ID3 algorithm is used to build a decision tree, given a set of non-categorical attributes C1, C2, .., Cn, the categorical attribute C, and a training set T of records.
function ID3 (R: a set of non-categorical attributes,
C: the categorical attribute,
S: a training set) returns a decision tree;
begin
If S is empty, return a single node with value Failure;
If every example in S has the same value for categorical
attribute, return single node with that value;
If R is empty, then return a single node with most
frequent of the values of the categorical attribute found in
examples S; [note: there will be errors, i.e., improperly
classified records];
Let D be attribute with largest Gain(D,S) among R’s attributes;
Let {dj| j=1,2, .., m} be the values of attribute D;
Let {Sj| j=1,2, .., m} be the subsets of S consisting
respectively of records with value dj for attribute D;
Return a tree with root labeled D and arcs labeled
d1, d2, .., dm going respectively to the trees
ID3(R-{D},C,S1), ID3(R-{D},C,S2) ,.., ID3(R-{D},C,Sm);
end ID3;
- What is a greedy search?
- At each step, make decision which makes greatest improvement in whatever you are trying optimize.
- Do not backtrack (unless you hit a dead end)
- This type of search is likely not to be a globally optimum solution, but generally works well.
- What are we really doing here?
- At each node of tree, make decision on which attribute best classifies training data at that point.
- Never backtrack (in ID3)
- Do this for each branch of tree.
- End result will be tree structure representing a hypothesis which works best for the training data.
Information Theory Background
- Question?
- How do you determine which attribute best classifies data?
- Answer: Entropy!
- Information gain:
- Statistical quantity measuring how well an attribute classifies the data.
- Calculate the information gain for each attribute.
- Choose attribute with greatest information gain.
- Statistical quantity measuring how well an attribute classifies the data.
- Information gain:
Entropy
- If an event conveys information, that means it’s a surprise.
- If an event always occurs, $P(A_i)=1$, then it carries no information. $-\log_2(1) = 0$
- If an event rarely occurs (e.g. $P(A_i)=0.001$), it carries a lot of information. $-\log_2(0.001) = 9.97$
- The less likely the event, the more the information it carries since, for $0 \leq P(A_i) \leq 1$,
- $\log_2(P(A_i))$ increases as $P(A_i)$ goes from 1 to 0.
- (Note: ignore events with $P(A_i)=0$ since they never occur.)
- $\log_2(P(A_i))$ increases as $P(A_i)$ goes from 1 to 0.
In general: – For an ensemble of random events: $\{A1,A2\}$ occurring with probabilities: $z = \{P(A1),P(A2)\}$
If you consider the self-information of event, i, to be: -log2(P(Ai)) Entropy is weighted average of information carried by each event.
ID3
- Information gain is our metric for how well one attribute classifies the training data.
- Information gain for a particular attribute = $A_i$
- Information about target function, given the value of that attribute. (conditional entropy)
- Mathematical expression for information gain:
- ID3 algorithm (for boolean-valued function)
- Calculate the entropy for all training examples
- positive and negative cases
- p+ = Npos/NTot p- = Nneg/NTot
- H(S) = -p+log2(p+) - plog2(p-)
- Determine which single attribute best classifies the training examples using information gain.
- For each attribute find:
- Use attribute with greatest information gain as a root
Example: PlayTennis
- Four attributes used for classification:
- Outlook = {Sunny, Overcast, Rain}
- Temperature = {Hot, Mild, Cool}
- Humidity = {High, Normal}
- Wind = {Weak, Strong}
- One predicted (target) attribute (binary)
- PlayTennis = {Yes, No}
- Given 14 Training examples
- 9 positive
- 5 negative
- Important:
- Attributes are excluded from consideration if they appear higher in the tree
- Process continues for each new leaf node until:
- Every attribute has already been included along path through the tree or
- Training examples associated with this leaf all have same target attribute value.
- End up with tree:
- Note: In this example data were perfect.
- No contradictions
- Branches led to unambiguous Yes, No decisions
- If there are contradictions take the majority vote
- This handles noisy data.
- Another note:
- Attributes are eliminated when they are assigned to a node and never reconsidered.
- e.g. You would not go back and reconsider Outlook under Humidity
- ID3 uses all of the training data at once
- Can handle noisy data.
- Contrast to Candidate-Elimination
Another Example: Russell’s and Norvig’s Restaurant Domain
- Develop a decision tree to model the decision a patron makes when deciding whether or not to wait for a table at a restaurant.
- Two classes: wait, leave
- Ten attributes:
- alternative restaurant available?,
- bar in restaurant?, is it Friday?,
- are we hungry?,
- how full is the restaurant?,
- how expensive?,
- is it raining?,
- do we have a reservation?,
- what type of restaurant is it?,
- what’s the purported waiting time?
- Training set of 12 examples
- ~ 7000 possible cases
Decision Tree
Lets compare two candidate attributes: Patrons and Type. Which is a better attribute?
We select a best candidate for discerning between X4(+),x12(+), x2(-),x5(-),x9(-),x10(-)
By continuing in the same manner we obtain
How well does it work?
- Many case studies have shown that decision trees are at least as accurate as human experts.
– A study for diagnosing breast cancer:
- humans correctly classifying the examples 65% of the time,
- the decision tree classified 72% correct.
- British Petroleum designed a decision tree for gas-oil separation for offshore oil platforms/
- It replaced an earlier rule-based expert system. – Cessna designed an airplane flight controller using 90,000 examples and 20 attributes per example.
- Extensions of the Decision Tree Learning Algorithm
- Using gain ratios
- Real-valued data
- Noisy data and Overfitting
- Generation of rules
- Setting Parameters
- Cross-Validation for Experimental Validation of Performance
- Incremental learning
Real-valued data
- Select a set of thresholds defining intervals;
- each interval becomes a discrete value of the attribute
- We can use some simple heuristics
- always divide into quartiles
- We can use domain knowledge
- divide age into infant (0-2), toddler (3 - 5), and school aged (5-8)
- or treat this as another learning problem
- try a range of ways to discretize the continuous variable
- Find out which yield “better results” with respect to some metric.
Noisy data and Overfitting
- Many kinds of “noise” that could occur in the examples:
- Two examples have same attribute/value pairs, but different classifications
- Some values of attributes are incorrect because of:
- Errors in the data acquisition process
- Errors in the preprocessing phase
- The classification is wrong (e.g., + instead of -) because of some error
- Some attributes are irrelevant to the decision-making process,
- e.g., color of a die is irrelevant to its outcome.
- Irrelevant attributes can result in overfitting the training data.
- Overfitting:
- learning result fits data (training examples) well but does not hold for unseen data
- This means, the algorithm has poor generalization
- Often need to compromise fitness to data and generalization power
- Overfitting is a problem common to all methods that learn from data
- learning result fits data (training examples) well but does not hold for unseen data
- Fix overfitting/overlearning problem
– By cross validation (see later)
– By pruning lower nodes in the decision tree.
- For example, if Gain of the best attribute at a node is below a threshold, stop and make this node a leaf rather than generating children nodes.
Pruning Decision Trees
- Pruning of the decision tree is done by replacing a whole subtree by a leaf node.
- The replacement takes place if a decision rule establishes that the expected error rate in the subtree is greater than in the single leaf.
- E.g.,
- Training Set: eg, one red ‘yes’ and one blue ‘no’
- Test: three red ‘no’s ,one red ‘yes’ , one blue ‘yes’, and one blue ‘no’
- Consider replacing this subtree by a single ‘no’ node.
- After replacement we will have only two errors instead of four errors.
댓글남기기