[Artificial Intelligence] 결정트리(ID3)

7 분 소요

Decision Tree Learning

후보제거 알고리즘
- 속성의 부울 함수 발견
- 두개 이상의 출력을 갖는 함수에도 적용가능함
널리 사용됨
잡음에 강인함
논리합표현식 생성
쉽게 해석이 가능함 (트리구조, if-then rules)

분류 결정트리
- 예제들을 카테고리 중 한 개로 분류함
- 학습된 함수는 트리로 표현됨
- 트리의 각 노드는 예제의 속성을 테스트
- 가지는 속성값을 나타냄
- 근노드로부터 단말노드로 진행하여 출력을 구함

트리는 가정을 형성함
- Disjunction (OR’s) of conjunctions (AND’s)
- Each path from root to leaf forms conjunction of constraints on attributes
- Separate branches are disjunctions
Example from PlayTennis decision tree:
- (Outlook=Sunny $\wedge$ Humidity=Normal) $vee$ (Outlook=Overcast) $\vee$ (Outlook=Rain $\wedge$ Wind=Weak)

Types of problems decision tree learning is good for:

Instances represented by attribute-value pairs
- For algorithm in note, attributes take on a small number of discrete values
- Can be extended to real-valued attributes (numerical data)
- Target function has discrete output values
- Can be extended to multiple output values
Hypothesis space can include disjunctive expressions.
- In fact, hypothesis space is complete space of finite discretevalued functions
Robust to imperfect training data
- classification errors
- errors in attribute values
- missing attribute values
Examples:
- Equipment diagnosis
- Medical diagnosis
- Credit card risk analysis
- Robot movement
- Pattern Recognition
  - face recognition

ID3 Algorithm

A greedy algorithm for Decision Tree Construction developed by Ross Quinlan, 1987
Top-down, greedy search through space of possible decision trees
- Remember, decision trees represent hypotheses, so this is a search through hypothesis space.
What is top-down?
- How to start tree?
  - What attribute should represent the root?
- As you proceed down tree, choose attribute for each successive node.
- No backtracking:
  - So, algorithm proceeds from top to bottom

The ID3 algorithm is used to build a decision tree, given a set of non-categorical attributes C1, C2, .., Cn, the categorical attribute C, and a training set T of records.

function ID3 (R: a set of non-categorical attributes,
C: the categorical attribute,
S: a training set) returns a decision tree;
begin
If S is empty, return a single node with value Failure;
If every example in S has the same value for categorical
attribute, return single node with that value;
If R is empty, then return a single node with most
frequent of the values of the categorical attribute found in
examples S; [note: there will be errors, i.e., improperly
classified records];
Let D be attribute with largest Gain(D,S) among R’s attributes;
Let {dj| j=1,2, .., m} be the values of attribute D;
Let {Sj| j=1,2, .., m} be the subsets of S consisting
respectively of records with value dj for attribute D;
Return a tree with root labeled D and arcs labeled
d1, d2, .., dm going respectively to the trees
ID3(R-{D},C,S1), ID3(R-{D},C,S2) ,.., ID3(R-{D},C,Sm);
end ID3;

What is a greedy search?
- At each step, make decision which makes greatest improvement in whatever you are trying optimize.
- Do not backtrack (unless you hit a dead end)
- This type of search is likely not to be a globally optimum solution, but generally works well.
What are we really doing here?
- At each node of tree, make decision on which attribute best classifies training data at that point.
- Never backtrack (in ID3)
- Do this for each branch of tree.
- End result will be tree structure representing a hypothesis which works best for the training data.

Information Theory Background

Question?
- How do you determine which attribute best classifies data?
Answer: Entropy!
- Information gain:
  - Statistical quantity measuring how well an attribute classifies the data.
    - Calculate the information gain for each attribute.
    - Choose attribute with greatest information gain.

Entropy

If an event conveys information, that means it’s a surprise.
If an event always occurs, $P(A_i)=1$, then it carries no information. $-\log_2(1) = 0$
If an event rarely occurs (e.g. $P(A_i)=0.001$), it carries a lot of information. $-\log_2(0.001) = 9.97$
The less likely the event, the more the information it carries since, for $0 \leq P(A_i) \leq 1$,
- $\log_2(P(A_i))$ increases as $P(A_i)$ goes from 1 to 0.
  - (Note: ignore events with $P(A_i)=0$ since they never occur.)

In general: – For an ensemble of random events: $\{A1,A2\}$ occurring with probabilities: $z = \{P(A1),P(A2)\}$

If you consider the self-information of event, i, to be: -log2(P(Ai)) Entropy is weighted average of information carried by each event.

ID3

Information gain is our metric for how well one attribute classifies the training data.
Information gain for a particular attribute = $A_i$
- Information about target function, given the value of that attribute. (conditional entropy)
Mathematical expression for information gain:

ID3 algorithm (for boolean-valued function)
- Calculate the entropy for all training examples
- positive and negative cases
- p+ = Npos/NTot p- = Nneg/NTot
- H(S) = -p+log2(p+) - plog2(p-)
Determine which single attribute best classifies the training examples using information gain.
- For each attribute find:

Use attribute with greatest information gain as a root

Example: PlayTennis

Four attributes used for classification:
- Outlook = {Sunny, Overcast, Rain}
- Temperature = {Hot, Mild, Cool}
- Humidity = {High, Normal}
- Wind = {Weak, Strong}
One predicted (target) attribute (binary)
- PlayTennis = {Yes, No}
Given 14 Training examples
- 9 positive
- 5 negative

Important:
- Attributes are excluded from consideration if they appear higher in the tree
Process continues for each new leaf node until:
- Every attribute has already been included along path through the tree or
- Training examples associated with this leaf all have same target attribute value.
- End up with tree:

Note: In this example data were perfect.
- No contradictions
- Branches led to unambiguous Yes, No decisions
- If there are contradictions take the majority vote
  - This handles noisy data.
Another note:
- Attributes are eliminated when they are assigned to a node and never reconsidered.
- e.g. You would not go back and reconsider Outlook under Humidity
ID3 uses all of the training data at once
- Can handle noisy data.
- Contrast to Candidate-Elimination

Another Example: Russell’s and Norvig’s Restaurant Domain

Develop a decision tree to model the decision a patron makes when deciding whether or not to wait for a table at a restaurant.
Two classes: wait, leave
Ten attributes:
- alternative restaurant available?,
- bar in restaurant?, is it Friday?,
- are we hungry?,
- how full is the restaurant?,
- how expensive?,
- is it raining?,
- do we have a reservation?,
- what type of restaurant is it?,
- what’s the purported waiting time?
Training set of 12 examples
~ 7000 possible cases

Decision Tree

Lets compare two candidate attributes: Patrons and Type. Which is a better attribute?

We select a best candidate for discerning between X4(+),x12(+), x2(-),x5(-),x9(-),x10(-)

By continuing in the same manner we obtain

How well does it work?

Many case studies have shown that decision trees are at least as accurate as human experts.

– A study for diagnosing breast cancer:

humans correctly classifying the examples 65% of the time,
the decision tree classified 72% correct.
British Petroleum designed a decision tree for gas-oil separation for offshore oil platforms/
- It replaced an earlier rule-based expert system. – Cessna designed an airplane flight controller using 90,000 examples and 20 attributes per example.

Extensions of the Decision Tree Learning Algorithm
- Using gain ratios
- Real-valued data
- Noisy data and Overfitting
- Generation of rules
- Setting Parameters
- Cross-Validation for Experimental Validation of Performance
- Incremental learning

Real-valued data

Select a set of thresholds defining intervals;
- each interval becomes a discrete value of the attribute
We can use some simple heuristics
- always divide into quartiles
We can use domain knowledge
- divide age into infant (0-2), toddler (3 - 5), and school aged (5-8)
or treat this as another learning problem
- try a range of ways to discretize the continuous variable
- Find out which yield “better results” with respect to some metric.

Noisy data and Overfitting

Many kinds of “noise” that could occur in the examples:
- Two examples have same attribute/value pairs, but different classifications
- Some values of attributes are incorrect because of:
  - Errors in the data acquisition process
  - Errors in the preprocessing phase
- The classification is wrong (e.g., + instead of -) because of some error
Some attributes are irrelevant to the decision-making process,
- e.g., color of a die is irrelevant to its outcome.
- Irrelevant attributes can result in overfitting the training data.

Overfitting:
- learning result fits data (training examples) well but does not hold for unseen data
  - This means, the algorithm has poor generalization
- Often need to compromise fitness to data and generalization power
- Overfitting is a problem common to all methods that learn from data
Fix overfitting/overlearning problem – By cross validation (see later) – By pruning lower nodes in the decision tree.
- For example, if Gain of the best attribute at a node is below a threshold, stop and make this node a leaf rather than generating children nodes.

Pruning Decision Trees

Pruning of the decision tree is done by replacing a whole subtree by a leaf node.
The replacement takes place if a decision rule establishes that the expected error rate in the subtree is greater than in the single leaf.
E.g.,
- Training Set: eg, one red ‘yes’ and one blue ‘no’
- Test: three red ‘no’s ,one red ‘yes’ , one blue ‘yes’, and one blue ‘no’
- Consider replacing this subtree by a single ‘no’ node.
After replacement we will have only two errors instead of four errors.

Twitter Facebook LinkedIn

LEE CHANWOO

[Artificial Intelligence] 결정트리(ID3)

Types of problems decision tree learning is good for:

ID3 Algorithm

Information Theory Background

Entropy

ID3

Example: PlayTennis

Another Example: Russell’s and Norvig’s Restaurant Domain

Decision Tree

How well does it work?

Real-valued data

Noisy data and Overfitting

Pruning Decision Trees

공유하기

댓글남기기

참고

[Docker] Docker Compose Cheatsheet

[Docker] Docker Cheatsheet

[MLOps] 가상화와 컨테이너

[Machine Learning] Parquet vs Arrow