[Artificial Intelligence] 다층 퍼셉트론(Multi-Layer Perceptiron, MLP)

5 분 소요

Multi-Layer Perceptiron, MLP

MLP and backpropagation algorithm which is used to train it
MLP used to describe any general feedforward (no recurrent connections) network

3 layer (no. of layers of adaptive weights), 3층 신경망
1st question:
- 은닉층의 역할?
  - 특징추출기
- 단층 신경망의 문제점?

XOR problem

XOR (exclusive OR) problem
- 0+0=0
- 1+1=2=0 mod 2
- 1+0=1
- 0+1=1
Perceptron does not work here
Single layer generates a linear decision boundary

Minsky & Papert (1969) offered solution to XOR problem by combining perceptron unit responses using a second layer of units

$out = hard lim(w_1 x_1 + w_2 x_2 - \theta)$

Three Layer Networks

Properties of architecture

층내에서 연결이 없음
입력층과 출력층의 직접연결이 없음
인접층간에는 완전연결
출력 unit개수, 입력 unit 개수
은닉층의 유닛의 개수는 입력유닛 혹은 출력 유닛의 개수보다 작거나, 많을수 있다.

Decision Boundary

2층은 지역적 지식을 추출하고, 3층은 전역적인 지식을 추출
은닉층에 sigmoidal 구동함수를 사용하면, 2층 신경망은 어떤 함수라도 근사화 할 수 있음. : Universal Approximation

Backpropagation Learning

단층 퍼셉트론에서 가중치를 찾기 위해, 손실함수를 경사하강법을 적용 :
- $\Delta w_{ji} = (t_j - y_j) x_i$
입력유닛 $i$ 에서 출력 유닛 $j$로의 가중치 ($w_{ji}$) 의 수정량은 입력과 $j$ 출력 유닛의 에러에만 의존하는 지역적 특성

에러는 3째층에서만 계산됨. 첫째층과 2째층의 가중치는?
처음 2개층에는 직접적인 에러가 없음.

Credit assignment problem
- Problem of assigning ‘credit’ or ‘blame’ to individual elements involved in forming overall response of a learning system (hidden units)
- In neural networks, problem relates to deciding which weights should be altered, by how much and in which direction

초기층의 가중치가 출력, 따라서 에러에 얼마나 기여하는지를 결정해야함
즉, 가중치 wij 가 에러에 어떤 영향을 끼치는지 구하고자 함. 다음 값을 구하고자 함.

Backpropagation learning algorithm ‘BP’
Solution to credit assignment problem in MLP. Rumelhart, Hinton and Williams (1986)
BP 는 두 단계:
- 순방향 패스: 순방향으로 각 유닛의 입력값 계산, 출력 계산을 반복하여 최종 출력값 구함.
- 역방향 패스: 출력 유닛의 에러신호를 계산한후 역방향으로 에러를 전파함. (에러는 실제값과 목표치와의 차이)

2층으로 설명함. 쉽게 다층으로 확장 가능함.
$z_i(t) = g( \sum_j v_{ij}(t) x_j(t) )$ at time t
- $= g ( u_i(t) )$
$y_i (t) = g( \sum_j w_{ij}(t)z_j(t) )$ at time t
- $= g ( a_i(t) )$
a/u activation(total 입력)
g 구동함수(activation function)
바이어스는 추가 가중치로 둠

Forward pass

Weights are fixed during forward and backward pass at time t

Backward Pass

Will use a sum of squares error measure. For each training pattern we have:

where dk is the target value for dimension k. We want to know how to modify weights in order to decrease E. Use gradient descent ie

both for hidden units and output units

The partial derivative can be rewritten as product of two terms using chain rule for partial differentiation

both for hidden units and output units

Term A
- i 유닛의 전체입력 변화에 따라 에러가 어떻게 변하는가? $\Delta i$
Term B
- 가중치 w 의 변화에 따라 i 유닛의 입력은 어떻게 변하는가? $z_j$

Activation Functions

How does the activation function affect the changes?

we need to compute the derivative of activation function g
to find derivative the activation function must be smooth (differentiable)

Sigmoidal (logistic) function-common in MLP

Derivative of sigmoidal function is

Derivative of sigmoidal function has max at 𝑎𝑖(𝑡)= 0, is symmetric about this point falling to zero as sigmoid approaches extreme values

가중치의 수정값은 구동함수의 미분에 비례함.

가중치 수정은 유닛의 전체입력이 중간값(0부근)일 때 가장 크게 발생. 입력이 아주 크거나,아주 작을 때에는 수정치는 거의 0임. 유닛이 포화상태. …다층일 때 그래디언트 소멸(gradient vanish)문제

Batch Learning, Online learning

Batch learning :
- 각 훈련데이터에 대한 계산된 가중치 수정치를 누적함
- 전체 훈련데이터에 대해 누적된 값을 한번에 수정함- –epoch
- 수정 횟수 적음
Online learning :
- 가중치를 각 훈련데이터마다 수정함
- 소요 메모리 공간 적음
- 학습률은 작은 값으로 둠
- 데이터는 랜덤한 순서로 입력함 - stochastic gradient descent
- 랜덤순서…지역적 최소값을 회피하는데 도움

Activation function of output unit

Learning Example

Once weight changes are computed for all units, weights are updated at the same time (bias included as weights here). An example:

Use identity activation function (ie g(a) = a)

All biases set to 1. Will not draw them for clarity.
Learning rate $\eta$ = 0.1

Have input [0 1] with target [1 0].

Forward pass. Calculate $1^{st}$ layer activations:

$u_1 = -1 \times 0 + 0 \times 1 + 1 = 1$
$u_2 = 0 \times 0 + 1 \times 1 + 1 = 2$

Calculate first layer outputs by passing activations thru activation functions

$z_1 = g(u_1) = 1$
$z_2 = g(u_2) = 2$

Calculate $2^{nd}$ layer outputs (weighted sum thru activation functions):

$y_1 = a_1 = 1 \times 1 + 0 \times 2 + 1 = 2$
$y_2 = a_2 = -1 \times 1 + 1 \times 2 + 1 = 2$

So
- $\Delta_1 = (d_1 - y_1) = 1 – 2 = -1$
- $\Delta_2 = (d_2 - y_2) = 0 – 2 = -2$

Calculate weight changes for $1^{st}$ layer (cf perceptron learning):

$\delta_1 = - 1 + 2 = 1$
$\delta_2 = 0 – 2 = -2$

Finally change weights:

Note that the weights multiplied by the zero input are unchanged as they do not contribute to the error We have also changed biases (not shown)

Now go forward again (would normally use a new input vector):

Now go forward again (would normally use a new input vector):

Outputs now closer to target value [1, 0]

Selecting initial weight values

Choice of initial weight values is important as this decides starting position in weight space. That is, how far away from global minimum
Select weight values randomly from uniform probability distribution

Momentum

수렴속도를 높이면서 불안정성을 줄이는 기법
가중치 갱신식에 이전 가중치 수정치의 비례값을 더함
수정된 가중치 갱신식

$\alpha$ is momentum constant and controls how much notice is taken of recent history
Effect of momentum term
- 가중치의 수정치가 이전 수정치와 같은 부호를 가지면, 관성항은 수정을 많이 하여 수렴속도를 높임.
- 만약 수정치가 이전의 수정치와 반대부호를 가지면, 관성항은 수정치를 줄여서 속도를 늦추어 진동을 방지한다. (안정화)
- 지역적 최소치를 회피하도록 도움

Adaptive Learning Rate

Learning rate $\eta$
- mostly less than or equal to 0.2
- can be made adaptive for faster convergence
- kept large when learning takes place and decreased when learning slows down

Overtraining

where
- n = number of training patterns,
- m = number of output units
Could stop training when rate of change of E is small, suggesting convergence
However, aim is for new patterns to be classified correctly

Typically, though error on training set will decrease as training continues generalisation error (error on unseen data) hitts a minimum then increases (cf model complexity etc)
Therefore want more complex stopping criterion

Cross-validation
- Method for evaluating generalisation performance of networks in order to determine which is best using of available data
Hold-out method
- Simplest method when data is not scare
Divide available data into sets
- Training data set
  - used to obtain weight and bias values during network training
- Validation data
  - used to periodically test ability of network to generalize
  - suggest ‘best’ network based on smallest error
Cf . Test data

Weka

Choose classify-functions-multilayer perceptron

hidden layer 의 노드 수의 설정방법

popup 창의 choose 난의 multilayer perceptron을 클릭

hiddenLayers 에 은닉층의 노드수 입력

예 2,4 입력

1째 은닉층의 노드수 2개 2째 은닉층의 노드수 4개
choose 난의 multilayer perceptron을 클릭 popup 창의 GUI 난을 true 로 하면 MLP 구조가 보임

MLP regressor (cpu performance)

MLP classifier(iris)

Twitter Facebook LinkedIn

LEE CHANWOO