zonghan程式筆記: Weight initial for neural network

正確的Weight initial 對Deep network 非常重要。只有1~2層時，weights 的初始化並不是太大問題，高斯分佈的標準差不要太大造成training 無法收斂就行。但當層數再四層以上，就不是這麼回事了，原因是每一層的運算，都會讓輸出值的分佈放大或是縮小（相對於輸入），所以愈到後面層，分佈範圍就會被放大愈多或是縮小愈多。下面為公式推導：X 為輸入，W為weights，S為輸出，n 為輸入層的neurons 個數

這邊是假設輸入以及weight 都以0為平均，所以第3步的E(wi) 及 E(xi) 都是0 。可以看到輸出的分佈除了會被輸入的分佈影響外，還會被neurons 個數跟 weights 分佈影響。nVar(w) 過大，隨著層數會多，越發散。反之就愈來愈趨近於0 。

為何分佈過大或過小都不好呢？假設 neurons 輸出分佈範圍太小都趨近於0，在BP時weights 的gradient 全部是0 （因為dw = neurons* gradient of activate function ) 導致無法學習。如果分佈太大，當activated function 是 tanh 時，會落在 1 & -1 區域，此區域的tanh gradient 幾乎為0 ，倒置dw 會是0，一樣無法學習。

解決辦法很簡單，就是將w 的標準差等於sqrt(1/n)，讓nVar(w) 等於1，這樣輸入的分佈就會等於輸出。下面我們用python 模擬一個10層的network，每層有500個neurons，看看不同的weight initial ，對每一層的輸出分佈的影響

Code

from __future__ import print_function
import os
import numpy as np
import random
from six.moves import xrange
from matplotlib import pylab as plt
import math

D = np.random.randn(1000,500)
hidden_layer_sizes = [500]*10
nonlineararities = ['tanh']*len(hidden_layer_sizes)
act = {'relu': lambda x : np.maximum(0,x) , 'tanh' : lambda x:np.tanh(x)}

Hs = {}
for i in xrange(len(hidden_layer_sizes)):
    X = D if i == 0 else Hs[i-1]
    fan_in = X.shape[1]
    fan_out = hidden_layer_sizes[i]
    W = np.random.randn(fan_in,fan_out)*0.02
    
    H = np.dot(X , W)
    H = act[nonlineararities[i]](H)
    Hs[i] = H 

print ('input layer had mean %f and std %f' % (np.mean(D), np.std(D)))
layer_means = [np.mean(H) for i,H in Hs.items()]
layer_stds = [np.std(H) for i,H in Hs.items()]
for i,H in Hs.items():
    print ('hidden layer %d had mean %f and std %f' % (i+1 , layer_means[i], layer_stds[i]))

Hs_keys = [i for i in Hs.keys()]
plt.figure(figsize=(15,4))
plt.subplot(1,2,1)
plt.plot(Hs_keys,layer_means, 'ob-')
plt.title('layer mean')
plt.subplot(1,2,2)
plt.plot(Hs_keys,layer_stds, 'or-')
plt.title('layer std')

plt.show()
plt.figure(figsize=(20,4))
for i,H in Hs.items():
    plt.subplot(1,10*2,(i+1)*2)
    plt.hist(H.ravel(), 30, range = (-1,1))

plt.show()

Result (tanh)

Initial STD 過小

假設 W = np.random.randn(fan_in,fan_out)*0.02 ，也就是分佈標準差STD 為 0.02

結果如下

左圖：每層neurons 輸出平均
右圖：每層neurons 標準差

每層neurons 分佈

可以看到愈深層，STD愈小，幾乎所有neurons 都為0，根本無法學習。

Initial STD 過大

W = np.random.randn(fan_in,fan_out)*1 ，也就是分佈標準差STD 為 1

可以看到，分佈都往兩邊 1 & -1 靠，這區域的gradient 幾乎為0，所以根本無法學習。

Initial STD = sqrt(1/n) to keep n*Var(w) = 1

W = np.random.randn(fan_in,fan_out)/math.sqrt(n) , n 為輸入層(前層) neurons 個數

將nVar(w) = 0, 讓輸入的分佈不會邊大或變小，就可以讓深層的neurons 還是有一定的分佈範圍，保留學習的能力。

Result (relu)

用relu 當activated dunction 時，不能weight initial 分佈不能直接用sqrt(1/n) ，因為relu 作用後mean 不為0，所以上面的公式不適用。因為relu 會讓<0 的分佈全部為0，所以relu 作用後的分佈會減半，因此Var(w) 必須是原本sqrt(1/n) 的兩倍，才會讓rule 作用後的輸出分佈不變。

所以用relu 當activated function，weight 分佈必須為 sqrt(2/n)