xavier初始化_深入解读xavier初始化(附源码)
論文是Understanding the difficulty of training deep feedforward neural networks。
一篇感覺(jué)不錯(cuò)的翻譯為【Deep Learning】筆記:Understanding the difficulty of training deep feedforward neural networks。
一些不錯(cuò)的解讀文章Understanding the difficulty of training deep feedforward neural networks。
這篇論文還是很經(jīng)典的,比如作者的名字Xavier Glorot,Xavier初始化就是這位大佬搞的。
參考的源碼是TensorFlow版的,API在variance_scaling_initializer,源碼在initializers.py。
0 Abstract
論文講深度學(xué)習(xí)效果變好的兩大功臣是參數(shù)初始化和訓(xùn)練技巧,參數(shù)初始化能夠拔高到如此地步還是很震撼我的:
All these experimental results were obtained with new initialization or training mechanisms.然而現(xiàn)有的初始化方法即隨機(jī)初始化表現(xiàn)并不好,本文的主要目的在于論證這點(diǎn),以及基于論證指明改進(jìn)的方向:
Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.論文首先觀察了非線性激活函數(shù)的影響,發(fā)現(xiàn)sigmoid激活函數(shù)因其均值不適合于深度網(wǎng)絡(luò),其會(huì)導(dǎo)致高級(jí)層陷入飽和狀態(tài):
We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation.論文發(fā)現(xiàn)處于飽和的神經(jīng)元能夠自己逃出飽和狀態(tài):
Surprisingly, we find that saturated units can move out of saturation by themselves, albeit slowly, and explaining the plateaus sometimes seen when training neural networks.論文發(fā)現(xiàn)較少飽和的激活函數(shù)通常是有用的:
We find that a new non-linearity that saturates less can often be beneficial.論文發(fā)現(xiàn)每層網(wǎng)絡(luò)的雅克比矩陣的奇異值遠(yuǎn)大于1時(shí)網(wǎng)絡(luò)就會(huì)難以訓(xùn)練:
Finally, we study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1.據(jù)此論文提出了一種新的初始化方法。
1 Deep Neural Networks
深度學(xué)習(xí)網(wǎng)絡(luò)旨在從低階特征中學(xué)習(xí)高階特征:
Deep learning methods aim at learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower level features.論文并不著眼于無(wú)監(jiān)督的預(yù)訓(xùn)練或半監(jiān)督的訓(xùn)練準(zhǔn)則,而是觀察多層神經(jīng)網(wǎng)絡(luò)的問(wèn)題所在:
So here instead of focusing on what unsupervised pre-training or semi-supervised criteria bring to deep architectures, we focus on analyzing what may be going wrong with good old (but deep) multi-layer neural networks.這個(gè)觀察主要是指層之間以及訓(xùn)練過(guò)程中激活值和梯度的變化情況:
Our analysis is driven by investigative experiments to monitor activations (watching for saturation of hidden units) and gradients, across layers and across training iterations.另外也評(píng)估了激活函數(shù)與初始化過(guò)程的影響:
We also evaluate the effects on these of choices of activation function (with the idea that it might affect saturation) and initialization procedure (since unsupervised pre-training is a particular form of initialization and it has a drastic impact).2 Experiment Setting and Datasets
2.1 Online Learning on an Infinite Dataset: Shapeset-3*2
論文講了在線學(xué)習(xí)的好處:
The online setting is also interesting because it focuses on the optimization issues rather than on the small-sample regularization effects.在線學(xué)習(xí)是一個(gè)優(yōu)化問(wèn)題,而不是在小數(shù)據(jù)集內(nèi)的正則化效果。我自己的理解就是,小數(shù)據(jù)集的正則化效果就是給你一堆二維點(diǎn),求一個(gè)直線使得距離最短(理論上可以直接求解得到,這里使用訓(xùn)練的方法),可以先給一個(gè)初始直線,扔進(jìn)一個(gè)點(diǎn)去,優(yōu)化這個(gè)直線使得直線與點(diǎn)之間的距離變小,直到小數(shù)據(jù)集的點(diǎn)利用完。而在線學(xué)習(xí)就是,不斷重復(fù)上述這個(gè)過(guò)程,點(diǎn)是取之不盡用之不竭的,無(wú)窮的點(diǎn)可以無(wú)限接近于真實(shí)狀態(tài)。兩者的不同在于網(wǎng)絡(luò)是否知道全部的數(shù)據(jù)集。
之后講了數(shù)據(jù)集
,對(duì)該數(shù)據(jù)集的描述是:Shapeset-3*2 contains images of 1 or 2 two-dimensional objects, each taken from 3 shape categories (triangle, parallelogram, ellipse), and placed with random shape parameters (relative lengths and/or angles), scaling, rotation, translation and grey-scale.圖1上部分有一部分例圖:
這里講下圖1種為什么將會(huì)產(chǎn)生9種可能的分類:
The task is to predict the objects present (e.g. triangle+ellipse, parallelogram+parallelogram, triangle alone, etc.) without having to distinguish between the foreground shape and the background shape when they overlap. This therefore defines nine configuration classes.這9種分別是:
1.單獨(dú)的目標(biāo),有3種:triangle alone, parallelogram alone, ellipse alone
2.三種目標(biāo)選擇不同的兩種,則有
種:triangle+parallelogram, triangle+ellipse, parallelogram+ellipse3.三種目標(biāo)選擇相同的兩種,有3種:triangle+triangle, parallelogram+parallelogram, ellipse+ellipse
2.2 Finite Datasets
有三個(gè)數(shù)據(jù)集,分別是:MNIST、 CIFAR-10、Small-ImageNet。
2.3 Experimental Setting
有一些網(wǎng)絡(luò)的基本設(shè)置,另外介紹了三種激活函數(shù):the sigmoid、the hyperbolic tangent、the softsign,這三個(gè)激活函數(shù)如下圖所示:
其中,
。3 Effect of Activation Functions and Saturation During Training
該章基本上是對(duì)三個(gè)激活函數(shù)的實(shí)驗(yàn),判斷好壞的效果論文是從以下兩方面介紹的:
1.excessive saturation of activation function (then gradients will not propagate well)
2.overly linear units (they will not compute something interesting)
首先分析第一點(diǎn),因?yàn)榉床椒ㄇ蠼獾臅r(shí)候需要考慮到激活函數(shù)的偏導(dǎo),如果激活函數(shù)處于飽和狀態(tài),即意味者其偏導(dǎo)接近于0,會(huì)導(dǎo)致梯度彌散,參考反向傳播算法的公式推導(dǎo) - HappyRocking的專欄 - CSDN博客,反向傳播時(shí)代價(jià)函數(shù)對(duì)
和 的偏導(dǎo)為:另外:
這兩個(gè)公式證明反步法求解得到的梯度與激活函數(shù)的導(dǎo)數(shù)相關(guān),激活函數(shù)飽和表示激活函數(shù)的導(dǎo)數(shù)接近為0,這是不利的。
關(guān)于第二點(diǎn),剛好最近學(xué)習(xí)了下ResNet,可參考我的文章深入解讀殘差網(wǎng)絡(luò)ResNet V1(附源碼),4.2節(jié)中講了如果沒(méi)有激活函數(shù),兩層的神經(jīng)網(wǎng)絡(luò)也是相當(dāng)于一層神經(jīng)網(wǎng)絡(luò)的,因?yàn)榫€性函數(shù)的疊加依然是線性函數(shù),神經(jīng)網(wǎng)絡(luò)擬合的是非線性函數(shù)(非線性一般由非線性函數(shù)賦予),所以過(guò)多的線性單元是無(wú)用的。
3.1 Experiments with the sigmoid
Sigmoid激活函數(shù)的平均值非零,而平均值與海森矩陣的奇異值相關(guān)(又是一篇很古老的論文了,沒(méi)時(shí)間看了),這導(dǎo)致其訓(xùn)練相對(duì)較慢:
The sigmoid non-linearity has been already shown to slow down learning because of its non-zero mean that induces important singular values in the Hessian.接著對(duì)圖2做一個(gè)解釋吧:
論文是這么解釋activation values的:
activation values: output of the sigmoid因?yàn)閟igmoid是一個(gè)函數(shù),其因變量隨著自變量變化而變化,而自變量
。論文里講訓(xùn)練的時(shí)候會(huì)一直拿固定的300個(gè)樣本的檢測(cè)集去測(cè)試, 表示的就是測(cè)試樣本傳至此節(jié)點(diǎn)時(shí)的值,則activation values的值為 ,這里 為sigmoid函數(shù)。而每一層是有一千個(gè)隱含節(jié)點(diǎn)的,每個(gè)節(jié)點(diǎn)都有一個(gè)對(duì)應(yīng)的activation values,因?yàn)闀?huì)有一個(gè)平均值和標(biāo)準(zhǔn)差,再加上300個(gè)樣本,圖2即意在闡明這個(gè)。
不過(guò)不理解的是圖示里講的top hidden layer指的啥層,按照描述是Layer 4,但top hidden layer應(yīng)該是Layer 1呀,有疑問(wèn)。現(xiàn)在傾向于就是Layer 4了,因?yàn)檎睦镉羞@么一句話:
We see that very quickly at the beginning, all the sigmoid activation values of the last hidden layer are pushed their lower saturation value of 0.綜上,the last hidden layer和top hidden layer指的都是Layer 4。
論文開(kāi)始對(duì)圖2進(jìn)行分析,在很長(zhǎng)的一段時(shí)間內(nèi),前三層激活值的平均值一直保持在0.5左右,Layer 4則一直保持在0左右,即處于飽和區(qū),并且當(dāng)Layer 4開(kāi)始跳出飽和區(qū)時(shí),前三層開(kāi)始飽和并穩(wěn)定下來(lái)。
論文給出的解釋是:隨機(jī)初始化的時(shí)候,最后一層softmax層
,剛開(kāi)始訓(xùn)練的時(shí)候會(huì)更依賴于偏置,而不是top hidden layer(即Layer 4)的輸出 ,因此梯度更新的時(shí)候會(huì)使得 傾向于0,即使得 傾向于0:The logistic layer output sofmax(b+Wh) might initially rely more on its biaes b (which are learned very quickly) than on the top hidden activations h derived from the input image (because h would vary in ways that are not predictive of y, maybe correlated mostly with other and possibly more dominant variations of x).Thus the error gradient would tend to push Wh towards 0, which can be achieved by pushing h towards 0.
我理解的意思是w的梯度回傳里還有個(gè)系數(shù)h,b對(duì)應(yīng)的系數(shù)則是1,所以h如果非常小的話,w的梯度是非常小的,這導(dǎo)致其學(xué)習(xí)速度比b差很多。
這會(huì)導(dǎo)致反向?qū)W習(xí)很困難,使得低級(jí)層很難學(xué)習(xí):
However, pushing the sigmoid outputs to 0 would bring them into a saturation regime which would prevent gradients to flow backward and prevent the lower layers from learning features.而Sigmoid的輸出0位于飽和區(qū),這使得其剛開(kāi)始訓(xùn)練的時(shí)候會(huì)很緩慢:
Eventually but slowly, the lower layers move toward more useful features and the top hidden layer then moves out of the saturation regime.3.2 Experiments with Hyperbolic tangent
對(duì)圖3里的98 percentiles不理解,姑且認(rèn)為98%的點(diǎn)都位于上下兩個(gè)記號(hào)之間吧。
論文發(fā)現(xiàn)Layer 1的markers首先到達(dá)1附近,表示飽和了(為啥???),之后是Layer 2,依次類推。
3.3 Experiments with the Softsign
softsign激活函數(shù)類似于Hyperbolic tangent激活函數(shù),在達(dá)到飽和狀態(tài)方面也有不同,因?yàn)槠鋵?duì)
和 的漸近線是多項(xiàng)式逼近而不是指數(shù)逼近(意味著更為緩慢的變化)。有興趣的可以比較2.3節(jié)本文所繪的圖。圖4也比較了訓(xùn)練完畢兩個(gè)激活函數(shù)的激活值分布的不同。
可以看到,Hyperbolic tangent的激活值大部分位于飽和區(qū)(即1或-1附近)或者線性區(qū)(即0附近),而softsign的激活值大多位于(-0.6,-0.8)或(0.6,0.8)區(qū)間內(nèi),這是一段非常好的非線性區(qū),反向傳播可以傳播的很好,說(shuō)明softsign比Hyperbolic tangent要好。
4 Studying Gradients and their Propagation
4.1 Effect of the Cost Function
Cross entropy損失函數(shù)為
,quadratic cost損失函數(shù)為 ,區(qū)別在于 的情況,Cross entropy損失函數(shù)為正無(wú)窮大,quadratic cost損失函數(shù)為1,個(gè)人理解損失函數(shù)越大越有利于收斂。作者構(gòu)想了一個(gè)特簡(jiǎn)單的網(wǎng)絡(luò)(我設(shè)想的網(wǎng)絡(luò)函數(shù)為
,其中l(wèi)abel恒等于0,則只有當(dāng) 、 時(shí)loss才為0):可以看出,Cross entropy損失函數(shù)的plateau更少,個(gè)人理解損失函數(shù)成山坡,權(quán)值更新表示的就是一個(gè)小球不斷在表面滾動(dòng)直至落入一個(gè)坑里,表示得到一個(gè)局部最優(yōu)解,那么plateau越多,小球在plateau就沒(méi)有動(dòng)能往前滾動(dòng)了,性能也越差。
4.2 Gradients at initialization
首先闡述了下什么是梯度彌散,即梯度在反向推導(dǎo)的時(shí)候回越來(lái)越小:
He studied networks with linear activation at each layer, finding that the variance of the back-propagated gradients decreases as we go backwards in the network.然后給了兩個(gè)反向傳播的公式:
之后的公式較為復(fù)雜,文章《Understanding the difficulty of training deep feedforward neural networks》筆記的推導(dǎo)相當(dāng)不錯(cuò),可以參考參考,這里就不贅述其推導(dǎo)過(guò)程了。
文章里有推導(dǎo), 文章沒(méi)有推導(dǎo),有時(shí)間自己推導(dǎo)一遍吧,文章里的公式7是由公式3推導(dǎo)得到的。然后論文里講前向傳播和反向傳播的兩個(gè)條件,注意
和 是相鄰的兩層:至于
推導(dǎo)到 ,可參考文章深度學(xué)習(xí)--Xavier初始化方法 - shuzfan的專欄 - CSDN博客。對(duì)應(yīng)的代碼為:
分析函數(shù)_initializer(),首先會(huì)得到fan_in和fan_out的值,即輸入層和輸出層的個(gè)數(shù),如果mode =='FAN_AVG',則n = (fan_in + fan_out) /2.0,unform表示是否均勻分布,如果是則有:
limit = math.sqrt(3.0 * factor / n) return random_ops.random_uniform(shape, -limit, limit, dtype, seed=seed)這正是論文里的公式
。另外,如果uniform為False,則有:
trunc_stddev = math.sqrt(1.3 * factor / n) return random_ops.truncated_normal(shape, 0.0, trunc_stddev, dtype, seed=seed)variance_scaling_initializer介紹說(shuō)uniform為True或False都是xavier初始化,就不太理解了。
另外,需要考慮的一點(diǎn)是現(xiàn)在大部分是ReLU函數(shù),因此
不是一定成立的,這個(gè)的影響待研究。之后論文給了效果圖(一共有3幅,這里只給1幅):
后面還講了雅克比矩陣的奇異值(可以參考【線性代數(shù)】通俗的理解奇異值以及與特征值的區(qū)別,還有奇異值分解及其應(yīng)用):
,論文認(rèn)為平均奇異值的大小表示了層間激活值方差的比例,而這個(gè)比例越接近1,代表流動(dòng)性越好:When consecutive layers have the same dimension, the average singular value corresponds to the average ratio of infinitesimal volumes mapped from to , as well as to the ratio of average activation variance going from to .而正則初始化平均奇異值為0.8,標(biāo)準(zhǔn)初始化平均奇異值為0.5,相對(duì)來(lái)說(shuō)正則初始化更利于梯度的回傳。
4.3 Back-propagated Gradients During Learning
論文也指明是不能簡(jiǎn)單地依靠方差來(lái)進(jìn)行理論分析的:
In particular, we cannot use simple variance calculations in our theoretical analysis because the weights values are not anymore independent of the activation values and the linearity hypothesis is also violated.圖7對(duì)梯度遞減有個(gè)直觀的說(shuō)明:
從圖7上部分標(biāo)準(zhǔn)初始化方法可以看出,從Layer 1到Layer 5梯度越來(lái)越小,當(dāng)然在訓(xùn)練過(guò)程中梯度衰減現(xiàn)象會(huì)越來(lái)越不明顯。使用本文提出的正則初始化方法可沒(méi)有此梯度衰減現(xiàn)象。
論文還觀察到對(duì)標(biāo)準(zhǔn)初始化方法而言,權(quán)重的梯度沒(méi)有梯度衰減的現(xiàn)象存在:
What was initially really surprising is that even when the back-propagated gradients become smaller (standard initialization), the variance of the weights gradients is roughly constant across layers, as shown on Figure 8.5 Error Curves and Conclusions
接下來(lái)會(huì)用錯(cuò)誤率來(lái)驗(yàn)證上述幾種策略的優(yōu)點(diǎn)。
實(shí)驗(yàn)結(jié)果如下圖所示:
可以得到如下的幾個(gè)結(jié)論:
論文還講了一些別的方法(很難理解,有機(jī)會(huì)再刷),得到的一些結(jié)論是:
(已完結(jié))
總結(jié)
以上是生活随笔為你收集整理的xavier初始化_深入解读xavier初始化(附源码)的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
 
                            
                        - 上一篇: 5820k内存,三大理由让你不得不购买
- 下一篇: 逆战显卡内存不够?三招教你轻松解决
