Rectifier (neural networks) - 整流函数
Rectifier (neural networks) - 整流函數
https://en.wikipedia.org/wiki/Rectifier_(neural_networks)
線性整流函數 / 線性修正單元 (Rectified Linear Unit,ReLU) 是一種人工神經網絡中常用的激活函數 (activation function),通常指代以斜坡函數及其變種為代表的非線性函數。
常用的線性整流函數有斜坡函數 f ( x ) = max ? ( 0 , x ) f(x) = \max(0, x) f(x)=max(0,x)、帶泄漏整流函數 (Leaky ReLU),其中 x x x 為神經元 (Neuron) 的輸入。線性整流被認為有一定的生物學原理,并且由于在實踐中通常有著比其他常用激活函數 (譬如邏輯函數) 更好的效果,而被如今的深度神經網絡廣泛使用于諸如圖像識別等計算機視覺人工智能領域。
In the context of artificial neural networks, the rectifier is an activation function defined as the positive part of its argument:
在人工神經網絡的背景下,整流器是一個激活函數,被定義為其參數的正數部分:
f ( x ) = x + = max ? ( 0 , x ) , f(x) = x^{+} = \max(0, x), f(x)=x+=max(0,x),
where x x x is the input to a neuron. This is also known as a ramp function and is analogous to half-wave rectification in electrical engineering. This activation function was first introduced to a dynamical network by Hahnloser et al. in 2000 with strong biological motivations and mathematical justifications. It has been demonstrated for the first time in 2011 to enable better training of deeper networks, compared to the widely-used activation functions prior to 2011, e.g., the logistic sigmoid (which is inspired by probability theory; see logistic regression) and its more practical counterpart, the hyperbolic tangent. The rectifier is, as of 2017, the most popular activation function for deep neural networks.
其中 x x x 是神經元的輸入。這也稱為斜坡函數,類似于電氣工程中的半波整流。該激活函數首先由 Hahnloser et al. 在 2000 年引入動力網絡,具有強烈的生物學動機和數學理由。與 2011 年之前廣泛使用的激活函數相比 (e.g., the logistic sigmoid (which is inspired by probability theory; see logistic regression) and its more practical counterpart, the hyperbolic tangent.),2011 年首次證明了能夠更好地訓練更深層次的網絡。截至 2017 年,整流器是深度神經網絡最受歡迎的激活功能。
A unit employing the rectifier is also called a rectified linear unit (ReLU).
采用整流器的單元也稱為整流線性單元 (ReLU)。
通常意義下,線性整流函數指代數學中的斜坡函數,即
f ( x ) = max ? ( 0 , x ) f(x) = \max(0, x) f(x)=max(0,x)
在神經網絡中,線性整流作為神經元的激活函數,定義了該神經元在線性變換 w T x + b \mathbf {w} ^{T}\mathbf {x} + b wTx+b 之后的非線性輸出結果。對于進入神經元的來自上一層神經網絡的輸入向量 x x x,使用線性整流激活函數的神經元會輸出
max ? ( 0 , w T x + b ) {\max(0, \mathbf {w} ^{T}\mathbf {x} +b)} max(0,wTx+b)
至下一層神經元或作為整個神經網絡的輸出 (取決現神經元在網絡結構中所處位置)。
Plot of the rectifier (blue) and softplus (green) functions near $x = 0$1. Variants - 變體
線性整流函數在基于斜坡函數的基礎上有其他同樣被廣泛應用于深度學習的變種,譬如帶泄漏線性整流 (Leaky ReLU),帶泄漏隨機線性整流 (Randomized Leaky ReLU),以及噪聲線性整流 (Noisy ReLU)。
1.1 Leaky ReLUs - 帶泄漏線性整流
在輸入值 x x x 為負的時候,帶泄漏線性整流函數 (Leaky ReLU) 的梯度為一個常數 λ ∈ ( 0 , 1 ) \lambda \in (0,1) λ∈(0,1),而不是 0。在輸入值為正的時候,帶泄漏線性整流函數和普通斜坡函數保持一致。
f ( x ) = { x if? x > 0 λ x if? x ≤ 0 f(x)= {\begin{cases} x&{\text{if }}x>0\\ \lambda x&{\text{if }}x\leq 0 \end{cases}} f(x)={xλx?if?x>0if?x≤0?
在深度學習中,如果設定 λ \lambda λ 為一個可通過反向傳播算法 (backpropagation) 學習的變量,那么帶泄漏線性整流又被稱為參數線性整流 (Parametric ReLU)。
Leaky ReLUs allow a small, positive gradient when the unit is not active.
當神經元未激活時,Leaky ReLU 允許小的正梯度。
Parametric ReLUs (PReLUs) take this idea further by making the coefficient of leakage into a parameter that is learned along with the other neural network parameters.
參數化 ReLU (PReLU) 通過將泄漏系數變為與其他神經網絡參數一起學習的參數。
f ( x ) = { x if? x > 0 a x otherwise f(x)= \begin{cases} x&{\text{if }}x>0\\ ax&{\text{otherwise}} \end{cases} f(x)={xax?if?x>0otherwise?
Note that for a ≤ 1 a\leq 1 a≤1, this is equivalent to
f ( x ) = max ? ( x , a x ) f(x)=\max(x, ax) f(x)=max(x,ax)
and thus has a relation to “maxout” networks.
1.2 帶泄漏隨機線性整流
帶泄漏隨機線性整流 (Randomized Leaky ReLU, RReLU) 最早是在 Kaggle 全美數據科學大賽 (NDSB) 中被首先提出并使用的。相比于普通帶泄漏線性整流函數,帶泄漏隨機線性整流在負輸入值段的函數梯度 λ \lambda λ 是一個取自連續性均勻分布 U ( l , u ) U(l,u) U(l,u) 概率模型的隨機變量,即
f ( x ) = { x if? x > 0 λ x if? x ≤ 0 f(x)={ \begin{cases} x&{\text{if }}x>0 \\ \lambda x&{\text{if }}x\leq 0 \end{cases} } f(x)={xλx?if?x>0if?x≤0?
其中 λ ~ U ( l , u ) , l < u \lambda \sim U(l,u),l<u λ~U(l,u),l<u 且 l , u ∈ [ 0 , 1 ) l,u\in [0,1) l,u∈[0,1)。
1.3 Noisy ReLUs - 噪聲線性整流
噪聲線性整流 (Noisy ReLU) 是修正線性單元在考慮高斯噪聲的基礎上進行改進的變種激活函數。對于神經元的輸入值 x x x,噪聲線性整流加上了一定程度的正態分布的不確定性,即
f ( x ) = max ? ( 0 , x + Y ) f(x)=\max(0,x+Y) f(x)=max(0,x+Y)
其中隨機變量 Y ~ N ( 0 , σ ( x ) ) Y\sim {\mathcal {N}}(0,\sigma (x)) Y~N(0,σ(x))。目前,噪聲線性整流函數在受限玻爾茲曼機 (Restricted Boltzmann Machine) 在計算機圖形學的應用中取得了比較好的成果。
Rectified linear units can be extended to include Gaussian noise, making them noisy ReLUs, giving
f ( x ) = max ? ( 0 , x + Y ) , w i t h Y ~ N ( 0 , σ ( x ) ) f(x) = \max(0, x+Y), with Y\sim {\mathcal {N}}(0, \sigma (x)) f(x)=max(0,x+Y),withY~N(0,σ(x))
1.4 ELUs
Exponential linear units try to make the mean activations closer to zero which speeds up learning. It has been shown that ELUs can obtain higher classification accuracy than ReLUs.
指數線性單位試圖使平均激活接近于零,這加速了學習。已經表明,ELU 可以獲得比 ReLU 更高的分類精度。
f ( x ) = { x if? x > 0 a ( e x ? 1 ) otherwise f(x)= \begin{cases} x&{\text{if }}x>0\\ a(e^{x}-1)&{\text{otherwise}} \end{cases} f(x)={xa(ex?1)?if?x>0otherwise?
a a a is a hyper-parameter to be tuned and a ≥ 0 a\geq 0 a≥0 is a constraint.
a a a 是要調整的超參數, a ≥ 0 a\geq 0 a≥0 是約束。
2. Advantages - 優勢
相比于傳統的神經網絡激活函數,諸如邏輯函數 (Logistic sigmoid) 和 tanh 等雙曲函數,線性整流函數有著以下幾方面的優勢:
-
仿生物學原理:相關大腦方面的研究表明生物神經元的信息編碼通常是比較分散及稀疏的。通常情況下,大腦中在同一時間大概只有 1%-4% 的神經元處于活躍狀態。使用線性修正以及正則化 (regularization) 可以對機器神經網絡中神經元的活躍度 (即輸出為正值) 進行調試;相比之下,邏輯函數在輸入為 0 時達到 1 2 \frac {1}{2} 21?,即已經是半飽和的穩定狀態,不夠符合實際生物學對模擬神經網絡的期望。不過需要指出的是,一般情況下,在一個使用修正線性單元 (即線性整流) 的神經網絡中大概有 50% 的神經元處于激活態。
-
更加有效率的梯度下降以及反向傳播,避免了梯度爆炸和梯度消失問題。
-
簡化計算過程:沒有了其他復雜激活函數中諸如指數函數的影響,同時活躍度的分散性使得神經網絡整體計算成本下降。
-
Biological plausibility: One-sided, compared to the antisymmetry of tanh.
-
Sparse activation: For example, in a randomly initialized network, only about 50% of hidden units are activated (having a non-zero output).
-
Better gradient propagation: Fewer vanishing gradient problems compared to sigmoidal activation functions that saturate in both directions.
-
Efficient computation: Only comparison, addition and multiplication.
-
Scale-invariant: max ? ( 0 , a x ) = a max ? ( 0 , x ) for? a ≥ 0 \max(0, ax) = a\max(0, x){\text{ for }} a\geq 0 max(0,ax)=amax(0,x)?for?a≥0
Rectifying activation functions were used to separate specific excitation and unspecific inhibition in the Neural Abstraction Pyramid, which was trained in a supervised way to learn several computer vision tasks. In 2011, the use of the rectifier as a non-linearity has been shown to enable training deep supervised neural networks without requiring unsupervised pre-training. Rectified linear units, compared to sigmoid function or similar activation functions, allow for faster and effective training of deep neural architectures on large and complex datasets. -
生物學合理性:與 tanh 的反對稱性相比,是單側的。
-
稀疏激活:例如,在隨機初始化的網絡中,只有大約 50% 的隱藏單元被激活 (具有非零輸出)。
-
更好的梯度傳播:與在兩個方向上飽和的 S 形激活函數相比,消失梯度問題更少。
-
高效計算:僅比較、加法和乘法。
-
尺度不變: max ? ( 0 , a x ) = a max ? ( 0 , x ) for? a ≥ 0 \max(0, ax) = a \max(0, x) {\text{ for }} a \geq 0 max(0,ax)=amax(0,x)?for?a≥0。
整流激活函數用于在神經抽象金字塔中分離特定激活和非特定抑制,神經抽象金字塔以監督的方式訓練,學習若干計算機視覺任務。在 2011 年,已經證明使用整流器作為非線性可以訓練深度監督神經網絡而無需非監督的預訓練。與 sigmoid 函數或類似的激活函數相比,整流線性單元允許在大型和復雜數據集上更快,更有效地訓練深度神經架構。
3. Potential problems
- Non-differentiable at zero; however, it is differentiable anywhere else, and the value of the derivative at zero can be arbitrarily chosen to be 0 or 1.
- Non-zero centered
- Unbounded
- Dying ReLU problem: ReLU neurons can sometimes be pushed into states in which they become inactive for essentially all inputs. In this state, no gradients flow backward through the neuron, and so the neuron becomes stuck in a perpetually inactive state and “dies.” This is a form of the vanishing gradient problem. In some cases, large numbers of neurons in a network can become stuck in dead states, effectively decreasing the model capacity. This problem typically arises when the learning rate is set too high. It may be mitigated by using Leaky ReLUs instead, which assign a small positive slope to the left of x = 0 x = 0 x=0.
- 0 位置不可微分,但是,它在任何其他地方都是可微分的,并且在0 位置的導數的值可以任意選擇為 0 或 1。
- 非零中心
- 無界
- 死亡 ReLU 問題:ReLU 神經元有時會被推入基本上所有輸入都變為非活動狀態。在這種狀態下,沒有梯度向后流過神經元,因此神經元陷入永久不活動狀態并“死亡”。 這是消失梯度問題的一種形式。在某些情況下,網絡中的大量神經元可能會陷入死亡狀態,從而有效地降低了模型容量。當學習率設置得太高時,通常會出現此問題。可以通過使用 Leaky ReLUs 來減輕它,它在 x = 0 x = 0 x=0 的左邊分配一個小的正斜率。
4. Softplus
A smooth approximation to the rectifier is the analytic function
整流器的平滑近似是分析函數
f ( x ) = log ? ( 1 + e x ) , f(x) = \log(1+e^{x}), f(x)=log(1+ex),
which is called the softplus or SmoothReLU function. The derivative of softplus is f ′ ( x ) = e x 1 + e x = 1 1 + e ? x f'(x) = {\frac {e^{x}}{1+e^{x}}} = {\frac {1}{1+e^{-x}}} f′(x)=1+exex?=1+e?x1?, the logistic function. The logistic function is a smooth approximation of the derivative of the rectifier, the Heaviside step function.
這被稱為 softplus 或 SmoothReLU 函數。softplus 的導數是 f ′ ( x ) = e x 1 + e x = 1 1 + e ? x f'(x) = {\frac {e^{x}}{1+e^{x}}} = {\frac {1}{1+e^{-x}}} f′(x)=1+exex?=1+e?x1? 是 logistic function。邏輯函數是整流器的導數的平滑近似,Heaviside 階躍函數。
The multivariable generalization of single-variable softplus is the LogSumExp with the first argument set to zero:
單變量 softplus 的多變量推廣是 LogSumExp,第一個參數設置為零:
L S E 0 + ( x 1 , . . . , x n ) : = L S E ( 0 , x 1 , . . . , x n ) = log ? ( 1 + e x 1 + ? + e x n ) . \mathrm {LSE_{0}} ^{+}(x_{1},...,x_{n}):=\mathrm {LSE} (0,x_{1},...,x_{n})=\log \left(1+e^{x_{1}}+\cdots +e^{x_{n}}\right). LSE0?+(x1?,...,xn?):=LSE(0,x1?,...,xn?)=log(1+ex1?+?+exn?).
The LogSumExp function itself is:
LogSumExp 函數本身是:
L S E ( x 1 , … , x n ) = log ? ( e x 1 + ? + e x n ) , \mathrm {LSE} (x_{1},\dots ,x_{n})=\log \left(e^{x_{1}}+\cdots +e^{x_{n}}\right), LSE(x1?,…,xn?)=log(ex1?+?+exn?),
and its gradient is the softmax; the softmax with the first argument set to zero is the multivariable generalization of the logistic function. Both LogSumExp and softmax are used in machine learning.
它的梯度是 softmax,第一個參數設置為零的 softmax 是邏輯函數的多變量推廣。LogSumExp 和 softmax 都用于機器學習。
總結
以上是生活随笔為你收集整理的Rectifier (neural networks) - 整流函数的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 针对小米手机fastboot模式下电脑无
- 下一篇: 十进制转二进制,八进制,十六进制(PHP