當前位置：首頁 > 人工智能 > pytorch >内容正文

pytorch

深度学习正则化正则化率_何时以及如何在深度学习中使用正则化

發(fā)布時間：2023/12/15 pytorch 26 豆豆

生活随笔收集整理的這篇文章主要介紹了深度学习正则化正则化率_何时以及如何在深度学习中使用正则化小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

深度學習正則化正則化率

介紹： (Introduction:)

The key role of Regularization in deep learning models is to reduce overfitting of data. It makes the network simple resulting in generalization on data points never encountered before. This helps in reducing the testing error when the model performs well only on the training set.

在深學習模式的制度化 噸他的關鍵作用是減少數據的過度擬合 。它使網絡變得簡單，從而可以概括以前從未遇到過的數據點。當模型僅在訓練集上表現良好時，這有助于減少測試錯誤。

Before learning about regularization let’s understand briefly about different scenarios and where it can be helpful.

在學習正則化之前，讓我們簡要了解不同的場景以及在什么情況下會有所幫助。

確定錯誤原因： (Identifying causes of errors:)

A simple model might fail to perform well on the training data while complex models may succeed in fitting the training points close to the actual function. However, the ultimate goal of any model is to perform well on unseen data. The two main error causing scenarios are:

簡單的模型可能無法在訓練數據上很好地執(zhí)行，而復雜的模型則可能成功地將訓練點擬合為接近實際功能。但是，任何模型的最終目標都是在看不見的數據上表現良好。導致錯誤的兩個主要情況是：

擬合不足： (Underfitting:)

A statistical model or an algorithm is said to have underfitted when it cannot capture the underlying trend of data. The model may be too simple or biased that it is not able to justify the data trend.

當統計模型或算法無法捕獲數據的潛在趨勢時，稱該模型不合適。該模型可能過于簡單或帶有偏見，無法證明數據趨勢的合理性。

It usually happens when we try to build a linear model with a non-linear data. In such cases the rules of the deep learning model are too easy and will probably make a lot of wrong predictions.

當我們嘗試使用非線性數據構建線性模型時，通常會發(fā)生這種情況。在這種情況下，深度學習模型的規(guī)則過于簡單，可能會做出很多錯誤的預測。

Therefore, Underfitting > High Bias and Low Variance.

因此，選擇欠擬合> 高偏差和低方差 。

Techniques to reduce underfitting: 1. Increase model complexity 2. Increase number of features or perform feature engineering 3. Increase the duration of training

減少欠擬合的技術：1.增加模型復雜度； 2.增加特征數量或執(zhí)行特征工程； 3.增加訓練時間

過度擬合： (Overfitting:)

A statistical model or an algorithm is said to have overfitted when it starts learning from all the noise or inaccuracies possessed by the training data, in a way that even minute details are recorded.

當統計模型或算法開始從訓練數據所具有的所有噪聲或不準確性中學習時，據說統計模型或算法已經過擬合，從而記錄了詳細信息。

The causes of overfitting are generally non-parametric and non-linear methods because these types of algorithms have more flexibility to build the model based on the dataset and therefore sometimes build unrealistic models. As a result, they perform poorly on the testing data.

過度擬合的原因通常是非參數方法和非線性方法，因為這些類型的算法在基于數據集構建模型時具有更大的靈活性，因此有時會構建不切實際的模型。結果，它們在測試數據上表現不佳。

Therefore, Overfitting > Low Bias and High variance.

因此， 過度擬合>低偏差和高方差。

Techniques to reduce overfitting: 1. Increase training data or perform data augmentation. 2. Reduce model complexity. 3. Early stopping during the training phase depending on loss. 4. Regularization

減少過度擬合的技術：1.增加訓練數據或進行數據擴充。 2.降低模型復雜度。 3.根據損失，在訓練階段盡早停止。 4. 正則化

The above image shows conditions where underfitting, optimal(just-right) and overfitting occur. The goal is to train a model such that it results fall in the just-right scenario with bias-variance balance.

上圖顯示了發(fā)生欠擬合，最佳(恰好)和過度擬合的情況。目的是訓練模型，使其結果在偏差方差平衡恰到好處的情況下下降。

The key to any model training approach is to inspect the trends in training regularly to identify different bias-variance scenarios with the help of validation dataset.

任何模型訓練方法的關鍵是定期檢查訓練趨勢，并借助驗證數據集來識別不同的偏差方差場景。

The following table summarizes the intuitions behind this.

下表總結了其背后的直覺。

The ultimate objective of any model is to make training error small (reduces underfitting) while keeping the testing error close to it (reduces overfitting).

任何模型的最終目標都是使訓練誤差小(減少擬合不足)，同時使測試誤差接近(減少過度擬合)。

This requires appropriate selection of the algorithms and features to be used leading us to the Occam’s Razor Principle which states ‘among all the competing hypotheses that explain known hypothesis equally well, select the simplest one.’

這就需要適當選擇要使用的算法和功能，從而使我們得出Occam的剃刀 原理，該原理指出“在所有能夠很好地解釋已知假設的競爭假設中，選擇最簡單的假設”。

In order to make the model better, we tend to over explore the features which can cause wrong fits and unsatisfying results in general. Rather, the focus should be on simplicity while exploring the features and algorithms.

為了使模型更好，我們傾向于過度研究可能導致錯誤擬合和不令人滿意的結果的特征。相反，在探索功能和算法時，重點應該放在簡單性上 。

Since this blog focuses on regularization, if you would like to learn more about bias-variance tradeoff, I recommend you to go through this article.

由于此博客關注正則化，因此，如果您想了解有關偏差方差折衷的更多信息，建議您閱讀本文。

正則化如何減少過度擬合： (How Regularization reduces Overfitting:)

Since deep learning deals with highly complex models, it is easy for it to overfit the training data. Even when the model performs well on training data, the testing error can be quite large resulting in high variance.

由于深度學習處理高度復雜的模型，因此很容易過度擬合訓練數據。即使模型在訓練數據上表現良好，測試誤差也可能很大，導致高方差 。

Image credits] The increasing test error indicated overfitting圖片來源 ]測試錯誤的增加表明過度擬合

Consider training a neural network with cost function J denoted as:

考慮使用成本函數J表示訓練神經網絡：

Cost for a Logistic Regression exampleLogistic回歸示例的成本

where w and b are weights and bias respectively.

其中w和b分別是權重和偏差。

y’? = predicted label

y'=預測標簽

y = actual label

y =實際標簽

m = number of training samples

m =訓練樣本數

We add a regularization term to this function so that it penalizes the weight matrices of nodes within the network.

我們向此函數添加一個正則化項，以便對網絡內節(jié)點的權重矩陣進行懲罰。

where, λ = regularization coefficient

λ=正則化系數

Update of weight w for each layer:

每層重量w的更新：

In this way, regularization term is used to make some of the weight matrices nearly equal to zero to reduce their impact. As a result, the network will be much simpler and chances of overfitting the training data reduce since, different nodes are suppressed while training. The coefficient λ needs to optimized according to the performance on validation set to obtain a well-fitted model.

以這種方式，使用正則項來使一些權重矩陣幾乎等于零，以減少其影響。結果，網絡將變得更加簡單，并且由于在訓練時會抑制不同的節(jié)點，因此減少了過度擬合訓練數據的機會。需要根據驗證集上的性能對系數λ進行優(yōu)化，以獲得一個擬合模型。

Image credits] Reduction in number of nodes makes the network simpler圖像幣 ]減少節(jié)點的數量，使得網絡更加簡單

Another intuition lies in the activation function of output layer of a network. Since the weights tend to be smaller because of regularization, the function z is given by:

另一個直覺在于激活功能網絡的輸出層。由于權重由于正則化而趨于變小，因此函數z由下式給出：

where a is the activation from the last layer.

其中a是來自最后一層的激活。

Hence, z also becomes small. Thus, any activation function like sigmoid(z) or tanh(z) has better chances of capturing values within its linear range. This results in a comparatively linear behaviour of the then complex function reducing the overfitting.

因此，z也變小。因此，任何激活函數(如sigmoid(z)或tanh(z))都有更好的機會捕獲其線性范圍內的值。這導致了當時復雜函數的相對線性行為，從而減少了過擬合。

An example of tanh(z) function is shown below.

tanh(z)函數的示例如下所示。

Image credits] tanh(z) tends to end up in encircled range圖片來源 ] tanh(z)傾向于以包圍范圍結束

常見的正則化技術： (Common Regularization Techniques:)

Now that we know how regularization is helpful to reduce overfitting, let us understand about the most common and effective practices.

既然我們知道正則化如何有助于減少過度擬合，那么讓我們了解最常見和最有效的做法。

1. L1和L2正則化： (1. L1 and L2 Regularization:)

When we have a large number of features, the tendency of the model to overfit along with computational complexities can increase.

當我們具有大量特征時，模型過度擬合的趨勢以及計算復雜性會增加。

Two powerful techniques called Ridge (performs L2 regularization) and Lasso (performs L1 regularization) regression are performed to bring down the Cost function.

執(zhí)行了兩種強大的技術，稱為Ridge(執(zhí)行L2正則化)和Lasso(執(zhí)行L1正則化)回歸，以降低Cost函數。

a) Ridge Regression for L2 Regularization: It penalize the variables if they are found to be too far from zero. Thus, decreasing model complexity while keeping all variables in the model.

a)用于L2正則化的Ridge回歸：如果發(fā)現變量離零太遠，則會對變量進行懲罰。因此，在降低模型復雜度的同時將所有變量保留在模型中。

[Image credits to owner] Ridge Regression[圖片歸所有者所有] Ridge Regression

The red points in the above image correspond to the training set. The model represented by red curve fits these points but it is clear that the testing data (green points) will not perform very well.

上圖中的紅點對應于訓練集。用紅色曲線表示的模型適合這些點，但是很明顯測試數據(綠色點)的性能不是很好。

So, Ridge regression helps in finding the optimum model tat reduces overfitting on training set represented by the blue curve by introducing bias.

因此，Ridge回歸有助于找到最佳模型tat，從而通過引入偏差減少藍色曲線表示的訓練集的過度擬合。

This bias is known as the ridge regression penalty = (λ * slope2)

這種偏差稱為脊回歸罰分=(λ*斜率2)

The slope2 contains all the input intercepts squared and added, excluding just the y (output) intercept.

斜率2包含所有輸入截距的平方和加和，僅包括y(輸出)截距。

b) Lasso Regression (Least Absolute Shrinkage and Selection Operator) for L1 Regularization: Previously, in ridge regression the bias was increased in order to decrease the variance using slope squares.

b) 用于L1正則化的 套索回歸(最小絕對收縮和選擇算子 ) ：以前，在山脊回歸中增加了偏差以使用斜率平方減小方差。

In Lasso regression we add an absolute value of slope, |slope| instead of slope squares to introduce a little amount of bias represented by orange curve in the below image. This bias improves over training time.

在套索回歸中，我們添加斜率的絕對值| slope | 而不是斜率正方形會在下圖中引入少量由橙色曲線表示的偏差。隨著訓練時間的推移，這種偏見有所改善。

Therefore, lasso regression penalty = (λ * |slope|)

因此，套索回歸懲罰=(λ* | slope |)

[Image credits to owner] Lasso Regression[圖片歸所有者所有]套索回歸

深度學習中的實現公式： (Equations for implementation in deep learning:)

>L2 regularization uses the regularization term as discussed above to penalize the weights of complex models. In simple terms the equation for cost function becomes:

> L2 正則化使用上面討論的正則化項來懲罰復雜模型的權重。簡單來說，成本函數的等式變?yōu)?#xff1a;

成本函數=成本(來自y和y')+正則項， (Cost function = Cost(from y and y’) + Regularization term,)

The subscript 2 denotes L2下標2表示L2

λ = Regularization coefficient

λ=正則化系數

m = number of training samples and,

m =訓練樣本數，并且

where n = number of features其中n =特征數量

>For L1 regularization the only difference is that the regularization term contains λ/m instead of (λ/2)*m. Therefore, the cost function for L1 is:

>對于L1正則化 ，唯一的區(qū)別是正則化項包含λ/ m而不是(λ/ 2)* m。因此，L1的成本函數為：

Subscript 1 denotes L1下標1表示L1

2.輟學正則化： (2. Dropout Regularization:)

This is the most intuitive regularization technique and is frequently used. At every iteration, some nodes are dropped randomly valid only for that particular iteration. A new (random) set of nodes is dropped for the upcoming iteration.

這是最直觀的正則化技術，并且經常使用。在每次迭代中，一些節(jié)點被隨機丟棄，僅對該特定迭代有效。為即將到來的迭代刪除了一組新的(隨機)節(jié)點。

Image credits] Nodes Used in Two Different Iterations圖像幣 ]節(jié)點用于兩種不同的迭代

So, in this way every iteration uses different set of nodes so that the network produced is random and does not overfit the original complex structure. Bacause any feature in the network can be dropped at random, model is never heavily influenced by a particular feature thus reducing overfitting.

因此，通過這種方式，每次迭代都使用不同的節(jié)點集，因此生成的網絡是隨機的，不會過度適應原始的復雜結構。由于網絡中的任何特征都可以隨意刪除，因此模型永遠不會受到特定特征的嚴重影響，從而減少了過擬合。

The number of nodes to be eliminated is decided by assigning keep-probabilities separately for each hidden layer of the network. Thus, we can control the effect produced by each and every layer on the network.

通過為網絡的每個隱藏層 分別分配保持概率，可以確定要消除的節(jié)點數。因此，我們可以控制網絡上每一層所產生的效果。

In conclusion, regularization is an important technique in deep learning. With sufficient knowledge of overfitting scenarios and regularization implementation, the results improve to a great extend.

總之，正則化是深度學習中的一項重要技術。有了對過度擬合場景和正則化實現的足夠了解，結果可以得到很大程度的改善。

翻譯自: https://medium.com/snu-ai/when-and-how-to-use-regularization-in-deep-learning-4cf3fca3950f