當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

模型标签数据神经网络_大型神经网络和小数据的模型选择

發布時間：2023/12/15 编程问答 27 豆豆

生活随笔收集整理的這篇文章主要介紹了模型标签数据神经网络_大型神经网络和小数据的模型选择小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

模型標簽數據神經網絡

The title statement is certainly a bold claim, and I suspect many of you are shaking your heads right now.

標題聲明無疑是一個大膽的主張，我懷疑你們中的許多人現在都在搖頭。

Under the classical teachings of statistical learning, this contradicts the well-known bias-variance tradeoff. This theory defines a sweet-spot where, if you increase model complexity further, generalization error tends to increase (the typical U-shaped test error curve).

在統計學習的經典理論下，這與眾所周知的偏差-方差折衷相矛盾。該理論定義了一個最佳點，如果您進一步增加模型復雜度，則泛化誤差趨于增加(典型的U形測試誤差曲線)。

You would think this effect is more pronounced for small datasets where the number of parameters, p, is larger than the number of observations, n, but this is not necessarily the case.

您可能會認為這種影響在參數數p大于觀察數n的 小型數據集中更為明顯，但不一定是這種情況。

In a recent ICML 2020 paper by Deepmind (Bornschein, 2020), it was shown that one can train on a smaller subset of the training data while maintaining generalizable results, even for large overparameterized models. If this is true, we can reduce the computational overhead in model selection and hyperparameter tuning significantly.

Deepmind (Bornschein，2020)在最近的ICML 2020論文中表明，即使對于大型的超參數化模型，也可以在較小的訓練數據子集上進行訓練，同時保持可推廣的結果。如果是這樣，我們可以大大減少模型選擇和超參數調整的計算開銷 。

Think for a moment regarding the implications of this. This could dramatically alter how we select optimal models or tune hyperparameters (for example in Kaggle competitions) since we can include significantly more models in our grid search (or the like).

考慮一下這一點的含義。這可能會極大地改變我們選擇最佳模型或調整超參數的方式(例如在Kaggle競賽中)，因為我們可以在網格搜索(或類似方法)中包含更多模型。

Is this too good to be true? And how can we prove it?

這太好了以至于無法做到嗎？ 我們怎么證明呢？

Here are the main takeaways before we get started:

以下是開始之前的主要收獲：

Model selection is possible using only a subset of your training data, thus saving computational resources (relative ranking-hypothesis)
僅使用一部分訓練數據即可進行模型選擇 ，從而節省了計算資源 ( 相對排名假設 )
Large overparameterized neural networks can generalize surprisingly well (double descent)
大型的超參數化神經網絡可以令人驚訝地很好地推廣 ( 兩次下降 )
After reaching a minimum, test cross-entropy tends to gradually increase over time while test accuracy improves (overconfidence). This can be avoided using temperature scaling.
達到最小值后， 測試交叉熵往往會隨著時間的推移逐漸增加，而測試精度會提高 ( 過度自信 )。使用溫度調節可以避免這種情況。

Let’s get started.

讓我們開始吧。

1.關于偏差方差折衷的經典理論的回顧 (1. Review of classical theory on the bias-variance trade-off)

Before we get started, I will offer you two options. If you are tired of hearing about the bias-variance trade-off for the 100th time, please read the TLDR at the end of Section 1 and then move on to Section 2. Otherwise, I will briefly introduce the bare minimum needed to understand the basics before moving on with the actual paper.

在我們開始之前，我將為您提供兩種選擇。如果您厭倦了第100次聽到偏差方差的折衷，請閱讀第1節末尾的TLDR ，然后繼續第2節。否則，我將簡要介紹理解該偏差的最低要求。基礎知識，然后再繼續實際的論文。

The predictive error for all supervised learning algorithms can be broken into three (theoretical) parts, which are essential to understand the bias-variance tradeoff. These are; 1) Bias 2) Variance 3) Irreducible error (or noise term)

所有監督學習算法的預測誤差都可以分為三個(理論上的)部分，這對于理解偏差方差的權衡至關重要。這些是; 1)偏差2)方差3)不可減少的誤差(或噪聲項)

The irreducible error (sometimes called noise) is a term disconnected from the chosen model which can never be reduced. It is an aspect of the data arising due to an imperfect framing of the problem, meaning we will never be able to capture the true relationship of the data — no matter how good our model is.

不可減少的誤差 (有時稱為噪聲)是一個與所選模型脫節的術語，永遠無法減少。這是由于問題框架不完善而產生的數據的一個方面，這意味著無論我們的模型多么出色，我們將永遠無法捕獲數據的真實關系。

The bias term is generally what people think of when they refer to model (predictive) errors. In short, it measures the difference between the “average” model prediction and the ground truth. Average might seem strange in this case as we typically only train one model. Think of it this way. Due to small perturbations (randomness) in our data, we can get slightly different predictions even with the same model. By averaging the range of predictions we get due to these perturbations, we obtain the bias term. High bias is a sign of poor model fit (underfitting), as it will have a large prediction error on both the training and test set.

偏差術語通常是人們指的是模型(預測)錯誤時的想法。簡而言之，它衡量的是“平均”模型預測與基本事實之間的差異。在這種情況下，平均值可能看起來很奇怪，因為我們通常只訓練一個模型。這樣想吧。由于我們數據中的小擾動(隨機性)，即使使用相同的模型，我們也可以得到略有不同的預測。通過平均由于這些擾動而獲得的預測范圍，可以得到偏差項。 高偏差是模型擬合(擬合不足)的標志，因為它將在訓練和測試集上都具有較大的預測誤差。

Finally, the variance term refers to the variability of the model prediction for a given data point. It might sound similar, but the key difference lies in the “average” versus “data point”. High variance implies high generalization error. For example, while a model might be relatively accurate on the training set, it can achieve a considerably poor fit on the test set. This latter scenario (high variance, low bias) is typically the most likely when training overparameterized neural networks, i.e., what we refer to as overfitting.

最后，方差項是指給定數據點的模型預測的方差。聽起來可能很相似，但主要區別在于“平均”與“數據點”。 高方差意味著高泛化誤差。例如，雖然模型在訓練集上可能相對準確，但是可以在測試集上實現相當差的擬合。在訓練過參數化的神經網絡(即我們所謂的過擬合 )時，通常最有可能出現后一種情況(高方差，低偏差)。

The practical implication of these terms implies balancing the bias and variance (hence the name trade-off), typically controlled via model complexity. The ultimate goal is to obtain low bias and low variance. This is the typical U-shape test error curve you might have seen before.

這些術語的實際含義意味著通常通過模型復雜性來控制偏差和方差(因此需要權衡取舍)。最終目標是獲得低偏差和低方差。這是您之前可能看到的典型U形測試誤差曲線。

https://www.digitalvidya.com/blog/bias-variance-tradeoff/https://www.digitalvidya.com/blog/bias-variance-tradeoff/

Alright, I will assume you know enough about the bias-variance trade-off to understand why the original claim that overparameterized neural networks do not necessarily imply high variance is puzzling, indeed.

好吧，我假設您對偏差方差的權衡已經足夠了解，以了解為什么原來聲稱過參數化神經網絡不一定意味著高方差的說法令人困惑。

TLDR; high variance, low bias is a sign of overfitting. Overfitting happens when a model achieves high accuracy on the training set but low accuracy on the test set. This typically happens for overparameterized neural networks.

TLDR；高方差，低偏差是過度擬合的標志。當模型在訓練集上達到高精度但在測試集上達到低精度時，就會發生過度擬合。對于過參數化的神經網絡，通常會發生這種情況。

2.現代體制-更大的模型更好！ (2. Modern Regime — Larger models are better!)

In practice, we typically optimize the bias-variance trade-off using a validation set with (for example) early stopping. Interestingly, this approach might be completely wrong. Over the past few years, researchers have found that if you keep fitting increasingly flexible models, you obtain what is termed double descent, i.e., generalization error will start to decrease again after reaching an intermediary peak. This finding is empirically validated in Nakkiran et al. (2019) for modern neural network architectures on established and challenging datasets. See the following figure from OpenAI, which shows this scenario;

實際上，我們通常使用帶有(例如)提前停止的驗證集來優化偏差方差的權衡。有趣的是，這種方法可能是完全錯誤的 。在過去的幾年中，研究人員發現，如果您保持擬合度越來越高的模型，您將獲得所謂的兩次下降，即泛化誤差將在達到中間峰后再次開始下降。該發現在Nakkiran等人中得到了經驗驗證。 (2019)中關于已建立且具有挑戰性的數據集的現代神經網絡架構。請參見下圖，來自OpenAI，它顯示了這種情況；

https://openai.com/blog/deep-double-descent/https://openai.com/blog/deep-double-descent/

Test error initially declines until it reaches a (local) minimum, and then starts increasing again with increasing complexity. In the critical regime, it is important that we keep adding model complexity, as the test error will start to decline again, eventually reaching a (global) minimum.
測試錯誤最初會下降，直到達到(本地)最小值為止，然后又隨著復雜性的增加而再次增加。在關鍵條件下，重要的是我們要不斷增加模型的復雜性，因為測試誤差將再次開始下降，最終達到(全局)最小值。

These findings imply that larger models are generally better due to the double descent phenomena, which challenges the long-held viewpoint regarding overfitting for overparameterized neural networks.

這些發現暗示，由于雙重下降現象，較大的模型通常更好，這挑戰了人們對于長期擬合過參數化神經網絡的長期觀點。

3.相對排名假說 (3. Relative Ranking-Hypothesis)

Having established that large overparameterized neural networks can generalize well, we want to take it one step further. Enter the relative ranking hypothesis. Before we explain the hypothesis, we note that if proven true, then you can potentially perform model selection and hyperparameter tuning on a small subset of your training dataset for your next experiment, and by doing so save computational resources and valuable training time.

確定大型超參數化神經網絡可以很好地泛化之后，我們想進一步邁出一步。輸入相對排名假設 。在解釋該假設之前，我們注意到如果被證明是正確的，那么您有可能可以在下一個實驗的訓練數據集中的一小部分上執行模型選擇和超參數調整，從而節省了計算資源和寶貴的訓練時間。

We will briefly introduce the hypothesis followed by a few experiments to validate the claim. As an additional experiment not included in the literature (as far as we know), we will investigate one setting that could potentially invalidate the relative ranking hypothesis, which is imbalanced datasets.

我們將簡要介紹該假設，然后進行一些實驗以驗證該要求。作為文獻中未包含的另一項實驗(據我們所知)，我們將研究一種可能使相對排名假設無效的設置，即不平衡數據集 。

a)理論 (a) Theory)

One of the key hypotheses of Bornschein (2020) is;

Bornschein(2020)的主要假設之一是：

“overparameterized model architectures seem to maintain their relative ranking in terms of generalization performance, when trained on arbitrarily small subsets of the training set”.

“當在訓練集的任意小子集上訓練時，過參數化的模型體系結構似乎在泛化性能方面保持其相對排名” 。

They call this observation the relative ranking-hypothesis.

他們稱這種觀察為相對排名假說 。

In layman terms; let’s say we have 10 models to choose from, numbered from 1 to 10. We train our models on a 10% subset of the training data and find that model 6 is the best, followed by 4, then 3, and so on..

用外行術語 ; 假設我們有10種模型可供選擇，編號從1到10。我們在訓練數據的10％子集上訓練模型，發現模型6是最好的，其次是4，然后是3，依此類推。

The ranking hypothesis postulates, that as we gradually increase the subset percentage from 10% all the way up to 100%, we should obtain the exact same ordering of optimal models.

排名假設假設，隨著我們將子集百分比從10％逐漸增加到100％，我們應該獲得與最優模型完全相同的順序。

If this hypothesis is true, we can essentially perform model selection on a small subset of the original data to the added benefit of much faster convergence. If this was not controversial enough, the authors even take it one step further as they found some experiments where training on small datasets led to more robust model selection (less variance), which certainly seems counterintuitive given that we would expect relatively more noise for smaller datasets.

如果這個假設是正確的，我們就可以對原始數據的一小部分進行模型選擇，從而獲得更快收斂的好處。如果這還沒有引起足夠的爭議，那么作者甚至更進一步，因為他們發現一些實驗在小型數據集上進行訓練導致模型選擇更加可靠(方差較小)，這顯然是違反直覺的，因為我們希望較小的噪聲相對更大數據集。

b)溫度校準 (b) Temperature calibration)

One strange phenomenon when training neural network classifiers is, that cross-entropy error tends to increase while classification error decreases. This seems counterintuitive but is simply due to models becoming overconfident in their predictions ( Guo et al. (2017)). We can use something called temperature scaling, which calibrates the cross-entropy estimates on a small held-out dataset. This yields more generalizable and well-behaved results compared to classical cross-entropy, especially relevant for overparameterized neural networks. As a rough analogy, you can think of this as providing less “false negatives” regarding the number of overfitting cases.

訓練神經網絡分類器時一個奇怪的現象是， 交叉熵誤差趨于增加，而分類誤差則減小 。這似乎違反直覺，但這僅僅是由于模型在其預測中變得過于自信 ( Guo等人(2017) )。我們可以使用一種稱為溫度縮放的工具 ，該工具可以在一個小的保留數據集上校準交叉熵估計。與經典的交叉熵相比，這產生了更通用和行為良好的結果，特別是與過參數化的神經網絡有關。粗略地類推，您可以認為這為過度擬合案例的數量提供了較少的“假陰性”。

While Bornschein (2020) do not provide explicit details on the exact softmax temperature calibration procedure used in their paper, we use the following procedure for our experiments;

盡管Bornschein(2020)并未提供其論文中使用的softmax溫度校準程序的明確細節，但我們在實驗中使用了以下程序；

We define a held-out calibration dataset, C, equivalent to 10% of the training data.
我們定義一個保留的校準數據集C，相當于訓練數據的10％。
We initialize the temperature scalar to be 1.5 (like in Guo et al. (2017))
我們將溫度標量初始化為1.5(類似于Guo et al。(2017))

對于每個時代； (For each epoch;)

1) Calculate cross-entropy loss on our calibration set C
1)在我們的校準集C上計算交叉熵損失
2) Optimize the temperature scalar using gradient descent on the calibration set (see this Github repo by Guo et al. (2017))
2)使用校準集上的梯度下降來優化溫度標量( 請參閱Guo et al。(2017)的Github回購 )
3) Use the updated temperature scalar to calibrate the regular cross-entropy during gradient descent
3)在梯度下降過程中使用更新的溫度標量來校準規則的交叉熵
After training for 50 epochs, we calculate the calibrated test error, which should no longer show signs of overconfidence.
在訓練了50個紀元后，我們計算出校準的測試誤差，該誤差不再顯示過度自信的跡象。

Let us now turn to the experimental setting.

現在讓我們轉到實驗設置。

4.實驗 (4. Experiments)

We will conduct two experiments in this post. One for validating the relative ranking-hypothesis on the MNIST dataset, and one for evaluating how our conclusions change if we synthetically make MNIST imbalanced. This latter experiment is not included in the Bornschein (2020) paper, and could potentially invalidate the relative ranking-hypothesis for imbalanced datasets.

我們將在本文中進行兩個實驗。一種用于驗證MNIST數據集上的相對排名假設，另一種用于評估如果我們綜合使MNIST不平衡，我們的結論將如何變化。后一個實驗未包括在Bornschein(2020)的論文中，并且可能使不平衡數據集的相對排名假設無效。

MNIST (MNIST)

We start by replicating the Bornschein (2020) study on MNIST, before moving on with the imbalanced dataset experiment. This is not meant to disprove any of the claims in the paper, but simply to ensure we have replicated their experimental setup as closely as possible (with some modifications).

在繼續進行不平衡數據集實驗之前，我們首先復制關于MNIST的Bornschein(2020)研究。這并不是要證明本文中的任何權利要求，而只是為了確保我們盡可能地復制了他們的實驗裝置(進行了一些修改)。

Split of 90%/10% for the training and calibration sets, respectively
分別為訓練和校準集分配90％/ 10％
Random sampling (as balanced subset sampling did not provide any added benefit according to the paper)
隨機抽樣(根據本文，平衡子集抽樣沒有提供任何額外的好處)
50 epochs
50紀
Adam with fixed learning rate [10e-4]
亞當的學習率固定[10e-4]
Batch size = 256
批次大小= 256
Fully connected MLPs with 3 hidden layers and 2048 units each
完全連接的MLP，具有3個隱藏層和每個2048個單元
Without dropout (made our results too unstable to include)
沒有輟學(導致我們的結果太不穩定而無法包括在內)
A simple convolutional network with 4 layers, 5x5 spatial kernel, stride 1 and 256 channels
一個簡單的卷積網絡，具有4層，5x5空間內核，步幅1和256通道
Logistic regression
邏輯回歸
10 different seeds to visualize uncertainty bands (30 in the original paper)
10種不同的種子來可視化不確定性帶(原始論文中有30種)

The authors also mention experimenting with replacing ReLU with tanh, batch-norm, layer-norm, etc., but it is unclear if these tests were included in their final results. Thus, we only consider the experiment using the above settings.

作者還提到嘗試用tanh，批處理規范，圖層規范等替換ReLU，但是目前尚不清楚這些測試是否包含在最終結果中。因此，我們僅考慮使用以上設置的實驗。

實驗1：梯度下降期間的溫度縮放如何影響一般性？ (Experiment 1: How does temperature scaling during gradient descent affect generalization?)

As an initial experiment, we want to validate why temperature scaling is needed. For this, we train an MLP using ReLU and 3 hidden layers of 2048 units each, respectively. We do not include dropout and we train for 50 epochs.

作為初始實驗，我們想驗證為什么需要溫度縮放。為此，我們分別使用ReLU和每個2048個單元的3個隱藏層來訓練MLP。我們不包括輟學，我們訓練了50個紀元。

Our hypothesis is: The test cross-entropy should gradually increase while test accuracy decreases over time (motivation for temperature scaling in the first place, i.e., model overconfidence).

我們的假設是：測試交叉熵應逐漸增加，而測試精度則隨時間降低(首先是溫度縮放的動機，即模型過度自信)。

Here are the results of this initial experiment:

以下是此初始實驗的結果：

Clearly, the test entropy does decline initially and then gradually increases over time while test accuracy keeps improving. This is evidence in favor of hypothesis 1. Figure 3 in Guo et al. (2017) demonstrates the exact same effect on CIFAR-100.Note: We have smoothed the results a bit (5-window rolling mean) to make the effect more visible.

顯然，測試熵確實會先下降，然后隨時間逐漸增加，同時測試精度會不斷提高。這是支持假設1的證據。Guo等人的圖3。 (2017)證明了對CIFAR-100的完全相同的作用。注意：我們對結果進行了一些平滑處理(5窗口滾動平均值)，以使效果更明顯。

Conclusions from Experiment 1:

實驗1的結論：

If we keep training large neural networks for sufficiently long, we start to see overconfident probabilistic predictions, making them less useful out-of-sample.
如果我們繼續訓練大型神經網絡足夠長的時間，我們就會開始看到過分自信的概率預測，從而使它們在樣本外的使用率降低。

To remedy this effect, we can incorporate temperature scaling which

為了糾正這種影響，我們可以結合使用溫度縮放

a) ensures probabilistic forecasts are more stable and reliable out-of-sample, and
a)確保概率預測更加穩定和可靠的樣本外；以及
b) improves generalization by scaling training cross-entropy during gradient descent.
b)通過在梯度下降過程中縮放訓練交叉熵來提高泛化能力。

平衡數據集 (Balanced Dataset)

Having shown that temperature scaling is needed, we now turn to the primary experiment — i.e., how does test cross-entropy vary as a function of the size of our training dataset. Our results look as follows:

在表明需要進行溫度縮放后，我們現在轉向主要實驗-即，測試交叉熵如何隨訓練數據集的大小而變化。我們的結果如下所示：

Test cross-entropy as a function of the size of the training set for MNIST測試交叉熵與MNIST訓練集大小的關系

Note, we do not obtain the exact same “smooth” results as Bornschein (2020). This is most likely due to the fact, that we have not replicated their experiment completely, as they for example include many more different seeds. Nevertheless, we can draw the following conclusions:

注意，我們無法獲得與Bornschein(2020)完全相同的“平滑”結果。這很可能是由于以下事實：我們尚未完全復制他們的實驗，例如它們包含更多不同的種子。盡管如此，我們可以得出以下結論：

Interestingly, the relatively large ResNet-18 model does not overfit more than logistic regression at any point during training!
有趣的是，相對較大的ResNet-18模型在訓練期間的任何時候都不會比邏輯回歸擬合得更多！
The relative ranking-hypothesis is confirmed
相對排名假設得到確認
Beyond 25000 observations (roughly half of the MNIST train dataset), the significantly larger ResNet model is only marginally better than the relatively faster MLP model.
除了25000個觀測值(大約是MNIST火車數據集的一半)以外，明顯更大的ResNet模型僅比相對較快的MLP模型略勝一籌。

數據集不平衡 (Imbalanced Dataset)

We will now conduct an experiment for the case of imbalanced datasets, which is not included in the actual paper, as it could be a setting where the tested hypothesis is invalid.

現在，我們將針對不平衡數據集的情況進行實驗，該實驗未包含在實際論文中，因為這可能是檢驗假設無效的設置。

We sample an artificially imbalanced version of MNIST similar to Guo et al. (2019). The procedure is as follows. For each class in our dataset, we subsample between 0 and 100 percent of the original training and test dataset. We use the following GitHub repo for this sampling procedure.

我們采樣了與Guo等類似的人為失衡的MNIST版本。 (2019) 。步驟如下。對于數據集中的每個課程，我們在原始訓練和測試數據集中的0％到100％之間進行子采樣。我們將以下GitHub存儲庫用于此采樣過程。

Then, we select our calibration dataset similar to the previous experiment, i.e., random 90/10% split between training and calibration.

然后，我們選擇類似于先前實驗的校準數據集，即在訓練和校準之間隨機分配90/10％。

We include a visualization of the classes distribution for the original MNIST training dataset

我們包括原始MNIST訓練數據集的班級分布的可視化

Frequency count for each class in MNISTMNIST中每個類別的頻率計數

and the imbalanced version

和不平衡版本

Frequency count for each class in imbalanced MNIST不平衡MNIST中每個類別的頻率計數

Given this large difference in the frequency distribution, you can clearly see how this version is much more imbalanced compared to the original MNIST.

考慮到頻率分布的巨大差異，您可以清楚地看到與原始MNIST相比，此版本的不平衡性如何。

While a plethora of different methods for overcoming the problem of imbalanced datasets exist (see the following review paper), we want to investigate and isolate the effects of having an imbalanced dataset for the relative ranking hypothesis, i.e., does the relative ranking-hypothesis still hold in the imbalanced data setting?

盡管存在多種解決不平衡數據集問題的不同方法(請參閱以下評論文章 )，但我們想研究和隔離不平衡數據集對于相對排名假設的影響，即相對排名假設是否仍然存在保持不平衡的數據設置？

We run all our models again using this synthetically imbalanced MNIST dataset, and obtain the following results:

我們使用此綜合不平衡MNIST數據集再次運行所有模型，并獲得以下結果：

Test cross-entropy as a function of the size of the training set for imbalanced MNIST測試交叉熵與不平衡MNIST訓練集大小的關系

So has the conclusion changed?

那么結論是否改變了？

Not really!

不是真的 ！

This is quite an optimistic result, as we are now more confident, that the relative ranking-hypothesis is mostly true in the case of imbalanced datasets. We believe this could also be the reason behind the quote from the Bornschein (2020) paper regarding the sampling strategy;

正如我們現在更加確信的那樣，這是一個非常樂觀的結果，即相對排名假設在數據集不平衡的情況下大部分是正確的。我們認為，這也可能是Bornschein(2020)論文引用抽樣策略背后的原因。

“We experimented with balanced subset sampling, i.e. ensuring that all subsets always contain an equal number of examples per class. But we did not observe any reliable improvements from doing so and therefore reverted to a simple i.i.d sampling strategy.”

“我們嘗試了平衡子集抽樣，即確保每個子集的所有子集始終包含相同數量的示例。但是我們并沒有觀察到任何可靠的改進，因此恢復了簡單的iid采樣策略。”

The primary difference between the balanced and imbalanced version is the more “jumpy” results, which makes sense given that there might be classes available in the test set not seen during training for the chosen models.

平衡版本和不平衡版本之間的主要區別是結果更“ 跳動 ”，這是有道理的，因為測試模型中可能存在某些訓練中未選模型的可用類。

5.總結 (5. Summary)

To sum up our findings:

總結一下我們的發現：

Due to the relative ranking-hypothesis, we can perform model selection using only a subset of our training data for both balanced and imbalanced datasets, thus saving computational resources
由于相對的排序假設，我們可以僅使用訓練數據的一個子集來進行模型選擇 (無論是平衡數據集還是不平衡數據集)，從而節省了計算資源
Large overparameterized neural networks can generalize surprisingly well, even on small datasets (double descent)
大型的超參數化神經網絡即使在較小的數據集( 兩次下降 )上也能令人驚訝地很好地泛化
We can avoid overconfidence by applying temperature scaling
我們可以通過應用溫度縮放來避免過度自信

I hope that you might be able to apply these findings in your next machine learning experiments and remember, larger is (almost) always better.

我希望您可以在下一個機器學習實驗中應用這些發現，并記住，更大(幾乎)總是更好。

Thank you for reading!

感謝您的閱讀！

6.參考 (6. References)

[1] J. Bornschein, F. Visin, and S. Osindero, Small Data, Big Decisions: Model Selection in the Small-Data Regime (2020), in International Conference on Machine Learning (ICML).

[1] J. Bornschein，F。Visin和S. Osindero，《小數據，大決策：小數據體制中的模型選擇》(2020年)，在國際機器學習會議(ICML)中。

[2] P. Nakkiran, G. Kaplun, Y. Yang, B. T. Barak, and I. Sutskever, Deep double descent: Where bigger models and more data hurt (2019), arXiv preprint arXiv:1912.02292.

[2] P. Nakkiran，G.Kaplun，Y.Yang，BT Barak和I.Sutskever，《深雙血統：更大的模型和更多的數據受到傷害》(2019)，arXiv預印本arXiv：1912.02292。

[3] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks (2017), arXiv preprint arXiv:1706.04599.

[3] C. Guo，G. Pleiss，Y. Sun和KQ Weinberger，Guo。C.，Pleiss，G.，Sun，Y.，＆Weinberger，KQ(2017)。關于現代神經網絡的校準(2017)，arXiv預印本arXiv：1706.04599。

[4] T. Guo, X. Zhu, Y. Wang, and F. Chen, Discriminative Sample Generation for Deep Imbalanced Learning (2019), in International Joint Conferences on Artificial Intelligence Organization (IJCAI) (pp. 2406–2412).

[4] T. Guo，X X. Zhu，Wang Y和Wang F. Chen，《深度不平衡學習的判別樣本生成》(2019年)，在國際人工智能組織聯合會議(IJCAI)上發表(第2406-2412頁)。

Originally published at https://holmdk.github.io on August 14, 2020.

最初于 2020年8月14日發布于 https://holmdk.github.io 。

翻譯自: https://towardsdatascience.com/model-selection-with-large-neural-networks-and-small-data-955b4d929d55