Paper:《Adam: A Method for Stochastic Optimization》的翻译与解读
Paper:《Adam: A Method for Stochastic Optimization》的翻譯與解讀
?
?
目錄
Adam: A Method for Stochastic Optimization
ABSTRACT
1、INTRODUCTION
3、CONCLUSION
?
?
?
Adam: A Method for Stochastic Optimization
論文出處:Adam: A Method for Stochastic Optimization
ABSTRACT
| We introduce Adam, an algorithm for first-order gradient-based optimization ofstochastic objective functions, based on adaptive estimates of lower-order moments.The method is straightforward to implement, is computationally efficient,has little memory requirements, is invariant to diagonal rescaling of the gradients,and is well suited for problems that are large in terms of data and/or parameters.The method is also appropriate for non-stationary objectives and problems withvery noisy and/or sparse gradients. The hyper-parameters have intuitive interpretationsand typically require little tuning. Some connections to related algorithms,on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm. | 介紹了一種基于低階矩自適應估計的隨機目標函數一階梯度優化算法Adam。該方法易于實現,計算效率高,對內存的要求少,對梯度的對角重新縮放是不變的,并且非常適合于數據和/或參數很大的問題。該方法也適用于非平穩目標和具有非常嘈雜和/或稀疏梯度的問題。超參數有直觀的解釋,通常需要很少的調整。本文討論了一些與相關算法的聯系,Adam 正是在這些算法上受到啟發。我們還分析了算法的理論收斂性,并給出了與在線凸優化框架下的最優結果相當的收斂遺憾界。實驗結果表明,該方法在實際應用中效果良好,與其它隨機優化方法相比具有一定的優越性。最后,我們討論了AdaMax,一個基于無窮范數的Adam變體。 |
?
1、INTRODUCTION
| Stochastic gradient-based optimization is of core practical importance in many fields of science and engineering. Many problems in these fields can be cast as the optimization of some scalar parameterized objective function requiring maximization or minimization with respect to its parameters. If the function is differentiable w.r.t. its parameters, gradient descent is a relatively efficient optimization method, since the computation of first-order partial derivatives w.r.t. all the parameters is of the same computational complexity as just evaluating the function. Often, objective functions are stochastic. For example, many objective functions are composed of a sum of subfunctions evaluated at different subsamples of data; in this case optimization can be made more efficient by taking gradient steps w.r.t. individual subfunctions, i.e. stochastic gradient descent (SGD) or ascent. SGD proved itself as an efficient and effective optimization method that was central in many machine learning success stories, such as recent advances in deep learning (Deng et al., 2013; Krizhevsky et al., 2012; Hinton & Salakhutdinov, 2006; Hinton et al., 2012a; Graves et al., 2013). Objectives may also have other sources of noise than data subsampling, such as dropout (Hinton et al., 2012b) regularization. For all such noisy objectives, efficient stochastic optimization techniques are required. The focus of this paper is on the optimization of stochastic objectives with high-dimensional parameters spaces. In these cases, higher-order optimization methods are ill-suited, and discussion in this paper will be restricted to first-order methods. | 基于隨機梯度的優化方法在許多科學和工程領域具有核心的實際意義。這些領域中的許多問題都可以轉化為對某個標量參數化目標函數的優化,該函數的參數需要最大化或最小化。如果函數的參數是可微的,梯度下降法是一種比較有效的優化方法,因為一階偏導數的計算與函數的求值具有相同的計算復雜度。通常,目標函數是隨機的。例如,許多目標函數由在不同數據子樣本上求值的子函數和組成;在這種情況下,優化可以通過采取梯度步驟w.r.t.單獨的子函數,即隨機梯度下降(SGD)或上升來提高效率。SGD證明了自己是一種高效和有效的優化方法,這在許多機器學習成功的故事中都是核心,比如最近在深度學習方面的進展(Deng et al., 2013;Krizhevsky等,2012;Hinton & Salakhutdinov, 2006;Hinton等,2012a;Graves 等人,2013)。目標也可能有數據子采樣之外的其他噪音源,如dropout (Hinton et al., 2012b)正則化。對于所有這些有噪聲的目標,都需要有效的隨機優化技術。本文主要研究具有高維參數空間的隨機目標的優化問題。在這種情況下,高階優化方法是不合適的,本文的討論將局限于一階方法。 |
| We propose Adam, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement. The method computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients; the name Adam is derived from adaptive moment estimation. Our method is designed to combine the advantages of two recently popular methods: AdaGrad (Duchi et al., 2011), which works well with sparse gradients, and RMSProp (Tieleman & Hinton, 2012), which works well in on-line and non-stationary settings; important connections to these and other stochastic optimization methods are clarified in section 5. Some of Adam’s advantages are that the magnitudes of parameter updates are invariant to rescaling of the gradient, its stepsizes are approximately bounded by the stepsize hyperparameter, it does not require a stationary objective, it works with sparse gradients, and it naturally performs a form of step size annealing. | 我們提出了一種只需要一階梯度且對內存要求很小的高效隨機優化方法Adam。該方法通過估計梯度的一階矩和二階矩計算不同參數的個體自適應學習率;Adam這個名字來源于自適應矩估計。我們的方法結合了兩種最近流行的方法的優點:AdaGrad (Duchi et al., 2011)和RMSProp (Tieleman & Hinton, 2012),前者在稀疏梯度下工作良好,后者在在線和非平穩環境下工作良好;與這些和其他隨機優化方法的重要聯系在第5節中闡明。Adam的一些優勢的大小參數更新不變的尺度改變梯度,其stepsizes大約有界的stepsize hyperparameter,它不需要一個固定的目標,它適用于稀疏的梯度,它自然地執行步長退火的一種形式。 |
?
?
3、CONCLUSION
| We have introduced a simple and computationally efficient algorithm for gradient-based optimization of stochastic objective functions. Our method is aimed towards machine learning problems with large datasets and/or high-dimensional parameter spaces. The method combines the advantages of two recently popular optimization methods: the ability of AdaGrad to deal with sparse gradients, and the ability of RMSProp to deal with non-stationary objectives. The method is straightforward to implement and requires little memory. The experiments confirm the analysis on the rate of convergence in convex problems. Overall, we found Adam to be robust and well-suited to a wide range of non-convex optimization problems in the field machine learning. | 介紹了一種簡單、計算效率高的隨機目標函數梯度優化算法。我們的方法是針對大數據集和/或高維參數空間的機器學習問題。該方法結合了兩種最近流行的優化方法的優點:AdaGrad處理稀疏梯度的能力和RMSProp處理非平穩目標的能力。該方法易于實現,并且需要的內存很少。實驗驗證了凸問題收斂速度的分析。總的來說,我們發現Adam是健壯的,并且非常適合于在領域機器學習中大量的非凸優化問題。 |
?
?
?
?
總結
以上是生活随笔為你收集整理的Paper:《Adam: A Method for Stochastic Optimization》的翻译与解读的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: CSDN:因博主近期注重写专栏文章(已超
- 下一篇: DL之模型调参:深度学习算法模型优化参数