关于KL距离(KL Divergence)
鏈接:https://www.zhihu.com/question/29980971/answer/103807952
來源:知乎
著作權(quán)歸作者所有。商業(yè)轉(zhuǎn)載請聯(lián)系作者獲得授權(quán),非商業(yè)轉(zhuǎn)載請注明出處。
最早KL divergence就是從信息論里引入的,不過既然題主問的是ML中的應(yīng)用,就不多做具體介紹。只是簡單概述給定真實(shí)概率分布P和近似分布Q,KL divergence所表達(dá)的就是如果我們用一套最優(yōu)的壓縮機(jī)制(compression scheme)來儲存Q的分布,對每個(gè)從P來的sample我需要多用的bits(相比我直接用一套最優(yōu)的壓縮機(jī)制來儲存P的分布)。這也叫做 Kraft–McMillan theorem。
所以很自然的它可以被用作統(tǒng)計(jì)距離,因?yàn)樗旧韮?nèi)在的概率意義。然而,也正因?yàn)檫@種意義,題主所說的不對稱性是不可避免的。因?yàn)镈(P||Q)和D(Q||P)回答的是基于不同壓縮機(jī)制下的“距離”問題。
至于general的統(tǒng)計(jì)距離,當(dāng)然,它們其實(shí)沒有本質(zhì)差別。更廣泛的來看,KL divergence可以看成是phi-divergence的一種特殊情況(phi取log)。注意下面的定義是針對discrete probability distribution,但是把sum換成integral很自然可以定義連續(xù)版本的。
因?yàn)樗鼈兌加邢嗨频母怕室饬x,比如說pinsker's theorem保證了KL-divergence是total variation metric的一個(gè)tight bound. 其它divergence metric應(yīng)該也有類似的bound,最多就是order和常數(shù)會差一些。而且,用這些divergence定義的minimization問題也都會是convex的,但是具體的computation performance可能會有差別,所以KL還是用的多。
Reference: Bayraksan G, Love DK. Data-Driven Stochastic Programming Using Phi-Divergences.
作者:知乎用戶
鏈接:https://www.zhihu.com/question/29980971/answer/93489660
來源:知乎
著作權(quán)歸作者所有。商業(yè)轉(zhuǎn)載請聯(lián)系作者獲得授權(quán),非商業(yè)轉(zhuǎn)載請注明出處。
KL divergence KL(p||q), in the context of information theory, measures the amount of extra bits (nats) that is necessary to describe samples from the distribution p with coding based on q instead of p itself. From the Kraft-Macmillan theorem, we know that the coding scheme for one value out of a set X can be represented q(x) = 2^(-l_i) as over X, where l_i is the length of the code for x_i in bits.
We know that KL divergence is also the relative entropy between two distributions, and that gives some intuition as to why in it's used in variational methods. Variational methods use functionals as measures in its objective function (i.e. entropy of a distribution takes in a distribution and return a scalar quantity). It's interpreted as the "loss of information" when using one distribution to approximate another, and is desirable in machine learning due to the fact that in models where dimensionality reduction is used, we would like to preserve as much information of the original input as possible. This is more obvious when looking at VAEs which use the KL divergence between the posterior q and prior p distribution over the latent variable z. Likewise, you can refer to EM, where we decompose
ln p(X) = L(q) + KL(q||p)
Here we maximize the lower bound on L(q) by minimizing the KL divergence, which becomes 0 when p(Z|X) = q(Z). However, in many cases, we wish to restrict the family of distributions and parameterize q(Z) with a set of parameters w, so we can optimize w.r.t. w.
Note that KL(p||q) = - \sum p(Z) ln (q(Z) / p(Z)), and so KL(p||q) is different from KL(q||p). This asymmetry, however, can be exploited in the sense that in cases where we wish to learn the parameters of a distribution q that over-compensates for p, we can minimize KL(p||q). Conversely when we wish to seek just the main components of p with q distribution, we can minimize KL(q||p). This example from the Bishop book illustrates this well.KL divergence belongs to an alpha family of divergences, where the parameter alpha takes on separate limits for the forward and backwards KL. When alpha = 0, it becomes symmetric, and linearly related to the Hellinger distance. There are other metrics such as the Cauchy Schwartz divergence which are symmetric, but in machine learning settings where the goal is to learn simpler, tractable parameterizations of distributions which approximate a target, they might not be as useful as KL.
總結(jié)
以上是生活随笔為你收集整理的关于KL距离(KL Divergence)的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 让你的Silverlight程序部署在任
- 下一篇: treegrid.bootstrap使用