集合计数 二项式反演_对计数数据使用负二项式
集合計數 二項式反演
The Negative Binomial distribution is a discrete probability distribution that you should have in your toolkit for count data. For example, you might have data on the number of pages someone visited before making a purchase or the number of complaints or escalations associated with each customer service representative. Given this data, you might want to model the process and, later, see if some covariates affect the parameters. And in many contexts, you might find that a negative binomial distribution is a good fit.
負二項分布是您應該在工具包中用于計數數據的離散概率分布。 例如,您可能具有有關某人在購買之前訪問的頁面數或與每個客戶服務代表相關的投訴或上報數量的數據。 給定此數據后,您可能需要對過程進行建模,然后再查看是否有一些協變量影響參數。 在許多情況下,您可能會發現負二項式分布很合適。
In this article we’ll introduce the distribution and compute its probability mass function (PMF). We’ll cover its basic properties (mean and variance) by using the binomial theorem. This is in contrast to the usual treatments you will find which either just give you a formula or use more complicated tools to derive the results. Finally, we’ll turn to focus on the distributions’ interpretations.
在本文中,我們將介紹分布并計算其概率質量函數(PMF)。 我們將使用二項式定理介紹其基本屬性(均值和方差)。 這與您會發現的常規處理方法相反,后者只是給您提供公式或使用更復雜的工具來得出結果。 最后,我們將重點關注發行版的解釋。
負二項分布 (The Negative Binomial Distribution)
Suppose you are going to flip a biased coin that has probability p of coming up heads, which we will call a “success.” Furthermore, you are going to flip the coin continuously until at r successes occur. Let k be the number of failures along the way (so k+r coin flips happen in total).
假設您要拋棄一枚有偏見的硬幣,該硬幣的正面朝上的概率為p ,我們稱之為“成功”。 此外,你要不斷地翻轉硬幣,直到在r成功發生。 令k為一路失敗的次數(因此總共發生了k + r次硬幣翻轉)。
In the context of our examples, we could imagine:
在我們的示例上下文中,我們可以想象:
A user might browse your website. On each page they have a probability of p=1% of seeing an item they want to buy. We imagine that when they have put r=3 items in their basket, they are ready to checkout. k is the number of pages they will browse and not buy from. Of course we will want to fit the model to find the true values of r and p as well as if/how they vary between users.
用戶可能瀏覽您的網站。 在每一頁上,他們看到想要購買的商品的概率為p = 1%。 我們假設當他們把r = 3時 籃子里的東西,他們準備結帳。 k是他們將瀏覽而不是購買的頁面數。 當然,我們將需要對模型進行擬合以找到r和p的真實值以及它們在用戶之間是否/如何變化。
A customer service representative might in general receive complaints. After receiving complaints, there is a probability p that they will be reprimanded. Then after r times being told off, they will stop getting complaints due to changed behavior. k is the number of complaints on which they are not reprimanded before they change their behavior.
客戶服務代表通常可能會收到投訴。 接到投訴后,有概率p,他們將受到譴責。 然后,在被告知r次之后,由于行為改變,他們將停止投訴。 k是在改變行為之前沒有受到譴責的投訴數量。
Whether you actually think this is true is, as always, up to your prior beliefs and how well the model fits the data. Also, note that the number of failures is closely related to the number of events (k versus k plus r).
與往常一樣,您是否真的認為這是真的,取決于您先前的信念以及模型對數據的擬合程度。 另外,請注意,失敗的數量與事件的數量(k對k加r)密切相關。
It is relatively straightforward to write down the probability mass function using some combinatorics. The probability that the r-th success happens on the (k+r)-th coin flip is:
使用某些組合來寫下概率質量函數相對簡單。 第(k + r)次擲硬幣成功發生第r次成功的概率為:
The probability that there are r–1 successes on the first k+r–1 flips, times
的概率有R-1上的前k + R-1翻轉成功,倍
The probability of success on the (k+r)-th flip.
第( k + r)-次翻轉成功的概率。
There are (k+r–1) choose k orderings of (r–1) successes and k failure on the first k+r–1 flips. (The number of ways to arrange k A’s and (r–1) B’s in a line). Each has the same probability of occurring. This gives the PMF:
有第(k + R-1)選擇k排序的(R-1)的成功而k失敗上的前k + R-1翻轉。 (將k A和(r–1)B排列成一行的方式的數目)。 每個都有相同的發生概率。 這給出了PMF:
Hopefully you remember some basic facts about combinations and permutations. If not, here is a brief review of facts you can convince yourself of to help you out. Suppose there are 3 A’s and 2 B’s and you want to arrange them into a string like “AAABB” or “ABABA”. The number of ways to do this is 5 choose 2 (there are 5 total things and 2 B’s) which is the same as 5 choose 3 (there are 3 A’s). To see this, pretend that each letter is actually a distinct symbols (so the 5 symbols are A1, A2, A3, B1, B2). Then there are 5!=120 ways to arrange the distinct symbols. But there are 3!=6 ways to rearrange the A1 A2 A3 without changing the placements of the A’s, and 2!=2 ways to arrange the B’s. So the total number is 5!/2!3! = 10.
希望您能記住有關組合和排列的一些基本事實。 如果沒有,這里是對事實的簡要回顧,您可以說服自己來幫助您。 假設有3個A和2個B,并且您想將它們排列成字符串,例如“ AAABB”或“ ABABA”。 這樣做的方法是5 選擇 2(共有5個事物和2 B),與5選擇3(存在3 A)相同。 為此,假設每個字母實際上是一個不同的符號(因此5個符號是A1,A2,A3,B1,B2)。 然后有5!= 120種方式來排列不同的符號。 但是有3!= 6種方法可以在不更改A的??位置的情況下重新排列A1 A2 A3,還有2!= 2種方法來排列B's。 因此總數為5!/ 2!3! = 10。
Now, the trick is, binomials also work for negative numbers on top, or with non-integers. For example, if we expand what we have above, we can add a minus sign to each of the k terms in the numerator:
現在,訣竅是,二項式也可以在頂部使用負數,也可以用于非整數。 例如,如果擴展上面的內容,則可以為分子中的k個項中的每一個添加減號:
The Negative Binomial Distribution as an actual Negative Binomial負二項式分布作為實際的負二項式Hence the name “negative binomial.”
因此,名稱為“負二項式”。
The other trick to keep in mind is that we can define binomials with non-integer numbers. Using the fact that the Γ function (Gamma function) satisfies, for positive integers n,
要記住的另一個技巧是我們可以使用非整數來定義二項式。 利用Γ函數( 伽馬函數 )滿足正整數n的事實,
The Gamma Function extends the Factorial伽瑪函數擴展了階乘We can write our binomial coefficients in the form
我們可以將二項式系數寫成
Binomial Coefficients with n not an integern不是整數的二項式系數And this enables us to allow that, in the negative binomial distribution, the parameter r does not have to be an integer. This will be useful because when we estimate our models, we generally don’t have a way to constrain r to be an integer. So a non-integer value for r won’t be a problem. (We will require r to be positive, however). We’ll come back to how to interpret a non-integer value of r.
這使我們能夠在負二項式分布中使參數r不必為整數。 這將很有用,因為當我們估計模型時,通常沒有辦法將r約束為整數。 因此, r的非整數值不會有問題。 (但是,我們將要求r為正)。 我們將回到如何解釋r的非整數值。
負二項分布的性質 (Properties of the Negative Binomial Distribution)
We would like to compute the expectation and variance. As a warmup, let’s check that the negative binomial distribution is in fact a probability distribution. For convenience, let q=1–p.
我們想計算期望值和方差。 作為熱身,讓我們檢查負二項式分布實際上是否是概率分布。 為了方便起見,讓q = 1–p 。
The Negative Binomial Distribution is in fact a Probability Distribution負二項式分布實際上是概率分布The crucial point is the third line, where we used the binomial theorem (yes, it works with negative exponents).
關鍵是第三行,我們使用了二項式定理 (是的,它適用于負指數)。
Now let’s compute the expectation:
現在讓我們計算期望值:
Expected Value of the Negative Binomial Distribution負二項分布的期望值To get the third line, we used the identity
為了獲得第三行,我們使用了身份
Where we used the binomial theorem again to get the third to last line.
在這里我們再次使用二項式定理來獲得倒數第三行。
Warning: this is the opposite of what you will find on Wikipedia as of this writing. It is what you will find from Wolfram (the makers of Mathematica). This is because Wikipedia thinks about the number of successes before r failures, where as we count failures before r successes. In general, there is a variety of similar ways to parameterize/interpret the distribution, so be careful you have everything straight when looking at formulas in different places.
警告 :與本文撰寫時在Wikipedia上發現的相反。 這是從Wolfram (Mathematica的制造商)那里找到的。 這是因為維基百科認為,關于成功的前[R失敗的次數,在這里,我們計數R成功之前失敗。 通常,可以使用多種類似的方法來對分布進行參數化/解釋,因此在不同位置查看公式時,請務必小心。
Next, we can compute the variance in two steps. First, we repeat the trick from above, using the identity twice this time to get the third line. We again use the binomial theorem to compute the sum and obtain the third-to-last line.
接下來,我們可以分兩步計算方差。 首先,我們從上面重復技巧,這次使用兩次標識來獲得第三行。 我們再次使用二項式定理來計算總和并獲得倒數第二行。
Now we can compute:
現在我們可以計算:
Variance of the Negative Binomial Distribution負二項分布的方差Again, this is the opposite of what is on Wikipedia.
同樣,這與Wikipedia相反。
負二項分布的解釋 (Interpretation of the Negative Binomial Distribution)
We have covered the “defining interpretation” of the Negative Binomial Distribution: it is the number of failures before r success occur, with the probability of success at each step being p. But there are a few other ways to look at the distribution that can be illuminating and also help interpret the case where r is not an integer.
我們已經討論了負二項式分布的“定義解釋”:它是r成功發生之前的失敗次數,每一步成功的概率為p 。 但是,還有其他一些方法可以查看可能具有啟發性的分布,并且還可以幫助解釋r不是整數的情況。
過度分散的泊松分布 (Over-Dispersed Poisson Distribution)
The Poisson distribution is a very simple model for count data, which assumes that events happen randomly at a certain rate. Then it models the distribution of how many events will occur in a given time interval. In the context of our examples, it would say that:
泊松分布是用于計數數據的非常簡單的模型,它假定事件以一定速率隨機發生。 然后,它模擬在給定時間間隔內將發生多少事件的分布。 在我們的示例中,它會說:
- Customer service representatives get complaints at a constant rate. The variation in counts is just determined by random variation. (Compare the model where their behavior eventually changes). Again, in modeling this, we could model a difference in rate between representatives based on exogenous covariates. 客戶服務代表不斷收到投訴。 計數的變化僅由隨機變化確定。 (比較他們的行為最終改變的模型)。 同樣,在對此建模時,我們可以基于外部協變量對代表之間的匯率差異進行建模。
One big problem with the Poisson distribution is that the variance is equal to the mean. This may not fit our data. Let’s say we parameterize our Negative Binomial distribution with a mean λ and stopping parameter r. Then we have
泊松分布的一個大問題是方差等于均值。 這可能不適合我們的數據。 假設我們使用平均值λ和停止參數r來參數化負二項式分布。 那我們有
Re-parametrization of the Negative Binomial Distribution負二項分布的重新參數化Our probability mass function becomes
我們的概率質量函數變為
Probability Mass Function for the Negative Binomial parameterized with Mean λ均值λ表示的負二項式的概率質量函數Now let’s consider what happens if we take the limit as r →∞ holding λ fixed. (This means that the probability of success goes to 1 as well, in the way defined by p=r/[λ+r]). In this limit, the binomial term approaches (–r) to the power of k divided by k! and r + λ approaches r.
現在讓我們考慮一下,如果將λ固定為r→∞時,將發生限制。 (這意味著成功概率也以p = r / [λ+ r]定義的方式也變為1)。 在此極限下,二項式項的值接近(-r)k的冪除以k! r +λ接近r。
Limit of the Negative Binomial for Large r with fixed mean λ具有固定均值λ的大r的負二項式的極限In the last line, the r to the k-th powers cancel and we have used the definition of the exponential. The result is that we recover the Poission distribution.
在最后一行中,第k次冪的r抵消,我們使用了指數的定義。 結果是我們恢復了Poission分布。
Therefore, we can interpret the Negative Binomial Distribution as a generalization of the Poisson distribution. If the distribution is in fact Poission, we will see a large r and p close to 1. This makes sense because as p approaches 1, the variance approaches the mean. When p is smaller than one, the variance is higher than that of a Poisson distribution with the same mean, so we can see that the Negative Binomial distribution generalizes Poisson by increasing the variance.
因此,我們可以將負二項分布解釋為泊松分布的推廣。 如果分布實際上是Poission,我們將看到一個大的r和p接近1。這是有道理的,因為當p接近1時,方差接近均值。 當p小于1時,方差大于均值相同的泊松分布,因此我們可以看到負二項分布通過增加方差來推廣泊松。
泊松分布的混合 (Mixture of Poisson Distributions)
The Negative Binomial Distribution also arises as a mixture of Poisson random variables. For example, suppose that our customer service representatives each receive complaints at a given rate (they never change their behavior), but that rate varies between representatives. If that rate is randomly distributed according to a Gamma distribution, we get a Negative Binomial Distribution for the ensemble.
負二項分布也可以由泊松隨機變量混合而成。 例如,假設我們的客戶服務代表每人以給定的比率接收投訴(他們從不改變其行為),但是代表之間的比率有所不同。 如果該比率是根據Gamma分布隨機分布的,則該集合將得到負二項分布。
The intuition behind this is as follows. We initially said the Negative Binomial Distribution was the count of failures before r successes when we do coin flips. Instead, replace the coin flip with two Poisson processes. Process one (the “success” process) has rate p and process two, the “failure” process, has rate (1-p). This means that instead of thinking of the Negative Binomial Distribution as counting coin flips, we think that there are independent processes generating “success” and “failure” independently and we just count how many failures before a certain number of successes.
其背后的直覺如下。 我們最初說負二項式分布是當我們進行硬幣翻轉時r成功之前的失敗計數。 取而代之的是用兩個Poisson工序代替硬幣翻轉。 進程一(“成功”進程)的速率為p ,進程二(“失敗”進程)的速率為(1-p)。 這意味著我們不認為負二項式分布是對硬幣翻轉進行計數,而是認為有獨立的過程獨立地產生“成功”和“失敗”,而我們只計算在一定數量的成功之前發生了多少次失敗。
Now, the Gamma Distribution is the distribution of waiting times for Poisson processes. Let T be the waiting time for r successes from the “success” process. T is Gamma distributed. Then the number of failures has a mean of (1–p)T and is Poisson distributed.
現在,伽馬分布是泊松過程的等待時間的分布。 令T為“成功”過程中r成功的等待時間。 T是伽馬分布的。 然后,故障次數的平均值為(1-p)T,并且是泊松分布。
結論 (Conclusion)
The last few points worth pointing out. First of all, there is no analytic way to fit the Negative Binomial Distribution to data. Instead, use the Maximum Likelihood Estimator and numerical estimation. You can use the statsmodels package to do this in Python.
最后幾點值得指出。 首先,沒有將負二項式分布擬合到數據的分析方法。 而是使用最大似然估計器和數值估計。 您可以使用statsmodels包在Python中執行此操作。
Also, it is possible to do Negative Binomial regression, modeling the effects of covariates. We’ll save that for a future article.
同樣,可以進行負二項式回歸,對協變量的影響進行建模。 我們將其保存在以后的文章中。
翻譯自: https://towardsdatascience.com/use-a-negative-binomial-for-count-data-c68c062de203
集合計數 二項式反演
總結
以上是生活随笔為你收集整理的集合计数 二项式反演_对计数数据使用负二项式的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 概率论在数据挖掘_为什么概率论在数据科学
- 下一篇: 饥荒联机版夏天怎么过(饥荒中文版下载)