随机模拟_随机模拟可帮助您掌握统计概念
隨機模擬
模擬有助于提煉概念 (Simulation helps distilling concepts)
掌握與統計相關的概念可能很困難 (Grasping statistics-related concepts can be hard)
Do you find grasping the concepts of statistical analysis — law of large numbers, expectation value, confidence interval, p-value — somewhat difficult and troublesome?
您是否發現掌握統計分析的概念( 大數定律 , 期望值 , 置信區間 , p值 )有些困難和麻煩?
You are not alone.
你不是一個人。
Our human brain and psyche have not evolved to deal with rigorous statistical methods. In fact, a study of why people struggle to solve statistical problems reveals a preference for complicated rather than simpler, more intuitive solutions — which often leads to failure in solving the problem altogether.
我們的人腦和心理尚未進化為應對嚴格的統計方法 。 實際上, 一項關于人們為何努力解決統計問題的研究表明 ,人們更傾向于復雜而不是簡單,直觀的解決方案,這通常會導致無法完全解決問題。
As you might know from the famous book of Nobel Laureate Daniel Kahneman, “Thinking, Fast and Slow”, our intuition does not lie in the same system where our rationality resides (see the video below).
您可能從諾貝爾獎獲得者丹尼爾·卡尼曼(Daniel Kahneman)的著名著作“ 思考,快速和緩慢 ”中知道,我們的直覺并不在于我們的理性所在的系統中(請參見下面的視頻)。
PixabayPixabayWe are good with a small set of numbers. The short-term working memory of the human brain is around 7–8 items/numbers.
我們對少數數字很滿意。 人腦的短期工作記憶約為7–8個項目/數字 。
Therefore, whenever a process presents itself with a scale of thousands or millions, we tend to lose our grasp on the ‘inherent nature’ of that process. The laws and patterns, which are only manifested at the limit of large numbers, seem random and meaningless to us.
因此, 每當一個過程呈現成千上萬的規模時,我們 往往會失去 對該過程的“內在本質”的了解 。 僅在大數量的限制下表現出來的規律和模式對我們來說似乎是隨機的和毫無意義的。
Statistics deals with large numbers and almost all theories and results in the statistical modeling and analysis are valid at the limit of large numbers only.
統計處理大量數據,幾乎所有理論和統計建模和分析中的結果僅在大量限制下才有效。
數據科學/機器學習植根于統計數據-怎么辦? (Data science/Machine learning is rooted in statistics — what to do?)
In this era of data science and machine learning, where the knowledge of the core statistical concepts are considered essential for success in those fields, this can be worrisome for data science practitioners and folks who are on their journey to learn the trade.
在這個數據科學和機器學習的時代,核心統計概念的知識被認為是在這些領域取得成功所必需的,這對于正在學習該行業的數據科學從業者和人們來說可能是令人擔憂的。
But do not despair. There is a surprisingly easy way to tackle this. And it is called ‘simulation’. In particular — discrete, stochastic, event-based simulation.
但是不要絕望。 有一種非常簡單的方法可以解決此問題。 它被稱為“模擬” 。 特別是基于事件的離散,隨機模擬。
Therefore, whenever a process presents itself with a scale of thousands or millions, we tend to lose our grasp on the ‘inherent nature’ of that process.
因此, 每當一個過程呈現成千上萬的規模時,我們往往會失去對該過程的“內在本質”的了解 。
讓我告訴你最簡單的例子 (Let me show you the simplest possible example)
PixabayPixabay擲骰子的期望值 (The expected value of dice throw)
Suppose we are throwing a (fair) dice with 6 possible faces — 1 to 6. This event of the dice face taking up a value from the set {1,2,3,4,5,6} is represented by a random variable. In a formal setting, the so-called ‘expectation value’ (denoted by E[X]) of any random variable X is given by,
假設我們拋出一個帶有6個可能的面Kong(從1到6)的(普通)骰子。這個骰子面Kong占據集合{1,2,3,4,5,6}中的值的事件由一個隨機變量表示。 在形式上,任何隨機變量X的所謂“期望值”(用E [X]表示)由下式給出:
where f(x) is the probability distribution function (PDF) or probability mass function (PMF) for X i.e. the mathematical function that describes the distribution of the possible values that X can assume.
其中,f(x)是概率分布函數 (PDF)或概率質量函數 (PMF)為X即描述了可能的值的分布的是X可以假設數學函數。
For a dice throwing situation, the random variable X is of discrete nature i.e. it can assume discrete values only, so it has a PMF (and not a PDF). And it is a very simple PMF,
對于擲骰子的情況,隨機變量X具有離散性質,即它只能采用離散值,因此它具有PMF(而不是PDF)。 這是一個非常簡單的PMF,
This is because the random variable has a ‘uniform probability distribution’ over the sample space {1,2,3,4,5,6} i.e. any dice throw can result in any one of these values, completely randomly, and without any bias towards any particular value. Therefore, the expected value is,
這是因為隨機變量在樣本空間{1,2,3,4,5,6}上具有“ 均勻概率分布 ”,即,任何擲骰子都可能導致這些值中的任何一個完全隨機且沒有任何偏差對任何特定的價值。 因此, 期望值是
So, as per theory, 3.5 is the expected value of the dice throwing process.
因此,根據理論,擲骰子過程的期望值為3.5。
Is it the most probable value? No. Because a dice does not even have a face with 3.5! So, what’s the meaning of this quantity?
這是最有可能的價值嗎? 不。因為骰子甚至連3.5的臉都沒有! 那么,這個數量是什么意思呢?
Is it some kind of probability? No. Because the value is clearly greater than 1 and probability values are always between 0 and 1.
這是某種可能性嗎? 否。因為該值明顯大于1,并且概率值始終在0到1之間。
Does it mean we can expect the face to turn up either 3 or 4 most times (3.5 is the average of 3 and 4)? No. Because the PMF tells us that all the faces are equally likely to turn up.
這是否意味著我們可以預期面部最多會出現3到4次(3.5是3和4的平均值) ? 不會。因為PMF告訴我們,所有面Kong都有可能出現。
Fortunately, the answer is provided by a fundamental tenet of statistics — The Law of large numbers — which says that, in the long run, the expected value is simply the average of all the values that the random variable will take.
幸運的是,答案是通過統計的基本原則提供的- 大數定律 - 從長遠來看,期望值只是隨機變量將采用的所有值的平均值 。
Notice the phrase “in the long run”. How do we verify this? Can we simulate such a scenario?
注意短語“ 從長遠來看 ”。 我們如何驗證這一點? 我們可以模擬這種情況嗎?
Sure we can. Simple Python code can help us simulate the scenario and verify the Law of Large Numbers.
我們當然可以。 簡單的Python代碼可以幫助我們模擬這種情況并驗證大數定律 。
營救Python (Python to rescue)
Define an array with dice faces and a function to simulate a single throw.
定義一個帶有骰子面的數組和一個模擬單擲的函數。
Throw around the dice a few times,
擲骰子幾次
As you might have noticed, for every invocation of dice_throw(), I am using the np.random.choice()function to pick a single random item out of the array dice. If you run this code, you will get a completely different sequence on your machine.
您可能已經注意到,對于dice_throw()每次調用,我都使用np.random.choice()函數從數組dice選擇一個隨機項。 如果運行此代碼,則計算機上的序列將完全不同。
我們將統計信息留在了后面,我們處于模擬區域 (We left the statistics behind, we are in a simulation zone)
Pixabay (Free for commercial use)Pixabay (免費用于商業用途)Take a pause and realize what is happening.
暫停一下,了解發生了什么。
We are not dealing anymore with formal probabilities and definitions. We are simulating a random event — dice throw — just like in real life. This is the lure of simulation. It constructs a replica of real life on your computing hardware :-)
我們不再處理正式的概率和定義。 就像現實生活中一樣,我們正在模擬一個隨機事件-擲骰子 。 這就是模擬的誘惑。 它在您的計算硬件上構建真實生活的副本:-)
We could leave all the coding behind and just do that — throw a dice, note down the face, rinse and repeat — for real. But it will take a whole lot of time to verify the Law of Large Numbers following that route.
我們可以把所有的編碼都拋在后面,然后做到—擲骰子,記在臉上,沖洗并重復—真實。 但是,按照那條路線驗證大數定律將花費大量時間。
That’s why we have the computer and the Python programming language, don’t we?
這就是為什么我們擁有計算機和Python編程語言,不是嗎?
So, we just simulate it for a sufficiently long time, keep a running average, and plot it. Here is what I got.
因此,我們只需對其進行足夠長的時間進行仿真,并保持運行平均值,然后對其進行繪制。 這就是我得到的。
We are not dealing anymore with formal probabilities and definitions. We are simulating a random event — dice throw — just like in real life.
我們不再處理正式的概率和定義。 就像現實生活中一樣,我們正在模擬一個隨機事件-擲骰子 。
Initially, the running average is pretty wild and moves around. As we increase the number of simulations, the average converges to 3.5, as expected from the theory.
最初,移動平均線非常瘋狂并且會四處移動。 正如我們所期望的,隨著模擬次數的增加,平均值收斂到3.5。
This way, we come back to statistics again, with the help of simulation. The Law of Large Number could be verified by repeated stimulations of a random event — with a minimal amount of programming.
這樣, 我們在模擬的幫助下再次回到統計 。 可以通過重復刺激隨機事件來驗證“大數定律”,而所需的編程數量最少。
通過仿真處理置信區間 (Tackling the confidence interval with simulation)
一些基本定義 (Some essential definitions)
Population: The whole collection of which we want to measure some property. We can (almost) never get enough data about the whole population. Therefore, we can never know the true values of population properties.
人口 :我們要測量其屬性的整個集合。 我們(幾乎)永遠無法獲得有關整個人口的足夠數據。 因此,我們永遠無法知道人口屬性的真實值。
Sample: A fraction (subset) of data from the population, which we can gather, and which helps us estimate the properties of the population. Because we cannot measure the true values of the population properties, we can only estimate them. This is the central job of statisticians.
樣本 :可以從總體中收集的一部分數據(子集),有助于我們估算總體的屬性。 由于我們無法測量總體屬性的真實值,因此只能估算它們。 這是統計學家的核心工作。
Statistic: A statistic is a function of a sample. It is a random variable because every time you take a new sample (from the same population) you will get a new value for the statistic. Examples are the sample mean or the sample variance. These are good (unbiased) estimates of the population.
統計量 :統計量是樣本的函數。 這是一個隨機變量,因為每次您從同一個總體中抽取一個新樣本時,您將獲得一個新的統計值 。 示例是樣本均值或樣本方差。 這些是總體的良好(無偏)估計。
Confidence interval: A range/bound around the statistic (of our choice). We need this min/max bound to quantify the uncertainty of the random nature of our sampling. Let’s clarify this further with the example of the confidence interval for the mean.
置信區間 :(我們選擇的)統計范圍/范圍。 我們需要這個最小/最大界限來量化采樣隨機性的不確定性。 我們以均值的置信區間為例進一步說明這一點。
Depending on where and how we are drawing the sample, we may get a good representation of the population or not. So, if we repeat the process of drawing the sample many times, in some cases the sample will contain the true mean of the population, and in other cases, it will miss it.
根據我們在哪里以及如何繪制樣本,我們可能會很好地代表總體。 因此,如果我們多次重復繪制樣本的過程 ,則在某些情況下,樣本將包含總體的真實均值,而在其他情況下,樣本將丟失。
Can we say anything about the proportion of our success in drawing a sample which contains the true mean?
我們能否說出我們在抽取包含真實均值的樣本中所取得的成功比例 ?
The answer to this question is found in the confidence interval. If some assumptions are met, then we can calculate the confidence interval that will contain the true mean (when we sample a large number of times) with a certain fraction.
在置信區間中可以找到該問題的答案。 如果滿足一些假設,那么我們可以計算出置信區間,該區間將包含一定比例的真實平均值(當我們進行大量采樣時)。
The necessary formulas are given below. We won’t get into details about this formula or why the particular t-distribution is used in this equation. Readers can refer to any undergraduate level stats text or excellent online resources to understand the rationale.
必要的公式如下。 我們不會詳細介紹該公式,也不會介紹為什么在此公式中使用特定的t分布 。 讀者可以參考任何本科水平的統計數據或出色的在線資源,以了解其基本原理。
Image source (Public university course material)圖片來源 (公共大學課程資料)Can we say anything about the proportion of our success in drawing a sample which contains the true mean? The answer to this question is found in the confidence interval.
我們能否說出我們在抽取包含真實均值的樣本中所取得的成功比例 ? 在置信區間中可以找到該問題的答案。
實際用途是什么? (What is the practical utility?)
Be careful about the definition and the process to understand the true practical utility of the confidence interval.
注意定義和過程,以了解置信區間的實際實用性。
When you are calculating a 95% confidence interval of mean, you are not calculating any probability (0.95 or otherwise). You are calculating two specific numbers (min and max bounds around the sample mean) which creates a range of values that will contain the true population mean (unknown) if we were to repeat the process.
當您計算平均值的95%置信區間時,您沒有在計算任何概率(0.95或其他)。 您正在計算兩個特定的數字 (樣本均值的最小和最大范圍),這將創建一個值范圍, 如果我們要重復此過程 ,則該值將包含真實的總體均值(未知)。
Here lies the practical utility. We are not repeating the process. We are just drawing the sample once and constructing this range.
這是實用工具。 我們不會重復該過程。 我們只繪制一次樣本并構建此范圍 。
If we could repeat the process a million times, we would be able to verify the claim that the true mean lies inside this range in 95% cases.
如果我們可以將這一過程重復一百萬次,那么我們將能夠證明在95%的情況下,真實均值在此范圍內。
But sampling a million times can be quite expensive and downright impossible in real life. So, the theoretical calculation of the confidence interval provides us with the min/max range, just from one draw of the sample. This is amazing, isn’t it?
但是采樣一百萬次可能會非常昂貴,而且在現實生活中完全不可能。 因此, 置信區間的理論計算僅從一幅樣本中就為我們提供了最小/最大范圍 。 太神奇了,不是嗎?
但是在模擬中,我們可以進行百萬次實驗! (But in the simulation, we can experiment a million times!)
Yes, simulation is fantastic. We can repeat the sampling process a million times and verify the claim that our theoretical confidence interval truly contains the population mean, approximate 95% of the time.
是的,模擬太棒了。 我們可以將抽樣過程重復一百萬次,并證明我們的理論置信區間確實包含總體均值,大約為95%。
Let’s verify it using a real-life example of factory production. Let’s say in a factory, a certain machine produces 20 tons of product on average, with a standard deviation of 5 tons. These are the true population mean and standard deviation. So, we can write simple Python code to generate a typical production run over a year (52 weeks) and plot it.
讓我們使用真實的工廠生產示例進行驗證。 假設在工廠中,某臺機器平均生產20噸產品,標準偏差為5噸。 這些是真實的總體平均值和標準偏差 。 因此,我們可以編寫簡單的Python代碼來生成一年(52周)內的典型生產運行并將其繪制出來。
Then, we can write the following function to simulate the process an arbitrary number of times to count how many times the confidence intervals truly contained the population mean. Remember that we know the population mean for this case — it is 20.
然后,我們可以編寫以下函數來任意模擬該過程,以計算置信區間真正包含總體平均值的次數。 請記住,我們知道這種情況下的總體平均值為20。
If we run this function 10,000 times, every time counting if the C.I. contained the true mean or not, and then check the frequency/ratio, we get the following.
如果我們運行此功能10,000次,每次計算CI是否包含真實均值,然后檢查頻率/比率,我們將得到以下結果。
The ratios came amazingly close to the theoretical calculation 0.9 (90%) and 0.99 (99%), didn’t they?
這些比率驚人地接近理論計算值0.9(90%)和0.99(99%),不是嗎?
We can repeat the sampling process a million times and verify the claim that our theoretical confidence interval truly contains the population mean.
我們可以將抽樣過程重復一百萬次,并驗證我們的理論置信區間確實包含總體均值的說法。
仿真是大規模數據科學的強大工具 (Simulation is a powerful tool for large-scale data science)
In the example above, we talked about the C.I. of the mean. But we can construct the C.I. around any other statistic like variance or even quantiles. We can even construct C.I. of the difference of means between two experiments. The exact formula and calculations may be slightly different in each case but the idea remains the same.
在上面的示例中,我們討論了均值的CI。 但是我們可以圍繞任何其他統計量(如方差甚至分位數)構造CI。 我們甚至可以構建兩個實驗之間均值差異的 CI。 在每種情況下,確切的公式和計算可能會略有不同,但是想法保持不變。
As the process complexity increases and we deal with not one but a multitude of interconnected processes, calculating simple summary statistics may not always be possible in practice. We must master the art of stochastic simulation to deal with such situations for large data science and analytics tasks.
隨著過程復雜性的增加,我們不僅處理一個相互關聯的過程,而且處理多個相互關聯的過程,因此在實踐中不一定總是可以計算簡單的摘要統計信息。 我們必須掌握隨機模擬的技巧,以應對大數據科學和分析任務的這種情況。
仿真總結與思考 (Summary and thoughts for simulation)
In this article, we demonstrated the power of simulation to understand concepts of statistical estimation like expected value and confidence interval. In reality, we do not get the chance to repeat a statistical experiment thousands of times, but we can simulate the process on a computer, which helps us to distill down these concepts in a clear and intuitive manner.
在本文中,我們展示了仿真的功能,可以理解統計估計的概念,例如期望值和置信區間 。 實際上,我們沒有機會重復進行數千次統計實驗,而是可以在計算機上模擬過程,這有助于我們以清晰直觀的方式提取這些概念。
Once you master the art of simulating a stochastic event, you can investigate the properties of the random variables and the esoteric statistical theory behind them, with a new weapon of analysis.
掌握了模擬隨機事件的技巧之后,您就可以使用一種新的分析工具來研究隨機變量的屬性以及其背后的深奧統計理論。
For example, you can investigate, using stochastic simulation,
例如,您可以使用隨機模擬進行調查,
- The convergence of the mean of many stochastic events to a Normal distribution (verifying the Central Limit Theorem by numerical experiment) 將許多隨機事件的均值收斂到正態分布(通過數值實驗驗證中心極限定理)
- Check what happens when you mix or transform many statistical distributions together in this way or that? what kind of resulting distributions do you get? 檢查以這種方式混合或變換許多統計分布時會發生什么? 您得到什么樣的結果分布?
- If a stochastic event does not follow the theoretical assumptions, what kind of aberrant behavior you can get in the result? In this case, the simulation could be your only friend because the standard theory fails if the assumptions are not met. 如果隨機事件沒有遵循理論假設,那么結果中會出現哪種異常行為? 在這種情況下,模擬可能是您唯一的朋友,因為如果不滿足假設,則標準理論將失敗。
- What kind of statistical properties emerges from the operation of a Deep Learning network? 深度學習網絡的運作會產生什么樣的統計屬性?
For learning the foundational principles of data science and machine learning, the importance of these kinds of exercise cannot be emphasized enough.
為了學習數據科學和機器學習的基本原理,無法充分強調這類練習的重要性。
If you liked this article, you may also like my other articles on stochastic simulation, and statistical concepts using Python.
如果您喜歡本文,那么您可能還會喜歡我的其他文章,其中涉及隨機模擬和使用Python的統計概念。
Also, you can check the author’s GitHub repositories for code, ideas, and resources in machine learning and data science. If you are, like me, passionate about AI/machine learning/data science, please feel free to add me on LinkedIn or follow me on Twitter.
同樣 ,您可以檢查作者的GitHub 存儲庫以獲取機器學習和數據科學中的代碼,思想和資源。 如果您像我一樣對AI /機器學習/數據科學充滿熱情,請隨時在LinkedIn上添加我或在Twitter上關注我 。
翻譯自: https://towardsdatascience.com/stochastic-simulation-helps-you-grasp-concepts-of-statistics-befdba517404
隨機模擬
總結
以上是生活随笔為你收集整理的随机模拟_随机模拟可帮助您掌握统计概念的全部內容,希望文章能夠幫你解決所遇到的問題。
 
                            
                        - 上一篇: 机器学习 深度学习 ai_人工智能,机器
- 下一篇: 酷睿i5 8250u与酷睿i5 8265
