随机森林算法的随机性_理解随机森林算法的图形指南
隨機森林算法的隨機性
本文是關于什么的 (What this article is about)
In this article , we will see how the Random Forest algorithm works internally. To truly appreciate it, it might be helpful to understand a bit about Decision-Tree Classifiers. But its not entirely required.
在本文中,我們將了解隨機森林算法在內部如何工作。 要真正欣賞它,了解一些有關決策樹分類器可能會有所幫助。 但是它不是完全必需的。
👉 Note : We are not covering the pre-processing or feature creature steps involved in modelling — but only see what happens within the algorithm when we call the .fit() and .transform() methods for sklearn’s RandomForestClassifier package does.
注意 :我們不討論建模中涉及的預處理或特征生物步驟,而是僅在調用sklearn的RandomForestClassifier包的.fit()和.transform()方法時,才能看到算法中發生了什么。
一段隨機森林 (Random Forest in one paragraph)
Random Forest ( RF) is a tree based algorithm . It is an ensemble of multiple random trees of different kinds. The final value of the model is the average of all the prediction/estimates created by each individual tree .
隨機森林(RF)是一種基于樹的算法。 它是多個不同種類的隨機樹的集合。 模型的最終值是每個單獨的樹創建的所有預測/估計的平均值。
包裝 (The Package)
We will be basing our article on sklearn’s RandomForestClassifier module
我們將以sklearn的RandomForestClassifier模塊為基礎
sklearn.ensemble.RandomForestClassifier
sklearn.ensemble.RandomForestClassifier
數據 (The Data)
For illustration, we will be using a training data similar to the one below.
為了便于說明,我們將使用與以下數據類似的訓練數據。
Image by Author)作者提供的圖像)👉 Note :age ,glucose_level, weight, gender, smoking .. … f98, f99 are all the independent variables or the features.
👉 注意 : age ,glucose_level, weight, gender, smoking .. … f98, f99均為自變量或特征。
Diabetic is the y-variable / dependent variable that we have to predict.
Diabetic是我們必須預測的y變量/因變量。
內部真正發生了什么 (What really happens internally)
With these basic information, lets get started and understand what happens with we pass this training set to the algorithm …
有了這些基本信息,我們就可以開始并了解將培訓集傳遞給算法所發生的事情……
第1步 -自舉 (Step 1 — Bootstrapping)
Image by Author)作者提供的圖像)Once we provide the training data to the RandomForestClassifier model, it (the algorithm) selects a bunch of rows randomly . This process is called Bootstrapping (Random replacement). For our example, lets assume that it selects m records.
一旦我們將訓練數據提供給RandomForestClassifier模型,它( 該算法 )就會隨機選擇一堆行。 此過程稱為自舉(隨機替換)。 對于我們的示例,假設它選擇了m條記錄。
Note 👉 The number of rows to be selected can be provided by the user in the hyper-parameter- max_samples)
注意 to可以通過在超參數- max_samples)使用r提供要選擇的行數
Note 👉 One row might get selected more than once
注意👉可能會多次選擇一行
第2步-選擇子樹的功能 (Step 2 — Selecting features for sub-trees)
Choose the features for the mini decision tree選擇迷你決策樹的功能Now, RF randomly selects a subset of features / columns . Here for the sake of simplicity and for the example, we are choosing 3 random features.
現在,RF隨機選擇要素/列的子集。 為了簡單起見,在此示例中,我們選擇3個隨機特征。
Note 👉 You can control this number in your hyper-parameter — max_features similar to the code below
注意 👉您可以在超參數中控制此數字— max_features與下面的代碼類似
import sklearn.ensemblemy_rf = RandomForestClassifiermax_features=8)步驟3 —選擇根節點 (Step 3 — Selecting root node)
Once the 3 random features are selected, the algorithm runs a splitting of the m record (from step 1) and does a quick calculation of the before and after values of a metric.
一旦選擇了3個隨機特征,該算法將對m條記錄進行分割(來自步驟1),并快速計算度量的前后值。
This metric could be either gini-impurity or the entropy. It is based on the The criteria — gini or entropy based on the choice you have provided in your hyper-parameter .
該度量可以是基尼雜質或熵。 它基于準則- gini或entropy基于您在超參數中提供的選擇。
criterion = 'gini' ( or 'entropy' . default= 'gini’ )
criterion = 'gini' (或' entropy '。 default= 'gini' )
Which ever of the random feature gives the most minimum gini impurity / entropy value is selected as the root node .
選擇哪個隨機特征給出最大的基尼雜質/熵值最小的根節點。
The records are split at this node based on the best splitting point.
將根據最佳拆分點在此節點上拆分記錄。
步驟4 —選擇子節點 (Step 4 — Selecting the child nodes)
Select the features randomly隨機選擇功能The algorithm performs the same process as in Step 2 and Step 4 and selects another set of 3 random features . ( 3 is the number we have specified — you can choose what you like — or leave it to the algorithm to choose the best number )
該算法執行與步驟2和步驟4相同的過程,并選擇另一組3個隨機特征。 (3是我們指定的數字-您可以選擇自己喜歡的數字-或將其留給算法以選擇最佳數字)
Based on the criteria ( gini / entropy ), it selects which feature will go into the next node / child node , and further splitting of the records happens here .
根據條件(基尼/熵),它選擇哪個特征將進入下一個節點/子節點,并在此處進一步分割記錄。
步驟5 —進一步拆分并創建子節點 (Step 5 —Further split and create child nodes)
continue selection of the features ( columns ) to select the further child nodes繼續選擇特征(列)以選擇其他子節點This process continues ( Steps 2, 4 ) of selecting the random feature and splitting of the nodes happens till either of the following conditions happen
繼續選擇隨機特征并分裂節點的過程(步驟2、4),直到發生以下任一情況
- a) you have ran out of the number of rows to split ( or the threshold — minimum number of rows to be present in each child node ) a)您已用完要拆分的行數(或閾值-每個子節點中存在的最小行數)
- b) the gini / entropy after splitting does not decrease b)分裂后的基尼/熵不降低
You now have your first “mini-decision tree ”.
現在,您有了第一個“小型決策樹”。
The first mini-decision tree created using the randomly selected rows ( records) & columns (features) (Image by Author)使用隨機選擇的行(記錄)和列(功能)創建的第一個小型決策樹( 作者提供的圖像)第6步-創建更多的小型決策樹 (Step 6 — Create more mini-decision trees)
Algorithm goes back to your data and does steps 1–5 to creates the 2nd “mini-tree”
算法返回到您的數據并執行步驟1-5,以創建第二個“迷你樹”
second mini tree that we created using another set of randomly chosen rows & columns第二個迷你樹步驟7.建立樹木森林 (Step 7. Build the forest of trees)
Once the default value of 100 trees is reached ( you now have 100 mini decision trees ), the model is said to have completed its fit() process.
一旦達到100棵樹的默認值(您現在有100棵微型決策樹),該模型就被稱為完成了fit()過程。
2 trees from the list of 100 trees100棵樹中的2棵樹Note 👉 You can specify the number of trees you want to generate in your hyper-parameter ( n_estimators)
注意 👉您可以在超參數中指定要生成的樹數( n_estimators)
import sklearn.ensemblemy_rf = RandomForestClassifiern_estimators=300)n_estimators variable or a default value of 100, if not specified ) (n_estimators變量指定的數量,或者如果未指定,則默認值為100)( Image by Author)作者提供的圖像)Now you have a forest of randomly created mini-trees ( hence the name Random Forest )
現在,您有一個隨機創建的迷你樹森林( 因此命名為Random Forest )
步驟7.推論 (Step 7. Inferencing)
Now lets predict the values in an unseen data set ( the test data set )
現在讓我們預測一個看不見的數據集(測試數據集)中的值
For inferencing (more commonly referred to as predicting/ scoring ) the test data, the algorithm passes the record through each mini-tree.
為了推斷( 更通常稱為預測/評分 )測試數據,算法將記錄傳遞到每個小樹中。
Image by Author)作者提供的圖像)The values from the record traverses through the mini tree based on the variables that each node represents,and reaches a leaf node ultimately. Based on the predetermined value of the leaf-node(during training) where this record ends up, that mini-tree is assigned one prediction output.
記錄中的值基于每個節點表示的變量遍歷迷你樹,并最終到達葉節點。 根據該記錄最終到達的葉節點的預定值(在訓練過程中),為該小樹分配一個預測輸出。
Image by Author)作者提供的圖片)Similarly, the same record goes through all the 100 mini-decision trees and each of the 100 trees have a prediction output. The final prediction value for this record is calculated by taking a simple voting of these 100 mini trees.
同樣,同一條記錄遍歷所有100個小型決策樹,并且這100棵樹中的每一個都有預測輸出。 該記錄的最終預測值是通過對這100棵迷你樹進行簡單表決而計算出的。
Now we have the prediction for a single record.
現在我們有了單個記錄的預測。
The algorithm iterates through all the records of the test set following the same process and does a calculation of the overall accuracy !
該算法按照相同的過程遍歷測試集的所有記錄,并計算整體精度 !
Iterate the process of obtaining the prediction for each row of the test set to arrive at the final accuracy.迭代獲取測試集每一行的預測的過程,以達到最終精度。翻譯自: https://towardsdatascience.com/a-pictorial-guide-to-understanding-random-forest-algorithm-fbf570a0ae0d
隨機森林算法的隨機性
總結
以上是生活随笔為你收集整理的随机森林算法的随机性_理解随机森林算法的图形指南的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 农行贷款30万一年多少利息
- 下一篇: 隆基股份是什么板块的股票