當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

透过性别看世界_透过树林看森林

發布時間：2023/12/15 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了透过性别看世界_透过树林看森林小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

透過性別看世界

決策樹如何運作 (How a Decision Tree Works)

Pictorially, a decision tree is like a flow-chart where the parent nodes represent an attribute’s test and the leaf nodes represent the final category assigned to the datapoints which made it to that leaf.

從圖片上看，決策樹就像流程圖，其中父節點代表屬性的測試，葉節點代表分配給數據點的最終類別，從而使數據點到達該葉。

Figure 1— Students sample distribution圖1－學生樣本分布

In the illustration above, a total of 13 students were randomly sampled from a students performance dataset. The scatter plot shows a distribution of the sample based on two attributes:

在上圖中，從學生成績數據集中隨機抽取了13名學生。散點圖基于兩個屬性顯示了樣本的分布：

raisedhands: number of times the student raised his/her hands in class to ask or answer questions.

舉手：學生在課堂上舉手提問或回答問題的次數。

visitedResources: How many times the student visited a course content.

VisitedResources：該學生瀏覽過一次課程內容的次數。

Our intent is to manually construct a decision tree that can best separate the sample data points into the distinct classes — L, M, H where:

我們的目的是手動構建決策樹，該決策樹可以將樣本數據點最好地分為不同的類-L，M，H，其中：

L= Lower performance category

L =較低性能類別

M= Medium(average) performance category

M =中(平均)性能類別

H= High performance category

H =高性能類別

Option A選項A

An option is to split the data along the attribute — visitedResources, at point mark 70.

一種選擇是沿標記點70沿屬性VisitedResources拆分數據。

This “perfectly” separates the H class from the rest.

這“完美地”將H類與其他人分開。

Option B選項B

One other option is to split along the same attribute — visitedResources, at point mark 41.

另一種選擇是沿同一個屬性(在點標記41)進行拆分：visitedResources。

No “perfect” separation is achieved for any class.

任何類別都無法實現“完美”的分離。

Option C選項C

Another option is to split along the attribute — raisedhands, at point mark 38.

另一種選擇是沿著屬性-舉手，在點標記38處分割。

This “perfectly” separates the L class from the rest.

這“完美地”將L類與其他類分開。

Options A and C did a better job at separating at least one of the classes. Suppose we pick option A, the resultant decision three will be:

選項A和C在分離至少一個類方面做得更好。假設我們選擇了選項A，那么最終的決定三將是：

The left branch has only H class students, hence, cannot be separated any further. On the right branch, the resultant node has four students each in the M and L classes.

左分支只有H級學生，因此不能再分開。在右側分支上，結果節點在M和L班級中分別有四個學生。

Remember that this is the current state of our separation exercise.

請記住，這是我們分離工作的當前狀態。

How best can the remaining students(data points) be separated into their appropriate classes? Yes, you guessed right — draw more lines!

其余的學生(數據點)如何最好地分成各自合適的班級？是的，您猜對了-畫出更多的線！

One option is to split along the attribute — raisedhands, at point mark 38.

一種選擇是沿屬性(舉手，在點標記38)進行拆分。

Again, any number of split lines can be drawn, however, this option seems to yield a good result, so, we shall go with it.

同樣，可以繪制任意數量的分割線，但是，此選項似乎產生了很好的效果，因此，我們將繼續使用它。

The resultant decision tree after the split is shown below:

拆分后的結果決策樹如下所示：

Clearly, the data points are perfectly separated into the appropriate classes, hence no further logical separation is needed.

顯然，數據點被完美地分為適當的類，因此不需要進一步的邏輯分離。

到目前為止吸取的教訓： (Lessons Learnt So Far:)

In ML parlance, this process of building out a decision tree that best classifies a given dataset is interpreted or referred to as Learning.

用ML的話來說，構建決策樹以最好地對給定數據集進行分類的過程被解釋為或稱為Learning 。

This learning process is iterative.

此學習過程是迭代的。

Several decision trees of varying levels of prediction accuracy can be derived from the same dataset, subject to the split attribute choices made and tree depth allowed.

可以從同一個數據集中獲得具有不同級別的預測準確性的幾棵決策樹，但要進行分割屬性選擇并允許樹深度。

In manually constructing the decision tree, we learnt that the separation lines can be drawn at any point along any of the attributes available in a dataset. The question is, at any given decision node, which of the possible attributes and separation points will do a better job of separating the dataset into the desired or near-desired classes or categories? An instrument to determining the answer to this question is the Gini Impurity.

在手動構建決策樹時，我們了解到可以沿著數據集中任何可用屬性的任意點繪制分隔線。問題是，在任何給定的決策節點上，哪種可能的屬性和分離點將更好地將數據集分離為所需的或接近所需的類或類別？確定這個問題答案的工具是基尼雜質。

基尼雜質 (Gini Impurity)

Suppose we have a new student and we randomly classify this new student into any of the three classes based on the probability distribution of the classes. The gini impurity is a measure of the likelihood of incorrectly classifying that new random student(variable). It is a probabilistic measure, hence it’s bounded between 0 and 1.

假設我們有一個新學生，我們根據班級的概率分布將該新學生隨機分為三個班級中的任何一個。基尼雜質是對新隨機學生(變量)進行錯誤分類的可能性的度量。這是一種概率測度，因此范圍在0到1之間。

We have a total of 13 students in our sample dataset and the probability distribution of H, M and L class are 5/13, 4/13 and 4/13 respectively.

我們的樣本數據集中共有13名學生，H，M和L班級的概率分布分別為5 / 13、4 / 13和4/13。

The formular below is applied in calculating gini impurity:

以下公式用于計算基尼雜質：

The above formular when applied in our example case becomes:

當在我們的示例情況下應用時，上述公式將變為：

Therefore gini impurity at the root node of the decision tree before any split, will be computed as::

因此，在任何拆分之前，決策樹根節點處的基尼雜質將被計算為：

Recall the earlier discussed split options A and C at the root node, let us compare the gini impurities of the two options and see why A was picked as a better split choice.

回想一下之前討論過的根節點拆分選項A和C，讓我們比較兩個選項的基尼雜質，并了解為什么選擇A作為更好的拆分選項。

Option A選項A Option C選項C

Therefore, the amount of impurity removed with split option A — gini gain is: 0.66–0.3=0.36. While that for split option C is: 0.66–0.37=0.29.

因此，使用分割選項A-gini增益去除的雜質量為：0.66-0.3 = 0.36 。而拆分選項C的系數為：0.66-0.37 = 0.29 。

Obviously, gini gain 0.36>0.29, hence, option A is a better split choice, informing the earlier decision to pick A over C.

顯然，基尼系數增加0.36> 0.29，因此，選項A是更好的拆分選擇，這表明了較早的選擇A勝過C的決定。

The gini impurity at a node where all the students are of only one class, say H, is always equal to zero — meaning no impurity. This implies a perfect classification, hence, no further split is needed.

在所有學生僅屬于一個班級的節點(例如H)上的基尼雜質始終等于零，這意味著沒有雜質。這意味著一個完美的分類，因此不需要進一步的拆分。

隨機森林 (Random Forest)

We have seen that many decision trees can be generated from the same dataset, and that the performance of the trees at correctly predicting unseen examples can vary. Also, using a single tree model (decision tree) can easily lead to over-fitting.

我們已經看到，可以從同一個數據集生成許多決策樹，并且在正確預測看不見的示例時樹的性能可能會有所不同。同樣，使用單個樹模型(決策樹)很容易導致過度擬合。

The question becomes: how do we make sure to construct the best possible performant tree? An answer to this is to smartly construct as many trees as possible and use averaging to improve the predictive accuracy and control over-fitting. This method is called the Random Forest. It is random because each tree is constructed using not all the training dataset but a random sample of the dataset and attributes.

問題就變成了：我們如何確保構建性能最佳的樹？對此的一種解決方案是智能地構造盡可能多的樹，并使用求平均值來提高預測準確性和控制過度擬合。此方法稱為隨機森林。這是隨機的，因為不是使用所有訓練數據集而是使用數據集和屬性的隨機樣本來構造每棵樹。

We shall use the random forest algorithm implementation in Scikit-learn python package to demonstrate how a random forest model can be trained, tested as well as visualize one of the trees that constitute the forest.

我們將使用Scikit-learn python軟件包中的隨機森林算法實現來演示如何訓練，測試以及可視化構成森林的其中一棵樹。

For this exercise, we shall train a random forest model to predict(classify) the academic performance category (Class) which students belong to, based on their participation in class/learning processes.

在本練習中，我們將根據學生對課堂/學習過程的參與程度，訓練一個隨機森林模型來預測(分類)學生所屬的學習成績類別(Class)。

In the dataset for this exercise, students’ participation is defined as a measure of four variables, which are:

在此練習的數據集中，學生的參與被定義為四個變量的度量，它們是：

Raised hand: How many times the student raised his/her hands in class to ask or answer questions (numeric:0–100)

舉手：學生在課堂上舉手問或回答問題的次數(數字：0-100)

Visited resources: How many times the student visited a course content(numeric:0–100)

造訪資源：學生造訪課程內容的次數(數字：0-100)

Viewing announcements: How many times the student checked the news announcements(numeric:0–100)

查看公告：學生查看新聞公告的次數(數字：0-100)

Discussion groups: How many times the student participated in discussion groups (numeric:0–100)

討論小組：該學生參加了多少次討論小組(數字：0-100)

In the sample extract below, the first four(4) numeric columns correspond to the students’ participation measures defined earlier, and the last column — Class which is categorical, represents the students performance. A student can be in either of three(3) classes — Low, Medium or High performance.

在下面的示例摘錄中，前四(4)個數字列對應于之前定義的學生參與度，而最后一列是“類別”，它表示學生的表現。學生可以是三(3)類任一L -流中，M edium或H IGH性能。

Figure-1: Dataset extract: Students participation measures and performance class圖1：數據集摘錄：學生的參與度和表現班

Basic data preparation steps:

基本數據準備步驟：

Load dataset

加載數據集

Clean or preprocess data. All features in this dataset are already in the right format and there exist no missing values. In my experience, this is rarely the case in ML projects, as some degree of cleaning or preprocessing is usually required.

清理或預處理數據。該數據集中的所有要素均已采用正確的格式，并且不存在缺失值。以我的經驗，在ML項目中很少出現這種情況，因為通常需要一定程度的清潔或預處理。

Encode label. This is necessary as the label(Class) in this dataset is categorical.

編碼標簽。這是必需的，因為此數據集中的label(Class)是分類的。

Split dataset into train and test sets.

將數據集分為訓練集和測試集。

An implementation of all the above steps is shown in the snippet below:

下面的代碼段顯示了上述所有步驟的實現：

Next, we shall create a RandomForest instance and fit (build the tree) the model to the train set.

接下來，我們將創建一個RandomForest實例，并將模型擬合(構建樹)以適合火車集合。

Where:

哪里：

n_estimators = number of trees to make the forest

n_estimators = 造林的樹木數量

criterion = what method to use in picking the best attribute split option for the decision trees. Here, we see the gini impurity being used.

條件 =為決策樹選擇最佳屬性拆分選項時使用的方法。在這里，我們看到使用了基尼雜質。

max_depth = This is a cap on the depth of the trees. If at this depth, no clear classification is arrived at, the model will consider all the nodes at the level to be leaf nodes. Also, for each leaf node, the data points are classified to be of the majority class in that node.

max_depth =這是樹木深度的上限。如果在此深度上沒有清晰的分類，則模型會將該級別的所有節點視為葉節點。同樣，對于每個葉節點，數據點被分類為該節點中的多數類。

Note that the optimal n_estimators and max_depth combination can only be determined by experimenting with several combinations. One way to achieve this is by using the grid search method.

請注意，只能通過試驗幾種組合來確定最佳的n_estimators和max_depth組合。實現此目的的一種方法是使用網格搜索方法。

模型評估 (Model Evaluation)

While there exist several metrics for evaluating models, we shall use one if not the most basic one — accuracy.

盡管存在幾種評估模型的指標，但我們將使用一種(即使不是最基本的)指標-準確性。

Accuracy on train set: 72.59%, test set: 68.55% — could be better but not a bad benchmark.

列車定型的準確度：72.59％ ， 測試定型：68.55％-可能更好，但基準不錯。

可視化森林中最好的樹 (Visualizing The Best Tree in the Forest)

The most optimal tree in a random forest model can be visualized easily, enabling both the engineer, scientist and business specialists to have some understanding of the decision-flow of the model.

可以輕松地可視化隨機森林模型中最優化的樹，使工程師，科學家和業務專家都可以對模型的決策流程有所了解。

The snippet below extracts and visualizes the most optimal tree from the above-trained model:

下面的代碼片段從上面訓練的模型中提取并可視化了最佳樹：

Decision tree extracted from the random forest.從隨機森林中提取決策樹。

結論： (Conclusion:)

In this article, we succeeded in looking at how a decision tree works, understanding how the attributes split-choices are made using the gini impurity, how several decision trees are ensembled to make a random forest, and finally, demonstrated the usage of the random forest algorithm by training a random forest model to classify students into academic performance categories based on their participation in class/learning processes.

在本文中，我們成功地研究了決策樹的工作原理，了解了如何使用基尼雜質生成拆分選擇屬性，如何將幾棵決策樹組合成一個隨機森林，最后演示了隨機樹的用法。森林算法，方法是訓練隨機森林模型，以根據學生對課堂/學習過程的參與程度將其分類為學習成績類別。

Thanks for reading.

謝謝閱讀。

翻譯自: https://towardsdatascience.com/seeing-the-forest-through-the-trees-45deafe1a6f0