當前位置：首頁 > 编程语言 > python >内容正文

python

python决策树多分类_Python中的决策树分类：您需要了解的一切

發布時間：2024/1/1 python 20 豆豆

生活随笔收集整理的這篇文章主要介紹了 python决策树多分类_Python中的决策树分类：您需要了解的一切小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

python決策樹多分類

什么是決策樹？ (What is Decision Tree?)

A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.

決策樹是一種決策支持工具，它使用決策的樹狀圖或模型及其可能的結果，包括偶然事件結果，資源成本和效用。這是顯示僅包含條件控制語句的算法的一種方法。

Decision Trees (DTs) are a non-parametric supervised learning method used for both classification and regression. Decision trees learn from data to approximate a sine curve with a set of if-then-else decision rules. The deeper the tree, the more complex the decision rules, and the fitter the model. The decision tree builds classification or regression models in the form of a tree structure, hence called CART (Classification and Regression Trees). It breaks down a data set into smaller and smaller subsets building along an associated decision tree at the same time. The final result is a tree with decision nodes and leaf nodes. A decision node has two or more branches. The leaf node represents a classification or decision. The topmost decision node in a tree which corresponds to the best predictor called the root node. Decision trees can handle both categorical and numerical data.

決策樹(DT)是一種用于分類和回歸的非參數監督學習方法。決策樹從數據中學習，以使用一組if-then-else決策規則來近似正弦曲線。樹越深，決策規則越復雜，模型越合適。決策樹以樹結構的形式構建分類或回歸模型，因此稱為CART(分類和回歸樹)。它同時將數據集分解為越來越小的子集，這些子集沿著關聯的決策樹構建。最終結果是一棵具有決策節點和葉節點的樹。決策節點具有兩個或更多分支。葉節點表示分類或決策。樹中與最佳預測變量相對應的最高決策節點稱為根節點。決策樹可以處理分類數據和數字數據。

何時使用決策樹？ (When is Decision Tree Used?)

When the user has and objective and he is trying to achieve max profit, optimized cost, etc.

當用戶擁有目標時，他正試圖實現最大利潤，優化成本等。

When there are several courses of action like the menu system in an ATM machine, Customer Support calling menu, etc.

當有多種操作過程時，例如ATM機中的菜單系統，客戶支持呼叫菜單等。

Uncertainty concerning which outcome will actually happen.

不確定哪種結果會真正發生。

如何制定決策樹？ (How to Make a Decision Tree?)

Step 1

第1步

Calculate the entropy of the target.

計算目標的熵。

Step 2

第2步

The dataset is then split into different attributes. The entropy for each branch is calculated. Then it is added proportionally, to get total entropy for the split. The resulting entropy is subtracted from the entropy before the split. The result is the Information Gain or decrease in entropy.

然后將數據集拆分為不同的屬性。計算每個分支的熵。然后按比例將其相加，以獲得拆分的總熵。從拆分之前的熵中減去所得的熵。結果是信息獲取或熵減少。

Step 3

第三步

Choose attribute with the largest information gain as the decision node, divide the dataset by its branches and repeat the same process on every branch.

選擇信息增益最大的屬性作為決策節點，將數據集除以其分支，然后在每個分支上重復相同的過程。

熵和信息增益計算 (Entropy and Information Gain Calculations)

熵 (Entropy)

Where,

哪里，

S is the total sample space,
S是總樣本空間，
P(yes) is the probability of yes
P(是)是是的概率

If number of yes = nunmber of no i.e. P(S) = 0.5

如果是=否，即P(S)= 0.5

Entropy(S) = 1
熵(S)= 1

When P(yes) = P(no) = 0.5 i.e. YES +NO = Total Sample(S) = 1
當P(yes)= P(no)= 0.5時，即YES + NO =總樣本量(S)= 1

If it contains all yes or all no i.e. P(S) = 1 or 0

如果包含全是或全無，即P(S)= 1或0

Entropy(S) = 0
熵(S)= 0

When P(yes) = 1 i.e. YES = Total Sample(S)
當P(yes)= 1，即YES =總樣本量(S)

E(S) = 1 log 1
E(S)= 1對數1

E(S) = 0
E(S)= 0

信息增益 (Information Gain)

Measure the reduction in entropy
測量熵的減少
Decides which attribute should be selected as a decision node.
決定應選擇哪個屬性作為決策節點。

If S is our total collection,

如果S是我們的總收藏，

Information Gain = Entropy(S) — [(Weighted Avg) x Entropy(each feature)]

信息增益=熵(S)-[(加權平均值)x熵(每個特征)]

決策樹的Python實現 (Python Implementation of Decision Tree)

我們將使用以下庫。 (We will use the following libraries.)

Python Pandas

Python熊貓

Python Numpy

Python Scikit Learn

Python Scikit學習

Python MatPlotLib

We will use the BankNoteAuthentication dataset.

我們將使用BankNoteAuthentication數據集。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inlinebankdata = pd.read_csv("../input/bank-note-authentication-uci-data/BankNote_Authentication.csv")
bankdata

功能選擇 (Feature Selection)

Here, you need to divide given columns into two types of variables dependent(or target variable) and independent variable(or feature variables).

在這里，您需要將給定的列分為因變量(目標變量)和自變量(或特征變量)兩種類型。

feature_cols = ['variance','skewness','curtosis','entropy']
#split dataset in features and target variable
X = pima[feature_cols] # Features
y = pima['class'] # Target variable

分割數據 (Splitting Data)

To understand model performance, dividing the dataset into a training set and a test set is a good strategy.

為了了解模型的性能，將數據集分為訓練集和測試集是一個很好的策略。

Let’s split the dataset by using function train_test_split(). You need to pass 3 parameters features, target, and test_set size.

讓我們使用函數train_test_split()拆分數據集。您需要傳遞3個參數功能，目標和test_set大小。

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test

建筑決策樹模型 (Building Decision Tree Model)

Let’s create a Decision Tree Model using Scikit-learn.

讓我們使用Scikit-learn創建一個決策樹模型。

# Create Decision Tree classifer object
clf = DecisionTreeClassifier()
# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)

評估模型 (Evaluating Model)

Let’s estimate, how accurately the classifier or model can predict the type of cultivars.

讓我們估計一下分類器或模型可以多么準確地預測品種的類型。

Accuracy can be computed by comparing actual test set values and predicted values.

可以通過比較實際測試設置值和預測值來計算準確性。

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

OUTPUT: Accuracy: 0.9878640776699029

輸出：精度：0.9878640776699029

混淆矩陣 (Confusion matrix)

A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. This is the key to the confusion matrix. The confusion matrix shows the ways in which your classification model is confused when it makes predictions. It gives us insight not only into the errors being made by a classifier but more importantly the types of errors that are being made.

混淆矩陣是關于分類問題的預測結果的摘要。正確和不正確的預測數會匯總計數值，并按每個類別進行細分。這是混淆矩陣的關鍵。混淆矩陣顯示分類模型進行預測時的混淆方式。它不僅使我們了解分類器所產生的錯誤，而且更重要的是，可以了解所產生的錯誤的類型。

cm = confusion_matrix(y_test, y_pred)
cm

OUTPUT:

輸出：

array([[231, 4], [ 1, 176]])

數組([[231，4]，[1，176]])

Originally published at https://www.numpyninja.com on August 12, 2020.

最初于 2020年8月12日發布在 https://www.numpyninja.com 上。

翻譯自: https://medium.com/analytics-vidhya/decision-tree-classification-in-python-everything-you-need-to-know-212160ec03f6