决策树(Decision Tree,DT)
文章目錄
- 1. 決策樹(shù)模型與學(xué)習(xí)
- 2. 特征選擇
- 2.1 特征選擇Python代碼
- 3. 決策樹(shù)的生成
- 3.1 Python代碼
- 4. 決策樹(shù)的剪枝
- 5. CART 算法
- 6. sklearn 例子
- 6.1 書(shū)上貸款例子
- 6.2 鳶尾花 及 決策樹(shù)可視化
- 附. 本文完整代碼
決策樹(shù)(decision tree)是一種基本的分類(lèi)與回歸方法。
- 分類(lèi)問(wèn)題中,基于特征對(duì)實(shí)例進(jìn)行分類(lèi)的過(guò)程。
- 優(yōu)點(diǎn):模型具有可讀性,分類(lèi)速度快。
- 學(xué)習(xí):利用訓(xùn)練數(shù)據(jù),根據(jù)損失函數(shù)最小化的原則建立決策樹(shù)模型。
- 預(yù)測(cè):對(duì)新的數(shù)據(jù),利用決策樹(shù)模型進(jìn)行分類(lèi)。
決策樹(shù)學(xué)習(xí)通常包括3個(gè)步驟:特征選擇、決策樹(shù)生成、決策樹(shù)修剪。
Quinlan在1986年提出的ID3算法、1993年提出的C4.5算法
Breiman等人在1984年提出的CART算法
1. 決策樹(shù)模型與學(xué)習(xí)
決策樹(shù)由結(jié)點(diǎn)(node)和有向邊(directed edge)組成。
- 內(nèi)部結(jié)點(diǎn)(internal node)和葉結(jié)點(diǎn)(leaf node)。
- 內(nèi)部結(jié)點(diǎn)表示一個(gè)特征或?qū)傩?#xff0c;葉結(jié)點(diǎn)表示一個(gè)類(lèi)。
- 用決策樹(shù)分類(lèi),從根結(jié)點(diǎn)開(kāi)始,對(duì)實(shí)例的某一特征進(jìn)行測(cè)試,根據(jù)測(cè)試結(jié)果,將實(shí)例分配到其子結(jié)點(diǎn);這時(shí),每一個(gè)子結(jié)點(diǎn)對(duì)應(yīng)著該特征的一個(gè)取值。
- 遞歸地對(duì)實(shí)例進(jìn)行測(cè)試并分配,直至達(dá)到葉結(jié)點(diǎn)。最后將實(shí)例分到葉結(jié)點(diǎn)的類(lèi)中。
決策樹(shù)學(xué)習(xí)本質(zhì):從訓(xùn)練數(shù)據(jù)集中歸納出一組分類(lèi)規(guī)則。
- 需要一個(gè)與訓(xùn)練數(shù)據(jù)矛盾較小的決策樹(shù),同時(shí)具有很好的泛化能力。
- 決策樹(shù)學(xué)習(xí)的損失函數(shù):通常是正則化的極大似然函數(shù)。
- 決策樹(shù)學(xué)習(xí)的策略:損失函數(shù)為目標(biāo)函數(shù)的最小化。
2. 特征選擇
決策樹(shù)訓(xùn)練時(shí)高度太高,對(duì)訓(xùn)練數(shù)據(jù)準(zhǔn)確率高,但泛化能力差,需要剪枝。
常用的準(zhǔn)則:
(1)樣本集合DDD對(duì)特征AAA的 信息增益(ID3)
g(D,A)=H(D)?H(D∣A)g(D, A)=H(D)-H(D|A)g(D,A)=H(D)?H(D∣A)
H(D)=?∑k=1K∣Ck∣∣D∣log?2∣Ck∣∣D∣H(D)=-\sum_{k=1}^{K} \frac{\left|C_{k}\right|}{|D|} \log _{2} \frac{\left|C_{k}\right|}{|D|}H(D)=?k=1∑K?∣D∣∣Ck?∣?log2?∣D∣∣Ck?∣?
H(D∣A)=∑i=1n∣Di∣∣D∣H(Di)H(D | A)=\sum_{i=1}^{n} \frac{\left|D_{i}\right|}{|D|} H\left(D_{i}\right)H(D∣A)=i=1∑n?∣D∣∣Di?∣?H(Di?)
其中,H(D)H(D)H(D)是數(shù)據(jù)集DDD的熵,H(Di)H(D_i)H(Di?)是數(shù)據(jù)集DiD_iDi?的熵,H(D∣A)H(D|A)H(D∣A)是數(shù)據(jù)集DDD對(duì)特征AAA的條件熵。 DiD_iDi?是DDD中特征AAA取第iii個(gè)值的樣本子集,CkC_kCk?是DDD中屬于第kkk類(lèi)的樣本子集。nnn是特征AAA取值的個(gè)數(shù),KKK是類(lèi)的個(gè)數(shù)。
- 熵越大,隨機(jī)變量的不確定性就越大
- 信息增益(information gain)表示得知特征X的信息而使得類(lèi)Y的信息的不確定性減少的程度。
- 選擇信息增益 大的
(2)樣本集合DDD對(duì)特征AAA的 信息增益比(C4.5)
gR(D,A)=g(D,A)HA(D)g_{R}(D, A)=\frac{g(D, A)}{H_A(D)}gR?(D,A)=HA?(D)g(D,A)?
其中,g(D,A)g(D,A)g(D,A)是信息增益,HA(D)H_A(D)HA?(D)是數(shù)據(jù)集DDD關(guān)于特征AAA的熵。
(3)樣本集合DDD的 基尼指數(shù)(CART)
Gini?(D)=1?∑k=1K(∣Ck∣∣D∣)2\operatorname{Gini}(D)=1-\sum_{k=1}^{K}\left(\frac{\left|C_{k}\right|}{|D|}\right)^{2}Gini(D)=1?k=1∑K?(∣D∣∣Ck?∣?)2
特征AAA條件下集合DDD的基尼指數(shù):
Gini?(D,A)=∣D1∣∣D∣Gini?(D1)+∣D2∣∣D∣Gini?(D2)\operatorname{Gini}(D, A)=\frac{\left|D_{1}\right|}{|D|} \operatorname{Gini}\left(D_{1}\right)+\frac{\left|D_{2}\right|}{|D|} \operatorname{Gini}\left(D_{2}\right)Gini(D,A)=∣D∣∣D1?∣?Gini(D1?)+∣D∣∣D2?∣?Gini(D2?)
- 基尼指數(shù)表示集合D的不確定性,基尼指數(shù) Gini(D,A)Gini(D,A)Gini(D,A) 表示經(jīng)A=aA=aA=a分割后集合DDD的不確定性。
- 基尼指數(shù)值越大,樣本集合的不確定性也就越大,這一點(diǎn)與熵相似。
- 選擇 基尼指數(shù) 小的
2.1 特征選擇Python代碼
def get_data():datasets = [['青年', '否', '否', '一般', '否'],['青年', '否', '否', '好', '否'],['青年', '是', '否', '好', '是'],['青年', '是', '是', '一般', '是'],['青年', '否', '否', '一般', '否'],['中年', '否', '否', '一般', '否'],['中年', '否', '否', '好', '否'],['中年', '是', '是', '好', '是'],['中年', '否', '是', '非常好', '是'],['中年', '否', '是', '非常好', '是'],['老年', '否', '是', '非常好', '是'],['老年', '否', '是', '好', '是'],['老年', '是', '否', '好', '是'],['老年', '是', '否', '非常好', '是'],['老年', '否', '否', '一般', '否'],]labels = [u'年齡', u'有工作', u'有自己的房子', u'信貸情況', u'分類(lèi)']# 字符串前加 u, 后面字符串以 Unicode 格式 進(jìn)行編碼,一般用在中文字符串前面,防止亂碼return datasets, labels; # ---------書(shū)上貸款例子----------------- datasets, labels = get_data()def cal_entropy(datasets): # 經(jīng)驗(yàn)熵H(D)data_len = len(datasets)label_count = {}for i in range(data_len):label = datasets[i][-1]if label not in label_count:label_count[label] = 0label_count[label] += 1entropy = -sum([(p / data_len) * log(p / data_len, 2) for p in label_count.values()])return entropydef cond_entropy(datasets, axis=0): # 經(jīng)驗(yàn)條件熵H(D|A)data_len = len(datasets)feature_set = {}for i in range(data_len):feature = datasets[i][axis]if feature not in feature_set:feature_set[feature] = []feature_set[feature].append(datasets[i])cond_ent = sum([(len(p) / data_len) * cal_entropy(p) for p in feature_set.values()])return cond_entdef info_gain(entropy, cond_ent): # 信息增益return entropy - cond_entdef info_gain_train(datasets): # 基于特征信息增益的特征選擇count = len(datasets[0]) - 1entropy = cal_entropy(datasets)best_feature = []for i in range(count):info_gain_i = info_gain(entropy, cond_entropy(datasets, axis=i))best_feature.append((i, info_gain_i))print("特征({})- info_gain - {:.3f}".format(labels[i], info_gain_i))best_feature_i = max(best_feature, key=lambda x: x[-1])print("特征({})的信息增益最大,選為根節(jié)點(diǎn)的特征".format(labels[best_feature_i[0]]))info_gain_train(np.array(datasets)) 特征(年齡)- info_gain - 0.083 特征(有工作)- info_gain - 0.324 特征(有自己的房子)- info_gain - 0.420 特征(信貸情況)- info_gain - 0.363 特征(有自己的房子)的信息增益最大,選為根節(jié)點(diǎn)的特征3. 決策樹(shù)的生成
通常使用信息增益最大、信息增益比最大或基尼指數(shù)最小作為特征選擇的準(zhǔn)則。
決策樹(shù)的生成往往通過(guò)計(jì)算信息增益或其他指標(biāo),從根結(jié)點(diǎn)開(kāi)始,遞歸地產(chǎn)生決策樹(shù)。
這相當(dāng)于用信息增益或其他準(zhǔn)則不斷地選取局部最優(yōu)的特征,或?qū)⒂?xùn)練集分割為能夠基本正確分類(lèi)的子集。
- ID3算法只有樹(shù)的生成,所以該算法生成的樹(shù)容易產(chǎn)生過(guò)擬合
- C4.5算法與ID3算法相似,進(jìn)行了改進(jìn)。C4.5在生成的過(guò)程中,用信息增益比來(lái)選擇特征。
3.1 Python代碼
class Node():def __init__(self, root=True, label=None, feature_name=None, feature=None):self.root = rootself.label = labelself.feature_name = feature_nameself.feature = featureself.tree = {}self.result = {'label:': self.label,'feature:': self.feature,'tree:': self.tree}def __repr__(self): # 類(lèi)似str方法,更側(cè)重程序員調(diào)試print('{}'.format(self.result))def add_node(self, val, node):self.tree[val] = nodedef predict(self, features):if self.root is True:return self.labelreturn self.tree[features[self.feature]].predict(features)class DTree():def __init__(self, epsilon=0.1): # 信息增益閾值, < epsilon 時(shí),結(jié)束決策樹(shù)展開(kāi)self.epsilon = epsilonself._tree = {}@staticmethoddef cal_entropy(datasets): # 經(jīng)驗(yàn)熵H(D)data_len = len(datasets)label_count = {}for i in range(data_len):label = datasets[i][-1]if label not in label_count:label_count[label] = 0label_count[label] += 1entropy = -sum([(p / data_len) * log(p / data_len, 2) for p in label_count.values()])return entropydef cond_entropy(self, datasets, axis=0): # 經(jīng)驗(yàn)條件熵H(D|A)data_len = len(datasets)feature_set = {}for i in range(data_len):feature = datasets[i][axis]if feature not in feature_set:feature_set[feature] = []feature_set[feature].append(datasets[i])cond_ent = sum([(len(p) / data_len) * self.cal_entropy(p) for p in feature_set.values()])return cond_ent@staticmethoddef info_gain(entropy, cond_ent): # 信息增益return entropy - cond_entdef info_gain_train(self, datasets): # 基于特征信息增益的特征選擇count = len(datasets[0]) - 1entropy = self.cal_entropy(datasets)best_feature = []for i in range(count):info_gain_i = info_gain(entropy, cond_entropy(datasets, axis=i))best_feature.append((i, info_gain_i))print("特征({})- info_gain - {:.3f}".format(labels[i], info_gain_i))best_feature_i = max(best_feature, key=lambda x: x[-1])return best_feature_idef train(self, train_data):''':input: 數(shù)據(jù)集D(DataFrame格式),特征集A,閾值eta:return: 決策樹(shù)DT'''_, y_train, features = train_data.iloc[:, :-1], train_data.iloc[:, -1], train_data.columns[:-1]# 1. 若所有D實(shí)例都屬于同一分類(lèi),不用分了,直接返回那個(gè)類(lèi)if len(y_train.value_counts()) == 1:return Node(root=True, label=y_train.iloc[0])# 2. 若沒(méi)有特征A,返回D中數(shù)量最多的分類(lèi)if len(features) == 0:return Node(root=True, label=y_train.value_counts().sort_values(ascending=False).index[0])# 3. 計(jì)算最大信息增益,取為特征max_feature, max_info_gain = self.info_gain_train(np.array(train_data))max_feature_name = features[max_feature]# 4. 如果信息增益小于閾值epsilon,置為單節(jié)點(diǎn),將實(shí)例數(shù)最大的類(lèi)作為節(jié)點(diǎn)標(biāo)記if max_info_gain < self.epsilon:return Node(root=True, label=y_train.value_counts().sort_values(ascending=False).index[0])# 5. 構(gòu)建Ag子集node_tree = Node(root=False, feature_name=max_feature_name, feature=max_feature)feature_list = train_data[max_feature_name].value_counts().indexfor f in feature_list:sub_train_df = train_data.loc[train_data[max_feature_name] == f].drop([max_feature_name], axis=1)# 6. 遞歸生成樹(shù)sub_tree = self.train(sub_train_df)node_tree.add_node(f, sub_tree)return node_treedef fit(self, train_data):self._tree = self.train(train_data)return self._treedef predict(self, X_test):return self._tree.predict(X_test)train_data = pd.DataFrame(datasets, columns=labels) dt = DTree() tree = dt.fit(train_data) print(dt.predict(['老年', '否', '否', '一般'])) print(dt.predict(['青年', '否', '是', '一般'])) print(dt.predict(['中年', '是', '否', '好'])) print(dt.predict(['老年', '否', '是', '一般']))4. 決策樹(shù)的剪枝
學(xué)習(xí)時(shí),過(guò)多考慮準(zhǔn)確性,樹(shù)復(fù)雜,過(guò)擬合,泛化能力差,需要剪枝。
方法:極小化決策樹(shù)整體損失函數(shù)
5. CART 算法
分類(lèi)與回歸樹(shù)(classification and regression tree,CART)模型
- 二叉樹(shù)
- 左分支,是;右分支,否
- (1)決策樹(shù)生成:基于訓(xùn)練數(shù)據(jù)集生成決策樹(shù),生成的決策樹(shù)要盡量大;
(2)決策樹(shù)剪枝:用驗(yàn)證數(shù)據(jù)集對(duì)已生成的樹(shù)進(jìn)行剪枝并選擇最優(yōu)子樹(shù),這時(shí)用損失函數(shù)最小作為剪枝的標(biāo)準(zhǔn)。
6. sklearn 例子
sklearn.tree.DecisionTreeClassifier
class sklearn.tree.DecisionTreeClassifier(criterion='gini', splitter='best',max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None,class_weight=None, presort='deprecated', ccp_alpha=0.0)- 特征選擇標(biāo)準(zhǔn):
- 擇優(yōu)劃分、樹(shù)的最大深度、最小劃分幾類(lèi)、葉子節(jié)點(diǎn)個(gè)數(shù)等參數(shù)
6.1 書(shū)上貸款例子
# ---------書(shū)上貸款例子----------------- datasets, labels = get_data() train_data = np.array(pd.DataFrame(datasets, columns=labels)) X_train, y_train = train_data[:, :-1], train_data[:, -1:] encoder = preprocessing.OrdinalEncoder() # 將字符轉(zhuǎn)成浮點(diǎn) encoder.fit(X_train) # 先擬合 X_train = encoder.transform(X_train) # 轉(zhuǎn)換成數(shù)字 A = encoder.transform([['青年', '否', '是', '一般']]) B = encoder.transform([['中年', '是', '否', '好']]) C = encoder.transform([['老年', '否', '是', '一般']])encoder = preprocessing.OrdinalEncoder() encoder.fit(y_train) y_train = encoder.transform(y_train) # sklearn 決策樹(shù) clf = DecisionTreeClassifier() clf.fit(X_train, y_train) print(encoder.inverse_transform([clf.predict(A)])) print(clf.predict_proba(B)) print(clf.predict_proba(C)) [['是']] [[0. 1.]] [[0. 1.]]6.2 鳶尾花 及 決策樹(shù)可視化
# ------------鳶尾花--------------- iris = load_iris() df = pd.DataFrame(iris.data, columns=iris.feature_names) df['label'] = iris.target df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'label'] data = np.array(df.iloc[:100, [0, 1, -1]]) X = data[:, :2] y = data[:, -1] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) clf = DecisionTreeClassifier() print(clf) clf.fit(X_train, y_train) print(clf.score(X_test, y_test))# --------------決策樹(shù)可視化------------- # 需要安裝graphviz,添加path,可視化決策樹(shù) with open('mytree.dot', 'w', encoding='utf-8') as f:dot_data = export_graphviz(clf, out_file=None, feature_names=df.columns[:2],filled=True, rounded=True, special_characters=True, class_names=iris.target_names[0:2]) dot = graphviz.Source(dot_data) dot.view() # 寫(xiě)入png , pdf graph = pydotplus.graph_from_dot_data(dot_data) graph.write_png('tree.png') # cmd: dot -Tpdf tree.dot -o output.pdf,dot -Tpng tree.dot -o output.png決策樹(shù)可視化
附. 本文完整代碼
# -*- coding:utf-8 -*- # @Python Version: 3.7 # @Time: 2020/3/13 19:36 # @Author: Michael Ming # @Website: https://michael.blog.csdn.net/ # @File: 5.decisionTree.py # @Reference: https://github.com/fengdu78/lihang-codeimport pandas as pd import numpy as np from sklearn import preprocessing from sklearn.model_selection import train_test_split from collections import Counter import math from math import log import pprint import matplotlib.pyplot as plt from sklearn.datasets import load_iris from sklearn.tree import DecisionTreeClassifier from sklearn.tree import export_graphviz import graphviz import pydotplusdef get_data():datasets = [['青年', '否', '否', '一般', '否'],['青年', '否', '否', '好', '否'],['青年', '是', '否', '好', '是'],['青年', '是', '是', '一般', '是'],['青年', '否', '否', '一般', '否'],['中年', '否', '否', '一般', '否'],['中年', '否', '否', '好', '否'],['中年', '是', '是', '好', '是'],['中年', '否', '是', '非常好', '是'],['中年', '否', '是', '非常好', '是'],['老年', '否', '是', '非常好', '是'],['老年', '否', '是', '好', '是'],['老年', '是', '否', '好', '是'],['老年', '是', '否', '非常好', '是'],['老年', '否', '否', '一般', '否'],]labels = [u'年齡', u'有工作', u'有自己的房子', u'信貸情況', u'分類(lèi)']# 字符串前加 u, 后面字符串以 Unicode 格式 進(jìn)行編碼,一般用在中文字符串前面,防止亂碼return datasets, labels;# ---------書(shū)上貸款例子----------------- datasets, labels = get_data() train_data = np.array(pd.DataFrame(datasets, columns=labels)) X_train, y_train = train_data[:, :-1], train_data[:, -1:] encoder = preprocessing.OrdinalEncoder() # 將字符轉(zhuǎn)成浮點(diǎn) encoder.fit(X_train) # 先擬合 X_train = encoder.transform(X_train) # 轉(zhuǎn)換成數(shù)字 A = encoder.transform([['青年', '否', '是', '一般']]) B = encoder.transform([['中年', '是', '否', '好']]) C = encoder.transform([['老年', '否', '是', '一般']])encoder = preprocessing.OrdinalEncoder() encoder.fit(y_train) y_train = encoder.transform(y_train) # sklearn 決策樹(shù) clf = DecisionTreeClassifier() clf.fit(X_train, y_train) print(encoder.inverse_transform([clf.predict(A)])) print(clf.predict_proba(B)) print(clf.predict_proba(C))# --------------決策樹(shù)可視化------------- # 需要安裝graphviz,添加path,可視化決策樹(shù) with open('mytree.dot', 'w', encoding='utf-8') as f:dot_data = export_graphviz(clf, out_file=None, feature_names=clf.feature_importances_,filled=True, rounded=True, special_characters=True) dot = graphviz.Source(dot_data) # dot.view() # 寫(xiě)入png , pdf graph = pydotplus.graph_from_dot_data(dot_data) graph.write_png('tree.png')# cmd: dot -Tpdf tree.dot -o output.pdf,dot -Tpng tree.dot -o output.png# -----------自編程,抄一遍--------------- # ----特征選擇,基于信息增益---- def cal_entropy(datasets): # 經(jīng)驗(yàn)熵H(D)data_len = len(datasets)label_count = {}for i in range(data_len):label = datasets[i][-1]if label not in label_count:label_count[label] = 0label_count[label] += 1entropy = -sum([(p / data_len) * log(p / data_len, 2) for p in label_count.values()])return entropydef cond_entropy(datasets, axis=0): # 經(jīng)驗(yàn)條件熵H(D|A)data_len = len(datasets)feature_set = {}for i in range(data_len):feature = datasets[i][axis]if feature not in feature_set:feature_set[feature] = []feature_set[feature].append(datasets[i])cond_ent = sum([(len(p) / data_len) * cal_entropy(p) for p in feature_set.values()])return cond_entdef info_gain(entropy, cond_ent): # 信息增益return entropy - cond_entdef info_gain_train(datasets): # 基于特征信息增益的特征選擇count = len(datasets[0]) - 1entropy = cal_entropy(datasets)best_feature = []for i in range(count):info_gain_i = info_gain(entropy, cond_entropy(datasets, axis=i))best_feature.append((i, info_gain_i))print("特征({})- info_gain - {:.3f}".format(labels[i], info_gain_i))best_feature_i = max(best_feature, key=lambda x: x[-1])print("特征({})的信息增益最大,選為根節(jié)點(diǎn)的特征".format(labels[best_feature_i[0]]))info_gain_train(np.array(datasets))# -------ID3算法生成決策樹(shù)---------class Node():def __init__(self, root=True, label=None, feature_name=None, feature=None):self.root = rootself.label = labelself.feature_name = feature_nameself.feature = featureself.tree = {}self.result = {'label:': self.label,'feature:': self.feature,'tree:': self.tree}def __repr__(self): # 類(lèi)似str方法,更側(cè)重程序員調(diào)試print('{}'.format(self.result))def add_node(self, val, node):self.tree[val] = nodedef predict(self, features):if self.root is True:return self.labelreturn self.tree[features[self.feature]].predict(features)class DTree():def __init__(self, epsilon=0.1): # 信息增益閾值, < epsilon 時(shí),結(jié)束決策樹(shù)展開(kāi)self.epsilon = epsilonself._tree = {}@staticmethoddef cal_entropy(datasets): # 經(jīng)驗(yàn)熵H(D)data_len = len(datasets)label_count = {}for i in range(data_len):label = datasets[i][-1]if label not in label_count:label_count[label] = 0label_count[label] += 1entropy = -sum([(p / data_len) * log(p / data_len, 2) for p in label_count.values()])return entropydef cond_entropy(self, datasets, axis=0): # 經(jīng)驗(yàn)條件熵H(D|A)data_len = len(datasets)feature_set = {}for i in range(data_len):feature = datasets[i][axis]if feature not in feature_set:feature_set[feature] = []feature_set[feature].append(datasets[i])cond_ent = sum([(len(p) / data_len) * self.cal_entropy(p) for p in feature_set.values()])return cond_ent@staticmethoddef info_gain(entropy, cond_ent): # 信息增益return entropy - cond_entdef info_gain_train(self, datasets): # 基于特征信息增益的特征選擇count = len(datasets[0]) - 1entropy = self.cal_entropy(datasets)best_feature = []for i in range(count):info_gain_i = info_gain(entropy, cond_entropy(datasets, axis=i))best_feature.append((i, info_gain_i))print("特征({})- info_gain - {:.3f}".format(labels[i], info_gain_i))best_feature_i = max(best_feature, key=lambda x: x[-1])return best_feature_idef train(self, train_data):''':input: 數(shù)據(jù)集D(DataFrame格式),特征集A,閾值eta:return: 決策樹(shù)DT'''_, y_train, features = train_data.iloc[:, :-1], train_data.iloc[:, -1], train_data.columns[:-1]# 1. 若所有D實(shí)例都屬于同一分類(lèi),不用分了,直接返回那個(gè)類(lèi)if len(y_train.value_counts()) == 1:return Node(root=True, label=y_train.iloc[0])# 2. 若沒(méi)有特征A,返回D中數(shù)量最多的分類(lèi)if len(features) == 0:return Node(root=True, label=y_train.value_counts().sort_values(ascending=False).index[0])# 3. 計(jì)算最大信息增益,取為特征max_feature, max_info_gain = self.info_gain_train(np.array(train_data))max_feature_name = features[max_feature]# 4. 如果信息增益小于閾值epsilon,置為單節(jié)點(diǎn),將實(shí)例數(shù)最大的類(lèi)作為節(jié)點(diǎn)標(biāo)記if max_info_gain < self.epsilon:return Node(root=True, label=y_train.value_counts().sort_values(ascending=False).index[0])# 5. 構(gòu)建Ag子集node_tree = Node(root=False, feature_name=max_feature_name, feature=max_feature)feature_list = train_data[max_feature_name].value_counts().indexfor f in feature_list:sub_train_df = train_data.loc[train_data[max_feature_name] == f].drop([max_feature_name], axis=1)# 6. 遞歸生成樹(shù)sub_tree = self.train(sub_train_df)node_tree.add_node(f, sub_tree)return node_treedef fit(self, train_data):self._tree = self.train(train_data)return self._treedef predict(self, X_test):return self._tree.predict(X_test)train_data = pd.DataFrame(datasets, columns=labels) dt = DTree() tree = dt.fit(train_data) print(dt.predict(['老年', '否', '否', '一般'])) print(dt.predict(['青年', '否', '是', '一般'])) print(dt.predict(['中年', '是', '否', '好'])) print(dt.predict(['老年', '否', '是', '一般']))# ------------鳶尾花--------------- iris = load_iris() df = pd.DataFrame(iris.data, columns=iris.feature_names) df['label'] = iris.target df.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'label'] data = np.array(df.iloc[:100, [0, 1, -1]]) X = data[:, :2] y = data[:, -1] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) clf = DecisionTreeClassifier() print(clf) clf.fit(X_train, y_train) print(clf.score(X_test, y_test)) # --------------決策樹(shù)可視化------------- # 需要安裝graphviz,添加path,可視化決策樹(shù) with open('mytree.dot', 'w', encoding='utf-8') as f:dot_data = export_graphviz(clf, out_file=None, feature_names=df.columns[:2],filled=True, rounded=True, special_characters=True, class_names=iris.target_names[0:2]) dot = graphviz.Source(dot_data) dot.view() # 寫(xiě)入png , pdf graph = pydotplus.graph_from_dot_data(dot_data) graph.write_png('tree.png') # cmd: dot -Tpdf tree.dot -o output.pdf,dot -Tpng tree.dot -o output.png 創(chuàng)作挑戰(zhàn)賽新人創(chuàng)作獎(jiǎng)勵(lì)來(lái)咯,堅(jiān)持創(chuàng)作打卡瓜分現(xiàn)金大獎(jiǎng)總結(jié)
以上是生活随笔為你收集整理的决策树(Decision Tree,DT)的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: LeetCode 984. 不含 AAA
- 下一篇: LintCode 1690. 朋友推荐(