當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

机器学习：决策树过拟合与剪枝，决策树代码实现（三）

發(fā)布時(shí)間：2024/9/15 编程问答 31 豆豆

生活随笔收集整理的這篇文章主要介紹了机器学习：决策树过拟合与剪枝，决策树代码实现（三）小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

文章目錄

楔子
變量
方法
- 數(shù)據(jù)預(yù)處理
- 剪枝
- - 獲取待剪集：
  - 針對ID3，C4.5的剪枝
  - - 損失函數(shù)的設(shè)計(jì)
    - 基于該損失函數(shù)的算法描述
    - 基于該損失函數(shù)的代碼實(shí)現(xiàn)
  - 針對CART的剪枝
  - - 損失函數(shù)的設(shè)計(jì)
    - 基于該損失函數(shù)的算法描述
    - 基于該損失函數(shù)的代碼實(shí)現(xiàn)
    - - 獲得p顆生成樹
      - 選取最優(yōu)生成樹
- 整個流程處理fit():

楔子

上次講到：至此node類的變量和方法基本實(shí)現(xiàn)完畢，為什么說基本呢，因?yàn)檎嬲暮蠹糁€沒講，他還需要在node類里添加一些方法。這一次來講一下后剪枝。

首先，后剪枝是對整個生成樹操作，我們給整個樹的操作定義一個基類，定義一個新類就涉及到：變量和方法

Tree 結(jié)構(gòu)需要做到如下幾點(diǎn)：

定義好需要在各個 Node 上調(diào)用的“全局變量” 做好數(shù)據(jù)預(yù)處理的工作、保證傳給 Node 的數(shù)據(jù)是合乎要求的對各個 Node 進(jìn)行合適的封裝，做到：生成決策樹時(shí)能夠正確地調(diào)用它們的生成算法進(jìn)行后剪枝時(shí)能夠正確地調(diào)用它們的局部剪枝函數(shù) 定義預(yù)測函數(shù)和評估函數(shù)以供用戶調(diào)用

變量

首先，我們思考一下，我們整體考慮生成樹，并對樹進(jìn)行操作，我們需要操作哪些對象：

1、我們需要剪枝，就需要對結(jié)點(diǎn)操作，在這里我們不好每次都遍歷樹一遍，我們把所有的node存下來專門處理，self.nodes = []
2、每個node都有一個可選features的列表，但是選中某個feature之后，遍歷featureValue時(shí)，在node里面沒有變量定義，在全局變量里面定義一個，所有features的featureValue的變量：self.feature_sets；同樣的道理各個特征的維度是否連續(xù)也是如此：self.whether_continuous
3、剪枝屬于全局的操作，變量也應(yīng)該是全局的：限制樹的深度：self.max_depth；CART種需要處理p顆生成樹：self.roots
4、還有一個最終要的變量，就是樹的根：self.root

from copy import deepcopy from Node import * import numpy as np class CvBase:def __init__(self,max_depth= None, node= None):# self.nodes:記錄所有node的列表self.nodes = []# self.roots:主要用于CART屬性，存儲算法過程中的各個決策樹self.roots = []# self.max_depth:用于記錄決策樹的最大深度self.max_depth = max_depth# self.root: 根節(jié)點(diǎn)self.root = node# self.feature_sets：用于記錄可選特征維度的列表self.feature_sets = []# self.label_dic:類別的轉(zhuǎn)換字典self.label_dic = {}# self.prune_alpha,self.layers:ID3和C4.5剪枝的兩個屬性self.prune_alpha = 1# 前者是懲罰因子，后者是記錄每一層的node# self.whether_continuous:記錄各維度特征是否是連續(xù)的self.whether_continuous = Nonedef __str__(self):return "CvTree ({})".format(self.root.height)__repr__ = __str__

方法

數(shù)據(jù)預(yù)處理

自動判斷哪些features為連續(xù)的；初始化樹的全局變量

def feed_data(self, x, continuous_rate = 0.2):# continuous_rate用于判斷該維度是否是連續(xù)的# 利用set獲取各個維度的特征可能取值self.feature_sets = [set(dimension) for dimension in x.T]data_len, data_dim = x.shape# 判斷是否連續(xù)self.whether_continuous = np.array([len(feat) >= continuous_rate*data_len for feat in self.feature_sets])# 根節(jié)點(diǎn)可選的劃分特征維度self.root.feats = [i for i in range(x.shape[1])]# 把self.root.feed_tree(self)

最后一行我們對根節(jié)點(diǎn)調(diào)用了feed_tree方法，該方法會做以下三件事：

讓決策樹中所有的 Node 記錄一下它們所屬的 Tree 結(jié)構(gòu) 將自己記錄在 Tree 中記錄所有 Node 的列表nodes里根據(jù) Tree 的相應(yīng)屬性更新記錄連續(xù)特征的列表 # 栽樹，會做三件事# 決策樹所有node記錄他們屬于哪一顆樹# 把所有結(jié)點(diǎn)保存到self.tree.nodes# 更新每一個結(jié)點(diǎn)的特征是否連續(xù)的列表def feed_tree(self, tree):self.tree = treeself.tree.nodes.append(self)self.wc = tree.whether_continuousfor child in self.children.values():if child is not None:child.feed_tree(tree)

剪枝

剪枝時(shí)，需要獲取所有的非葉子結(jié)點(diǎn)，為待剪集，從底層像高層一層一層的剪枝。

獲取待剪集：

# ============================================================================= # # 定義Prune # 因?yàn)槭呛蠹糁κ轻槍θ值目紤]，要決定那些結(jié)點(diǎn)需要剪枝，然后再調(diào)用結(jié)點(diǎn)的剪枝 # =============================================================================# 獲取每一層的結(jié)點(diǎn)self.layers:[depth,node_lst] = nodedef _update_layers(self):self.layers = [[] for _ in range(self.root.height)]self.root.update_layers()# Util# 獲取以當(dāng)前結(jié)點(diǎn)為根的樹的每一層結(jié)點(diǎn)列表def update_layers(self):self.tree.layers[self._depth].append(self)for node in sorted(self.children):node = self.children[node]if node is not None:node.update_layers()

針對ID3，C4.5的剪枝

損失函數(shù)的設(shè)計(jì)

# 新的損失函數(shù)，當(dāng)未剪枝時(shí)損失，已剪枝或者葉子的損失def cost(self, pruned=False):if not pruned:return sum([leaf["chaos"] * len(leaf["y"]) for leaf in self.leafs.values()])return self.chaos * len(self._y)# node.cost() + self.prune_alpha * len(node.leafs)

基于該損失函數(shù)的算法描述

基于該損失函數(shù)的代碼實(shí)現(xiàn)

# 離散數(shù)據(jù)的剪枝函數(shù)def _prune(self):# 獲取生成樹每一層的結(jié)點(diǎn)，每一層結(jié)點(diǎn)按照其劃分feature順序排列self._update_layers()# 用于保存所有的非葉子結(jié)點(diǎn)，為待剪枝結(jié)點(diǎn)，保存順序前面的靠近底部，后面的靠近根部tmp_nodes = []append = tmp_nodes.appendfor node_lst in self.layers[::-1]:for node in node_lst[::-1]:if node.category is None:append(node)# 剪枝的新?lián)p失函數(shù) = 各個葉子不確定度*葉子樣本數(shù)量加權(quán)和 + alpha*葉子個數(shù)# old為剪枝前的損失函數(shù)，所有的待剪枝結(jié)點(diǎn)的剪枝前的損失函數(shù)old = np.array([node.cost() + self.prune_alpha * len(node.leafs) for node in tmp_nodes])# 假如進(jìn)行剪枝后，當(dāng)前結(jié)點(diǎn)變成葉子，損失函數(shù) = 當(dāng)前結(jié)點(diǎn)的不確定度*樣本個數(shù) + alpha*1new = np.array([node.cost(pruned=True) + self.prune_alpha for node in tmp_nodes])# 根據(jù)這個得到待剪枝的結(jié)點(diǎn)maskmask = old >= newwhile True:# 剪到根時(shí)退出if self.root.height == 1:break# 獲取最深的待剪枝的結(jié)點(diǎn)，從下往上的剪枝，取的是第一個True，前面都是靠近底部的結(jié)點(diǎn)p = np.argmax(mask) # type: int# 判斷一下是否是可剪枝的，每次剪枝之后，會影響上層的結(jié)點(diǎn)，可能Ture變成了False，# 最后一次時(shí)里面，里面可能全部都是False if mask[p]:# 對這個結(jié)點(diǎn)剪枝，該做的操作在結(jié)點(diǎn)里面都操作了，里面還有一項(xiàng)操作# 就是剪枝該結(jié)點(diǎn)，會對那些結(jié)點(diǎn)有影響，就是他的祖宗們，已標(biāo)記node.affectetmp_nodes[p].prune()# 遍歷所有的待剪枝結(jié)點(diǎn)，挑出被當(dāng)前結(jié)點(diǎn)影響的結(jié)點(diǎn)for i, node in enumerate(tmp_nodes):if node.affected:# 更新那些結(jié)點(diǎn)的損失函數(shù)old[i] = node.cost() + self.prune_alpha * len(node.leafs)# 再次判斷是否需要被剪枝，new是不會變的他只和樣本有關(guān)mask[i] = old[i] >= new[i]# 重置一下，以免下次也更新他了node.affected = False# 把標(biāo)記為已剪枝的結(jié)點(diǎn)從待剪枝結(jié)點(diǎn)列表刪除，當(dāng)前結(jié)點(diǎn)也是標(biāo)記為已剪枝的# 他已經(jīng)變成葉子結(jié)點(diǎn)，葉子結(jié)點(diǎn)是不在待剪枝列表的for i in range(len(tmp_nodes) - 1, -1, -1):if tmp_nodes[i].pruned:tmp_nodes.pop(i)old = np.delete(old, i)new = np.delete(new, i)mask = np.delete(mask, i)# 假如待剪枝列表沒有可剪枝的也退出else:break# 剪枝完畢之后，新的生成樹，更新一下，這棵樹的nodes列表，把前面刪除的葉子都刪除掉# 前后的剪枝函數(shù)主要處理的是leafs，沒有處理nodes,所以最后處理一下。self.reduce_nodes()

針對CART的剪枝

損失函數(shù)的設(shè)計(jì)

這個的設(shè)計(jì)思想是，隨著懲罰因子alpha從0到大不斷增加，結(jié)點(diǎn)被一個一個剪掉，每剪掉一顆都是一棵樹保存起來，最后只剩下root,形成了p棵樹，求p棵樹里面的最優(yōu)樹。
每一個結(jié)點(diǎn)都有一個alpha的閾值，超過了這個閾值，該節(jié)點(diǎn)就可以被剪掉。
閾值的實(shí)現(xiàn)：

# 獲取該節(jié)點(diǎn)的閾值，就是懲罰因子有多大時(shí)，就輪到這個結(jié)點(diǎn)被剪掉了，# 當(dāng)然這個可能會隨著一些結(jié)點(diǎn)被剪掉而變化，# 隨著懲罰因子的變大，結(jié)點(diǎn)會一個一個剪掉，知道只剩下根def get_threshold(self):return (self.cost(pruned=True) - self.cost()) / (len(self.leafs) - 1)# 說初始化整顆樹的self.tree值，這棵樹的每個結(jié)點(diǎn)屬于哪棵樹

基于該損失函數(shù)的算法描述

基于該損失函數(shù)的代碼實(shí)現(xiàn)

獲得p顆生成樹

# CART的剪枝處理def _cart_prune(self):# 初始化整顆樹的self.tree值，這棵樹的每個結(jié)點(diǎn)屬于哪棵樹self.root.cut_tree()# 獲取待剪枝的結(jié)點(diǎn)列表，也就是非葉子結(jié)點(diǎn)tmp_nodes = [node for node in self.nodes if node.category is None]# 計(jì)算這些候選集的閾值thresholds = np.array([node.get_threshold() for node in tmp_nodes])while True:# 理論上我們需要記錄p棵樹，然后在p顆樹里找最好的那棵樹，# 因此我們需要深度copy原始樹，在此基本上剪枝，每次形成不同的樹root_copy = deepcopy(self.root)# self.roots用于記錄產(chǎn)生的p棵樹，先把原始樹存進(jìn)來self.roots.append(root_copy)# 出口，只剩根結(jié)點(diǎn)了，p棵樹產(chǎn)生完畢if self.root.height == 1:break# 取閾值最低的結(jié)點(diǎn)，那個結(jié)點(diǎn)第一個被剪p = np.argmin(thresholds) # type: int# 下面的處理和離散處理一致tmp_nodes[p].prune()# 剪掉之后，看哪些結(jié)點(diǎn)受影響了，更新受影響的結(jié)點(diǎn)for i, node in enumerate(tmp_nodes):if node.affected:# 對于受影響的結(jié)點(diǎn)，更新一下閾值thresholds[i] = node.get_threshold()node.affected = Falsepop = tmp_nodes.popfor i in range(len(tmp_nodes) - 1, -1, -1):if tmp_nodes[i].pruned:pop(i)thresholds = np.delete(thresholds, i)self.reduce_nodes()

選取最優(yōu)生成樹

# 定義選擇那個樹最優(yōu)的標(biāo)準(zhǔn)，使用加權(quán)正確率作為交叉驗(yàn)證的標(biāo)準(zhǔn)def acc(self, y, y_pred, weights):if weights is not None:return np.sum((np.array(y) == np.array(y_pred))*weights) /len(y)return np.sum(np.array(y) == np.array(y_pred)) /len(y)# 后剪枝是通過比較每棵樹在驗(yàn)證集上的表現(xiàn)來找出最優(yōu)樹def prune(self, x_cv, y_cv, weights):if self.root.is_cart:if x_cv is not None and y_cv is not None:self._cart_prune()# 選出最優(yōu)的子樹arg = np.argmax([self.acc(y_cv, tree.predict(x_cv), weights) for tree in self.roots]) # type: inttar_root = self.roots[arg]self.nodes = []# 更新一下樹的相關(guān)信息，所屬tree,所有的nodestar_root.feed_tree(self)# 把指針給rootself.root = tar_rootelse:self._prune()

整個流程處理fit():

方法都有了下面就開始整個操作流程：準(zhǔn)備數(shù)據(jù)，數(shù)據(jù)預(yù)處理，生成樹，剪枝

# =============================================================================# 參數(shù)alpha和剪枝有關(guān)；cv_rate用于控制交叉驗(yàn)證集大小；train_only是否進(jìn)行數(shù)據(jù)集切分def fit(self,x,y,alpha= None, sample_weight= None, eps= 12-8, cv_rate= 0.2, train_only= False):# 數(shù)值化類別向量_dic = {c:i for i,c in enumerate(set(y))}# 將y數(shù)值化y = np.array([_dic[yy] for yy in y])# 保存ID-->class映射，這樣才可以反向找回去self.label_dic = {value:key for key,value in _dic.items()}# 如果x為非數(shù)值的，也需要數(shù)值化x = np.array(x)# 根據(jù)特征個數(shù)給出alphaself.prune_alpha = alpha if alpha is not None else x.shape[1]/2# 劃分?jǐn)?shù)據(jù)集if not train_only and self.root.is_cart:# 利用下標(biāo)實(shí)現(xiàn)各種切分_train_num = int(len(x)*(1-cv_rate))# 相當(dāng)于打亂了順序_indices = np.random.permutation(np.arange(len(x)))_train_indices = _indices[:_train_num]_test_indices = _indices[_train_num:]# 針對樣本權(quán)重的處理if sample_weight is not None:# 切分后的樣本權(quán)重需要做歸一化處理_train_weight = sample_weight[_train_indices]_test_weight = sample_weight[_test_indices]# 歸一化_train_weight /= np.sum(_train_weight)_test_weight /= np.sum(_test_weight)else:_train_weight = _test_weight = Nonex_train, y_train = x[_train_indices],y[_train_indices]x_cv, y_cv = x[_test_indices],y[_test_indices]else:x_train, y_train, _train_weight = x, y, sample_weightx_cv = y_cv = _test_weight = None# 數(shù)據(jù)預(yù)處理 self.feed_data(x_train)# 調(diào)用根節(jié)點(diǎn)的生成算法self.root.fit(x_train, y_train, _train_weight, eps)# 調(diào)用對node的剪枝算法的封裝self.prune(x_cv, y_cv, _test_weight)# 定義刪除結(jié)點(diǎn)方法，從后往前刪除，這樣就可以使用popdef reduce_nodes(self):for i in range(len(self.nodes)-1, -1, -1):if self.nodes[i].pruned:self.nodes.pop(i) 與50位技術(shù)專家面對面20年技術(shù)見證，附贈技術(shù)全景圖

總結(jié)

以上是生活随笔為你收集整理的机器学习：决策树过拟合与剪枝，决策树代码实现（三）的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：机器学习：结点的实现，决策树代码实现（二
下一篇：机器学习：集成学习（ensemble)，

3atv精品不卡视频,97人人超碰国产精品最新,中文字幕av一区二区三区人妻少妇,久久久精品波多野结衣,日韩一区二区三区精品

编程问答

机器学习：决策树过拟合与剪枝，决策树代码实现（三）

文章目錄

楔子

變量

方法

數(shù)據(jù)預(yù)處理

剪枝

獲取待剪集：

針對ID3，C4.5的剪枝

損失函數(shù)的設(shè)計(jì)

基于該損失函數(shù)的算法描述

基于該損失函數(shù)的代碼實(shí)現(xiàn)

針對CART的剪枝

損失函數(shù)的設(shè)計(jì)

基于該損失函數(shù)的算法描述

基于該損失函數(shù)的代碼實(shí)現(xiàn)

獲得p顆生成樹

選取最優(yōu)生成樹

整個流程處理fit():

總結(jié)