《机器学习实战》笔记(03):决策树
決策樹
kNN算法可以完成很多分類任務(wù),但是它最大的缺點(diǎn)就是給出數(shù)據(jù)的內(nèi)在含義,決策樹的主要優(yōu)勢(shì)就在于數(shù)據(jù)形式非常容易理解
決策樹的構(gòu)造
- 優(yōu)點(diǎn):計(jì)算復(fù)雜度不高,輸出結(jié)果易于理解,對(duì)中間值的缺失不敏感,可以處理不相關(guān)特征數(shù)據(jù)。
- 缺點(diǎn):可能會(huì)產(chǎn)生過度匹配問題。
適用數(shù)據(jù)類型:數(shù)值型和標(biāo)稱型。
創(chuàng)建分支的偽代碼函數(shù)createBranch()
Check if every item in the dataset is in the same class:If so return the class labelElsefind the best feature to split the datasplit the datasetcreate a branch nodefor each splitcall createBranch() and add the result to the branch nodereturn branch node示例數(shù)據(jù)
海洋生物數(shù)據(jù)
| 是 | 是 | 是 |
| 是 | 是 | 是 |
| 是 | 否 | 否 |
| 否 | 是 | 否 |
| 否 | 是 | 否 |
信息增益 Information gain
劃分?jǐn)?shù)據(jù)集的大原則是:將無序的數(shù)據(jù)變得更加有序。
組織雜亂無章數(shù)據(jù)的一種方法就是使用 信息論 度量信息。
在劃分?jǐn)?shù)據(jù)集之前之后信息發(fā)生的變化稱為 信息增益.
知道如何計(jì)算信息增益,就可以計(jì)算某個(gè)特征值劃分?jǐn)?shù)據(jù)集獲得的信息增益,獲得信息增益最高的特征就是最好的選擇。
馮 諾依曼 建議使用 熵 這術(shù)語
信息增益是熵(數(shù)據(jù)無序度)的減少,大家肯定對(duì)于將熵用于度量數(shù)據(jù)無序度的減少更容易理解。
集合信息的度量稱為香農(nóng)熵 或者 簡(jiǎn)稱 熵(entropy)。(更多熵知識(shí)請(qǐng)移步至 What Is Information Entropy)
熵定義為信息的期望值
信息定義
如果待分類的事務(wù)可能劃分在多個(gè)分類之中,則符號(hào)Xi的信息定義為
其中p(Xi)是選擇該分類的概率。
為了計(jì)算熵,我們需要計(jì)算所有分類別所有可能值包含的信息期望值,通過下面的公式得到
trees.py
計(jì)算給定數(shù)據(jù)集的香農(nóng)熵
def calcShannonEnt(dataSet):#實(shí)例總數(shù)numEntries = len(dataSet)labelCounts = {}#the the number of unique elements and their occurance#統(tǒng)計(jì)目標(biāo)變量的值出現(xiàn)的次數(shù)for featVec in dataSet: #每個(gè)實(shí)例的最后一項(xiàng)是目標(biāo)變量currentLabel = featVec[-1]if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0labelCounts[currentLabel] += 1shannonEnt = 0.0#利用上面的公式計(jì)算出香農(nóng)熵for key in labelCounts:prob = float(labelCounts[key])/numEntriesshannonEnt -= prob * log(prob,2) #log base 2return shannonEnt創(chuàng)建數(shù)據(jù)集
def createDataSet():dataSet = [[1, 1, 'yes'],[1, 1, 'yes'],[1, 0, 'no'],[0, 1, 'no'],[0, 1, 'no']]labels = ['no surfacing','flippers']#change to discrete valuesreturn dataSet, labels運(yùn)行
testTree.py
# -*- coding: utf-8 -*- import treesdataSet, labels = trees.createDataSet()print dataSet #[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']] print labels #['no surfacing', 'flippers']#計(jì)算熵 print trees.calcShannonEnt(dataSet) #0.970950594455#改變多一個(gè)數(shù)據(jù) dataSet[0][-1] = 'maybe'print dataSet #[[1, 1, 'maybe'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']] print trees.calcShannonEnt(dataSet) #1.37095059445熵越大,則混合的數(shù)據(jù)越多
延伸:另一個(gè)度量集合無序程度的方法是基尼不純度,簡(jiǎn)單地說就是從一個(gè)數(shù)據(jù)集中隨機(jī)選取子項(xiàng),度量其被錯(cuò)誤分類到其他分組的概率。
劃分?jǐn)?shù)據(jù)集
#axis表示第n列 #返回剔除第n列數(shù)據(jù)的數(shù)據(jù)集 def splitDataSet(dataSet, axis, value):retDataSet = []for featVec in dataSet:if featVec[axis] == value:#剔除第n列數(shù)據(jù)reducedFeatVec = featVec[:axis] reducedFeatVec.extend(featVec[axis+1:])retDataSet.append(reducedFeatVec)return retDataSet運(yùn)行
testTree.py
print dataSet #[[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']] #劃分?jǐn)?shù)據(jù)集#當(dāng)?shù)?列時(shí),值為0 的實(shí)例 print trees.splitDataSet(dataSet, 0, 0) #[[1, 'no'], [1, 'no']]#當(dāng)?shù)?列時(shí),值為1 的實(shí)例 print trees.splitDataSet(dataSet, 0, 1) #[[1, 'yes'], [1, 'yes'], [0, 'no']] #append 和 extend 區(qū)別 >>> a=[1,2,3] >>> b=[4,5,6] >>> a.append(b) >>> a [1, 2, 3, [4, 5, 6]] >>> a=[1,2,3] >>> a.extend(b) >>> a [1, 2, 3, 4, 5, 6] #選擇最好的數(shù)據(jù)集劃分方式 def chooseBestFeatureToSplit(dataSet):#有多少個(gè)特征數(shù)量,最后一個(gè)目標(biāo)變量numFeatures = len(dataSet[0]) - 1#計(jì)算基準(zhǔn) 香農(nóng)熵 目標(biāo)變量的熵baseEntropy = calcShannonEnt(dataSet)bestInfoGain = 0.0; bestFeature = -1#迭代特征,i是列數(shù)for i in range(numFeatures): #該特征(一列)下 所有值#使用 列表推倒 (List Comprehension)featList = [example[i] for example in dataSet] #特征值去重uniqueVals = set(featList)newEntropy = 0.0for value in uniqueVals:#返回剔除第i列數(shù)據(jù)的數(shù)據(jù)集subDataSet = splitDataSet(dataSet, i, value)prob = len(subDataSet)/float(len(dataSet))#新的香農(nóng)熵#有點(diǎn)不清楚這公式newEntropy += prob * calcShannonEnt(subDataSet) #計(jì)算增益infoGain = baseEntropy - newEntropy#選擇最大增益,增益越大,區(qū)分越大if (infoGain > bestInfoGain):bestInfoGain = infoGainbestFeature = ireturn bestFeature運(yùn)行
print trees.chooseBestFeatureToSplit(dataSet) #0 #這運(yùn)行結(jié)果告訴我們,第0特征是最好的用于劃分?jǐn)?shù)據(jù)集的特征#chooseBestFeatureToSplit(dataSet)的一些中間變量的值 #baseEntropy: 0.970950594455 #第0列 #value: 0 #value: 1 #newEntropy: 0.550977500433#第1列 #value: 0 #value: 1 #newEntropy: 0.8遞歸構(gòu)建決策樹
返回出現(xiàn)次數(shù)最多的分類名稱
def majorityCnt(classList):classCount={}for vote in classList:if vote not in classCount.keys(): classCount[vote] = 0classCount[vote] += 1sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)return sortedClassCount[0][0]創(chuàng)建樹的函數(shù)代碼
def createTree(dataSet,labels):#目標(biāo)變量的值classList = [example[-1] for example in dataSet]#stop splitting when all of the classes are equalif classList.count(classList[0]) == len(classList): return classList[0]#stop splitting when there are no more features in dataSetif len(dataSet[0]) == 1: return majorityCnt(classList)bestFeat = chooseBestFeatureToSplit(dataSet)bestFeatLabel = labels[bestFeat]myTree = {bestFeatLabel:{}}del(labels[bestFeat])featValues = [example[bestFeat] for example in dataSet]uniqueVals = set(featValues)for value in uniqueVals:#copy all of labels, so trees don't mess up existing labelssubLabels = labels[:] myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels)return myTree運(yùn)行結(jié)果
print "---createTree---"print trees.createTree(dataSet, labels)""" ---createTree--- classList: ['yes', 'yes', 'no', 'no', 'no'] baseEntropy: 0.970950594455 value: 0 value: 1 newEntropy: 0.550977500433 value: 0 value: 1 newEntropy: 0.8 --- classList: ['no', 'no'] --- classList: ['yes', 'yes', 'no'] baseEntropy: 0.918295834054 value: 0 value: 1 newEntropy: 0.0 --- classList: ['no'] --- classList: ['yes', 'yes']---最終運(yùn)行結(jié)果--- {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}} """在Python中使用Matplotlib注解繪制樹形圖
Matplotlib注解annotate
testCreatePlot.py
import matplotlib.pyplot as pltdecisionNode = dict(boxstyle="sawtooth", fc="0.8") leafNode = dict(boxstyle="round4", fc="0.8") arrow_args = dict(arrowstyle="<-")#繪制節(jié)點(diǎn) def plotNode(nodeTxt, centerPt, parentPt, nodeType):createPlot.ax1.annotate(nodeTxt, xy=parentPt, xycoords='axes fraction',xytext=centerPt, textcoords='axes fraction',va="center", ha="center", bbox=nodeType, arrowprops=arrow_args )def createPlot():fig = plt.figure(1, facecolor='white')fig.clf()createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses plotNode('a decision node', (0.5, 0.1), (0.1, 0.5), decisionNode)plotNode('a leaf node', (0.8, 0.1), (0.3, 0.8), leafNode)plt.show()print createPlot#<function createPlot at 0x0000000007636F98>createPlot()print createPlot.ax1#AxesSubplot(0.125,0.11;0.775x0.77)注意:紅色的坐標(biāo)是后來加上去的,不是上面程序生成的。
構(gòu)造注解樹
獲取葉節(jié)點(diǎn)的數(shù)目和樹的層數(shù)
testPlotTree.py
def getNumLeafs(myTree):numLeafs = 0firstStr = myTree.keys()[0]secondDict = myTree[firstStr]for key in secondDict.keys():if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodesnumLeafs += getNumLeafs(secondDict[key])else: numLeafs +=1return numLeafsdef getTreeDepth(myTree):maxDepth = 0firstStr = myTree.keys()[0]secondDict = myTree[firstStr]for key in secondDict.keys():if type(secondDict[key]).__name__=='dict':#test to see if the nodes are dictonaires, if not they are leaf nodesthisDepth = 1 + getTreeDepth(secondDict[key])else: thisDepth = 1if thisDepth > maxDepth: maxDepth = thisDepthreturn maxDepthdef retrieveTree(i):listOfTrees =[{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}},{'no surfacing': {0: 'no', 1: {'flippers': {0: {'head': {0: 'no', 1: 'yes'}}, 1: 'no'}}}}]return listOfTrees[i]myTree = retrieveTree(0) print "myTree: " print myTreeprint "getNumLeafs(myTree): " print getNumLeafs(myTree)print "getTreeDepth(myTree): " print getTreeDepth(myTree)# myTree: # {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}} # getNumLeafs(myTree): # 3 # getTreeDepth(myTree): # 2treePlotter.py
#在父子節(jié)點(diǎn)間填充文本信息 def plotMidText(cntrPt, parentPt, txtString):xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]createPlot.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30)def plotTree(myTree, parentPt, nodeTxt):#if the first key tells you what feat was split on#this determines the x width of this treenumLeafs = getNumLeafs(myTree) depth = getTreeDepth(myTree)#the text label for this node should be thisfirstStr = myTree.keys()[0]cntrPt = (plotTree.xOff + (1.0 + float(numLeafs))/2.0/plotTree.totalW, plotTree.yOff)plotMidText(cntrPt, parentPt, nodeTxt)plotNode(firstStr, cntrPt, parentPt, decisionNode)secondDict = myTree[firstStr]plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalDfor key in secondDict.keys():#test to see if the nodes are dictonaires, if not they are leaf nodesif type(secondDict[key]).__name__=='dict':#recursion遞歸調(diào)用plotTree(secondDict[key],cntrPt,str(key))else:#it's a leaf node print the leaf node繪制葉節(jié)點(diǎn)plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalWplotNode(secondDict[key], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(key))plotTree.yOff = plotTree.yOff + 1.0/plotTree.totalD #if you do get a dictonary you know it's a tree, and the first element will be another dictdef createPlot(inTree):fig = plt.figure(1, facecolor='white')fig.clf()axprops = dict(xticks=[], yticks=[])#**axprops 表示 no ticks 不繪制坐標(biāo)點(diǎn)createPlot.ax1 = plt.subplot(111, frameon=False, **axprops) #createPlot.ax1 = plt.subplot(111, frameon=False) #ticks for demo puropses plotTree.totalW = float(getNumLeafs(inTree))plotTree.totalD = float(getTreeDepth(inTree))plotTree.xOff = -0.5/plotTree.totalW; plotTree.yOff = 1.0;#繪制跟節(jié)點(diǎn)plotTree(inTree, (0.5,1.0), '')plt.show()運(yùn)用
testPlotTree2.py
# -*- coding: utf-8 -*- import treePlottermyTree = treePlotter.retrieveTree(0) print myTree #{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}#開始繪制決策樹 treePlotter.createPlot(myTree)測(cè)試算法:使用決策樹執(zhí)行分類
def classify(inputTree,featLabels,testVec):firstStr = inputTree.keys()[0]secondDict = inputTree[firstStr]featIndex = featLabels.index(firstStr)key = testVec[featIndex]valueOfFeat = secondDict[key]if isinstance(valueOfFeat, dict): classLabel = classify(valueOfFeat, featLabels, testVec)else: classLabel = valueOfFeatreturn classLabel運(yùn)行
testClassify.py
import treePlotter import treesdataSet, labels = trees.createDataSet() myTree = treePlotter.retrieveTree(0)print myTreeprint trees.classify(myTree, labels, [1, 0]) #noprint trees.classify(myTree, labels, [1, 1]) #yes決策樹存儲(chǔ)到本地
為了節(jié)省計(jì)算時(shí)間,最好能夠在每次執(zhí)行分類時(shí)調(diào)用已經(jīng)構(gòu)造好的決策樹。
def storeTree(inputTree,filename):import picklefw = open(filename,'w')pickle.dump(inputTree,fw)fw.close()def grabTree(filename):import picklefr = open(filename)return pickle.load(fr)運(yùn)用
testStoreTree.py
import trees import treePlottermyTree = treePlotter.retrieveTree(0) #存儲(chǔ)到'classifierStorage.txt'文件 trees.storeTree(myTree, 'classifierStorage.txt')#再讀取 print trees.grabTree('classifierStorage.txt') #{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}示例:使用決策樹預(yù)測(cè)隱形眼鏡類型
數(shù)據(jù)集lenses.txt
age prescript藥方 astigmatic散光的 tearRate撕裂率young myope no reduced no lenses young myope no normal soft young myope yes reduced no lenses young myope yes normal hard young hyper no reduced no lenses young hyper no normal soft young hyper yes reduced no lenses young hyper yes normal hard pre myope no reduced no lenses pre myope no normal soft pre myope yes reduced no lenses pre myope yes normal hard pre hyper no reduced no lenses pre hyper no normal soft pre hyper yes reduced no lenses pre hyper yes normal no lenses presbyopic myope no reduced no lenses presbyopic myope no normal no lenses presbyopic myope yes reduced no lenses presbyopic myope yes normal hard presbyopic hyper no reduced no lenses presbyopic hyper no normal soft presbyopic hyper yes reduced no lenses presbyopic hyper yes normal no lenses- pre 之前
- presbyopic 遠(yuǎn)視眼的
- myope 近視眼
- hyper 超級(jí)
運(yùn)用
testLenses.py
import trees import treePlotterfr=open('lenses.txt') lenses=[inst.strip().split('\t') for inst in fr.readlines()] lensesLabels=['age', 'prescript', 'astigmatic', 'tearRate'] lensesTree = trees.createTree(lenses,lensesLabels)print "lensesTree: " print lensesTree #{'tearRate': {'reduced': 'no lenses', 'normal': {'astigmatic': {'yes': {'prescript': {'hyper': {'age': {'pre': 'no lenses', 'presbyopic': 'no lenses', 'young': 'hard'}}, 'myope': 'hard'}}, 'no': {'age': {'pre': 'soft', 'presbyopic': {'prescript': {'hyper': 'soft', 'myope': 'no lenses'}}, 'young': 'soft'}}}}}}treePlotter.createPlot(lensesTree)生成的決策樹圖
總結(jié)
上面決策樹非常好地匹配了實(shí)驗(yàn)數(shù)據(jù),然而這些匹配選項(xiàng)可能太多了。我們將這種問題稱之為過度匹配Overfitting。
為了減少過度匹配問題,可以裁剪決策樹,去掉一些不必要的葉子節(jié)點(diǎn)。
如果葉子節(jié)點(diǎn)只能增加少許信息,則可以刪除該節(jié)點(diǎn),將它并入到其他葉子節(jié)點(diǎn)中。
上述闡述的是ID3算法,它是一個(gè)瑕不掩瑜的算法。
ID3算法無法直接處理數(shù)值型int,double的數(shù)據(jù),盡管我們可以通過量化的方法將數(shù)值型轉(zhuǎn)換為標(biāo)稱型數(shù)值,但是如果存在太多的特征劃分,ID3算法仍然會(huì)面臨其他問題。
總結(jié)
以上是生活随笔為你收集整理的《机器学习实战》笔记(03):决策树的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Event Recommendation
- 下一篇: NLP复习资料(2)-三~五章:形式语言