【机器学习】深刻理解决策树-动手计算ID3算法
?一、決策樹概述
決策樹算法易于理解、可解釋性強(qiáng),是一個(gè)非常常見并且優(yōu)秀的機(jī)器學(xué)習(xí)算法,可分類,也可回歸。現(xiàn)在許多最優(yōu)秀的集成模型,基礎(chǔ)也是決策樹。因此,決策樹系列算法是機(jī)器學(xué)習(xí)繞不過的大山。需要進(jìn)行非常系統(tǒng)化、深刻化的學(xué)習(xí)和理解。
在信息論中一個(gè)屬性的信息增益越大,表明該屬性對(duì)樣本的熵減少能力越強(qiáng),也就是說確定這個(gè)屬性會(huì)使系統(tǒng)越穩(wěn)定有序(熵越小系統(tǒng)越穩(wěn)定),那么該分區(qū)的純度也就越高。
不論一個(gè)數(shù)據(jù)集有多少特征,每次劃分?jǐn)?shù)據(jù)集時(shí)只能選一個(gè)特征,那么第一次選擇哪個(gè)特征作為劃分的參考屬性才能將數(shù)據(jù)更快的分類呢?
答案一定是分類能力最好的那個(gè)特征,但問題來了,如何判斷哪一個(gè)特征分類能力最好呢?可以引入一個(gè)比較感性的概念,就是純度,分裂后越純?cè)胶?#xff0c;衡量純度有三種常見的方法,不同的衡量方法可能會(huì)導(dǎo)致不同的分裂。
1) ID3 ?算 法:Iterative Dichotomiser3,迭代二叉樹三代,是最早提出的決策樹算法,用信息增益作為分裂準(zhǔn)則。
2) C4.5 算 法:C4.5由J.Ross Quinlan在ID3的基礎(chǔ)上提出的,他是 ID3 的改進(jìn)版,用信息增益率作為分類準(zhǔn)則。
3) CART算法:Classification and Regression Tree,分類回歸樹,用基尼指數(shù)作為分裂準(zhǔn)則,這種算法即可以用于分類,也可以用于回歸問題。
今天我們要講的就是第一個(gè)算法:ID3 ?算 法,其余兩種算法將會(huì)后續(xù)推出詳細(xì)的講解,在下面的文章中,我會(huì)帶著大家一步步的計(jì)算信息增益,讓大家徹底理解ID3算法的原理。如果有不懂的,也可以加我交流:wuzhx2014
二、數(shù)據(jù)集介紹
下面是本例采用的數(shù)據(jù)集,包含了過去 14 天的天氣因素 Outlook(天氣),Temp.(溫度),Humidity(濕度)、Wind(風(fēng)力) 4 個(gè)特征、14 個(gè)樣本,來學(xué)習(xí)一個(gè)是否去室外打球的決策樹。
Day | Outlook | Temp. | Humidity | Wind | Decision |
1 | Sunny | Hot | High | Weak | No |
2 | Sunny | Hot | High | Strong | No |
3 | Overcast | Hot | High | Weak | Yes |
4 | Rain | Mild | High | Weak | Yes |
5 | Rain | Cool | Normal | Weak | Yes |
6 | Rain | Cool | Normal | Strong | No |
7 | Overcast | Cool | Normal | Strong | Yes |
8 | Sunny | Mild | High | Weak | No |
9 | Sunny | Cool | Normal | Weak | Yes |
10 | Rain | Mild | Normal | Weak | Yes |
11 | Sunny | Mild | Normal | Strong | Yes |
12 | Overcast | Mild | High | Strong | Yes |
13 | Overcast | Hot | Normal | Weak | Yes |
14 | Rain | Mild | High | Strong | No |
對(duì)于決策樹ID3算法,只有兩個(gè)核心公式,計(jì)算經(jīng)驗(yàn)熵和條件熵,我們簡(jiǎn)單回顧下:
經(jīng) 驗(yàn) ?熵:Entropy(S) ? ?= -∑ p(i) * log2p(i)
條 件 ?熵:Entropy(S|A) = ∑ [ p(S|A) * Entropy(S|A) ]
信息增益:Gain(S, A) ? ? =?Entropy(S) -∑ [ p(S|A)* Entropy(S|A) ]
這些公式看起來可能會(huì)讓人迷惑,我們從實(shí)際計(jì)算中,一步步來理解這些公式,理解了上面的三個(gè)公式,基本上也就理解了決策樹這個(gè)算法。
三、經(jīng)驗(yàn)熵計(jì)算
首先,我們需要計(jì)算整個(gè)數(shù)據(jù)集的經(jīng)驗(yàn)熵,也就是固有的熵,數(shù)據(jù)集包含14個(gè)樣本和兩個(gè)類別:Yes and No,9個(gè)樣本是Yes,5個(gè)樣本是No。
標(biāo)簽 | Yes | No | 匯總 |
樣本數(shù) | 9 | 5 | 14 |
概率值 | 9/14 | 5/14 | 14/14 |
根據(jù)上述公式和統(tǒng)計(jì)數(shù)據(jù),我們可以計(jì)算出數(shù)據(jù)集的經(jīng)驗(yàn)熵
from math import log2 Entropy(S) = -p(Yes)*log2p(Yes)-p(No)*log2p(No) = -(9/14)*log2(9/14)-(5/14)*log2(5/14) = 0.9402859586706311經(jīng)驗(yàn)熵計(jì)算完了,現(xiàn)在,我們要計(jì)算每個(gè)特征的條件熵,以及對(duì)應(yīng)的信息增益,并對(duì)信息增益進(jìn)行排序,選擇增益最大的特征作為第一個(gè)分裂點(diǎn)進(jìn)行分裂。
四、條件熵計(jì)算
完成了數(shù)據(jù)集經(jīng)驗(yàn)熵的計(jì)算,就該計(jì)算每個(gè)特征分別的條件熵,以及對(duì)應(yīng)的信息增益。
第一層分裂決策
數(shù)據(jù)集有四個(gè)特征,需要分別計(jì)算每個(gè)特征的條件熵,Outlook(天氣),Temp.(溫度),Humidity(濕度)、Wind(風(fēng)力) 。
1、Wind條件熵
wind這個(gè)特征包含兩個(gè)屬性,weak and strong
Gain(Decision, Wind) = Entropy(Decision)- ∑ [ p(Decision|Wind)* Entropy(Decision|Wind)] = Entropy(Decision)-[p(Decision|Wind=Weak)*Entropy(Decision|Wind=Weak)]-[p(Decision|Wind=Strong)*Entropy(Decision|Wind=Strong)]我們需要分別計(jì)算 (Decision|Wind=Weak) ?和 (Decision|Wind=Strong)
Wind | Yes | No | 樣本數(shù) |
Weak | 6 | 2 | 8 |
Strong | 3 | 3 | 6 |
1)Weak 屬性熵計(jì)算
Day | Outlook | Temp. | Humidity | Wind | Decision |
1 | Sunny | Hot | High | Weak | No |
3 | Overcast | Hot | High | Weak | Yes |
4 | Rain | Mild | High | Weak | Yes |
5 | Rain | Cool | Normal | Weak | Yes |
8 | Sunny | Mild | High | Weak | No |
9 | Sunny | Cool | Normal | Weak | Yes |
10 | Rain | Mild | Normal | Weak | Yes |
13 | Overcast | Hot | Normal | Weak | Yes |
Wind=Weak這個(gè)屬性的子集,共有8個(gè)樣本,其中2個(gè)No和8個(gè)Yes,計(jì)算該子集的熵如下:
Entropy(Decision|Wind=Weak) = –p(No)* log2p(No)–p(Yes) * log2p(Yes) = -(6/8)*log2(6/8)-(2/8)*log2(2/8) = 0.8112781244591328注意:如果類的實(shí)例數(shù)為0,而實(shí)例總數(shù)為n,則需要計(jì)算-(0/n) .log2(0/n),定義0log20=0,熵只依賴于X的分布,與X的取值無關(guān)。這里,log(0)將等于-∞, 我們不能計(jì)算0次∞. 這是決策樹應(yīng)用程序中經(jīng)常出現(xiàn)的一種特殊情況。即使編譯器不能計(jì)算這個(gè)運(yùn)算,我們也可以用微積分來計(jì)算,如果你想知道如何計(jì)算這個(gè)方程,請(qǐng)閱讀這篇文章。
https://sefiks.com/2018/08/25/indeterminate-forms-and-lhospitals-rule-in-decision-trees/
2)Strong 屬性熵計(jì)算
Day | Outlook | Temp. | Humidity | Wind | Decision |
2 | Sunny | Hot | High | Strong | No |
6 | Rain | Cool | Normal | Strong | No |
7 | Overcast | Cool | Normal | Strong | Yes |
11 | Sunny | Mild | Normal | Strong | Yes |
12 | Overcast | Mild | High | Strong | Yes |
14 | Rain | Mild | High | Strong | No |
Wind = Strong 這個(gè)子集,一共有6個(gè)樣本,其中3個(gè)Yes,3個(gè)No,計(jì)算其熵如下
Entropy(Decision|Wind=Strong) = –p(No)*log2p(No)-p(Yes)*log2p(Yes) = -(3/6)*log2(3/6)-(3/6)*log2(3/6) = 1.03)匯總計(jì)算
Wind | Yes | No | 樣本數(shù) |
Weak | 6 | 2 | 8 |
Strong | 3 | 3 | 6 |
根據(jù)上面的統(tǒng)計(jì)數(shù)據(jù),現(xiàn)在我們可以計(jì)算Wind特征的信息增益了
Gain(Decision,Wind) = Entropy(Decision)-[p(Decision|Wind=Weak)*Entropy(Decision|Wind=Weak)] -[p(Decision|Wind=Strong).Entropy(Decision|Wind=Strong)] = 0.940–[(8/14).0.811 ]–[(6/14). 1] = 0.940-(8/14)*0.811-(6/14)* 1 = 0.04799999999999999Wind(風(fēng)力)這個(gè)特征的信息增益計(jì)算結(jié)束了,現(xiàn)在,我們需要對(duì)其他特征應(yīng)用相同的計(jì)算方法,計(jì)算出剩余每個(gè)特征的風(fēng)險(xiǎn)增益。
2、Outlook 條件熵
Outlook 這個(gè)特征有Sunny、Overcast、Rain這三個(gè)屬性,分別計(jì)算每個(gè)屬性的熵,然后再進(jìn)行加權(quán)得到條件熵。
Outlook | Yes | No | 樣本數(shù) |
Sunny | 2 | 3 | 5 |
Overcast | 4 | 0 | 4 |
Rain | 3 | 2 | 5 |
1)Overcast
Day | Outlook | Temp. | Humidity | Wind | Decision |
3 | Overcast | Hot | High | Weak | Yes |
7 | Overcast | Cool | Normal | Strong | Yes |
12 | Overcast | Mild | High | Strong | Yes |
13 | Overcast | Hot | Normal | Weak | Yes |
這個(gè)屬性全部只有一個(gè)標(biāo)簽Yes,可以按下面的計(jì)算,但是會(huì)報(bào)錯(cuò),可以直接按上面的規(guī)則計(jì)算(0/4)*log2(0/4)=0,所以可以知道Entropy(Decision|Outlook=Overcast)=0
Entropy(Decision|Outlook=Overcast) =–p(No)*log2p(No)–p(Yes)*log2p(Yes) =-(0/4)*log2(0/4)-(4/4)*log2(4/4) ValueError: math domain error2)Rain
Day | Outlook | Temp. | Humidity | Wind | Decision |
4 | Rain | Mild | High | Weak | Yes |
5 | Rain | Cool | Normal | Weak | Yes |
6 | Rain | Cool | Normal | Strong | No |
10 | Rain | Mild | Normal | Weak | Yes |
14 | Rain | Mild | High | Strong | No |
Rain這個(gè)屬性有5個(gè)樣本,其中2個(gè)No,3個(gè)Yes,按下面的方法計(jì)算該子集的熵
Entropy(Decision|Outlook=Rain) = -p(No)*log2p(No)-p(Yes)*log2p(Yes) = -(2/5)*log2(2/5)-(3/5)*log2(3/5) = 0.97095059445466863)Sunny
Day | Outlook | Temp. | Humidity | Wind | Decision |
1 | Sunny | Hot | High | Weak | No |
2 | Sunny | Hot | High | Strong | No |
8 | Sunny | Mild | High | Weak | No |
9 | Sunny | Cool | Normal | Weak | Yes |
11 | Sunny | Mild | Normal | Strong | Yes |
Sunny這個(gè)屬性也有有5個(gè)樣本,其中2個(gè)No,3個(gè)Yes,按下面的方法計(jì)算該子集的熵
Entropy(Decision|Outlook=Sunny) = -p(No) * log2p(No) - p(Yes) *log2p(Yes) = -(2/5) * log2(2/5) - (3/5) *log2(3/5) = 0.9709505944546686現(xiàn)在我們可以計(jì)算Outlook特征的信息增益了
Gain(Decision, Outlook) = Entropy(Decision)–[p(Decision|Outlook=Overcast)*Entropy(Decision|Outlook=Overcast)]–[p(Decision|Outlook=Rain)*Entropy(Decision|Outlook=Rain) –[p(Decision|Outlook=Sunny)*Entropy(Decision|Outlook=Sunny)] = 0.940-(5/14)*0-(5/14)*0.9709-(5/14)*0.9709 = 0.24649999999999994通過上面的計(jì)算,得到Outlook這個(gè)特征的信息增益為:Gain(Decision, Outlook)= 0.246,是不是非常簡(jiǎn)單呢。
3、Temp 特征的條件熵
Temp | Yes | No | 樣本數(shù) |
Cool | 3 | 1 | 4 |
Hot | 2 | 2 | 4 |
Mild | 4 | 2 | 6 |
1)Cool
Day | Outlook | Temp. | Humidity | Wind | Decision |
5 | Rain | Cool | Normal | Weak | Yes |
7 | Overcast | Cool | Normal | Strong | Yes |
9 | Sunny | Cool | Normal | Weak | Yes |
6 | Rain | Cool | Normal | Strong | No |
Temp=Cool這個(gè)屬性,一共有4個(gè)樣本,3個(gè)Yes,1個(gè)No,計(jì)算熵如下:
Entropy(Decision|Temp=Cool) = -p(No)*log2p(No)-p(Yes)*log2p(Yes) = -(3/4)*log2(3/4)-(1/4)*log2(1/4) = 0.81127812445913282)Hot
Day | Outlook | Temp. | Humidity | Wind | Decision |
3 | Overcast | Hot | High | Weak | Yes |
13 | Overcast | Hot | Normal | Weak | Yes |
1 | Sunny | Hot | High | Weak | No |
2 | Sunny | Hot | High | Strong | No |
Temp=Hot這個(gè)屬性,一共有4個(gè)樣本,2個(gè)Yes,2個(gè)No,計(jì)算熵如下:
Entropy(Decision|Temp=Hot) = -p(No)*log2p(No)-p(Yes)*log2p(Yes) = -(2/4)*log2(2/4)-(2/4) *log2(2/4) = 1.03)Mild
Day | Outlook | Temp. | Humidity | Wind | Decision |
4 | Rain | Mild | High | Weak | Yes |
10 | Rain | Mild | Normal | Weak | Yes |
11 | Sunny | Mild | Normal | Strong | Yes |
12 | Overcast | Mild | High | Strong | Yes |
8 | Sunny | Mild | High | Weak | No |
14 | Rain | Mild | High | Strong | No |
Temp=Mild這個(gè)屬性,一共有6個(gè)樣本,4個(gè)Yes,2個(gè)No,計(jì)算熵如下:
Entropy(Decision|Temp=Mild) = -p(No)*log2p(No)-p(Yes)*log2p(Yes) = -(4/6)*log2(4/6)-(2/6)*log2(2/6) = 0.91829583405448964)匯總計(jì)算
Temp | Yes | No | 樣本數(shù) |
Cool | 3 | 1 | 4 |
Hot | 2 | 2 | 4 |
Mild | 4 | 2 | 6 |
通過統(tǒng)計(jì)數(shù)據(jù)可以看到,Cool占比4/14,Hot占比4/14,Mild占比6/14,加權(quán)上面的熵,得到下面的計(jì)算:
Gain(Decision,Temp.) = Entropy(Decision)–[p(Decision|Temp.=Cool)*Entropy(Decision|Temp.=Cool)]–[p(Decision|Temp.=Hot)*Entropy(Decision|Temp.=Hot) –[p(Decision|Temp.=Mild)*Entropy(Decision|Temp.=Mild)] = 0.940-(4/14)*0.8112-(4/14)*1.0-(6/14)*0.9182 = 0.029000000000000026通過上面的計(jì)算,得到Outlook這個(gè)特征的信息增益為:Gain(Decision, Temp) = 0.029,又完成了一個(gè),我們接著算下面一個(gè)一個(gè)特征Humidity 的條件熵。
4、Humidity 特征條件熵
Humidity | Yes | No | 樣本數(shù) |
High | 3 | 4 | 7 |
Normal | 6 | 1 | 7 |
1)High屬性的熵
Day | Outlook | Temp. | Humidity | Wind | Decision |
3 | Overcast | Hot | High | Weak | Yes |
12 | Overcast | Mild | High | Strong | Yes |
4 | Rain | Mild | High | Weak | Yes |
1 | Sunny | Hot | High | Weak | No |
2 | Sunny | Hot | High | Strong | No |
14 | Rain | Mild | High | Strong | No |
8 | Sunny | Mild | High | Weak | No |
Humidity=High這個(gè)屬性,一共有7個(gè)樣本,3個(gè)Yes,4個(gè)No,計(jì)算熵如下:
Entropy(Decision|Humidity=High) = -p(No)*log2p(No)-p(Yes)*log2p(Yes) = -(4/7)*log2(4/7)-(3/7)*log2(3/7) = 0.98522813603425162)Normal屬性的熵
Day | Outlook | Temp. | Humidity | Wind | Decision |
7 | Overcast | Cool | Normal | Strong | Yes |
5 | Rain | Cool | Normal | Weak | Yes |
9 | Sunny | Cool | Normal | Weak | Yes |
13 | Overcast | Hot | Normal | Weak | Yes |
10 | Rain | Mild | Normal | Weak | Yes |
11 | Sunny | Mild | Normal | Strong | Yes |
6 | Rain | Cool | Normal | Strong | No |
Humidity=High這個(gè)屬性,一共有7個(gè)樣本,6個(gè)Yes,1個(gè)No,計(jì)算熵如下:
Entropy(Decision|Humidity=High) = -p(No)*log2p(No)-p(Yes)*log2p(Yes) = -(1/7)*log2(1/7)-(6/7)*log2(6/7) = 0.59167277858232755、匯總計(jì)算
Humidity | Yes | No | 樣本數(shù) |
High | 3 | 4 | 7 |
Normal | 6 | 1 | 7 |
通過上面的計(jì)算,得到Humidity這個(gè)特征的信息增益為:Gain(Decision, Humidity) = 0.151 ,又完成了一個(gè),我們接著算下面一個(gè)一個(gè)特征Humidity 的條件熵。
到此我們完成了所有特征的信息增益的計(jì)算,對(duì)所有特征進(jìn)行排序,找增益最大的進(jìn)行分列
排序后分裂
Feature | Gain | 排名 |
Outlook | 0.246 | 1 |
Temperature | 0.029 | 4 |
Humidity | 0.151 | 2 |
Wind | 0.048 | 3 |
所以我們選擇 outlook ?這個(gè)特征作為第一次分裂特征,也是就決策樹的根
Root decision on the tree
現(xiàn)在,我們需要進(jìn)一步計(jì)算第二層的分裂特征的選擇,在outlook分裂后的子集中計(jì)算。
第二層分裂決策
第一層分裂的特征決定了,那就要根據(jù)分裂的結(jié)果,進(jìn)行第二層的分裂,同第一層,也是需要計(jì)算每個(gè)子集的經(jīng)驗(yàn)熵 + 條件熵。
數(shù)據(jù)集第一步被Outlook 這個(gè)特征分裂成三個(gè)節(jié)點(diǎn),現(xiàn)在需要對(duì)每個(gè)節(jié)點(diǎn)計(jì)算下一步的分裂特征。
1、Outlook -Overcast 節(jié)點(diǎn)
在overcast這個(gè)分支上,標(biāo)簽全部是yes,也就是已經(jīng)徹底的完成了分裂了,這個(gè)就可以作為葉子節(jié)點(diǎn),無需繼續(xù)分裂。
Day | Outlook | Temp. | Humidity | Wind | Decision |
3 | Overcast | Hot | High | Weak | Yes |
7 | Overcast | Cool | Normal | Strong | Yes |
12 | Overcast | Mild | High | Strong | Yes |
13 | Overcast | Hot | Normal | Weak | Yes |
無需繼續(xù)分裂!!
2、Outlook-Sunny節(jié)點(diǎn)
在sunny 這個(gè)屬性下面,還存在5個(gè)樣本,包含3/5 的No, 2/5 的Yes,還需要繼續(xù)分裂,所以這一層,我們重復(fù)上述的步驟,先計(jì)算這個(gè)子集的經(jīng)驗(yàn)熵,然后計(jì)算每個(gè)特征的信息增益,找到最佳的分割特征。
Day | Outlook | Temp. | Humidity | Wind | Decision |
1 | Sunny | Hot | High | Weak | No |
2 | Sunny | Hot | High | Strong | No |
8 | Sunny | Mild | High | Weak | No |
9 | Sunny | Cool | Normal | Weak | Yes |
11 | Sunny | Mild | Normal | Strong | Yes |
1)計(jì)算子集經(jīng)驗(yàn)熵
首先我們計(jì)算Outlook=Sunny子集的經(jīng)驗(yàn)熵
Entropy(Outlook=Sunny ) = -p(No)*log2p(No)-p(Yes)*log2p(Yes) = -(3/5)*log2(3/5)-(2/5) *log2(2/5) = 0.97095下面計(jì)算每個(gè)特征的經(jīng)驗(yàn)熵和信息增益。
2)Gain(Outlook=Sunny|Temp.)
Temp. | Yes | No | 樣本數(shù) |
Hot | 0 | 2 | 2 |
Cool | 1 | 0 | 1 |
Mild | 1 | 1 | 2 |
根據(jù)上面的統(tǒng)計(jì)數(shù)據(jù),計(jì)算特征Temp.的條件熵
Entropy(Outlook=Sunny|Temp.) = 2/5*(-(0/2)*log2(0/2)-(2/2)*log2(2/2))+ 1/5*(-(1/1)*log2(1/1)-(0/1)*log2(0/1))+ 2/5*(-(1/2)*log2(1/2)-(1/2)*log2(1/2)) = 0.4計(jì)算得到Temp.特征的信息增益
Gain(Outlook=Sunny|Temp.) = Entropy(Outlook=Sunny)-Entropy(Outlook=Sunny|Temp.) = 0.97095-0.4 = 0.57095根據(jù)上面的計(jì)算,可以得到最終的信息增益為:Gain(Outlook=Sunny|Temp.)=0.570
3)Gain(Outlook=Sunny|Humidity)
Humidity | Yes | No | 樣本數(shù) |
High | 0 | 3 | 3 |
Normal | 2 | 0 | 2 |
根據(jù)上面的統(tǒng)計(jì)數(shù)據(jù),計(jì)算特征Humidity的條件熵
Entropy(Outlook=Sunny|Humidity) = 3/5*(-(0/3)*log2(0/3)-(3/3)*log2(3/3))+ 2/5*(-(2/2)*log2(2/2)-(0/2)*log2(0/2))+ = 0計(jì)算得到Temp.特征的信息增益
Gain(Outlook=Sunny|Humidity) = Entropy(Outlook=Sunny)-Entropy(Outlook=Sunny|Humidity) = 0.97095-0 = 0.970954)Gain(Outlook=Sunny|Wind)
Wind | Yes | No | 樣本數(shù) |
Weak | 1 | 2 | 3 |
Strong | 1 | 1 | 2 |
根據(jù)上面的統(tǒng)計(jì)數(shù)據(jù),計(jì)算特征Temp.的條件熵
Entropy(Outlook=Sunny|Humidity) = 3/5*(-(1/3)*log2(1/3)-(2/3)*log2(2/3))+ 2/5*(-(1/2)*log2(1/2)-(1/2)*log2(1/2))+ = 0.9509775004326937計(jì)算得到Temp.特征的信息增益
Gain(Outlook=Sunny|Humidity) = Entropy(Outlook=Sunny)-Entropy(Outlook=Sunny|Humidity) = 0.97095-0.9509775004326937 = 0.019972499567306295)增益排序
Gain(Outlook=Sunny|Temp.) = 0.570 Gain(Outlook=Sunny|Humidity)????=?0.970 Gain(Outlook=Sunny|Wind) = 0.019可以看到,humidity 是增益最高的,我們就按這個(gè)特征進(jìn)行分裂,分裂結(jié)果如下,兩個(gè)分支都是純度百分百了
Day | Outlook | Temp. | Humidity | Wind | Decision |
1 | Sunny | Hot | High | Weak | No |
2 | Sunny | Hot | High | Strong | No |
8 | Sunny | Mild | High | Weak | No |
On the other hand, decision will always be yes if humidity were normal
Day | Outlook | Temp. | Humidity | Wind | Decision |
9 | Sunny | Cool | Normal | Weak | Yes |
11 | Sunny | Mild | Normal | Strong | Yes |
3、Outlook -Rain 節(jié)點(diǎn)
Day | Outlook | Temp. | Humidity | Wind | Decision |
4 | Rain | Mild | High | Weak | Yes |
5 | Rain | Cool | Normal | Weak | Yes |
6 | Rain | Cool | Normal | Strong | No |
10 | Rain | Mild | Normal | Weak | Yes |
14 | Rain | Mild | High | Strong | No |
1)計(jì)算子集經(jīng)驗(yàn)熵
我們計(jì)算Outlook=Rain子集的經(jīng)驗(yàn)熵,在Outlook=Rain這個(gè)節(jié)點(diǎn)下,一共有5個(gè)樣本,3個(gè)Yes,2個(gè)No,我們可以算出這個(gè)子集的經(jīng)驗(yàn)熵。
Entropy(Outlook=Rain) = – p(No)*log2p(No)–p(Yes)*log2p(Yes) = - (2/5)*log2(2/5)-(3/5)*log2(3/5) = 0.970952)Gain(Outlook=Rain | Temp)
Temp. | Yes | No | 樣本數(shù) |
Cool | 1 | 1 | 2 |
Mild | 2 | 1 | 3 |
根據(jù)上面的統(tǒng)計(jì)數(shù)據(jù),可以計(jì)算出子集的熵
Entropy(Outlook=Rain|Temp) = 2/5*(-(1/2)*log2(1/2)-(1/2)*log2(1/2))+3/5*(-(2/3)*log2(2/3)-(1/3)*log2(1/3)) = 0.9509775Gain(Outlook=Rain|Temp) = Entropy(Outlook=Rain)-Entropy(Outlook=Rain|Temperature) = 0.97095-0.9509775 = 0.01997計(jì)算得到Temp特征的信息增益為:Gain(Outlook=Rain | Temp)=0.01997
3)Gain(Outlook=Rain | Humidity)
Humidity | Yes | No | 樣本數(shù) |
High | 1 | 1 | 2 |
Normal | 2 | 1 | 3 |
根據(jù)上面的統(tǒng)計(jì)數(shù)據(jù),可以計(jì)算出子集的熵
Entropy(Outlook=Rain|Humidity) = 2/5*(-(1/2)*log2(1/2)-(1/2)*log2(1/2))+\3/5*(-(2/3)*log2(2/3) - (1/3) * log2(1/3)) = 0.9509775Gain(Outlook=Rain | Humidity) = Entropy(Outlook=Rain)-Entropy(Outlook=Rain|Temperature) = 0.97095-0.9509775 = 0.01997計(jì)算得到Humidity特征的信息增益為:Gain(Outlook=Rain | Humidity)=0.01997,與上面的Temperature特征一樣
4)Gain(Outlook=Rain | Wind)
Wind | Yes | No | 樣本數(shù) |
Weak | 3 | 0 | 3 |
Strong | 0 | 2 | 2 |
根據(jù)上面的統(tǒng)計(jì)數(shù)據(jù),可以計(jì)算出子集的熵
Entropy(Outlook=Rain|Wind) = 3/5*(-(0/3)*log2(0/3)-(3/3)*log2(3/3))+2/5*(-(0/2)*log2(0/2)-(2/2) * log2(2/2)) = 0Gain(Outlook=Rain | Humidity) = Entropy(Outlook=Rain)-Entropy(Outlook=Rain|Wind) = 0.97095-0 = 0.97095計(jì)算得到Wind特征的信息增益為:Gain(Outlook=Rain | Wind) =0.97095
綜上,可以得到下面的結(jié)論,因此Wind的信息增益最大,選做本節(jié)點(diǎn)的分列特征
Gain(Outlook=Rain | Temperature) = 0.0199730940219748 Gain(Outlook=Rain?|?Humidity)?? ?=?0.0199730940219748 Gain(Outlook=Rain | Wind) = 0.9709505944546686分列后得到下面兩個(gè)葉子節(jié)點(diǎn),葉子節(jié)點(diǎn)的純度都為100%,也就是只有一個(gè)標(biāo)簽類型,因此無需繼續(xù)分列。
Wind = Weak
Day | Outlook | Temp. | Humidity | Wind | Decision |
4 | Rain | Mild | High | Weak | Yes |
5 | Rain | Cool | Normal | Weak | Yes |
10 | Rain | Mild | Normal | Weak | Yes |
Wind = Strong
Day | Outlook | Temp. | Humidity | Wind | Decision |
6 | Rain | Cool | Normal | Strong | No |
14 | Rain | Mild | High | Strong | No |
五、決策樹生成
綜合第一、第二層的計(jì)算,我們得到了最終的決策樹如下:
Final version of decision tree
往期精彩回顧適合初學(xué)者入門人工智能的路線及資料下載機(jī)器學(xué)習(xí)及深度學(xué)習(xí)筆記等資料打印機(jī)器學(xué)習(xí)在線手冊(cè)深度學(xué)習(xí)筆記專輯《統(tǒng)計(jì)學(xué)習(xí)方法》的代碼復(fù)現(xiàn)專輯 AI基礎(chǔ)下載黃海廣老師《機(jī)器學(xué)習(xí)課程》視頻課黃海廣老師《機(jī)器學(xué)習(xí)課程》711頁完整版課件本站qq群955171419,加入微信群請(qǐng)掃碼:
總結(jié)
以上是生活随笔為你收集整理的【机器学习】深刻理解决策树-动手计算ID3算法的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 跨平台低延迟的RTMP/RTSP直播播放
- 下一篇: Unity3D下Linux平台播放RTS