1. 程式人生 > >《機器學習實戰》——決策樹的構造及案例

《機器學習實戰》——決策樹的構造及案例

ID3演算法的決策樹的構造

決策樹的理論部分,不再贅述,本篇博文主要是自己的學習筆記(《機器學習實戰》)

先看下述決策樹,希望對理解決策樹有一定的幫助。


3.1.1資訊增益

首先需要了解兩個公式:


建立名為treesde.py檔案,將下述程式碼新增進去

from math import log
def calcShannonEnt(dataSet):#該函式的功能是計算給定資料集的夏農熵
numEntries=len(dataSet)
    labelCounts={}
    for featVec in dataSet:
currentLabel=featVec[-1]
        if 
currentLabel not in labelCounts.keys(): labelCounts[currentLabel]=0 labelCounts[currentLabel]+=1 shannonEnt=0.0 for key in labelCounts: prob =float(labelCounts[key])/numEntries shannonEnt-=prob*log(prob,2) return shannonEnt
輸入資料集
def createDataSet():
dataSet=[[1,1,'yes'],
             [1, 1
, 'yes'], [1,0,'no'], [0, 1, 'no'], [0, 1, 'no'], ] labels=['no suffacing','flippers'] return dataSet,labels
在python命令提示符下輸入下述命令:


得到的0.970~~~~就是商,熵越高則說明混合的資料越多。

3.1.2 劃分資料集

def splitDataSet(dataSet,axis,value):#按照給定的特徵劃分資料集
retDataSet=[]
    for 
featVec in dataSet: if featVec[axis]==value: reduceFeatVec=featVec[:axis] reduceFeatVec.extend(featVec[axis+1:]) #extend接受一個列表作為引數,並將該引數的每個元素都新增到原有列表中 retDataSet.append(reduceFeatVec) return retDataSet
現在來測試一下該函式

在python命令提示符中輸入:


接下來我們將遍歷整個資料集,迴圈計算夏農熵和splitDataSet()函式,找到最好的特徵劃分方式。

依舊在trees.py中加入如下程式碼

#選擇最好的資料集劃分方式
def chooseBestFeatureToSplit(dataSet):
numFeatures=len(dataSet[0]-1)
    baseEntropy=calcShannonEnt(dataSet)
    bestInfoGain=0.0;bestFeature=-1
for i in range(numFeatures):
featList=[example[i] for example in dataSet]
        uniqueVals=set(featList)
        newEntropy=0.0
for value in uniqueVals:
subDataSet=splitDataSet(dataSet,i,value)
            prob=len(subDataSet)/float(len(dataSet))
            newEntropy+=prob*calcShannonEnt(subDataSet)
        infoGain=baseEntropy-newEntropy
        if(infoGain>baseEntropy):
bestInfoGain=infoGain
            bestFeature=i
    return bestFeature
測試程式碼:


最好的劃分是0

3.1.3遞迴構建決策樹

import  operator
def majorityCnt(classList):
classCount={}
    for vote in classList:
if vote not in classCount.keys():classCount[vote]=0
classCount+=1
sortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]
該函式使用分類名稱的列表,然後建立鍵值為classList中唯一值得資料字典,字典物件儲存了classList中每個類標籤出現的頻率,最後利用operator操作鍵值排序字典,並返回出現次數最多的分類名稱
def createTree(dataSet,labels):#建立數的函式程式碼
classList=[example[-1] for example in dataSet]
    if classList.count(classList[0])==len(classList):
return classList[0]
    if len(dataSet[0])==1:
return majorityCnt(classList)
    bestFeat=chooseBestFeatureToSplit(dataSet)
    bestFeatLabel=labels[bestFeat]
    myTree={bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues=[example[bestFeat] for example in dataSet]
    uniqueVals=set(featValues)
    for value in uniqueVals:
subLabels =labels[:]
        myTree[bestFeatLabel][value]=createTree(splitDataSet(dataSet,bestFeat,value),subLabels)
    return myTree
進行測試:


決策樹已建成,不過看起來有點費勁~~·~我們需要利用Matplotlib註解來繪製樹圖形

在我的上一篇博文中有講到

3.3 測試和儲存分類器

3.3.1 測試演算法:使用決策樹執行分類

def classfiy(inputTree,featLabels,testVec):
firstStr=inputTree.keys()[0]
    secondDict=inputTree[firstStr]
    featIndex=featLabels.index(firstStr)
    for key in secondDict.keys():
if testVec[featIndex]==key:
if type(secondDict[key]).__name__=='dict':
classLbel=classfiy(secondDict[key],featLabels,testVec)
            else: classLbel=secondDict[key]
    return classLbel
程式碼測試:



3.3.2 使用演算法:決策樹的儲存

def storeTree(inputTree,filename):#使用pickle模組儲存決策樹
import pickle
    fw=open(filename,'w')
    pickle.dump(inputTree,fw)
    fw.close()
def grabTree(filename):
import pickle
    fr=open(filename)
    return pickle.load(fr)


3.4 例項:使用決策樹預測隱形眼鏡型別

資料如下:

youngmyopenoreducedno lenses
youngmyope nonormal soft
youngmyope yesreduced no lenses
youngmyope yesnormal hard
younghyper noreduced no lenses
younghyper nonormal soft
younghyper yesreduced no lenses
younghyper yesnormal hard
premyope noreduced no lenses
premyope nonormal soft
premyope yesreduced no lenses
premyope yesnormal hard
prehyper noreduced no lenses
prehyper nonormal soft
prehyper yesreduced no lenses
prehyper yesnormal no lenses
presbyopicmyopenoreducedno lenses
presbyopicmyopenonormalno lenses
presbyopicmyopeyesreducedno lenses
presbyopicmyopeyesnormalhard
presbyopichypernoreducedno lenses
presbyopichypernonormalsoft
presbyopichyperyesreducedno lenses
presbyopichyperyesnormalno lenses

在python命令提示符中下列命令:



結果如下: