《機器學習實戰》——決策樹的構造及案例

阿新 • • 發佈：2019-01-26

ID3演算法的決策樹的構造

決策樹的理論部分，不再贅述，本篇博文主要是自己的學習筆記（《機器學習實戰》）

先看下述決策樹，希望對理解決策樹有一定的幫助。

3.1.1資訊增益

首先需要了解兩個公式：

建立名為treesde.py檔案，將下述程式碼新增進去

from math import log

def calcShannonEnt(dataSet):#該函式的功能是計算給定資料集的夏農熵
numEntries=len(dataSet)
    labelCounts={}
    for featVec in dataSet:
currentLabel=featVec[-1]
        if  
currentLabel not in labelCounts.keys():
labelCounts[currentLabel]=0
labelCounts[currentLabel]+=1
shannonEnt=0.0
for key in labelCounts:
prob =float(labelCounts[key])/numEntries
        shannonEnt-=prob*log(prob,2)
    return shannonEnt

輸入資料集

def createDataSet():
dataSet=[[1,1,'yes'],
             [1, 1 
, 'yes'],
             [1,0,'no'],
             [0, 1, 'no'],
             [0, 1, 'no'],
             ]
    labels=['no suffacing','flippers']
    return dataSet,labels

在python命令提示符下輸入下述命令：

得到的0.970~~~~就是商，熵越高則說明混合的資料越多。

3.1.2 劃分資料集

def splitDataSet(dataSet,axis,value):#按照給定的特徵劃分資料集
retDataSet=[]
    for  
featVec in dataSet:
if featVec[axis]==value:
reduceFeatVec=featVec[:axis]
            reduceFeatVec.extend(featVec[axis+1:])
            #extend接受一個列表作為引數，並將該引數的每個元素都新增到原有列表中
retDataSet.append(reduceFeatVec)
    return retDataSet

現在來測試一下該函式

在python命令提示符中輸入：

接下來我們將遍歷整個資料集，迴圈計算夏農熵和splitDataSet()函式，找到最好的特徵劃分方式。

依舊在trees.py中加入如下程式碼

#選擇最好的資料集劃分方式
def chooseBestFeatureToSplit(dataSet):
numFeatures=len(dataSet[0]-1)
    baseEntropy=calcShannonEnt(dataSet)
    bestInfoGain=0.0;bestFeature=-1
for i in range(numFeatures):
featList=[example[i] for example in dataSet]
        uniqueVals=set(featList)
        newEntropy=0.0
for value in uniqueVals:
subDataSet=splitDataSet(dataSet,i,value)
            prob=len(subDataSet)/float(len(dataSet))
            newEntropy+=prob*calcShannonEnt(subDataSet)
        infoGain=baseEntropy-newEntropy
        if(infoGain>baseEntropy):
bestInfoGain=infoGain
            bestFeature=i
    return bestFeature

測試程式碼：

最好的劃分是0

3.1.3遞迴構建決策樹

import  operator
def majorityCnt(classList):
classCount={}
    for vote in classList:
if vote not in classCount.keys():classCount[vote]=0
classCount+=1
sortedClassCount=sorted(classCount.iteritems(),key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]

該函式使用分類名稱的列表，然後建立鍵值為classList中唯一值得資料字典，字典物件儲存了classList中每個類標籤出現的頻率，最後利用operator操作鍵值排序字典，並返回出現次數最多的分類名稱

def createTree(dataSet,labels):#建立數的函式程式碼
classList=[example[-1] for example in dataSet]
    if classList.count(classList[0])==len(classList):
return classList[0]
    if len(dataSet[0])==1:
return majorityCnt(classList)
    bestFeat=chooseBestFeatureToSplit(dataSet)
    bestFeatLabel=labels[bestFeat]
    myTree={bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues=[example[bestFeat] for example in dataSet]
    uniqueVals=set(featValues)
    for value in uniqueVals:
subLabels =labels[:]
        myTree[bestFeatLabel][value]=createTree(splitDataSet(dataSet,bestFeat,value),subLabels)
    return myTree

進行測試：

決策樹已建成，不過看起來有點費勁~~·~我們需要利用Matplotlib註解來繪製樹圖形

在我的上一篇博文中有講到

3.3 測試和儲存分類器

3.3.1 測試演算法：使用決策樹執行分類

def classfiy(inputTree,featLabels,testVec):
firstStr=inputTree.keys()[0]
    secondDict=inputTree[firstStr]
    featIndex=featLabels.index(firstStr)
    for key in secondDict.keys():
if testVec[featIndex]==key:
if type(secondDict[key]).__name__=='dict':
classLbel=classfiy(secondDict[key],featLabels,testVec)
            else: classLbel=secondDict[key]
    return classLbel

程式碼測試：

3.3.2 使用演算法：決策樹的儲存

def storeTree(inputTree,filename):#使用pickle模組儲存決策樹
import pickle
    fw=open(filename,'w')
    pickle.dump(inputTree,fw)
    fw.close()
def grabTree(filename):
import pickle
    fr=open(filename)
    return pickle.load(fr)

3.4 例項：使用決策樹預測隱形眼鏡型別

資料如下：

youngmyopenoreducedno lenses
youngmyope nonormal soft
youngmyope yesreduced no lenses
youngmyope yesnormal hard
younghyper noreduced no lenses
younghyper nonormal soft
younghyper yesreduced no lenses
younghyper yesnormal hard
premyope noreduced no lenses
premyope nonormal soft
premyope yesreduced no lenses
premyope yesnormal hard
prehyper noreduced no lenses
prehyper nonormal soft
prehyper yesreduced no lenses
prehyper yesnormal no lenses
presbyopicmyopenoreducedno lenses
presbyopicmyopenonormalno lenses
presbyopicmyopeyesreducedno lenses
presbyopicmyopeyesnormalhard
presbyopichypernoreducedno lenses
presbyopichypernonormalsoft
presbyopichyperyesreducedno lenses
presbyopichyperyesnormalno lenses

在python命令提示符中下列命令：

結果如下：

《機器學習實戰》——決策樹的構造及案例

機器學習實戰——決策樹Python實現問題記錄

機器學習實戰-決策樹-畫圖

機器學習實戰--決策樹（一）

機器學習實戰--決策樹

機器學習實戰決策樹（一）——資訊增益與劃分資料集

機器學習實戰——決策樹

機器學習實戰決策樹演算法筆記

機器學習實戰-決策樹

【機器學習】決策樹剪枝優化及視覺化

機器學習：決策樹及ID3,C4.5,CART演算法描述

Python機器學習之決策樹案例

機器學習之決策樹與隨機森林模型

機器學習系列-決策樹

機器學習_決策樹

機器學習：決策樹（基尼系數）

機器學習_決策樹Python代碼詳解

機器學習之決策樹（二）

ID3的REP（Reduced Error Pruning）剪枝程式碼詳細解釋+周志華《機器學習》決策樹圖4.5、圖4.6、圖4.7繪製

[三]機器學習之決策樹與隨機森林

機器學習3---決策樹

《機器學習實戰》——決策樹的構造及案例

相關推薦