Python建立決策樹—解決隱形眼鏡選擇問題

阿新 • • 發佈：2018-11-15

現在我們碰到這樣一個問題，一個人去醫院想配一副隱形眼鏡。我們需要通過問他4個問題，決定他需要帶眼鏡的型別。那麼如何解決這個問題呢？我們決定用決策樹。首先我們去下載一個隱形眼鏡資料集，資料來源於UCI資料庫。下載了lenses.data檔案，如下：

1  1  1  1  1  3
2  1  1  1  2  2
3  1  1  2  1  3
4  1  1  2  2  1
5  1  2  1  1  3
6  1  2  1  2  2
7  1  2  2  1  3
8  1  2  2  2  1
9  2  1  1  1  3
10  2  1  1  2  2
11  2  1  2  1  3
12  2  1  2  2  1
13  2  2  1  1  3
14  2  2  1  2  2
15  2  2  2  1  3
16  2  2  2  2  3
17  3  1  1  1  3
18  3  1  1  2  3
19  3  1  2  1  3
20  3  1  2  2  1
21  3  2  1  1  3
22  3  2  1  2  2
23  3  2  2  1  3
24  3  2  2  2  3

我們可以看到，第一列的1到24，對應資料的ID

第二列的1到3,分別對應病人的年齡（age of patient）,分別是青年（young），中年（pre-presbyopic），老年（presbyopic）

第三列的1和2，分別對應近視情況（spectacle prescription），近視（myope），遠視（hypermetrope）

第四列的1和2，分別對應眼睛是否散光（astigmatic），不散光（no），散光（yes）

第五列的1和2，分別對應分泌眼淚的頻率（tear production rate），很少（reduce），普通（normal）

第六列的1到3，則是最終根據以上資料得到的分類，分別是硬性的隱形眼鏡（hard），軟性的隱形眼鏡（soft），不需要帶眼鏡（no lenses）

資料我們獲取到了，那麼我們寫一個函式去開啟檔案設定好資料集，以下是程式碼：

from numpy import *
import operator
from math import log

def createLensesDataSet():#建立隱形眼鏡資料集
    fr = open('lenses.data')
    allLinesArr = fr.readlines()
    linesNum = len(allLinesArr)
    returnMat = zeros((linesNum, 4))
    statusLabels = ['age of the patient', 'spectacle prescription', 'astigmatic', 'tear production rate']
    classLabelVector = []
    classLabels = ['hard', 'soft', 'no lenses']

    index = 0
    for line in allLinesArr:
        line = line.strip()
        lineList = line.split('  ')
        returnMat[index, :] = lineList[1:5]
        classIndex = int(lineList[5]) - 1
        classLabelVector.append(classLabels[classIndex])  # 索引-1代表列表最後一個元素
        index += 1

    return ndarray.tolist(returnMat), statusLabels, classLabelVector

def createLensesAttributeInfo():
    parentAgeList = ['young', 'pre', 'presbyopic']
    spectacleList = ['myope', 'hyper']
    astigmaticList = ['no', 'yes']
    tearRateList = ['reduced', 'normal']
    return parentAgeList, spectacleList, astigmaticList, tearRateList

那麼接下來我們應該設定決策樹的分支，如何確定以上哪一個特徵是第一個分支呢，我們要提到一個概念，夏農熵（Shannon entropy）。熵這個概念代表資訊的不確定性的大小，在劃分資料集中經常會運用到。

它的公式是：

那麼我們先寫一個計算夏農熵的函式：

def calcShannonEnt(dataSet):#計算夏農熵
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet:
        currentLabel = featVec[-1]
        if currentLabel not in labelCounts.keys():
            labelCounts[currentLabel] = 0
        labelCounts[currentLabel] += 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * log(prob, 2)
    return shannonEnt

經過計算，我們可以得到我們當前使用的資料集，熵為：1.32608752536

然後，我們寫一個劃分資料集的函式，可以根據資料集，特徵索引和特徵值來劃分資料集：

def splitDataSet(dataSet, axis, value):#按照特徵值劃分資料集，引數為資料集，特徵索引，特徵值
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

說到取最佳特徵值，我們就要提到一個概念資訊增益（information divergence）

他的公式是：

即將單獨一個特徵值提取出來，計算該特徵值每個分支劃分出資料集的熵的求和，然後用總資料集的熵減去它

計算四個特徵值的資訊增益我們得到以下資料：

0:0.0393965036461
1:0.0395108354236
2:0.377005230011
3:0.548794940695

以下是計算資訊增益的程式碼：

def chooseBestFeatureToSplit(dataSet):#選擇最佳分割特徵值
    numFeatures =  len(dataSet[0]) - 1
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0
    bestFeature = -1
    for i in range(numFeatures):
        featList = [example[i] for example in dataSet]
        uniqueVals = set(featList)
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet) / float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy
        print(str(i)+':'+str(infoGain))
        if (infoGain > bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

通過計算我們可以得出特徵值的優先順序，tear production rate>astigmatic>spectacle prescription>age of patient

接下來，有了以上的計算函式，我們就可以開始建立決策樹了，建立決策樹，我們使用字典型別去儲存，用鍵代表分支節點，值代表下一個節點或者葉子節點，程式碼如下：

def createTree(dataSet, labels):#建立決策樹
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList):
        print(classList[0])
        return classList[0]
    if len(dataSet[0]) == 1:
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)

    for value in uniqueVals:
        subLabels = labels[:]
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
    return myTree

def majorityCnt(classList):#對於單個特徵值的列表，按出現次數進行排序
    classCount = {}
    for vote in classList:
        if vote not in classCount.keys(): classCount[vote] = 0
        classCount[vote] += 1
    sortedClassCount = sorted(classCount.iteritems(), key = operator.itemgetter(1), reverse = True)
    return sortedClassCount[0][0]

主要函式寫完以後，我們寫一段測試程式碼，列印我們創建出的決策樹：

import trees
import treePlotter
from numpy import *

lensesData, labels, vector = trees.createLensesDataSet()
parentAgeList, spectacleList, astigmaticList, tearRateList = trees.createLensesAttributeInfo()
lensesAttributeList = [parentAgeList, spectacleList, astigmaticList, tearRateList]

for i in range(len(lensesData)):
    for j in range(len(lensesData[i])):
        index = int(lensesData[i][j]) - 1
        lensesData[i][j] = lensesAttributeList[j][index]
    lensesData[i].append(str(vector[i]))

myTree = trees.createTree(lensesData, labels)
print(myTree)

我們看一下輸出：

{'tear production rate': {'reduced': 'no lenses', 'normal': {'astigmatic': {'yes': {'spectacle prescription': {'hyper': {'age of the patient': {'pre': 'no lenses', 'presbyopic': 'no lenses', 'young': 'hard'}}, 'myope': 'hard'}}, 'no': {'age of the patient': {'pre': 'soft', 'presbyopic': {'spectacle prescription': {'hyper': 'soft', 'myope': 'no lenses'}}, 'young': 'soft'}}}}}}

可以看出這是一個比較長的字典巢狀結構，但是這樣看上去很不直觀，為了讓這個決策樹能直觀的顯示出來，我們要匯入圖形化模組matplotlib，用來把決策樹畫出來。

我們新寫一個treePlotter指令碼，指令碼中新增計算決策樹葉節點數量及深度的函式，用以計算畫布的高寬佈局。通過計算兩個節點中點座標的函式，確定分支屬性的位置，最終畫出決策樹。以下是指令碼程式碼：

import matplotlib.pyplot as plt
import matplotlib

from pylab import *
mpl.rcParams['font.sans-serif'] = ['SimHei']

# 定義文字框和箭頭格式
decisionNode = dict(boxstyle = "sawtooth", fc = "0.8")
leafNode = dict(boxstyle = "round4", fc = "0.8")
arrow_args = dict(arrowstyle = "<-")

def plotNode(nodeTxt, centerPt, parentPt, nodeType):
    createPlotPlus.ax1.annotate(nodeTxt, xy = parentPt, xycoords = 'axes fraction', xytext = centerPt, textcoords = 'axes fraction', \
                            va = "center", ha = "center", bbox = nodeType, arrowprops = arrow_args)

def getNumLeafs(myTree):#獲取葉節點的總數量
    numLeafs = 0
    firstStr = myTree.keys()[0]
    secondDict = myTree[firstStr]
    for k in secondDict.keys():
        if type(secondDict[k]).__name__ == 'dict':#判斷節點資料型別是否為字典
            numLeafs += getNumLeafs(secondDict[k])
        else:
            numLeafs += 1
    return numLeafs

def getTreeDepth(myTree):#判斷決策樹的深度
    maxDepth = 0
    firstStr = myTree.keys()[0]
    secondDict = myTree[firstStr]
    for k in secondDict.keys():
        if type(secondDict[k]).__name__ == 'dict':  # 判斷節點資料型別是否為字典
            thisDepth = 1 + getTreeDepth(secondDict[k])
        else:
            thisDepth = 1

        if thisDepth > maxDepth:
            maxDepth = thisDepth
    return maxDepth

def plotMidText(cntrPt, parentPt, txtString):#計算給定兩個座標的中點座標
    xMid = (parentPt[0] - cntrPt[0]) / 2.0 + cntrPt[0]
    yMid = (parentPt[1] - cntrPt[1]) / 2.0 + cntrPt[1]
    createPlotPlus.ax1.text(xMid-0.05, yMid, txtString, rotation = 30)

def plotTree(myTree, parentPt, nodeTxt):#根據樹，父節點，節點文字，繪製一個分支節點
    numLeafs = getNumLeafs(myTree)
    firstStr = myTree.keys()[0]
    cntrPt = (plotTree.xOff +(1.0 + float(numLeafs)) / 2.0 /plotTree.totalW, plotTree.yOff)
    plotMidText(cntrPt, parentPt, nodeTxt)
    plotNode(firstStr, cntrPt, parentPt, decisionNode)
    secondDict = myTree[firstStr]
    plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
    for k in secondDict.keys():
        if type(secondDict[k]).__name__ =='dict':
            plotTree(secondDict[k], cntrPt, str(k))
        else:
            plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
            plotNode(secondDict[k], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
            plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(k))
    plotTree.yOff = plotTree.yOff + 1.0 / plotTree.totalD

def createPlotPlus(inTree):#根據給定決策樹建立影象
    fig = plt.figure(1, facecolor='white')
    fig.clf()
    axprops = dict(xticks = [], yticks = [])
    createPlotPlus.ax1 = plt.subplot(111, frameon = False, **axprops)
    plotTree.totalW = float(getNumLeafs(inTree))
    plotTree.totalD = float(getTreeDepth(inTree))
    plotTree.xOff = -0.5 / plotTree.totalW
    plotTree.yOff = 1.0
    plotTree(inTree, (0.5, 1.0), '')
    plt.show()

經過這個指令碼的處理，我們在測試程式碼上呼叫建立決策樹影象的函式：

treePlotter.createPlotPlus(myTree)

得到最終影象：

以上，完成。

參考書籍：《機器學習實戰》

Python建立決策樹—解決隱形眼鏡選擇問題

Python建立決策樹—解決隱形眼鏡選擇問題

Python實現決策樹應用之判斷隱形眼鏡的型別

【python和機器學習入門2】決策樹3——使用決策樹預測隱形眼鏡型別

【Python】決策樹的python實現

R_針對churn資料用id3、cart、C4.5和C5.0建立決策樹模型進行判斷哪種模型更合適

決策樹(一)--特徵值選擇

python實現決策樹程式碼

python實現決策樹

python實現決策樹演算法

基於決策樹預測隱形眼鏡型別

Python實現決策樹對西瓜進行分類

機器學習經典演算法詳解及Python實現--決策樹（Decision Tree）

機器學習筆記：ID3演算法建立決策樹(一)

Python實現決策樹並且使用Graphvize視覺化

IBM SPSS Modeler 【6】建立決策樹

【機器學習】【決策樹】有了決策樹的字典結構後，如何用python繪製決策樹？

詳解決策樹、python實現決策樹

【10月31日】機器學習實戰（二）決策樹：隱形眼鏡資料集

《機器學習實戰》學習筆記———使用決策樹預測隱形眼鏡型別

機器學習小實戰（二）建立決策樹

Python建立決策樹—解決隱形眼鏡選擇問題

相關推薦