Python建立決策樹—解決隱形眼鏡選擇問題
現在我們碰到這樣一個問題,一個人去醫院想配一副隱形眼鏡。我們需要通過問他4個問題,決定他需要帶眼鏡的型別。那麼如何解決這個問題呢?我們決定用決策樹。首先我們去下載一個隱形眼鏡資料集,資料來源於UCI資料庫。下載了lenses.data檔案,如下:
1 1 1 1 1 3 2 1 1 1 2 2 3 1 1 2 1 3 4 1 1 2 2 1 5 1 2 1 1 3 6 1 2 1 2 2 7 1 2 2 1 3 8 1 2 2 2 1 9 2 1 1 1 3 10 2 1 1 2 2 11 2 1 2 1 3 12 2 1 2 2 1 13 2 2 1 1 3 14 2 2 1 2 2 15 2 2 2 1 3 16 2 2 2 2 3 17 3 1 1 1 3 18 3 1 1 2 3 19 3 1 2 1 3 20 3 1 2 2 1 21 3 2 1 1 3 22 3 2 1 2 2 23 3 2 2 1 3 24 3 2 2 2 3
我們可以看到,第一列的1到24,對應資料的ID
第二列的1到3,分別對應病人的年齡(age of patient),分別是青年(young),中年(pre-presbyopic),老年(presbyopic)
第三列的1和2,分別對應近視情況(spectacle prescription),近視(myope),遠視(hypermetrope)
第四列的1和2,分別對應眼睛是否散光(astigmatic),不散光(no),散光(yes)
第五列的1和2,分別對應分泌眼淚的頻率(tear production rate),很少(reduce),普通(normal)
第六列的1到3,則是最終根據以上資料得到的分類,分別是硬性的隱形眼鏡(hard),軟性的隱形眼鏡(soft),不需要帶眼鏡(no lenses)
資料我們獲取到了,那麼我們寫一個函式去開啟檔案設定好資料集,以下是程式碼:
from numpy import * import operator from math import log def createLensesDataSet():#建立隱形眼鏡資料集 fr = open('lenses.data') allLinesArr = fr.readlines() linesNum = len(allLinesArr) returnMat = zeros((linesNum, 4)) statusLabels = ['age of the patient', 'spectacle prescription', 'astigmatic', 'tear production rate'] classLabelVector = [] classLabels = ['hard', 'soft', 'no lenses'] index = 0 for line in allLinesArr: line = line.strip() lineList = line.split(' ') returnMat[index, :] = lineList[1:5] classIndex = int(lineList[5]) - 1 classLabelVector.append(classLabels[classIndex]) # 索引-1代表列表最後一個元素 index += 1 return ndarray.tolist(returnMat), statusLabels, classLabelVector def createLensesAttributeInfo(): parentAgeList = ['young', 'pre', 'presbyopic'] spectacleList = ['myope', 'hyper'] astigmaticList = ['no', 'yes'] tearRateList = ['reduced', 'normal'] return parentAgeList, spectacleList, astigmaticList, tearRateList
那麼接下來我們應該設定決策樹的分支,如何確定以上哪一個特徵是第一個分支呢,我們要提到一個概念,夏農熵(Shannon entropy)。熵這個概念代表資訊的不確定性的大小,在劃分資料集中經常會運用到。
它的公式是:
那麼我們先寫一個計算夏農熵的函式:
def calcShannonEnt(dataSet):#計算夏農熵
numEntries = len(dataSet)
labelCounts = {}
for featVec in dataSet:
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys():
labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key])/numEntries
shannonEnt -= prob * log(prob, 2)
return shannonEnt
經過計算,我們可以得到我們當前使用的資料集,熵為:1.32608752536
然後,我們寫一個劃分資料集的函式,可以根據資料集,特徵索引和特徵值來劃分資料集:
def splitDataSet(dataSet, axis, value):#按照特徵值劃分資料集,引數為資料集,特徵索引,特徵值
retDataSet = []
for featVec in dataSet:
if featVec[axis] == value:
reducedFeatVec = featVec[:axis]
reducedFeatVec.extend(featVec[axis+1:])
retDataSet.append(reducedFeatVec)
return retDataSet
說到取最佳特徵值,我們就要提到一個概念資訊增益(information divergence)
他的公式是:
即將單獨一個特徵值提取出來,計算該特徵值每個分支劃分出資料集的熵的求和,然後用總資料集的熵減去它
計算四個特徵值的資訊增益我們得到以下資料:
0:0.0393965036461
1:0.0395108354236
2:0.377005230011
3:0.548794940695
以下是計算資訊增益的程式碼:
def chooseBestFeatureToSplit(dataSet):#選擇最佳分割特徵值
numFeatures = len(dataSet[0]) - 1
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain = 0.0
bestFeature = -1
for i in range(numFeatures):
featList = [example[i] for example in dataSet]
uniqueVals = set(featList)
newEntropy = 0.0
for value in uniqueVals:
subDataSet = splitDataSet(dataSet, i, value)
prob = len(subDataSet) / float(len(dataSet))
newEntropy += prob * calcShannonEnt(subDataSet)
infoGain = baseEntropy - newEntropy
print(str(i)+':'+str(infoGain))
if (infoGain > bestInfoGain):
bestInfoGain = infoGain
bestFeature = i
return bestFeature
通過計算我們可以得出特徵值的優先順序,tear production rate>astigmatic>spectacle prescription>age of patient
接下來,有了以上的計算函式,我們就可以開始建立決策樹了,建立決策樹,我們使用字典型別去儲存,用鍵代表分支節點,值代表下一個節點或者葉子節點,程式碼如下:
def createTree(dataSet, labels):#建立決策樹
classList = [example[-1] for example in dataSet]
if classList.count(classList[0]) == len(classList):
print(classList[0])
return classList[0]
if len(dataSet[0]) == 1:
return majorityCnt(classList)
bestFeat = chooseBestFeatureToSplit(dataSet)
bestFeatLabel = labels[bestFeat]
myTree = {bestFeatLabel:{}}
del(labels[bestFeat])
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues)
for value in uniqueVals:
subLabels = labels[:]
myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
return myTree
def majorityCnt(classList):#對於單個特徵值的列表,按出現次數進行排序
classCount = {}
for vote in classList:
if vote not in classCount.keys(): classCount[vote] = 0
classCount[vote] += 1
sortedClassCount = sorted(classCount.iteritems(), key = operator.itemgetter(1), reverse = True)
return sortedClassCount[0][0]
主要函式寫完以後,我們寫一段測試程式碼,列印我們創建出的決策樹:
import trees
import treePlotter
from numpy import *
lensesData, labels, vector = trees.createLensesDataSet()
parentAgeList, spectacleList, astigmaticList, tearRateList = trees.createLensesAttributeInfo()
lensesAttributeList = [parentAgeList, spectacleList, astigmaticList, tearRateList]
for i in range(len(lensesData)):
for j in range(len(lensesData[i])):
index = int(lensesData[i][j]) - 1
lensesData[i][j] = lensesAttributeList[j][index]
lensesData[i].append(str(vector[i]))
myTree = trees.createTree(lensesData, labels)
print(myTree)
我們看一下輸出:
{'tear production rate': {'reduced': 'no lenses', 'normal': {'astigmatic': {'yes': {'spectacle prescription': {'hyper': {'age of the patient': {'pre': 'no lenses', 'presbyopic': 'no lenses', 'young': 'hard'}}, 'myope': 'hard'}}, 'no': {'age of the patient': {'pre': 'soft', 'presbyopic': {'spectacle prescription': {'hyper': 'soft', 'myope': 'no lenses'}}, 'young': 'soft'}}}}}}
可以看出這是一個比較長的字典巢狀結構,但是這樣看上去很不直觀,為了讓這個決策樹能直觀的顯示出來,我們要匯入圖形化模組matplotlib,用來把決策樹畫出來。
我們新寫一個treePlotter指令碼,指令碼中新增計算決策樹葉節點數量及深度的函式,用以計算畫布的高寬佈局。通過計算兩個節點中點座標的函式,確定分支屬性的位置,最終畫出決策樹。以下是指令碼程式碼:
import matplotlib.pyplot as plt
import matplotlib
from pylab import *
mpl.rcParams['font.sans-serif'] = ['SimHei']
# 定義文字框和箭頭格式
decisionNode = dict(boxstyle = "sawtooth", fc = "0.8")
leafNode = dict(boxstyle = "round4", fc = "0.8")
arrow_args = dict(arrowstyle = "<-")
def plotNode(nodeTxt, centerPt, parentPt, nodeType):
createPlotPlus.ax1.annotate(nodeTxt, xy = parentPt, xycoords = 'axes fraction', xytext = centerPt, textcoords = 'axes fraction', \
va = "center", ha = "center", bbox = nodeType, arrowprops = arrow_args)
def getNumLeafs(myTree):#獲取葉節點的總數量
numLeafs = 0
firstStr = myTree.keys()[0]
secondDict = myTree[firstStr]
for k in secondDict.keys():
if type(secondDict[k]).__name__ == 'dict':#判斷節點資料型別是否為字典
numLeafs += getNumLeafs(secondDict[k])
else:
numLeafs += 1
return numLeafs
def getTreeDepth(myTree):#判斷決策樹的深度
maxDepth = 0
firstStr = myTree.keys()[0]
secondDict = myTree[firstStr]
for k in secondDict.keys():
if type(secondDict[k]).__name__ == 'dict': # 判斷節點資料型別是否為字典
thisDepth = 1 + getTreeDepth(secondDict[k])
else:
thisDepth = 1
if thisDepth > maxDepth:
maxDepth = thisDepth
return maxDepth
def plotMidText(cntrPt, parentPt, txtString):#計算給定兩個座標的中點座標
xMid = (parentPt[0] - cntrPt[0]) / 2.0 + cntrPt[0]
yMid = (parentPt[1] - cntrPt[1]) / 2.0 + cntrPt[1]
createPlotPlus.ax1.text(xMid-0.05, yMid, txtString, rotation = 30)
def plotTree(myTree, parentPt, nodeTxt):#根據樹,父節點,節點文字,繪製一個分支節點
numLeafs = getNumLeafs(myTree)
firstStr = myTree.keys()[0]
cntrPt = (plotTree.xOff +(1.0 + float(numLeafs)) / 2.0 /plotTree.totalW, plotTree.yOff)
plotMidText(cntrPt, parentPt, nodeTxt)
plotNode(firstStr, cntrPt, parentPt, decisionNode)
secondDict = myTree[firstStr]
plotTree.yOff = plotTree.yOff - 1.0/plotTree.totalD
for k in secondDict.keys():
if type(secondDict[k]).__name__ =='dict':
plotTree(secondDict[k], cntrPt, str(k))
else:
plotTree.xOff = plotTree.xOff + 1.0/plotTree.totalW
plotNode(secondDict[k], (plotTree.xOff, plotTree.yOff), cntrPt, leafNode)
plotMidText((plotTree.xOff, plotTree.yOff), cntrPt, str(k))
plotTree.yOff = plotTree.yOff + 1.0 / plotTree.totalD
def createPlotPlus(inTree):#根據給定決策樹建立影象
fig = plt.figure(1, facecolor='white')
fig.clf()
axprops = dict(xticks = [], yticks = [])
createPlotPlus.ax1 = plt.subplot(111, frameon = False, **axprops)
plotTree.totalW = float(getNumLeafs(inTree))
plotTree.totalD = float(getTreeDepth(inTree))
plotTree.xOff = -0.5 / plotTree.totalW
plotTree.yOff = 1.0
plotTree(inTree, (0.5, 1.0), '')
plt.show()
經過這個指令碼的處理,我們在測試程式碼上呼叫建立決策樹影象的函式:
treePlotter.createPlotPlus(myTree)
得到最終影象:
以上,完成。
參考書籍:《機器學習實戰》