決策樹代碼《機器學習實戰》
阿新 • • 發佈:2017-08-10
必須 nbsp getter 什麽 key 畫圖 不支持 spl name
22:45:17 2017-08-09
KNN算法簡單有效,可以解決很多分類問題。但是無法給出數據的含義,就是一頓計算向量距離,然後分類。
決策樹就可以解決這個問題,分類之後能夠知道是問什麽被劃分到一個類。用圖形畫出來就效果更好了,這次沒有學哪個畫圖的,下次。
這裏只涉及信息熵的計算,最佳分類特征的提取,決策樹的構建。剪枝沒有學,這裏沒有。
1 # -*- oding: itf-8 -*-
2
3 ‘‘‘
4 function: 《機器學習實戰》決策樹的代碼,畫圖的部分沒有寫;
5 note: 貼出來以後用方便一點~
6 date: 2017.8.9
7 ‘‘‘
8
9 from numpy import *
10 from math import log
11 import operator
12
13 #計算香濃信息熵
14 def calcuEntropy(dataSet):
15 numOfEntries = len(dataSet)
16 featVec = {}
17 for data in dataSet:
18 currentLabel = data[-1]
19 if currentLabel not in featVec.keys():
20 featVec[currentLabel] = 1
21 else:
22 featVec[currentLabel] += 1
23 shannonEntropy = 0.0
24 for feat in featVec.keys():
25 prob = float(featVec[feat]) / numOfEntries
26 shannonEntropy += -prob*log(prob, 2)
27 return shannonEntropy
28
29 #產生數據集
30 def loadDataSet():
31 dataSet = [[1,1,‘ yes‘],
32 [1,0,‘no‘],
33 [0,1,‘no‘],
34 [0,1,‘no‘]]
35 labels = [‘no surfacing‘, ‘flippers‘]
36 return dataSet, labels
37
38 ‘‘‘
39 function: split the dataset
40 return: 基於劃分特征劃分之後我們想要的那部分集合
41 parameters: dataSet: 數據集,axis: 要劃分的特征, value:要返回的集合的axis特征值
42 ‘‘‘
43 def splitDataSet(dataSet, axis, value):
44 retDataSet = [] #防止原始的數據集被修改
45 for featVec in dataSet:
46 if featVec[axis] == value: #我們想要的數值存起來,一會返回
47 reducedFeatVec = featVec[:axis]
48 reducedFeatVec.extend(featVec[axis+1:])
49 retDataSet.append(reducedFeatVec)
50 return retDataSet
51
52 ‘‘‘
53 function: 找出數據集中最佳的劃分特征
54 ‘‘‘
55 def chooseBestClassifyFeat(dataSet):
56 numOfFeatures = len(dataSet[0]) - 1
57 bestFeature = -1 #初始化最佳的劃分特征
58 baseInfoGain = 0.0 #信息增益
59 baseEntropy = calcuEntropy(dataSet)
60 for i in range(numOfFeatures):
61 # if numOfFeatures == 1: #錯了,只有一個特征不是只有一個類別
62 # print(‘only one feature‘)
63 # print(dataSet[0][0])
64 # return dataSet[0][0] #只有一個特征直接返回該特征
65 featList = [example[i] for example in dataSet] #或者第i個特征所有的取值
66 unicVals = set(featList) #不重復的第i個特征取值
67 newEntropy = 0.0
68 for value in unicVals:
69 subDataSet = splitDataSet(dataSet, i, value)
70
71 #計算劃分之後各個子數據集的信息熵,然後累加就是這個劃分的信息熵
72 currentEntropy = calcuEntropy(subDataSet)
73 prob = float(len(subDataSet)) / len(dataSet)
74 newEntropy += prob * currentEntropy
75 newInfoGain = baseEntropy - newEntropy
76 if newInfoGain > baseInfoGain:
77 bestFeature = i
78 baseInfoGain = newInfoGain
79 return bestFeature
80
81 ‘‘‘
82 function: 多數表決,當分類器用完所有屬性,葉節點還是類別不統一的時候調用這個函數
83 arg: labelList 類別標簽列表
84 ‘‘‘
85 def majorityCount(labelList):
86 classCount = {}
87 for label in labelList:
88 if label not in classCount.keys():
89 classCount[label] = 0
90 classCount[label] += 1
91 sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1),reverse = True)
92 print(sortedClassCount)
93 return sortedClassCount[0][0]
94
95
96 ‘‘‘
97 function: 遞歸的建造決策樹
98 arg: dataset: 數據集 labels: 代表特征的標簽,起始算法不需要,比如fippers代表第一個特征的意義
99 ‘‘‘
100 def createTree(dataSet, labels):
101 classList = [example[-1] for example in dataSet] #得到所有的類別
102 if classList.count(classList[0]) == len(classList): #只有一種類別,直接返回
103 return classList[0]
104 if len(dataSet[0]) == 1: #特征屬性用完了但是還沒有完全分開,多數表決
105 return majorityCount(classList)
106 bestFeat = chooseBestClassifyFeat(dataSet)
107 print(‘bestFeat = ‘ + str(bestFeat))
108 bestFeatLabel = labels[bestFeat]
109 del(labels[bestFeat]) #刪除這次使用的特征
110 featValues = [example[bestFeat] for example in dataSet]
111 myTree = {bestFeatLabel: {}}
112 unicVals = set(featValues)
113 for value in unicVals:
114 labelCopy = labels[:]
115 subDataSet = splitDataSet(dataSet, bestFeat, value)
116 myTree[bestFeatLabel][value] = createTree(subDataSet, labelCopy)
117 return myTree
118
119 ‘‘‘
120 function: 用決策樹進行分類
121 arg: inputTree: 訓練好的決策樹,featLabels: 特征標簽,testVec: 待分類的向量
122 ‘‘‘
123 def classify(inputTree, featLabel, testVec):
124 firstStr = list(inputTree.keys())[0] #python3 dict,.keys()不支持索引,必須轉換一下
125 secondDict = inputTree[firstStr] #second tree
126 featIndex = featLabel.index(firstStr) #可利用index函數找到這個特征標簽對飲過的特征位置
127 for key in secondDict.keys():
128 if testVec[featIndex] == key:
129 if type(secondDict[key]).__name__ == ‘dict‘: #說明下面不是葉子節點,繼續分類
130 classLabel = classify(secondDict[key], featLabel, testVec)
131 else:
132 classLabel = secondDict[key] #到達葉子節點,直接返回類別標簽
133 return classLabel
134
135 ‘‘‘
136 function: 使用pickle模塊持久化存儲決策樹
137 note:
138 ‘‘‘
139 def storeTree(inputTree, filename):
140 import pickle
141 fw = open(filename, ‘wb‘)
142 pickle.dump(inputTree, fw)
143 fw.close()
144
145 ‘‘‘
146 function: 從本地文件中讀取決策樹
147 ‘‘‘
148 def grabTree(filename):
149 import pickle
150 fr = open(filename,‘rb‘)
151 return pickle.load(fr)
152
153 #測試信息熵的計算
154 dataSet, labels = loadDataSet()
155 shannon = calcuEntropy(dataSet)
156 print(shannon)
157
158 #測試數據集分割
159 print(dataSet)
160 retDataSet = splitDataSet(dataSet, 1, 1)
161 print(retDataSet)
162 retDataSet = splitDataSet(dataSet, 1, 0)
163 print(retDataSet)
164
165 #尋找最佳的劃分特征
166 bestFeature = chooseBestClassifyFeat(dataSet)
167 print(bestFeature)
168
169 #測試多數表決
170 out = majorityCount([1,1,2,2,2,1,2,2])
171 print(out)
172
173 #創建決策大叔
174 myTree = createTree(dataSet, labels)
175 print(myTree)
176
177 #測試分類器
178 dataSet, labels = loadDataSet()
179 classLabel = classify(myTree, labels, [0,1])
180 print(classLabel)
181 classLabel = classify(myTree, labels, [1,1])
182 print(classLabel)
183
184 #持久化存儲決策樹
185 storeTree(myTree, ‘classifierStorage.txt‘)
186 outTree = grabTree(‘classifierStorage.txt‘)
187 print(outTree)
決策樹代碼《機器學習實戰》