機器學習---決策樹decision tree的應用
1.Python
2.Python機器學習的庫:scikit-learn
2.1 特性:
簡單高效的資料探勘和機器學習分析
對所有使用者開放,根據不同需求高度可重用性
基於Numpy,SciPy和matplotlib
開源的,且可達到商用級別,獲得BSD許可
安裝 Graphviz—-轉化dot檔案至pdf視覺化決策樹:dot -Tpdf *.dot -o
2.2覆蓋問題領域
分類(classifaction),迴歸(regression),聚類(clustering),降維(dimensionality reduction)
模型選擇(model selection),預處理(preprocessing)
3.使用scikit-learn
安裝scikit-learn:
安裝必要package:numpy,Scipy和matplotlib。
sklearn兩篇優質文章:
4.例子
RID | age | income | student | credit_rating | class_buys_computer |
---|---|---|---|---|---|
1 | youth | high | no | fair | no |
2 | youth | high | no | excellent | no |
3 | middle_aged | high | no | fair | yes |
4 | senior | medium | no | fair | yes |
5 | senior | low | yes | fair | yes |
6 | senior | low | yes | excellent | no |
7 | middle_aged | low | yes | excellent | yes |
8 | youth | medium | no | fair | no |
9 | youth | low | yes | fair | yes |
10 | senior | medium | yes | fair | yes |
11 | youth | medium | yes | excellent | yes |
12 | middle_aged | medium | no | excellent | yes |
13 | middle_aged | high | yes | fair | yes |
14 | senior | medium | no | excellent | no |
5.實現
from sklearn.feature_extraction import DictVectorizer
import csv
from sklearn import tree
from sklearn import preprocessing
from sklearn.externals.six import StringIO
#sklearn對資料有格式要求,首先要對資料進行格式預處理。
# Read in the csv file and put features into list of dict and list of class label
#讀取csv檔案,並把屬性放到字典列表和類標籤中
#Python2.x
#allElectronicsData = open(r'AllElectronics.csv', 'rb')
#reader = csv.reader(allElectronicsData)
#headers = reader.next()
#上面的語句在python3.X會報錯,'_csv.reader' object has no attribute 'next'
#在python3.x需改為如下語句
allElectronicsData = open(r'AllElectronics.csv', 'rt')
reader = csv.reader(allElectronicsData)
headers = next(reader)
print(headers)
#['RID', 'age', 'income', 'student', 'credit_rating', 'class_buys_computer']
featureList = []
labelList = []
for row in reader:
labelList.append(row[len(row)-1])
rowDict = {}
for i in range(1, len(row)-1):
rowDict[headers[i]] = row[i]
featureList.append(rowDict)
print(featureList)
'''
[{'age': 'youth', 'credit_rating': 'fair', 'income': 'high', 'student': 'no'},
{'age': 'youth', 'credit_rating': 'excellent', 'income': 'high', 'student': 'no'},
{'age': 'middle_aged', 'credit_rating': 'fair', 'income': 'high', 'student': 'no'},
{'age': 'senior', 'credit_rating': 'fair', 'income': 'medium', 'student': 'no'},
{'age': 'senior', 'credit_rating': 'fair', 'income': 'low', 'student': 'yes'},
{'age': 'senior', 'credit_rating': 'excellent', 'income': 'low', 'student': 'yes'},
{'age': 'middle_aged', 'credit_rating': 'excellent', 'income': 'low', 'student': 'yes'},
{'age': 'youth', 'credit_rating': 'fair', 'income': 'medium', 'student': 'no'},
{'age': 'youth', 'credit_rating': 'fair', 'income': 'low', 'student': 'yes'},
{'age': 'senior', 'credit_rating': 'fair', 'income': 'medium', 'student': 'yes'},
{'age': 'youth', 'credit_rating': 'excellent', 'income': 'medium', 'student': 'yes'},
{'age': 'middle_aged', 'credit_rating': 'excellent', 'income': 'medium', 'student': 'no'},
{'age': 'middle_aged', 'credit_rating': 'fair', 'income': 'high', 'student': 'yes'},
{'age': 'senior', 'credit_rating': 'excellent', 'income': 'medium', 'student': 'no'}]
'''
#從表中可以看出是用字典儲存,所以是無序的。
# Vetorize features
vec = DictVectorizer()
dummyX = vec.fit_transform(featureList) .toarray()
print("dummyX: " + str(dummyX))
#將每一行轉化為如下格式
#youth middle_age senor high medium low yes no fair excellent buy
# 1 0 0 1 0 0 0 1 1 0 0
'''
dummyX:
[[ 0. 0. 1. 0. 1. 1. 0. 0. 1. 0.]
[ 0. 0. 1. 1. 0. 1. 0. 0. 1. 0.]
[ 1. 0. 0. 0. 1. 1. 0. 0. 1. 0.]
[ 0. 1. 0. 0. 1. 0. 0. 1. 1. 0.]
[ 0. 1. 0. 0. 1. 0. 1. 0. 0. 1.]
[ 0. 1. 0. 1. 0. 0. 1. 0. 0. 1.]
[ 1. 0. 0. 1. 0. 0. 1. 0. 0. 1.]
[ 0. 0. 1. 0. 1. 0. 0. 1. 1. 0.]
[ 0. 0. 1. 0. 1. 0. 1. 0. 0. 1.]
[ 0. 1. 0. 0. 1. 0. 0. 1. 0. 1.]
[ 0. 0. 1. 1. 0. 0. 0. 1. 0. 1.]
[ 1. 0. 0. 1. 0. 0. 0. 1. 1. 0.]
[ 1. 0. 0. 0. 1. 1. 0. 0. 0. 1.]
[ 0. 1. 0. 1. 0. 0. 0. 1. 1. 0.]]
'''
print(vec.get_feature_names())
'''
['age=middle_aged', 'age=senior', 'age=youth',
'credit_rating=excellent', 'credit_rating=fair',
'student=no', 'student=yes']
'''
print("labelList: " + str(labelList))
#labelList:
#['no', 'no', 'yes', 'yes', 'yes', 'no', 'yes', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'no']
# vectorize class labels
lb = preprocessing.LabelBinarizer()
dummyY = lb.fit_transform(labelList)
print("dummyY: " + str(dummyY))
'''
dummyY:
[[0]
[0]
[1]
[1]
[1]
[0]
[1]
[0]
[1]
[1]
[1]
[1]
[1]
[0]]
'''
# Using decision tree for classification
# clf = tree.DecisionTreeClassifier()
'''
clf就是生成的決策樹,引數可以選擇決策樹的演算法種類,這裡使用entropy即ID3資訊熵演算法。
'''
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf = clf.fit(dummyX, dummyY)
print("clf: " + str(clf))
# Visualize model
'''
建立.dot檔案用於存放視覺化決策樹資料,決策樹已經數值化,如果要還原屬性到決策樹,需要傳入屬性引數feature_names=vec.get_feature_names()
'''
with open("allElectronicInformationGainOri.dot", 'w') as f:
f = tree.export_graphviz(clf, feature_names=vec.get_feature_names(), out_file=f)
'''
最後把生成的.dot檔案轉換成視覺化的pdf檔案,dot -Tpdf input.dot -o output.pdf
'''
#決策樹生成後,用demo例項預測結果
#取第一行資料,並稍做改動
oneRowX = dummyX[0, :]
print("oneRowX: " + str(oneRowX))
#oneRowX: [ 0. 0. 1. 0. 1. 1. 0. 0. 1. 0.]
newRowX = oneRowX
newRowX[0] = 1
newRowX[2] = 0
print("newRowX: " + str(newRowX))
#newRowX: [ 1. 0. 0. 0. 1. 1. 0. 0. 1. 0.]
#predictedY = clf.predict(newRowX)
'''
直接執行會報如下錯誤
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[ 0. 0. 1. 0. 1. 1. 0. 0. 1. 0.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
提示需要reshape,所以入參改為newRowX.reshape(1,-1)
reshape作用可參考http://www.cnblogs.com/iamxyq/p/6683147.html
'''
predictedY = clf.predict(newRowX.reshape(1,-1))
print("predictedY: " + str(predictedY))
#predictedY: [1]
生成的決策樹如下:
RID age income student credit_rating class_buys_computer
1 youth high no fair no
2 youth high no excellent no
3 middle_aged high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle_aged low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle_aged medium no excellent yes
13 middle_aged high yes fair yes
14 senior medium no excellent no