資料分析系列精彩濃縮(三)
資料分析(三)
在分析UCI資料之前,有必要先了解一些決策樹的概念(decision tree)
-
此處推薦一個關於決策樹的部落格地址:
http://www.cnblogs.com/yonghao/p/5061873.html
-
決策樹(decision tree (DT))的基本特徵
-
DT 是一個監督學習方法(supervised learning method)
-
DT is a supervised learning method, thus we need labeled data
-
It is one process only thus it is not good for giant datasets
-
PS: It is pretty good on small and clean datasets
-
-
UCI資料特徵: UCI credit approval data set
-
690 data entries, relatively small dataset
-
15 attributes, pretty tiny to be honest
-
missing value is only 5%
-
2 class data
-
-
By looking at these two, we know DT should work well for our dataset
綜上,就可以嘗試用程式碼實現決策樹的功能了,此時使用段老師提供的skeleton(框架),按照以下步驟寫自己的程式碼
-
Copy and paste your code to function
readfile(file_name)
under the comment# Your code here
. -
Make sure your input and output matches how I descirbed in the docstring
-
Make a minor improvement to handle missing data, in this case let's use string
"missing"
"?"
. -
Implement
is_missing(value)
,class_counts(rows)
,is_numeric(value)
as directed in the docstring -
Implement class
Determine
. This object represents a node of our DT. 這個物件表示的是決策樹的節點。-
It has 2 inputs and a function. 有兩個輸入,一個方法
-
We can think of it as the Question we are asking at each node. 可以理解成決策樹中每個節點我們所提出的“問題”
-
-
Implement the method
partition(rows, question)
as described in the docstring-
Use Determine class to partition data into 2 groups
-
-
Implement the method
gini(rows)
as described in the docstring -
Implement the method
info_gain(left, right, current_uncertainty)
as described in the docstring -
my code is as follows , for reference only(以下是我的程式碼,僅供參考)
def readfile(file_name):
"""
This function reads data file and returns structured and cleaned data in a list
:param file_name: relative path under data folder
:return: data, in this case it should be a 2-D list of the form
[[data1_1, data1_2, ...],
[data2_1, data2_2, ...],
[data3_1, data3_2, ...],
...]
i.e.
[['a', 58.67, 4.46, 'u', 'g', 'q', 'h', 3.04, 't', 't', 6.0, 'f', 'g', '00043', 560.0, '+'],
['a', 24.5, 0.5, 'u', 'g', 'q', 'h', 1.5, 't', 'f', 0.0, 'f', 'g', '00280', 824.0, '+'],
['b', 27.83, 1.54, 'u', 'g', 'w', 'v', 3.75, 't', 't', 5.0, 't', 'g', '00100', 3.0, '+'],
...]
Couple things you should note:
1. You need to handle missing data. In this case let's use "missing" to represent all missing data
2. Be careful of data types. For instance,
"58.67" and "0.2356" should be number and not a string
"00043" should be string but not a number
It is OK to treat all numbers as float in this case. (You don't need to worry about differentiating integer and float)
"""
# Your code here
data_ = open(file_name, 'r')
# print(data_)
lines = data_.readlines()
output = []
# never use built-in names unless you mean to replace it
for list_str in lines:
str_list = list_str[:-1].split(",")
# keep it
# str_list.remove(str_list[len(str_list)-1])
data = []
for substr in str_list:
if substr.isdigit():
if len(substr) > 1 and substr.startswith('0'):
data.append(substr)
else:
substr = int(substr)
data.append(substr)
else:
try:
current = float(substr)
data.append(current)
except ValueError as e:
if substr == '?':
substr = 'missing'
data.append(substr)
output.append(data)
return output
def is_missing(value):
"""
Determines if the given value is a missing data, please refer back to readfile() where we defined what is a "missing" data
:param value: value to be checked
:return: boolean (True, False) of whether the input value is the same as our "missing" notation
"""
return value == 'missing'
def class_counts(rows):
"""
Count how many data samples there are for each label
數每個標籤的樣本數
:param rows: Input is a 2D list in the form of what you have returned in readfile()
:return: Output is a dictionary/map in the form:
{"label_1": #count,
"label_2": #count,
"label_3": #count,
...
}
"""
# 這個方法是一個死方法 只使用於當前給定標籤(‘+’,‘-’)的資料統計 為了達到能使更多不確定標籤的資料的統計 擴展出下面方法
# label_dict = {}
# count1 = 0
# count2 = 0
# # rows 是readfile返回的結果
# for row in rows:
# if row[-1] == '+':
# count1 += 1
# elif row[-1] == '-':
# count2 += 1
# label_dict['+'] = count1
# label_dict['-'] = count2
# return label_dict
# 擴充套件方法一
# 這個方法可以完成任何不同標籤的資料的統計 使用了兩個迴圈 第一個迴圈是統計出所有資料中存在的不同型別的標籤 得到一個標籤列表lable_list
# 然後遍歷lable_list中的標籤 重要的是在其中嵌套了遍歷所有資料的迴圈 同時在當前迴圈中統計出所有資料的標籤中和lable_list中標籤相同的總數
# label_dict = {}