1. 程式人生 > >資料分析系列精彩濃縮(三)

資料分析系列精彩濃縮(三)

資料分析(三)

在分析UCI資料之前,有必要先了解一些決策樹的概念(decision tree)

  • 此處推薦一個關於決策樹的部落格地址:
    http://www.cnblogs.com/yonghao/p/5061873.html
  • 決策樹(decision tree (DT))的基本特徵

    • DT 是一個監督學習方法(supervised learning method)

    • DT is a supervised learning method, thus we need labeled data

    • It is one process only thus it is not good for giant datasets

    • PS: It is pretty good on small and clean datasets

  • UCI資料特徵: UCI credit approval data set

    • 690 data entries, relatively small dataset

    • 15 attributes, pretty tiny to be honest

    • missing value is only 5%

    • 2 class data

  • By looking at these two, we know DT should work well for our dataset

綜上,就可以嘗試用程式碼實現決策樹的功能了,此時使用段老師提供的skeleton(框架),按照以下步驟寫自己的程式碼

  • Copy and paste your code to function readfile(file_name) under the comment # Your code here.

  • Make sure your input and output matches how I descirbed in the docstring

  • Make a minor improvement to handle missing data, in this case let's use string "missing"

    to represent missing data. Note that it is given as "?".

  • Implement is_missing(value), class_counts(rows), is_numeric(value) as directed in the docstring
  • Implement class Determine. This object represents a node of our DT. 這個物件表示的是決策樹的節點。
    • It has 2 inputs and a function. 有兩個輸入,一個方法

    • We can think of it as the Question we are asking at each node. 可以理解成決策樹中每個節點我們所提出的“問題”

  • Implement the method partition(rows, question)as described in the docstring
    • Use Determine class to partition data into 2 groups

  • Implement the method gini(rows) as described in the docstring
    • Here is the formula for Gini impurity:

      • where n is the number of classes

      • is the percentage of the given class i

  • Implement the method info_gain(left, right, current_uncertainty) as described in the docstring
    • Here is the formula for Information Gain:

      • where

      • is current_uncertainty

      • is the percentage/probability of left branch, same story for

  • my code is as follows , for reference only(以下是我的程式碼,僅供參考)

    def readfile(file_name):
       """
      This function reads data file and returns structured and cleaned data in a list
      :param file_name: relative path under data folder
      :return: data, in this case it should be a 2-D list of the form
      [[data1_1, data1_2, ...],
        [data2_1, data2_2, ...],
        [data3_1, data3_2, ...],
        ...]
       
      i.e.
      [['a', 58.67, 4.46, 'u', 'g', 'q', 'h', 3.04, 't', 't', 6.0, 'f', 'g', '00043', 560.0, '+'],
        ['a', 24.5, 0.5, 'u', 'g', 'q', 'h', 1.5, 't', 'f', 0.0, 'f', 'g', '00280', 824.0, '+'],
        ['b', 27.83, 1.54, 'u', 'g', 'w', 'v', 3.75, 't', 't', 5.0, 't', 'g', '00100', 3.0, '+'],
      ...]
       
      Couple things you should note:
      1. You need to handle missing data. In this case let's use "missing" to represent all missing data
      2. Be careful of data types. For instance,
          "58.67" and "0.2356" should be number and not a string
          "00043" should be string but not a number
          It is OK to treat all numbers as float in this case. (You don't need to worry about differentiating integer and float)
      """
       # Your code here
       data_ = open(file_name, 'r')
       # print(data_)
       lines = data_.readlines()
       output = []
       # never use built-in names unless you mean to replace it
       for list_str in lines:
           str_list = list_str[:-1].split(",")
           # keep it
           # str_list.remove(str_list[len(str_list)-1])
           data = []
           for substr in str_list:
               if substr.isdigit():
                   if len(substr) > 1 and substr.startswith('0'):
                       data.append(substr)
                   else:
                       substr = int(substr)
                       data.append(substr)
               else:
                   try:
                       current = float(substr)
                       data.append(current)
                   except ValueError as e:
                       if substr == '?':
                           substr = 'missing'
                       data.append(substr)
           output.append(data)
       return output




    def is_missing(value):
       """
      Determines if the given value is a missing data, please refer back to readfile() where we defined what is a "missing" data
      :param value: value to be checked
      :return: boolean (True, False) of whether the input value is the same as our "missing" notation
      """
       return value == 'missing'


    def class_counts(rows):
       """
      Count how many data samples there are for each label
      數每個標籤的樣本數
      :param rows: Input is a 2D list in the form of what you have returned in readfile()
      :return: Output is a dictionary/map in the form:
      {"label_1": #count,
        "label_2": #count,
        "label_3": #count,
        ...
      }
      """
       # 這個方法是一個死方法 只使用於當前給定標籤(‘+’,‘-’)的資料統計   為了達到能使更多不確定標籤的資料的統計 擴展出下面方法
       # label_dict = {}
       # count1 = 0
       # count2 = 0
       # # rows 是readfile返回的結果
       # for row in rows:
       #     if row[-1] == '+':
       #         count1 += 1
       #     elif row[-1] == '-':
       #         count2 += 1
       # label_dict['+'] = count1
       # label_dict['-'] = count2
       # return label_dict

       # 擴充套件方法一
       # 這個方法可以完成任何不同標籤的資料的統計 使用了兩個迴圈 第一個迴圈是統計出所有資料中存在的不同型別的標籤 得到一個標籤列表lable_list
       # 然後遍歷lable_list中的標籤 重要的是在其中嵌套了遍歷所有資料的迴圈 同時在當前迴圈中統計出所有資料的標籤中和lable_list中標籤相同的總數
       # label_dict = {}
       # lable_list = []
       # for row in rows:
       #     lable = row[-1]
       #     if lable_list == []:
       #         lable_list.append(lable)
       #     else:
       #         if lable in lable_list:
       #             continue
       #         else:
       #             lable_list.append(lable)
       #
       # for lable_i in lable_list:
       #     count_row_i = 0
       #     for row_i in rows:
       #         if lable_i == row_i[-1]:
       #             count_row_i += 1
       #     label_dict[lable_i] = count_row_i
       # print(label_dict)
       # return label_dict
       #

    # 擴充套件方法二
       # 此方法是巧妙的使用了dict.key()函式將所有的狀態進行儲存以及對出現的次數進行累計
       label_dict = {}
       for row in rows:
           keys = label_dict.keys()
           if row[-1] in keys:
               label_dict[row[-1]] += 1
           elif row[-1] not in keys:
               label_dict[row[-1]] = 1
       return label_dict


    def is_numeric(value):
       print(type(value),'-----')
       print(value)
       """
      Test if the input is a number(float/int)  
      :param value: Input is a value to be tested    
      :return: Boolean (True/False)    
      """
       # Your code here
       # 此處用到eavl()函式:將字串string物件轉換為有效的表示式參與求值運算返回計算結果
       # if type(eval(str(value))) == int or type(eval(str(value))) == float:
       #     return True
       # 不用eval()也可以 而且有部落格說eval()存在一定安全隱患

       # if value is letter(字母) 和將以0開頭的字串檢出來
       if str(value).isalpha() or str(value).startswith('0'):
           return False
       return type(int(value)) == int or type(float(value)) == float


    class Determine:
       """
      這個class用來對比。取列序號和值
      match方法比較數值或者字串
      可以理解為決策樹每個節點所提出的“問題”,如:
          今天溫度是冷還是熱?
          今天天氣是晴,多雲,還是有雨?
      """
       def __init__(self, column, value):
           """
          initial structure of our object
          :param column: column index of our "question"
          :param value: splitting value of our "question"
          """
           self.column = column
           self.value = value

       def match(self, example):
           """
          Compares example data and self.value
          note that you need to determine whether the data asked is numeric or categorical/string
          Be careful for missing data
          :param example: a full row of data
          :return: boolean(True/False) of whether input data is GREATER THAN self.value (numeric) or the SAME AS self.value (string)
          """
           # Your code here . missing is string too so don't judge(判斷)
           e_index = self.column
           value_node = self.value
           # 此處and之後的條件是在e_index = 10是補充的,因為此列的資料型別不統一,包括0開頭的字串,還有int型數字,這就尷尬了,int 和 str 無法做compare
           if is_numeric(example[e_index]) and type(value_node) is int or type(value_node) is float:
               return example[e_index] > value_node
           else:
               return example[e_index] == value_node


       def __repr__(self):
           """
          列印樹的時候用
          :return:
          """
           if is_numeric(self.value):
               condition = ">="
           else:
               condition = "是"
           return "{} {} {}?".format(
               header[self.column], condition, str(self.value))


    def partition(rows, question):
       """
      將資料分割,如果滿足上面Question條件則被分入true_row,否則被分入false_row
      :param rows: data set/subset
      :param question: Determine object you implemented above
      :return: 2 lists based on the answer of the question
      """
       # Your code here . question is Determine's object
       true_rows, false_rows = [], []
       # 此處將二維陣列進行遍歷的目的是Determine物件中match方法只處理每個一維列表中指定索引的資料
       for row in rows:
           if question.match(row):
               true_rows.append(row)
           else:
               false_rows.append(row)
       return true_rows, false_rows


    def gini(rows):
       """
      計算一串資料的Gini值,即離散度的一種表達方式
      :param rows: data set/subset
      :return: gini值,”不純度“ impurity
      """
       data_set_size = len(rows)    # 所有資料的總長度
       class_dict = class_counts(rows)
       sum_subgini = 0
       for class_dict_value in class_dict.values():
           sub_gini = (class_dict_value/data_set_size) ** 2
           sum_subgini += sub_gini
       gini = 1 - sum_subgini
       return gini



    def info_gain(left, right, current_uncertainty):
       """
      計算資訊增益
      Please refer to the .md tutorial for details
      :param left: left branch
      :param right: right branch
      :param current_uncertainty: current uncertainty (data)
      """
       p_left = len(left) / (len(left) + len(right))
       p_right = 1 - p_left
       return current_uncertainty - p_left * gini(left) - p_right * gini(right)




    # 使用這組資料測試自己程式碼的質量
    data = readfile("E:\data\crx.data")
    t, f = partition(data, Determine(2,'1.8'))
    print(info_gain(t, f, gini(data)))

 

January 2, 2019