機器學習演算法之：決策樹 (decision trees)

阿新 • • 發佈：2018-12-17

> By joey周琦

概述

線性模型一般variance小，bias大；而樹模型一般variance大，bias小
決策樹的優點：具有可讀性，分類速度快
一般包含三個步驟：
- 特徵選擇
- 決策樹生成
- 剪枝
決策樹定義：分類決策樹模型是一種描述對例項進行分類的樹形結構。決策樹由結點(node)和有向邊(directed edge)組成。結點分為內部結點和葉結點，內部結點表示一個特徵(feature),外部結點表示一個類。
- 決策樹可以看為一個if-then規則集合，具有“互斥完備”性質

決策樹學習

假設有資料集

D={(x1,y1),(x2,y2),...(xN,yN)} $D= \{(x_1,y_1),(x_2,y_2) ,...(x_N,y_N)\}$
其中

xi=(x1i,x2i,...xni) $x_i = (x_i^1,x_i^2,...x_i^n)$ , 為輸入特徵向量，n為特徵的數目.

yi∈{1,2,...,K} $y_i \in \{1,2,...,K\}$ 為類標記，i=1,2,…N, N為樣本容量。
我們需要一個與訓練資料矛盾較小的決策樹，而且有很好泛化能力(剪枝）。

特徵選擇

特徵選擇是選取對訓練資料具有分類能力的特徵，將最有分類能力的特徵放在前面判斷。那麼如何量化特徵的分類能力呢？這裡引入一個“資訊增益（information gain“的概念

資訊理論Information Theory

考慮隨機變數 $x$ , 若獲取 $x$ 的確切值，我們可以獲取多少資訊量？
知道了一件可能性很小事情的發生，相對於知道一件可能性很大事情的發生，前者的資訊量要大很多。如果我們知道一件必然事情要發生，即沒有資訊量。
如何衡量？利用“資訊熵(entropy)”,
- 若
  
  $x$ 離散隨機變數則熵表示為： $H[x] = - \sum_i p(x_i) \log_2 p(x_i)$
- 若 $x$ 連續隨機變數則熵表示為： $H[x] = - \int p(x) \log_2 p(x) d_x$
- 若 D 為一個數據集，則資訊熵表示為: H(D)=−∑x∈Xp(x)log2p(x)
  其中
  - X - 資料集D中所有的類
  - p(x) - 每個類所佔的比例

資訊增益Information gain

資訊增益IG(A)衡量了，一個數據集在被變數A分割之前與分割之後的熵之差。用另外一句話說，資料集的不確定程度在分割變數A之後減少了多少。

IG(A,D)=H(D)−∑t∈Tp(t)H(t) $IG(A,D) = H(D) - \sum_{t \in T} p(t)H(t)$
其中,
* H(D) - 資料集D的熵
* T - 資料集D在被變數A分割之後所產生的資料子集 D =

⋃t∈Tt $\bigcup_{t \in T} t$
* p(t) - 子資料集t所佔比例
* H(t) - 子資料集t的熵
使用資訊增益作為選擇特徵依據的演算法，叫作ID3演算法。如果使用特徵增益比作為選擇特徵的依據，演算法為ID4.5, 特徵增益比表示式如下：

IGr(A,D)=IG(A,D)H(D) $IG_r(A,D) = \frac{IG(A,D)} {H(D)}$

決策樹建立

下面我們以ID3決策樹分類演算法為例，介紹分類決策樹建立。
ID3演算法從跟結點開始迭代，選擇出資訊增益最大的特徵作為結點的分割，產生若個資料子集，當出現如下情況時停止：
*每個子集中的元素都屬於一個類，該結點變為葉結點，並且被標記為所有元素屬於的那一類
*沒有特徵可以選擇了，該結點變為葉結點，並且被標記為所有元素中佔比最多的那一類
*子集中沒有元素，那麼建立葉結點，並且被標記為父結點中元素佔比最多的那一類

虛擬碼為(ID3 wiki)：

    ID3 (Examples, Target_Attribute, Attributes)
    Create a root node for the tree
    If all examples are positive, Return the single-node tree Root, with label = +.
    If all examples are negative, Return the single-node tree Root, with label = -.
    If number of predicting attributes is empty, then Return the single node tree Root,
    with label = most common value of the target attribute in the examples.
    Otherwise Begin
        A ← The Attribute that best classifies examples.
        Decision Tree attribute for Root = A.
        For each possible value, vi, of A,
            Add a new tree branch below Root, corresponding to the test A = vi.
            Let Examples(vi) be the subset of examples that have the value vi for A
            If Examples(vi) is empty
                Then below this new branch add a leaf node with label = most common target value in the examples
            Else below this new branch add the subtree ID3 (Examples(vi), Target_Attribute, Attributes – {A})
    End
    Return Root

剪枝

python程式碼實現

對於ID3演算法的實現，我們分為三部分介紹
1. 獲取資料並且預處理
2. 建立決策樹
3. 測試決策樹效果

獲取資料並且預處理

而實驗資料來自常用的資料集，”adults”, 連結： http://archive.ics.uci.edu/ml/datasets/Adult。該資料有14個特徵，連續或離散，預測目標是離散的二類特徵。資料分為訓練資料和測試資料。

因為特徵包含連續、離散特徵，而ID3決策樹只支援離散特徵，那麼首先要解決的問題就是連續資料的離線化。這裡主要用了panda庫中的一個函式cut, 即在下面的程式中“pd.cut(data_of_feature,bin_num)”這一句。data_of_feature表示某一個特徵的資料集，是一列資料，而bin_num就是一個數字，表示將個列連續資料分為多少個離散組。利用這兩個引數panda.cut可以自動離散化資料，並且進行合適的分割。值得一提的是，這裡需要將訓練資料和測試資料放在一起離散化，因為如果分開離散化，訓練資料、測試資料的分割點不同。

程式碼有些部分是根據adult資料集寫死的，比如哪些列是連續的，離散化時需要分割為多少個bin, 如果更換資料集則要相應變動。經過預處理，最終可以得到一個兩維的np.ndarray資料，作為生成決策樹的輸入資料。整體的資料預處理程式碼如下：


def prepareData(training_data_file_path,testing_data_file_path):
    print "prepare adult data"
    #To split continues data and have the same split standard, we have to discretize the 
    #training data and testing data in one batch. And use the panda.cut to discretize .

    continue_feature_list = [0,2,4,10,11,12] #The index of continues feature
    bins = [10,12,8,12,12,12] #The bins of each continues feature
    data = []
    training_size = 0
    with open(training_data_file_path) as f:
        dataList = f.read().splitlines()    
    for datai in dataList:
        training_size += 1
        datai_feature_list = datai.split(", ")
        data.append(np.array(datai_feature_list))

    with open(testing_data_file_path) as f:
        dataList = f.read().splitlines()    
    for datai in dataList:
        datai_feature_list = datai.split(", ")
        data.append(np.array(datai_feature_list))
    data = np.array(data)
    discretizedData = discretizeData(data, continue_feature_list, bins)

    #return training data and testing data
    print "training_size: ", training_size
    return discretizedData[0:training_size,:],discretizedData[training_size:,:]

#data_of_feature:np.array, the data of a feature
#bin_num: to discretize to how many bins
def discretizeFeature(data_of_feature,bin_num):    
    return pd.cut(data_of_feature,bin_num)

    #data: np.ndarray, the training data
    #continue_attr: list
    #bins: the length of each discretized feature
    #To discretize the continues attribute/feature of data
def discretizeData(data,continue_feature_list,bins):
    for feature_i_index in range(len(continue_feature_list)):
        feature = continue_feature_list[feature_i_index]
        data_of_feature_i = np.array( [float(rowi) for rowi in data[:,feature] ] )#str to float
        discretized_feature_i = discretizeFeature(data_of_feature_i,bins[feature_i_index])
        print discretized_feature_i
        data[:,feature] = np.array(discretized_feature_i) #Use the discretized feature replace the continues feature
    return data

建立決策樹

下面建立決策數，建立的過程，就如上面的虛擬碼所描述。所以這裡僅解釋一些程式碼裡的每個函式的作用、和一些引數的含義：

def makeTree(data,attributes,target,depth):#構建決策數
#data 訓練資料
#attributes 特徵和預測目標名字的集合
#target 預測目標
#depth 樹的深度（可以為剪枝使用）
#返回的樹儲存為一個“字典格式”. key:是根部最優的分割特徵
#value兩種可能。 1 就是預測的label. 2 就是一個字典。字典的key為分割特徵的某個值，value就是一個子樹

def majority(data,attributes,target): #獲取該資料集下，預測值佔大多數的那一個
def get_entropy_data(data,attributes,target,rows):#獲取熵
def get_expected_entropy_data(data,attributes,attri,target):#獲取條件期望熵
def infoGain(data,attributes,attri,target):#獲取資訊增益
def best_split(data,attributes,target):#獲取最優的分割特徵
def getValue(data,attributes,best_attri):#獲取最優特徵的所有值的集合
def getExample(data,attributes,best_attri,val):#獲取滿足“最優特徵==val”的子資料集

而整體的構建程式碼如下：

'''
Created on 2015.10

@author: joeyqzhou
'''


import numpy as np
import copy 
from matplotlib.mlab import entropy


#return the majority of the label
def majority(data,attributes,target):
    target_index = attributes.index(target)

    valFreq = {}
    for i in range(data.shape[0]):
        if valFreq.has_key(data[i,target_index]):
            valFreq[ data[i,target_index] ] += 1
        else:
            valFreq[ data[i,target_index] ] = 1

    maxLabel = 0
    major = ""
    for label in valFreq.keys():
        if valFreq[label] > maxLabel:
            major = label
            max = valFreq[label]

    return major


def get_entropy_data(data,attributes,target,rows):
    data_len = data.shape[0]
    target_index = attributes.index(target)
    target_list = list( [ data[i,target_index]  for i in range(data_len) if rows[i]==1 ] )
    target_set = set(target_list)
    len_of_each_target_val = []
    for target_val in target_set:
        len_of_each_target_val.append( target_list.count(target_val) )

    entropy_data = 0.0 

    for target_count in len_of_each_target_val:
        entropy_data += -target_count*1.0/sum(len_of_each_target_val) * np.log(target_count*1.0/sum(len_of_each_target_val) )

    return entropy_data*sum(rows)*1.0/len(rows)

def get_expected_entropy_data(data,attributes,attri,target):

    attri_index = attributes.index(attri)
    attri_value_set = set( data[:,attri_index] )
    data_len = data.shape[0]
    sum_expected_entropy = 0.0

    for attri_value in attri_value_set:
        attri_selected_rows = np.zeros(data_len)
        for i in range(data_len):
            if data[i,attri_index] == attri_value:
                attri_selected_rows[i] = 1
        sum_expected_entropy += get_entropy_data(data,attributes,target,attri_selected_rows)

    return sum_expected_entropy


def infoGain(data,attributes,attri,target):
    entropy_data = get_entropy_data(data,attributes,target,np.ones(data.shape[0]))
    expected_entropy_data = get_expected_entropy_data(data,attributes,attri,target)
    return entropy_data - expected_entropy_data
#id3
def best_split(data,attributes,target):
    max_info = 0.000001 #Also can be seen as a threshold
    best_attri = ""
    print "best_split attributes: ",attributes
    print "data_len: ", data.shape[0]
    for attri in attributes:
        if attri != target:
            attri_infoGain = infoGain(data,attributes,attri,target)
            if attri_infoGain > max_info :
                max_info = attri_infoGain
                best_attri = attri

    print "max_info_gain: ",attri_infoGain
    print "best_attri: ", best_attri


    return best_attri

#get the possible value of best_attri in the data
def getValue(data,attributes,best_attri):
    best_attri_index = attributes.index(best_attri)
    return set(data[:,best_attri_index])

#get the data that best_attri==val from parent-data
def getExample(data,attributes,best_attri,val):
    best_attri_index = attributes.index(best_attri)
    data_len = data.shape[0]
    subset_data = [ ]
    for i in range(data_len):
        if data[i,best_attri_index] == val:
            subset_data.append( np.concatenate([data[i,0:best_attri_index],data[i,(best_attri_index+1):]]) )

    return np.array(subset_data)

#data: np.ndarray, training data, each row is a piece of data, each column is a feature
#attributes: list , feature name list
#target: target name
def makeTree(data,attributes,target,depth):
    print "depth: ", depth
    depth += 1
    val = [ record[attributes.index(target)] for record in data] #val is the value of target
    label_prediction = majority(data,attributes,target)

    #if data is empty or attributes is empty
    # len(attributes) <= 1, 1 is from the target
    if len(attributes) <= 1:
        return label_prediction
    elif val.count(val[0]) == len(val):
         return val[0]
    else:
         best_attri = best_split(data,attributes,target)
         print "best_attri: ", best_attri
         if best_attri == "":
            return label_prediction

         #create a new decision tree
         tree = {best_attri:{}}

         for val in getValue(data,attributes,best_attri):
             examples = getExample(data,attributes,best_attri,val)
             if examples.shape[0] == 0: #if the data_len ==0, then this is leaf node whose value is the majority
                 tree[best_attri][val] = label_prediction
             else:
                 newAttr = copy.copy(attributes)
                 newAttr.remove(best_attri)
                 subTree = makeTree(examples,newAttr,target,depth)
                 tree[best_attri][val] = subTree

    return tree

測試決策樹效果

最後就是測試資料了，測試的時候，我也獲得了幾個小經驗
1. 連續資料的離散化程度，可以影響到最終的準確率。比如，你把年齡這個連續特徵分為幾段，我一開始分為了4段，準確率就不高，後面試了10段，就提高了不少。所以可以通過調節每個連續特徵的分割數目(bin_num),來優化最終的結果
2. 對於決策樹無法分辨的情況，可以給出其父樹中，大多數的label作為測試結果

經過測試adult.test, 準確率可以達到78.3%左右，但是這個遠遠沒有達到 85%左右的準確率，原因可以有如下
1. 連續特徵的離散化有優化空間
2. 沒有進行剪枝

總結

完整的程式碼在http://download.csdn.net/detail/u011467621/9211651可以下載。直接執行main.py即可
另外，寫部落格的確經驗不是很多，希望可以越來越好吧。如果有任何問題、建議，可以留言或聯絡我的郵箱[email protected]，一起討論進步

參考文獻

1 統計學習方法
2 PRML
3 https://en.wikipedia.org/wiki/ID3_algorithm
4 https://github.com/NinjaSteph/DecisionTree (程式碼參考）

機器學習演算法之：決策樹 (decision trees)

概述

決策樹學習

特徵選擇

資訊理論Information Theory

資訊增益Information gain

決策樹建立

剪枝

python程式碼實現

獲取資料並且預處理

建立決策樹

測試決策樹效果

總結

參考文獻

機器學習演算法之：決策樹 (decision trees)

機器學習方法(四)：決策樹Decision Tree原理與實現技巧

《機器學習實戰》：決策樹之為自己配個隱形眼鏡

機器學習教程之13-決策樹（decision tree）的sklearn實現

簡單易學的機器學習演算法——梯度提升決策樹GBDT

機器學習演算法之：邏輯迴歸 logistic regression (LR)

機器學習演算法之：分類演算法概述

機器學習筆記之九——決策樹原理以及舉例計算

【機器學習演算法】：提升樹（Boosting tree）

機器學習演算法之：指數族分佈與廣義線性模型

機器學習之：決策樹（Decision Tree）

十大機器學習演算法之決策樹（用於信用風險）

無公式無程式碼白話機器學習演算法之決策樹

機器學習（七）決策樹演算法研究與實現

機器學習演算法之CART（分類和迴歸樹）

機器學習演算法之七：5分鐘上手SVM

機器學習演算法之二：5分鐘上手K-Means

機器學習（三）決策樹演算法ID3的實現

spark機器學習庫指南[Spark 1.3.1版]——決策樹(decision trees)

Python機器學習（三）--決策樹演算法

機器學習演算法之：決策樹 (decision trees)

概述

決策樹學習

特徵選擇

資訊理論Information Theory

資訊增益Information gain

決策樹建立

剪枝

python程式碼實現

獲取資料並且預處理

建立決策樹

測試決策樹效果

總結

參考文獻

相關推薦