資料分析系列精彩濃縮（三）

阿新 • • 發佈：2019-01-07

資料分析（三）

在分析UCI資料之前，有必要先了解一些決策樹的概念（decision tree）

此處推薦一個關於決策樹的部落格地址：
```
http://www.cnblogs.com/yonghao/p/5061873.html
```
決策樹（decision tree (DT)）的基本特徵
- DT 是一個監督學習方法（supervised learning method）
- DT is a supervised learning method, thus we need labeled data
- It is one process only thus it is not good for giant datasets
- PS: It is pretty good on small and clean datasets
UCI資料特徵: UCI credit approval data set
- 690 data entries, relatively small dataset
- 15 attributes, pretty tiny to be honest
- missing value is only 5%
- 2 class data
By looking at these two, we know DT should work well for our dataset

綜上，就可以嘗試用程式碼實現決策樹的功能了，此時使用段老師提供的skeleton（框架），按照以下步驟寫自己的程式碼

Copy and paste your code to function readfile(file_name) under the comment # Your code here.
Make sure your input and output matches how I descirbed in the docstring
Make a minor improvement to handle missing data, in this case let's use string "missing"

to represent missing data. Note that it is given as "?".
Implement is_missing(value), class_counts(rows), is_numeric(value) as directed in the docstring
Implement class Determine. This object represents a node of our DT. 這個物件表示的是決策樹的節點。
- It has 2 inputs and a function. 有兩個輸入，一個方法
- We can think of it as the Question we are asking at each node. 可以理解成決策樹中每個節點我們所提出的“問題”
Implement the method partition(rows, question)as described in the docstring
- Use Determine class to partition data into 2 groups
Implement the method gini(rows) as described in the docstring
- Here is the formula for Gini impurity: $\mathit{Gini} = 1 - \sum_{i=1}^{n}p_i^2$
  - where n is the number of classes
  - $p_i$ is the percentage of the given class i
Implement the method info_gain(left, right, current_uncertainty) as described in the docstring
- Here is the formula for Information Gain: $IG = U_0 - p_l \cdot G_l - p_r \cdot G_r$
  - where $p_r = 1-p_l$
  - $U_0$ is current_uncertainty
  - $p_l$ is the percentage/probability of left branch, same story for $p_r$

my code is as follows , for reference only(以下是我的程式碼，僅供參考)

def readfile(file_name):
    """
   This function reads data file and returns structured and cleaned data in a list
   :param file_name: relative path under data folder
   :return: data, in this case it should be a 2-D list of the form
   [[data1_1, data1_2, ...],
     [data2_1, data2_2, ...],
     [data3_1, data3_2, ...],
     ...]
    
   i.e.
   [['a', 58.67, 4.46, 'u', 'g', 'q', 'h', 3.04, 't', 't', 6.0, 'f', 'g', '00043', 560.0, '+'],
     ['a', 24.5, 0.5, 'u', 'g', 'q', 'h', 1.5, 't', 'f', 0.0, 'f', 'g', '00280', 824.0, '+'],
     ['b', 27.83, 1.54, 'u', 'g', 'w', 'v', 3.75, 't', 't', 5.0, 't', 'g', '00100', 3.0, '+'],
   ...]
    
   Couple things you should note:
   1. You need to handle missing data. In this case let's use "missing" to represent all missing data
   2. Be careful of data types. For instance,
       "58.67" and "0.2356" should be number and not a string
       "00043" should be string but not a number
       It is OK to treat all numbers as float in this case. (You don't need to worry about differentiating integer and float)
   """
    # Your code here
    data_ = open(file_name, 'r')
    # print(data_)
    lines = data_.readlines()
    output = []
    # never use built-in names unless you mean to replace it
    for list_str in lines:
        str_list = list_str[:-1].split(",")
        # keep it
        # str_list.remove(str_list[len(str_list)-1])
        data = []
        for substr in str_list:
            if substr.isdigit():
                if len(substr) > 1 and substr.startswith('0'):
                    data.append(substr)
                else:
                    substr = int(substr)
                    data.append(substr)
            else:
                try:
                    current = float(substr)
                    data.append(current)
                except ValueError as e:
                    if substr == '?':
                        substr = 'missing'
                    data.append(substr)
        output.append(data)
    return output




def is_missing(value):
    """
   Determines if the given value is a missing data, please refer back to readfile() where we defined what is a "missing" data
   :param value: value to be checked
   :return: boolean (True, False) of whether the input value is the same as our "missing" notation
   """
    return value == 'missing'


def class_counts(rows):
    """
   Count how many data samples there are for each label
   數每個標籤的樣本數
   :param rows: Input is a 2D list in the form of what you have returned in readfile()
   :return: Output is a dictionary/map in the form:
   {"label_1": #count,
     "label_2": #count,
     "label_3": #count,
     ...
   }
   """
    # 這個方法是一個死方法 只使用於當前給定標籤（‘+’，‘-’）的資料統計   為了達到能使更多不確定標籤的資料的統計 擴展出下面方法
    # label_dict = {}
    # count1 = 0
    # count2 = 0
    # # rows 是readfile返回的結果
    # for row in rows:
    #     if row[-1] == '+':
    #         count1 += 1
    #     elif row[-1] == '-':
    #         count2 += 1
    # label_dict['+'] = count1
    # label_dict['-'] = count2
    # return label_dict

    # 擴充套件方法一
    # 這個方法可以完成任何不同標籤的資料的統計 使用了兩個迴圈 第一個迴圈是統計出所有資料中存在的不同型別的標籤 得到一個標籤列表lable_list
    # 然後遍歷lable_list中的標籤 重要的是在其中嵌套了遍歷所有資料的迴圈 同時在當前迴圈中統計出所有資料的標籤中和lable_list中標籤相同的總數
    # label_dict = {}
    # lable_list = []
    # for row in rows:
    #     lable = row[-1]
    #     if lable_list == []:
    #         lable_list.append(lable)
    #     else:
    #         if lable in lable_list:
    #             continue
    #         else:
    #             lable_list.append(lable)
    #
    # for lable_i in lable_list:
    #     count_row_i = 0
    #     for row_i in rows:
    #         if lable_i == row_i[-1]:
    #             count_row_i += 1
    #     label_dict[lable_i] = count_row_i
    # print(label_dict)
    # return label_dict
    #

 # 擴充套件方法二
    # 此方法是巧妙的使用了dict.key()函式將所有的狀態進行儲存以及對出現的次數進行累計
    label_dict = {}
    for row in rows:
        keys = label_dict.keys()
        if row[-1] in keys:
            label_dict[row[-1]] += 1
        elif row[-1] not in keys:
            label_dict[row[-1]] = 1
    return label_dict


def is_numeric(value):
    print(type(value),'-----')
    print(value)
    """
   Test if the input is a number(float/int)   
   :param value: Input is a value to be tested     
   :return: Boolean (True/False)    
   """
    # Your code here
    # 此處用到eavl()函式：將字串string物件轉換為有效的表示式參與求值運算返回計算結果
    # if type(eval(str(value))) == int or type(eval(str(value))) == float:
    #     return True
    # 不用eval()也可以 而且有部落格說eval()存在一定安全隱患

    # if value is letter(字母) 和將以0開頭的字串檢出來
    if str(value).isalpha() or str(value).startswith('0'):
        return False
    return type(int(value)) == int or type(float(value)) == float


class Determine:
    """
   這個class用來對比。取列序號和值
   match方法比較數值或者字串
   可以理解為決策樹每個節點所提出的“問題”，如：
       今天溫度是冷還是熱？
       今天天氣是晴，多雲，還是有雨？
   """
    def __init__(self, column, value):
        """
       initial structure of our object
       :param column: column index of our "question"
       :param value: splitting value of our "question"
       """
        self.column = column
        self.value = value

    def match(self, example):
        """
       Compares example data and self.value
       note that you need to determine whether the data asked is numeric or categorical/string
       Be careful for missing data
       :param example: a full row of data
       :return: boolean(True/False) of whether input data is GREATER THAN self.value (numeric) or the SAME AS self.value (string)
       """
        # Your code here . missing is string too so don't judge(判斷)
        e_index = self.column
        value_node = self.value
        # 此處and之後的條件是在e_index = 10是補充的，因為此列的資料型別不統一，包括0開頭的字串，還有int型數字，這就尷尬了，int 和 str 無法做compare
        if is_numeric(example[e_index]) and type(value_node) is int or type(value_node) is float:
            return example[e_index] > value_node
        else:
            return example[e_index] == value_node


    def __repr__(self):
        """
       列印樹的時候用
       :return:
       """
        if is_numeric(self.value):
            condition = ">="
        else:
            condition = "是"
        return "{} {} {}?".format(
            header[self.column], condition, str(self.value))


def partition(rows, question):
    """
   將資料分割，如果滿足上面Question條件則被分入true_row，否則被分入false_row
   :param rows: data set/subset
   :param question: Determine object you implemented above
   :return: 2 lists based on the answer of the question
   """
    # Your code here . question is Determine's object
    true_rows, false_rows = [], []
    # 此處將二維陣列進行遍歷的目的是Determine物件中match方法只處理每個一維列表中指定索引的資料
    for row in rows:
        if question.match(row):
            true_rows.append(row)
        else:
            false_rows.append(row)
    return true_rows, false_rows


def gini(rows):
    """
   計算一串資料的Gini值，即離散度的一種表達方式
   :param rows: data set/subset
   :return: gini值，”不純度“ impurity
   """
    data_set_size = len(rows)    # 所有資料的總長度
    class_dict = class_counts(rows)
    sum_subgini = 0
    for class_dict_value in class_dict.values():
        sub_gini = (class_dict_value/data_set_size) ** 2
        sum_subgini += sub_gini
    gini = 1 - sum_subgini
    return gini



def info_gain(left, right, current_uncertainty):
    """
   計算資訊增益
   Please refer to the .md tutorial for details
   :param left: left branch
   :param right: right branch
   :param current_uncertainty: current uncertainty (data)
   """
    p_left = len(left) / (len(left) + len(right))
    p_right = 1 - p_left
    return current_uncertainty - p_left * gini(left) - p_right * gini(right)




# 使用這組資料測試自己程式碼的質量
data = readfile("E:\data\crx.data")
t, f = partition(data, Determine(2,'1.8'))
print(info_gain(t, f, gini(data)))

January 2, 2019

資料分析系列精彩濃縮（三）

資料分析（三）在分析UCI資料之前，有必要先了解一些決策樹的概念（decision tree）此處推薦一個關於決策樹的部落格地址： http://www.cnblogs.com/yonghao/p/5061873.html 決策樹（decision tree (DT)）的基本特徵

資料分析系列精彩濃縮（二）

資料分析系列精彩濃縮（二）那麼我們有了UCI提供的datasets，我們怎麼Perfect operation呢？ First，we download a data file to the localhost , such as crx.data file we will use pur

數據分析系列精彩濃縮（三）

param 無法 gin 打印 can tput swe 數據分析 inf 數據分析（三）在分析UCI數據之前，有必要先了解一些決策樹的概念（decision tree）此處推薦一個關於決策樹的博客地址： http://www.cnblogs.com/yonghao

利用Python資料分析：pandas入門（三）

obj = Series(range(3),index=['a','b','c']) index = obj.index index index[1:] index[1] = 'd' # index物件是不能被修改的 Index does not support mut

資料分析的資料架構知識詳解（三）

資料分析的架構是有很多的，比如傳統的大資料架構、流式架構、lambda架構、Kappa架構、Unifield架構。但是大家對於這些架構都不是很熟悉的，並且各個資料分析的架構都是有很多優點和缺點的，下面就由小編為大家解答一下這個問題。首先說說傳統大資料架構。我們叫傳統大資料架構，是因為其定位是為了解決傳

Android系列之網路（三）----使用HttpClient傳送HTTP請求（分別通過GET和POST方法傳送資料）

【正文】在前兩篇文章中，我們學習到了和HTTP相關的基礎知識。文章連結如下：一、GET和POST的對比：在漫長的時間當中，其他的方法逐漸的退出了歷史舞臺，最常用的只剩下GET和POST方法。而之前已經講過了通過GET方法獲取資料，今天來學習一下如何分別通過

需求工程——軟件建模與分析閱讀筆記一（三）

數量細節文檔模板困難普通用戶軟件需求結果活動求和需求工程——軟件建模與分析閱讀筆記（三）需求工程的過程：需求獲取：需求獲取是從人、文檔、或環境中獲取需求的過程，必須用各種方法和技術來發現需求，需求開發的過程包含學習和認知的兩個過程，學習和認知是遞進的。

Exchange 2016異地容災系列-AD部署（三）

ad域控日誌技術 rect sub 容災 directory 網段上海站這幾周工作與生活都是非常忙，抱歉沒有及時更新文章，各位老鐵見諒見諒。首先來安裝部署北京站點的AD域控制器。更改北京DC的網絡信息與計算機名稱打開服務器管理器-“添加角色和功能”下一步勾選“Act

Python基礎學習---資料型別和條件語句（三）

1 #!/usr/bin/python3 list = ['Google', 'Runoob', 1997, 2000] print ("原始列表 : ", list) del list[2] print ("刪除第三個元素 : ", list) 2 #!/usr/bi

Python3常用資料結構及方法介紹（三）——字串

三.字串特點：不可更改 1.基本操作（同其他序列） ①索引 >>> 'python'[2] 't' ②分片 >>> 'beauty'[0:2] 'be' >>> 'beauty'[::2] 'bat' ③相加/相乘

北京理工大學-資料結構期末考試試題（三）

資料結構試卷（三）一、選擇題(每題1分，共20分) 1．設某資料結構的二元組形式表示為A=(D，R)，D={01，02，03，04，05，06，07，08，09}，R={r}，r={<01，02>，<01，03>，<01，04>

資料結構-樹狀陣列（三）

學習筆記-樹狀陣列（三）樹狀陣列（一）樹狀陣列（二）通過樹狀陣列的基本操作，我們可以實現區間查詢和單點修改。結合差分，又可以實現單點查詢和區間修改。那麼，怎麼才能像線段樹一樣，快速實現區間查詢，區間修改呢？由差分到字首和既然要區間修改，那麼一定要使用差分陣列而不是原始陣列由上一篇可見，

資料結構與演算法筆記（三）陣列

3.陣列陣列（Array）是一種線性表資料結構。它是一組連續的記憶體空間，來儲存一組具有相同型別的資料。 3.1 特性線性表資料排成像一條線的結構，如陣列、連結串列、佇列、棧等。與之相對立的是非線性，如二叉樹、堆、圖等，其資料之間並不是簡單的前後關係。

資料結構與演算法筆記（三）反轉部分連結串列

反轉部分連結串列上次我們搞定了反轉單向連結串列和雙向連結串列的問題，但實際過程中我們可能只要反轉部分連結串列，在這種情況下我們需要對上次寫出的類增加一個叫做reverse_part_linklist的函式,傳入引數為兩個整數from和to，將from到to之間的節點進行反轉

資料分析----pandas 基本用法（上）

一、生成資料表 1、首先匯入pandas庫，一般都會用到numpy庫，首先為我們先匯入備用 import numpy as np import pandas as pd 2、匯入CSV或者xlsx檔案 df=pd.DataFrame(pd.read_csv('name

資料分析那點事兒（二）

在之前我們給大家講了講什麼是資料分析以及資料分析的目的，資料分析就是通過使用合適的方法進行統計，統計也不是隨隨便便的統計的，需要找對方法。統計分析方法對收集來的大量資料進行分析，提取有用資訊和形成結論而對資料加以詳細研究和概括總結的過程。而資料分析的目的就是通過分析資料找到企業未來的發展情況。今天就給大家

Angular系列之指令（三）

本篇將介紹angular的重要核心模組之一指令；那麼有關指令的一些介紹其實在前一篇《angular系列之表示式（二）》已有講過，便不再敘說，暫時只介紹部分指令，其餘指令後面有介紹使用； ng-app 指令用於告訴 AngularJS 應用當前這個元素是根元素。所有 Angu

資料分析工具之Pandas（二）轉載

一、Pandas統計計算和描述示例程式碼： import numpy as np import pandas as pd df = pd.DataFrame(np.random.randn(5,4), columns = ['a', 'b', 'c', 'd']) print(d

資料分析工具之Pandas（一）轉載

第三部分資料分析工具Pandas Pandas的名稱來自於面板資料（panel data）和Python資料分析（data analysis）。 Pandas是一個強大的分析結構化資料的工具集，基於NumPy構建，提供了高階資料結構和資料操作工具，它是使Pytho

f2fs系列文章fill_super（三）

這篇文章完成f2fs的segment管理結構f2fs_sm_info的建立和恢復。 build_segment_manager：首先分配容納f2fs_sm_info的空間，然後用f2

資料分析系列精彩濃縮（三）

資料分析（三）

在分析UCI資料之前，有必要先了解一些決策樹的概念（decision tree）

此處推薦一個關於決策樹的部落格地址：

決策樹（decision tree (DT)）的基本特徵

UCI資料特徵: UCI credit approval data set

綜上，就可以嘗試用程式碼實現決策樹的功能了，此時使用段老師提供的skeleton（框架），按照以下步驟寫自己的程式碼

Implement `is_missing(value)`, `class_counts(rows)`, `is_numeric(value)` as directed in the docstring

Implement class `Determine`. This object represents a node of our DT. 這個物件表示的是決策樹的節點。

Implement the method `partition(rows, question)`as described in the docstring

Implement the method `gini(rows)` as described in the docstring

Implement the method `info_gain(left, right, current_uncertainty)` as described in the docstring

my code is as follows , for reference only(以下是我的程式碼，僅供參考)

January 2, 2019

資料分析系列精彩濃縮（三）

資料分析系列精彩濃縮（二）

數據分析系列精彩濃縮（三）

利用Python資料分析：pandas入門（三）

資料分析的資料架構知識詳解（三）

Android系列之網路（三）----使用HttpClient傳送HTTP請求（分別通過GET和POST方法傳送資料）

需求工程——軟件建模與分析閱讀筆記一（三）

Exchange 2016異地容災系列-AD部署（三）

Python基礎學習---資料型別和條件語句（三）

Python3常用資料結構及方法介紹（三）——字串

北京理工大學-資料結構期末考試試題（三）

資料結構-樹狀陣列（三）

資料結構與演算法筆記（三）陣列

資料結構與演算法筆記（三）反轉部分連結串列

資料分析----pandas 基本用法（上）

資料分析那點事兒（二）

Angular系列之指令（三）

資料分析工具之Pandas（二）轉載

資料分析工具之Pandas（一）轉載

f2fs系列文章fill_super（三）

資料分析系列精彩濃縮（三）

資料分析（三）

在分析UCI資料之前，有必要先了解一些決策樹的概念（decision tree）

此處推薦一個關於決策樹的部落格地址：

決策樹（decision tree (DT)）的基本特徵

UCI資料特徵: UCI credit approval data set

綜上，就可以嘗試用程式碼實現決策樹的功能了，此時使用段老師提供的skeleton（框架），按照以下步驟寫自己的程式碼

Implement is_missing(value), class_counts(rows), is_numeric(value) as directed in the docstring

Implement class Determine. This object represents a node of our DT. 這個物件表示的是決策樹的節點。

Implement the method partition(rows, question)as described in the docstring

Implement the method gini(rows) as described in the docstring

Implement the method info_gain(left, right, current_uncertainty) as described in the docstring

my code is as follows , for reference only(以下是我的程式碼，僅供參考)

January 2, 2019

相關推薦

Implement `is_missing(value)`, `class_counts(rows)`, `is_numeric(value)` as directed in the docstring

Implement class `Determine`. This object represents a node of our DT. 這個物件表示的是決策樹的節點。

Implement the method `partition(rows, question)`as described in the docstring

Implement the method `gini(rows)` as described in the docstring

Implement the method `info_gain(left, right, current_uncertainty)` as described in the docstring