Python資料探勘學習——親和性分析
阿新 • • 發佈:2018-12-10
最近了解了一些Python資料探勘方面的內容,主要學習了《Python資料探勘入門與實踐》這本書的內容,在這裡對書中的內容以及我遇到的一些問題進行整理。
資料探勘旨在讓計算機根據已有的資料作出決策。
資料探勘的第一步一般是建立資料集,資料集主要包括:
(1)樣本:表示真實世界中的物體
(2)特徵:描述資料集中樣本
學習的第一步接觸的就是親和性分析,親和性分析是通過樣本個體之間的相似度確定它們之間關係的親疏。
這個例子中採用商品購買的一個數據集,商品共有:麵包,牛奶,乳酪,蘋果,香蕉這幾種。
這裡每個特徵都有且只可能有0或者1兩個值——表示是否購買該商品,而非購買的數量。
在得到樣品及特徵後,我們要找出規則,比如“購買了X,那麼可能會購買Y”
找出規則後還需要判斷其優劣,這裡涉及到兩個指標——支援度和置信度。
程式碼如下:
""" 《Python資料探勘入門與實踐》 親和性分析 資料集每一列代表:是否購買——麵包、牛奶、乳酪、蘋果、香蕉 支援度support——規則應驗的次數 置信度confidence——規則應驗的比例 """ import numpy as np from collections import defaultdict #預設字典——如果沒有對應的鍵,返回預設值0 from operator import itemgetter #針對字典進行排序 dataset_filename = r'F:\Python\pycharm\DataAnalysis_test\data\affinity_dataset.txt' X = np.loadtxt(dataset_filename) # print(X[:15])#顯示前15行資料 features = ["bread", "milk", "cheese", "apple", "banana"]#特徵列表 """檢視有多少人購買了蘋果""" # num_apple_buy = 0 # for sample in X: # if sample[3] == 1: # num_apple_buy +=1 # print("{0} people bought Apples".format(num_apple_buy)) """構建規則字典""" valid_rules = defaultdict(int)#規則應驗 invalid_rules = defaultdict(int)#規則無效 num_occurances = defaultdict(int)#符合A條件(如果。。。)的所有情況 n_features = 5#共有幾項特徵 for sample in X: for premise in range(n_features): if sample[premise] == 0: continue else: num_occurances[premise] += 1#符合A條件的情況+1 for conclusion in range(n_features): if premise == conclusion: continue else: if sample[conclusion] == 1: valid_rules[(premise, conclusion)] +=1 #規則應驗 else: invalid_rules[(premise, conclusion)] +=1 #規則無效 #計算每條規則的置信度(confidence規則的準確率如何)和支援度(support規則應驗的次數) support = valid_rules confidence = defaultdict(float) for (premise, conclusion) in valid_rules.keys(): rule = (premise, conclusion) confidence[rule] = valid_rules[rule] / num_occurances[premise] """定義輸出每條規則及其置信度和支援度的函式""" def print_rule(premise, conclusion, support, confidence, features): premise_name = features[premise] conclusion_name = features[conclusion] print("rule: if a person buys {0} they will also buy {1}".format(premise_name, conclusion_name)) print("置信度confidence: {0:.3f}".format(confidence[(premise, conclusion)])) print("支援度support:{0}".format(support[(premise, conclusion)])) """排序找出最佳規則""" def best_rule(): sorted_support = sorted(support.items(), key=itemgetter(1), #以字典的值的次序進行排序 reverse=True)#降序 sorted_confidence = sorted(confidence.items(), key=itemgetter(1), reverse=True) for index in range(5):#輸出排序最高的五個規則 print("RULE #{0}".format(index + 1)) premise, conclusion = sorted_support[index][0] print_rule(premise, conclusion, support, confidence, features) if __name__ == '__main__': premise = 2 conclusion = 4 # print_rule(premise, conclusion, support, confidence, features) best_rule() # print(valid_rules)
輸出結果為規則的評價結果:
RULE #1 rule: if a person buys cheese they will also buy banana 置信度confidence: 0.659 支援度support:27 RULE #2 rule: if a person buys banana they will also buy cheese 置信度confidence: 0.458 支援度support:27 RULE #3 rule: if a person buys apple they will also buy cheese 置信度confidence: 0.694 支援度support:25 RULE #4 rule: if a person buys cheese they will also buy apple 置信度confidence: 0.610 支援度support:25 RULE #5 rule: if a person buys banana they will also buy apple 置信度confidence: 0.356 支援度support:21
這個例子中的資料集下載連結:商品購買資料集下載