資料探勘 | 親和性分析(二)
阿新 • • 發佈:2019-01-29
上回講了親和性分析的簡單分析,但只計算了一條規則的支援度和置信度,現在來說說怎麼計算所有規則的支援度和置信度
首先先建立字典,分別建立有效規則字典、無效規則字典以及條件相同的規則數量
# 建立字典,儲存規則有效資料及無效資料
from collections import defaultdict
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int) # 條件相同的規則數量
字典建立完成後,開始統計規則,判斷是否有效,然後存進相應的字典裡
裡面的鍵值表示為(1,2),表示為買了牛奶又買了乳酪,對應的值為7,表示有7個人買了牛奶又買了乳酪
通過迴圈,判斷各項條件是否成立,然後存進對應的字典裡
# 字典鍵值表示為(1,2),表示購買了牛奶和乳酪,對應的值表示次數
for sample in X:
for premise in range(n_features):
if sample[premise] == 0: continue # 前提:購買了某一種商品
num_occurences[premise] += 1 # 滿足前提,存進字典,記錄該前提出現次數
for conclusion in range(n_features): # 結論,滿足前提條件下還購買了什麼
if premise == conclusion: # 過濾條件和結論相同情況
continue
if sample[conclusion] == 1: # 規則成立,存進規則有效字典,計算次數
valid_rules[(premise, conclusion)] += 1
else: # 否則不成,存進規則無效字典,計算次數
invalid_rules[(premise, conclusion)] += 1
列印三個字典出來看一下,結果如下
defaultdict(<class 'int'>, {(1, 2): 8, (1, 4): 20, (2, 1): 8, (2, 4): 29, (4, 1): 20, (4, 2): 29, (2, 3): 19, (3, 2): 19, (3, 4): 24, (4, 3): 24, (1, 3): 15, (3, 1): 15, (0, 1): 23, (0, 2): 4, (0, 3): 11, (0, 4): 21, (1, 0): 23, (2, 0): 4, (3, 0): 11, (4, 0): 21}
defaultdict(<class 'int'>, {(1, 0): 23, (1, 3): 31, (2, 0): 32, (2, 3): 17, (4, 0): 40, (4, 3): 37, (1, 2): 38, (1, 4): 26, (2, 1): 28, (3, 0): 28, (3, 1): 24, (4, 1): 41, (3, 2): 20, (3, 4): 15, (2, 4): 7, (0, 1): 20, (0, 3): 32, (4, 2): 32, (0, 2): 39, (0, 4): 22}
defaultdict(<class 'int'>, {1: 46, 2: 36, 4: 61, 3: 39, 0: 43})
有了上述字典,我們還需要計算各個規則的支援度和置信度,對此我們還要建立支援度字典和置信度字典
# 計算支援度和置信度,得到字典
support = valid_rules # 規則應驗次數
confidence = defaultdict(float) # 規則準確率
for premise, conclusion in valid_rules.keys():
rule = (premise, conclusion)
confidence[rule] = valid_rules[rule] / num_occurences[premise]
列印結果如下
defaultdict(<class 'int'>, {(1, 2): 8, (1, 4): 20, (2, 1): 8, (2, 4): 29, (4, 1): 20, (4, 2): 29, (2, 3): 19, (3, 2): 19, (3, 4): 24, (4, 3): 24, (1, 3): 15, (3, 1): 15, (0, 1): 23, (0, 2): 4, (0, 3): 11, (0, 4): 21, (1, 0): 23, (2, 0): 4, (3, 0): 11, (4, 0): 21})
defaultdict(<class 'float'>, {(1, 2): 0.17391304347826086, (1, 4): 0.43478260869565216, (2, 1): 0.2222222222222222, (2, 4): 0.8055555555555556, (4, 1): 0.32786885245901637, (4, 2): 0.47540983606557374, (2, 3): 0.5277777777777778, (3, 2): 0.48717948717948717, (3, 4): 0.6153846153846154, (4, 3): 0.39344262295081966, (1, 3): 0.32608695652173914, (3, 1): 0.38461538461538464, (0, 1): 0.5348837209302325, (0, 2): 0.09302325581395349, (0, 3): 0.2558139534883721, (0, 4): 0.4883720930232558, (1, 0): 0.5, (2, 0): 0.1111111111111111, (3, 0): 0.28205128205128205, (4, 0): 0.3442622950819672})
有了這些字典,我們就可以查詢任意規則的支援度和置信度啦
先定義一個列表,方便我們讀懂資料
然後定義一個輸出函式,將規則資訊、支援度和置信度全部輸出
程式碼如下
features = ["麵包", "牛奶", "乳酪", "蘋果", "香蕉"]
def print_rule(premise, conclusion, support,confidence, features):
premise_name = features[premise]
conclusion_name = features[conclusion]
rule = (premise, conclusion)
print("規則:如果顧客購買了{0},他們還會買{1}".format(premise_name, conclusion_name))
print("支援度:{0}".format(support[rule]))
print("置信度:{0:.3f}".format(confidence[rule])
if __name__ == '__main__':
premise = 0
conclusion = 1
print_rule(premise, conclusion, support,confidence, features)
這裡表示查詢買了麵包又買了牛奶這條規則的支援度和置信度,結果如下
規則:如果顧客購買了麵包,他們還會買牛奶
支援度:23
置信度:0.535
完整程式碼:
#coding: utf-8
import numpy as np
# 定義資料集檔名
dataset_filename = "affinity_dataset.txt"
# 載入資料集
X = np.loadtxt(dataset_filename)
n_samples, n_features = X.shape
# 尋找規則:如果購買了X,可能願意購買Y
# 判斷規則優劣:支援度(規則應驗次數)和置信度(規則準確率)
# 一條規則由前提條件和結論兩部分組成
# 建立字典,儲存規則有效資料及無效資料
from collections import defaultdict
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int) # 條件相同的規則數量
# 字典鍵值表示為(1,2),表示購買了牛奶和乳酪,對應的值表示次數
for sample in X:
for premise in range(n_features):
if sample[premise] == 0: continue # 前提:購買了某一種商品
num_occurences[premise] += 1 # 滿足前提,存進字典,記錄該前提出現次數
for conclusion in range(n_features): # 結論,滿足前提條件下還購買了什麼
if premise == conclusion: # 過濾條件和結論相同情況
continue
if sample[conclusion] == 1: # 規則成立,存進規則有效字典,計算次數
valid_rules[(premise, conclusion)] += 1
else: # 否則不成,存進規則無效字典,計算次數
invalid_rules[(premise, conclusion)] += 1
# 計算支援度和置信度,得到字典
support = valid_rules # 規則應驗次數
confidence = defaultdict(float) # 規則準確率
for premise, conclusion in valid_rules.keys():
rule = (premise, conclusion)
confidence[rule] = valid_rules[rule] / num_occurences[premise]
features = ["麵包", "牛奶", "乳酪", "蘋果", "香蕉"]
def print_rule(premise, conclusion, support,confidence, features):
premise_name = features[premise]
conclusion_name = features[conclusion]
rule = (premise, conclusion)
print("規則:如果顧客購買了{0},他們還會買{1}".format(premise_name, conclusion_name))
print("支援度:{0}".format(support[rule]))
print("置信度:{0:.3f}".format(confidence[rule]))
if __name__ == '__main__':
premise = 0
conclusion = 1
print_rule(premise, conclusion, support,confidence, features)
那麼,已經統計出了所有規則的支援度和置信度了,下一次來講如何排序,選出最優規則