機器學習實戰(Machine Learning in Action)學習筆記————08.使用FPgrowth演算法來高效發現頻繁項集
機器學習實戰(Machine Learning in Action)學習筆記————08.使用FPgrowth演算法來高效發現頻繁項集
關鍵字:FPgrowth、頻繁項集、條件FP樹、非監督學習
作者:米倉山下
時間:2018-11-3
機器學習實戰(Machine Learning in Action,@author: Peter Harrington)
原始碼下載地址:https://www.manning.com/books/machine-learning-in-action
[email protected]:pbharrin/machinelearninginaction.git
*************************************************************
一、使用FPgrowth演算法來高效發現頻繁項集
FPgrowth演算法原理:
基於Apriori構建,但在完成相同任務時,採用了一些不同的的技術。這裡的任務是將資料集儲存在一個特定的稱為FP樹的結構之後發現頻繁項集或則頻繁項對,即在一塊出現的的元素項的集合FP樹。這種做法的執行速度要快於Apriori,通常效能要好兩個數量級以上。
FP——Frequent pattern(頻繁模式)
*************************************************************
二、FPgrowth演算法——構建FP樹
FP樹構建函式
----------------------------------------------------------------------------
輸入:dataSet——待挖掘資料集;minSup——最小支援度,預設為1
輸出:retTree——構建的FP樹; headerTable——頭指標表
def createTree(dataSet, minSup=1): #create FP-tree from dataset but don't mine headerTable = {} #掃描兩次資料集dataSet for trans in dataSet:#第一次掃描,統計所有元素出現的頻次 for item in trans: headerTable[item] = headerTable.get(item, 0) + dataSet[trans] for k in headerTable.keys(): #移除不符合minSup的items if headerTable[k] < minSup: del(headerTable[k]) freqItemSet = set(headerTable.keys()) #print 'freqItemSet: ',freqItemSet if len(freqItemSet) == 0: return None, None #沒有items符合minSup,返回None退出 for k in headerTable: headerTable[k] = [headerTable[k], None] #結構化headerTable #print 'headerTable: ',headerTable retTree = treeNode('Null Set', 1, None) #建立FP樹根節點 for tranSet, count in dataSet.items(): #第二次掃描,構建FP樹retTree localD = {} for item in tranSet: #獲取條資料中每個元素的全域性頻次,以便排序 if item in freqItemSet: localD[item] = headerTable[item][0] if len(localD) > 0: orderedItems = [v[0] for v in sorted(localD.items(), key=lambda p: p[1], reverse=True)] #排序 updateTree(orderedItems, retTree, headerTable, count) #更新FP樹retTree return retTree, headerTable #返回FP樹retTree,頭指標表headerTable
注:createTree足夠靈活,下面構建條件FP樹時還要用到
#更新FP樹retTree def updateTree(items, inTree, headerTable, count): if items[0] in inTree.children: #如果第一個元素orderedItems[0]在子節點中 inTree.children[items[0]].inc(count) #增加計數 else: #不存在,增加子節點 inTree.children[items[0]] = treeNode(items[0], count, inTree) if headerTable[items[0]][1] == None: #頭指標表中items沒有指向節點 headerTable[items[0]][1] = inTree.children[items[0]] else: #頭指標表中items以指向某個相似節點,追加到後面 updateHeader(headerTable[items[0]][1], inTree.children[items[0]]) if len(items) > 1: #items不止一個元素,去掉第一個元素,遞迴呼叫updateTree構建樹 updateTree(items[1::], inTree.children[items[0]], headerTable, count)
----------------------------------------------------------------------------
測試:
>>> import fpGrowth >>> simpdata=fpGrowth.loadSimpDat() >>> initset=fpGrowth.createInitSet(simpdata) >>> simpdata [['r', 'z', 'h', 'j', 'p'], ['z', 'y', 'x', 'w', 'v', 'u', 't', 's'], ['z'], ['r', 'x', 'n', 'o', 's'], ['y', 'r', 'x', 'z', 'q', 't', 'p'], ['y', 'z', 'x', 'e', 'q', 's', 't', 'm']] >>> initset {frozenset(['e', 'm', 'q', 's', 't', 'y', 'x', 'z']): 1, frozenset(['x', 's', 'r', 'o', 'n']): 1, frozenset(['s', 'u', 't', 'w', 'v', 'y', 'x', 'z']): 1, frozenset(['q', 'p', 'r', 't', 'y', 'x', 'z']): 1, frozenset(['h', 'r', 'z', 'p', 'j']): 1, frozenset(['z']): 1} >>> minSup = 3 >>> myFPtree, myHeaderTab = fpGrowth.createTree(initset, minSup) >>> myFPtree.disp() Null Set 1 x 1 s 1 r 1 z 5 x 3 y 3 s 2 t 2 r 1 t 1 r 1 >>> myHeaderTab {'s': [3, <fpGrowth.treeNode instance at 0x00000000039FE608>], 'r': [3, <fpGrowth.treeNode instance at 0x00000000039FE788>], 't': [3, <fpGrowth.treeNode instance at 0x00000000039FE688>], 'y': [3, <fpGrowth.treeNode instance at 0x00000000039FE5C8>], 'x': [4, <fpGrowth.treeNode instance at 0x00000000039FE588>], 'z': [5, <fpGrowth.treeNode instance at 0x00000000039FE548>]} >>>
*************************************************************
三、從一棵FP樹種挖掘頻繁項集
#遞迴查詢頻繁項:mineTree函式
----------------------------------------------------------------------------
#輸入:inTree——輸入FP樹,遞迴呼叫時為此時的元素preFix條件FP樹;headerTable——頭指標表;minSup——最小支援數;preFix——初始化為set([]),遞迴呼叫時為條件FP樹inTree對應的元素;freqItemList——初始化為[],用來儲存頻繁項集。 #輸出:freqItemList——用來儲存頻繁項集 def mineTree(inTree, headerTable, minSup, preFix, freqItemList): bigL = [v[0] for v in sorted(headerTable.items(), key=lambda p: p[1])] #頭指標表排序 for basePat in bigL: #從頭指標表bigL(headerTable)底端開始遍歷(從小到大) newFreqSet = preFix.copy() newFreqSet.add(basePat) #遞迴前,newFreqSet為單元素頻繁項;遞迴時preFix不為空,開始組合 freqItemList.append(newFreqSet) #將每個頻繁項加入到列表freqItemList中 condPattBases = findPrefixPath(basePat,\ #抽取條件模式基condPattBases,去掉了元素本身 headerTable[basePat][1]) myCondTree, myHead = createTree(condPattBases,\ #根據條件模式基condPattBases構建條件頻繁樹myCondTree minSup) if myHead != None: #挖掘FP條件樹 #print 'conditional tree for: ',newFreqSet #myCondTree.disp(1) mineTree(myCondTree, myHead, minSup, \ #newFreqSet不為空set([]),遞迴呼叫mineTree函式 newFreqSet, freqItemList)
----------------------------------------------------------------------------
將原始碼中下面兩行取消註釋:
#print 'conditional tree for: ',newFreqSet
#myCondTree.disp(1) #列印條件樹
測試:
>>> myFreqList = [] >>> reload(fpGrowth) <module 'fpGrowth' from 'fpGrowth.py'> #遍歷頭指標表myHeaderTab,將其單元素頻繁項加入myFreqList後,再找出每個元素的條件FP樹。遞迴呼叫組合頻繁項 >>> fpGrowth.mineTree(myFPtree, myHeaderTab, minSup, set([]), myFreqList) conditional tree for: set(['y']) Null Set 1 x 3 z 3 conditional tree for: set(['y', 'z']) Null Set 1 x 3 conditional tree for: set(['s']) Null Set 1 x 3 conditional tree for: set(['t']) Null Set 1 y 3 x 3 z 3 conditional tree for: set(['x', 't']) Null Set 1 y 3 conditional tree for: set(['z', 't']) Null Set 1 y 3 x 3 conditional tree for: set(['x', 'z', 't']) Null Set 1 y 3 conditional tree for: set(['x']) Null Set 1 z 3 >>> myFreqList [set(['y']), set(['y', 'z']), set(['y', 'x', 'z']), set(['y', 'x']), set(['s']), set(['x', 's']), set(['t']), set(['z', 't']), set(['x', 'z', 't']), set(['y', 'x', 'z', 't']), set(['y', 'z', 't']), set(['x', 't']), set(['y', 'x', 't']), set(['y', 't']), set(['r']), set(['x']), set(['x', 'z']), set(['z'])] >>> len(myFreqList) 18 >>>
*************************************************************
四、示例:從新聞網站點選流中挖掘
kosarak.dat中有將近100萬條記錄,每一行包含了某個使用者瀏覽過得新聞報道。有些使用者只看過一篇,有的使用者看過2498篇報道。使用者和報道編碼成整數,利用FPgrowth演算法
#讀取資料,資料集格式化
>>> parsedDat=[line.split() for line in open('kosarak.dat').readlines()] >>> len(parsedDat) 990002 >>> initset=fpGrowth.createInitSet(parsedDat) #構建FP樹,尋找閱讀量10+的新聞報道 >>> myFPtree, myHeaderTab = fpGrowth.createTree(initset, 100000) #建立條件FP樹 >>> myFreqList = [] >>> fpGrowth.mineTree(myFPtree, myHeaderTab, 100000, set([]), myFreqList) >>> len(myFreqList) 9 >>> myFreqList [set(['1']), set(['1', '6']), set(['3']), set(['11', '3']), set(['11', '3', '6']), set(['3', '6']), set(['11']), set(['11', '6']), set(['6'])] >>>
----------------------------------------------------------------------------
總結:
優點:FPgrowth演算法相比Apriori只需要對資料庫進行兩次掃描,能夠顯著加快頻繁項集發現速度
缺點:該演算法能夠更高效的發現頻繁項集,但不能用於發現關聯規則
應用:搜尋引擎推薦詞(經常在一塊出現的詞對)等