DFA 算法實現關鍵詞匹配
阿新 • • 發佈:2017-08-13
== tail this word 允許 text children contain 源代碼
起因: 從網頁中爬去的頁面。須要推斷是否跟預設的關鍵詞匹配(是否包括預設的關鍵詞),並返回全部匹配到的關鍵詞 。
眼下pypi 上兩個實現
ahocorasick
https://pypi.python.org/pypi/ahocorasick/0.9
esmre
https://pypi.python.org/pypi/esmre/0.3.1
可是事實上包都是基於DFA 實現的
這裏提供源代碼例如以下:
#!/usr/bin/python2.6
# -*- coding: utf-8 -*-
import time
class Node(object):
def __init__ (self):
self.children = None
# 標記匹配到了關鍵詞
self.flag = False
# The encode of word is UTF-8
def add_word(root,word):
if len(word) <= 0:
return
node = root
for i in range(len(word)):
if node.children == None:
node.children = {}
node.children[word[i]] = Node()
elif word[i] not in node.children:
node.children[word[i]] = Node()
node = node.children[word[i]]
node.flag = True
def init(word_list):
root = Node()
for line in word_list:
add_word(root,line)
return root
# The encode of word is UTF-8
# The encode of message is UTF-8
def key_contain(message, root):
res = set()
for i in range(len(message)):
p = root
j = i
while (j<len(message) and p.children!=None and message[j] in p.children):
if p.flag == True:
res.add(message[i:j])
p = p.children[message[j]]
j = j + 1
if p.children==None:
res.add(message[i:j])
#print ‘---word---‘,message[i:j]
return res
def dfa():
print ‘----------------dfa-----------‘
word_list = [‘hello‘, ‘民警‘, ‘朋友‘,‘女兒‘,‘派出所‘, ‘派出所民警‘]
root = init(word_list)
message = ‘四處亂咬亂吠,嚇得家中11歲的女兒躲在屋裏不敢出來,直到轄區派出所民警趕到後,才將孩子從屋中救出。最後在征得主人允許後,民警和村民合力將這僅僅發瘋的狗打死‘
x = key_contain(message, root)
for item in x:
print item
if __name__ == ‘__main__‘:
dfa()
請再閱讀我的這篇文章
http://blog.csdn.net/woshiaotian/article/details/10047675
DFA 算法實現關鍵詞匹配