Python Trie樹實現最長字首字串提取
阿新 • • 發佈:2018-12-21
在文字解析專案中,經常會碰到提取品牌、商家名等需求。如給定一個手機型號字串,要求從中提取出品牌。Trie可以很好滿足此類需求。
Tire,也叫字首樹字典樹,是一種資料結構,可以用來快速檢索字串是否存在以及在字串開始處抽取預定義的子字串。
Python中無指標,使用Dict實現樹結構。
# -*- coding: utf-8 -*- """ Trie for prefix search, a data structure that quickly matches and extracts predefined substrings at the beginning of a given text (if they can be found). We can also skip certain characters and still succeed in a match. """ default_ignored_chars = u' _-/' class Trie(object): def __init__(self, items, ignored_chars=default_ignored_chars): """ Stores all given items into this trie. """ self.ignored_chars = ignored_chars self.trie = {} for item in items: assert item, 'Empty/none item passed in' item = item.strip() assert item, 'Empty item given' curr_dict = self.trie for c in item.upper(): if c not in self.ignored_chars: curr_dict = curr_dict.setdefault(c, {}) curr_dict['end'] = item def is_item(self, text): """ Return True if text is a valid item stored in this trie. """ if not text: return False curr_dict = self.trie for c in text.upper(): if c not in self.ignored_chars: if c not in curr_dict: return False curr_dict = curr_dict[c] return 'end' in curr_dict def extract_longest_item(self, text): """ Return longest item-name found at beginning of the text. Also returns the offset where the item ends in case the caller wants to chop the string. """ curr_dict, longest, offset = self.trie, None, 0 if not text: return longest, offset for i, c in enumerate(text.upper()): if c not in self.ignored_chars: if c not in curr_dict: return longest, offset curr_dict = curr_dict[c] if 'end' in curr_dict: longest, offset = curr_dict['end'], i + 1 return longest, offset # tester if __name__ == '__main__': brands = ['Huawei', 'OPPO', 'VIVO', 'Xiaomi', 'Xiao', 'HTC', 'Oneplus'] model_name = 'xiaomi mix3' brand_lookup = Trie(brands) brand, offset = brand_lookup.extract_longest_item(model_name) print(brand, offset)