Python Trie樹實現最長字首字串提取

阿新 • • 發佈：2018-12-21

在文字解析專案中，經常會碰到提取品牌、商家名等需求。如給定一個手機型號字串，要求從中提取出品牌。Trie可以很好滿足此類需求。

Tire，也叫字首樹字典樹，是一種資料結構，可以用來快速檢索字串是否存在以及在字串開始處抽取預定義的子字串。

Python中無指標，使用Dict實現樹結構。

# -*- coding: utf-8 -*-
"""
Trie for prefix search, a data structure that quickly matches and extracts predefined substrings
at the beginning of a given text (if they can be found).

We can also skip certain characters and still succeed in a match.
"""

default_ignored_chars = u' _-/'


class Trie(object):
    def __init__(self, items, ignored_chars=default_ignored_chars):
        """ Stores all given items into this trie. """
        self.ignored_chars = ignored_chars

        self.trie = {}
        for item in items:
            assert item, 'Empty/none item passed in'
            item = item.strip()
            assert item, 'Empty item given'
            curr_dict = self.trie
            for c in item.upper():
                if c not in self.ignored_chars:
                    curr_dict = curr_dict.setdefault(c, {})
            curr_dict['end'] = item

    def is_item(self, text):
        """ Return True if text is a valid item stored in this trie. """
        if not text:
            return False
        curr_dict = self.trie
        for c in text.upper():
            if c not in self.ignored_chars:
                if c not in curr_dict:
                    return False
                curr_dict = curr_dict[c]
        return 'end' in curr_dict

    def extract_longest_item(self, text):
        """ Return longest item-name found at beginning of the text. Also returns the
            offset where the item ends in case the caller wants to chop the string. """
        curr_dict, longest, offset = self.trie, None, 0

        if not text:
            return longest, offset

        for i, c in enumerate(text.upper()):
            if c not in self.ignored_chars:
                if c not in curr_dict:
                    return longest, offset
                curr_dict = curr_dict[c]
                if 'end' in curr_dict:
                    longest, offset = curr_dict['end'], i + 1

        return longest, offset


# tester
if __name__ == '__main__':

    brands = ['Huawei', 'OPPO', 'VIVO', 'Xiaomi', 'Xiao', 'HTC', 'Oneplus']
    model_name = 'xiaomi mix3'
    brand_lookup = Trie(brands)

    brand, offset = brand_lookup.extract_longest_item(model_name)
    print(brand, offset)

Python Trie樹實現最長字首字串提取

在文字解析專案中，經常會碰到提取品牌、商家名等需求。如給定一個手機型號字串，要求從中提取出品牌。Trie可以很好滿足此類需求。 Tire，也叫字首樹字典樹，是一種資料結構，可以用來快速檢索字串是否存在以及在字串開始處抽取預定義的子字串。 Python中無指標，使用Dict

利用trie樹實現字首輸入提示（python）

程式碼來自https://github.com/wklken/suggestion/blob/master/easymap/suggest.py 還實現了快取功能，搜尋某個字首超過一定次數時，進行快取，減少搜尋時間：將詞字尾部分儲存在節點使用了詞頻資訊，可以對返回的列表進行排序使用dict實現tri

Python實現“最長公共字首”的兩種方法

找出字串陣列中最長的公共字元字首如果，沒有公共字元字首的話就返回空字串"" Example 1: Input: ["flower","flow","flight"] Output: "fl" Example 2: Input: ["dog","racecar"

python實現最長公共子序列的求解

（待完善...）最長公共子序列是動態規劃基本題目，下面按照動態規劃基本步驟解出來。 1.找出最優解的性質，並刻劃其結構特徵序列a共有m個元素，序列b共有n個元素，如果a[m-1]==b[n-1]，那麼a[:m]和b[:n]的最長公共子序列長度就是a[:m-1]和b[:n-1]的最長公

2.3.1 Longest Prefix 最長字首(字典樹)

Description 在生物學中，一些生物的結構是用包含其要素的大寫字母序列來表示的。生物學家對於把長的序列分解成較短的（稱之為元素的）序列很感興趣。如果一個集合 P 中的元素可以通過串聯（允許重複；串聯，相當於 Pascal 中的 “+” 運算子）組成一個序列 S ，那麼我們認為序列

python - 最長字首

編寫一個函式來查詢字串陣列中的最長公共字首如果不存在最長公共字首，返回空字串 ‘’ 示例 1: 輸入: [“flower”,”flow”,”flight”] 輸出: “fl” 示例 2: 輸入: [“dog”,”racecar”,”car”] 輸出: “” 解釋: 輸入不存在最長公共

【Leetcode】Python實現最長迴文子串

動態規劃實現根據迴文的特性，一個大回文按比例縮小後的字串也必定是迴文，比如ABCCBA，那BCCB肯定也是迴文。所以我們可以根據動態規劃的兩個特點：（1）把大問題拆解為小問題（2）重複利用之

用python實現最長公共子序列演算法(找到所有最長公共子串)

軟體安全的一個小實驗，正好複習一下LCS的寫法。實現LCS的演算法和演算法導論上的方式基本一致，都是先建好兩個表，一個儲存在(i,j)處當前最長公共子序列長度，另一個儲存在(i,j)處的回溯方向。相對於演算法導論的版本，增加了一個多分支回溯，即儲存回溯方向時出現了向上向左都可以的情況時，這時候就代表可能

USACO最長字首（trie練習題）

點我題目描述在生物學中，一些生物的結構是用包含其要素的大寫字母序列來表示的。生物學家對於把長的序列分解成較短的序列（即元素）很感興趣。如果一個集合 P 中的元素可以通過串聯（元素可以重複使用，相當於 Pascal 中的 “+” 運算子）組成一個序列 S ，那麼我們認為序列 S 可以分解為 P 中的元素

[hihocoder 1050]求樹的最長鏈

c++ mes clu 最長 logs amp tor 樹形dp target 題目鏈接：http://hihocoder.com/problemset/problem/1050 兩種方法： 1. 兩遍dfs，第一次隨便找一個根，找到距離這個根最遠的點，這個點必然是最長鏈的

python中matplotlib實現最小二乘法擬合的過程詳解

ast array plt atp ons 正則 key code 擬合這篇文章主要給大家介紹了關於python中matplotlib實現最小二乘法擬合的相關資料，文中通過示例代碼詳細介紹了關於最小二乘法擬合直線和最小二乘法擬合曲線的實現過程，需要的朋友可以參考借鑒，下

【演算法 in python | DP】LCS最長公共子串

1. LCS，最長公共子串動態規劃，狀態轉移方程： #該版本是返回最長公共子串和其長度，若只返回長度，則可以簡化 def lcs(s1, s2): l1 = len(s1) l2 = len(s2) # res[i][j]儲存子串s1[0:i] 和子串s2[

最長公共字串 Longest common subsequence problem

例最長公共字串 Longest common subsequence problem 問題描述：這個，很。。。顯而易見吧，不知道的，。。。看這裡 http://en.wikipedia.org/wiki/Longest_common_subsequence_problem 當然

【PHP】從2個字串找到相同的部分，展示最長的字串

思路：最容易想到的方法，是把第一個字串按順序擷取，與第二個字串對比，存在則寫入陣列，最後再從陣列找到重複之中最長的那個輸出字串1：/a/b/c/?.oietr?e/f/g/zwty.cn 字串2：/a/b/c/awp.neeg/e/f/g/zxtn.cc $str1 = '/a/b/c/?

演算法 -- 求最長公共字串&PHP

本文是利用PHP，求最長公共字串。思路：利用動態規劃和矩陣的思想。動態規劃：就是用空間的代價來爭取時間，將中間結果儲存下來，後面迴圈使用供，減少重複計算次數。矩陣思想：定義一個矩陣，寬和高分別為兩個字串的長度。從上到下、從左到右逐個掃描，每次掃描要比較矩陣中每個點對應的行列字元

KMP演算法模板 - 構建next最長字首陣列與 kmp核心演算法

#include <iostream> #include <string> using namespace std; //構建next最長字首陣列 int* getNextArray(const string &sub) { if(sub.length() ==

樹的最長路徑（直徑）【codevs1814】

樹的最長路徑即樹上的最遠點對，也被稱為樹的直徑。這可以用兩遍dfs來求。第一遍dfs先任選一個點，找出離這個點最遠的點maxd。該點必為最長路徑上的一個端點（可以用反證法證明）再從maxd這個點出發再進行一次dfs就能找到另一個端點。 #include<cstdio> #i

leetcode718+就是最長公共字串，Dp

https://leetcode.com/problems/maximum-length-of-repeated-subarray/description/ class Solution { public: int findLength(vector<int>&

LeetCode 124. Binary Tree Maximum Path Sum（樹中最長路徑和，遞迴）

Given a non-empty binary tree, find the maximum path sum. For this problem, a path is defined as any sequence of nodes from some starting node t

PAT 1045 Favorite Color Stripe （30) 分動態規劃最長字首問題燚

Eva is trying to make her own color stripe out of a given one. She would like to keep only her favorite colors in her favorite order by cutting off

Python Trie樹實現最長字首字串提取

相關推薦