AC自動機1——適用於utf-8編碼的Trie樹

阿新 • • 發佈：2019-02-07

最近需要用到文字的拼音相似度計算，看了hankcs大神的hanlp裡面通過ac自動機實現拼音的儲存，想把它轉成python版本的。開始啃AC自動機吧。

AC自動機建立在Trie樹和KMP字串匹配演算法。首先啃Trie樹。

我所要做的是把utf-8編碼的中文詞和拼音對應起來。Utf-8編碼將一個漢字編碼成3個byte，每個byte按照16進位制儲存。鑑於這種情況，需要構造一個256 Trie，即每一層可能有256個節點。

看了幾個程式後，集眾人智慧，寫了一個自己的。

# coding:utf-8

import sys

reload(sys)
sys.setdefaultencoding("utf-8")

class TrieNode(object):
    def __init__(self):
        self.one_byte = {}
        self.value = None
        self.is_word = False


class Trie256(object):
    def __init__(self):
        self.root = TrieNode()

    def getUtf8String(self, string):
        bytes_array = bytearray(string.encode("utf-8"))
        return bytes_array

    def insert(self, bytes_array, str):
        node = self.root
        for byte in bytes_array:
            child = node.one_byte.get(byte)
            if child == None:
                node.one_byte[byte] = TrieNode()
            node = node.one_byte[byte]
        node.is_word = True
        node.value = str

    def find(self, bytes_array):
        node = self.root
        for byte in bytes_array:
            child = node.one_byte.get(byte)
            if child == None:
                print "No this word in this Trie."
                return None
            node = node.one_byte[byte]
        if not node.is_word:
            print "It is not a word."
            return None
        else:
            return node.value

    def modify(self, bytes_array, str):
        node = self.root
        for byte in bytes_array:
            child = node.one_byte.get(byte)
            if child == None:
                print "This word is not in this Trie, we will insert it."
                node.one_byte[byte] = TrieNode()
            node = node.one_byte[byte]
        if not node.is_word:
            print "This word is not a word in this Trie, we will make it a word."
            node.is_word = True
            node.value = str
        else:
            print "modify this word..."
            node.value = str

    def delete(self, bytes_array):
        node = self.root
        for byte in bytes_array:
            child = node.one_byte.get(byte)
            if child == None:
                print "This word is not in this Trie."
                break
            node = node.one_byte[byte]
        if not node.is_word:
            print "It is not a word."
        else:
            node.is_word = False
            node.value = None
            child = node.one_byte.keys()
            if len(child) == 0:
                node.one_byte.clear()

    def print_item(self, p, indent=0):
        if p:
            ind = '' + '\t' * indent
            for key in p.one_byte.keys():
                label = "'%s' : " % key
                print ind + label + '{'
                self.print_item(p.one_byte[key], indent + 1)
            #print ind + ' ' * len(label) + '}'
            #self.print_item(p.one_byte[key], indent + 1)


if __name__ == "__main__":
    trie = Trie256()

    with open("dictionary/pinyin.txt", 'r') as fd:
        line = fd.readline()
        while line:
            line_split = line.split('=')
            word = line_split[0]
            pinyin = line_split[1].strip()
            bytes = trie.getUtf8String(word)
            sentence = ''
            for byte in bytes:
                sentence = sentence + 'x' + str(byte)
            print sentence
            trie.insert(bytes, pinyin)
            line = fd.readline()

    trie.print_item(trie.root)


    bytes = trie.getUtf8String("一分鐘".decode("utf-8"))
    for byte in bytes:
        print byte
    print trie.find(bytes)

AC自動機1——適用於utf-8編碼的Trie樹

最近需要用到文字的拼音相似度計算，看了hankcs大神的hanlp裡面通過ac自動機實現拼音的儲存，想把它轉成python版本的。開始啃AC自動機吧。 AC自動機建立在Trie樹和KMP字串匹配演算法。首先啃Trie樹。我所要做的是把utf-8編碼的中文詞和拼音對應起來

PHP技術分享--實現中文字串擷取無亂碼的函式(適用於utf-8)

$re['utf-8'] = "/[\x01-\x7f]|[\xc2-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xff][\x80-\xbf]{3}/";

xml中1位元組的UTF-8序列的位元組1無效（[字元編碼]Invalid byte 1 of 1-byte UTF-8 sequence終極解決方案）

xml中1位元組的UTF-8序列的位元組1無效（[字元編碼]Invalid byte 1 of 1-byte UTF-8 sequence終極解決方案）專案本地執行是ok的，但是釋出到線上伺服器就一直報錯，說什麼 Error

建立一個資料夾用於寫入UTF-8編碼的檔案

實現效果：　　知識運用：　　File類的CreateText方法　　StreamWriter類的WriteLine方法實現程式碼： private void button2_Click(object sender, EventArgs e) {

創建一個文件夾用於寫入UTF-8編碼的文件

arp class emp oid lin line ext send 寫入實現效果：　　知識運用：　　File類的CreateText方法　　StreamWriter類的WriteLine方法實現代碼： private void button

用java實現簡單快速的webservice客戶端/資料採集器（支援soap1.1和soap1.2標準，支援utf-8編碼）

前言：用了cxf，axis等各種wbeservice實現庫，簡單試用了一下動態呼叫的方式，很不滿意，完全無法滿足業務的需要，所以自己實現了一個webservice採集客戶端，方便動態呼叫外部webservice介面。一、實現的功能 1、soap1.1客戶端（soap1.

將ISO-8859-1編碼 UTF-8編碼 myeclipse檔案

一：preferences 下的 myeclipse --> editors --> jsp 的encoding改為UTF-8, 二：eclipse --> window --> General --> Content Types --&g

刨根究底字符編碼之十一——UTF-8編碼方式與字節序標記

所有碼元 unix 找到概念不可見執行大端位置 UTF-8編碼方式與字節序標記一、UTF-8編碼方式 1. 接下來將分別介紹Unicode字符集的三種編碼方式：UTF-8、UTF-16、UTF-32。這裏先介紹應用最為廣泛的UTF-8。為滿足基於AS

php開啟mbstring擴展並設置支持utf-8編碼

tran ret utf-8 enc php asm 不一致需要 Coding 前一段時間使用一個服務的接口，因為調用接口時使用的參數裏面有中文，調用接口會出現異常問題，後來才明白是編碼不一致的問題。然而，我本地項目開發使用的是utf-8，接口那邊也是需要utf-8的，那

在MyEclipse中設置jsp頁面為默認utf-8編碼

技術 logs utf tor and ren 創建菜單下拉框在MyEclispe中創建Jsp頁面，Jsp頁面的默認編碼是“ISO-8859-1”，如下圖所示：在這種編碼下編寫中文是沒有辦法保存Jsp頁面的，會出現如下的錯誤提示：因此可以設置Jsp默認的編碼為

ASCII UTF-8 編碼

代碼 href 語言 ace 最終方式 eight 中文版丟了 1. ASCII碼我們知道，在計算機內部，所有的信息最終都表示為一個二進制的字符串。每一個二進制位（bit）有0和1兩種狀態，因此八個二進制位就可以組合出256種狀態，這被稱為一個字節（byte）。也就是

Python中的Unicode編碼和UTF-8編碼

2個傳輸硬盤中文字符結合 2.7 客戶端有一點來看下午看廖雪峰的Python2.7教程，看到字符串和編碼一節，有一點感受，結合崔慶才的Python博客，把這種感受記錄下來： ASCII碼：是用一個字節（8bit， 0-255）中的127個字母表示大

解決excel打開utf-8編碼csv文件亂碼的bug

導入對話框原因識別直接格式 excel exce 編碼直接用 excel 打開 utf-8 編碼的 csv 文件會導致漢字部分出現亂碼。原因是 excel 以 ansi 格式打開,不會做編碼識別。打開 utf-8 編碼的 csv 文件的方法： 1) 打開

JavaScript進行UTF-8編碼與解碼

str 前端轉載 clas utf-8 處理序列一個 ket JavaScript本身可通過charCodeAt方法得到一個字符的Unicode編碼，並通過fromCharCode方法將Unicode編碼轉換成對應字符。但charCodeAt方法得到的應該是一個16

C# MD5 32位加密 UTF-8編碼

spl 十六進制 post ring one 類型開始出現問題 int 項目開發過程中需要用到MD5加密，最開始的使用使用加密方法： public static string GetMD5(string str) { byte[] b = System

php 多語言(UTF-8編碼)導出Excel、CSV亂碼解決辦法之導出UTF-8編碼的Excel、CSV

csv tex 完整多語繁體 HP 項目 .html agen 轉自： https://www.cnblogs.com/kclteam/p/5278926.html 新項目，大概情況是這樣的：可能存在多國、不同語種使用者，比喻有中文、繁體中文，韓文、日本等等，開發

UTF-8編碼的xml文件帶頭部信息，用XmlDocument解析出錯問題

進制 mage nbsp 信息分享正常 img .com 16進制 1.使用UE查看文件，切換到16進制編輯。前面三位 EF BB BF 是UTF-8編碼的表示。前面兩位是：FF FE 表示是Unicode編碼 3C是小於符號（<）的AS

寫一個腳本批量轉換項目中GB2312編碼的文件為UTF-8編碼

for 腳本一個 bash iconv lua In don convert #!/bin/bash convert_file() { for file in `find .` do if [[ -f $file ]] t

UTF-8編碼下'u7528u6237'轉換為中文漢字'用戶'

bsp 16px -a 設置圖進行開發 ngs color tin UTF-8編碼下‘\u7528\u6237‘轉換為中文‘用戶‘ 一、前言有過多次，在開發項目中遇見設置文件編碼格式為UTF-8，但是打開該文件出現類似\u7528這樣的數據，看也看不懂，也不是

python中的字符串編碼問題——2.理解ASCII碼、ANSI碼、Unicode編碼、UTF-8編碼

unicode編碼統一 col 簡單 utf 文字 stand 二進制 pan ASCII碼：全名是American Standard Code for Information Interchange，ASCII碼中，一個英文字母（不分大小寫）占一個字節的空間，範圍0x0

AC自動機1——適用於utf-8編碼的Trie樹

相關推薦