【原創】cython and python for kenlm

阿新 • • 發佈：2018-12-21

未經允許不可轉載

關於Kenlm模組的使用及C++原始碼說明

載入Kenlm模組命令

[email protected]:~/Documents/kenlm/lm$ ../bin/query -n test.arpa
***

Kenlm模組C++原始碼說明

query的主入口檔案:query_main.cc
query的執行函式檔案:ngram_query.hh
注意:
預設執行的是query_main.cc檔案96行的

Query<ProbingModel>(file, config, sentence_context, show_words);

而不是lm/wrappers/nplm.hh,這個封裝檔案是需要NPLM模組的,參考以下程式碼,當時疏忽了在這個地方耽誤了一些時間

#ifdef WITH_NPLM
    } else if (lm::np::Model::Recognize(file)) {
      lm::np::Model model(file);
      if (show_words) {
        Query<lm::np::Model, lm::ngram::FullPrint>(model, sentence_context);
      } else {
        Query<lm::np::Model, lm::ngram::BasicPrint>(model, sentence_context);
      }
#endif

關於Model類的繼承關係

最基類virtual_interface.hh lm::base::Model

次基類facade.hh lm::base::ModelFacade : public Model

子類model.hh lm::ngram::GenericModel : public base::ModelFacade<GenericModel<Search, VocabularyT>, State, VocabularyT>

關於cython的簡單說明

cython官網
可以從官網下載最新版本,參考Documentation分類中的Cython Wiki和Cython FAQ瞭解一些知識。

cython-cpp-test-sample
Wrapping C++ Classes in Cython
cython wrapping of base and derived class
std::string arguments in cython
Cython and constructors of classes
Cython基礎--Cython入門

kenlm的python模組封裝

接下來，讓我們進入正題，在kenlm的原始碼中實際上已經提供了python的應用。在kenlm/python資料夾中，那麼為什麼還要再封裝python模組呢，因為kenlm中所帶的python模組僅僅實現了包含<s>和</s>這種情況下的計算分數的方法，而沒有提供不包含這種情況的計算分數的演算法，這就是為什麼要重新封裝python模組的原因。

簡單介紹一下python模組使用的必要步驟

安裝kenlm.so模組到python的目錄下，預設直接執行kenlm目錄下的setup.py檔案即可安裝成功sudo python setup.py install --record log。

安裝成功後，即可執行python example.py檔案，檢視執行結果。

如何擴充套件kenlm的python模組

接下來，正式進入python擴充套件模組的介紹。kenlm.pxd是cython針對所用到C++類及物件的宣告檔案，kenlm.pyx是真正要編寫的cython功能程式碼，也是未來python所要呼叫的類及方法。使用cython的編譯命令，可以把kenlm.pxd和kenlm.pyx編譯出kenlm.cpp檔案。setup.py檔案會用到編譯出來的kenlm.cpp檔案。

cython編譯命令cython --cplus kenlm.pyx

擴充套件後的kenlm.pxd檔案

from libcpp.string cimport string

cdef extern from "lm/word_index.hh":
    ctypedef unsigned WordIndex

cdef extern from "lm/return.hh" namespace "lm":
    cdef struct FullScoreReturn:
        float prob
        unsigned char ngram_length

cdef extern from "lm/state.hh" namespace "lm::ngram":
    cdef struct State:
        pass

    ctypedef State const_State "const lm::ngram::State"

cdef extern from "lm/virtual_interface.hh" namespace "lm::base":
    cdef cppclass Vocabulary:
        WordIndex Index(char*)
        WordIndex BeginSentence() 
        WordIndex EndSentence()
        WordIndex NotFound()

    ctypedef Vocabulary const_Vocabulary "const lm::base::Vocabulary"


cdef extern from "lm/model.hh" namespace "lm::ngram":
    cdef cppclass Model:
        const_Vocabulary& GetVocabulary()
        const_State& NullContextState()
        void Model(char* file)
        FullScoreReturn FullScore(const_State& in_state, WordIndex new_word, const_State& out_state)

        void BeginSentenceWrite(void *)
        void NullContextWrite(void *)
        unsigned int Order()
        const_Vocabulary& BaseVocabulary()
        float BaseScore(void *in_state, WordIndex new_word, void *out_state)
        FullScoreReturn BaseFullScore(void *in_state, WordIndex new_word, void *out_state)
        void * NullContextMemory()

擴充套件後的kenlm.pyx檔案

import os

cdef bytes as_str(data):
    if isinstance(data, bytes):
        return data
    elif isinstance(data, unicode):
        return data.encode('utf8')
    raise TypeError('Cannot convert %s to string' % type(data))

cdef int as_in(int &Num):
    (&Num)[0] = 1

cdef class LanguageModel:
    cdef Model* model
    cdef public bytes path
    cdef const_Vocabulary* vocab

    def __init__(self, path):
        self.path = os.path.abspath(as_str(path))
        try:
            self.model = new Model(self.path)
        except RuntimeError as exception:
            exception_message = str(exception).replace('\n', ' ')
            raise IOError('Cannot read model \'{}\' ({})'.format(path, exception_message))\
                    from exception
        self.vocab = &self.model.GetVocabulary()

    def __dealloc__(self):
        del self.model

    property order:
        def __get__(self):
            return self.model.Order()
    
    def score(self, sentence):
        cdef list words = as_str(sentence).split()
        cdef State state
        self.model.BeginSentenceWrite(&state)
        cdef State out_state
        cdef float total = 0
        for word in words:
            total += self.model.BaseScore(&state, self.vocab.Index(word), &out_state)
            state = out_state
        total += self.model.BaseScore(&state, self.vocab.EndSentence(), &out_state)
        return total

    def full_scores(self, sentence):
        cdef list words = as_str(sentence).split()
        cdef State state
        self.model.BeginSentenceWrite(&state)
        cdef State out_state
        cdef FullScoreReturn ret
        cdef float total = 0
        for word in words:
            ret = self.model.BaseFullScore(&state,
                self.vocab.Index(word), &out_state)
            yield (ret.prob, ret.ngram_length)
            state = out_state
        ret = self.model.BaseFullScore(&state,
            self.vocab.EndSentence(), &out_state)
        yield (ret.prob, ret.ngram_length)
    
    def full_scores_n(self, sentence):
        cdef list words = as_str(sentence).split()
        cdef State state
        state = self.model.NullContextState()
        cdef State out_state
        cdef FullScoreReturn ret
        cdef int ovv = 0
        for word in words:
            ret = self.model.FullScore(state,
                self.vocab.Index(word), out_state)
            yield (ret.prob, ret.ngram_length)
            state = out_state

    """""""""""
    """count scores when not included <s> and </s>"""
    """""""""""
    def score_n(self, sentence):
        cdef list words = as_str(sentence).split()
        cdef State state
        state = self.model.NullContextState()
        cdef State out_state
        cdef float total = 0
        for word in words:
            ret = self.model.FullScore(state,
                self.vocab.Index(word), out_state)
            total += ret.prob
            """print(total)"""
            state = out_state
        return total


    def __contains__(self, word):
        cdef bytes w = as_str(word)
        return (self.vocab.Index(w) != 0)

    def __repr__(self):
        return '<LanguageModel from {0}>'.format(os.path.basename(self.path))

    def __reduce__(self):
        return (LanguageModel, (self.path,))

【原創】cython and python for kenlm

未經允許不可轉載 Kenlm相關知識 Kenlm下載地址 kenlm中文版本訓練語言模型如何使用kenlm訓練出來的模型C++版本關於Kenlm模組的使用及C++原始碼說明載入Kenlm模組命令 [email protected]:~/Documents/kenlm/lm$ .

【原創】關於用python創建動態變量賦值

for 動態變量例如利用 esc 關於 style python 創建當在枚舉一些元素的時候，每個元素需要創建不同的變量改怎麽寫？這個時候可以利用loads()創建動態變量！例如： n=0 createVar = locals() #pr

【原創】Python 對象創建過程中元類, new, call, init 的處理

diff regular luci 自定義 weight ica 一般來說 att ray 原始type: type是最原始的元類，其__call__方法是在你使用" t_class = type(classname_string, base_classes_tuple,

【原創】用python將時間unix格式轉換總結

接受 bsp 時間戳 pretty 需要字符串解析 time函數 spa datetime 我們可以用python裏面的time模塊mktime方法將轉為unix時間戳，mktime函數只能接受相應時間的元祖序列。在此之前需要先將輸入的時間轉為元組序列：如果輸入的時間為

【原創】python學習筆記（自學階段1）-- 自學，爬蟲備註--先佔坑

Request：使用者將自己的資訊通過瀏覽器（socket client）傳送給伺服器（socket server） Response：伺服器接收請求，分析使用者發來的請求資訊，然後返回資料（返回的資料中可能包含其他連結，如：圖片，js，css等） ps：瀏覽器在接收Res

今日頭條文章js生成cp和as引數轉換為php和python演算法【原創】

今日頭條js生成cp和as引數轉換為php和python演算法【原創】 cp 和 as 引數實際是對當前時間戳的加密後得到的 JS !function(t) { var i = {};

【原創】python學習筆記（進階1）-- 自學，爬蟲備註--先佔坑

【原創】python學習筆記（10）--《笨辦法學python》字串處理

字串基本操作（1）字串+字串（2）字串*數字（3）字串+str（其他） # -*- coding:utf-8 -*- print ("test1") name1="alice" name2="bob" name_new=name1+name2 print

【原創】python 比較兩個版本號大小

?123456789101112131415161718192021222324252627

【原創】python學習筆記（5）--《笨辦法學python》，指令碼帶引數

一指令碼檔案（1）簡單的說就是一段自己寫的，可執行的程式碼，否則會報錯（2）簡單指令碼，直接 python xxx1.py （3）帶引數指令碼，需要 python xxx2.py argv1 argv2 argv3 根據指令碼引數的數量，

【原創】python學習筆記（8）--《笨辦法學python》關於list列表

一列表，元組和字典的概念二列表的各種方法 .append() .insert() .sort() .reverse() .index() .count() .remove() # -*- coding:utf-8 -*- #先看下list 再

【原創】python遊戲pygame學習筆記（2）--pie遊戲--還要DEBUG

# -*- coding:utf-8 -*- import pygame import sys from pygame.locals import * import math color=200,80,60 width=4 x=300 y=250 radius=2

【原創】問題定位分享（15）Context namespace element 'component-scan' and its parser class [org.springframework.context.annotation.ComponentScanBeanDefinit

今天嘗試執行一個古老的工程，配置好之後編譯通過，結果執行時報錯： org.springframework.beans.factory.BeanDefinitionStoreException: Unexpected exception parsing XML document from class p

【原創】cython and python for kenlm

未經允許不可轉載

Kenlm相關知識

關於Kenlm模組的使用及C++原始碼說明

載入Kenlm模組命令

Kenlm模組C++原始碼說明

關於cython的簡單說明

kenlm的python模組封裝

簡單介紹一下python模組使用的必要步驟

如何擴充套件kenlm的python模組

擴充套件後的kenlm.pxd檔案

擴充套件後的kenlm.pyx檔案

【原創】cython and python for kenlm

【原創】關於用python創建動態變量賦值

【原創】Python 對象創建過程中元類, new, call, init 的處理

【原創】用python將時間unix格式轉換總結

【原創】python學習筆記（自學階段1）-- 自學，爬蟲備註--先佔坑

今日頭條文章js生成cp和as引數轉換為php和python演算法【原創】

【原創】python學習筆記（進階1）-- 自學，爬蟲備註--先佔坑

【原創】python學習筆記（10）--《笨辦法學python》字串處理

【原創】python 比較兩個版本號大小

【原創】python學習筆記（5）--《笨辦法學python》，指令碼帶引數

【原創】python學習筆記（8）--《笨辦法學python》關於list列表

【原創】python遊戲pygame學習筆記（2）--pie遊戲--還要DEBUG

【原創】問題定位分享（15）Context namespace element 'component-scan' and its parser class [org.springframework.context.annotation.ComponentScanBeanDefinit

【原創】collections庫和 python的生成式生成器迭代器

【原創】《笨辦法學python》(12)----關於python的---程式設計語法---基礎概念

【原創】《笨辦法學python》(11)----關於python的資料型別

【原創】學習 python的多型性，基礎知識

【原創】python encoding中文編碼

【原創】Python處理海量資料的實戰研究

【原創】Python Mongo 批量操作

【原創】cython and python for kenlm

未經允許不可轉載

Kenlm相關知識

關於Kenlm模組的使用及C++原始碼說明

載入Kenlm模組命令

Kenlm模組C++原始碼說明

關於cython的簡單說明

kenlm的python模組封裝

簡單介紹一下python模組使用的必要步驟

如何擴充套件kenlm的python模組

擴充套件後的kenlm.pxd檔案

擴充套件後的kenlm.pyx檔案

相關推薦