用pyltp做分詞、詞性標註、ner

阿新 • • 發佈：2018-12-31

工具：win10、python2.7

主要參考官方文件

http://pyltp.readthedocs.io/zh_CN/latest/api.html#

http://ltp.readthedocs.io/zh_CN/latest/install.html

1、安裝pyltp

https://github.com/hit-scir/pyltp

別忘了下載網頁裡面的模型，這個是會更新的

下載原始碼後解壓，用cmd命令切換到解壓目錄，用python setup.py install命令安裝，在python中import pyltp不報錯說明就成功了

2、安裝cmake

https://cmake.org/download/

在連結里根據自己電腦型號下載.msi檔案，開啟後按照提示一步步安裝就行

3、下載VS

這個大家的電腦基本都有吧

4、編譯

在專案資料夾下新建一個名為 build 的目錄，在cmd命令中切換到build目錄，執行：cmake..

構建後得到ALL_BUILD、RUN_TESTS、ZERO_CHECK三個VC Project。

使用VS開啟ALL_BUILD專案，在生成/配置管理器中選擇Release

右鍵生成

就能在tools/train/Release目錄下看到otcws和otpos等套件

5、分詞

from pyltp import Segmentor
def segmentor(sentence):
    segmentor = Segmentor()
    segmentor.load('cws.model')  #載入模型
    words = segmentor.segment(sentence)  #分詞
    word_list = list(words)
    segmentor.release()  #釋放模型
    return word_list

個性化分詞

個性化分詞是LTP的特色功能。個性化分詞為了解決測試資料切換到如小說、財經等不同於新聞領域的領域。在切換到新領域時，使用者只需要標註少量資料。個性化分詞會在原有新聞資料基礎之上進行增量訓練。從而達到即利用新聞領域的豐富資料，又兼顧目標領域特殊性的目的。

用cmd命令切換到tools/train/Release目錄

輸入：

otcws.exe customized-learn --baseline-modelpath/to/your/model --model name.model --reference path/to/the/reference/file --development path/to/the/development/file

等待

其中：

reference：指定訓練集檔案

development：指定開發集檔案

algorithm：指定引數學習方法，現在LTP線上學習框架支援兩種引數學習方法，分別是passive aggressive(pa)和average perceptron(ap)。

model：指定輸出模型檔名字首，模型採用model.$iter方式命名

max-iter：指定最大迭代次數

rare-feature-threshold：模型裁剪力度，如果rare-feature-threshold為0，則只去掉為0的特徵；rare-feature-threshold；如果大於0時將進一步去掉更新次數低於閾值的特徵。關於模型裁剪演算法細節，請參考模型裁剪部分。

dump-details：指定儲存模型時輸出所有模型資訊，這一引數用於個性化分詞，具體請參考個性化分詞。

需要注意的是，reference和development都需要是人工切分的句子。

6、詞性標註

from pyltp import Postagger
def posttagger(words):
    postagger = Postagger()
    postagger.load('pos.model')
    posttags = postagger.postag(words)  #詞性標註
    postags = list(posttags)
    postagger.release()  #釋放模型
    return postags

7、ner

def ner(words, postags):
    recognizer = NamedEntityRecognizer()
    recognizer.load('ner.model')  #載入模型
    netags = recognizer.recognize(words, postags)  #命名實體識別
    for word, ntag in zip(words, netags):
        print word + '/' + ntag
    recognizer.release()  #釋放模型
    nerttags = list(netags)

8、讀取文字

import codecs

news_files = codecs.open('C:test.txt', 'r', encoding='utf8')#讀取的文字格式是encoding引數值，codecs函式將其轉化為unicode格式。news_list = news_files.readlines()

9、儲存

#新建一個txt檔案儲存命名實體識別的結果
out_file = codecs.open('ner.txt', 'w', encoding='utf8')

for row in news_list:
    news_str = row.encode("utf-8")#分詞引數輸入的格式必須為str格式
    words = segmentor(news_str)
    tags = posttagger(words)
    nertags = ner(words, tags)
    for word, nertag in zip(words, nertags):
     out_file.write(word.decode('utf-8') + '/' + nertag.decode('utf-8') + ' ')

out_file.close()

10、提取

import codecs
import re

file=codecs.open('/ner.txt','r',encoding='utf8')
file_content = file.read()
file_list = file_content.split()
#print file_list

out_file = codecs.open('tiqu.txt', 'w', encoding='utf8')

ner_list=[]
phrase_list=[]
for word in file_list:
    if(re.search('Ni$',word)):#$表示結尾
        print word
        out_file.write(word+' ')
        word_list=word.split('/')
        # 判斷是否單個詞是否是命名實體
        if re.search(r'^S', word_list[1]):
          ner_list.append(word_list[0])
        elif re.search(r'^B', word_list[1]):
          phrase_list.append(word_list[0])
        elif re.search(r'^I', word_list[1]):
          phrase_list.append(word_list[0])
        else:
          phrase_list.append(word_list[0])
          # 把list轉換為字串.
    ner_phrase = ''.join(phrase_list)
    ner_list.append(ner_phrase)
    phrase_list = []
    #for ner in ner_list:

        #print ner


out_file.close()

用pyltp做分詞、詞性標註、ner

用pyltp做分詞、詞性標註、ner

HMM與分詞、詞性標註、命名實體識別

HMM演算法-viterbi演算法的實現及與分詞、詞性標註、命名實體識別的引用

文字處理（二）詞頻統計,jieba分詞，詞性標註，snownlp情感分析

Python 文字挖掘：jieba中文分詞和詞性標註

Deep Learning 在中文分詞和詞性標註任務中的應用

結巴分詞4--詞性標註

清華大學thulac分詞和詞性標註程式碼理解

python3進行中文分詞和詞性標註

pyhanlp 分詞與詞性標註

jieba分詞及詞性標註

jieba分詞與詞性標註自定義詞典支援特殊字元

自然語言處理工具pyhanlp分詞與詞性標註

python 分詞、自定義詞表、停用詞、詞頻統計與權值（tfidf）、詞性標註與部分詞性刪除

一套準確率高且效率高的分詞、詞性標註工具-thulac

jieba 去除停用詞、提取關鍵詞、詞性標註

統計自然語言處理梳理一：分詞、命名實體識別、詞性標註

jieba分詞、自定義詞典提取高頻詞、詞性標註及獲取詞的位置

python︱六款中文分詞模組嘗試:jieba、THULAC、SnowNLP、pynlpir、CoreNLP、pyLTP

用PostgreSQL 做實時高效搜尋引擎 - 全文檢索、模糊查詢、正則查詢、相似查詢、ADHOC查詢

用pyltp做分詞、詞性標註、ner

相關推薦