CoreNLP Python介面處理中文

阿新 • • 發佈：2019-02-16

CoreNLP 專案是Stanford開發的一套開源的NLP系統。包括tokenize, pos , parse 等功能，與SpaCy類似。SpaCy號稱是目前最快的NLP系統，並且提供現成的python介面，但不足之處就是目前還不支援中文處理， CoreNLP則包含了中文模型，可以直接用於處理中文，但CoreNLP使用Java開發，python呼叫稍微麻煩一點。

安裝

安裝的方式比較簡單，下載CoreNLP最新的壓縮包，再下載對應的語言jar包。從CoreNLP下載頁面下載。將壓縮包解壓得到目錄，再將語言的jar包放到這個目錄下即可。

啟動NLPServer

由於corenlp使用Java開發，所以沒有python包可以直接使用，但是corenlp可以啟動Server端，接收http請求。所以使用python簡單的封裝，就可以與server端進行通訊，像使用原生python包一樣使用。

對於中文的情況，啟動corenlp server的方式是，到corenlp的目錄下，執行如下程式碼

java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-chinese.properties -port 9000 -timeout 15000

目前corenlp對jdk的要求是1.8以上。上面的-Xmx4g的含義是為這個server端申請4G的記憶體。-serverProperties指定properties檔案，這個檔案在chinese-model的jar包裡面。

啟動Server之後，第一次執行的時候會比較慢，需要載入各種包。

基本HTTP 請求

wget --post-data 'The quick brown fox jumped over the lazy dog.' 'localhost:9000/?properties={"annotators":"tokenize,ssplit,pos","outputFormat":"json"}'

這是發一個POST的HTTP請求，使用Python的示例如下

import requests

url = 'http://192.168.200.169:9000'
properties = {'annotators' 
: 'tokenize,ssplit,pos', 'outputFormat': 'json'}

# properties 要轉成字串, requests包裝URL的時候貌似不支援巢狀的dict
params = {'properties' : str(properties)}

data = '天氣非常好'

resp = requests.post(url, data, params=params)

官方Python介面

git clone https://github.com/stanfordnlp/python-stanford-corenlp然後在python-stanford-corenlp目錄底下，sudo python setup.py install就安裝成功了。

設定JAVANLP_HOME環境變數

這個Python介面並不是一個完整的CoreNLP Python包，它僅僅是對上文所說的啟動Server，Client端傳送http請求的一個封裝。因此底層還是依賴於執行在JVM裡面的CoreNLP Server端。這個Server端可以在程式碼執行的時候在本地啟動，因此程式需要知道Java CoreNLP的目錄，為了不用每次都傳這個引數，程式碼中是從系統獲取名為JAVANLP_HOME的環境變數。

所以到~/.bashrc或~/.bash_profile檔案中新增JAVANLP_HOME環境變數

JAVANLP_HOME="/path/to/corenlp"
export JAVANLP_HOME

修改程式碼以處理中文

但是用於處理中文還需要改一些地方，可以fork到自己的github，修改一下，以後在其他地方要用直接clone自己修改過的專案就可以了。

需要改的是python-stanford-corenlp/corenlp/client.py檔案CoreNLPClient的__init__方法中啟動server端的命令start_cmd，原來的程式碼如下：

start_cmd = "{javanlp}/bin/javanlp.sh edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port {port} -timeout {timeout}".format(
                javanlp=os.getenv("JAVANLP_HOME"),
                port=port,
                timeout=timeout)

修改為

start_cmd = 'java -Xmx{memory}g -cp "{javanlp}/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-chinese.properties -port {port} -timeout {timeout}'.format(
                memory=allocate_mem,
                javanlp=os.getenv("JAVANLP_HOME"),
                port=port,
                timeout=timeout)

原來的命令start_cmd被寫的比較死，並且可能由於我下的CoreNLP版本不對，目錄底下並沒有bin目錄與javanlp.sh指令碼。因此直接改成
java -Xmx{memory}g -cp "{javanlp}/*" ，memory引數用於配置server端所需的記憶體。

增加-serverProperties引數為了可以處理中文。修改後的__init__方法程式碼如下：

    DEFAULT_ANNOTATORS = "tokenize ssplit pos ner depparse".split()
    DEFAULT_PROPERTIES = {}

    def __init__(self, start_server=True, endpoint="http://localhost:9000", 
        timeout=5000, annotators=DEFAULT_ANNOTATORS, properties=DEFAULT_PROPERTIES, allocate_mem=4):
        if start_server:
            host, port = urlparse(endpoint).netloc.split(":")
            assert host == "localhost", "If starting a server, endpoint must be localhost"

            assert os.getenv("JAVANLP_HOME") is not None, "Please define $JAVANLP_HOME where your CoreNLP Java checkout is"
            start_cmd = 'java -Xmx{memory}g -cp "{javanlp}/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-chinese.properties -port {port} -timeout {timeout}'.format(
                memory=allocate_mem,
                javanlp=os.getenv("JAVANLP_HOME"),
                port=port,
                timeout=timeout)
            stop_cmd = None
        else:
            start_cmd = stop_cmd = None

        super(CoreNLPClient, self).__init__(start_cmd, stop_cmd, endpoint)
        self.default_annotators = annotators
        self.default_properties = properties

還去除了DEFAULT_ANNOTATORS中的lemma，獲取詞原型的功能在處理中文的時候沒用。

示例

修改好程式碼以後，重新執行一遍sudo python setup.py install即可。

應用的示例程式碼如下

#-*- coding:utf-8 -*-

import corenlp


text = u'今天是一個大晴天'

with corenlp.CoreNLPClient(annotators='tokenize ssplit pos'.split()) as client:
    ann = client.annotate(text)
    sentence = ann.sentence[0]

    for token in sentence.token:
        print token.word, token.pos

執行以後結果

今天 NT
是 VC
一 CD
個 M
大 JJ
晴天 NN

CoreNLP Python介面處理中文

安裝

啟動NLPServer

基本HTTP 請求

官方Python介面

設定JAVANLP_HOME環境變數

修改程式碼以處理中文

示例

CoreNLP Python介面處理中文

Python文字處理——中文標點符號處理

使用Stanford CoreNLP工具包處理中文

python中處理中文編碼問題

【python gensim使用】word2vec詞向量處理中文語料

[Python工具]FoolNLTK 中文處理工具包使用教程

python︱六款中文分詞模組嘗試:jieba、THULAC、SnowNLP、pynlpir、CoreNLP、pyLTP

Python中一般如何處理中文

【中文編碼】使用Python處理中文時的文字編碼問題

python與sqlite處理中文字元時出現的編碼錯誤問題解決

Python CGi URL 中文以及特殊轉義字元的處理

【影象處理】Ubuntu安裝OpenCV 3.0以及Python介面

Python處理中文文字字元時提取某個漢字或字元的方法

自然語言處理工具包HanLP的Python介面

使用python處理中文csv檔案，並讓excel正確顯示中文（避免亂碼）

python中json處理中文問題

python處理中文編碼問題總結

python 處理中文路徑終極解決辦法

Python--異常處理--12

Python異常處理和進程線程-day09

CoreNLP Python介面處理中文

安裝

啟動NLPServer

基本HTTP 請求

官方Python介面

設定JAVANLP_HOME環境變數

修改程式碼以處理中文

示例

相關推薦