基於CRF的中文命名實體識別模型

阿新 • • 發佈：2018-11-30

條件隨機場（Conditional Random Fields，簡稱 CRF）是給定一組輸入序列條件下另一組輸出序列的條件概率分佈模型，在自然語言處理中得到了廣泛應用。

新建corpus_process類

import re
import sklearn_crfsuite
from sklearn_crfsuite import metrics
from sklearn.externals import joblib

class CorpusProcess(object):
    def __init__(self):
        """初始化"""
        self.train_corpus_path = "D://input_py//day15//1980_01rmrb.txt"
        self.process_corpus_path = "D://input_py//day15//result-rmrb.txt"
        self._maps = {u't': u'T',u'nr': u'PER', u'ns': u'ORG',u'nt': u'LOC'}

    def read_corpus_from_file(self, file_path):
        """讀取語料"""
        f = open(file_path, 'r')#,encoding='utf-8'
        lines = f.readlines()
        f.close()
        return lines

    def write_corpus_to_file(self, data, file_path):
        """寫語料"""
        f = open(file_path, 'wb')
        f.write(data)
        f.close()

    def q_to_b(self,q_str):
        """全形轉半形"""
        b_str = ""
        for uchar in q_str:
            inside_code = ord(uchar)
            if inside_code == 12288:  # 全形空格直接轉換
                inside_code = 32
            elif 65374 >= inside_code >= 65281:  # 全形字元（除空格）根據關係轉化
                inside_code -= 65248
            b_str += chr(inside_code)
        return b_str

    def b_to_q(self,b_str):
        """半形轉全形"""
        q_str = ""
        for uchar in b_str:
            inside_code = ord(uchar)
            if inside_code == 32:  # 半形空格直接轉化
                inside_code = 12288
            elif 126 >= inside_code >= 32:  # 半形字元（除空格）根據關係轉化
                inside_code += 65248
            q_str += chr(inside_code)
        return q_str

    def pre_process(self):
        """語料預處理 """
        lines = self.read_corpus_from_file(self.train_corpus_path)
        new_lines = []
        for line in lines:
            words = self.q_to_b(line.strip()).split(u'  ')
            pro_words = self.process_t(words)
            pro_words = self.process_nr(pro_words)
            pro_words = self.process_k(pro_words)
            new_lines.append('  '.join(pro_words[1:]))
        self.write_corpus_to_file(data='\n'.join(new_lines).encode('utf-8'), file_path=self.process_corpus_path)

    def process_k(self, words):
        """處理大粒度分詞,合併語料庫中括號中的大粒度分詞,類似：[國家/n  環保局/n]nt """
        pro_words = []
        index = 0
        temp = u''
        while True:
            word = words[index] if index < len(words) else u''
            if u'[' in word:
                temp += re.sub(pattern=u'/[a-zA-Z]*', repl=u'', string=word.replace(u'[', u''))
            elif u']' in word:
                w = word.split(u']')
                temp += re.sub(pattern=u'/[a-zA-Z]*', repl=u'', string=w[0])
                pro_words.append(temp+u'/'+w[1])
                temp = u''
            elif temp:
                temp += re.sub(pattern=u'/[a-zA-Z]*', repl=u'', string=word)
            elif word:
                pro_words.append(word)
            else:
                break
            index += 1
        return pro_words

    def process_nr(self, words):
        """ 處理姓名，合併語料庫分開標註的姓和名，類似：溫/nr  家寶/nr"""
        pro_words = []
        index = 0
        while True:
            word = words[index] if index < len(words) else u''
            if u'/nr' in word:
                next_index = index + 1
                if next_index < len(words) and u'/nr' in words[next_index]:
                    pro_words.append(word.replace(u'/nr', u'') + words[next_index])
                    index = next_index
                else:
                    pro_words.append(word)
            elif word:
                pro_words.append(word)
            else:
                break
            index += 1
        return pro_words

    def process_t(self, words):
        """處理時間,合併語料庫分開標註的時間詞，類似： （/w  一九九七年/t  十二月/t  三十一日/t  ）/w   """
        pro_words = []
        index = 0
        temp = u''
        while True:
            word = words[index] if index < len(words) else u''
            if u'/t' in word:
                temp = temp.replace(u'/t', u'') + word
            elif temp:
                pro_words.append(temp)
                pro_words.append(word)
                temp = u''
            elif word:
                pro_words.append(word)
            else:
                break
            index += 1
        return pro_words

    def pos_to_tag(self, p):
        """由詞性提取標籤"""
        t = self._maps.get(p, None)
        return t if t else u'O'

    def tag_perform(self, tag, index):
        """標籤使用BIO模式"""
        if index == 0 and tag != u'O':
            return u'B_{}'.format(tag)
        elif tag != u'O':
            return u'I_{}'.format(tag)
        else:
            return tag

    def pos_perform(self, pos):
        """去除詞性攜帶的標籤先驗知識"""
        if pos in self._maps.keys() and pos != u't':
            return u'n'
        else:
            return pos

    def initialize(self):
        """初始化 """
        lines = self.read_corpus_from_file(self.process_corpus_path)
        words_list = [line.strip().split('  ') for line in lines if line.strip()]
        del lines
        self.init_sequence(words_list)

    def init_sequence(self, words_list):
        """初始化字序列、詞性序列、標記序列 """
        words_seq = [[word.split(u'/')[0] for word in words] for words in words_list]
        pos_seq = [[word.split(u'/')[1] for word in words] for words in words_list]
        tag_seq = [[self.pos_to_tag(p) for p in pos] for pos in pos_seq]
        self.pos_seq = [[[pos_seq[index][i] for _ in range(len(words_seq[index][i]))]
                        for i in range(len(pos_seq[index]))] for index in range(len(pos_seq))]
        self.tag_seq = [[[self.tag_perform(tag_seq[index][i], w) for w in range(len(words_seq[index][i]))]
                        for i in range(len(tag_seq[index]))] for index in range(len(tag_seq))]
        self.pos_seq = [[u'un']+[self.pos_perform(p) for pos in pos_seq for p in pos]+[u'un'] for pos_seq in self.pos_seq]
        self.tag_seq = [[t for tag in tag_seq for t in tag] for tag_seq in self.tag_seq]
        self.word_seq = [[u'<BOS>']+[w for word in word_seq for w in word]+[u'<EOS>'] for word_seq in words_seq]

    def extract_feature(self, word_grams):
        """特徵選取"""
        features, feature_list = [], []
        for index in range(len(word_grams)):
            for i in range(len(word_grams[index])):
                word_gram = word_grams[index][i]
                feature = {u'w-1': word_gram[0], u'w': word_gram[1], u'w+1': word_gram[2],
                           u'w-1:w': word_gram[0]+word_gram[1], u'w:w+1': word_gram[1]+word_gram[2],
                           # u'p-1': self.pos_seq[index][i], u'p': self.pos_seq[index][i+1],
                           # u'p+1': self.pos_seq[index][i+2],
                           # u'p-1:p': self.pos_seq[index][i]+self.pos_seq[index][i+1],
                           # u'p:p+1': self.pos_seq[index][i+1]+self.pos_seq[index][i+2],
                           u'bias': 1.0}
                feature_list.append(feature)
            features.append(feature_list)
            feature_list = []
        return features

    def segment_by_window(self, words_list=None, window=3):
        """視窗切分"""
        words = []
        begin, end = 0, window
        for _ in range(1, len(words_list)):
            if end > len(words_list): break
            words.append(words_list[begin:end])
            begin = begin + 1
            end = end + 1
        return words

    def generator(self):
        """訓練資料"""
        word_grams = [self.segment_by_window(word_list) for word_list in self.word_seq]
        features = self.extract_feature(word_grams)
        return features, self.tag_seq

再建test類

import sys
reload(sys)
sys.setdefaultencoding('utf8')

import sklearn_crfsuite,joblib
from sklearn_crfsuite import metrics
import base_corpus_process
class CRF_NER(object):

    def __init__(self):
        """初始化引數"""
        self.algorithm = "lbfgs"
        self.c1 ="0.1"
        self.c2 = "0.1"
        self.max_iterations = 100
        self.model_path = "D://input_py//day15//model.pkl"
        self.corpus = base_corpus_process.CorpusProcess()  #Corpus 例項
        self.corpus.pre_process()  #語料預處理
        self.corpus.initialize()  #初始化語料
        self.model = None

    def initialize_model(self):
        """初始化"""
        algorithm = self.algorithm
        c1 = float(self.c1)
        c2 = float(self.c2)
        max_iterations = int(self.max_iterations)
        self.model = sklearn_crfsuite.CRF(algorithm=algorithm, c1=c1, c2=c2,
                                          max_iterations=max_iterations, all_possible_transitions=True)

    def train(self):
        """訓練"""
        self.initialize_model()
        x, y = self.corpus.generator()
        x_train, y_train = x[500:], y[500:]
        x_test, y_test = x[:500], y[:500]
        self.model.fit(x_train, y_train)
        labels = list(self.model.classes_)
        labels.remove('O')
        y_predict = self.model.predict(x_test)
        metrics.flat_f1_score(y_test, y_predict, average='weighted', labels=labels)
        sorted_labels = sorted(labels, key=lambda name: (name[1:], name[0]))
        print(metrics.flat_classification_report(y_test, y_predict, labels=sorted_labels, digits=3))
        self.save_model()

    def predict(self, sentence):
        """預測"""
        self.load_model()
        u_sent = self.corpus.q_to_b(sentence)
        word_lists = [[u'<BOS>']+[c for c in u_sent]+[u'<EOS>']]
        word_grams = [self.corpus.segment_by_window(word_list) for word_list in word_lists]
        features = self.corpus.extract_feature(word_grams)
        y_predict = self.model.predict(features)
        entity = u''
        for index in range(len(y_predict[0])):
            if y_predict[0][index] != u'O':
                if index > 0 and y_predict[0][index][-1] != y_predict[0][index-1][-1]:
                    entity += u' '
                entity += u_sent[index]
            elif entity[-1] != u' ':
                entity += u' '
        return entity

    def load_model(self):
        """載入模型 """
        self.model = joblib.load(self.model_path)

    def save_model(self):
        """儲存模型"""
        joblib.dump(self.model, self.model_path)

ner = CRF_NER()
model = ner.train()

執行得到的準確率和召回率：

             precision    recall  f1-score   support

      B_LOC      0.944     0.827     0.882       266
      I_LOC      0.878     0.796     0.835      1203
      B_ORG      0.942     0.911     0.926       682
      I_ORG      0.939     0.869     0.903       997
      B_PER      0.981     0.927     0.953       440
      I_PER      0.975     0.945     0.960       824
        B_T      0.989     0.989     0.989       444
        I_T      0.993     0.994     0.993      1099

avg / total      0.949     0.904     0.925      5955

基於CRF的中文命名實體識別模型

條件隨機場（Conditional Random Fields，簡稱 CRF）是給定一組輸入序列條件下另一組輸出序列的條件概率分佈模型，在自然語言處理中得到了廣泛應用。新建corpus_process類 import re import sklearn_crfsuite from

BiLSTM-CRF模型做基於字的中文命名實體識別

在MSRA的簡體中文NER語料（我是從這裡下載的，非官方出品，可能不是SIGHAN 2006 Bakeoff-3評測所使用的原版語料）上訓練NER模型，識別人名、地名和組織機構名。嘗試了兩種模型：一種是手工定義特徵模板後再用CRF++開源包訓練CRF模型；另一種是

零基礎入門--中文命名實體識別（BiLSTM+CRF模型，含程式碼）

自己也是一個初學者，主要是總結一下最近的學習，大佬見笑。中文分詞說到命名實體抽取，先要了解一下基於字標註的中文分詞。比如一句話 "我愛北京天安門”。分詞的結果可以是 “我/愛/北京/天安門”。那什麼是基於字標註呢？ “我/O 愛/O 北/B

基於crf的CoNLL2002資料集命名實體識別模型實現-pycrfsuite

下面是用python的pycrfsuite庫實現的命名實體識別，是我最初為了感知命名實體識別到底是什麼，調研命名實體識別時跑的案例，記錄在下面，為了以後查閱。案例說明：內容：在通用語料庫CoNLL2002上，用crf方法做命名實體識別（地點、組織和人名）。工具：Anacond

BiLSTM-CRF 模型實現中文命名實體識別

三個月之前 NLP 課程結課，我們做的是命名實體識別的實驗。在MSRA的簡體中文NER語料（我是從這裡下載的，非官方出品，可能不是SIGHAN 2006 Bakeoff-3評測所使用的原版語料）上訓練NER模型，識別人名、地名和組織機構名。嘗試了兩種模型：一種是手工定義特徵模板後再用CRF++開源包訓練CR

【論文筆記】《基於深度學習的中文命名實體識別研究》閱讀筆記

作者及其單位：北京郵電大學，張俊遙，2019年6月，碩士論文摘要實驗資料：來源於網路公開的新聞文字資料；用隨機欠取樣和過取樣的方法解決分類不均衡問題；使用BIO格式的標籤識別5類命名實體，標註11種標籤。學習模型：基於RNN-CRF框架，提出Bi-GRU-Attention模型；基於改進的ELMo可

用CRF做命名實體識別

裏的以及命名語料庫 images AD 之前 .dll alt 摘要本文主要講述了關於人民日報標註語料的預處理，利用CRF++工具包對模型進行訓練以及測試目錄明確我們的標註任務語料和工具數據預處理 1.數據說明 2.數據預處理模型訓練及測試 1.流程 2

NLP之中文命名實體識別

在MUC-6中首次使用了命名實體（named entity）這一術語，由於當時關注的焦點是資訊抽取（information extraction）問題，即從報章等非結構化文字中抽取關於公司活動和國防相關活動的結構化資訊，而人名、地名、組織機構名、時間和數字表達（包括時間、日期、貨幣量和百分數等）是結

BILSTM+CRF實現命名實體識別NER

#第一步：資料處理 #pikle是一個將任意複雜的物件轉成物件的文字或二進位制表示的過程。 #同樣，必須能夠將物件經過序列化後的形式恢復到原有的物件。 #在 Python 中，這種序列化過程稱為 pickle， #可以將物件 pickle 成字串、磁碟上的檔案或者任何類似於檔案的物件， #也可以

使用Stanford Word Segmenter and Stanford Named Entity Recognizer (NER)實現中文命名實體識別

簡介 Stanford NER是命名實體識別（NER，Named Entity Recognizer）的一個Java實現。NER可以標記文字中詞的序列，如人名、公司名、基因名或者蛋白質名等。它自帶精心設計的用於NER的特徵提取器，和用於定義特徵提取器的許多選項

BiLSTM介紹及中文命名實體識別應用

What-什麼是LSTM和BiLSTM？ LSTM：全稱Long Short-Term Memory，是RNN（Recurrent Neural Network）的一種。LSTM由於其設計的特點，非常適合用於對時序資料的建模，如文字資料。 BiLSTM：Bi-directional

中文命名實體識別之學習筆記一（詞性標註）

接觸命名實體識別這個領域有不少時間了，中文命名實體識別的主要任務是識別出文本中的人名，地名，組織機構名等專有名稱和有意義的時間，日期等數量短語並加以歸類。命名實體識別技術是資訊抽取，資訊檢索，機器翻譯，問答系統等多種自然語言處理技術必不可少的組成部分。對於這個技術，自己

使用Stanford NLP工具實現中文命名實體識別

一、系統配置 Eclipseluna、 JDK 1.8+ 二、分詞介紹 data目錄下有兩個gz壓縮檔案，分別是ctb.gz和pku.gz，其中CTB：賓州大學的中國樹庫訓練資料，PKU：中國北京大學提供的訓練資料。三、 NER 使用斯坦福大學

NLP入門（八）使用CRF++實現命名實體識別(NER)

CRF與NER簡介 CRF，英文全稱為conditional random field, 中文名為條件隨機場，是給定一組輸入隨機變數條件下另一組輸出隨機變數的條件概率分佈模型，其特點是假設輸出隨機變數構成馬爾可夫（Markov）隨機場。較為簡單的條件隨機場是定義線上性鏈上的條件隨機場，稱為線性鏈條件

【NLP】基於CRF條件隨機場的命名實體識別原理詳解

1. 命名實體用來做什麼？在自然語言處理應用領域中，命名實體識別是資訊檢索、知識圖譜、機器翻譯、情感分析、問答系統等多項自然語言處理應用的基礎任務，例如，我們需要利用命名實體識別技術自動識別使用者的查詢，然後將查詢中的實體連結到知識圖譜對應的結點上其識別的準確率將會直接影

基於深度學習做命名實體識別

note 深度學習以及效果數據集 pre 之前得到高達基於CRF做命名實體識別系列用CRF做命名實體識別(一) 用CRF做命名實體識別(二) 用CRF做命名實體識別(三) 摘要 1. 之前用CRF做了命名實體識別，效果還可以，最高達到0.9293，當然這是自己

中文電子病例命名實體識別專案

MedicalNamedEntityRecognition Medical Named Entity Recognition implement using bi-directional lstm and crf model with char embedding.CCKS2018中文電

BiLSTM+CRF(三）命名實體識別實踐與總結

本博文是對上一篇部落格(https://blog.csdn.net/jmh1996/article/details/84779680 BiLSTM+CRF(二）命名實體識別 )的完善。資料處理功能模組語料庫資料格式：訓練集： source_data.txt :文字每一行為

BiLSTM+CRF(二）命名實體識別

前言前一篇部落格裡面，我們已經提到了如何構建一個雙向的LSTM網路，並在原來單層的RNN的基礎上，修改少數幾行程式碼即可實現。 Bi-LSTM其實就是兩個LSTM，只不過反向的LSTM是把輸入的資料先reverse 首尾轉置一下，然後跑一個正常的LSTM，然後再把輸出結果rever

基於BERT命名實體識別程式碼的理解

我一直做的是有關實體識別的任務，BERT已經火了有一段時間，也研究過一點，今天將自己對bert對識別實體的簡單認識記錄下來，希望與大家進行來討論 BERT官方Github地址：https://github.com/google-research/bert ，其中對BERT模型進行了

基於CRF的中文命名實體識別模型

相關推薦