【中文分詞】結構化感知器SP

阿新 • • 發佈：2019-01-18

結構化感知器（Structured Perceptron, SP）是由Collins [1]在EMNLP'02上提出來的，用於解決序列標註的問題。中文分詞工具THULAC、LTP所採用的分詞模型便是基於此。

1. 結構化感知器

模型

CRF全域性化地以最大熵準則建模概率\(P(Y|X)\)；其中，\(X\)為輸入序列\(x_1^n\)，\(Y\)為標註序列\(y_1^n\)。不同於CRF建模概率函式，SP則是以最大熵準則建模score函式：

\[ S(Y,X) = \sum_s \alpha_s \Phi_s(Y,X) \]

其中，\(\Phi_s(Y,X)\)為本地特徵函式\(\phi_s(h_i,y_i)\)

的全域性化表示：

\[ \Phi_s(Y,X) = \sum_i \phi_s(h_i,y_i) \]

那麼，SP解決序列標註問題，可視作為：給定\(X\)序列，求解score函式最大值對應的\(Y\)序列：

\[ \mathop{\arg \max}_Y S(Y,X) \]

為了避免模型過擬合，保留每一次更新的權重，然後對其求平均。具體流程如下所示：

因此，結構化感知器也被稱為平均感知器（Average Perceptron）。

解碼

在將SP應用於中文分詞時，除了事先定義的特徵模板外，還用用到一個狀態轉移特徵\((y_{t-1}, y_t)\)。記在時刻\(t\)的狀態為\(y\)的路徑\(y_1^{t}\)

所對應的score函式最大值為

\[ \delta_t(y) = \max S(y_1^{t-1},X,y_t=y) \]

則有，在時刻\(t+1\)

\[ \delta_{t+1}(y) = \max_{y'} \ \{ \delta_t(y') + w_{y',y} + F(y_{t+1}=y,X) \} \]

其中，\(w_{y',y}\)為轉移特徵\((y',y)\)所對應的權值，\(F(y_{t+1}=y,X)\)為\(y_{t+1}=y\)所對應的特徵模板的特徵值的加權之和。

2. 開源實現

張開旭的minitools/cws（THULAC的雛形）給出了SP中文分詞的簡單實現。首先，來看看定義的特徵模板：

def gen_features(self, x):  # 列舉得到每個字的特徵向量
    for i in range(len(x)):
        left2 = x[i - 2] if i - 2 >= 0 else '#'
        left1 = x[i - 1] if i - 1 >= 0 else '#'
        mid = x[i]
        right1 = x[i + 1] if i + 1 < len(x) else '#'
        right2 = x[i + 2] if i + 2 < len(x) else '#'
        features = ['1' + mid, '2' + left1, '3' + right1,
                    '4' + left2 + left1, '5' + left1 + mid, '6' + mid + right1, '7' + right1 + right2]
        yield features

共定義了7個特徵：

\(x_iy_i\)
\(x_{i-1}y_i\)
\(x_{i+1}y_i\)
\(x_{i-2}x_{i-1}y_i\)
\(x_{i-1}x_{i}y_i\)
\(x_{i}x_{i+1}y_i\)
\(x_{i+1}x_{i+2}y_i\)

將狀態B、M、E、S分別對應於數字0、1、2、3：

def load_example(words):  # 詞陣列，得到x，y
    y = []
    for word in words:
        if len(word) == 1:
            y.append(3)
        else:
            y.extend([0] + [1] * (len(word) - 2) + [2])
    return ''.join(words), y

訓練語料則採取的更新權重：

for i in range(args.iteration):
    print('第 %i 次迭代' % (i + 1), end=' '), sys.stdout.flush()
    evaluator = Evaluator()
    for l in open(args.train, 'r', 'utf-8'):
        x, y = load_example(l.split())
        z = cws.decode(x)
        evaluator(dump_example(x, y), dump_example(x, z))
        cws.weights._step += 1
        if z != y:
            cws.update(x, y, 1)
            cws.update(x, z, -1)
    evaluator.report()
    cws.weights.update_all()
    cws.weights.average()

Viterbi演算法用於解碼，與HMM相類似：

def decode(self, x):  # 類似隱馬模型的動態規劃解碼演算法
    # 類似隱馬模型中的轉移概率
    transitions = [[self.weights.get_value(str(i) + ':' + str(j), 0) for j in range(4)]
                   for i in range(4)]
    # 類似隱馬模型中的發射概率
    emissions = [[sum(self.weights.get_value(str(tag) + feature, 0) for feature in features)
                  for tag in range(4)] for features in self.gen_features(x)]
    # 類似隱馬模型中的前向概率
    alphas = [[[e, None] for e in emissions[0]]]
    for i in range(len(x) - 1):
        alphas.append([max([alphas[i][j][0] + transitions[j][k] + emissions[i + 1][k], j]
                           for j in range(4))
                       for k in range(4)])
    # 根據alphas中的“指標”得到最優序列
    alpha = max([alphas[-1][j], j] for j in range(4))
    i = len(x)
    tags = []
    while i:
        tags.append(alpha[1])
        i -= 1
        alpha = alphas[i][alpha[1]]
    return list(reversed(tags))

3. 參考資料

[1] Collins, Michael. "Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms." Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10. Association for Computational Linguistics, 2002.
[2] Zhang, Yue, and Stephen Clark. "Chinese segmentation with a word-based perceptron algorithm." Annual Meeting-Association for Computational Linguistics. Vol. 45. No. 1. 2007.
[3] Kai Zhao and Liang Huang, Structured Prediction with Perceptron: Theory and Algorithms.
[4] Michael Collins, Lecture 4, COMS E6998-3: The Structured Perceptron.

【中文分詞】結構化感知器SP

1. 結構化感知器

模型

解碼

2. 開源實現

3. 參考資料

【中文分詞】結構化感知器SP

【中文分詞】隱馬爾可夫模型HMM

【中文分詞】簡單高效的MMSeg

【中文分詞】二階隱馬爾可夫模型2-HMM

【中文分詞】最大熵馬爾可夫模型MEMM

【中文分詞】條件隨機場CRF

【中文分詞系列】 8 更好的新詞發現演算法

Hadoop學習之自己動手做搜尋引擎【網路爬蟲+倒排索引+中文分詞】

【中文分詞系列】 5. 基於語言模型的無監督分詞

【轉】中文分詞之HMM模型詳解

《數學之美》讀書記錄【思維導圖記錄】：第四章，談談中文分詞

【NLP】【一】中文分詞之jieba

【NLP學習筆記】中文分詞

【Python】中文分詞並過濾停用詞

【Apache Solr系列】使用IKAnalyzer中文分詞以及自定義分詞字典

【python 走進NLP】利用jieba技術中文分詞並寫入txt

【NLP】11大Java開源中文分詞器的使用方法和分詞效果對比

【資料彙編】結巴中文分詞官方文件和原始碼分析系列文章

【結巴分詞資料彙編】結巴中文分詞原始碼分析(2)

【原創】中文分詞系統 ICTCLAS2015 的JAVA封裝和多執行緒執行（附程式碼）

【中文分詞】結構化感知器SP

1. 結構化感知器

模型

解碼

2. 開源實現

3. 參考資料

相關推薦