Do-it-yourself NLP for bot developers

阿新 • • 發佈：2018-11-14

我相信在大多數情況下，聊天機器人的開發者構建自己的自然語言解析器，而不是使用第三方雲端API，是有意義的選擇。這樣做有很好的戰略性和技術性方面的依據，我將向你展示自己實現NLP有多麼簡單。這篇文章包含3個部分：

為什麼要自己做
最簡單的實現也很有效
你可以真正用起來的東西

那麼要實現一個典型的機器人，你需要什麼樣的NLP技術棧？假設您正在構建一項服務來幫助人們找到餐館。你的使用者可能會這樣說：

I’m looking for a cheap Mexican place in the centre.

為了回答這個問題，你需要做兩件事：

瞭解使用者的意圖（intent

）：他們正在尋找一家餐廳，而不是說“你好”，“再見”或“感謝”。
提取cheap ， Mexican和center作為你的查詢欄位。

在之前的文章中，我提到像wit和LUIS這樣的工具使得意圖分類（Intent Classification）和實體提取（Entity Extraction）變得非常簡單，以至於在參加黑客馬拉松期間你就可以快速構建一個聊天機器人。我是這些雲端服務以及背後團隊的忠實粉絲，但是並不是說它們適用於任何場景。

1.使用NLP庫而不是雲端API的三個理由

首先，如果你真的想建立一個基於對話軟體的業務，那麼把使用者告訴你的所有東西都傳給Facebook

或者微軟可能不是一個好策略。其次，我不相信Web API是開發中每一個問題的解決方案。 https呼叫速度很慢，並且始終受到API設計的限制。其次，本地庫是可以深入探索的（hackable）。第三，在自己的資料和用例上，你有機會實現更好的效能。請記住，通用API必須在每個問題上都做得很好，而你只需要做好你的工作。

2. 詞向量 +啟發式 - 複雜性=工作程式碼

首先，我們將不使用任何庫（numpy除外）來構建一個最簡單的模型，以便了解它是如何工作的。

我堅信，在機器學習中你唯一能做的就是找到一個好的表示（presentation）。如果這一點不是很明瞭，我現在正在寫另一篇文章對此進行解釋，所以請稍後再回來看看。重點是，如果你有一個有效的方式來表示你的資料，那麼即使是非常簡單的演算法也能完成這項工作。

我們將使用詞向量（word vector），它是包含幾十或幾百個浮點數的向量，可以在某種程度上捕捉單詞的含義。事實上，完全可以做到這一點，而這些模型的訓練方式都是非常有趣的。就我們的目的而言，這意味著我們已經完成了艱苦的工作：像word2vec或GloVe這樣的詞嵌入（word embedding）是表示文字資料的有力方式。我決定使用GloVe進行這些實驗。你可以從他們的倉庫中下載訓練好的模型，我使用了最小維數（50維）的預訓練模型。

下面的程式碼基於GloVe倉庫中的python示例。實現的就是把整個詞表載入記憶體：

class Embedding(object):
    def __init__(self,vocab_file,vectors_file):
        with open(vocab_file, 'r') as f:
            words = [x.rstrip().split(' ')[0] for x in f.readlines()]

        with open(vectors_file, 'r') as f:
            vectors = {}
            for line in f:
                vals = line.rstrip().split(' ')
                vectors[vals[0]] = [float(x) for x in vals[1:]]

        vocab_size = len(words)
        vocab = {w: idx for idx, w in enumerate(words)}
        ivocab = {idx: w for idx, w in enumerate(words)}

        vector_dim = len(vectors[ivocab[0]])
        W = np.zeros((vocab_size, vector_dim))
        for word, v in vectors.items():
            if word == '<unk>':
                continue
            W[vocab[word], :] = v

        # normalize each word vector to unit variance
        W_norm = np.zeros(W.shape)
        d = (np.sum(W ** 2, 1) ** (0.5))
        W_norm = (W.T / d).T

        self.W = W_norm
        self.vocab = vocab
self.ivocab = ivocab

現在讓我們嘗試使用這些詞向量來完成第一項任務：在句子I’m looking for a cheap Mexican place in the centre.中提取Mexican作為菜系名。我們將盡可能使用最簡單的方法：在句子中尋找與給出的菜系樣例最相似的單詞。我們將遍歷句子中的單詞，並挑選出與參考單詞的平均餘弦相似度高於某個閾值的單詞：

def find_similar_words(embed,text,refs,thresh):

    C = np.zeros((len(refs),embed.W.shape[1]))

    for idx, term in enumerate(refs):
        if term in embed.vocab:
            C[idx,:] = embed.W[embed.vocab[term], :]


    tokens = text.split(' ')
    scores = [0.] * len(tokens)
    found=[]

    for idx, term in enumerate(tokens):
        if term in embed.vocab:
            vec = embed.W[embed.vocab[term], :]
            cosines = np.dot(C,vec.T)
            score = np.mean(cosines)
            scores[idx] = score
            if (score > thresh):
                found.append(term)
    print scores

return found

讓我們試一下例句。

vocab_file ="/path/to/vocab_file"
vectors_file ="/path/to/vectors_file"

embed = Embedding(vocab_file,vectors_file)

cuisine_refs = ["mexican","chinese","french","british","american"]
threshold = 0.2

text = "I want to find an indian restaurant"

cuisines = find_similar_words(embed,cuisine_refs,text,threshold)
print(cuisines)
# >>> ['indian']

令人驚訝的是，上面的程式碼足以正確地泛化，並根據其與參考詞的相似性來挑選Indian作為菜系型別。因此，這就是為什麼我說，一旦有了好的表示，問題就變得簡單了。

現在來分類使用者的意圖。我們希望能夠把句子分成“打招呼”，“感謝”，“請求餐館”，“指定位置”，“拒絕建議”等類別，以便我們可以告訴機器人的後端執行哪些程式碼。我們可以通過很多方法通過組合詞向量來建立句子的表示，不過再一次，我們決定採用最簡單的方法：把詞向量加起來。我知道也許你對這一方法的意義與作用有所質疑，附錄中解釋了這麼做的原因。

我們可以為每個句子建立這些詞袋（bag-of-words）向量，並再次使用簡單的距離對它們進行分類。再一次，令人驚訝的是，它已經可以泛化處理之前從未見過的句子了：

import numpy as np

def sum_vecs(embed,text):

    tokens = text.split(' ')
    vec = np.zeros(embed.W.shape[1])

    for idx, term in enumerate(tokens):
        if term in embed.vocab:
            vec = vec + embed.W[embed.vocab[term], :]
    return vec


def get_centroid(embed,examples):

    C = np.zeros((len(examples),embed.W.shape[1]))
    for idx, text in enumerate(examples):
        C[idx,:] = sum_vecs(embed,text)

    centroid = np.mean(C,axis=0)
    assert centroid.shape[0] == embed.W.shape[1]
    return centroid


def get_intent(embed,text):
    intents = ['deny', 'inform', 'greet']
    vec = sum_vecs(embed,text)
    scores = np.array([ np.linalg.norm(vec-data[label]["centroid"]) for label in intents ])
    return intents[np.argmin(scores)]


embed = Embedding('/path/to/vocab','/path/to/vectors')


data={
  "greet": {
    "examples" : ["hello","hey there","howdy","hello","hi","hey","hey ho"],
    "centroid" : None
  },
  "inform": {
    "examples" : [
      "i'd like something asian",
      "maybe korean",
      "what mexican options do i have",
      "what italian options do i have",
      "i want korean food",
      "i want german food",
      "i want vegetarian food",
      "i would like chinese food",
      "i would like indian food",
      "what japanese options do i have",
      "korean please",
      "what about indian",
      "i want some vegan food",
      "maybe thai",
      "i'd like something vegetarian",
      "show me french restaurants",
      "show me a cool malaysian spot"
    ],
    "centroid" : None
  },
  "deny": {
    "examples" : [
      "nah",
      "any other places ?",
      "anything else",
      "no thanks"
      "not that one",
      "i do not like that place",
      "something else please",
      "no please show other options"
    ],
    "centroid" : None
  }
}


for label in data.keys():
    data[label]["centroid"] = get_centroid(embed,data[label]["examples"])


for text in ["hey you","i am looking for chinese food","not for me"]:
    print "text : '{0}', predicted_label : '{1}'".format(text,get_intent(embed,text))

# output
# >>>text : 'hey you', predicted_label : 'greet'
# >>>text : 'i am looking for chinese food', predicted_label : 'inform'
# >>>text : 'not for me', predicted_label : 'deny'

我所展示的解析和分類方法都不是特別魯棒，所以我們將繼續探索更好的方向。但是，我希望我已經證明，沒有什麼神祕的，實際上很簡單的方法已經可以工作了。

3.你可以實際使用的東西

有很多事情我們可以做得更好。例如，將文字轉換為token而不是僅僅基於空白字元進行拆分。一種方法是使用SpaCy /textacy的組合來清理和解析文字，並使用scikit-learn來構建模型。在這裡，我將使用MITIE （MIT資訊抽取庫）的Python介面來完成我們的任務。

有兩個類我們可以直接使用。首先，一個文字分類器（Text Classifier）：

import sys, os
from mitie import *

trainer = text_categorizer_trainer("/path/to/total_word_feature_extractor.dat")

data = {} # same as before  - omitted for brevity

for label in training_examples.keys():
  for text in training_examples[label]["examples"]:
    tokens = tokenize(text)
    trainer.add_labeled_text(tokens,label)

trainer.num_threads = 4
cat = trainer.train()

cat.save_to_disk("my_text_categorizer.dat")

# we can then use the categorizer to predict on new text
tokens = tokenize("somewhere that serves chinese food")
predicted_label, _ = cat(tokens)

其次，一個實體識別器（Entity Recognizer）：

import sys, os
from mitie import *
sample = ner_training_instance(["I", "am", "looking", "for", "some", "cheap", "Mexican", "food", "."])

sample.add_entity(xrange(5,6), "pricerange")
sample.add_entity(xrange(6,7), "cuisine")

# And we add another training example
sample2 = ner_training_instance(["show", "me", "indian", "restaurants", "in", "the", "centre", "."])
sample2.add_entity(xrange(2,3), "cuisine")
sample2.add_entity(xrange(6,7), "area")


trainer = ner_trainer("/path/to/total_word_feature_extractor.dat")

trainer.add(sample)
trainer.add(sample2)

trainer.num_threads = 4

ner = trainer.train()

ner.save_to_disk("new_ner_model.dat")


# Now let's make up a test sentence and ask the ner object to find the entities.
tokens = ["I", "want", "expensive", "korean", "food"]
entities = ner.extract_entities(tokens)


print "\nEntities found:", entities
print "\nNumber of entities detected:", len(entities)
for e in entities:
    range = e[0]
    tag = e[1]
    entity_text = " ".join(tokens[i] for i in range)
    print "    " + tag + ": " + entity_text

# output 
# >>> Number of entities detected: 2
# >>>     pricerange: expensive
# >>>     cuisine: korean

MITIE庫非常複雜，使用多種詞嵌入而不單是GloVe。文字分類器是一個簡單的SVM，而實體識別器使用結構化SVM。如果您有興趣，在github倉庫中有相關文獻的連結。

正如你所期望的那樣，使用像這樣的庫（或者SpaCy加上你最喜歡的ML庫）比起我在開始時釋出的實驗程式碼提供了更好的效能。事實上，根據我的經驗，你可以很快地超越wit或LUIS的表現，因為你可以根據自己資料集進行相應的引數調整。

結論

我希望我已經說服你，在構建聊天機器人時建立自己的NLP模組是值得的。請在下面新增你的想法、建議和問題。我期待著討論。如果你喜歡這個文章，可以在這裡贊一下，或者在twitter上，那會更好。

感謝 Alex ，Kate， Norman 和 Joey 的閱讀草稿！

附錄：稀疏恢復（`sparse recovery`）

你怎麼可能把一個句子中的單詞向量加起來（或平均），就可以作為句子的表示？這就好像告訴你，一個班上的10個學生，在測試中平均得分為75％，你卻試圖找出每個人的成績。好吧，差不多。事實證明，這是關於高維幾何的那些違反直覺的事情之一。

如果從一組1000個樣本中抽取10個向量，那麼只要知道平均值，就可以真正地找出你選擇哪個向量，如果向量具有足夠高的維數（比如說300）。這歸結於R³⁰⁰中有很多空間的事實，所以如果隨機抽樣一對向量，你可以期望它們（幾乎）是線性獨立的。

我們對單詞向量的長度不感興趣（它們在上面的程式碼中被歸一化了），所以我們可以把這些向量看作單位球面上的點。假設我們有一組N個向量V⊂ℝ^d，它們是單位d球體上的獨立同分布（iid）。問題是，給定V的一個子集S，如果只知道x = Σ v _i，我們需要多大的D才能恢復所有的S？只有當x和v（v ∉ S）之間的點積足夠小，並且S中的向量有v · x〜1 時，我們才能夠恢復原始的資料。

我們可以使用一個叫做度量集中（concentration of measure）的結果，它告訴我們我們需要什麼。對於單位d球上的iid點，任意兩點之間點積的期望值E（ v · w ）= 1 /√d。而點積大於a的概率是P（ v · w > a）≤（1-a²）^（d / 2）。所以我們可以寫出概率ε，即就空間的維度而言，某個向量v ∉S太靠近向量v ∈S。。這給出了減少ε失敗概率的結果，我們需要d> S log（NS /ε）。因此，如果我們想要從總共1000個具有1％容錯的10個向量中恢復一個子集，我們可以在138個維度或更高的維度上完成。

回到測試分數的比喻，按這個思路進行思考的話可能會使事情變得更清楚。現在我們得到的是每個問題的平均分數，而不是平均總分。現在，你可以從10位學生那裡獲得平均每個問題的分數，我們所說的是當測試中包含更多問題時，分辨哪些學生變得更容易。畢竟這不是很違反直覺。

感謝 Alexander Weidauer 。

原文：Do-it-yourself NLP for bot developers

Do-it-yourself NLP for bot developers

1.使用NLP庫而不是雲端API的三個理由

2. 詞向量 +啟發式 - 複雜性=工作程式碼

3.你可以實際使用的東西

結論

附錄：稀疏恢復（`sparse recovery`）

Do-it-yourself NLP for bot developers

Community for Indian developers and IT professionals

3% of users browse with IE9 and 14% of users have a disability. Why do we only cater for the former?

2017多校第7場 HDU 6129 Just do it 找規律

【組合數+Lucas定理】2017多校訓練七 HDU 6129 Just do it

hdu 6129 Just do it 找規律

vue 命令行報錯“Do not use ‘new’ for side effects“

It’s Time for a Montage

Just Do IT

PyCharm - The Python IDE for Professional Developers - 安裝

138. Copy List with Random Pointer (not do it by myself)

C#程式設計基礎第七課：C#中的基本迴圈語句：while迴圈、do-while迴圈、for迴圈、foreach迴圈的使用

Blockchain Programming for New Developers — II

Blockchain Programming for New Developers — I

C之三種常用迴圈：while迴圈、do...while迴圈、for迴圈

Machine Learning for iOS Developers iOS開發者的機器學習教程 Lynda課程中文字幕

ONIE: Why It Is Special for Bare Metal Switch?

只管做吧！ JUST DO IT

解決vue專案eslint校驗 Do not use 'new' for side effects 的兩種方法

while 迴圈，do while迴圈，for迴圈

Do-it-yourself NLP for bot developers

1.使用NLP庫而不是雲端API的三個理由

2. 詞向量 +啟發式 - 複雜性=工作程式碼

3.你可以實際使用的東西

結論

附錄：稀疏恢復（sparse recovery）

相關推薦

附錄：稀疏恢復（`sparse recovery`）