MIT自然語言處理第三講：概率語言模型（第四、五、六部分）

阿新 • • 發佈：2019-01-30

MIT自然語言處理第三講：概率語言模型（第四部分）

自然語言處理：概率語言模型

Natural Language Processing: Probabilistic Language Modeling
作者：Regina Barzilay（MIT,EECS Department, November 15, 2004)
譯者：我愛自然語言處理（www.52nlp.cn ，2009年1月20日）

四、平滑演算法
a) 最大似然估計（Maximum Likelihood Estimate）
　i. MLE使訓練資料儘可能的“大”（MLE makes training data as probable as possible）：
　　　P_{ML}(w_{i}/{w_{i-1},w_{i-2}}) = {Count(w_{i-2},w_{i-1},w_{i})}/{Count(w_{i-2},w_{i-1})}
　　1. 對於詞彙規模為N的語料庫，我們將要在模型中得到N^{3}的引數（For vocabulary of size N, we will have N3 parameters in the model）
　　2. 對於N=1000，我們將要估計1000^{3}=10^{9}個引數（For N =1, 000, we have to estimate1, 000^{3}=10^{9} parameters）
　　3. 問題（Problem）: 如何處理未登入詞（how to deal with unseen words）?
　ii. 資料稀疏問題（Sparsity）
　　1. 未知事件的總計概率構成了測試資料的很大一部分（The aggregate probability of unseen events constitutes a large fraction of the test data）
　　2. Brown et al (1992): 考慮一個3.5億詞的英語語料庫，14%的三元詞是未知的（considered a 350 million word corpus of English, 14% of trigrams are unseen）
　iii. 注：關於MLE的簡要補充
　　1. 最大似然估計是一種統計方法，它用來求一個樣本集的相關概率密度函式的引數。這個方法最早是遺傳學家以及統計學家羅納德•費雪爵士在1912年至1922年間開始使用的。
　　2. “似然”是對likelihood 的一種較為貼近文言文的翻譯，“似然”用現代的中文來說即“可能性”。故而，若稱之為“最大可能性估計”則更加通俗易懂。
　　3.MLE選擇的引數使訓練語料具有最高的概率，它沒有浪費任何概率在訓練語料中沒有出現的事件中
　　4.但是MLE概率模型通常不適合做NLP的統計語言模型，會出現0概率，這是不允許的。
b) 如何估計未知元素的概率（How to estimate probability of unseen elements）?
　i. 打折（Discounting）
　　1. Laplace加1平滑（Laplace）
　　2. Good-Turing打折法（Good-Turing）
　ii. 線性插值法（Linear Interpolation）
　iii. Katz回退（Katz Back-Off）
c) 加一(Laplace)平滑（Add-One (Laplace) Smoothing）
　i. 最簡單的打折方法（Simplest discounting technique）:
　　　{P(w_{i}/w_{i-1})} = {C(w_{i-1},w_{i})+1}/{C(w_{i-1})+V}
　　這裡Ｖ是詞彙表的數目——語料庫的“型”（where |ν| is a vocabulary size）
　　注：MIT課件這裡似乎有誤，我已修改
　ii. 貝葉斯估計假設事件發生前是一個均勻分佈（Bayesian estimator assuming a uniform unit prior on events）
　iii. 問題（Problem）: 對於未知事件佔去的概率太多了（Too much probability mass to unseen events）
　iv. 例子（Example）：
　　假設V=10000(詞型)，S=1000000(詞例)（Assume |ν| =10, 000, and S=1, 000, 000）：
　　　P_{MLE}(ball/{kike~a}) = {{Count(kike~a~ball)}/{Count(kick~a)}} = 9/10 = 0.9
　　　P_{+1}(ball/{kike~a}) = {{Count(kike~a~ball)+1}/{Count(kick~a)+V}} = {9+1}/{10+10000} = 9*10^{-4}
　v. Laplace的缺點（Weaknesses of Laplace）
　　1. 對於稀疏分佈，Laplace法則賦予未知事件太多的概率空間（For Sparse distribution, Laplace’s Law gives too much of the probability space to unseen events）
　　2. 在預測二元語法的實際概率時與其他平滑方法相比顯得非常差（Worst at predicting the actual probabilities of bigrams than other methods）
　　3. 使用加epsilon平滑更合理一些（More reasonable to use add-epsilonsmoothing (Lidstone’s Law)）

未完待續:第五部分

附：課程及課件pdf下載MIT英文網頁地址：
　　　http://people.csail.mit.edu/regina/6881/

MIT自然語言處理第三講：概率語言模型（第五部分）

自然語言處理：概率語言模型
Natural Language Processing: Probabilistic Language Modeling
作者：Regina Barzilay（MIT,EECS Department, November 15, 2004)
譯者：我愛自然語言處理（www.52nlp.cn ，2009年2月10日）

五、 Good-Turing打折法（Good-Turing Discounting）

a) 你在將來看到一個新詞的可能性有多大？用所看到的事件去估計未知事件的概率（How likely are you to see a new word type in the future? Use things you’ve seen once to estimate the probability of unseen things）
　i. n_r——頻率為r的元素（n元語法）計數並且r>0（number of elements with r frequency and r>0）
　ii. n_0——總詞彙規模減去觀察到的詞彙規模，既出現次數為0的n元語法（size of the total lexicon minus the size of observed lexicon）
　iii. 對於頻率為r的元素，修正計數為（Modified count for elements with frequency r）：
　　　　　　　　r^* = (r+1)*{n_{r+1}/n_r}
b) 關於Good-Turing打折法的補充說明：
　i. Good(1953)首先描述了Good-Turing演算法，而這種演算法的原創思想則來自Turing 。
　ii. Good-Turing平滑的基本思想是：用觀察較高的N元語法數的方法來重新估計概率量的大小，並把它指派給那些具有零計數或較低計數的N元語法。
c) 直觀的Good-Turing打折法（Good-Turing Discounting: Intuition）
　i. 目的（Goal）: 估計訓練資料中計數為r的單詞在同樣規模測試集中的出現頻率（estimate how often word with r counts in training data occurs in test set of equal size）。
　ii. 我們使用刪除估計（We use deleted estimation）：
　　1. 每次刪除一個單詞（delete one word at a time）
　　2. 如果單詞“test”在所有的資料集中出現了r+1次（if “test” word occurs r +1 times in complete data set）：
　　——它在訓練集中出現了r 次（it occurs r times in “training” set）
　　——對計數為r的單詞加1（add one count to words with r counts）
　iii. r-count單詞“桶”中的總的計數為（total count placed to bucket for r-count words is）:
　　　　　　　　　n_{r+1}*(r +1)
　iv. 平均計數為：
　　　　　　(avg-count of r count words) = {n_{r+1}*(r+1)}/n_r
d) Good-Turing打折法續（Good-Turing Discounting (cont.)）：
　i. 在Good-Turing中，分配給所有未知事件的總的概率等於n_1/N, 其中N是訓練集的規模。它與分配給獨立事件的相對頻率公式相似。
　ii. In Good-Turing, the total probability assigned to all the unobserved events is equal ton_1/N , where N is the size of the training set. It is the same as a relative frequency formula would assign to singleton events.
e) 舉例（Example: Good-Turing）
Training sample of 22,000,000 (Church&Gale’1991))
r 　　　N_r　　　　　　　heldout　　r^*
0 　　74,671,100,000　0.00027　0.00027
1 　　2,018,046　　　　0.448　　0.446
2 　　449,721　　　　　1.25　　　1.26
3 　　188,933　　　　　2.24　　　2.24
4 　　105,668　　　　　3.23　　　3.24
5 　　68,379　　　　　 4.21　　　4.22
6 　　48,190　　　　　 5.23　　　5.19
f) 補充說明：
　i. 根據Zipf定律,對於小的r, N_r比較大;對於大的r,N_r小,對於出現次數最多的n元組,r*=0!
　ii. 所以對於出現次數很多的n元組, GT估計不準,而MLE估計比較準,因此可以直接採用MLE. GT估計一般適用於出現次數為k(k<10)的n元組　iii. 如果這樣,考慮”劫富濟貧”,這裡的”富”就變成了”中產”階級!呵呵,真正的富翁沾光了!（雖然富翁損一點也沒什麼）連打折法也不敢欺富人！這就是“為富不仁”，“一毛不拔”的來歷。未完待續：

第六部分

附：課程及課件pdf下載MIT英文網頁地址：
　　　http://people.csail.mit.edu/regina/6881/

MIT自然語言處理第三講：概率語言模型（第六部分）

六、插值及回退
a) The Bias-Variance Trade-Off
　i. 未平滑的三元模型估計(Unsmoothed trigram estimate)：　　　　　　
　　P_ML({w_i}/{w_{i-2},w_{i-1}})={Count(w_{i-2}w_{i-1}w_{i})}/{Count(w_{i-2},w_{i-1})}
　ii. 未平滑的二元模型估計(Unsmoothed bigram estimate）：
　　　P_ML({w_i}/{w_{i-1}})={Count(w_{i-1}w_{i})}/{Count(w_{i-1})}
　iii. 未平滑的一元模型估計(Unsmoothed unigram estimate)：
　　　P_ML({w_i})={Count(w_{i})}/sum{j}{}{Count(w_{j})}
　iv. 這些不同的估計中哪個和“真實”的P({w_i}/{w_{i-2},w_{i-1}})概率最接近（How close are these different estimates to the “true” probability P({w_i}/{w_{i-2},w_{i-1}}))?
b) 插值（Interpolation）
　i. 一種解決三元模型資料稀疏問題的方法是在模型中混合使用受資料稀疏影響較小的二元模型和一元模型（One way of solving the sparseness in a trigram model is to mix that model with bigram and unigram models that suffer less from data sparseness）。
　ii. 權值可以使用期望最大化演算法（EM）或其它數值優化技術設定（The weights can be set using the Expectation-Maximization Algorithm or another numerical optimization technique）
　iii. 線性插值（Linear Interpolation)
　　hat{P}({w_i}/{w_{i-2},w_{i-1}})={lambda_1}*P_ML({w_i}/{w_{i-2},w_{i-1}})
　　+{lambda_2}*P_ML({w_i}/w_{i-1})+{lambda_3}*P_ML({w_i})
　　這裡{lambda_1}+{lambda_2}+{lambda_3}=1並且{lambda_i}>=0 對於所有的 i
　iv. 引數估計（Parameter Estimation）
　　1. 取出訓練集的一部分作為“驗證”資料（Hold out part of training set as “validation” data）
　　2. 定義Count_2(w_1,w_2,w_3)作為驗證集中三元集 w_1,w_2,w_3 的出現次數（Define Count_2(w_1,w_2,w_3) to be the number of times the trigram w_1,w_2,w_3 is seen in validation set）
　　3. 選擇{lambda_i}去最大化(Choose {lambda_i} to maximize):
L({lambda_1},{lambda_2},{lambda_3})=sum{(w_1,w_2,w_3)in{upsilon}}{}{Count_2(w_1,w_2,w_3)}log{hat{P}}({w_3}/{w_2,w_1})
　　這裡{lambda_1}+{lambda_2}+{lambda_3}=1並且{lambda_i}>=0 對於所有的 i
　　注：關於引數估計的其他內容，由於公式太多，這裡略，請參考原始課件
c)Kats回退模型-兩元（Katz Back-Off Models (Bigrams)）：
　i. 定義兩個集合（Define two sets）：
　　A(w_{i-1})=delim{lbrace}{w:Count(w_{i-1},w)>0}{rbrace}
　　
　　B(w_{i-1})=delim{lbrace}{w:Count(w_{i-1},w)=0}{rbrace}
　ii. 一種兩元模型（A bigram model）：
P_K({w_i}/w_{i-1})=delim{lbrace}{matrix{2}{2}{{{Count^{*}(w_{i-1},w)}/{Count(w_{i-1})}>0} {if{w_i}{in}{A(w_{i-1})}} {alpha(w_{i-1}){{P_ML(w_{i})}/sum{w{in}B(w_{i-1})}{}{P_ML(w)}} } {if{w_i}{in}{B(w_{i-1})}} }}{}
{alpha(w_{i-1})=1-sum{w{in}A(w_{i-1})}{}{{Count^{*}(w_{i-1},w)}/{Count(w_{i-1})}}}
　iii. Count^*定義（Count^*definitions）
　　1. Kats對於Count(x)<5使用Good-Turing方法,對於Count(x)>=5令Count^*(x)=Count(x)(Katz uses Good-Turing method for Count(x)< 5, and Count^*(x)=Count(x)for Count(x)>=5)
　　2. “Kneser-Ney”方法（“Kneser-Ney” method）：
　　　Count^*(x)=Count(x)-D,其中 D={n_1}/{n_1+n_2}
　　　n_1是頻率為1的元素個數（n_1 is a number of elements with frequency 1)
　　　n_2是頻率為2的元素個數（n_2 is a number of elements with frequency 2)

七、綜述
a) N元模型的弱點（Weaknesses of n-gram Models）
　i. 有何想法（Any ideas）?
　　短距離（Short-range）
　　中距離（Mid-range）
　　長距離（Long-range）
b) 更精確的模型（More Refined Models）
　i. 基於類的模型（Class-based models）
　ii. 結構化模型（Structural models）
　iii. 主題和長距離模型（Topical and long-range models）
c) 總結（Summary）
　i. 從一個詞表開始（Start with a vocabulary）
　ii. 選擇一種模型（Select type of model）
　iii. 引數估計（Estimate Parameters）
d) 工具包參考：
　i. CMU-Cambridge language modeling toolkit:
　　http://mi.eng.cam.ac.uk/~prc14/toolkit.html
　ii.SRILM – The SRI Language Modeling Toolkit:
　　http://www.speech.sri.com/projects/srilm/

第三講結束！
第四講：標註

附：課程及課件pdf下載MIT英文網頁地址：
　　　http://people.csail.mit.edu/regina/6881/

MIT自然語言處理第三講：概率語言模型（第四、五、六部分）

MIT自然語言處理第三講：概率語言模型（第四部分）

MIT自然語言處理第三講：概率語言模型（第五部分）

MIT自然語言處理第三講：概率語言模型（第六部分）

MIT自然語言處理第三講：概率語言模型（第四、五、六部分）

自然語言處理(NLP) 三：詞袋模型 + 文字分類

MIT 線性代數導論第三講：矩陣乘法與逆矩陣

自然語言處理(NLP) 一：分詞、分句、詞幹提取

自然語言處理學習6：nltk詞性標註

統計自然語言處理梳理一：分詞、命名實體識別、詞性標註

自然語言處理學習3：中文分句re.split()，jieba分詞和詞頻統計FreqDist

【蟲師講Selenium+Python】第三講：操作測試對象

學習筆記-小甲魚Python3學習第三講：小插曲之變量和字符串

第三講：感知機+SVM+LR（上）

三週學會小程式第三講：服務端搭建和部署

斯坦福CS224N_自然語言處理NLP深度學習DL課程筆記（一）

第三講：3.1 spring 如何裝配bean

第三講： 3.2 spring 依賴注入

深度學習與自然語言處理(7)_斯坦福cs224d 語言模型，RNN，LSTM與GRU

第三講微控制器C語言之12864液晶顯示

Kotlin 第三講：實戰第一課-變數的宣告

有限擴散集團凝聚模型（DLCA）第三講：流動水流中的凝聚模擬(1)

Java核心技術36講第三講：final、finally、finalize的不同

第三講：建立雲端的SQL資料庫（學習筆記)

MIT自然語言處理第三講：概率語言模型（第四、五、六部分）

MIT自然語言處理第三講：概率語言模型（第四部分）

MIT自然語言處理第三講：概率語言模型（第五部分）

MIT自然語言處理第三講：概率語言模型（第六部分）

相關推薦