只有21行的拼寫檢查器
我是愛生活愛學習愛工作的布知,英語初階機器學習入門的土澳渣碩,在這個系列裡以後會給大家分享一些程式碼少於100行有趣的Python新手向小應用和背後的演算法原理,感受到程式設計帶來的小小樂趣。
Python簡介:
這裡粗暴地引用下廖大的介紹
- Python是一種指令碼語言
- Python程式簡單易懂,易於入門,易於深入
- Python的哲學就是簡單優雅,儘量寫容易看明白的程式碼,儘量寫少的程式碼
- Python提供了非常完善的基礎程式碼庫,Python還有大量的第三方庫,大大加快開發進度
- 許多大型網站就是用Python開發的,例如YouTube、Instagram,還有國內的豆瓣。很多大公司,包括Google、Yahoo等,甚至NASA(美國航空航天局)都大量地使用Python。
- Python適合開發的型別
- 網路應用,包括網站、後臺服務等;
- 許多日常需要的小工具,包括系統管理員需要的指令碼任務等;
- 把其他語言開發的程式再包裝起來,方便使用。
資源推薦
本系列裡不會對基礎語法進行非常詳盡地介紹,因為已經有足夠多優秀的資源了,這裡推薦兩個
網上還有很多優秀的部落格資源,童鞋們可以自行搜尋學習
21行的拼寫檢查器
好了,下面介紹下我們今天的主角,Norvig大神寫的拼寫檢查器
- Python程式碼的簡潔優雅:21行程式碼實現完整功能
- 數學演算法的雋永之美:貝葉斯公式(要好好學數學啊,童鞋們,真的很重要)
是機器學習的一個小應用
import re
from collections import Counter
def words(text): return re.findall(r'\w+', text.lower())
WORDS = Counter(words(open('big.txt').read()))
def P(word, N=sum(WORDS.values())):
"Probability of `word`."
return WORDS[word] / N
def correction(word):
"Most probable spelling correction for word."
return max(candidates(word), key=P)
def candidates(word):
"Generate possible spelling corrections for word."
return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])
def known(words):
"The subset of `words` that appear in the dictionary of WORDS."
return set(w for w in words if w in WORDS)
def edits1(word):
"All edits that are one edit away from `word`."
letters = 'abcdefghijklmnopqrstuvwxyz'
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [L + R[1:] for L, R in splits if R]
transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
inserts = [L + c + R for L, R in splits for c in letters]
return set(deletes + transposes + replaces + inserts)
def edits2(word):
"All edits that are two edits away from `word`."
return (e2 for e1 in edits1(word) for e2 in edits1(e1))
上面就是完整的程式碼,下面我們再看下它的執行情況
>>> correction("englihs")
'english'
>>> correction("englsh")
'english'
>>> correction("engliish")
'english'
可以看出,程式碼對單詞中字母的換位、缺失和重複起到了檢查的作用
數學原理
看見下面這條公式,大家是不是很眼熟
就是大家高數上的貝葉斯公式
現在我把它換一種寫法
然後,介紹一下公示的含義
- w代表輸入的原始單詞w(ord),c代表更正後的單詞c(andidate)
- P(c|w)代表輸入原始單詞w的情況下,輸出更正單詞c的概率,就是我們想要的結果
令P(c|w)
最大的c
就是最後的更正單詞
考慮到
最後的目標就是
程式碼解析
import re
from collections import Counter
呼叫相關的庫
- re:re(gex)正則表示式相關的函式庫
- collection.Counter:生成一個可迭代的Counter類,類似於dict
生成候選詞
def edits1(word):
"All edits that are one edit away from `word`."
letters = 'abcdefghijklmnopqrstuvwxyz'
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [L + R[1:] for L, R in splits if R]
transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
replaces = [L + c + R[1:] for L, R in splits if R for c in letters]
inserts = [L + c + R for L, R in splits for c in letters]
return set(deletes + transposes + replaces + inserts)
edits1函式列出了常見的單詞拼寫錯誤分離(split)、刪除(deletes)、交換(transposes)和替換(replaces)、插入(inserts),值得注意的是這裡的編輯距離均為1,意味著只有相鄰字母的交換一個字母的缺失一個字母的插入,最後使用set()
函式去一下重複
def edits2(word):
"All edits that are two edits away from `word`."
return (e2 for e1 in edits1(word) for e2 in edits1(e1))
edit2
用了兩層的for迴圈,在edit1
的基礎上得到了編輯距離為2的錯誤單詞
WORDS = Counter(words(open('big.txt').read()))
big.txt是一個非常大的語料庫,這裡我們使用Counter()
函式對他進行詞頻統計
>>> len(WORDS)
32192
>>> sum(WORDS.values())
1115504
>>> WORDS.most_common(10)
[('the', 79808),
('of', 40024),
('and', 38311),
('to', 28765),
('in', 22020),
('a', 21124),
('that', 12512),
('he', 12401),
('was', 11410),
('it', 10681),
('his', 10034),
('is', 9773),
('with', 9739),
('as', 8064),
('i', 7679),
('had', 7383),
('for', 6938),
('at', 6789),
('by', 6735),
('on', 6639)]
我們可以看下大致的統計情況,一共有32192個詞,它們一共出現了1115504次,其中出現頻率最高的是the
def known(words):
"The subset of `words` that appear in the dictionary of WORDS."
return set(w for w in words if w in WORDS)
known函式用來判斷生成的單詞是否真的是一個已知的單詞,通過檢查他是否在WORDS
中實現
def candidates(word):
"Generate possible spelling corrections for word."
return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word]
通過candidate()
函式合併去重就可以得到候選的更正詞c
語言模型
def words(text): return re.findall(r'\w+', text.lower())
words函式用到了 正則表示式,re.findall()
函式搜尋文字中所有符合正則表示式的文字,通過正則表示式\w+
提取文字中的所有單詞,正則表示式\w+
表示匹配任意一個字母數字和下劃線到空格處停止,text.lower()
表示將字母全部小寫
def P(word, N=sum(WORDS.values())):
"Probability of `word`."
return WORDS[word] / N
def correction(word):
"Most probable spelling correction for word."
return max(candidates(word), key=P)
correction()
函式根據P()
函式的最大值來篩選更正詞c
誤差模型
其實,解讀完整個程式可以發現,實際上是用到的公式是
並沒有乘上
效能評估
原諒我明天還要考試,直接拿作者的展示下,就不寫測試樣例了
def unit_tests():
assert correction('speling') == 'spelling' # insert
assert correction('korrectud') == 'corrected' # replace 2
assert correction('bycycle') == 'bicycle' # replace
assert correction('inconvient') == 'inconvenient' # insert 2
assert correction('arrainged') == 'arranged' # delete
assert correction('peotry') =='poetry' # transpose
assert correction('peotryy') =='poetry' # transpose + delete
assert correction('word') == 'word' # known
assert correction('quintessential') == 'quintessential' # unknown
assert words('This is a TEST.') == ['this', 'is', 'a', 'test']
assert Counter(words('This is a test. 123; A TEST this is.')) == (
Counter({'123': 1, 'a': 2, 'is': 2, 'test': 2, 'this': 2}))
assert len(WORDS) == 32192
assert sum(WORDS.values()) == 1115504
assert WORDS.most_common(10) == [
('the', 79808),
('of', 40024),
('and', 38311),
('to', 28765),
('in', 22020),
('a', 21124),
('that', 12512),
('he', 12401),
('was', 11410),
('it', 10681)]
assert WORDS['the'] == 79808
assert P('quintessential') == 0
assert 0.07 < P('the') < 0.08
return 'unit_tests pass'
def spelltest(tests, verbose=False):
"Run correction(wrong) on all (right, wrong) pairs; report results."
import time
start = time.clock()
good, unknown = 0, 0
n = len(tests)
for right, wrong in tests:
w = correction(wrong)
good += (w == right)
if w != right:
unknown += (right not in WORDS)
if verbose:
print('correction({}) => {} ({}); expected {} ({})'
.format(wrong, w, WORDS[w], right, WORDS[right]))
dt = time.clock() - start
print('{:.0%} of {} correct ({:.0%} unknown) at {:.0f} words per second '
.format(good / n, n, unknown / n, n / dt))
def Testset(lines):
"Parse 'right: wrong1 wrong2' lines into [('right', 'wrong1'), ('right', 'wrong2')] pairs."
return [(right, wrong)
for (right, wrongs) in (line.split(':') for line in lines)
for wrong in wrongs.split()]
print(unit_tests())
spelltest(Testset(open('spell-testset1.txt'))) # Development set
spelltest(Testset(open('spell-testset2.txt'))) # Final test set
unit_tests pass
75% of 270 correct at 41 words per second
68% of 400 correct at 35 words per second
None
對於一個構造非常簡單的小應用來說,做到這種程度已經足夠高效了,至於為什麼會造成識別錯誤這裡不進行探討,可以直接參看這裡
參考資料
這次就寫到這裡,我是愛生活愛學習愛工作的布知,下次我會給大家介紹怎麼使用Python做又酷又炫的詞雲圖