NLP自然語言處理
阿新 • • 發佈:2020-11-18
老實來講這課我一頭霧水滿腦袋問號
import numpy as np from collections import Counter counttime = 0 #統計訓練語料:spam 和 ham 各自評論總數,單詞頻率 def seperate(filename): hamcnt = Counter() #ham 評論統計字典 spamcnt = Counter() #spam 評論統計字典 totalNum = 0 # 郵件的總數 hamNum = 0 # ham 評論數 spamNum = 0 # spam 評論數 global counttime #!!! i = 0 file = open(filename, encoding='gb18030', errors='ignore') for line in file: #逐行處理 i = i + 1 new = line.split() # 一行郵件,以 ham/spam 標記分開 #預設為所有的空字元,包括空格、換行(\n)、製表符(\t)等。分割次數。預設為 -1, 即分隔所有。 totalNum = totalNum + 1 if new[0]=='1': # ham 好的評論數 hamNum = hamNum + 1 for word in new[1:]: hamcnt[word] += 1 #ham 中詞頻計數 if new[0]=='0': # spam 評論數 spamNum = spamNum + 1 for word in new[1:]: spamcnt[word] += 1 #spam 詞頻計數 if counttime == 0 : print('訓練樣本的總行數:%s' % i) print('ham 樣本:%s' % hamNum) print('spam 樣本:%s' % spamNum) counttime += 1 return hamcnt, spamcnt, totalNum, hamNum, spamNum def train(filename, preData): hamcnt, spamcnt, totalNum, hamNum, spamNum = seperate(filename) #統計 spam/ham 詞典 # 計算 spam/ham 各自總詞數 wordNumerOfham = 0 for key in hamcnt: wordNumerOfham += hamcnt[key] wordNumerOfspam = 0 for key in spamcnt: wordNumerOfspam += spamcnt[key] # 計算概率 p(spam|total),p(ham|total) p1_spam = spamNum / totalNum #ham 先驗概率 p1_ham = hamNum / totalNum #spam 先驗概率 hamProbablity = 1 spamProbability = 1 # 針對測試文字文字計算條件概率 newPreData = preData.split() for word in newPreData: #計算測試語料中每個詞的條件概率 try: #加 1 平滑 hamProbablity = hamProbablity * (hamcnt[word] + 1) / (wordNumerOfham + len(hamcnt)) except: # 文字中沒有該單詞 hamProbablity = hamProbablity * 1 / (wordNumerOfham + len(hamcnt)) res1 = hamProbablity * p1_ham #為 ham 類的概率:先驗*條件 for word in newPreData: try: spamProbability = spamProbability * (spamcnt[word] + 1) / (wordNumerOfspam + len(spamcnt)) except: spamProbability = spamProbability * (1) / (wordNumerOfspam + len(spamcnt)) res2 = spamProbability * p1_spam #為 spam 類的概率:先驗*條件 if res1 == res2: print('res1', res1, 'res2', res2) if res1 > res2: print('好的評論!', 'ham概率:',res1, ' spam概率:', res2) return 0 else: print('糟糕評論!', 'ham概率:',res1, ' spam概率:', res2) return 1 filename = 'kaggle_training.txt' #訓練語料 for line in open("kaggle_test.txt", encoding='gb18030', errors='ignore'): #print(line) #line = line.encode() res = train(filename, line) #print("RES: "+str(res))
不管別的
- 如何讓部分程式碼只執行一次 counttime在函式外宣告一次 在函式內用global表示一下 https://www.cnblogs.com/fendou-999/p/3822028.html
- byte string:https://blog.csdn.net/lqzdreamer/article/details/76549256 split:這個地方跟給的區別https://blog.csdn.net/weixin_40283816/article/details/83591582
(忘記我改的啥了 最後看到上面有對應的操作:encoding='gb18030', errors='ignore') - 想讓列印輸出的那堆有固定的格式,於是: