1. 程式人生 > 其它 >[吳恩達團隊自然語言處理第一課_1]分類:邏輯迴歸與樸素貝葉斯

[吳恩達團隊自然語言處理第一課_1]分類:邏輯迴歸與樸素貝葉斯

監督學習與情感分析

Supervised ML(training)

V維特徵

出現為1,否則為0,得出V維向量

計數器

包含四個推文的Corpus(語料庫)

I am happy because I am learning NLP I am happy I am sad,I am not learning NLP

I am sad

得到vocabulary

I

am happy because learning NLP sad not

已經有的分類

Positive tweets negative tweets
I am happy because I am learning NLP I am sad,I am not learning NLP
I am happy I am sad

計數

freq: dictionary mapping from (word,class) to frequency

vocabulary PosFreq(1) NegFreq(0)
I 3 3
am 3 3
happy 2 0
because 1 0
learning 1 1
NLP 1 1
sad 0 2
not 0 1

特徵提取得向量

例如I am sad,I am not learning NLP

vocabulary PosFreq(1) NegFreq(0)
I 3 3
am 3 3
learning 1 1
NLP 1 1
sad 0 2
not 0 1

計算

\[\sum_{w}freqs(w,1)=3+3+1+1+0+0=8 \]\[\sum_w{freqs(w,0)=3+3+1+1+1+2+1=11} \]\[X_m=[1,8,11] \]

預處理

停用詞和標點符號

Stop words Punctuation
and is are at has for a , . ; ! " '

@YMourri and @AndrewYNg are tuninga GREAT AI modelat https://deeplearning. ai!!!

去掉停用詞@YMourri @AndrewYNg tuning GREAT AI model https://deeplearning. ai!!!

去掉標點符號``@YMourri @AndrewYNg tuning GREAT AI model https://deeplearning. ai`

Handles and urls

去掉handles和urls 後tuning GREAT AI model

stemming and lowercasing

stemming詞幹提取:去除單詞的前後綴得到詞根的過程

Preprocessed tweet

[tun,great,ai,model]

程式碼

#建立頻率詞典
freqs=build_freqs(tweets,labels)#build freqs dicitonary
#初始化X矩陣
X=np.zeros((m,3))
for i in range(m):#For every tweet
    p_tweet=process_tweet(tweets[i])
    X[i,:]=extract_features(p_tweet,freqs)#提取特徵

邏輯迴歸

公式

左下角預測為negative,右上角為positive

@YMourri and @AndrewYNg are tuning a GREAT AI model

去掉標點符號和停用詞後,轉化為詞幹

[tun,ai,great,model]

LR

梯度下降

測試

\[ X_{val} Y_{val} \theta \]\[pred=h(X_{val},\theta)>=0.5 \]

得到如上預測向量,用驗證集來計算

\[\sum_{i=1}^{m}\frac{pred^{(i)}==y^{(i)}_{val}}{m} \]

預測結果和驗證集比較,如果相等就為1,如

\[Y_{val}=\left[\begin{matrix}0\\1\\1\\0\\1\end{matrix}\right] pred=\left[\begin{matrix}0\\1\\0\\0\\1\end{matrix}\right] (Y_{val}==pred)=\left[\begin{matrix}1\\1\\0\\1\\1\end{matrix}\right] \]

計算

\[accuracy=\frac{4}{5}=0.8 \]

cost function損失函式

\[J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}logh(x^{(i)},\theta)+(1-y^{(i)})log(1-h(x^{(i)},\theta))] \]

m:樣本數,負號使結果為正數

當標籤為1時,與下面相關

\[y^{(i)}logh(x^{(i)},\theta) \]
y^i h(x^i,theta)
0 any 0
1 0.99 ~0 約等於0
1 ~0 -inf 負無窮

可以看出,當標籤為1,預測1,損失很小,預測為0損失很大

當標籤為0,與下面相關

\[(1-y^{(i)})log(1-h(x^{(i)},\theta)) \]
y^i h(x^i,theta)
1 any 0
0 0.01 ~0
0 ~1 -inf

情感分析與樸素貝葉斯

樸素貝葉斯

介紹

某類別推特總數除以語料庫中的推文總數

\[A\rightarrow Positive tweet\\ P(A)=P(Positive)=N_{pos}/N \]

$$ P(A)=N_{pos}/N=13/20=0.65\\ P(Negative)=1-P(Positive)=0.35 $$

Probabilities

包含happy的推特

$$ B\rightarrow tweet contains "happy"\\ P(B)=P(happy)=N_{happy}/N\\ P(B)=4/20=0.2 $$ $$ P(A\cap B)=P(A,B)=3/20=0.15 $$

Conditional Probabilities條件概率

P(AB)=P(A|B)*P(B)

P(AB)是AB同時發生,P(A|B)是B發生條件下A發生的概率,乘以P(B)即AB同時發生.或在A集合中一個元素同時也屬於B的概率

\[P(A|B)=P(Positive|"happy")\\ P(A|B)=3/4=0.75 \] $$ P(B|A)=P("happy"|Positive)\\ P(B|A)=3/313=0.231 $$ $$ P(Positive|"happy")=\frac{P(Positive\cap"happy")}{P("happy")} $$

Bayes' Rule

\[P(Positive|"happy")=\frac{P(Positive\cap"happy")}{P("happy")}\\ P("happy"|Positive)=\frac{P("happy"\cap Positive)}{P(Positive)} \]

\[P("happy"\cap Positive)和P(Positive\cap"happy")相等\\在等式中可以刪除 \]

\[P(Positive|"happy")=P("happy"|Positive)*\frac{P(Positive)}{P("happy")} \]

\[P(X|Y)=P(Y|X)*\frac{P(X)}{P(Y)} \]

naive Bayes for sentiment analysis

naive:因為假設X和Y是獨立的,但是很多情況並不是

step 1 頻率表

Positive tweets:

I am happy because I am learning NLP

I am happy, not sad

Negative:

I am sad, I am not learning NLP

I am sad, not happy

進行計數

word PosFreq(1) NegFreq(0)
I 3 3
am 3 3
happy 2 1
because 1 0
learning 1 1
NLP 1 1
sad 1 2
not 1 2
N_class 13 12

step 2 概率表

word Pos Neg
I 0.24 0.25
am 0.24 0.25
happy 0.15 0.08
because 0.08 0
learning 0.08 0.08
NLP 0.08 0.08
sad 0.08 0.17
not 0.08 0.17
sum 1 1

I am lerning之類差值很小的值為中性詞,而happy是power word,becuase的Neg為0,造成計算問題,為避免這種情況,我們使概率函式平滑

word Pos Neg
I 0.20 0.20
am 0.20 0.20
happy 0.14 0.10
because 0.10 0.05
learning 0.10 0.10
NLP 0.10 0.10
sad 0.10 0.15
not 0.10 0.15

naive Bayes inference condition rule for binary classification

Tweet:

I am happy today; I am learning.

\[\prod_{i=1}^m\frac{P(w_i|pos)}{P(w_i|neg)} \\將tweet中的單詞依次累乘,today沒有就不算 \\\frac{0.20}{0.20}*\frac{0.20}{0.20}*\frac{0.14}{0.10}*\frac{0.20}{0.20}*\frac{0.20}{0.20}*\frac{0.10}{0.10}\\ 將\frac{0.20}{0.20}這類中性詞去掉 \\得 \frac{0.14}{0.10}=1.4>1 \\所以我們得出推文是positive \]

Laplacian Smoothing 拉普拉斯平滑

避免概率為0

\[P(w_i|class)=\frac{freq(w_i,class)}{N_{class}}\\ class \in \{Positive,Negative\}\\ P(w_i|class)=\frac{freq(w_i,class)+1}{N_{class}+V_{class}}\\ N_{class}=frequency\ of\ all\ words\ in\ class\\ V_{class}=number\ of\ unique\ words\ in\ class \]

+1:防止概率為0,為了+1後的歸一化,分母加V,詞彙表中去重後單詞的數量

四捨五入後得Pos和Neg,接下來利用

\[\begin{align}ratio(w_i)&=\frac{P(w_i|Pos)}{P(w_i|Neg)} \\&\approx\frac{frq(w_i,1)+1}{freq(w_i,0)+1} \end{align} \]
word Pos Neg ratio
I 0.19 0.20 1
am 0.19 0.20 1
happy 0.14 0.10 1.4
because 0.10 0.05 1
learning 0.10 0.10 1
NLP 0.10 0.10 1
sad 0.10 0.15 0.6
not 0.10 0.15 0.6
sum 1 1

積極的詞>1,越大說明越積極,消極的詞小於1,越接近0說明越消極,

\[class\in \{pos,neg\} \\w\rightarrow set\ of\ m\ words\ in\ a\ tweet\\ \prod_{i=1}^m\frac{P(w_i|pos)}{P(w_i|neg)}>1\ likelihood \\>1說推文是積極的,<1說是消極的,叫似然估計 \\前面加上pos和neg的比率 \\\frac{P(pos)}{P(neg)}\prod_{i=1}^m\frac{P(w_i|pos)}{P(w_i|neg)}>1\\ \frac{P(pos)}{P(neg)}\ prior\ probability\ 先驗概率 \]

先驗概率對不均衡的資料集很重要

Log likelihood

連續相乘面臨下溢位風險,太小而無法儲存。

使用數學技巧先log

\[log(a*b)=log(a)+log(b) \\log(\frac{P(pos)}{P(neg)}\prod_{i=1}^m\frac{P(w_i|pos)}{P(w_i|neg)}) \\\Longrightarrow log\frac{P(pos)}{P(neg)}+\sum_{i=1}^nlog\frac{P(w_i|pos)}{P(w_i|neg)} \]

log prior + log likelihood

Calculating Lambda

lambda為比率的對數

\[\lambda(w)=log\frac{P(w|pos)}{P(w|neg)} \] $$ \lambda(I)=log\frac{0.05}{0.05}=log(1)=0 $$ 得

doc:I am happy because I am learning.

log likelihood=0+0+2.2+0+0+0+1.1=3.3

\[\prod_{i=1}^m\frac{P(w_i|pos)}{P(w_i|neg)}>1 \]

如右圖

\[\sum_{i=1}^nlog\frac{P(w_i|pos)}{P(w_i|neg)} \]

如右圖

3.3>0得出推文為正

summary

\[log\prod_{i=1}^mratio(w_i)=\sum_{i=1}^m\lambda(w_i)>0 \\log likelihood 對數似然 \]

naive Bayes model

step0: collect and annotate corpus

step1: preprocess

  • lowercase

  • remove punctuation, urls, names

  • remove stops words

  • stemming

  • tokenize sentences

step2: word count

step3: P(w|class)

\[V_{class}=6 \\\frac{freq(w,class)+1}{N_{class}+V_{class}} \]

step4: get lambda

step5: get the log prior

\[D_{pos}=number\ of\ positive tweets\\ D_{neg}=number\ of\ negative\ tweets\\ logprior=log\frac{D_{pos}}{D_{neg}}\\ if\ dataset\ is\ balanced,\ D_{pos}=D_{neg}\ and\ logprior=0 \]

summary

  • get or annotate a dataset with positive and negative tweets

  • preprocess the tweets: process_tweet(tweet)->[w1,w2,w3,...]

  • compute freq(w,class)

  • get P(w|pos),P(w|neg)

  • get lambda(w)

  • compute logprior=log(P(pos)/P(neg))

test navie baye's

  • predict using naive bayes model

  • using your validation set to compute model accuray

  • log-likehood dictionary

    \[\lambda(w)=log\frac{P(w|pos)}{P(w|neg)} \]
  • \[logprior=log\frac{D_{pos}}{D_{neg}}=0 \]
  • tweet: [I,pass,the,NLP,interview]

    依次累加分數,表格沒有的單詞為中性詞不需要操作,新增logprior平衡資料集

    score=-0..01+0.5-0.01+0+logprior=0.48

    pred=score>0積極

  • \[X_{val}\ Y_{val}\ \lambda_{logprior}\\ score=predict(X_{val},\lambda,logprior)\\ pred=score>0\\ \left[\begin{matrix}0.5\\-1\\1.3\\...\\score_m\end{matrix}\right]>0 =\left[\begin{matrix}0.5>0\\-1>0\\1.3>0\\...\\socre_m>0\end{matrix}\right] =\left[\begin{matrix}1\\0\\1\\...\\pred_m\end{matrix}\right] \]

首先,計算Xval中每列的分數,計算每個分數是否大於0,得到pred矩陣,1為積極,0為消極

\[\frac{1}{m}\sum_{i=1}^{m}(pred_i==Y{val_i})\\ 計算accuray \]

summary

  • \[X_{val}\ Y_{val}\longrightarrow Performance\ on\ unseen\ data \]
  • \[Predict\ using\ \lambda and logprior for each new tweet \]
  • \[Accuracy\ \longrightarrow \frac{1}{m}\sum_{i=1}^m(pred_i==Y_{val_i}) \]
  • \[what\ about\ words\ that\ do\ not\ appear\ in\ \lambda (w)? \]

Application of naive bayes

\[P(pos|tweet)\approx P(pos)P(tweet|pos)\\ P(neg|tweet)\approx P(neg)P(tweet|neg)\\ \frac{P(pos|tweet)}{P(neg|tweet)}=\frac{P(pos)}{P(neg)} \prod_{i=1}^m\frac{P(w_i|pos)}{P(w_i|neg)} \]

applicatons:

  • 作者識別

    \[\frac{P(莎士比亞|book)}{P(海明威|book)} \]
  • 垃圾郵件過濾

    \[\frac{P(spam|email)}{P(nonspam|email)} \]
  • Information retrieval

    \[P(document_k|query)\varpropto \prod_{i=0}^{|query|}P(query_i|document_k)\\ Retrieve\ document\ if\ P(document_k|query)>threshold \]

    最早應用於查詢資料庫中相關和不相關的文件

  • word disambiguation消除單詞歧義

Bank:河岸或銀行

\[ \frac{P(river|text)}{P(money|text)} \]

Independence

預測變數或特徵之間的獨立性

It is sunnuy and hot in the Sahara desert

假設文字中的單詞是獨立的,但通常情況並非如此,sunny 和 hot 經常同時出現,可能會導致低估或者高估單個單詞的條件概率

It's always cold and snowy in _

spring?summer?fall?winter?

貝葉斯認為他們相等,但是上下文得是winter

Relative frequency in corpus

依賴與資料集的分佈。實際上推文中傳送正面的推文頻率高於負面推文的頻率

錯誤分析

  • Removing punctuation and stop words 預處理過程失去語義

  • word order 單詞順序影響句子的含義

  • adversarial attaks 人類有些自然語言的怪癖

Processing as a Source of errors: Punctuation

  • 去掉標點符號

    Tweet: My beloved grandmother :(

    去掉:(

    processed_tweet: [belov,grandmoth]

  • 去停頓詞

    Tweet: This is not good, because your attitude is not even close to being nice.

    prcessed_tweet:[good,attitude,close,nice]

  • 單詞順序

    tweet:I am happy because I do not go.

    tweet:I am not happy because I did go.

    not被貝葉斯分類器忽略

  • Adversarial attacks

    對抗攻擊,Sarcasm, Irony and Euphemisms 面對諷刺和委婉語

    tweet:This is a ridiculously powerful movie. The plot was gripping and I cried through until the ending!

    processed_tweet:[ridicul,power,movi,ploy,grip,cry,end]

    積極的推文處理獲得大量否定的詞彙