[吳恩達團隊自然語言處理第一課_1]分類:邏輯迴歸與樸素貝葉斯

阿新 • • 發佈：2022-02-08

監督學習與情感分析

Supervised ML(training)

V維特徵

出現為1，否則為0，得出V維向量

計數器

包含四個推文的Corpus（語料庫）

I am happy because I am learning NLP I am happy I am sad,I am not learning NLP

I am sad

得到vocabulary

I

am happy because learning NLP sad not

已經有的分類

Positive tweets	negative tweets
I am happy because I am learning NLP	I am sad,I am not learning NLP
I am happy	I am sad

計數

freq: dictionary mapping from (word,class) to frequency

vocabulary	PosFreq(1)	NegFreq(0)
I	3	3
am	3	3
happy	2	0
because	1	0
learning	1	1
NLP	1	1
sad	0	2
not	0	1

特徵提取得向量

例如I am sad,I am not learning NLP

vocabulary	PosFreq(1)	NegFreq(0)
I	3	3
am	3	3
learning	1	1
NLP	1	1
sad	0	2
not	0	1

計算

\[\sum_{w}freqs(w,1)=3+3+1+1+0+0=8 \]\[\sum_w{freqs(w,0)=3+3+1+1+1+2+1=11} \]\[X_m=[1,8,11] \]

預處理

停用詞和標點符號

Stop words	Punctuation
and is are at has for a	, . ; ! " '

將@YMourri and @AndrewYNg are tuninga GREAT AI modelat https://deeplearning. ai!!!

去掉停用詞@YMourri @AndrewYNg tuning GREAT AI model https://deeplearning. ai!!!

去掉標點符號``@YMourri @AndrewYNg tuning GREAT AI model https://deeplearning. ai`

Handles and urls

去掉handles和urls 後tuning GREAT AI model

stemming and lowercasing

stemming詞幹提取：去除單詞的前後綴得到詞根的過程

Preprocessed tweet

[tun,great,ai,model]

程式碼

#建立頻率詞典
freqs=build_freqs(tweets,labels)#build freqs dicitonary
#初始化X矩陣
X=np.zeros((m,3))
for i in range(m):#For every tweet
    p_tweet=process_tweet(tweets[i])
    X[i,:]=extract_features(p_tweet,freqs)#提取特徵

邏輯迴歸

公式

左下角預測為negative,右上角為positive

@YMourri and @AndrewYNg are tuning a GREAT AI model

去掉標點符號和停用詞後，轉化為詞幹

[tun,ai,great,model]

LR

梯度下降

測試

\[ X_{val} Y_{val} \theta \]\[pred=h(X_{val},\theta)>=0.5 \]

得到如上預測向量，用驗證集來計算

\[\sum_{i=1}^{m}\frac{pred^{(i)}==y^{(i)}_{val}}{m} \]

預測結果和驗證集比較，如果相等就為1，如

\[Y_{val}=\left[\begin{matrix}0\\1\\1\\0\\1\end{matrix}\right] pred=\left[\begin{matrix}0\\1\\0\\0\\1\end{matrix}\right] (Y_{val}==pred)=\left[\begin{matrix}1\\1\\0\\1\\1\end{matrix}\right] \]

計算

\[accuracy=\frac{4}{5}=0.8 \]

cost function損失函式

\[J(\theta)=-\frac{1}{m}\sum_{i=1}^{m}[y^{(i)}logh(x^{(i)},\theta)+(1-y^{(i)})log(1-h(x^{(i)},\theta))] \]

m:樣本數，負號使結果為正數

當標籤為1時，與下面相關

\[y^{(i)}logh(x^{(i)},\theta) \]

y^i	h(x^i,theta)
0	any	0
1	0.99	~0 約等於0
1	~0	-inf 負無窮

可以看出，當標籤為1，預測1，損失很小，預測為0損失很大

當標籤為0，與下面相關

\[(1-y^{(i)})log(1-h(x^{(i)},\theta)) \]

y^i	h(x^i,theta)
1	any	0
0	0.01	~0
0	~1	-inf

情感分析與樸素貝葉斯

樸素貝葉斯

介紹

某類別推特總數除以語料庫中的推文總數

\[A\rightarrow Positive tweet\\ P(A)=P(Positive)=N_{pos}/N \]

如

$$ P(A)=N_{pos}/N=13/20=0.65\\ P(Negative)=1-P(Positive)=0.35 $$

Probabilities

包含happy的推特

$$ B\rightarrow tweet contains "happy"\\ P(B)=P(happy)=N_{happy}/N\\ P(B)=4/20=0.2 $$ $$ P(A\cap B)=P(A,B)=3/20=0.15 $$

Conditional Probabilities條件概率

P(AB)=P(A|B)*P(B)

P（AB）是AB同時發生，P(A|B)是B發生條件下A發生的概率，乘以P（B)即AB同時發生.或在A集合中一個元素同時也屬於B的概率

\[P(A|B)=P(Positive|"happy")\\ P(A|B)=3/4=0.75 \] $$ P(B|A)=P("happy"|Positive)\\ P(B|A)=3/313=0.231 $$ $$ P(Positive|"happy")=\frac{P(Positive\cap"happy")}{P("happy")} $$

Bayes' Rule

\[P(Positive|"happy")=\frac{P(Positive\cap"happy")}{P("happy")}\\ P("happy"|Positive)=\frac{P("happy"\cap Positive)}{P(Positive)} \]

而

\[P("happy"\cap Positive)和P(Positive\cap"happy")相等\\在等式中可以刪除 \]

得

\[P(Positive|"happy")=P("happy"|Positive)*\frac{P(Positive)}{P("happy")} \]

即

\[P(X|Y)=P(Y|X)*\frac{P(X)}{P(Y)} \]

naive Bayes for sentiment analysis

naive:因為假設X和Y是獨立的，但是很多情況並不是

step 1 頻率表

Positive tweets:

I am happy because I am learning NLP

I am happy, not sad

Negative:

I am sad, I am not learning NLP

I am sad, not happy

進行計數

word	PosFreq(1)	NegFreq(0)
I	3	3
am	3	3
happy	2	1
because	1	0
learning	1	1
NLP	1	1
sad	1	2
not	1	2
N_class	13	12

step 2 概率表

word	Pos	Neg
I	0.24	0.25
am	0.24	0.25
happy	0.15	0.08
because	0.08	0
learning	0.08	0.08
NLP	0.08	0.08
sad	0.08	0.17
not	0.08	0.17
sum	1	1

像I am lerning之類差值很小的值為中性詞，而happy是power word，becuase的Neg為0，造成計算問題，為避免這種情況，我們使概率函式平滑

word	Pos	Neg
I	0.20	0.20
am	0.20	0.20
happy	0.14	0.10
because	0.10	0.05
learning	0.10	0.10
NLP	0.10	0.10
sad	0.10	0.15
not	0.10	0.15

naive Bayes inference condition rule for binary classification

Tweet:

I am happy today; I am learning.

\[\prod_{i=1}^m\frac{P(w_i|pos)}{P(w_i|neg)} \\將tweet中的單詞依次累乘,today沒有就不算 \\\frac{0.20}{0.20}*\frac{0.20}{0.20}*\frac{0.14}{0.10}*\frac{0.20}{0.20}*\frac{0.20}{0.20}*\frac{0.10}{0.10}\\ 將\frac{0.20}{0.20}這類中性詞去掉 \\得 \frac{0.14}{0.10}=1.4>1 \\所以我們得出推文是positive \]

Laplacian Smoothing 拉普拉斯平滑

避免概率為0

\[P(w_i|class)=\frac{freq(w_i,class)}{N_{class}}\\ class \in \{Positive,Negative\}\\ P(w_i|class)=\frac{freq(w_i,class)+1}{N_{class}+V_{class}}\\ N_{class}=frequency\ of\ all\ words\ in\ class\\ V_{class}=number\ of\ unique\ words\ in\ class \]

+1:防止概率為0，為了+1後的歸一化，分母加V，詞彙表中去重後單詞的數量

四捨五入後得Pos和Neg,接下來利用

\[\begin{align}ratio(w_i)&=\frac{P(w_i|Pos)}{P(w_i|Neg)} \\&\approx\frac{frq(w_i,1)+1}{freq(w_i,0)+1} \end{align} \]

word	Pos	Neg	ratio
I	0.19	0.20	1
am	0.19	0.20	1
happy	0.14	0.10	1.4
because	0.10	0.05	1
learning	0.10	0.10	1
NLP	0.10	0.10	1
sad	0.10	0.15	0.6
not	0.10	0.15	0.6
sum	1	1

積極的詞>1，越大說明越積極，消極的詞小於1，越接近0說明越消極，

Navie Bayes' inference 推論

\[class\in \{pos,neg\} \\w\rightarrow set\ of\ m\ words\ in\ a\ tweet\\ \prod_{i=1}^m\frac{P(w_i|pos)}{P(w_i|neg)}>1\ likelihood \\>1說推文是積極的，<1說是消極的，叫似然估計 \\前面加上pos和neg的比率 \\\frac{P(pos)}{P(neg)}\prod_{i=1}^m\frac{P(w_i|pos)}{P(w_i|neg)}>1\\ \frac{P(pos)}{P(neg)}\ prior\ probability\ 先驗概率 \]

先驗概率對不均衡的資料集很重要

Log likelihood

連續相乘面臨下溢位風險，太小而無法儲存。

使用數學技巧先log

\[log(a*b)=log(a)+log(b) \\log(\frac{P(pos)}{P(neg)}\prod_{i=1}^m\frac{P(w_i|pos)}{P(w_i|neg)}) \\\Longrightarrow log\frac{P(pos)}{P(neg)}+\sum_{i=1}^nlog\frac{P(w_i|pos)}{P(w_i|neg)} \]

log prior + log likelihood

Calculating Lambda

lambda為比率的對數

\[\lambda(w)=log\frac{P(w|pos)}{P(w|neg)} \] $$ \lambda(I)=log\frac{0.05}{0.05}=log(1)=0 $$ 得

doc:I am happy because I am learning.

log likelihood=0+0+2.2+0+0+0+1.1=3.3

\[\prod_{i=1}^m\frac{P(w_i|pos)}{P(w_i|neg)}>1 \]

如右圖

\[\sum_{i=1}^nlog\frac{P(w_i|pos)}{P(w_i|neg)} \]

如右圖

3.3>0得出推文為正

summary

\[log\prod_{i=1}^mratio(w_i)=\sum_{i=1}^m\lambda(w_i)>0 \\log likelihood 對數似然 \]

naive Bayes model

step0: collect and annotate corpus

step1: preprocess

lowercase
remove punctuation, urls, names
remove stops words
stemming
tokenize sentences

step2: word count

step3: P(w|class)

\[V_{class}=6 \\\frac{freq(w,class)+1}{N_{class}+V_{class}} \]

step4: get lambda

step5: get the log prior

\[D_{pos}=number\ of\ positive tweets\\ D_{neg}=number\ of\ negative\ tweets\\ logprior=log\frac{D_{pos}}{D_{neg}}\\ if\ dataset\ is\ balanced,\ D_{pos}=D_{neg}\ and\ logprior=0 \]

summary

get or annotate a dataset with positive and negative tweets
preprocess the tweets: process_tweet(tweet)->[w1,w2,w3,...]
compute freq(w,class)
get P(w|pos),P(w|neg)
get lambda(w)
compute logprior=log(P(pos)/P(neg))

test navie baye's

predict using naive bayes model
using your validation set to compute model accuray
log-likehood dictionary
\[\lambda(w)=log\frac{P(w|pos)}{P(w|neg)} \]
\[logprior=log\frac{D_{pos}}{D_{neg}}=0 \]
tweet: [I,pass,the,NLP,interview]

依次累加分數，表格沒有的單詞為中性詞不需要操作,新增logprior平衡資料集

score=-0..01+0.5-0.01+0+logprior=0.48

pred=score>0積極
\[X_{val}\ Y_{val}\ \lambda_{logprior}\\ score=predict(X_{val},\lambda,logprior)\\ pred=score>0\\ \left[\begin{matrix}0.5\\-1\\1.3\\...\\score_m\end{matrix}\right]>0 =\left[\begin{matrix}0.5>0\\-1>0\\1.3>0\\...\\socre_m>0\end{matrix}\right] =\left[\begin{matrix}1\\0\\1\\...\\pred_m\end{matrix}\right] \]

首先，計算Xval中每列的分數，計算每個分數是否大於0,得到pred矩陣，1為積極，0為消極

\[\frac{1}{m}\sum_{i=1}^{m}(pred_i==Y{val_i})\\ 計算accuray \]

summary

\[X_{val}\ Y_{val}\longrightarrow Performance\ on\ unseen\ data \]
\[Predict\ using\ \lambda and logprior for each new tweet \]
\[Accuracy\ \longrightarrow \frac{1}{m}\sum_{i=1}^m(pred_i==Y_{val_i}) \]
\[what\ about\ words\ that\ do\ not\ appear\ in\ \lambda (w)? \]

Application of naive bayes

\[P(pos|tweet)\approx P(pos)P(tweet|pos)\\ P(neg|tweet)\approx P(neg)P(tweet|neg)\\ \frac{P(pos|tweet)}{P(neg|tweet)}=\frac{P(pos)}{P(neg)} \prod_{i=1}^m\frac{P(w_i|pos)}{P(w_i|neg)} \]

applicatons:

作者識別
\[\frac{P(莎士比亞|book)}{P(海明威|book)} \]
垃圾郵件過濾
\[\frac{P(spam|email)}{P(nonspam|email)} \]
Information retrieval
\[P(document_k|query)\varpropto \prod_{i=0}^{|query|}P(query_i|document_k)\\ Retrieve\ document\ if\ P(document_k|query)>threshold \]
最早應用於查詢資料庫中相關和不相關的文件
word disambiguation消除單詞歧義

Bank:河岸或銀行

\[ \frac{P(river|text)}{P(money|text)} \]

navie bayes assumptions假設

Independence

預測變數或特徵之間的獨立性

It is sunnuy and hot in the Sahara desert

假設文字中的單詞是獨立的，但通常情況並非如此，sunny 和 hot 經常同時出現，可能會導致低估或者高估單個單詞的條件概率

It's always cold and snowy in _

spring?summer?fall?winter?

貝葉斯認為他們相等，但是上下文得是winter

Relative frequency in corpus

依賴與資料集的分佈。實際上推文中傳送正面的推文頻率高於負面推文的頻率

錯誤分析

Removing punctuation and stop words 預處理過程失去語義
word order 單詞順序影響句子的含義
adversarial attaks 人類有些自然語言的怪癖

Processing as a Source of errors: Punctuation

去掉標點符號

Tweet: My beloved grandmother :(

去掉:(

processed_tweet: [belov,grandmoth]
去停頓詞

Tweet: This is not good, because your attitude is not even close to being nice.

prcessed_tweet:[good,attitude,close,nice]
單詞順序

tweet:I am happy because I do not go.

tweet:I am not happy because I did go.

not被貝葉斯分類器忽略
Adversarial attacks

對抗攻擊，Sarcasm, Irony and Euphemisms 面對諷刺和委婉語

tweet:This is a ridiculously powerful movie. The plot was gripping and I cried through until the ending!

processed_tweet:[ridicul,power,movi,ploy,grip,cry,end]

積極的推文處理獲得大量否定的詞彙

[吳恩達團隊自然語言處理第一課_1]分類:邏輯迴歸與樸素貝葉斯

監督學習與情感分析

Supervised ML(training)

V維特徵

計數器

特徵提取得向量

預處理

停用詞和標點符號

Handles and urls

stemming and lowercasing

Preprocessed tweet

程式碼

邏輯迴歸

公式

LR

測試

cost function損失函式

情感分析與樸素貝葉斯

樸素貝葉斯

介紹

Probabilities

Conditional Probabilities條件概率

Bayes' Rule

naive Bayes for sentiment analysis

step 1 頻率表

step 2 概率表

naive Bayes inference condition rule for binary classification

Laplacian Smoothing 拉普拉斯平滑

Navie Bayes' inference 推論

Log likelihood

Calculating Lambda

summary

naive Bayes model

summary

test navie baye's

summary

Application of naive bayes

navie bayes assumptions假設

Independence

Relative frequency in corpus

錯誤分析

Processing as a Source of errors: Punctuation

相關推薦