番外.3.情感計算與情緒識別
文章目錄
重新再複習一下NLP,把一些內容以番外的內容記錄一下。本節實現一個簡單的情感技術與情緒識別模型。
公式輸入請參考: 線上Latex公式
例項程式碼
資料集
用的是ISEAR資料,github上面搜尋可以找到下載連結。
joy,“On days when I feel close to my partner and other friends.
When I feel at peace with myself and also experience a close
contact with people whom I regard greatly.”,
serious illness, even death.”,
anger,“When I had been obviously unjustly treated and had no possibility
of elucidating this.”,
sadness,“When I think about the short time that we live and relate it to
the periods of my life when I think that I did not use this
disgust,“At a gathering I found myself involuntarily sitting next to two
people who expressed opinions that I considered very low and
discriminating.”,
shame,“When I realized that I was directing the feelings of discontent
with myself at my partner and this way was trying to put the blame
guilt,“I feel guilty when when I realize that I consider material things
more important than caring for my relatives. I feel very
self-centered.”,
joy,“After my girlfriend had taken her exam we went to her parent’s
place.”,
fear,“When, for the first time I realized the meaning of death.”,
資料第一列是情緒標籤,第二列是文字內容。
程式碼
讀取資料
import pandas as pd
import numpy as np
# 讀取csv的標準操作,注意資料要和程式碼檔案放在同一個資料夾
data = pd.read_csv('ISEAR.csv',header=None)
data.head()#列印前五行看看
劃分訓練和測試資料
from sklearn.model_selection import train_test_split
labels = data[0].values.tolist()#切割第一列標籤
sents = data[1].values.tolist()#切割第二列語句
X_train, X_test, y_train, y_test = train_test_split(sents, labels, test_size=0.2, random_state=5)#劃分訓練和測試集(20%),random_state:設定隨機數種子,保證每次都是同一個隨機數。若為0或不填,則每次得到資料都不一樣
抽取特徵,sklearn提供的是詞袋模型,這裡抽取的是tfidf特徵。
t
f
(
t
,
d
)
tf(t,d)
tf(t,d)是
t
f
tf
tf值,表示某一篇文字
d
d
d中,單詞
t
t
t出現的次數,
t
f
tf
tf值越大,說明在單詞
t
t
t在文字
d
d
d中出現的次數越多。
d
f
(
d
,
t
)
df(d,t)
df(d,t)表示包含單詞
t
t
t的文件(這裡是句子)總數。
n
d
n_d
nd表示文件的總數
i
d
f
(
t
)
idf(t)
idf(t)對頻次表示的
t
f
(
t
,
d
)
tf(t,d)
tf(t,d)進行了改進(出現在文件或者句子中次數越多越重要),它不僅考慮了文字中單詞出現的次數,同時考慮了單詞在一般文字上的出現次數,如果一個單詞總是在一般的文字中出現,表示它可提供的分類資訊較少(同時出現在很多文章中,越不重要),比如代詞或者虛詞 “的”、“地”、“得”等。
i
d
f
(
t
)
idf(t)
idf(t)平滑計算公式為:
i
d
f
(
t
)
=
log
n
d
+
1
d
f
(
d
,
t
)
+
1
idf(t)=\log\cfrac{n_d+1}{df(d,t)+1}
idf(t)=logdf(d,t)+1nd+1
最後把兩者都考慮進來:
t
f
i
d
f
(
t
,
d
)
=
t
f
(
t
,
d
)
×
i
d
f
(
t
)
tfidf(t,d)=tf(t,d)\times idf(t)
tfidf(t,d)=tf(t,d)×idf(t)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)# X_train原來是一個list,裡面是一句句話,經過fit_transform後,變成了一個矩陣(大小是:句數量*詞庫大小),裡面的值是tfidf值
X_test = vectorizer.transform(X_test)#這裡不是fit_transform,訓練才用fit_transform來進行擬合
這裡可以考慮使用別的特徵,例如詞性、n-gram等。
抽取完特徵後開始訓練,這裡用的邏輯迴歸模型,引數可以參考官網:https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
parameters = {'C':[0.00001, 0.0001, 0.001, 0.005,0.01,0.05, 0.1, 0.5,1,2,5,10]}#使用不同的正則引數進行交叉驗證
lr = LogisticRegression()#訓練
lr.fit(X_train, y_train).score(X_test, y_test)#測試
clf = GridSearchCV(lr, parameters, cv=10)#cv是cross validation,分10份的意思
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
print (clf.best_params_)#根據交叉驗證的結果打印出最佳超引數,這裡是2
列印預測結果的混淆矩陣
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, clf.predict(X_test))
混淆矩陣裡面是類別*類別大小的結果,對角線對應的是當前類別分類正確的結果,其他列是分到對應錯誤類別的結果。