1. 程式人生 > 其它 >番外.3.情感計算與情緒識別

番外.3.情感計算與情緒識別

文章目錄


重新再複習一下NLP,把一些內容以番外的內容記錄一下。本節實現一個簡單的情感技術與情緒識別模型。
公式輸入請參考: 線上Latex公式

例項程式碼

資料集

用的是ISEAR資料,github上面搜尋可以找到下載連結。
joy,“On days when I feel close to my partner and other friends.
When I feel at peace with myself and also experience a close
contact with people whom I regard greatly.”,

fear,“Every time I imagine that someone I love or I could contact a
serious illness, even death.”,
anger,“When I had been obviously unjustly treated and had no possibility
of elucidating this.”,
sadness,“When I think about the short time that we live and relate it to
the periods of my life when I think that I did not use this
short time.”,
disgust,“At a gathering I found myself involuntarily sitting next to two
people who expressed opinions that I considered very low and
discriminating.”,
shame,“When I realized that I was directing the feelings of discontent
with myself at my partner and this way was trying to put the blame
on him instead of sorting out my own feeliings.”,
guilt,“I feel guilty when when I realize that I consider material things
more important than caring for my relatives. I feel very
self-centered.”,
joy,“After my girlfriend had taken her exam we went to her parent’s
place.”,
fear,“When, for the first time I realized the meaning of death.”,
資料第一列是情緒標籤,第二列是文字內容。

程式碼

讀取資料

import pandas as pd
import numpy as np

# 讀取csv的標準操作,注意資料要和程式碼檔案放在同一個資料夾
data = pd.read_csv('ISEAR.csv',header=None)

data.head()#列印前五行看看

劃分訓練和測試資料

from sklearn.model_selection import train_test_split
labels = data[0].values.tolist()#切割第一列標籤
sents = data[1].values.tolist()#切割第二列語句
X_train, X_test, y_train, y_test = train_test_split(sents, labels, test_size=0.2, random_state=5)#劃分訓練和測試集(20%),random_state:設定隨機數種子,保證每次都是同一個隨機數。若為0或不填,則每次得到資料都不一樣

抽取特徵,sklearn提供的是詞袋模型,這裡抽取的是tfidf特徵。
t f ( t , d ) tf(t,d) tf(t,d) t f tf tf值,表示某一篇文字 d d d中,單詞 t t t出現的次數, t f tf tf值越大,說明在單詞 t t t在文字 d d d中出現的次數越多。
d f ( d , t ) df(d,t) df(d,t)表示包含單詞 t t t的文件(這裡是句子)總數。
n d n_d nd表示文件的總數
i d f ( t ) idf(t) idf(t)對頻次表示的 t f ( t , d ) tf(t,d) tf(t,d)進行了改進(出現在文件或者句子中次數越多越重要),它不僅考慮了文字中單詞出現的次數,同時考慮了單詞在一般文字上的出現次數,如果一個單詞總是在一般的文字中出現,表示它可提供的分類資訊較少(同時出現在很多文章中,越不重要),比如代詞或者虛詞 “的”、“地”、“得”等。 i d f ( t ) idf(t) idf(t)平滑計算公式為:
i d f ( t ) = log ⁡ n d + 1 d f ( d , t ) + 1 idf(t)=\log\cfrac{n_d+1}{df(d,t)+1} idf(t)=logdf(d,t)+1nd+1
最後把兩者都考慮進來:
t f i d f ( t , d ) = t f ( t , d ) × i d f ( t ) tfidf(t,d)=tf(t,d)\times idf(t) tfidf(t,d)=tf(t,d)×idf(t)

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train)# X_train原來是一個list,裡面是一句句話,經過fit_transform後,變成了一個矩陣(大小是:句數量*詞庫大小),裡面的值是tfidf值
X_test = vectorizer.transform(X_test)#這裡不是fit_transform,訓練才用fit_transform來進行擬合

這裡可以考慮使用別的特徵,例如詞性、n-gram等。
抽取完特徵後開始訓練,這裡用的邏輯迴歸模型,引數可以參考官網:https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

parameters = {'C':[0.00001, 0.0001, 0.001, 0.005,0.01,0.05, 0.1, 0.5,1,2,5,10]}#使用不同的正則引數進行交叉驗證
lr = LogisticRegression()#訓練
lr.fit(X_train, y_train).score(X_test, y_test)#測試

clf = GridSearchCV(lr, parameters, cv=10)#cv是cross validation,分10份的意思
clf.fit(X_train, y_train)
clf.score(X_test, y_test)
print (clf.best_params_)#根據交叉驗證的結果打印出最佳超引數,這裡是2

列印預測結果的混淆矩陣

from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, clf.predict(X_test))

在這裡插入圖片描述
混淆矩陣裡面是類別*類別大小的結果,對角線對應的是當前類別分類正確的結果,其他列是分到對應錯誤類別的結果。