kaggle 影評情感分析(1)—— TF-IDF+Logistic迴歸/樸素貝葉斯/SGD
阿新 • • 發佈:2018-12-20
前言
kaggle的這個starting competition (Bag of words meet bags of popcorns) 其實是一個word2vec-tutorial, 但是本篇文章沒有用到 word2vec, 只用了 TF-IDF 的方式將句子向量化,再分別用logistic regression、multinomial Naive Bayes、 SGDClassifier 進行訓練和預測。用LR得到的結果在kaggle上提交,得分是0.88+,排名已經將近300了。但是作為一個最初的嘗試還是可以的,勝在簡單。
簡單介紹一下TF-IDF, TF是詞頻即一個詞在其所處的句子裡出現的頻率,IDF(word) = log(N/N(word)+α),N是句子總數,N(word) 是出現了某個詞的句子數,α是為了保證分母不為0。一個詞的TF-IDF越大,說明這個詞在所處的句子裡出現頻率很高,在其他句子裡卻不怎麼出現,也就是說TF-IDF衡量了一個詞能夠在多大程度上代表他所處的這個句子。
程式碼實現
# coding: utf-8
import pandas as pd
import os
from lxml import etree
import re
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer as TFIV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import SGDClassifier
#load the data
path = "H:\PyCharmProjects\Popcorn\data"
t_set_df = pd.read_csv(os.path.join(path,"labeledTrainData.tsv"), header=0, sep='\t')
test_df = pd.read_csv(os.path.join(path,"testData.tsv" ), header=0, sep='\t')
t_set_pre = t_set_df['review']
test_pre = test_df['review']
t_set = []
test = []
t_label = t_set_df['sentiment']
#data preprocessing(remove the html labels)
def review2wordlist(review):
html = etree.HTML(review, etree.HTMLParser())
review = html.xpath('string(.)').strip()
review = re.sub("[^a-zA-Z]", " ", review)
wordlist = review.lower().split()
return wordlist
for i in range(len(t_set_pre)):
words = review2wordlist(t_set_pre[i])
t_set.append(" ".join(words))
for i in range(len(test_pre)):
words = review2wordlist(test_pre[i])
test.append(" ".join(words))
#vectorize sentences with words' TF-IDF value
all_x = t_set+test
tfv = TFIV(min_df=3, max_features=None,
strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
ngram_range=(1, 2), use_idf=1,smooth_idf=1,sublinear_tf=1,
stop_words = 'english')
tfv.fit(all_x)
all_x = tfv.transform(all_x)
train_len = len(t_set)
x_train = all_x[:train_len] #<25000x309819 sparse matrix of type '<class 'numpy.float64'>'with 3429925 stored elements in Compressed Sparse Row format>
x_test = all_x[train_len:]
#model_1: logistic regression
y_train = t_set_df['sentiment']
lr = LogisticRegression(C=30)
grid_value = {'solver':['sag','liblinear','lbfgs']}
model_lr = GridSearchCV(lr, cv=20, scoring='roc_auc', param_grid=grid_value)
model_lr.fit(x_train, y_train)
print(model_lr.cv_results_) #the best score is 0.96462 with sag
#model_2: naive bayes
model_nb = MultinomialNB()
model_nb.fit(x_train, y_train)
print("naive bayes score: ", np.mean(cross_val_score(model_nb, x_train, y_train, cv=20, scoring='roc_auc'))) #0.94963712
#model_3: SGDClassifier (SVM with linear knernel)
model_sgd = SGDClassifier(loss='modified_huber')
model_sgd.fit(x_train, y_train)
print("SGD score: ", np.mean(cross_val_score(model_sgd, x_train, y_train, cv=20, scoring='roc_auc'))) #0.964716288
# write the result to csv
lr_result = model_lr.predict(x_test)
lr_df = pd.DataFrame({'id':test_df['id'], 'sentiment':lr_result})
lr_df.to_csv("H:/PyCharmProjects/Popcorn/LR_result.csv", index=False)
nb_result = model_nb.predict(x_test)
nb_df = pd.DataFrame({'id':test_df['id'],'sentiment':nb_result})
nb_df.to_csv("H:/PyCharmProjects/Popcorn/NB_result.csv", index=False)
sgd_result = model_nb.predict(x_test)
sgd_df = pd.DataFrame({'id':test_df['id'],'sentiment':sgd_result})
sgd_df.to_csv("H:/PyCharmProjects/Popcorn/SGD_result.csv", index=False)