Kaggle競賽題之——Sentiment Analysis on Movie Reviews

阿新 • • 發佈：2019-01-01

題目連結：https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews

越來越喜歡iPython notebook了。以下所有工作都可以在一個頁面上完成，FireFox支援比Chrome要好。

資料集分為train.tsv和test.tsv。欄位以\t分隔，每一行有四個欄位：PhraseId，SentenceId，Phrase，Sentiment。

情感標識:

0 - negative
1 - somewhat negative
2 - neutral
3 - somewhat positive
4 - positive

import pandas as pd
df = pd.read_csv('train.tsv',header=0,delimiter='\t')
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 156060 entries, 0 to 156059
Data columns (total 4 columns):
PhraseId      156060 non-null int64
SentenceId    156060 non-null int64
Phrase        156060 non-null object
Sentiment     156060 non-null int64
dtypes: int64(3), object(1)

df.head()

Out[6]:

PhraseId	SentenceId	Phrase	Sentiment
0	1	1	A series of escapades demonstrating the adage ...	1
1	2	1	A series of escapades demonstrating the adage ...	2
2	3	1	A series	2
3	4	1	A	2
4	5	1	series	2

In [13]:
df.Sentiment.value_counts()/df.Sentiment.count()
Out[13]:
2    0.509945
3    0.210989
1    0.174760
4    0.058990
0    0.045316
dtype: float64

直接用訓練集的前5行做分類準確性測試：

X_train = df['Phrase']
y_train = df['Sentiment']
import numpy as np
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', LogisticRegression()),
                      ])
text_clf = text_clf.fit(X_train,y_train)
X_test = df.head()['Phrase']
predicted = text_clf.predict(X_test)
print np.mean(predicted == df.head()['Sentiment'])
for phrase, sentiment in zip(X_test, predicted):
    print('%r => %s' % (phrase, sentiment))

分類準確率及結果：

0.8
'A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .' => 3
'A series of escapades demonstrating the adage that what is good for the goose' => 2
'A series' => 2
'A' => 2
'series' => 2

df.head()['Sentiment']
0    1
1    2
2    2
3    2
4    2

第一個分類錯誤。
測試資料集：

test_df = pd.read_csv('test.tsv',header=0,delimiter='\t')
test_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 66292 entries, 0 to 66291
Data columns (total 3 columns):
PhraseId      66292 non-null int64
SentenceId    66292 non-null int64
Phrase        66292 non-null object
dtypes: int64(2), object(1)

用訓練好的模型對測試資料集進行分類：

from numpy import savetxt
X_test = test_df['Phrase']
phraseIds = test_df['PhraseId']
predicted = text_clf.predict(X_test)
pred = [[index+156061,x] for index,x in enumerate(predicted)]
savetxt('../Submissions/lr_benchmark.csv',pred,delimiter=',',fmt='%d,%d',header='PhraseId,Sentiment',comments='')

提交結果：

參考：http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

Kaggle競賽題之——Sentiment Analysis on Movie Reviews

題目連結：https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews 越來越喜歡iPython notebook了。以下所有工作都可以在一個頁面上完成，FireFox支援比Chrome要好。資料集分為train.tsv和test.tsv。

Simplifying Sentiment Analysis using VADER in Python (on Social Media Text)

What is Sentiment Analysis?Sentiment Analysis, or Opinion Mining, is a sub-field of Natural Language Processing (NLP) that tries to identify and extract op

Kaggle入門之泰塔尼克之災

pan param standards str clas 分割線 taf num .sh 數據的探索和可視化 ====》基礎模型的構建 ====》優化和調整 1、數據的探索和可視化　(1)包的導入　 #導入包 #Pandas impo

Kaggle 入門之數字識別

ref .com scott ebo digi note public 中文版知識接著玩Kaggle，大概花了一下午看了這個入門競賽的一篇論壇中比較熱門的Kernel，結合了在CS231n及NG的DL課程知識寫了一個中文版的入門Kernel，大概PublicLB能到8%

Kaggle比賽之『舊金山犯罪分類預測』 demo

日期格式建模舊金山 mon feature sklearn nor model sin import pandas as pd import numpy as np #用pandas載入csv訓練數據，並解析第一列為日期格式 train=pd.read_csv(‘.

Sentiment Analysis with Recurrent Neural Networks in TensorFlow 利用TensorFlow迴歸神經網路進行情感分析 Pluralsigh

Sentiment Analysis with Recurrent Neural Networks in TensorFlow 中文字幕利用TensorFlow迴歸神經網路進行情感分析中文字幕Sentiment Analysis with Recurrent Neural Netwo

斯坦福大學-自然語言處理入門筆記第七課情感分析（sentiment analysis）

一、情感分析簡述情感分析（sentiment analysis），又叫意見抽取（opinion extraction），意見挖掘（opinion mining）,情感挖掘（sentiment mining）以及主觀分析（subjectivity analysis）。情感分

資料視覺化之"Research on visualization techniques in data mining"

Research on visualization techniques in data mining Jin, Hailiang (Key Laboratory of Mine Spatial Information Technologies of State Bureau of Surveying an

【Kaggle-MNIST之路】CNN結構再改進+交叉熵損失函式（六）

簡述這裡再添加了一個卷積層。用一個kernal=4的卷積層做一個卷積之後，再做對映。基於之前的一個版本【Kaggle-MNIST之路】CNN結構改進+改進過的損失函式（五）成績：0.9898 排名：1000+ 程式碼注意

Targeted Aspect-Based Sentiment Analysis via Embedding Commonsense Knowledge into an Attentive LSTM

一、摘要作者提出一個新穎的方法來解決targeted ABSA，通過利用通用知識。用一個包含target-level的attention和sentence-level的attention的層次化attention機制來增強LSTM結構。關於情感相關概念的常識知識被結合到用於情感

【Kaggle-MNIST之路】CNN再新增一個層卷積（八）

簡述基於之前的框架【Kaggle-MNIST之路】自定義程式結構（七）得分：0.9914 排名：900+ 程式碼 CNN.py 檔案內容 import torch.nn as nn import torch class

【Kaggle-MNIST之路】自定義程式結構（七）

簡述這一篇跟這個系列的其他文章不一樣，這個是重新安排下程式結構結構如下：其中model這個模型專門放模型就好了 model/init.py中不用寫就好了。 model/CNN.py中的內容模型是基於之前的【Kaggle-MNIS

機器學習（十七）kaggle競賽之泰坦尼克號專案實戰-2

導航想寫這篇部落格的由衷是做完幾個專案，有時對於圖的畫法和模型融合演算法原理理解還很膚淺，特此加深一下印象。內容概覽圖 pandas、matplotlib、seaborn 餅圖直方圖

機器學習（八）kaggle競賽之泰坦尼克號專案實戰-1

引言機器學習演算法都是為專案為資料服務的，某一個演算法都有它自己的適用範圍，以及優勢與劣勢，研究演算法由於平日的日常操練，那麼用它去做專案就如同上戰場殺敵一樣，去發揮它的價值，kaggle就是這樣一個刷怪升級

Kaggle比賽之Artifical Neural Networks Applied to Taxi Destination Prediction程式碼整理

Code of the winning entry to the Kaggle ECML/PKDD taxi destination competition. Our approach is described in our paper. 依賴性我們使用了MILA實驗室的這些包:

PYTHON機器學習及實踐_從零開始通往KAGGLE競賽之路pdf

【下載地址】本書面向所有對機器學習與資料探勘的實踐及競賽感興趣的讀者，從零開始，以Python程式語言為基礎，在不涉及大量數學模型與複雜程式設計知識的前提下，逐步帶領讀者熟悉並且掌握當下最流行的機器學習、數learn作為基礎機器學習工具；第3章進階篇，涉及怎樣藉助高階技術或者模型進一步提升既有機器學習系統的

人工智障也刷題！Kaggle 入門之實戰泰坦尼克號

背景關於 Kaggle www.kaggle.com/ 這是一個為你提供完美資料，為你提供實際應用場景，可以與小夥伴在資料探勘領域 high 的不要不要的的地方啊！！！ Kaggle 是一個用來學習、分享和競賽的線上資料實驗平臺，有點類似 KDD—CUP（國際知識發現和資料探勘競賽），企

Java微服務之Spring Boot on Docker

一、準備工作 1.1 安裝Docker環境　　　　這一部分請參考我的另一篇文章《ASP.NET Core on Docker》，本文不再贅述。需要注意的是，如果你是Windows想要安裝Docker，Docker for Windows需要64bit Windows 10 Pro，且需要開啟Hy

Kaggle-MNIST之路

前言大概半個多月前吧，開始玩Kaggle，主要是為了熟悉神經網路。因為主要目的是為了熟悉神經網路，所以我是從最簡單的神經網路開始搭建起的，逐步的新增一些東西來完善使得整個模型逐漸完善。這篇文章類似於一個目錄吧所有的程式，模型，都是我Windo

leetcode之Max Points on a Line

問題描述如下： Given n points on a 2D plane, find the maximum number of points that lie on the same straight line. 問題連結 cpp程式碼如下： class S

Kaggle競賽題之——Sentiment Analysis on Movie Reviews

相關推薦