eda分析_從eda到部署的亞馬遜食品評論的情感分析

阿新 • • 發佈：2020-10-10

eda分析

Amazon.com, Inc., is an American multinational technology company based in Seattle, Washington. Amazon focuses on e-commerce, cloud computing, digital streaming, and artificial intelligence. As they are strong in e-commerce platforms their review system can be abused by sellers or customers writing fake reviews in exchange for incentives. It is expensive to check each and every review manually and label its sentiment. So a better way is to rely on machine learning/deep learning models for that. In this case study, we will focus on the

fine food review data set on amazon which is available on Kaggle.

Amazon.com，Inc.是一家位於華盛頓州西雅圖的美國跨國技術公司。亞馬遜專注於電子商務，雲端計算，數字流和人工智慧。由於他們在電子商務平臺中的實力很強，他們的評論系統可能會被賣方或客戶濫用虛假評論以換取激勵而濫用。手動檢查每條評論並標註其觀點非常昂貴。因此，更好的方法是依靠機器學習/深度學習模型。在本案例研究中，我們將重點關注Kaggle上提供的亞馬遜上的精美食品評論資料集 。

Note: This article is not a code explanation for our problem. Rather I will be explaining the approach I used. You can look at my code from

here.

注意：本文不是針對我們問題的程式碼說明。 相反，我將解釋我使用的方法。 您可以從 這裡 檢視我的程式碼 。

關於資料集 (About Data set)

The data set consists of reviews of fine foods from amazon over a period of more than 10 years, including 568,454 reviews till October 2012. Reviews include rating, product and user information, and a plain text review. It also includes reviews from all other Amazon categories.

資料集包括對亞馬遜地區超過10年的精美食品的評論，包括截至2012年10月的568,454條評論。評論包括等級，產品和使用者資訊以及純文字評論。它還包括來自所有其他亞馬遜類別的評論。

We have the following columns:

我們有以下幾列：

Product Id: Unique identifier for the product
產品ID：產品的唯一識別符號
User Id: unique identifier for the user
使用者ID：使用者的唯一識別符號
Profile Name: Profile name of the user
配置檔名稱：使用者的配置檔名稱
Helpfulness Numerator: Number of users who found the review helpful
幫助性分子：認為該評論有用的使用者數
Helpfulness Denominator: Number of users who indicated whether they found the review helpful or not
幫助度分母：表示他們是否認為本評論有用的使用者數量
Score: Rating between 1 and 5
分數：介於1到5之間
Time: Timestamp
時間：時間戳
Summary: Summary of the review
摘要：審查摘要
Text: Review
文字：評論

目的 (Objective)

Given a review, determine whether the review is positive (rating of 4 or 5) or negative (rating of 1 or 2).

給定評論，確定評論是正面的(4或5級)還是負面的(1或2級)。

How to determine if a review is positive or negative?

如何確定評論是正面還是負面？

We could use Score/Rating. A rating of 4 or 5 can be considered as a positive review. A rating of 1 or 2 can be considered as a negative one. A review of rating 3 is considered neutral and such reviews are ignored from our analysis. This is an approximate and proxy way of determining the polarity (positivity/negativity) of a review.

我們可以使用得分/評分。評分為4或5可被視為正面評價。評級為1或2可以被認為是負面的。等級3的評論被視為中立，我們的分析忽略了此類評論。這是確定評論的極性(陽性/陰性)的近似和替代方法。

探索性資料分析 (Exploratory Data Analysis)

基本預處理 (Basic Preprocessing)

As a step of basic data cleaning, we first checked for any missing values. Fortunately, we don’t have any missing values. Next, we will check for duplicate entries. On analysis, we found that for different products the same review is given by the same user at the same time. Practically it doesn’t make sense. So we will keep only the first one and remove other duplicates.

作為基本資料清理的步驟，我們首先檢查是否有缺失值。幸運的是，我們沒有任何遺漏的值 。接下來，我們將檢查重複的條目 。經過分析，我們發現對於不同的產品，同一使用者在同一時間給出了相同的評論。實際上，這沒有任何意義。因此，我們將僅保留第一個，並刪除其他重複項。

Example for a duplicate entry:

重複條目的示例：

Now our data points got reduced to about 69%.

現在，我們的資料點減少到了約69％。

分析評論趨勢 (Analyzing the review trend)

From 2001 to 2006 the number of reviews is consistent. But after that, the number of reviews began to increase. Out of those, a number of reviews with 5-star ratings were high. Maybe that are unverified accounts boosting the seller inappropriately with fake reviews. The other reason can be due to an increase in the number of user accounts.
從2001年到2006年，評論的數量是一致的。但此後，評論數量開始增加。在這些評論中，有許多5星級的評論很高。也許那是未經驗證的帳戶，以虛假評論不當地刺激了賣方。另一個原因可能是由於使用者帳戶數量的增加。

分析目標變數 (Analyzing target variable)

As discussed earlier we will assign all data points above rating 3 as positive class and below 3 as a negative class. we will neglect the rest of the points.

如前所述，我們將高於3的所有資料點分配為正數，將低於3的所有資料點分配為負數。我們將忽略其餘的要點。

Observation: It is clear that we have an imbalanced data set for classification. So We cannot choose accuracy as a metric. So here we will go with AUC(Area under ROC curve)

觀察：很明顯，我們的分類資料集不平衡。因此，我們不能選擇準確性作為度量標準。所以這裡我們將使用AUC(ROC曲線下的面積)

Why accuracy not for imbalanced datasets?

為什麼準確度不適用於不平衡的資料集？

Consider a scenario like this where we have an imbalanced data set. For example, consider the case of credit card fraud detection with 98% percentage of points as non-fraud(1) and rest 2% points as fraud(1). In such cases even if we predict all the points as non-fraud also we will get 98% accuracy. But actually it is not the case. So we can’t use accuracy as a metric.

考慮這樣的情況，我們的資料集不平衡。例如，以信用卡欺詐檢測為例，其中98％的點為不欺詐(1)，其餘2％的點為欺詐(1)。在這種情況下，即使我們將所有點都預測為非欺詐，我們也將獲得98％的準確性。但實際上並非如此。因此，我們不能將準確性用作指標。

What is AUC ROC?

什麼是AUC ROC？

AUC is the area under the ROC curve. It tells how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0s as 0s and 1s as 1s. The ROC curve is plotted with TPR against the FPR where TPR is on the y-axis and FPR is on the x-axis.

AUC是ROC曲線下的面積。它告訴模型該模型能夠區分多少類。 AUC越高，模型在將0s預測為0s和將1s預測為1s時越好。用TPR相對FPR繪製ROC曲線，其中TPR在y軸上，FPR在x軸上。

分析使用者行為 (Analyzing User behavior)

After analyzing the no of products that the user brought, we came to know that most of the users have brought a single product.
在分析使用者帶來的產品數量之後，我們知道大多數使用者帶來了單個產品。
Another thing to note is that the helpfulness denominator should be always greater than the numerator as the helpfulness numerator is the number of users who found the review helpful and the helpfulness denominator is the number of users who indicated whether they found the review helpful or not. There are some data points that violate this. So we remove those points.
還要注意的另一點是，幫助分母應始終大於分子，因為幫助分母是發現該評論有用的使用者數，而幫助分母是表明他們認為該評論是否有用的使用者數。有些資料點違反了這一點。因此，我們刪除了這些點。

After our preprocessing, data got reduced from 568454 to 364162.ie, about 64% of the data is remaining. Now let's get into our important part. Processing review data.

經過我們的預處理，資料從568454減少到364162。即，大約有64％的資料剩餘。現在，讓我們進入我們的重要部分。處理評論資料。

預處理文字資料 (Preprocessing text data)

Text data requires some preprocessing before we go on further with analysis and making the prediction model. Hence in the preprocessing phase, we do the following in the order below:-

文字資料需要進行一些預處理，然後才能繼續進行分析和建立預測模型。因此，在預處理階段，我們按以下順序執行以下操作：

Begin by removing the Html tags.
首先刪除Html標籤。
Remove any punctuation’s or a limited set of special characters like, or . or #,! etc.
刪除所有標點符號或一組有限的特殊字元，例如或。要麼＃，！等等
Check if the word is made up of English letters and is not alpha-numeric
檢查單詞是否由英文字母組成並且不是字母數字
Convert the word to lowercase
將單詞轉換為小寫
Finally, remove Stopwords
最後，刪除停用詞

火車測試拆分 (Train test split)

Once we are done with preprocessing, we will split our data into train and test. We will do splitting after sorting the data based on time as a change in time can influence the reviews.

完成預處理後，我們會將資料分為訓練和測試。我們將在根據時間對資料進行排序後進行拆分，因為時間的變化會影響評論。

向量化文字資料 (Vectorizing text data)

After that, I have applied bow vectorization, tfidf vectorization, average word2vec, and tfidf word2vec techniques for featuring our text and saved them as separate vectors. As vectorizing large amounts of data is expensive, I computed it once and stored so that I do not want to recompute it again and again.

在那之後，我應用了弓向量化，tfidf向量化，平均word2vec和tfidf word2vec技術來突出顯示文字，並將它們儲存為單獨的向量。由於向量化大量資料的成本很高，因此我對其進行了一次計算並進行儲存，因此我不想一次又一次地重新計算它。

Note: I used a unigram approach for a bag of words and tfidf. In the case of word2vec, I trained the model rather than using pre-trained weights. You can always try with an n-gram approach for bow/tfidf and can use pre-trained embeddings in the case of word2vec.

注意：我對一袋單詞和tfidf使用了unigram方法。 對於word2vec，我訓練了模型，而不是使用預先訓練的權重。 您始終可以對弓/ tfidf嘗試使用n-gram方法，在word2vec的情況下可以使用預訓練的嵌入。

You should always try to fit your model on train data and transform it on test data. Do not try to fit your vectorizer on test data as it can cause data leakage issues.

您應該始終嘗試使模型適合訓練資料，並根據測試資料進行轉換。不要嘗試將向量化儀安裝在測試資料上，因為它可能導致資料洩漏問題。

TSNE視覺化 (TSNE visualization)

TSNE which stands for t-distributed stochastic neighbor embedding is one of the most popular dimensional reduction techniques. It is mainly used for visualizing in lower dimensions. Before getting into machine learning models, I tried to visualize it at a lower dimension.

代表t分佈隨機鄰居嵌入的TSNE是最流行的降維技術之一。它主要用於較小尺寸的視覺化。在進入機器學習模型之前，我嘗試以較低的維度對其進行視覺化。

Steps I followed for TSNE:

我為TSNE遵循的步驟：

Keeping perplexity constant I ran TSNE at different iterations and found the most stable iteration.
保持困惑不變，我在不同的迭代中執行TSNE，並找到了最穩定的迭代。
Now keeping that iteration constant I ran TSNE at different perplexity to get a better result.
現在使迭代保持恆定，我以不同的困惑運行了TSNE，以獲得更好的結果。
Once I got the stable result, ran TSNE again with the same parameters.
獲得穩定的結果後，再次使用相同的引數執行TSNE。

But I found that TSNE is not able to well separate the points in a lower dimension.

但是我發現TSNE不能很好地在較低維度上分離點。

Note: I tried TSNE with random 20000 points (with equal class distribution). May results improve with a large number of datapoints

注意：我嘗試使用隨機20000點(具有相等的類分佈)嘗試TSNE。 使用大量資料點可能會改善結果

機器學習方法 (Machine learning Approach)

樸素貝葉斯 (Naive Bayes)

It is always better in machine learning if we have a baseline model to evaluate. We will begin by creating a naive Bayes model. For the naive Bayes model, we will split data to train, cv, and test since we are using manual cross-validation. Finally, we have tried multinomial naive Bayes on bow features and tfidf features. After hyperparameter tuning, we end with the following results.

如果我們要評估一個基準模型，那麼在機器學習中總會更好。我們將從建立樸素的貝葉斯模型開始。對於樸素的貝葉斯模型，由於將使用手動交叉驗證，因此我們將資料拆分為訓練，簡歷和測試。最後，我們在弓形特徵和tfidf特徵上嘗試了多項式樸素貝葉斯。經過超引數調整後，我們得到以下結果。

We can see that in both cases model is slightly overfitting. Don’t worry we will try out other algorithms as well.

我們可以看到，在兩種情況下，模型都有些過擬合。不用擔心，我們也會嘗試其他演算法。

邏輯迴歸 (Logistic regression)

As the algorithm was fast it was easy for me to train on a 12gb RAM machine. In this case, I only split the data into train and test since grid search cv does internal cross-validation. Finally, I did hyperparameter tuning of bow features,tfidf features, average word2vec features, and tfidf word2vec features.

由於演算法很快，因此我很容易在12gb RAM機器上進行訓練。在這種情況下，由於網格搜尋cv會進行內部交叉驗證，因此我僅將資料分為訓練和測試。最後，我對弓形特徵，tfidf特徵，平均word2vec特徵和tfidf word2vec特徵進行了超引數調整。

Even though bow and tfidf features gave higher AUC on test data, models are slightly overfitting. Average word2vec features make and more generalized model with 91.09 AUC on test data.

即使bow和tfidf功能在測試資料上提供了更高的AUC，模型還是有些過擬合。平均word2vec功能使測試資料具有91.09 AUC的更通用模型。

支援向量機 (Support vector machines)

Next, I tried with the SVM algorithm. I tried both with linear SVM and well as RBF SVM.SVM performs well with high dimensional data. Linear SVM with average word2vec features resulted in a more generalized model.

接下來，我嘗試使用SVM演算法。我同時嘗試了線性SVM和RBF SVM.SVM在處理高維資料時表現良好。具有平均word2vec特徵的線性SVM產生了更通用的模型。

決策樹 (Decision Trees)

Even though we already know that this data can easily overfit on decision trees, I just tried in order to see how well it performs on tree-based models.

即使我們已經知道該資料很容易在決策樹上過度擬合，但我還是嘗試了一下，以檢視其在基於樹的模型上的表現如何。

After hyperparameter tuning, I end up with the following result. We can see that the models are overfitting and the performance of decision trees are lower compared to logistic regression, naive Bayes, and SVM. We can either overcome this to a certain extend by using post pruning techniques like cost complexity pruning or we can use some ensemble models over it. Here I decided to use ensemble models like random forest and XGboost and check the performance.

經過超引數調整後，我得到以下結果。我們可以看到，與邏輯迴歸，樸素貝葉斯和SVM相比，模型過度擬合，決策樹的效能較低。我們可以通過使用後期修剪技術(例如成本複雜度修剪)在一定程度上克服這一問題，或者可以在其上使用一些整合模型。在這裡，我決定使用諸如隨機森林和XGboost之類的整合模型並檢查效能。

隨機森林 (Random Forest)

With Random Forest we can see that the Test AUC increased. but still, most of the models are slightly overfitting.

使用隨機森林，我們可以看到測試AUC增加了。但是，大多數模型還是有些過擬合。

XG助推器 (XG-Boost)

Xg-boost also performed similarly to the random forest. Most of the models were overfitting.

Xg-boost的表現也與隨機森林相似。大多數模型都是過擬合的。

After trying several machine learning approaches we can see that logistic regression and linear SVM on average word2vec features gives a more generalized model.

在嘗試了幾種機器學習方法之後，我們可以看到平均word2vec特徵上的邏輯迴歸和線性SVM提供了更通用的模型。

Do not Stop Here!!!

不要在這裡停下來！！！

What about sequence models. They have proved well for handling text data. Next, we will try to solve the problem using a deep learning approach and see whether the result is improving.

序列模型呢？事實證明，它們非常適合處理文字資料。接下來，我們將嘗試使用深度學習方法解決問題，並檢視結果是否有所改善。

深度學習方法 (Deep Learning Approach)

Basically the text preprocessing is a little different if we are using sequence models to solve this problem.

基本上，如果我們使用序列模型來解決此問題，則文字預處理會有所不同。

The initial preprocessing is the same as we have done before. We will remove punctuations, special characters, stopwords, etc and we will also convert each word to lower case.
初始預處理與我們之前所做的相同。我們將刪除標點符號，特殊字元，停用詞等，並且還將每個單詞轉換為小寫。
Next, instead of vectorizing data directly, we will use another approach. First, we convert the text data into sequenced by encoding them. ie, for each unique word in the corpus we will assign a number, and the number gets repeated if the word repeats.
接下來，我們將使用另一種方法，而不是直接向量化資料。首先，我們通過編碼將文字資料轉換為有序的文字。也就是說，對於語料庫中的每個唯一單詞，我們將分配一個數字，如果單詞重複，該數字就會重複。

For eg, the sequence for “it is really tasty food and it is awesome” be like “ 25, 12, 20, 50, 11, 17, 25, 12, 109” and sequence for “it is bad food” be “25, 12, 78, 11”

例如，“ 真是美味，真棒 ”的順序是“ 25、12、20、50、11、17、25、12、109”，“真是不好的食物”的順序是“ 25 ，12、78、11”

Finally, we will pad each of the sequences to the same length.
最後，我們將每個序列填充到相同的長度。

After plotting, the length of the sequence, I found that most of the reviews have sequence length ≤225. So I took the maximum length of the sequence as 225. If the sequence length is > 225, we will take the last 225 numbers in sequence and if it is < 225 we fill the initial points with zeros.

在繪製出序列的長度後，我發現大多數評論的序列長度≤225。因此，我將序列的最大長度設為225。如果序列長度> 225，我們將採用序列中的最後225個數字；如果值小於<225，我們將使用零填充初始點。

Our model consists of an embedding layer with pre-trained weights, an LSTM layer, and multiple dense layers. We tried different combinations of LSTM and dense layer and with different dropouts. We have used pre-trained embedding using glove vectors. I would say this played an important role in improving our AUC score to a certain extend. At last, we got better results with 2 LSTM layers and 2 dense layers and with a dropout rate of 0.2. Our architecture looks as follows:

我們的模型包括一個具有預訓練權重的嵌入層，一個LSTM層和多個緻密層。我們嘗試了LSTM和緻密層的不同組合以及不同的濾失。我們已經使用了手套向量進行預訓練的嵌入。我想說這在將我們的AUC分數提高到一定程度方面發揮了重要作用。最後，在2個LSTM層和2個緻密層以及輟學率為0.2的情況下，我們獲得了更好的結果。我們的架構如下：

Our model got easily converged in the second epoch itself. We got a validation AUC of about 94.8% which is the highest AUC we got for a generalized model.

我們的模型很容易在第二個時代本身收斂。我們獲得了約94.8％的驗證AUC，這是我們針對廣義模型獲得的最高AUC。

Some of our experimentation results are as follows:

我們的一些實驗結果如下：

Thus I had trained a model successfully. Here comes an interesting question. But how to use it? Don’t worry! I will also explain how I deployed the model using a flask.

因此，我成功地訓練了一個模型。這是一個有趣的問題。但是如何使用呢？不用擔心我還將說明如何使用燒瓶部署模型。

使用Flask進行模型部署 (Model Deployment Using Flask)

This is the most exciting part that everyone misses out. How to deploy the model we just created? I choose Flask as it is a python based micro web framework. As I am coming from a non-web developer background Flask is comparatively easy to use.

這是每個人都錯過的最令人興奮的部分。如何部署我們剛剛建立的模型？我選擇Flask，因為它是基於python的微型網路框架。由於我來自非網路開發人員背景，因此Flask相對易於使用。

Now we will test our application by predicting the sentiment of the text “food has good taste”.We will test it by creating a request as follows

現在，我們將通過預測“食物具有良好的味道”文字的情緒來測試我們的應用程式。我們將通過建立如下請求來進行測試

Our application will output both probabilities of the given text to be the corresponding class and the class name. Here our text is predicted to be a positive class with probability of about 94%.

我們的應用程式將同時輸出給定文字的概率作為相應的類和類名。在這裡，我們的文字預計為陽性類別，可能性約為94％。

You can play with the full code from my Github project.

您可以使用我的Github專案中的完整程式碼。

Scope of improvement:

改善範圍：

Still, there is a lot of scope of improvement for our present model. In order to train machine learning models, I never used the full data set. You can always try that. It may help in overcoming the over fitting issue of our ml models.
儘管如此，我們目前的模型仍有很多改進的範圍。為了訓練機器學習模型，我從未使用過完整的資料集。您可以隨時嘗試。它可能有助於克服我們的ml模型的過擬合問題。
I only used pretrained word embedding for our deep learning model but not with machine learning models. So you can try is to use pretrained embedding like a glove or word2vec with machine learning models.
我只將預訓練詞嵌入用於我們的深度學習模型，而沒有用於機器學習模型。因此，您可以嘗試在機器學習模型中使用像手套或word2vec這樣的預訓練嵌入。

Appliedai course
應用課程
https://github.com/arunm8489/Amazon_Fine_Food_Reviews-sentiment_analysis
https://github.com/arunm8489/Amazon_Fine_Food_Reviews-sentiment_analysis

翻譯自: https://towardsdatascience.com/sentiment-analysis-on-amazon-food-reviews-from-eda-to-deployment-f985c417b0c