CNN情感分析（文字分類）

阿新 • • 發佈：2019-01-31

　　這篇文章翻譯至denny britz的部落格。

一、資料預處理

　　這個情感分析的資料集來自Rotten Tomatoes的電影評論，總共10662個樣本，一半正例，一半負例，詞彙的數目大概2萬個。
　　任何機器學習能夠得到很好的執行，資料預處理都很重要。首先，簡單介紹下其資料預處理的過程：
　　1、從原始檔案中載入正例和負例；
　　2、對資料進行清理；
　　3、設定最大句子長度是59，不足的補符號；
　　4、建立一個字典，對映字到0-18765。
　　簡單說明一下，第3步保證輸入的句子長度一致，CNN本來是處理影象的，所以，我們要做的像個圖片一樣，因此要求長度一致。

二、CNN模型設計

CNN模型

　　這個圖已經畫的相當6了。
　　第一層詞嵌入，用word2vec把詞語對映為128維的向量。第二層是卷積層，這個卷積層比較特別，用了多個過濾器的大小。例如，每次分別劃分3，4或5個字詞；第三層，池化層，在這裡採用的最大池化，並將所有卷積層的結果放在一個長的特徵向量上，然後加上dropout正則，最後用softmax輸出結果。
　　下面是程式碼實現。
　　
　　為了更好的處理超引數，我們把程式碼都放在一個TextCNN類裡面：

import tensorflow as tf
import numpy as np

class TextCNN(object) 
:
    """
    A CNN for text classification.
    Uses an embedding layer, followed by a convolutional, max-pooling and softmax layer.
    """
    def __init__(
      self, sequence_length, num_classes, vocab_size,
      embedding_size, filter_sizes, num_filters):
        # Implementation...

　　sequence_length-句子的長度，59；
　　num_classes-分類的數目，2；
　　vocab_size-字典的長度；
　　embedding_size-我們對映的向量維度大小，128；
　　filter_sizes-卷積層過濾器的大小，[3, 4, 5]；
　　num_filters-過濾器的深度。

1、輸入
　　定義輸入資料的placeholder和dropout。

# Placeholders for input, output and dropout

self.input_x=tf.placeholder(tf.int32[None,sequence_length], name="input_x")
self.input_y=tf.placeholder(tf.float32[None,num_classes], name="input_y")
self.dropout_keep_prob=tf.placeholder(tf.float32,name="dropout_keep_prob")

　　tf.placeholder相當於一個坑，等著填呢。第二個引數是tensor的shape，None表示這個維度可以是任意數。在樣例中，第一個維度是批資料的大小，用None表示，可以是任意大小的批資料。
2、詞嵌入層

with tf.device('/cpu:0'), tf.name_scope("embedding"):
    W = tf.Variable(
        tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
        name="W")
    self.embedded_chars = tf.nn.embedding_lookup(W, self.input_x)
    self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)

　　這裡用的隨機生成的vector。最後的tf.expand_dims()來擴充資料，生成[[None, sequence_length, embedding_size, 1]這種格式。因為TensorFlow的局基層conv2d需要一個4維的tensor（batch，width，height，channel）
3、卷積池化層

pooled_outputs = []
for i, filter_size in enumerate(filter_sizes):
    with tf.name_scope("conv-maxpool-%s" % filter_size):
        # Convolution Layer
        filter_shape = [filter_size, embedding_size, 1, num_filters]
        W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
        b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")
        conv = tf.nn.conv2d(
            self.embedded_chars_expanded,
            W,
            strides=[1, 1, 1, 1],
            padding="VALID",
            name="conv")
        # Apply nonlinearity
        h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
        # Max-pooling over the outputs
        pooled = tf.nn.max_pool(
            h,
            ksize=[1, sequence_length - filter_size + 1, 1, 1],
            strides=[1, 1, 1, 1],
            padding='VALID',
            name="pool")
        pooled_outputs.append(pooled)

# Combine all the pooled features
num_filters_total = num_filters * len(filter_sizes)
self.h_pool = tf.concat(3, pooled_outputs)
self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total])

我們先建立卷積層，然後是最大池化層。因為我們卷積層的過濾器選擇了多個，所以我們需要對它們進行迭代，為每一個過濾器都建立一個layer，然後把結果匯聚到一個大的向量裡。
　　這裡需要說明的是，和影象不同的地方，就是我們文字不能跨行過濾，因為那樣會沒有意義。
4、設定dropout（訓練時候用）

# Add dropout
with tf.name_scope("dropout"):
    self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)

5、評分和預測（最後的全連線層）

with tf.name_scope("output"):
    W = tf.Variable(tf.truncated_normal([num_filters_total, num_classes], stddev=0.1), name="W")
    b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
    self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores")
    self.predictions = tf.argmax(self.scores, 1, name="predictions")

6、設定損失函式和準確率計算

# Calculate mean cross-entropy loss
with tf.name_scope("loss"):
    losses = tf.nn.softmax_cross_entropy_with_logits(self.scores, self.input_y)
    self.loss = tf.reduce_mean(losses)

　　這裡使用的交叉熵損失函式

# Calculate Accuracy
with tf.name_scope("accuracy"):
    correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
    self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")

　剩下的，按照常規步驟走就可以了。

CNN情感分析（文字分類）

CNN情感分析（文字分類）

斯坦福大學-自然語言處理入門筆記第七課情感分析（sentiment analysis）

斯坦福大學自然語言處理第七課“情感分析（Sentiment Analysis）”

基於機器學習的NLP情感分析（二）---- 分類問題

資料探勘——基於R文字情感分析（2）

神經網路之文字情感分析（二）

Python 文字挖掘：使用機器學習方法進行情感分析（一、特徵提取和選擇）

文字挖掘之情感分析（一）

深度學習情感分析（隨機梯度下降代碼實現）

python資料分析：分類分析（classification analysis）

kaggle 影評情感分析（1）—— TF-IDF+Logistic迴歸/樸素貝葉斯/SGD

第4章樸素貝葉斯（文字分類、過濾垃圾郵件、獲取區域傾向）

無監督分類：聚類分析（K均值）

圖片情感分析（2）：影象情感分析模型

PAT 1075連結串列元素的分類程式碼實現及錯誤分析（C語言）

文字分類實驗（多分類）

團隊作業10——復審和事後分析（Beta版本）

事後諸葛亮分析（Beta階段）

apigw鑒權分析（1-4）新浪微博開放平臺 - 鑒權分析

HashMap源碼分析（JDK1.8）

CNN情感分析（文字分類）

相關推薦