NLP15-基於keras的中文情感挖掘試探

阿新 • • 發佈：2019-01-20

摘要：keras開發，tf為後端；採用了兩個樣本（ChnSentiCorp_htl_ba_2000與imdb），三個神經網路的試探性執行（全連線的一般神經網路NN，LSTM，CNN），感覺keras比tf寫程式碼更簡單。對於NN只要引數充夠的多，會擬合得很好，不過這樣產生了過擬合；LSTM比CNN執行的效果好很多。
keras的中文文件：http://keras-cn.readthedocs.io/en/latest/

NN

下載資料
Downloading data from https://s3.amazonaws.com/text-datasets/imdb_full.pkl , 執行imdb.load_data()將會下載。
這裡面的資料，是一個整數矩陣，每個整數代表一個詞，這裡對每個詞做了一個整型編碼了。

# 探索一下資料情況
# -*- coding: utf-8 -*-
import matplotlib.pyplot as plt
import numpy as np
from keras.datasets import imdb
from keras.layers import Embedding, Flatten, Dense
from keras.models import Sequential
from keras.preprocessing import sequence

## EDA
# 載入資料，這個資料來自： https://s3.amazonaws.com/text-datasets/imdb_full.pkl 

(x_train, y_train), (x_test, y_test) = imdb.load_data()
# 探索一下資料情況
lens = list(map(len, x_train))
avg_len = np.mean(lens)
print(avg_len)
plt.hist(lens, bins=range(min(lens), max(lens) + 50, 50))
plt.show()
# 由於長度不同，這裡取相同的長度
m = max(max(list(map(len, x_train))), max(list(map(len, x_test))))
print('m=%d' 
 % m)
maxword = min(400, m)
x_train = sequence.pad_sequences(x_train, maxlen=maxword)
x_test = sequence.pad_sequences(x_test, maxlen=maxword)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
#詞數
vocab_siz = np.max([np.max(x_train[i]) for i in range(x_train.shape[0])]) + 1
print('vocab_siz=%d' % vocab_siz)
print('x_train.shape=[%d,%d]' % (x_train.shape[0], x_train.shape[1]))
#構建模型
model = Sequential()
# 第一層是嵌入層,矩陣為 vocab_siz * 64
model.add(Embedding(vocab_siz, 64, input_length=maxword))
# 把矩陣壓平，變成vocab_siz * 64維的向量
model.add(Flatten())
# 加入多層全連線
model.add(Dense(2000, activation='relu'))
model.add(Dense(500, activation='relu'))
model.add(Dense(200, activation='relu'))
model.add(Dense(50, activation='relu'))
# 最後一層輸進0~1之間的值，像lr那樣
model.add(Dense(1, activation='sigmoid'))
# 計算
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
print(type(x_train))
#訓練
model.fit(x_train, y_train, validation_data=(x_test, y_test), batch_size=100, nb_epoch=20, verbose=1)
score = model.evaluate(x_test, y_test)
print(score)

控制檯顯示：

________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
embedding_1 (Embedding)          (None, 400, 64)       5669568     embedding_input_1[0][0]          
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 25600)         0           embedding_1[0][0]                
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 2000)          51202000    flatten_1[0][0]                  
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 500)           1000500     dense_1[0][0]                    
____________________________________________________________________________________________________
dense_3 (Dense)                  (None, 200)           100200      dense_2[0][0]                    
____________________________________________________________________________________________________
dense_4 (Dense)                  (None, 50)            10050       dense_3[0][0]                    
____________________________________________________________________________________________________
dense_5 (Dense)                  (None, 1)             51          dense_4[0][0]                    
====================================================================================================
Total params: 57,982,369
Trainable params: 57,982,369
Non-trainable params: 0

Epoch 1/20
25000/25000 [==============================] - 452s - loss: 0.4248 - acc: 0.7779 - val_loss: 0.2961 - val_acc: 0.8768
Epoch 2/20
25000/25000 [==============================] - 458s - loss: 0.0779 - acc: 0.9730 - val_loss: 0.4230 - val_acc: 0.8503
Epoch 3/20
25000/25000 [==============================] - 450s - loss: 0.0050 - acc: 0.9985 - val_loss: 0.7284 - val_acc: 0.8522
Epoch 4/20
25000/25000 [==============================] - 452s - loss: 0.0031 - acc: 0.9990 - val_loss: 0.9187 - val_acc: 0.8420
Epoch 5/20
25000/25000 [==============================] - 449s - loss: 0.0052 - acc: 0.9982 - val_loss: 1.0336 - val_acc: 0.8362

LSTM

語料的樣子，網上找到的情感分析語料“ChnSentiCorp_htl_ba_2000”
這裡寫圖片描述
上成的例子中，顯示出來的資料都是一整數，這個整數是詞的id，上面把構建 id的過程省略了，這個LSTM的例子把這個詞轉id的過程補回來；同時，把對這個語料的處理的過程也補上，首先把pos與neg檔案合併成一個文件，然後再把這兩個合成一個文件，分詞。
把這個分好詞的文件運用Gensim構建詞典，形成id與詞的一個對映，轉成以整數為id的向量矩陣。
探索詞的維資料，選擇合適的維度，如果長句進行截斷，如果是短句進行填充。
處理完就可以構建LSTM了，訓練，評估。。。。
這裡寫圖片描述
程式碼：

# -*- coding:utf-8-*-

import os
import re
import numpy as np
import matplotlib.pyplot as plt
# 分詞
from pprint import pprint

import jieba
from bs4 import BeautifulSoup
from gensim import corpora
from keras.layers import Embedding, LSTM, Dense, Activation, Dropout
from keras.models import Sequential
from keras.preprocessing import sequence
from sklearn.cross_validation import train_test_split


def cutPhase(inFile, outFile):
    # 如果沒有自己定義的詞典，這行不要
    # jieba.load_userdict("dict_all.txt")
    # 載入停用詞
    stoplist = {}.fromkeys([line.strip() for line in open('data/stopword.txt', 'r', encoding='utf-8')])
    f1 = open(inFile, 'r', encoding='utf-8')
    f2 = open(outFile, 'w+', encoding='utf-8')
    line = f1.readline()
    count = 0
    while line:
        b = BeautifulSoup(line, "lxml")
        line = b.text
        # 分詞
        segs = jieba.cut(line, cut_all=False)
        # 過濾停用詞
        segs = [word for word in list(segs)
                if word.lstrip() is not None
                and word.lstrip() not in stoplist
                ]
        # 每個詞用空格隔開
        f2.write(" ".join(segs))
        f2.write('\n')
        line = f1.readline()
        count += 1
        if count % 100 == 0:
            print(count)
    f1.close()
    f2.close()


def load_data(out_pos_name='data/pos.txt', out_neg_name='data/neg.txt'):
    def do_load(file_name, dir):
        c = 0
        with open(file_name, 'w+', encoding='utf-8') as f_out:
            for root, _, files in os.walk(dir):
                # print(root)
                for f_name in files:
                    p = os.path.join(root, f_name)
                    try:
                        with open(p, mode='r', encoding='gbk') as f_read:
                            # print(os.path.join(root, f_name))
                            c += 1
                            txt = f_read.read()
                            txt = re.subn(r'\s+', ' ', txt)[0]
                            f_out.write('%s\n' % (txt))
                            # if c % 100 == 0:
                            #     print(c)
                    except Exception as e:
                        print('p:', p)
                        # print('e:',e)

    print('載入pos!!!')
    do_load(out_pos_name,
            'data/ChnSentiCorp_htl_ba_2000/pos')
    print('載入neg!!!')
    do_load(out_neg_name,
            'data/ChnSentiCorp_htl_ba_2000/neg')


def combine_data():
    c = 0
    f_w = open('data/train.cut', 'w+', encoding='utf-8')
    f_pos = open('data/pos.cut', 'r', encoding='utf-8')
    line = f_pos.readline()
    while line:
        c += 1
        f_w.write('%d\t%s' % (1, line))
        line = f_pos.readline()
        print(c)
    f_pos.close()

    f_neg = open('data/neg.cut', 'r', encoding='utf-8')
    line = f_neg.readline()
    while line:
        c += 1
        f_w.write('%d\t%s' % (0, line))
        line = f_neg.readline()
        print(c)
    f_neg.close()

    f_w.close()


if __name__ == '__main__':
    # print('# 載入資料')
    # load_data(out_pos_name='data/pos.txt', out_neg_name='data/neg.txt')
    # print('# 分詞')
    # cutPhase(inFile='data/pos.txt', outFile='data/pos.cut')
    # cutPhase(inFile='data/neg.txt', outFile='data/neg.cut')
    # 資料融合
    # combine_data()
    Y = []
    x = []
    for line in open('data/train.cut', encoding='utf-8'):
        label, sentence = line.split("\t")
        Y.append(int(label))
        x.append(sentence.split())

    print('#構建字典')
    dic = corpora.Dictionary(x)
    X = []
    for row in x:
        tmp = []
        for w_i in row:
            tmp.append(dic.token2id[w_i])
        X.append(tmp)
    X = np.array(X)
    Y = np.array(Y)
    # lens = list(map(len, X))
    # avg_len = np.mean(lens)
    # print(avg_len)
    # plt.hist(lens, bins=range(min(lens), max(lens) + 50, 50))
    # plt.show()

    # 由於長度不同，這裡取相同的長度,平均長度為38.18，最大長度為337.
    m = max(list(map(len, X)))
    print('m=%d' % m)
    maxword = min(100, m)
    X = sequence.pad_sequences(X, maxlen=maxword)
    print(X.shape)

    ## 資料劃分
    x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

    # 構建模型
    model = Sequential()
    model.add(Embedding(len(dic) + 1, 128, input_length=maxword))
    # model.add(LSTM(128, dropout_W=0.2, return_sequences=True))
    # model.add(LSTM(64, dropout_W=0.2,return_sequences=True))
    model.add(LSTM(128, dropout_W=0.2))
    model.add(Dense(1))
    model.add(Activation("sigmoid"))
    # 計算
    model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
    print(model.summary())
    # 進行訓練
    model.fit(x_train, y_train, batch_size=100, nb_epoch=10, validation_data=(x_test, y_test))
    ## 結果評估
    score, acc = model.evaluate(x_test, y_test, batch_size=100)
    print("score: %.3f, accuracy: %.3f" % (score, acc))

    # # 預測
    # my_sentences = ['討厭 房間']
    # my_x = []
    # for s in my_sentences:
    #     words = s.split()
    #     tmp = []
    #     for w_j in words:
    #         tmp.append(dic.token2id[w_j])
    #     my_x.append(tmp)
    # my_X = np.array(my_x)
    # my_X = sequence.pad_sequences(my_X, maxlen=maxword)
    # labels = [int(round(x[0])) for x in model.predict(my_X)]
    # for i in range(len(my_sentences)):
    #     print('%s:%s' % ('正面' if labels[i] == 1 else '負面', my_sentences[i]))

    # 這裡面沒有考慮到字典沒有的詞，這個作為下一個版本改進點。

LSTM模型：
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
embedding_1 (Embedding)          (None, 100, 64)       748800      embedding_input_1[0][0]          
____________________________________________________________________________________________________
lstm_1 (LSTM)                    (None, 100, 128)      98816       embedding_1[0][0]                
____________________________________________________________________________________________________
lstm_2 (LSTM)                    (None, 100, 64)       49408       lstm_1[0][0]                     
____________________________________________________________________________________________________
lstm_3 (LSTM)                    (None, 32)            12416       lstm_2[0][0]                     
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 1)             33          lstm_3[0][0]                     
____________________________________________________________________________________________________
activation_1 (Activation)        (None, 1)             0           dense_1[0][0]                    
====================================================================================================
Total params: 909,473
Trainable params: 909,473
score: 0.572, accuracy: 0.859

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
embedding_1 (Embedding)          (None, 100, 64)       748800      embedding_input_1[0][0]          
____________________________________________________________________________________________________
lstm_1 (LSTM)                    (None, 128)           98816       embedding_1[0][0]                
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 1)             129         lstm_1[0][0]                     
____________________________________________________________________________________________________
activation_1 (Activation)        (None, 1)             0           dense_1[0][0]                    
====================================================================================================
Total params: 847,745
Trainable params: 847,745
Non-trainable params: 0
score: 0.302, accuracy: 0.871

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
embedding_1 (Embedding)          (None, 100, 128)      1497600     embedding_input_1[0][0]          
____________________________________________________________________________________________________
lstm_1 (LSTM)                    (None, 128)           131584      embedding_1[0][0]                
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 1)             129         lstm_1[0][0]                     
____________________________________________________________________________________________________
activation_1 (Activation)        (None, 1)             0           dense_1[0][0]                    
====================================================================================================
Total params: 1,629,313
Trainable params: 1,629,313
Non-trainable params: 0
score: 0.386, accuracy: 0.874

CNN

對上面的LSTM模型結構考慮使用CNN，主要修改神經網路模型。

# 構建模型CNN
model = Sequential()
model.add(Embedding(len(dic) + 1, 128, input_length=maxword))
model.add(Conv1D(nb_filter=128, filter_length=5, border_mode='same', activation='relu'))
model.add(MaxPooling1D(pool_length=2))
model.add(Dropout(0.25))
model.add(Conv1D(nb_filter=128, filter_length=3, border_mode='same', activation='relu'))
model.add(MaxPooling1D(pool_length=2))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='relu'))
# 計算
model.compile(loss="binary_crossentropy", optimizer="rmsprop", metrics=["accuracy"])
print(model.summary())
# 進行訓練
model.fit(x_train, y_train, batch_size=100, nb_epoch=20, validation_data=(x_test, y_test))
## 結果評估
score, acc = model.evaluate(x_test, y_test, batch_size=100, verbose=1)
print("score: %.3f, accuracy: %.3f" % (score, acc))

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
embedding_1 (Embedding)          (None, 100, 128)      1497600     embedding_input_1[0][0]          
____________________________________________________________________________________________________
convolution1d_1 (Convolution1D)  (None, 100, 128)      82048       embedding_1[0][0]                
____________________________________________________________________________________________________
maxpooling1d_1 (MaxPooling1D)    (None, 50, 128)       0           convolution1d_1[0][0]            
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 50, 128)       0           maxpooling1d_1[0][0]             
____________________________________________________________________________________________________
convolution1d_2 (Convolution1D)  (None, 50, 128)       49280       dropout_1[0][0]                  
____________________________________________________________________________________________________
maxpooling1d_2 (MaxPooling1D)    (None, 25, 128)       0           convolution1d_2[0][0]            
____________________________________________________________________________________________________
dropout_2 (Dropout)              (None, 25, 128)       0           maxpooling1d_2[0][0]             
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 3200)          0           dropout_2[0][0]                  
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 64)            204864      flatten_1[0][0]                  
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 32)            2080        dense_1[0][0]                    
____________________________________________________________________________________________________
dense_3 (Dense)                  (None, 1)             33          dense_2[0][0]                    
====================================================================================================
Total params: 1,835,905
Trainable params: 1,835,905
Non-trainable params: 0
____________________________________________________________________________________________________

這裡主是為了測試，效果不怎麼樣。沒有前面的LSTM好，考慮研究改進，或考慮學習CNNText…

NLP15-基於keras的中文情感挖掘試探

NN

LSTM

CNN

NLP15-基於keras的中文情感挖掘試探

【NLP】【八】基於keras與imdb影評資料集做情感分類

python的中文文字挖掘庫snownlp進行購物評論文字情感分析例項

深度學習----基於keras的LSTM三分類的文字情感分析原理及程式碼

基於keras 的 python情感分析案例IMDB影評情感分析

基於Keras的imdb資料集電影評論情感二分類

基於keras實現的中文實體識別

grad-cam 、cam 和熱力圖，基於keras的實現

中文情感分析 glove+LSTM

NLPIR：中文語義挖掘是自然語言處理的關鍵

ptyhon中文本挖掘精簡版

中文情感分析語料庫【下載】

Word2vec進行中文情感分析

基於keras實現多標籤分類（multi-label classification）

基於Keras mnist手寫數字識別---Keras卷積神經網路入門教程

Keras之DNN：基於Keras(sigmoid+binary_crossentropy+predict_proba)利用DNN實現分類預測概率——DIY二分類資料集&預測新資料點

Keras之DNN：基於Keras(sigmoid+linear+mse+predict)利用DNN實現迴歸預測——DIY多分類資料集&預測新資料點

基於Keras的LSTM多變數時間序列預測（學習筆記）

基於Keras的LSTM多變數時間序列預測（北京PM2.5資料集pollution.csv）

snownlp中文情感分析[正負面sentiments/相似度sim]

NLP15-基於keras的中文情感挖掘試探

NN

LSTM

CNN

相關推薦