tf.keras入門(2) Film review text Classification（IMDB 資料集）

阿新 • • 發佈：2018-12-11

影評文字分類

使用 IMDB 資料集，其中包含來自網際網路電影資料庫的 50000 條影評文字。將這些影評拆分為訓練集（25000 條影評）和測試集（25000 條影評）。訓練集和測試集之間達成了平衡，意味著它們包含相同數量的正面和負面影評。

介面解釋

train_data = keras.preprocessing.sequence.pad_sequences(train_data, value = word_index["<PAD>"], padding='post', maxlen=256)

由於影評的長度是不同的，因此要統一一下，直接使用寫好的介面

如何工作的呢？見下圖：
model.add(keras.layers.Dense(16, activation=tf.nn.relu)) 向模型中加入一層神經元
keras.layers.Embedding(vocab_size,16)

其只能作為網路的第一層

這裡主要是Embedding，其中文是嵌入的意思，涉及到Word2vec 由於涉及到NLP的姿勢後面有空再補上具體演算法

其生成的維度為：(batch, sequence, embedding)。
model.add(keras.layers.GlobalAveragePooling1D())

通過對序列維度求平均值，針對每個樣本返回一個長度固定的輸出向量。
model.summary() # 看網路的結構
binary_crossentropy

亦稱作對數損失，一般用於二分類 (啟用函式是sigmoid)：

If it is a multiclass problem, you have to use categorical_crossentropy

The equation for categorical cross entropy is

The double sum is over the observations i

, whose number is N, and the categories c, whose number is C. The term

1_{y_i \in C_c}

is the indicator function of the ith observation belonging to the cth category. The

p_{model}[y_i \in C_c]

is the probability predicted by the model for the ith observation to belong to the cth category. When there are more than two categories, the neural network outputs a vector of C probabilities, each giving the probability that the network input should be classified as belonging to the respective category. When the number of categories is just two, the neural network outputs a single probability

\hat{y}_i

, with the other one being 1 minus the output. This is why the binary cross entropy looks a bit different from categorical cross entropy, despite being a special case of it.

history = model.fit() model.fit() 返回一個 History 物件，該物件包含一個字典，其中包括訓練期間發生的所有情況

網路結構

Code

main.py

'''
將文字形式的影評分為“正面”或“負面”影評。這是一個二元分類（又稱為兩類分類）的示例，也是一種重要且廣泛適用的機器學習問題。

TensorFlow 中包含 IMDB 資料集。我們已對該資料集進行了預處理，
將影評（字詞序列）轉換為整數序列，其中每個整數表示字典中的一個特定字詞。
'''
import tensorflow as tf 
from tensorflow import keras
import numpy as np
from plot import plot


# 引數 num_words=10000 會保留訓練資料中出現頻次在前 10000 位的字詞。為確保資料規模處於可管理的水平，罕見字詞將被捨棄。
imdb = keras.datasets.imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

# 看一下訓練集的大小
print("Training entries: {}, labels: {}".format(len(train_data), len(train_labels)))

# 看一下訓練集啥樣 這裡是影評 一個數字對應字典中的一個單詞
print(train_data[0],'\n',len(train_data[0]))
print(train_data[1],'\n',len(train_data[1]))
print(type(train_data))
# 可以看到文字的長度並不相同 但神經網路一般需要相同維數的向量進行輸入 下面可以看到解決辦法



# 將整數轉換回字詞
word_index = imdb.get_word_index()
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0  # 沒有出現但字典裡有的詞
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i,'?') for i in text]) # 從字典中取出對應的單詞  沒有就取'?'

# test 將整數轉換回字詞
print(decode_review(train_data[0]))



# 為了使輸入的張量維數相同 我們取max_length作為維數
# 我們將使用 pad_sequences 函式將長度標準化
train_data = keras.preprocessing.sequence.pad_sequences(train_data, 
                                                        value = word_index["<PAD>"],
                                                        padding='post',
                                                        maxlen=256)

test_data = keras.preprocessing.sequence.pad_sequences(test_data, 
                                                       value = word_index["<PAD>"],
                                                       padding='post',
                                                       maxlen=256)                            

#  檢視處理之後的資料
print(train_data[0],'\n',len(train_data[0]))


# input shape is the vocabulary count used for the movie reviews (10,000 words)
vocab_size = 10000

model = keras.Sequential()
# 該層會在整數編碼的詞彙表中查詢每個字詞-索引的嵌入向量。
# 模型在接受訓練時會學習這些向量。這些向量會向輸出陣列新增一個維度。
model.add(keras.layers.Embedding(vocab_size,16)) # batch * sequence * 16
# 通過對序列維度求平均值，針對每個樣本返回一個長度固定的輸出向量。
# 這樣，模型便能夠以儘可能簡單的方式處理各種長度的輸入。
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(1,activation=tf.nn.sigmoid))
model.summary() # 看一下模型的框架


# 模型在訓練時需要一個損失函式和一個優化器。由於這是一個二元分類問題
# 且模型會輸出一個概率（應用 S 型啟用函式的單個單元層），
# 因此我們將使用 binary_crossentropy 損失函式。
model.compile(optimizer=tf.train.AdamOptimizer(),
              loss='binary_crossentropy',
              metrics=['accuracy'])

# 從原始訓練資料中分離出 10000 個樣本，建立一個驗證集(validation)。
x_val = train_data[:10000]
partial_x_train = train_data[10000:]

y_val = train_labels[:10000]
partial_y_train = train_labels[10000:]

# 用有 512 個樣本的小批次訓練模型 40 個週期。這將對 x_train 和 y_train 張量中的所有樣本進行
#  40 次迭代。在訓練期間，監控模型在驗證集的 10000 個樣本上的損失和準確率：
history = model.fit(partial_x_train,
                    partial_y_train,
                    epochs=22,
                    batch_size=512,
                    validation_data=(x_val, y_val), # 實時驗證
                    verbose=1)

# 評估模型
results = model.evaluate(test_data, test_labels)
print(results)

# model.fit() 返回一個 History 物件，該物件包含一個字典，其中包括訓練期間發生的所有情況：
history_dict = history.history
print(history_dict.keys())

# 呼叫另一個檔案裡的函式 進行作圖
plot(history_dict)

plot.py

import matplotlib.pyplot as plt

def plot(history_dict):
    # 一共有 4 個條目：每個條目對應訓練和驗證期間的一個受監控指標。
    # 我們可以使用這些指標繪製訓練損失與驗證損失圖表以進行對比，並繪製訓練準確率與驗證準確率圖表：
    acc = history_dict['acc']
    val_acc = history_dict['val_acc']
    loss = history_dict['loss']
    val_loss = history_dict['val_loss']
    epochs = range(1,len(acc)+1)

    plt.figure(1)
    # "bo" is for blue dot
    plt.plot(epochs,loss,'bo',label='Training Loss')
    # "b" is for "solid blue line"
    plt.plot(epochs,val_loss,'b',label='Validation Loss')
    plt.title("Training and Validation Loss")
    plt.xlabel("Epochs")
    plt.ylabel("Loss")
    plt.legend()

    plt.figure(2)
    plt.plot(epochs, acc, 'bo', label='Training acc')
    plt.plot(epochs, val_acc, 'b', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.legend()

    plt.show()

tf.keras入門(2) Film review text Classification（IMDB 資料集）

影評文字分類

介面解釋

網路結構

Code

main.py

plot.py

tf.keras入門(2) Film review text Classification（IMDB 資料集）

tf.keras入門(3) Predicting House Prices: Regression（boston_housing 資料集）

[tensorflow]tf.keras入門2-分類

Go入門自學寶典003-變數（基本資料型別）

【Keras】使用Keras開發的流程（IMDB資料集電影評論二分類）

用Keras進行手寫字型識別（MNIST資料集）

tf.keras入門(1) Basic Classification（Fashion MNIST資料集）

tf.keras入門(4) Explore over-fitting and under-fitting

我的Keras使用總結（2）——構建影象分類模型（針對小資料集）

JavaWeb學習入門之——圖書館管理系統開發（Hibernate學習1）

2，抽象工廠模式（Abstract Factory Pattern）抽象工廠可以一下生產一個產品族（裏面有很多產品組成）

2.IOC 配置與應用（xml的方式）

jumpserver-0.3.2 堡壘機環境搭建（圖文詳解）

IntelliJ IDEA 2018.2.1激活碼（親測可用）

PowerShell管道入門，看看你都會不（管道例子大全）

Codeforces Round #197 (Div. 2): C. Xenia and Weights（記憶化搜尋）

2.1 The Python Interpreter（python解釋器）

Tensorflow 入門指引（MNIST資料的）

Windows10離線安裝Anaconda3-4.2.0-Windows-x86_64.exe（對應python3.5）和tensorflow_gpu-1.7.0-cp35-cp35m-win_amd

FC 13.2 使用Intent傳遞物件（Serializable和Parcelable）

tf.keras入門(2) Film review text Classification（IMDB 資料集）

影評文字分類

介面解釋

網路結構

Code

相關推薦