[深度應用]·Keras實現Self-Attention文字分類（機器如何讀懂人心）

配合閱讀：

[深度概念]·Attention機制概念學習筆記

[TensorFlow深度學習深入]實戰三·分別使用DNN,CNN與RNN(LSTM)做文字情感分析

筆者在[深度概念]·Attention機制概念學習筆記博文中，講解了Attention機制的概念與技術細節，本篇內容配合講解，使用Keras實現Self-Attention文字分類，來讓大家更加深入理解Attention機制。

作為對比，可以訪問[TensorFlow深度學習深入]實戰三·分別使用DNN,CNN與RNN(LSTM)做文字情感分析，檢視不同網路區別與聯絡。

一、Self-Attention概念詳解

瞭解了模型大致原理，我們可以詳細的看一下究竟Self-Attention結構是怎樣的。其基本結構如下

對於self-attention來講，Q(Query), K(Key), V(Value)三個矩陣均來自同一輸入，首先我們要計算Q與K之間的點乘，然後為了防止其結果過大，會除以一個尺度標度 $\sqrt{d_k}$ ，其中 $d_k$ 為一個query和key向量的維度。再利用Softmax操作將其結果歸一化為概率分佈，然後再乘以矩陣V就得到權重求和的表示。該操作可以表示為 $Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$

這裡可能比較抽象，我們來看一個具體的例子（圖片來源於https://jalammar.github.io/illustrated-transformer/，該部落格講解的極其清晰，強烈推薦），假如我們要翻譯一個片語Thinking Machines，其中Thinking的輸入的embedding vector用 $x_1$ 表示，Machines的embedding vector用 $x_2$ 表示。

當我們處理Thinking這個詞時，我們需要計算句子中所有詞與它的Attention Score，這就像將當前詞作為搜尋的query，去和句子中所有詞（包含該詞本身）的key去匹配，看看相關度有多高。我們用 $q_1$

代表Thinking對應的query vector， $k_1$

及 $k_2$

分別代表Thinking以及Machines對應的key vector，則計算Thinking的attention score的時候我們需要計算 $q_1$

與 $k_1,k_2$

的點乘，同理，我們計算Machines的attention score的時候需要計算 $q_2$

與 $k_1,k_2$

的點乘。如上圖中所示我們分別得到了 $q_1$

與 $k_1,k_2$

的點乘積，然後我們進行尺度縮放與softmax歸一化，如下圖所示：

顯然，當前單詞與其自身的attention score一般最大，其他單詞根據與當前單詞重要程度有相應的score。然後我們在用這些attention score與value vector相乘，得到加權的向量。

如果將輸入的所有向量合併為矩陣形式，則所有query, key, value向量也可以合併為矩陣形式表示

其中 $W^Q, W^K, W^V$ 是我們模型訓練過程學習到的合適的引數。上述操作即可簡化為矩陣形式

二、Self_Attention模型搭建

筆者使用Keras來實現對於Self_Attention模型的搭建，由於網路中間引數量比較多，這裡採用自定義網路層的方法構建Self_Attention，關於如何自定義Keras可以參看這裡：編寫你自己的 Keras 層

Keras實現自定義網路層。需要實現以下三個方法:（注意input_shape是包含batch_size項的）

build(input_shape): 這是你定義權重的地方。這個方法必須設 self.built = True，可以通過呼叫 super([Layer], self).build() 完成。
call(x): 這裡是編寫層的功能邏輯的地方。你只需要關注傳入 call 的第一個引數：輸入張量，除非你希望你的層支援masking。
compute_output_shape(input_shape): 如果你的層更改了輸入張量的形狀，你應該在這裡定義形狀變化的邏輯，這讓Keras能夠自動推斷各層的形狀。

實現程式碼如下：

from keras.preprocessing import sequence
from keras.datasets import imdb
from matplotlib import pyplot as plt
import pandas as pd

from keras import backend as K
from keras.engine.topology import Layer


class Self_Attention(Layer):

    def __init__(self, output_dim, **kwargs):
        self.output_dim = output_dim
        super(Self_Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        # 為該層建立一個可訓練的權重
        #inputs.shape = (batch_size, time_steps, seq_len)
        self.kernel = self.add_weight(name='kernel',
                                      shape=(3,input_shape[2], self.output_dim),
                                      initializer='uniform',
                                      trainable=True)

        super(Self_Attention, self).build(input_shape)  # 一定要在最後呼叫它

    def call(self, x):
        WQ = K.dot(x, self.kernel[0])
        WK = K.dot(x, self.kernel[1])
        WV = K.dot(x, self.kernel[2])

        print("WQ.shape",WQ.shape)

        print("K.permute_dimensions(WK, [0, 2, 1]).shape",K.permute_dimensions(WK, [0, 2, 1]).shape)


        QK = K.batch_dot(WQ,K.permute_dimensions(WK, [0, 2, 1]))

        QK = QK / (64**0.5)

        QK = K.softmax(QK)

        print("QK.shape",QK.shape)

        V = K.batch_dot(QK,WV)

        return V

    def compute_output_shape(self, input_shape):

        return (input_shape[0],input_shape[1],self.output_dim)

這裡可以對照一中的概念講解來理解程式碼

如果將輸入的所有向量合併為矩陣形式，則所有query, key, value向量也可以合併為矩陣形式表示

上述內容對應

WQ = K.dot(x, self.kernel[0])
WK = K.dot(x, self.kernel[1])
WV = K.dot(x, self.kernel[2])

其中 $W^Q, W^K, W^V$ 是我們模型訓練過程學習到的合適的引數。上述操作即可簡化為矩陣形式

上述內容對應（為什麼使用batch_dot呢？這是由於input_shape是包含batch_size項的）

QK = K.batch_dot(WQ,K.permute_dimensions(WK, [0, 2, 1]))
QK = QK / (64**0.5)
QK = K.softmax(QK)
print("QK.shape",QK.shape)
V = K.batch_dot(QK,WV)

這裡 QK = QK / (64**0.5) 是除以一個歸一化係數，(64**0.5)是筆者自己定義的，其他文章可能會採用不同的方法。

三、訓練網路

專案完整程式碼如下，這裡使用的是Keras自帶的imdb影評資料集

#%%

from keras.preprocessing import sequence
from keras.datasets import imdb
from matplotlib import pyplot as plt
import pandas as pd

from keras import backend as K
from keras.engine.topology import Layer


class Self_Attention(Layer):

    def __init__(self, output_dim, **kwargs):
        self.output_dim = output_dim
        super(Self_Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        # 為該層建立一個可訓練的權重
        #inputs.shape = (batch_size, time_steps, seq_len)
        self.kernel = self.add_weight(name='kernel',
                                      shape=(3,input_shape[2], self.output_dim),
                                      initializer='uniform',
                                      trainable=True)

        super(Self_Attention, self).build(input_shape)  # 一定要在最後呼叫它

    def call(self, x):
        WQ = K.dot(x, self.kernel[0])
        WK = K.dot(x, self.kernel[1])
        WV = K.dot(x, self.kernel[2])

        print("WQ.shape",WQ.shape)

        print("K.permute_dimensions(WK, [0, 2, 1]).shape",K.permute_dimensions(WK, [0, 2, 1]).shape)


        QK = K.batch_dot(WQ,K.permute_dimensions(WK, [0, 2, 1]))

        QK = QK / (64**0.5)

        QK = K.softmax(QK)

        print("QK.shape",QK.shape)

        V = K.batch_dot(QK,WV)

        return V

    def compute_output_shape(self, input_shape):

        return (input_shape[0],input_shape[1],self.output_dim)

max_features = 20000



print('Loading data...')

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
#標籤轉換為獨熱碼
y_train, y_test = pd.get_dummies(y_train),pd.get_dummies(y_test)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')



#%%資料歸一化處理

maxlen = 64


print('Pad sequences (samples x time)')

x_train = sequence.pad_sequences(x_train, maxlen=maxlen)

x_test = sequence.pad_sequences(x_test, maxlen=maxlen)

print('x_train shape:', x_train.shape)

print('x_test shape:', x_test.shape)

#%%

batch_size = 32
from keras.models import Model
from keras.optimizers import SGD,Adam
from keras.layers import *
from Attention_keras import Attention,Position_Embedding


S_inputs = Input(shape=(64,), dtype='int32')

embeddings = Embedding(max_features, 128)(S_inputs)


O_seq = Self_Attention(128)(embeddings)


O_seq = GlobalAveragePooling1D()(O_seq)

O_seq = Dropout(0.5)(O_seq)

outputs = Dense(2, activation='softmax')(O_seq)


model = Model(inputs=S_inputs, outputs=outputs)

print(model.summary())
# try using different optimizers and different optimizer configs
opt = Adam(lr=0.0002,decay=0.00001)
loss = 'categorical_crossentropy'
model.compile(loss=loss,

             optimizer=opt,

             metrics=['accuracy'])

#%%
print('Train...')

h = model.fit(x_train, y_train,

         batch_size=batch_size,

         epochs=5,

         validation_data=(x_test, y_test))

plt.plot(h.history["loss"],label="train_loss")
plt.plot(h.history["val_loss"],label="val_loss")
plt.plot(h.history["acc"],label="train_acc")
plt.plot(h.history["val_acc"],label="val_acc")
plt.legend()
plt.show()

#model.save("imdb.h5")

四、結果輸出

(TF_GPU) D:\Files\DATAs\prjs\python\tf_keras\transfromerdemo>C:/Files/APPs/RuanJian/Miniconda3/envs/TF_GPU/python.exe d:/Files/DATAs/prjs/python/tf_keras/transfromerdemo/train.1.py
Using TensorFlow backend.
Loading data...
25000 train sequences
25000 test sequences
Pad sequences (samples x time)
x_train shape: (25000, 64)
x_test shape: (25000, 64)
WQ.shape (?, 64, 128)
K.permute_dimensions(WK, [0, 2, 1]).shape (?, 128, 64)
QK.shape (?, 64, 64)
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_1 (InputLayer)         (None, 64)                0
_________________________________________________________________
embedding_1 (Embedding)      (None, 64, 128)           2560000
_________________________________________________________________
self__attention_1 (Self_Atte (None, 64, 128)           49152
_________________________________________________________________
global_average_pooling1d_1 ( (None, 128)               0
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 258
=================================================================
Total params: 2,609,410
Trainable params: 2,609,410
Non-trainable params: 0
_________________________________________________________________
None
Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/5
25000/25000 [==============================] - 17s 693us/step - loss: 0.5244 - acc: 0.7514 - val_loss: 0.3834 - val_acc: 0.8278
Epoch 2/5
25000/25000 [==============================] - 15s 615us/step - loss: 0.3257 - acc: 0.8593 - val_loss: 0.3689 - val_acc: 0.8368
Epoch 3/5
25000/25000 [==============================] - 15s 614us/step - loss: 0.2602 - acc: 0.8942 - val_loss: 0.3909 - val_acc: 0.8303
Epoch 4/5
25000/25000 [==============================] - 15s 618us/step - loss: 0.2078 - acc: 0.9179 - val_loss: 0.4482 - val_acc: 0.8215
Epoch 5/5
25000/25000 [==============================] - 15s 619us/step - loss: 0.1639 - acc: 0.9368 - val_loss: 0.5313 - val_acc: 0.8106

五、Reference

1.https://zhuanlan.zhihu.com/p/47282410

[深度應用]·Keras實現Self-Attention文字分類（機器如何讀懂人心）

[深度應用]·Keras實現Self-Attention文字分類（機器如何讀懂人心）

一、Self-Attention概念詳解

二、Self_Attention模型搭建

三、訓練網路

四、結果輸出

五、Reference

[深度應用]·Keras實現Self-Attention文字分類（機器如何讀懂人心）

[深度應用]·Keras極簡實現Attention結構

如何使用BERT實現中文的文字分類（附程式碼）

【Kaggle筆記】新聞文字分類（樸素貝葉斯）

中文文字分類（機器學習演算法原理與程式設計實踐筆記）

pytorch實現self-attention機制，並可視化

keras實現多個模型融合（非keras自帶模型，這裡以3個自己的模型為例）

用keras實現基本的影象分類任務

keras實現f1_score計算(多分類)

編碼實現將一個文字檔案（圖片）從一個地方複製到另一個地方（源路徑到目的路徑可以通過方法引數傳入）

[TensorFlow深度學習深入]實戰三·分別使用DNN,CNN與RNN(LSTM)做文字情感分析(機器如何讀懂人心)

《機器學習系統設計》之應用scikit-learn做文字分類（上）

Spark MLlib實現的中文文字分類–Naive Bayes

教程 | 用Scikit-Learn實現多類別文字分類

Tensorflow實現的CNN文字分類

Keras中實現模型載入與測試（以mnist為例）

《機器學習系統設計》之應用scikit-learn做文字分類（下）

新聞上的文字分類：機器學習大亂鬥王嶽王院長王嶽王院長 5 個月前目標從頭開始實踐中文短文字分類，記錄一下實驗流程與遇到的坑運用多種機器學習（深度學習 + 傳統機器學習）方法比較短文字分類處

手把手教你在Python中實現文字分類（附程式碼、資料集）

cnn、rnn實現中文文字分類（基於tensorflow）

[深度應用]·Keras實現Self-Attention文字分類（機器如何讀懂人心）

[深度應用]·Keras實現Self-Attention文字分類（機器如何讀懂人心）

一、Self-Attention概念詳解

二、Self_Attention模型搭建

三、訓練網路

四、結果輸出

五、Reference

相關推薦