1. 程式人生 > >深度有趣 | 26 Seq2Seq機器翻譯

深度有趣 | 26 Seq2Seq機器翻譯

簡介

介紹如何使用Sequence to Sequence Learning(Seq2Seq)實現神經機器翻譯(Neural Machine Translation,NMT)

原理

之前我們通過序列標註模型實現了中文分詞,序列標註屬於Seq2Seq的一種

Seq2Seq Learning的常見情況

這次我們使用Seq2Seq實現NMT,由於輸入語句和輸出語句都包含多個詞並且數量不一定相同,所以對應上圖中的第四種情況

最簡單的做法是,先將整個輸入語句編碼成固定長度的向量表示,然後再逐步進行解碼輸出對應的翻譯語句,Encoder和Decoder都可以使用RNN來實現

Encoder-Decoder模型

在RNN型別上可以選擇LSTM或GRU,也可以考慮使用多層LSTM、雙向LSTM等擴充套件

Seq2Seq機器翻譯模型

也可以考慮Attention機制,對於輸入序列每個輸入得到的輸出,計算注意力權重並加權

  • 不僅僅使用Encoder最後一步的輸出,而且使用Encoder每一步的輸出,和影象標題生成中的小塊類似
  • Decoder每次進行生成時,先根據Decoder當前狀態和Encoder每一步輸出之間的關係,計算對應的注意力權重
  • 根據權重將Encoder每一步的輸出進行加權求和,得到當前這一步所使用的上下文context
  • Decoder根據context以及上一步的輸出,更新得到下一步的狀態,進而得到下一步的輸出

αts=exp(score(ht,hˉs))s=1Sexp(score(ht,hˉs

)) \alpha_{ts}=\frac{\exp(score(\mathbf{h}_t,\mathbf{\bar{h}}_s))}{\sum^{S}_{{s}'=1}\exp(score(\mathbf{h}_t,\mathbf{\bar{h}}_{{s}'}))}
ct=s=1Sαtshˉs \mathbf{c}_t=\sum^{S}_{s=1}\alpha_{ts}\mathbf{\bar{h}}_s
α
t=f(ct,ht)=tanh(Wc[ct;ht]) \mathbf{\alpha}_t=f(\mathbf{c}_t,\mathbf{h}_t)=tanh(\mathbf{W}_c[\mathbf{c}_t;\mathbf{h}_t])

基於注意力機制的機器翻譯模型

在計算注意力權重時,主要有乘式和加式兩類實現方案,前者稱作Luong's multiplicative style,後者稱作Bahdanau's additive style

score(ht,hˉs)=htTWhˉs score(\mathbf{h}_t,\mathbf{\bar{h}}_s)=\mathbf{h}_t^T \mathbf{W} \mathbf{\bar{h}}_s
score(ht,hˉs)=vαTtanh(W1ht+W2hˉs) score(\mathbf{h}_t,\mathbf{\bar{h}}_s)=\mathbf{v}_\alpha^T tanh(\mathbf{W}_1 \mathbf{h}_t+\mathbf{W}_2 \mathbf{\bar{h}}_s)

資料

使用小牛翻譯開源社群提供的中英文平行語料,http://www.niutrans.com/,經過整理後,訓練集共10W對資料,驗證集共1K對資料,測試集共400對資料

實現

這裡我們主要使用TensorFlow提供的API來實現Seq2Seq Learning、Attention和beam search等內容,參考以下專案實現,https://github.com/tensorflow/nmt/

程式碼包括訓練、驗證、推斷三部分

  • 訓練:在訓練集上訓練模型,並計算損失函式
  • 驗證:在驗證集上驗證模型,並計算損失函式
  • 推斷:在測試集上應用模型,不計算損失函式,使用beam search生成序列,並使用bleu指標進行評估

載入庫

# -*- coding: utf-8 -*-

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.utils import shuffle
from keras.preprocessing.sequence import pad_sequences
import os
from tqdm import tqdm
import pickle

載入中英文詞典,保留最常見的2W個詞,其他詞以<unk>表示

def load_vocab(path):
    with open(path, 'r') as fr:
        vocab = fr.readlines()
        vocab = [w.strip('\n') for w in vocab]
    return vocab

vocab_ch = load_vocab('data/vocab.ch')
vocab_en = load_vocab('data/vocab.en')
print(len(vocab_ch), vocab_ch[:20])
print(len(vocab_en), vocab_en[:20])

word2id_ch = {w: i for i, w in enumerate(vocab_ch)}
id2word_ch = {i: w for i, w in enumerate(vocab_ch)}
word2id_en = {w: i for i, w in enumerate(vocab_en)}
id2word_en = {i: w for i, w in enumerate(vocab_en)}

載入訓練集、驗證集、測試集資料,計算中英文資料對應的最大序列長度,並根據mode對相應資料進行padding

def load_data(path, word2id):
    with open(path, 'r') as fr:
        lines = fr.readlines()
        sentences = [line.strip('\n').split(' ') for line in lines]
        sentences = [[word2id['<s>']] + [word2id[w] for w in sentence] + [word2id['</s>']]
                     for sentence in sentences]
        
        lens = [len(sentence) for sentence in sentences]
        maxlen = np.max(lens)
        return sentences, lens, maxlen

# train: training, no beam search, calculate loss
# eval: no training, no beam search, calculate loss
# infer: no training, beam search, calculate bleu
mode = 'train'

train_ch, len_train_ch, maxlen_train_ch = load_data('data/train.ch', word2id_ch)
train_en, len_train_en, maxlen_train_en = load_data('data/train.en', word2id_en)
dev_ch, len_dev_ch, maxlen_dev_ch = load_data('data/dev.ch', word2id_ch)
dev_en, len_dev_en, maxlen_dev_en = load_data('data/dev.en', word2id_en)
test_ch, len_test_ch, maxlen_test_ch = load_data('data/test.ch', word2id_ch)
test_en, len_test_en, maxlen_test_en = load_data('data/test.en', word2id_en)

maxlen_ch = np.max([maxlen_train_ch, maxlen_dev_ch, maxlen_test_ch])
maxlen_en = np.max([maxlen_train_en, maxlen_dev_en, maxlen_test_en])
print(maxlen_ch, maxlen_en)

if mode == 'train':
    train_ch = pad_sequences(train_ch, maxlen=maxlen_ch, padding='post', value=word2id_ch['</s>'])
    train_en = pad_sequences(train_en, maxlen=maxlen_en, padding='post', value=word2id_en['</s>'])
    print(train_ch.shape, train_en.shape)
elif mode == 'eval':
    dev_ch = pad_sequences(dev_ch, maxlen=maxlen_ch, padding='post', value=word2id_ch['</s>'])
    dev_en = pad_sequences(dev_en, maxlen=maxlen_en, padding='post', value=word2id_en['</s>'])
    print(dev_ch.shape, dev_en.shape)
elif mode == 'infer':
    test_ch = pad_sequences(test_ch, maxlen=maxlen_ch, padding='post', value=word2id_ch['</s>'])
    test_en = pad_sequences(test_en, maxlen=maxlen_en, padding='post', value=word2id_en['</s>'])
    print(test_ch.shape, test_en.shape)

定義四個placeholder,對輸入進行嵌入

X = tf.placeholder(tf.int32, [None, maxlen_ch])
X_len = tf.placeholder(tf.int32, [None])
Y = tf.placeholder(tf.int32, [None, maxlen_en])
Y_len = tf.placeholder(tf.int32, [None])
Y_in = Y[:, :-1]
Y_out = Y[:, 1:]

k_initializer = tf.contrib.layers.xavier_initializer()
e_initializer = tf.random_uniform_initializer(-1.0, 1.0)

embedding_size = 512
hidden_size = 512

if mode == 'train':
    batch_size = 128
else:
    batch_size = 16

with tf.variable_scope('embedding_X'):
    embeddings_X = tf.get_variable('weights_X', [len(word2id_ch), embedding_size], initializer=e_initializer)
    embedded_X = tf.nn.embedding_lookup(embeddings_X, X) # batch_size, seq_len, embedding_size
    
with tf.variable_scope('embedding_Y'):
    embeddings_Y = tf.get_variable('weights_Y', [len(word2id_en), embedding_size], initializer=e_initializer)
    embedded_Y = tf.nn.embedding_lookup(embeddings_Y, Y_in) # batch_size, seq_len, embedding_size

定義encoder部分,使用雙向LSTM

def single_cell(mode=mode):
    if mode == 'train':
        keep_prob = 0.8
    else:
        keep_prob = 1.0
    cell = tf.nn.rnn_cell.BasicLSTMCell(hidden_size)
    cell = tf.nn.rnn_cell.DropoutWrapper(cell, input_keep_prob=keep_prob)
    return cell

def multi_cells(num_layers):
    cells = []
    for i in range(num_layers):
        cell = single_cell()
        cells.append(cell)
    return tf.nn.rnn_cell.MultiRNNCell(cells)
    
with tf.variable_scope('encoder'):
    num_layers = 1
    fw_cell = multi_cells(num_layers)
    bw_cell = multi_cells(num_layers)
    bi_outputs, bi_state = tf.nn.bidirectional_dynamic_rnn(fw_cell, bw_cell, embedded_X, dtype=tf.float32,
                                                           sequence_length=X_len)
    # fw: batch_size, seq_len, hidden_size
    # bw: batch_size, seq_len, hidden_size
    print('=' * 100, '\n', bi_outputs)
    
    encoder_outputs = tf.concat(bi_outputs, -1)
    print('=' * 100, '\n', encoder_outputs) # batch_size, seq_len, 2 * hidden_size
    
    # 2 tuple(fw & bw), 2 tuple(c & h), batch_size, hidden_size
    print('=' * 100, '\n', bi_state)
    
    encoder_state = []
    for i in range(num_layers):
        encoder_state.append(bi_state[0][i])  # forward
        encoder_state.append(bi_state[1][i])  # backward
    encoder_state = tuple(encoder_state) # 2 tuple, 2 tuple(c & h), batch_size, hidden_size
    print('=' * 100)
    for i in range(len(encoder_state)):
        print(i, encoder_state[i])

定義decoder部分,使用兩層LSTM

with tf.variable_scope('decoder'):
    beam_width = 10
    memory = encoder_outputs
    
    if mode == 'infer':
        memory = tf.contrib.seq2seq.tile_batch(memory, beam_width)
        X_len = tf.contrib.seq2seq.tile_batch(X_len, beam_width)
        encoder_state = tf.contrib.seq2seq.tile_batch(encoder_state, beam_width)
        bs = batch_size * beam_width
    else:
        bs = batch_size
    
    attention = tf.contrib.seq2seq.LuongAttention(hidden_size, memory, X_len, scale=True) # multiplicative
    # attention = tf.contrib.seq2seq.BahdanauAttention(hidden_size, memory, X_len, normalize=True) # additive
    cell = multi_cells(num_layers * 2)
    cell = tf.contrib.seq2seq.AttentionWrapper(cell, attention, hidden_size, name='attention')
    decoder_initial_state = cell.zero_state(bs, tf.float32).clone(cell_state=encoder_state)
    
    with tf.variable_scope('projected'):
        output_layer = tf.layers.Dense(len(word2id_en), use_bias=False, kernel_initializer=k_initializer)
    
    if mode == 'infer':
        start = tf.fill([batch_size], word2id_en['<s>'])
        decoder = tf.contrib.seq2seq.BeamSearchDecoder(cell, embeddings_Y, start, word2id_en['</s>'],
                                                       decoder_initial_state, beam_width, output_layer)
        outputs, final_context_state, _ = tf.contrib.seq2seq.dynamic_decode(decoder,
                                                                            output_time_major=True,
                                                                            maximum_iterations=2 * tf.reduce_max(X_len))
        sample_id = outputs.predicted_ids
    else:
        helper = tf.contrib.seq2seq.TrainingHelper(embedded_Y, [maxlen_en - 1 for b in range(batch_size)])
        decoder = tf.contrib.seq2seq.BasicDecoder(cell, helper, decoder_initial_state, output_layer)
        
        outputs, final_context_state, _ = tf.contrib.seq2seq.dynamic_decode(decoder, 
                                                                            output_time_major=True)
        logits = outputs.rnn_output
        logits = tf.transpose(logits, (1, 0, 2))
        print(logits)

根據mode選擇是否需要定義損失函式和優化器

if mode != 'infer':
    with tf.variable_scope('loss'):
        loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=Y_out, logits=logits)
        mask = tf.sequence_mask(Y_len, tf.shape(Y_out)[1], tf.float32)
        loss = tf.reduce_sum(loss * mask) / batch_size

if mode == 'train':
    learning_rate = tf.Variable(0.0, trainable=False)
    params = tf.trainable_variables()
    grads, _ = tf.clip_by_global_norm(tf.gradients(loss, params), 5.0)
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).apply_gradients(zip(grads, params))

訓練部分程式碼,經過20輪訓練後,訓練損失從200以上降到52.19,perplexity降到5.53

sess = tf.Session()
sess.run(tf.global_variables_initializer())

if mode == 'train':
    saver = tf.train.Saver()
    OUTPUT_DIR = 'model_diy'
    if not os.path.exists(OUTPUT_DIR):
        os.mkdir(OUTPUT_DIR)
        
    tf.summary.scalar('loss', loss)
    summary = tf.summary.merge_all()
    writer = tf.summary.FileWriter(OUTPUT_DIR)
        
    epochs = 20
    for e in range(epochs):
        total_loss = 0
        total_count = 0
        
        start_decay = int(epochs * 2 / 3)
        if e <= start_decay:
            lr = 1.0
        else:
            decay = 0.5 ** (int(4 * (e - start_decay) / (epochs - start_decay)))
            lr = 1.0 * decay
        sess.run(tf.assign(learning_rate, lr))
        
        train_ch, len_train_ch, train_en, len_train_en = shuffle(train_ch, len_train_ch, train_en, len_train_en)
        
        for i in tqdm(range(train_ch.shape[0] // batch_size)):
            X_batch = train_ch[i * batch_size: i * batch_size + batch_size]
            X_len_batch = len_train_ch[i * batch_size: i * batch_size + batch_size]
            Y_batch = train_en[i * batch_size: i * batch_size + batch_size]
            Y_len_batch = len_train_en[i * batch_size: i * batch_size + batch_size]
            Y_len_batch = [l - 1 for l in Y_len_batch]

            feed_dict = {X: X_batch, Y: Y_batch, X_len: X_len_batch, Y_len: Y_len_batch}
            _, ls_ = sess.run([optimizer, loss], feed_dict=feed_dict)
            
            total_loss += ls_ * batch_size
            total_count += np.sum(Y_len_batch)

            if i > 0 and i % 100 == 0:
                writer.add_summary(sess.run(summary, 
                                            feed_dict=feed_dict), 
                                            e * train_ch.shape[0] // batch_size + i)
                writer.flush()
        
        print('Epoch %d lr %.3f perplexity %.2f' % (e, lr, np.exp(total_loss / total_count)))
        saver.save(sess, os.path.join(OUTPUT_DIR, 'nmt'))

驗證部分程式碼,驗證集的perplexity為11.56

if mode == 'eval':
    saver = tf.train.Saver()
    OUTPUT_DIR = 'model_diy'
    saver.restore(sess, tf.train.latest_checkpoint(OUTPUT_DIR))
    
    total_loss = 0
    total_count = 0
    for i in tqdm(range(dev_ch.shape[0] // batch_size)):
        X_batch = dev_ch[i * batch_size: i * batch_size + batch_size]
        X_len_batch = len_dev_ch[i * batch_size: i * batch_size + batch_size]
        Y_batch = dev_en[i * batch_size: i * batch_size + batch_size]
        Y_len_batch = len_dev_en[i * batch_size: i * batch_size + batch_size]
        Y_len_batch = [l - 1 for l in Y_len_batch]
        
        feed_dict = {X: X_batch, Y: Y_batch, X_len: X_len_batch, Y_len: Y_len_batch}
        ls_ = sess.run(loss, feed_dict=feed_dict)
        
        total_loss += ls_ * batch_size
        total_count += np.sum(Y_len_batch)

    print('Dev perplexity %.2f' % np.exp(total_loss / total_count))

推斷部分程式碼,測試集的bleu為0.2069,生成的英文翻譯結果在output_test_diy中

if mode == 'infer':
    saver = tf.train.Saver()
    OUTPUT_DIR = 'model_diy'
    saver.restore(sess, tf.train.latest_checkpoint(OUTPUT_DIR))
    
    def translate(ids):
        words = [id2word_en[i] for i in ids]
        if words[0] == '<s>':
            words = words[1:]
        if '</s>' in words:
            words = words[:words.index('</s>')]
        return ' '.join(words)
    
    fw = open('output_test_diy', 'w')
    for i in tqdm(range(test_ch.shape[0] // batch_size)):
        X_batch = test_ch[i * batch_size: i * batch_size + batch_size]
        X_len_batch = len_test_ch[i * batch_size: i * batch_size + batch_size]
        Y_batch = test_en[i * batch_size: i * batch_size + batch_size]
        Y_len_batch = len_test_en[i * batch_size: i * batch_size + batch_size]
        Y_len_batch = [l - 1 for l in Y_len_batch]
        
        feed_dict = {X: X_batch, Y: Y_batch, X_len: X_len_batch, Y_len: Y_len_batch}
        ids = sess.run(sample_id, feed_dict=feed_dict) # seq_len, batch_size, beam_width
        ids = np.transpose(ids, (1, 2, 0)) # batch_size, beam_width, seq_len
        ids = ids[:, 0, :] # batch_size, seq_len
        
        for j in range(ids.shape[0]):
            sentence = translate(ids[j])
            fw.write(sentence + '\n')
    fw.close()
    
    from nmt.utils.evaluation_utils import evaluate
    
    for metric in ['bleu', 'rouge']:
        score = evaluate('data/test.en', 'output_test_diy', metric)
        print(metric, score / 100)

造好的輪子

  • --num_units:RNN隱層神經元的個數
  • --unit_type:RNN型別,可以是lstm、gru、layer_norm_lstm、nas
  • --num_layers:RNN的層數
  • --encoder_type:RNN的型別,可以是uni、bi、gnmt
  • --residual:是否使用殘差連線
  • --attention:注意力型別,可以是luong、scaled_luong、bahdanau、normed_bahdanau,或者為空表示不使用注意力機制

如果覺得配置項太繁瑣,以上專案也提供好了4個配置項模板,iwslt15.json適用於小資料集(IWSLT English-Vietnamese,13W),其他三個模版適用於大資料集(WMT German-English,4.5M)

使用以上專案訓練中譯英模型,只需要執行以下命令,如果是訓練英譯中模型,修改src和tgt的值即可

python -m nmt.nmt --src=ch --tgt=en --vocab_prefix=data/vocab --train_prefix=data/train --dev_prefix=data/dev --test_prefix=data/test --out_dir=model_nmt --hparams_path=nmt/standard_hparams/iwslt15.json

訓練結果包括以下內容

  • 最後五次儲存下來的模型
  • train_log中包括可供tensorboard檢視的events檔案
  • output_dev和output_test分別對應驗證集和測試集的翻譯結果
  • best_bleu中包括在驗證集上bleu score最高的五個版本模型

模型在驗證集上的bleu為0.233,在測試集上的bleu為0.224

使用以下命令進行推斷,把需要翻譯的文字寫入對應檔案即可,生成的英文翻譯結果在output_test_nmt中

python -m nmt.nmt --out_dir=model_nmt --inference_input_file=test.ch --inference_output_file=output_test_nmt

對聯生成

使用以下命令訓練模型,將iwslt15.json複製一份為couplet.json,因為資料量更多,所以適當增加訓練次數,即修改num_train_steps為100000

沒有驗證集也沒有關係,用測試集替代即可,因為必填引數若不填將會報錯

python -m nmt.nmt --src=in --tgt=out --vocab_prefix=couplet/vocab --train_prefix=couplet/train --dev_prefix=couplet/test --test_prefix=couplet/test --out_dir=model_couplet --hparams_path=nmt/standard_hparams/couplet.json

output_test中的一些結果示例,每三句依次為上聯、下聯、生成的下聯,字數、詞性和詞意基本都對上了

騰 飛 上 鐵 , 銳 意 改 革 謀 發 展 , 勇 當 千 裡 馬
和 諧 南 供 , 安 全 送 電 保 暢 通 , 爭 做 領 頭 羊
改 革 開 放 , 科 學 發 展 促 繁 榮 , 爭 做 領 頭 羊

風 弦 未 撥 心 先 亂
夜 幕 已 沉 夢 更 閒
雪 韻 初 融 意 更 濃

彩 屏 如 畫 , 望 秀 美 崤 函 , 花 團 錦 簇
短 信 報 春 , 喜 和 諧 社 會 , 物 阜 民 康
妙 筆 生 花 , 書 輝 煌 史 冊 , 虎 嘯 龍 吟

如果需要根據沒有見過的上聯生成下聯即進行推斷,則使用之前介紹過的方法即可

參考

視訊講解課程