tensorflow學習筆記(二)-PTBModel初步解析

阿新 • • 發佈：2019-02-01

前言

本人最近正在學習深度學習以及tensorflow，在此記錄一些學習過程中看到的有價值的參考資料，並且寫下一點我自己的初步理解。

一、PTB

本人所用anaconda3,ptb位於Anaconda3\Lib\site-packages\tensorflow\models\rnn\ptb目錄下，共包含

ptb_word_lm.py
reader.py

兩個主要檔案。其中reader是PTB模型處理資料的工具包。PTBModel、main都位於ptb_word_lm中。

和之前的Tutorial一樣，PTB也是分為構建抽象模型和訓練兩大步驟。

配置說明

這份官方程式碼非常有心的設定了4種不同大小的配置，分別為small，medium、large和test，以small為例：

class SmallConfig(object):
   """Small config."""
   init_scale = 0.1    # 相關引數的初始值為隨機均勻分佈，範圍是[-init_scale,+init_scale]
   learning_rate = 1.0 # 學習速率，此值還會在模型學習過程中下降
   max_grad_norm = 5   # 用於控制梯度膨脹，如果梯度向量的L2模超過max_grad_norm，則等比例縮小
   num_layers = 2      # LSTM層數
   num_steps = 20      # 分隔句子的粒度大小，每次會把num_steps個單詞劃分為一句話(但是本模型與seq2seq模型不同，它僅僅是1對1模式，句子長度應該沒有什麼用處)。 

   hidden_size = 200   # 隱層單元數目，每個詞會表示成[hidden_size]大小的向量
   max_epoch = 4       # epoch<max_epoch時，lr_decay值=1,epoch>max_epoch時,lr_decay逐漸減小
   max_max_epoch = 13  # 完整的文字要迴圈的次數
   keep_prob = 1.0     # dropout率，1.0為不丟棄
   lr_decay = 0.5      # 學習速率衰減指數
   batch_size = 20     # 和num_steps共同作用，要把原始訓練資料劃分為batch_size組，每組劃分為n個長度為num_steps的句子。 

   vocab_size = 10000  # 單詞數量(這份訓練資料中單詞剛好10000種)

另有以下配置，可以設定要選用的config(下面為small)、資料地址、輸出儲存地址等。

flags = tf.flags
logging = tf.logging

flags.DEFINE_string(
   "model", "small",
   "A type of model. Possible options are: small, medium, large.")
flags.DEFINE_string("data_path", r'C:\Users\hasee\Desktop\tempdata\lstm\simple-examples\data', "data_path")
flags.DEFINE_string("save_path", r'C:\Users\hasee\Desktop\tempdata\lstm\simple-examples\data\res',
                   "Model output directory.")
flags.DEFINE_bool("use_fp16", False,
                 "Train using 16-bit floats instead of 32bit floats")

FLAGS = flags.FLAGS

PTBModel

在class PTBModel的init()中構建了一個抽象LSTM模型。

lstm_cell和initial_state


# Slightly better results can be obtained with forget gate biases


# initialized to 1 but the hyperparameters of the model would need to be


# different than reported in the paper.


# 註釋指的是如果將forget_bias=0.0改為1.0會得到更好的結果，但是這將與論文中的描述不符。

lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(size, forget_bias=0.0, state_is_tuple=True)
if is_training and config.keep_prob < 1:
 lstm_cell = tf.nn.rnn_cell.DropoutWrapper(
     lstm_cell, output_keep_prob=config.keep_prob)
cell = tf.nn.rnn_cell.MultiRNNCell([lstm_cell] * config.num_layers, state_is_tuple=True)

self._initial_state = cell.zero_state(batch_size, data_type())

使用BasicLSTMCell構建一個基礎LSTM單元，然後根據keep_prob來為cell配置dropout。最後通過MultiRNNCell將num_layers個lstm_cell連線起來。

在LSTM單元中，有2個狀態值，分別是c和h。

問：為什麼使用BasicLSTMCell而不是LSTMCell？

BasicLSTMCell沒有實現clipping，projection layer，peep-hole等一些lstm的高階變種，僅作為一個基本的basicline結構存在，如果要使用這些高階variant要用LSTMCell這個類。

由於我們現在只是想搭建一個基本的lstm-language model模型，現階段BasicLSTMCell夠用。這就是為什麼這裡用的是BasicLSTMCell這個類而不是別的什麼。

embedding

with tf.device("/cpu:0"):
 embedding = tf.get_variable(
     "embedding", [vocab_size, size], dtype=data_type())
 # input_.input_data為外部輸入的id形式的資料，通過embedding_lookup()將ids轉換為詞向量形式inputs。
 inputs = tf.nn.embedding_lookup(embedding, input_.input_data)

在這裡embedding表示詞向量矩陣。此矩陣共有vocab_size行(在這裡為10000)，每一行都是一個hidden_size維向量，隨著模型的訓練，embedding內部權值會不斷更新，最終可以得到各個詞的向量表示。

outputs與loss

這裡與基礎模型的套路大致一致，但是需要注意一下次數為num_steps的迴圈，他做的就是rnn的展開，每一次會產生一個output和一個狀態

outputs = []
state = self._initial_state
with tf.variable_scope("RNN"):
 for time_step in range(num_steps):
     if time_step > 0: tf.get_variable_scope().reuse_variables()
     # 這個cell(inputs[:, time_step, :], state)會呼叫tf.nn.rnn_cell.MultiRNNCell中的__CALL__()方法
     #  TODO __CALL__()的註釋說：Run this multi-layer cell on inputs, starting from state.但是還沒看該方法實際做了什麼
     (cell_output, state) = cell(inputs[:, time_step, :], state)
     outputs.append(cell_output)

# 下面套路和基礎模型一致，y=wx+b


# x=output,y=targets

output = tf.reshape(tf.concat(1, outputs), [-1, size])
softmax_w = tf.get_variable(
 "softmax_w", [size, vocab_size], dtype=data_type())
softmax_b = tf.get_variable("softmax_b", [vocab_size], dtype=data_type())
logits = tf.matmul(output, softmax_w) + softmax_b
self._logits=logits


# 將loss理解為一種更復雜的交叉熵形式：與基礎模型中的程式碼類似：


# cross_entropy=tf.reduce_mean(-tf.reduce_sum(y * tf.log(a), reduction_indices=[1]))

loss = tf.nn.seq2seq.sequence_loss_by_example(
         [logits],
         [tf.reshape(input_.targets, [-1])],
         [tf.ones([batch_size * num_steps], dtype=data_type())])

# 上述loss是所有batch上累加的loss，取平均值作為_cost

self._cost = cost = tf.reduce_sum(loss) / batch_size
self._final_state = state

lr與梯度下降

在此lstm模型執行過程中需要動態的更新gradient值。

官方文件說明了這種操作：

並給出了一個例子：


# Create an optimizer.

opt = GradientDescentOptimizer(learning_rate=0.1)


# Compute the gradients for a list of variables.

grads_and_vars = opt.compute_gradients(loss, <list of variables>)


# grads_and_vars is a list of tuples (gradient, variable).  Do whatever you


# need to the 'gradient' part, for example cap them, etc.

capped_grads_and_vars = [(MyCapper(gv[0]), gv[1]) for gv in grads_and_vars]


# Ask the optimizer to apply the capped gradients.

opt.apply_gradients(capped_grads_and_vars)

模仿這個程式碼，我們可以寫出如下的虛擬碼：

optimizer = tf.train.AdamOptimizer(learning_rate=self._lr)


# gradients: return A list of sum(dy/dx) for each x in xs.

grads = optimizer.gradients(self._cost, <list of variables>)
clipped_grads = tf.clip_by_global_norm(grads, config.max_grad_norm)


# accept: List of (gradient, variable) pairs, so zip() is needed

self._train_op = optimizer.apply_gradients(zip(grads, <list of variables>))

此時就差一個不知道了，也就是需要對哪些variables進行求導,答案是：trainable variables:

tvars = tf.trainable_variables()

此時再看官方PTBModel中的程式碼：


# 在執行過程中想要調整gradient值，就不能直接簡單的optimizer.minimize(loss)而是要顯式計算gradients

self._lr = tf.Variable(0.0, trainable=False)
tvars = tf.trainable_variables()
grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars),
                               config.max_grad_norm)
optimizer = tf.train.GradientDescentOptimizer(self._lr)
self._train_op = optimizer.apply_gradients(
 zip(grads, tvars),
 global_step=tf.contrib.framework.get_or_create_global_step())

self._new_lr = tf.placeholder(
 tf.float32, shape=[], name="new_learning_rate")
self._lr_update = tf.assign(self._lr, self._new_lr)

其中tf.clip_by_global_norm()可用於用於控制梯度爆炸的問題。

梯度爆炸和梯度彌散的原因一樣，都是因為鏈式法則求導的關係，導致梯度的指數級衰減。為了避免梯度爆炸，需要對梯度進行修剪。詳見tensorflow筆記：多層LSTM程式碼分析

main()

main首先要讀取並處理資料、配置模型並且控制模型運轉。

讀取資料、設定config


# 在ptb_raw_data中已經將原始文字轉換為id形式

raw_data = reader.ptb_raw_data(FLAGS.data_path)
train_data, valid_data, test_data, vocab_size = raw_data


# 原始資料剛好是10000個單詞，所以不需要修改config.vocab_size


# 但是我有試過修改訓練資料，所以加上了這句

config = get_config()
config.vocab_size=vocab_size

eval_config = get_config()
eval_config.batch_size = 1
eval_config.num_steps = 1
eval_config.vocab_size=vocab_size

重點關注ptb_raw_data()方法。此方法中有幾個關鍵步驟：

根據訓練資料構件單詞表

word_to_id = _build_vocab(train_path)

def _build_vocab(filename):
 """
 此方法讀取原始資料，將換行符替換為<eos>，然後根據詞頻構件一個詞彙表並返回。
 """
 data = _read_words(filename)

 counter = collections.Counter(data)
 count_pairs = sorted(counter.items(), key=lambda x: (-x[1], x[0]))

 words, _ = list(zip(*count_pairs))
 word_to_id = dict(zip(words, range(len(words))))

 return word_to_id

def _read_words(filename):
 # 在這裡講換行符替換為了<eos>
 with tf.gfile.GFile(filename, "r") as f:
   return f.read().decode("utf-8").replace("\n", "<eos>").split()

將原始train/valid/test資料轉換為id形式

根據上面得到的word_to_id詞彙表對原始資料進行轉化：

train_data = _file_to_word_ids(train_path, word_to_id)
valid_data = _file_to_word_ids(valid_path, word_to_id)
test_data = _file_to_word_ids(test_path, word_to_id)

生成/訓練模型

以train模式為例：

with tf.name_scope("Train"):
 # PTBInput中根據config設定好batch_size等，還初始化了input(slice0)以及targetOutput(slice1)
 train_input = PTBInput(config=config, data=train_data, name="TrainInput")
 with tf.variable_scope("Model", reuse=None, initializer=initializer):
     m = PTBModel(is_training=True, config=config, input_=train_input)
 tf.scalar_summary("Training Loss", m.cost)
 tf.scalar_summary("Learning Rate", m.lr)

基本是初始化模型的標準套路，但是需要注意PTBInput()

在PTBInput中通過reader.ptb_producer()生成input和targets。

class PTBInput(object):
 """The input data."""

 def __init__(self, config, data, name=None):
     self.batch_size = batch_size = config.batch_size
     self.num_steps = num_steps = config.num_steps
     self.epoch_size = ((len(data) // batch_size) - 1) // num_steps
     # input是當前slice[batchsize*num_steps]，output是下一個slice同樣是[batchsize*num_steps]
     self.input_data, self.targets = reader.ptb_producer(data, batch_size, num_steps, name=name)

在ptb_producer()中比較有趣的是最後幾句：

def ptb_producer(raw_data, batch_size, num_steps, name=None):


# 其他程式碼與註釋


i = tf.train.range_input_producer(epoch_size, shuffle=False).dequeue()
x = tf.slice(data, [0, i * num_steps], [batch_size, num_steps])
y = tf.slice(data, [0, i * num_steps + 1], [batch_size, num_steps])
return x, y

i的本質是range_input_producer()獲得的一個FIFOQueue.dequeue()(個人認為近似一個函式)，外部呼叫x和y時就可以通過i不斷更新自身的值。因為本模型要做的是預測下一個詞，所以在這裡y(target)就是x(input)右移一位。

tf中的佇列和其他變數一樣，是一種有狀態的節點，其他節點可以把新元素插入到佇列後端(rear)，也可以把佇列前端(front)的元素刪除。有如下例子：

q=tf.FIFOQueue(3,'float')
init=q.enqueue_many(([0.,0.,0.],))

x=q.dequeue()
y=x+1
q_inc=q.enqueue([y])


# 注意，如果不寫sess會報錯

with tf.Session() as sess:
 init.run()

 q_inc.run()
 q_inc.run()
 q_inc.run()

在sess中從佇列前端取走一個元素，加上1之後，放回佇列的後端。慢慢地，佇列的元素的值就會增加，示意圖如下：

之後迴圈max_max_epoch次(文字重複次數)，迴圈過程中調整學習率，再呼叫run_epoch()訓練模型。

with sv.managed_session() as session:
 for i in range(config.max_max_epoch):
     # 修改學習速率大小
     lr_decay = config.lr_decay ** max(i + 1 - config.max_epoch, 0.0)
     m.assign_lr(session, config.learning_rate * lr_decay)

     train_perplexity = run_epoch(session, m, eval_op=m.train_op,verbose=True)

run_epoch()

首先設定需要run獲取的資料，如果eval_op不為空，那麼呼叫它並讓模型根據預設程式碼自動優化。

fetches = {
 "cost": model.cost,
 "final_state": model.final_state,
}
if eval_op is not None:
 fetches["eval_op"] = eval_op

for step in range(model.input.epoch_size):
 feed_dict = {}
 for i, (c, h) in enumerate(model.initial_state):
     feed_dict[c] = state[i].c
     feed_dict[h] = state[i].h

 vals = session.run(fetches, feed_dict)
 cost = vals["cost"]
 state = vals["final_state"]

 costs += cost
 iters += model.input.num_steps

 if verbose and step % (model.input.epoch_size // 10) == 10:
     print("%.3f perplexity: %.3f speed: %.0f wps" %
           (step * 1.0 / model.input.epoch_size, np.exp(costs / iters),
            iters * model.input.batch_size / (time.time() - start_time)))
return np.exp(costs / iters)

tensorflow學習筆記(二)-PTBModel初步解析

前言

一、PTB

tensorflow學習筆記(二)-PTBModel初步解析

tensorflow學習筆記二----------變量

Tensorflow學習筆記二--線性迴歸模型

深度學習框架Tensorflow學習筆記(二)

TensorFlow學習筆記二Titanic題目實戰

《重構：改善既有程式碼的設計》-學習筆記二（+實戰解析）

TensorFlow學習筆記(二)：手寫數字識別之多層感知機

tensorflow學習筆記(二十一):tensorflow視覺化

tensorflow學習筆記（二）

tensorflow學習筆記各個函式解析

tensorflow學習筆記（二十八）：collection tensorflow學習筆記（二十八）：collection

tensorflow學習筆記(北京大學) tf5_1minst_forward.py 完全解析 mnist資料集

tensorflow學習筆記(北京大學) 隨機畫點完全解析

tensorflow學習筆記(北京大學) tf4_4.py 完全解析

tensorflow學習筆記(北京大學) tf4_2.py 完全解析

tensorflow學習筆記(北京大學) tf4_1.py 完全解析

（print除去省略號）tensorflow學習筆記(北京大學) tf4_1_0.py 完全解析列印完整資料

TensorFlow學習筆記（二十三）四種Cross Entropy交叉熵演算法實現和應用

Tensorflow 學習筆記（二）: 深層神經網路

tensorflow學習筆記（二）實現MNIST

tensorflow學習筆記(二)-PTBModel初步解析

前言

一、PTB

相關推薦