1. 程式人生 > >Nematus(二)程式執行流程分析

Nematus(二)程式執行流程分析

神經機器翻譯工具Nematus

程式執行流程分析

nematus/nmt.py/train (程式入口,從此函式開始分析)

  • 1.1 讀取源語言和目標語言詞彙表
    # 獲取設定的超引數引數
    model_options = locals().copy()  
    print 'Model options:',model_options

    # 載入字典,並且反轉
    worddicts = [None]*len(dictionaries)
    worddicts_r = [None]*len(dictionaries)
    for ii,dd in
enumerate(dictionaries): worddicts[ii] = load_dict(dd) worddicts_r[ii] = dict() for kk,vv in worddicts[ii].iteritems(): worddicts_r[ii][vv] = kk # 若詞彙總大小未設定,則給定預設值為詞彙表大小 if n_words_src is None: n_words_src = len(worddicts[0]) model_options['n_words_src'
] = n_words_src if n_words_tgt is None: n_words_tgt = len(worddicts[1]) model_options['n_words_tgt'] = n_words_tgt
  • 1.2 載入訓練集和開發集
    # 載入資料
    print 'Loading data ...'
    train = TextIterator(datasets[0],datasets[1],
                        dictionaries[0],dictionaries[1],
                        n_words_source=n_words_src,
                        n_words_target=n_words_tgt,
                        batch_size=batch_size,
                        maxlen=maxlen,
                        shuffle_each_epoch=shuffle_each_epoch,
                        sort_by_length=sort_by_length,
                        maxibatch_size=maxibatch_size)
    valid = TextIterator(valid_datasets[0
], valid_datasets[1], dictionaries[0], dictionaries[1], n_words_source=n_words_src, n_words_target=n_words_tgt, batch_size=valid_batch_size, maxlen=maxlen)
  • 1.3 初始化模型引數 init_params(model_options)
     # 初始化模型引數
    print 'Init parameters ...'
    params = init_params(model_options)
  • 1.4 過載模型,呼叫 load_params(saveto, params)
    # 重新載入模型,當程式意外中斷的時候,可以繼續執行程式碼
    if reload_ and os.path.exists(saveto):
        print 'Reloading model parameters'
        params = load_params(saveto,params)
  • 1.5 把網路中的引數變為共享變數,變成共享變數後引數才可以更新 init_theano_params(params)
    # 把網路中的W,b 變為共享變數
    tparams = init_theano_params(params)
  • 1.6 建立模型,即搭建計算圖,定義網路前向傳播過程並定義損失函式 build_model(tparams, model_options)
    # 建立模型
    print 'Building model ...'

    trng,use_noise,x,x_mask,y,y_mask,\
        opt_ret, cost, ctx, tt, _ = build_model(tparams,model_options)

    inps = [x, x_mask, y, y_mask]
  • 1.7 建立取樣器,用於測試過程
#建立取樣器
    if validFreq or sampleFreq:
        print 'Building sampler ...'
        f_init, f_next = build_sampler(tparams, model_options, use_noise, trng)
  • 1.8 正則化操作
    • 權重正則化,應用l2正則化
    • 注意力權重歸一化
    • ???

    # apply L2 regularization on weights
    if decay_c > 0.:
        decay_c = theano.shared(numpy.float32(decay_c), name='decay_c')
        weight_decay = 0.
        for kk, vv in tparams.iteritems():
            weight_decay += (vv ** 2).sum()
        weight_decay *= decay_c
        cost += weight_decay #加上正則項

    # regularize the alpha weights
    if alpha_c > 0. and not model_options['decoder'].endswith('simple'):
        alpha_c = theano.shared(numpy.float32(alpha_c), name='alpha_c')
        alpha_reg = alpha_c * (
            (tensor.cast(y_mask.sum(0)//x_mask.sum(0), 'float32')[:, None] -
             opt_ret['dec_alphas'].sum(0))**2).sum(1).mean()
        cost += alpha_reg

     # apply L2 regularisation to loaded model (map training)
    if map_decay_c > 0:
        map_decay_c = theano.shared(numpy.float32(map_decay_c), name="map_decay_c")
        weight_map_decay = 0.
        for kk, vv in tparams.iteritems():
            init_value = theano.shared(vv.get_value(), name= kk + "_init")
            weight_map_decay += ((vv -init_value) ** 2).sum()
        weight_map_decay *= map_decay_c
        cost += weight_map_decay
  • 1.9 計算損失函式關於網路中各個引數的梯度
    print 'Computing gradient...',
    grads = tensor.grad(cost, wrt=itemlist(tparams))
    print 'Done'
  • 1.10 應用 梯度裁剪 策略
 # apply gradient clipping here
    if clip_c > 0.:
        g2 = 0.
        for g in grads:
            g2 += (g**2).sum()
        new_grads = []
        for g in grads:
            new_grads.append(tensor.switch(g2 > (clip_c**2),
                                           g / tensor.sqrt(g2) * clip_c,
                                           g))
        grads = new_grads
  • 1.11 定義學習率標量,並建立優化器,使用優化器更新學習率
# compile the optimizer, the actual computational graph is compiled here
    lr = tensor.scalar(name='lr')

    print 'Building optimizers...',
    f_grad_shared, f_update = eval(optimizer)(lr, tparams, grads, inps, cost, profile=profile)
    print 'Done'
  • 1.12 開始優化過程…

過載歷史,包括:更新次數 uidx,歷史錯誤率 history_errs

#開始優化
    print 'Optimization'

    best_p = None
    bad_counter = 0
    uidx = 0
    estop = False
    history_errs = []
    # reload history
    if reload_ and os.path.exists(saveto):
        rmodel = numpy.load(saveto)
        history_errs = list(rmodel['history_errs'])
        if 'uidx' in rmodel:
            uidx = rmodel['uidx']

    if validFreq == -1:
        validFreq = len(train[0])/batch_size
    if saveFreq == -1:
        saveFreq = len(train[0])/batch_size
    if sampleFreq == -1:
        sampleFreq = len(train[0])/batch_size

    valid_err = None

從此處開始優化過程…
- max_epochs: 表示最大的 epochs 次數
- prepare_data: 準備資料,輸入x為列表,列表行為 batch_size,每一行為一個句子中的詞的 id

假設輸入 x為:

x=[1020324035143154112508910020032123]
prepare_data(x,y, maxlen=maxlen, …) 輸出x為:
x=[10141220350321589404110030200503210023000]
- 在上面這個例子中,矩陣中的紅色0,代表詞彙表中的 eos,即句尾結束符。即每個句子後面加上句尾結束符
- 函式 prepare_data 輸出 x_mask為:
x_mask=[111111111111111101101001]
- x_mask的作用為,在GRU網路中輸出隱狀態後,在最後一個時候的隱狀態一直複製下去。隱狀態組成一個三維陣列 (x_len, batch_size, dim)那麼在此三維陣列的底層就儲存了每個句子的最後一個時刻的隱狀態,具體見下圖。上述矩陣中紅色的1表示,每個句子最後的結尾eos通過GRU輸出的隱狀態也要保留。
這裡寫圖片描述
  • 更新引數後,計算損失值 cost = f_grad_shared(x, x_mask, y, y_mask)
  • 更新學習率,f_update(lrate) ???
for eidx in xrange(max_epochs):
        n_samples = 0

        for x, y in train:
            n_samples += len(x)
            uidx += 1
            use_noise.set_value(1.)
            # 準備資料用於訓練
            x, x_mask, y, y_mask = prepare_data(x, y, maxlen=maxlen,
                                                n_words_src=n_words_src,
                                                n_words=n_words_tgt)
            #長度小於 maxlen 的值的句子為 0
            if x is None:
                print 'Minibatch with zero sample under length ', maxlen
                uidx -= 1
                continue

            ud_start = time.time()

            # compute cost, grads and copy grads to shared variables
            cost = f_grad_shared(x, x_mask, y, y_mask)  #引數更新後,損失值
            # do the update on parameters
            f_update(lrate) #更新學習率
  • 顯示Epoch(Epoch次數), Update(更新次數), Cost(損失值), UD(執行一次更新的時間)
  • 儲存網路最優引數(最優引數存放在 best_p 中),並且儲存 history_errs 和更新次數 uidx
  • 儲存當前迭代次數對應的模型引數
            ud = time.time() - ud_start

            # check for bad numbers, usually we remove non-finite elements
            # and continue training - but not done here
            if numpy.isnan(cost) or numpy.isinf(cost):
                print 'NaN detected'
                return 1., 1., 1.

            # verbose
            if numpy.mod(uidx, dispFreq) == 0:
                print 'Epoch ', eidx, 'Update ', uidx, 'Cost ', cost, 'UD ', ud

            # save the best model so far, in addition, save the latest model
            # into a separate file with the iteration number for external eval
            if numpy.mod(uidx, saveFreq) == 0:
                print 'Saving the best model...', #儲存模型最優引數
                if best_p is not None:
                    params = best_p
                else:
                    params = unzip_from_theano(tparams)
                numpy.savez(saveto, history_errs=history_errs, uidx=uidx, **params)
                json.dump(model_options, open('%s.json' % saveto, 'wb'), indent=2)
                print 'Done'

                # save with uidx
                if not overwrite:
                    print 'Saving the model at iteration {}...'.format(uidx),
                    saveto_uidx = '{}.iter{}.npz'.format(
                        os.path.splitext(saveto)[0], uidx)
                    numpy.savez(saveto_uidx, history_errs=history_errs,
                                uidx=uidx, **unzip_from_theano(tparams))
                    print 'Done'
  • 產生當前模型引數下,翻譯的結果樣例
# generate some samples with the model and display them

            if sampleFreq and numpy.mod(uidx, sampleFreq) == 0:
                # FIXME: random selection?
                for jj in xrange(numpy.minimum(5, x.shape[1])):
                    stochastic = True
                    sample, score, sample_word_probs, alignment = gen_sample([f_init], [f_next],
                                               x[:, jj][:, None],