tensorflow多卡訓練

使用tensorflow的小夥伴肯定對GPU不陌生，在科學矩陣計算上，GPU的效能要比CPU快很多，這樣會大大減小我們訓練模型的時間，但是一個伺服器上往往有多卡，如果能多卡共同訓練一個模型豈不快哉？在開始學習tensorflow的時候我就注意到os.environ["CUDA_VISIBLE_DEVICES"]引數可以設定多卡，開始我也天真的以為這個引數就可以實現自動多卡訓練，可是在做了相關的調研之後發現，簡直是‘too young too simple’，從原理上就不對勁，不是這樣的，但是當然還是能實現的，只不過不是這麼簡單而已，廢話少說，進入正題

1 背景知識

在正式說多卡訓練之前，我們最好能搞懂原理，剩下的只不過都是細枝末節了~
我們肯定用過各種optimizer，比如sgd、adam，我們在使用的時候可能對裡面的實現關注的比較少，我們也沒有必要關注太多，但是每個optimizer裡面都有兩個特別重要且一定會實現的函式，那就是今天的主角compute_gradients和apply_gradients。梯度修剪主要避免訓練梯度爆炸和消失問題

，而這就要用到這兩個函數了~
所以一個optimizer的minimizer方法實際操作是計算梯度（compute_gradients）和反向傳播（apply_gradients）

1.1 compute_gradients

compute_gradients(
    loss,  # A Tensor containing the value to minimize or a callable taking no arguments which returns the value to minimize. When eager execution is enabled it must be a callable.  

    var_list=None, # Optional list or tuple of tf.Variable to update to minimize loss. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES. 
    gate_gradients=GATE_OP,#How to gate the computation of gradients. Can be GATE_NONE, GATE_OP, or GATE_GRAPH.  

    aggregation_method=None,# Specifies the method used to combine gradient terms. Valid values are defined in the class AggregationMethod. 
    colocate_gradients_with_ops=False, # If True, try colocating gradients with the corresponding op. 
    grad_loss=None   # Optional. A Tensor holding the gradient computed for loss.
)

計算loss中可訓練的var_list中的梯度。
相當於minimize()的第一步，返回(gradient, variable)對的list。

1.2 apply_gradients

apply_gradients(
    grads_and_vars,  #List of (gradient, variable) pairs as returned by compute_gradients().
    global_step=None, #Optional Variable to increment by one after the variables have been updated.
    name=None  # Optional name for the returned operation. Default to the name passed to the Optimizer constructor.
)

minimize()的第二部分，返回一個執行梯度更新的ops。

1.3 手寫optimizer.minimize()

#Now we apply gradient clipping. For this, we need to get the gradients,
#use the `clip_by_value()` function to clip them, then apply them:
threshold = 1.0
optimizer = tf.train.GradientDescentOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(loss)
#list包括的是：梯度和更新變數的元組對
capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var) 
             for grad, var in grads_and_vars]
 #執行對應變數的更新梯度操作
training_op = optimizer.apply_gradients(capped_gvs)

虛擬碼：

lr = tf.minimum(learning_rate, 0.001 / tf.log(999.) * tf.log(tf.cast(self.global_step, tf.float32) + 1))
opt = tf.train.AdamOptimizer(learning_rate = lr, beta1 = 0.8, beta2 = 0.999, epsilon = 1e-7)
grads = self.opt.compute_gradients(loss)
gradients, variables = zip(*grads)
capped_grads, _ = tf.clip_by_global_norm(gradients, config.grad_clip)
train_op = self.opt.apply_gradients( zip(capped_grads, variables), 
     						        global_step=self.global_step)

1.4 裝置資訊

因為multi gpu需要多卡的資訊來檢視相關變數建立情況，因此可以設定ConfigProto中log_device_placement=True來顯示相關的裝置log~

# Graph creation.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Running the operation.
print(sess.run(c))

result:

The output of TensorFlow GPU device placement logging shown as below:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K40c, pci bus
id: 0000:05:00.0
b: /job:localhost/replica:0/task:0/device:GPU:0
a: /job:localhost/replica:0/task:0/device:GPU:0
MatMul: /job:localhost/replica:0/task:0/device:GPU:0
[[ 22.  28.]
 [ 49.  64.]]

2.multi GPU

首先貼一張官方的教程圖片

可以清楚的看到CPU中儲存變數，GPU們計算整個model和gradients，然後把得到的梯度送回CPU中，CPU計算各個GPU送回來梯度的平均值作為本次step的梯度對引數進行更新。從圖中我們可以看到只有當所有的GPU完成梯度計算以後，CPU才能求平均值，所以，整個神經網路的迭代速度將取決於最慢的一個GPU，這也就是同步更新。
所以整體思路就是：
1.在各個GPU上計算梯度（compute_gradients）
2.在CPU上計算平均梯度
3.在各個GPU上執行反向傳播操作（apply_gradients）

2.1 baseline

直接上程式碼，對比單卡和多卡訓練，下面的demo並不是官方的版本，但是我感覺對於理解很有幫助，可以先擼一下理解下，然後在看後面的官方版本~（完整程式碼詳見git multi_gpu.py）

import sys
import os
import numpy as np
import time
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

def get_weight_varible(name,shape):
    return tf.get_variable(name, shape=shape,
                       initializer=tf.contrib.layers.xavier_initializer())

def get_bias_varible(name,shape):
    return tf.get_variable(name, shape=shape,
                       initializer=tf.contrib.layers.xavier_initializer())

#filter_shape: [f_h, f_w, f_ic, f_oc]
def conv2d(layer_name, x, filter_shape):
    with tf.variable_scope(layer_name):
        w = get_weight_varible('w', filter_shape)
        b = get_bias_varible('b', filter_shape[-1])
        y = tf.nn.bias_add(tf.nn.conv2d(input=x, filter=w, strides=[1, 1, 1, 1], padding='SAME'), b)
        return y

def pool2d(layer_name, x):
    with tf.variable_scope(layer_name):
        y = tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
        return y

#inp_shape: [N, L]
#out_shape: [N, L]
def fc(layer_name, x, inp_shape, out_shape):
    with tf.variable_scope(layer_name):
        inp_dim = inp_shape[-1]
        out_dim = out_shape[-1]
        y = tf.reshape(x, shape=inp_shape)
        w = get_weight_varible('w', [inp_dim, out_dim])
        b = get_bias_varible('b', [out_dim])
        y = tf.add(tf.matmul(y, w), b)
        return y

def build_model(x):
    y = tf.reshape(x,shape=[-1, 28, 28, 1])
    #layer 1
    y = conv2d('conv_1', y, [3, 3, 1, 8])
    y = pool2d('pool_1', y)
    #layer 2
    y = conv2d('conv_2', y, [3, 3, 8, 16])
    y = pool2d('pool_2', y)
    #layer fc
    y = fc('fc', y, [-1, 7*7*16], [-1, 10])
    return y


def average_losses(loss):
    tf.add_to_collection('losses', loss)

    # Assemble all of the losses for the current tower only.
    losses = tf.get_collection('losses')

    # Calculate the total loss for the current tower.
    regularization_losses = tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
    total_loss = tf.add_n(losses + regularization_losses, name='total_loss')

    # Compute the moving average of all individual losses and the total loss.
    loss_averages = tf.train.ExponentialMovingAverage(0.9, name='avg')
    loss_averages_op = loss_averages.apply(losses + [total_loss])

    with tf.control_dependencies([loss_averages_op]):
        total_loss = tf.identity(total_loss)
    return total_loss

def average_gradients(tower_grads):
    average_grads = []
    for grad_and_vars in zip(*tower_grads):
        # Note that each grad_and_vars looks like the following:
        #   ((grad0_gpu0, var0_gpu0), ... , (grad0_gpuN, var0_gpuN))
        grads = [g for g, _ in grad_and_vars]
        # Average over the 'tower' dimension.
        grad = tf.stack(grads, 0)
        grad = tf.reduce_mean(grad, 0)

        # Keep in mind that the Variables are redundant because they are shared
        # across towers. So .. we will just return the first tower's pointer to
        # the Variable.
        v = grad_and_vars[0][1]
        grad_and_var = (grad, v)
        average_grads.append(grad_and_var)

    return average_grads

def feed_all_gpu(inp_dict, models, payload_per_gpu, batch_x, batch_y):
    for i in range(len(models)):
        x, y, _, _, _ = models[i]
        start_pos = i * payload_per_gpu
        stop_pos = (i + 1) * payload_per_gpu
        inp_dict[x] = batch_x[start_pos:stop_pos]
        inp_dict[y] = batch_y[start_pos:stop_pos]
    return inp_dict

def single_gpu():
    batch_size = 128
    mnist = input_data.read_data_sets('/tmp/data/mnist',one_hot=True)

    tf.reset_default_graph()
    with tf.Session() as sess:
        with tf.device('/cpu:0'):
            print('build model...')
            print('build model on gpu tower...')
            with tf.device('/gpu:0'):
                x = tf.placeholder(tf.float32, [None, 784])
                y = tf.placeholder(tf.float32, [None, 10])
                pred = build_model(x)
                loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
                learning_rate = tf.placeholder(tf.float32, shape=[])
                train_op = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
            print('build model on gpu tower done.')

            print('reduce model on cpu...')
            all_y = tf.reshape(y, [-1,10])
            all_pred = tf.reshape(pred, [-1,10])
            correct_pred = tf.equal(tf.argmax(all_y, 1), tf.argmax(all_pred, 1))
            accuracy = tf.reduce_mean(tf.cast(correct_pred, 'float'))
            print('reduce model on cpu done.')

            print('run train op...')
            sess.run(tf.global_variables_initializer())
            lr = 0.01
            for epoch in range(2):
                start_time = time.time()
                total_batch = int(mnist.train.num_examples/batch_size)
                avg_loss = 0.0
                print('\n---------------------')
                print('Epoch:%d, lr:%.4f' % (epoch,lr))
                for batch_idx in range(total_batch):
                    batch_x,batch_y = mnist.train.next_batch(batch_size)
                    inp_dict = {}
                    inp_dict[learning_rate] = lr
                    inp_dict[x] = batch_x
                    inp_dict[y] = batch_y
                    _, _loss = sess.run([train_op, loss], inp_dict)
                    avg_loss += _loss
                avg_loss /= total_batch
                print('Train loss:%.4f' % (avg_loss))

                lr = max(lr * 0.7,0.00001)

                total_batch = int(mnist.validation.num_examples / batch_size)
                preds = None
                ys = None
                for batch_idx in range(total_batch):
                    batch_x,batch_y = mnist.validation.next_batch(batch_size)
                    inp_dict = {}
                    inp_dict[x] = batch_x
                    inp_dict[y] = batch_y
                    batch_pred,batch_y = sess.run([all_pred,all_y], inp_dict)
                    if preds is None:
                        preds = batch_pred
                    else:
                        preds = np.concatenate((preds, batch_pred), 0)
                    if ys is None:
                        ys = batch_y
                    else:
                        ys = np.concatenate((ys,batch_y),0)
                val_accuracy = sess.run([accuracy], {all_y:ys, all_pred:preds})[0]
                print('Val Accuracy: %0.4f%%' % (100.0 * val_accuracy))

                stop_time = time.time()
                elapsed_time = stop_time - start_time
                print('Cost time: ' + str(elapsed_time) + ' sec.')
            print('training done.')

            total_batch = int(mnist.test.num_examples / batch_size)
            preds = None
            ys = None
            for batch_idx in range(total_batch):
                batch_x, batch_y = mnist.test.next_batch(batch_size)
                inp_dict = {}
                inp_dict[x] = batch_x
                inp_dict[y] = batch_y
                batch_pred, batch_y = sess.run([all_pred, all_y], inp_dict)
                if preds is None:
                    preds = batch_pred
                else:
                    preds = np.concatenate((preds, batch_pred), 0)
                if ys is None:
                    ys = batch_y
                else:
                    ys = np.concatenate((ys, batch_y), 0)
            test_accuracy = sess.run([accuracy], {all_y: ys, all_pred: preds})[0]
            print('Test Accuracy: %0.4f%%' % (100.0 * test_accuracy))

def multi_gpu(num_gpu):
    batch_size = 128 * num_gpu
    mnist = input_data.read_data_sets('./data',one_hot=True)

    tf.reset_default_graph()
    with tf.Session() as sess:
        with tf.device('/cpu:0'):
            learning_rate = tf.placeholder(tf.float32, shape=[])
            opt = tf.train.AdamOptimizer(learning_rate=learning_rate)

            print('build model...')
            print('build model on gpu tower...')
            models = []
            for gpu_id in range(num_gpu):
                with tf.device('/gpu:%d' % gpu_id):
                    print('tower:%d...'% gpu_id)
                    with tf.name_scope('tower_%d' % gpu_id):
                        with tf.variable_scope('cpu_variables', reuse=gpu_id>0):
                            x = tf.placeholder(tf.float32, [None, 784])
                            y = tf.placeholder(tf.float32, [None, 10])
                            pred = build_model(x)
                            loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
                            grads = opt.compute_gradients(loss)
                            models.append((x,y,pred,loss,grads))
            print('build model on gpu tower done.')

            print('reduce model on cpu...')
            tower_x, tower_y, tower_preds, tower_losses, tower_grads = zip(*models)
            aver_loss_op = tf.reduce_mean(tower_losses)
            apply_gradient_op = opt.apply_gradients(average_gradients(tower_grads))

            all_y = tf.reshape(tf.stack(tower_y, 0), [-1,10])
            all_pred = tf.reshape(tf.stack(tower_preds, 0), [-1,10 
 
              
           
              
              
            
            相關推薦
			   
            
            
            
 

    

    
    tensorflow多卡訓練
      
							
							
							
使用tensorflow的小夥伴肯定對GPU不陌生，在科學矩陣計算上，GPU的效能要比CPU快很多，這樣會大大減小我們訓練模型的時間，但是一個伺服器上往往有多卡，如果能多卡共同訓練一個模型豈不快哉？在開始學習tensorflow的時候我就注意到os.envir 

  
 

    

    
    tensorflow 多gpu訓練
      
							
							
							當使用多個gpu訓練時,輸入資料為batch_size*num_gpu,這樣模型訓練時間可以大大較小.

tensorflow中使用制定gpu可以通過tf.device()實現.例如我想使用0號顯示卡:

gpu_ind=0
with tf.device("/g 

  
 

    

    
    Tensorflow（二）單機多卡分散式訓練
       
 
 
 建立分散式訓練： 
 # 計算losses:
with tf.device('/gpu:0'):
    D_real_loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=D_logit_real,
      

  
 

    

    
    快速上手多機多卡的分散式tensorflow
       
 
 快速上手多機多卡的分散式tensorflow 
 終於又有時間和成果拿出來和大家分享，實在不容易，之前由於臨時更換任務加上入職事情多斷更了很久，現在主要在做一些KG和KGQA方面的工作。今天要和大家分享的是最近在工作中實現的分散式tensorflow。(BTW打個廣告~NLP和DL討論歡迎加群~二維 

  
 

    

    
    【TensorFlow】多GPU訓練：示例程式碼解析
       
 
  
  
  
  使用多GPU有助於提升訓練速度和調參效率。 本文主要對tensorflow的示例程式碼進行註釋解析：cifar10_multi_gpu_train.py  
   
  1080Ti下加速效果如下（batch=128） 單卡：  兩個GPU比單個GPU加速了近一倍 ：  
  
 

  
 

    

    
    『TensorFlow』分散式訓練_其三_多機分散式
      
                

一、基本概念

Cluster、Job、task概念：三者可以簡單的看成是層次關係，task可以看成每臺機器上的一個程序，多個task組成job；job又有：ps、worker兩種，分別用於引數服務、計算服務，組成cluster。



同步更新

各個用於平行計算的電腦 

  
 

    

    
    【123】TensorFlow 多個特徵值線性迴歸，並且使用訓練集、驗證集和測試集的例子
      
							
							
							



我們的目標是構建數學模型來預測房價。通常情況下，會有多個因素影響房價，因此使用多個特徵值做線性迴歸。數學上，每個特徵值視為一個自變數，相當與構建一個包含多個自變數的函式。 


我寫了兩個 python 檔案，一個是用來訓練模型，並使用驗證集驗證模型。另 

  
 

    

    
    keras多顯示卡訓練
       
 
 完整例項 
 '''Trains a simple deep NN on the MNIST dataset.
Gets to 98.40% test accuracy after 20 epochs
(there is *a lot* of margin for parameter tuning) 

  
 

    

    
    TensorFlow多線程輸入數據處理框架（三）——組合訓練數據
      code   lte   函數   auth   cast   desc   結構   save   pca   參考書
《TensorFlow：實戰Google深度學習框架》（第2版）

通過TensorFlow提供的tf.train.batch和tf.train.shuffle_batch函數來將單個的樣 

  
 

    

    
    HDU 4920（杭電多校訓練#5  1010 題） Matrix multiplication(不知道該掛個什麽帽子。。。）
      預處理   ica   ref   循環   ring   sca   esp   題解   code   

題目地址：HDU 4920
對這個題簡直無語到極點。。。竟然O（n^3）的復雜度能過。。。。方法有三。。
1：進行輸入優化和輸出優化。。（前提是你的輸入優化不能太搓。。。）
2：利用緩存優化。。詳情 

  
 

    

    
    深度學習 多卡 python設置
      深度   ice   environ   設置   bsp   通過   python   dev   ron   首先 import os
然後通過 os.environ[‘CUDA_VISIBLE_DEVICES‘]=來設置用哪張卡
比如使用編號為0的卡：import os                 

  
 

    

    
    【鏈表】2017多校訓練3 HDU 6058 Kanade's sum
      iostream   ++   多校   open   pos   cnblogs   names   mat   play   acm.hdu.edu.cn/showproblem.php?pid=6058
【題意】

給定一個排列，計算




【思路】

計算排列A中每個數的貢獻，即對於每個ai,計算有 

  
 

    

    
    【組合數+Lucas定理】2017多校訓練七 HDU 6129 Just do it
      clu   sca   def   opened   ==   cnblogs   long   合數   color   http://acm.hdu.edu.cn/showproblem.php?pid=6129
【題意】

對於一個長度為n的序列a，我們可以計算b[i]=a1^a2^......^ai， 

  
 

    

    
    【雙向bfs】2017多校訓練十 HDU 6171 Admiral
      isp   hide   splay   編號   sig   push   pac   ans   logs   【題意】

現在給出一個三角矩陣，如果0編號的在點（x，y）的話，可以和（x+1，y），（x-1，y），（x+1，y+1），（x-1，y-1）這些點進行交換。
我們每一次只能對0點和其他點進行交 

  
 

    

    
    tensorflow多層CNN代碼分析
      其中   correct   flow   表示   models   input   app   最大值   blog   
tf,reshape(tensor,shape,name=None)
#其中shape為一個列表形式，特殊的一點是列表中可以存在-1。-1代表的含義是不用我們自己#指定這一維的大小， 

  
 

    

    
    牛客網多校訓練  道路問題
      一個   question   有道   void   lib   以及   -h   測試   iostream   題目描述 


隨著如今社會的不斷變化，交通問題也變得越來越重要，所以市長決定建設一些公路來方便各個城市之間的貿易和交易。雖然市長的想法很好，但是他也遇到了一般人也經常頭疼的問題，那就是手頭 

  
 

    

    
    牛客網多校訓練 訓練技巧
      bits   輸入   targe   單個   sdn   namespace   能力   tps   i+1   題目描述  來源


常州大學組織了新生寒假訓練一共N天，每天訓練可以獲得的訓練效果是Ei。但是如果連續訓練超過K天，萌新們會受不了而被勸退。
現在負責人想知道，如何安排能保證萌新不會被勸退 

  
 

    

    
    牛客網多校訓練 酸堿滴定
      itl   code   spa   簡單   等於   con   數據   aid   輸入   題目描述    題目來源


有時候你會抱怨，什麽時候才能到終點。
有時候你會迫不及待，怎麽顏色還不改變。
滴定管長場的，我們的路長長的。
用心的放入每一滴，終點就在你手心。
今天小星需要去完成一個酸堿滴定實 

  
 

    

    
    牛客網多校訓練  德瑪西亞萬歲
      while   關系   out   iostream   整數   戰鬥   來源   示例   -m   題目描述   來源


德瑪西亞是一個實力雄厚、奉公守法的國家，有著功勛卓著的光榮軍史。
這裏非常重視正義、榮耀、職責的意識形態，這裏的人民為此感到強烈自豪。
有一天他們想去制裁邪惡的比爾吉沃特，於是 

  
 

    

    
    TensorFlow-多層感知機(MLP)
      訓練   感知   set   equal   ()   closed   batch   BE   lac   TensorFlow訓練神經網絡的4個步驟：
1、定義算法公式，即訓練神經網絡的forward時的計算
2、定義損失函數和選擇優化器來優化loss
3、訓練步驟
4、對模型進行準確率評測
附Mul

1 背景知識

1.1 compute_gradients

1.2 apply_gradients

1.3 手寫optimizer.minimize()

1.4 裝置資訊

2.multi GPU

2.1 baseline

tensorflow多卡訓練

tensorflow 多gpu訓練

Tensorflow（二）單機多卡分散式訓練

快速上手多機多卡的分散式tensorflow

【TensorFlow】多GPU訓練：示例程式碼解析

『TensorFlow』分散式訓練_其三_多機分散式

【123】TensorFlow 多個特徵值線性迴歸，並且使用訓練集、驗證集和測試集的例子

keras多顯示卡訓練

TensorFlow多線程輸入數據處理框架（三）——組合訓練數據

HDU 4920（杭電多校訓練#5 1010 題） Matrix multiplication(不知道該掛個什麽帽子。。。）

深度學習多卡 python設置

【鏈表】2017多校訓練3 HDU 6058 Kanade's sum

【組合數+Lucas定理】2017多校訓練七 HDU 6129 Just do it

【雙向bfs】2017多校訓練十 HDU 6171 Admiral

tensorflow多層CNN代碼分析

牛客網多校訓練道路問題

牛客網多校訓練訓練技巧

牛客網多校訓練酸堿滴定

牛客網多校訓練德瑪西亞萬歲

TensorFlow-多層感知機(MLP)