TensorFlow 初學者在使用過程中可能遇到的問題及解決辦法

阿新 • • 發佈：2018-12-31

TensorFlow是什麼

官方的定義–TensorFlow是一個使用資料流圖來進行數值計算的開源軟體庫。簡單來說，TensorFlow是Google開源的深度學習框架。

TensorFlow初學者在使用過程中可能遇到的問題及解決辦法

1.出現的問題：

tensorflow.python.framework.errors.FailedPreconditionError: Attempting to use uninitialized value Variable

執行sess.run()前要記得初始化所有的變數：

init_op = tf.initialize_local_variables() sess.run(init_op)

2.類似Cannot feed value of shape (500,) for Tensor ' *', which has shape '(?, 500)'的問題。
這種一般是給的資料的shape不匹配問題，一維的tensor資料TensorFlow給出的shape會類似（?,500），在確認傳入資料無誤的情況下，只要reshape成（1,500）就可以，當然你也可以確定1維的維度，然後另1維直接寫-1就好了，TensorFlow會自動幫你計算。

inference_correct_prediction_value = sess.run(inference_correct_prediction, feed_dict={inference_op1: np.reshape(inference_op_value, (1 
,-1))})

3.在訓練過程中，每次執行sess.run(x)時的返回結果不一樣。
Tensorflow中如果直接列印tensor物件，會輸出tensor物件的一些屬性資訊，而不是直接給出tensor物件的值：

tensorflow.python.ops.variables.Variable object at 0x4c63f90>

如果需要檢視Variable或Constant的值，需要執行sess.run(x)。首先我們開一個互動式的Session，假設x是我們要檢視的物件：

import tensorflow as tf
x = tf.Variable([1.0,2.0])
sess = tf 
.InteractiveSession()
x.initializer.run()
print sess.run(x)

假設x有輸入要求，那麼在檢視其值之前需要使用feed操作，填充資料：

inputdata = ****
x_value = sess.run(x,feed_dict=inputdata)
print(x_value)

訓練的時候，每執行一次sess.run(x)就會執行一次訓練，像神經網路這種模型，有可能會導致不一樣的結果，所以可以在同一個sess.run()中返回多個值，例如

inference_correct_prediction_value,inference_accuracy_value = sess.run([inference_correct_prediction, inference_accuracy], feed_dict={inference_op1: np.reshape(inference_op_value, (-1,1))})

run的引數裡前面是操作的列表，後面依賴的資料統一放在feed_dict中，這樣sess.run()返回的不是tensor物件，而是numpy的ndarray，處理起來就會比較方便了。

4.出現問題：

Ran out of memory trying to allocate 625.0KiB

這種建議在執行之前先用 gpustat 來檢視一下GPU的狀態，看是否還有空間，或者哪臺GPU有空間。因為如果是使用GPU，TensorFlow預設會在第一塊GPU上執行，

# gpustat -cup
Sun Oct 30 19:37:09 2016
[0] ***       | 35'C,   0 % |    22 / 11519 MB |
[1] ***       | 36'C,   0 % |    22 / 11519 MB |

然後可以通過命令指定GPU來執行：

CUDA_VISIBLE_DEVICES='1 2' python ***.py #注意等號前後沒有空格

還可以在程式中使用device引數來指定在哪塊GPU上執行，比如"/cpu:0"代表機器的CPU，
"/gpu:0"代表機器的第一個GPU，"/gpu:1"以此類推：

with tf.Session() as sess:
    with tf.device("/gpu:1"):
        var1 = tf.constant([[1., 2.]])
        var2 = tf.constant([[3.],[5.]])
        product = tf.matmul(var1, var2)

5.如何儲存模型並在模型訓練完後檢視模型的訓練引數？
TensorFlow的checkpoint機制使得其能夠同時支援Online Learning和Continuous Learning，首先，通過tf.train.Saver()將訓練好的或者訓練過程中的模型儲存成checkpoint:

_, loss_value, step = sess.run([train_op, loss, global_step])
saver.save(sess,"./checkpoint/checkpoint.ckpt", global_step=step)

然後通過restore()函式從本地的checkpoint檔案中恢復模型，當然也可以從該點開始繼續執行，也就是所謂的Continuous Learning：

ckpt = tf.train.get_checkpoint_state("./checkpoint/")
if ckpt and ckpt.model_checkpoint_path:
    print("Continue training from the model {}".format(ckpt.model_checkpoint_path))
    saver.restore(sess, ckpt.model_checkpoint_path)
    _, loss_value, step = sess.run([train_op, loss, global_step])

最後通過tf.trainable_variables()獲取返回模型中所訓練的引數：

for var in tf.trainable_varisbles():
    print var.name

6.如何處理訓練資料量太大的情況？
TensorFlow支援從csv檔案和TFRecords檔案讀取資料，如果從二進位制的TFRecords檔案讀取，可以採用QueueRunner和Coordinator的方式進行多執行緒讀取，通過設定epoch引數控制訓練資料檔案迭代訓練的次數，通過設定batch_size的大小來控制一次訓練中從訓練資料中取得的樣本數量，還可以設定隨機選取，有利於加快訓練速度。

def read_and_decode(filename_queue):#從TFRecords中讀取資料
    reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue)
    features = tf.parse_single_example(serialized_example,
    features={
    "label": tf.FixedLenFeature([], tf.float32),
    "features": tf.FixedLenFeature([FEATURE_SIZE], tf.float32),
    })
    label = features["label"]
    features = features["features"]
return label, features

filename_queue = tf.train.string_input_producer(tf.train.match_filenames_once(trainFile), num_epochs=epoch_number)
label, features = read_and_decode(filename_queue)
batch_labels, batch_features = tf.train.shuffle_batch([label, features], batch_size=batch_size, num_threads=thread_number, capacity=capacity, min_after_dequeue=min_after_dequeue)

這裡的trainFile可以是一個檔名的列表：

trainFile = ['./data/train_1.tfrecords','./data/train_2.tfrecords']

還可以是一個正則表示式：

trainFile = './data/*.tfrecords'

使用Coordinator來管理佇列：

coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord, sess=sess)
try:
    while not coord.should_stop():
    _, loss_value, step = sess.run([train_op, loss, global_step])
    saver.save(sess, "./checkpoint/checkpoint.ckpt",global_step=step)
except tf.errors.OutOfRangeError:
    print("Done training after reading all data")
finally:
    coord.request_stop()

這裡經常會碰到的一個問題是在沒有訓練之前佇列就關閉了，類似“get ‘OutOfRange’, the queue will be closed”的問題，這是因為epoch設定過小，在開始訓練前就把資料讀完退出了，可以把epoch設定的大一些，如果設定成Nnoe，程式會無限制地一直跑下去，當然你可以在結果足夠好的時候手動中斷程式的執行。這裡就是我的問題啦，有沒有什麼好的方法來設定epoch引數？

7.如何讓程式分散式執行？

python ***.py --ps_hosts=127.0.0.1:2222,127.0.0.1:2223 --worker_hosts=127.0.0.1:2224,127.0.0.1:2225 --job_name=ps --task_index=0
python ***.py --ps_hosts=127.0.0.1:2222,127.0.0.1:2223 --worker_hosts=127.0.0.1:2224,127.0.0.1:2225 --job_name=ps --task_index=1
python ***.py --ps_hosts=127.0.0.1:2222,127.0.0.1:2223 --worker_hosts=127.0.0.1:2224,127.0.0.1:2225 --job_name=worker --task_index=0
python ***.py --ps_hosts=127.0.0.1:2222,127.0.0.1:2223 --worker_hosts=127.0.0.1:2224,127.0.0.1:2225 --job_name=worker --task_index=1

其中，ps是整個訓練叢集的引數伺服器，儲存模型的Variable，worker是計算模型梯度的節點，得到的梯度向量會交付給ps更新模型。ps_hosts代表有幾個ps，worker_hosts代表有幾個worker，ps跟worker都要啟動，而且要先啟動ps，再啟動worker，要啟動多個程序，像上面就要啟動4個程序，否則會出現下面這個錯誤：

E0830 09:34:30.845674045   51986 tcp_client_posix.c:173]     failed to connect to 'ipv4:127.0.0.1:2222': socket error: connection refused

當然，保險起見的話，前面也是可以指定GPU執行的。

8.如何給TensorFlow的分散式程式傳參？
分散式通過tf.app.run()執行， main()呼叫的時候有一個下劃線的，即：

def main(_):#這裡的下劃線不要忘掉
   ps_hosts = FLAGS.ps_hosts.split(",")
   worker_hosts = FLAGS.worker_hosts.split(",")
   cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
   server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)
   if FLAGS.job_name == "ps":
       server.join()
   elif FLAGS.job_name == "worker":
      with tf.device(tf.train.replica_device_setter(worker_device="/job:worker/task:%d" % FLAGS.task_index,
    cluster=cluster)):
      print 'start training'
      ......

if __name__ == "__main__":      
     tf.app.run()

這裡是如何實現傳參呢？我們看一下tf.app.run()的原始碼：

"""Generic entry point script."""  
from __future__ import absolute_import  
from __future__ import division  
from __future__ import print_function    
import sys  
from tensorflow.python.platform import flags      
def run(main=None):  
    f = flags.FLAGS  
    f._parse_flags()  
    main = main or sys.modules['__main__'].main  
    sys.exit(main(sys.argv))

執行main函式之前首先進行flags的解析，也就是說TensorFlow通過設定flags來傳遞tf.app.run()所需要的引數，我們可以直接在程式執行前初始化flags，也可以在執行程式的時候設定命令列引數來達到傳參的目的。

flags = tf.app.flags
FLAGS = flags.FLAGS
flags.DEFINE_float('learning_rate', 0.01, 'Initial learning rate.')
flags.DEFINE_integer('epoch_number', None, 'Number of epochs to run trainer.')
flags.DEFINE_integer("thread_number", 10 , "Number of thread to read data")
flags.DEFINE_string("mode", "train", "Option mode: train, train_from_scratch, inference")

比如我們執行程式的時候可以：

python ***.py --mode train
python ***.py --mode inference

基本上就這些了，針對TensorFlow應用的系統介紹，推薦 @tobe迪豪的文章，希望大家多指正，多交流。

參考資料

https://www.tensorflow.org/

https://github.com/wookayin/gpustat

http://weibo.com/p/1005052622517194/wenzhang

注：本文為“小米安全中心”原創，轉載請聯絡“小米安全中心”

問題

在使用TensorFlow訓練的過程中，有沒有什麼好的方法來設定epoch引數？(參見上述條目6)

參與詳情

參與方式：點選下方“寫留言”，回答上面的問題

參與時間：截止至11月6日18：00

獲獎方式：在精選留言中，選出作者滿意的留言送出精美禮品

TensorFlow 初學者在使用過程中可能遇到的問題及解決辦法

TensorFlow是什麼

TensorFlow初學者在使用過程中可能遇到的問題及解決辦法

參考資料

注：本文為“小米安全中心”原創，轉載請聯絡“小米安全中心”

問題

參與詳情

struts2改spring boot過程中一些問題及解決辦法記錄

Android開發過程中的坑及解決方法收錄（四）

Android開發過程中的坑及解決方法收錄

AspMVC -EF 使用過程中出現錯誤及解決

Eclipse 使用過程中的問題及解決方法

分散式框架dubbo使用過程中常見錯誤及解決

JDBC---Javaweb過程中的問題及解決方案

Android開發過程中的坑及解決方法收錄（五）

vue在html中出現{{}}原因及解決辦法

chrome安裝或更新失敗可能原因及解決辦法

java初學者常見的八大誤區及解決辦法

webpack3向4升級中具體問題及解決辦法

Vivado bug大揭祕——綜合實現引數配置中的Bug及解決辦法

安裝Photoshop時提示安裝過程中出現錯誤的解決辦法

git配置過程中出現錯誤的解決辦法

android 開發中遇到錯誤及解決辦法總結(在別處看到的)

ArcGIS 10.1 for Server 資料註冊失敗的一種可能原因及解決辦法

TensorFlow 初學者在使用過程中可能遇到的問題及解決辦法

python已寫內容中可能的報錯及解決辦法

百度網址安全中心提醒您：該頁面可能存在違法信息！處理過程及解決辦法

TensorFlow 初學者在使用過程中可能遇到的問題及解決辦法

TensorFlow是什麼

TensorFlow初學者在使用過程中可能遇到的問題及解決辦法

參考資料

注：本文為“小米安全中心”原創，轉載請聯絡“小米安全中心”

問題

參與詳情

相關推薦