Tensorflow on Spark爬坑指南
由於機器學習和深度學習不斷被炒熱,Tensorflow作為Google家(Jeff Dean大神)推出的開源深度學習框架,也獲得了很多關注。Tensorflow的靈活性很強,允許使用者使用多臺機器的多個裝置(如不同的CPU和GPU)。但是由於Tensorflow 分散式的方式需要使用者在客戶端顯示指定叢集資訊,另外需要手動拉起ps, worker等task. 對資源管理和使用上有諸多不便。因此,Yahoo開源了基於Spark的Tensorflow,使用executor執行worker和ps task. 專案地址為:https://github.com/yahoo/TensorFlowOnSpark。
寫在前面,前方高能,請注意!
雖然yahoo提供瞭如何在Spark叢集中執行Tensorflow的步驟,但是由於這個guideline過於簡單,一般情況下,根據這個guideline是跑不起來的。
Tensorflow on spark介紹
TensorflowOnSpark 支援使用Spark/Hadoop叢集分散式的執行Tensorflow,號稱支援所有的Tensorflow操作。需要注意的是使用者需要對原有的TF程式進行簡單的改造,就能夠執行在Spark叢集之上。
如何跑起來Tensorflow on spark
雖然Yahoo在github上說明了安裝部署TFS (https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_YARN), 但是根據實際實踐,根據這個文件如果能跑起來,那真的要謝天謝地。因為在實際過程中,會因為環境問題遇到一些unexpected error。以下就是我將自己在實踐過程中遇到的一些問題總結列舉。
1、編譯python和pip
yahoo提供的編譯步驟為:
# download and extract Python 2.7 export PYTHON_ROOT=~/Python curl -O https://www.python.org/ftp/python/2.7.12/Python-2.7.12.tgz tar -xvf Python-2.7.12. tgzrm Python-2.7.12.tgz # compile into local PYTHON_ROOT pushd Python-2.7.12 ./configure --prefix="${PYTHON_ROOT}" --enable-unicode=ucs4 make make installpopdrm -rf Python-2.7.12 # install pip pushd "${PYTHON_ROOT}" curl -O https://bootstrap.pypa.io/get-pip.py bin/python get-pip.py rm get-pip.py # install tensorflow (and any custom dependencies) ${PYTHON_ROOT}/bin/pip install pydoop # Note: add any extra dependencies here popd
在實際編譯過程中,採用的Centos7.2作業系統,可能出現以下問題:
安裝pip報錯
bin/python get-pip.py ERROR:root:code for hash sha224 was not found. Traceback (most recent call last):
報這個錯一般是因為python中缺少_ssl.so 和 _hashlib.so庫造成,可以從系統python庫中找對應版本的拷貝到相應的python資料夾下(例如:lib/python2.7/lib-dynload)。
缺少zlib
bin/python get-pip.py Traceback (most recent call last): File "get-pip.py", line 20061, in <module> main() File "get-pip.py", line 194, in main bootstrap(tmpdir=tmpdir) File "get-pip.py", line 82, in bootstrap import pip zipimport.ZipImportError: can't decompress data; zlib not available
解決這個問題的方法是使用yum安裝zlib*後,重新編譯python後,即可解決。
ssl 報錯
bin/python get-pip.py pip is configured with locations that require TLS/SSL, however the ssl module in Python is not available. Collecting pip Could not fetch URL https://pypi.python.org/simple/pip/: There was a problem confirming the ssl certificate: Can't connect to HTTPS URL because the SSL module is not available. - skipping Could not find a version that satisfies the requirement pip (from versions: ) No matching distribution found for pip
解決方法: 在Python安裝目錄下開啟檔案lib/python2.7/ssl.py,註釋掉 , HAS_ALPN
from _ssl import HAS_SNI, HAS_ECDH, HAS_NPN#, HAS_ALPN
pip install pydoop報錯
gcc: error trying to exec 'cc1plus': execvp:
解決辦法:需要在機器上安裝g++編譯器
2、安裝編譯 tensorflow w/RDMA support
git clone [email protected]:yahoo/tensorflow.git# follow build instructions to install into ${PYTHON_ROOT}
注意編譯過程需要google的bazel和protoc, 這兩個工具需要提前裝好。
3、接下來的步驟按照
https://github.com/yahoo/TensorFlowOnSpark/wiki/GetStarted_YARN 指導的步驟完成。
4、在HDP2.5部署的spark on Yarn環境上執行tensorflow
在yarn-env.sh中設定環境變數,增加 * export HADOOP_HDFS_HOME=/usr/hdp/2.5.0.0-1245/hadoop-hdfs/*
因為這個環境變數需要在執行tensorflow任務時被用到,如果沒有export,會報錯。
重啟YARN,使上述改動生效。
按照Yahoo github上的步驟,執行訓練mnist任務時,按下面命令提交作業:
export PYTHON_ROOT=/data2/Python/export LD_LIBRARY_PATH=${PATH}export PYSPARK_PYTHON=${PYTHON_ROOT}/bin/pythonexport SPARK_YARN_USER_ENV="PYSPARK_PYTHON=Python/bin/python"export PATH=${PYTHON_ROOT}/bin/:$PATHexport QUEUE=default spark-submit --master yarn --deploy-mode cluster --queue ${QUEUE} --num-executors 4 --executor-memory 1G --py-files /data2/tesorflowonSpark/TensorFlowOnSpark/tfspark.zip,/data2/tesorflowonSpark/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --archives hdfs:///user/${USER}/Python.zip#Python --conf spark.executorEnv.LD_LIBRARY_PATH="/usr/jdk64/jdk1.8.0_77/jre/lib/amd64/server/" /data2/tesorflowonSpark/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images mnist/csv/test/images --labels mnist/csv/test/labels --mode inference --model mnist_model --output predictions
此時,通過Spark介面可以觀察到worker0處於阻塞狀態。
17/03/21 18:17:18 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 28.4 KB, free 542.6 KB) 17/03/21 18:17:18 INFO TorrentBroadcast: Reading broadcast variable 1 took 17 ms 17/03/21 18:17:18 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 440.6 KB, free 983.3 KB) 2017-03-21 18:17:18,404 INFO (MainThread-14872) Connected to TFSparkNode.mgr on ochadoop03, ppid=14685, state='running' 2017-03-21 18:17:18,411 INFO (MainThread-14872) mgr.state='running' 2017-03-21 18:17:18,411 INFO (MainThread-14872) Feeding partition <generator object load_stream at 0x7f447f120960> into input queue <multiprocessing.queues.JoinableQueue object at 0x7f447f129890> 17/03/21 18:17:20 INFO PythonRunner: Times: total = 2288, boot = -5387, init = 5510, finish = 2165 17/03/21 18:17:20 INFO PythonRunner: Times: total = 101, boot = 3, init = 21, finish = 77 2017-03-21 18:17:20.587060: I tensorflow/core/distributed_runtime/master_session.cc:1011] Start master session b5d9a21a16799e0b with config:
通過分析原因發現,在mnist例子中,logdir設定的是hdfs的路徑,可能是由於tf對hdfs的支援有限或者存在bug(慚愧,並沒有深究 :))。將logdir改為本地目錄,就可以正常執行。但是由此又帶來了另一個問題,因為Spark每次啟動時worker0的位置並不確定,有可能每次啟動的機器都不同,這就導致在inference的時候沒有辦法獲得訓練的模型。
一個解決辦法是:在worker 0訓練完模型後,將模型同步到hdfs中,在inference的之前,再將hdfs的checkpoints資料夾拉取到本地執行。以下為我對yahoo提供的mnist example做的類似的修改.
def writeFileToHDFS():
rootdir = '/tmp/mnist_model'
client = HdfsClient(hosts='localhost:50070')
client.mkdirs('/user/root/mnist_model')
for parent,dirnames,filenames in os.walk(rootdir):
for dirname in dirnames:
print("parent is:{0}".format(parent))
for filename in filenames:
client.copy_from_local(os.path.join(parent,filename), os.path.join('/user/root/mnist_model',filename), overwrite=True)
#logdir = TFNode.hdfs_path(ctx, args.model)
logdir = "/tmp/" + args.model
while not sv.should_stop() and step < args.steps:
# Run a training step asynchronously.
# See `tf.train.SyncReplicasOptimizer` for additional details on how to
# perform *synchronous* training.
# using feed_dict
batch_xs, batch_ys = feed_dict()
feed = {x: batch_xs, y_: batch_ys}
if len(batch_xs) != batch_size:
print("done feeding")
break
else:
if args.mode == "train":
_, step = sess.run([train_op, global_step], feed_dict=feed)
# print accuracy and save model checkpoint to HDFS every 100 steps
if (step % 100 == 0):
print("{0} step: {1} accuracy: {2}".format(datetime.now().isoformat(), step, sess.run(accuracy,{x: batch_xs, y_: batch_ys})))
else:
# args.mode == "inference"
labels, preds, acc = sess.run([label, prediction, accuracy], feed_dict=feed)
results = ["{0} Label: {1}, Prediction: {2}".format(datetime.now().isoformat(), l, p) for l,p in zip(labels,preds)]
TFNode.batch_results(ctx.mgr, results)
print("acc: {0}".format(acc))
if task_index == 0:
writeFileToHDFS()
當然這段程式碼只是為了進行說明,並不是很嚴謹,在上傳hdfs的時候,是需要對資料夾是否存在等要做一系列的判斷。
5、train& inference
向Spark叢集提交訓練任務.
spark-submit --master yarn --deploy-mode cluster --queue ${QUEUE} --num-executors 3 --executor-memory 7G --py-files /data2/tesorflowonSpark/TensorFlowOnSpark/tfspark.zip,/data2/tesorflowonSpark/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --archives hdfs:///user/${USER}/Python.zip#Python --conf spark.executorEnv.LD_LIBRARY_PATH="/usr/jdk64/jdk1.8.0_77/jre/lib/amd64/server/" /data2/tesorflowonSpark/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images mnist/csv/train/images --labels mnist/csv/train/labels --mode train --model mnist_model
執行起來後,檢視Spark UI,可以看到當前訓練過程中的作業執行情況。
6.46.43.png
執行完後,檢查hdsf,checkpoint目錄, 可以看到模型的checkpoints已經上傳到hdfs中。
hadoop fs -ls /user/root/mnist_model Found 8 items -rwxr-xr-x 3 root hdfs 179 2017-03-21 18:53 /user/root/mnist_model/checkpoint -rwxr-xr-x 3 root hdfs 117453 2017-03-21 18:53 /user/root/mnist_model/graph.pbtxt -rwxr-xr-x 3 root hdfs 814164 2017-03-21 18:53 /user/root/mnist_model/model.ckpt-0.data-00000-of-00001-rwxr-xr-x 3 root hdfs 372 2017-03-21 18:53 /user/root/mnist_model/model.ckpt-0.index -rwxr-xr-x 3 root hdfs 45557 2017-03-21 18:53 /user/root/mnist_model/model.ckpt-0.meta -rwxr-xr-x 3 root hdfs 814164 2017-03-21 18:53 /user/root/mnist_model/model.ckpt-338.data-00000-of-00001-rwxr-xr-x 3 root hdfs 372 2017-03-21 18:53 /user/root/mnist_model/model.ckpt-338.index -rwxr-xr-x 3 root hdfs 45557 2017-03-21 18:53 /user/root/mnist_model/model.ckpt-338.meta
根據訓練的結果,執行模型inference
spark-submit --master yarn --deploy-mode cluster --queue ${QUEUE} --num-executors 4 --executor-memory 1G --py-files /data2/tesorflowonSpark/TensorFlowOnSpark/tfspark.zip,/data2/tesorflowonSpark/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py --conf spark.dynamicAllocation.enabled=false --conf spark.yarn.maxAppAttempts=1 --archives hdfs:///user/${USER}/Python.zip#Python --conf spark.executorEnv.LD_LIBRARY_PATH="/usr/jdk64/jdk1.8.0_77/jre/lib/amd64/server/" /data2/tesorflowonSpark/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py --images mnist/csv/test/images --labels mnist/csv/test/labels --mode inference --model mnist_model --output predictions
等任務執行完成後,會發現,模型判斷的結果已經輸出到hdfs相關目錄下了。
hadoop fs -ls /user/root/predictions Found 11 items -rw-r--r-- 3 root hdfs 0 2017-03-21 19:16 /user/root/predictions/_SUCCESS -rw-r--r-- 3 root hdfs 51000 2017-03-21 19:16 /user/root/predictions/part-00000 -rw-r--r-- 3 root hdfs 51000 2017-03-21 19:16 /user/root/predictions/part-00001 -rw-r--r-- 3 root hdfs 51000 2017-03-21 19:16 /user/root/predictions/part-00002 -rw-r--r-- 3 root hdfs 51000 2017-03-21 19:16 /user/root/predictions/part-00003 -rw-r--r-- 3 root hdfs 51000 2017-03-21 19:16 /user/root/predictions/part-00004 -rw-r--r-- 3 root hdfs 51000 2017-03-21 19:16 /user/root/predictions/part-00005 -rw-r--r-- 3 root hdfs 51000 2017-03-21 19:16 /user/root/predictions/part-00006 -rw-r--r-- 3 root hdfs 51000 2017-03-21 19:16 /user/root/predictions/part-00007 -rw-r--r-- 3 root hdfs 51000 2017-03-21 19:16 /user/root/predictions/part-00008 -rw-r--r-- 3 root hdfs 51000 2017-03-21 19:16 /user/root/predictions/part-00009
檢視其中的某一個檔案,可看到裡面儲存的是測試集的標籤和根據模型預測的結果。
# hadoop fs -cat /user/root/predictions/part-000002017-03-21T19:16:40.795694 Label: 7, Prediction: 7 2017-03-21T19:16:40.795729 Label: 2, Prediction: 2 2017-03-21T19:16:40.795741 Label: 1, Prediction: 1 2017-03-21T19:16:40.795750 Label: 0, Prediction: 0 2017-03-21T19:16:40.795759 Label: 4, Prediction: 4 2017-03-21T19:16:40.795769 Label: 1, Prediction: 1 2017-03-21T19:16:40.795778 Label: 4, Prediction: 4 2017-03-21T19:16:40.795787 Label: 9, Prediction: 9 2017-03-21T19:16:40.795796 Label: 5, Prediction: 6 2017-03-21T19:16:40.795805 Label: 9, Prediction: 9 2017-03-21T19:16:40.795814 Label: 0, Prediction: 0 2017-03-21T19:16:40.795822 Label: 6, Prediction: 6 2017-03-21T19:16:40.795831 Label: 9, Prediction: 9 2017-03-21T19:16:40.795840 Label: 0, Prediction: 0 2017-03-21T19:16:40.795848 Label: 1, Prediction: 1 2017-03-21T19:16:40.795857 Label: 5, Prediction: 5 2017-03-21T19:16:40.795866 Label: 9, Prediction: 9 2017-03-21T19:16:40.795875 Label: 7, Prediction: 7 2017-03-21T19:16:40.795883 Label: 3, Prediction: 3 2017-03-21T19:16:40.795892 Label: 4, Prediction: 4 2017-03-21T19:16:40.795901 Label: 9, Prediction: 9 2017-03-21T19:16:40.795909 Label: 6, Prediction: 6 2017-03-21T19:16:40.795918 Label: 6, Prediction: 6
Spark叢集和tensorflow job task的對應關係,如下圖,spark叢集起了4個executor,其中一個作為PS, 另外3個作為worker,而誰做ps誰做worker是由Yarn和spark排程的。
7.22.23.png
Cluster spec: {'ps': ['ochadoop02:50060'], 'worker': ['ochadoop04:52150', 'ochadoop03:52733', 'ochad