1. 程式人生 > >Kubeflow實戰系列: 利用TFJob運行分布式TensorFlow

Kubeflow實戰系列: 利用TFJob運行分布式TensorFlow

flex rfi nas roc 上下文 環境變量 ready 程序 UNC

摘要: TensorFlow作為現在最為流行的深度學習代碼庫,在數據科學家中間非常流行,特別是可以明顯加速訓練效率的分布式訓練更是殺手級的特性。但是如何真正部署和運行大規模的分布式模型訓練,卻成了新的挑戰。

介紹
本系列將介紹如何在阿裏雲容器服務上運行Kubeflow, 本文介紹如何使用TfJob運行分布式模型訓練。

TensorFlow分布式訓練和Kubernetes

TensorFlow作為現在最為流行的深度學習代碼庫,在數據科學家中間非常流行,特別是可以明顯加速訓練效率的分布式訓練更是殺手級的特性。但是如何真正部署和運行大規模的分布式模型訓練,卻成了新的挑戰。 實際分布式TensorFLow的使用者需要關心3件事情。

尋找足夠運行訓練的資源,通常一個分布式訓練需要若幹數量的worker(運算服務器)和ps(參數服務器),而這些運算成員都需要使用計算資源。
安裝和配置支撐程序運算的軟件和應用
根據分布式TensorFlow的設計,需要配置ClusterSpec。這個json格式的ClusterSpec是用來描述整個分布式訓練集群的架構,比如需要使用兩個worker和ps,ClusterSpec應該長成下面的樣子,並且分布式訓練中每個成員都需要利用這個ClusterSpec初始化tf.train.ClusterSpec對象,建立集群內部通信
cluster = tf.train.ClusterSpec({"worker": ["<VM_1>:2222",

"<VM_2>:2222"],
"ps": ["<IP_VM_1>:2223",
"<IP_VM_2>:2223"]})
其中第一件事情是Kubernetes資源調度非常擅長的事情,無論CPU和GPU調度,都是直接可以使用;而第二件事情是Docker擅長的,固化和可重復的操作保存到容器鏡像。而自動化的構建ClusterSpec是TFJob解決的問題,讓用戶通過簡單的集中式配置,完成TensorFlow分布式集群拓撲的構建。

應該說煩惱了數據科學家很久的分布式訓練問題,通過Kubernetes+TFJob的方案可以得到比較好的解決。

利用Kubernetes和TFJob部署分布式訓練

修改TensorFlow分布式訓練代碼
之前在阿裏雲上小試TFJob一文中已經介紹了TFJob的定義,這裏就不再贅述了。可以知道TFJob裏有的角色類型為MASTER, WORKER 和 PS。

舉個現實的例子,假設從事分布式訓練的TFJob叫做distributed-mnist, 其中節點有1個MASTER, 2個WORKERS和2個PS,ClusterSpec對應的格式如下所示:

{
"master":[
"distributed-mnist-master-0:2222"
],
"ps":[
"distributed-mnist-ps-0:2222",
"distributed-mnist-ps-1:2222"
],
"worker":[
"distributed-mnist-worker-0:2222",
"distributed-mnist-worker-1:2222"
]
}
而tf_operator的工作就是創建對應的5個Pod, 並且將環境變量TF_CONFIG傳入到每個Pod中,TF_CONFIG包含三部分的內容,當前集群ClusterSpec, 該節點的角色類型,以及id。比如該Pod為worker0,它所收到的環境變量TF_CONFIG為:

{
"cluster":{
"master":[
"distributed-mnist-master-0:2222"
],
"ps":[
"distributed-mnist-ps-0:2222"
],
"worker":[
"distributed-mnist-worker-0:2222",
"distributed-mnist-worker-1:2222"
]
},
"task":{
"type":"worker",
"index":0
},
"environment":"cloud"
}
在這裏,tf_operator負責將集群拓撲的發現和配置工作完成,免除了使用者的麻煩。對於使用者來說,他只需要在這裏代碼中使用通過獲取環境變量TF_CONFIG中的上下文。

這意味著,用戶需要根據和TFJob的規約修改分布式訓練代碼:

從環境變量TF_CONFIG中讀取json格式的數據

tf_config_json = os.environ.get("TF_CONFIG", "{}")

反序列化成python對象

tf_config = json.loads(tf_config_json)

獲取Cluster Spec

cluster_spec = tf_config.get("cluster", {})
cluster_spec_object = tf.train.ClusterSpec(cluster_spec)

獲取角色類型和id, 比如這裏的job_name 是 "worker" and task_id 是 0

task = tf_config.get("task", {})
job_name = task["type"]
task_id = task["index"]

創建TensorFlow Training Server對象

server_def = tf.train.ServerDef(
cluster=cluster_spec_object.as_cluster_def(),
protocol="grpc",
job_name=job_name,
task_index=task_id)
server = tf.train.Server(server_def)

如果job_name為ps,則調用server.join()

if job_name == ‘ps‘:
server.join()

檢查當前進程是否是master, 如果是master,就需要負責創建session和保存summary。

is_chief = (job_name == ‘master‘)

通常分布式訓練的例子只有ps和worker兩個角色,而在TFJob裏增加了master這個角色,實際在分布式TensorFlow的編程模型並沒有這個設計。而這需要使用TFJob的分布式代碼裏進行處理,不過這個處理並不復雜,只需要將master也看做worker_device的類型

with tf.device(tf.train.replica_device_setter(
worker_device="/job:{0}/task:{1}".format(job_name,task_id),
cluster=cluster_spec)):
具體代碼可以參考示例代碼

  1. 在本例子中,將演示如何使用TFJob運行分布式訓練,並且將訓練結果和日誌保存到NAS存儲上,最後通過Tensorboard讀取訓練日誌。

2.1 創建NAS數據卷,並且設置與當前Kubernetes集群的同一個具體vpc的掛載點。操作詳見文檔

2.2 在NAS上創建 /training的數據文件夾, 下載mnist訓練所需要的數據

mkdir -p /nfs
mount -t nfs -o vers=4.0 xxxxxxx.cn-hangzhou.nas.aliyuncs.com:/ /nfs
mkdir -p /nfs/training
umount /nfs
2.3 創建NAS的PV, 以下為示例nas-dist-pv.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
name: kubeflow-dist-nas-mnist
labels:
tfjob: kubeflow-dist-nas-mnist
spec:
capacity:
storage: 10Gi
accessModes:

  • ReadWriteMany
    storageClassName: nas
    flexVolume:
    driver: "alicloud/nas"
    options:
    mode: "755"
    path: /training
    server: xxxxxxx.cn-hangzhou.nas.aliyuncs.com
    vers: "4.0"
    將該模板保存到nas-dist-pv.yaml, 並且創建pv:

kubectl create -f nas-dist-pv.yaml

persistentvolume "kubeflow-dist-nas-mnist" created
2.4 利用nas-dist-pvc.yaml創建PVC

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: kubeflow-dist-nas-mnist
spec:
storageClassName: nas
accessModes:

  • ReadWriteMany
    resources:
    requests:
    storage: 5Gi
    selector:
    matchLabels:
    tfjob: kubeflow-dist-nas-mnist
    具體命令:

kubectl create -f nas-dist-pvc.yaml

persistentvolumeclaim "kubeflow-dist-nas-mnist" created
2.5 創建TFJob

apiVersion: kubeflow.org/v1alpha1
kind: TFJob
metadata:
name: mnist-simple-gpu-dist
spec:
replicaSpecs:

  • replicas: 1 # 1 Master
    tfReplicaType: MASTER
    template:
    spec:
    containers:
    • image: registry.aliyuncs.com/tensorflow-samples/tf-mnist-distributed:gpu
      name: tensorflow
      env:
      • name: TEST_TMPDIR
        value: /training
        command: ["python", "/app/main.py"]
        resources:
        limits:
        nvidia.com/gpu: 1
        volumeMounts:
      • name: kubeflow-dist-nas-mnist
        mountPath: "/training"
        volumes:
    • name: kubeflow-dist-nas-mnist
      persistentVolumeClaim:
      claimName: kubeflow-dist-nas-mnist
      restartPolicy: OnFailure
  • replicas: 1 # 1 or 2 Workers depends on how many gpus you have
    tfReplicaType: WORKER
    template:
    spec:
    containers:
    • image: registry.aliyuncs.com/tensorflow-samples/tf-mnist-distributed:gpu
      name: tensorflow
      env:
      • name: TEST_TMPDIR
        value: /training
        command: ["python", "/app/main.py"]
        imagePullPolicy: Always
        resources:
        limits:
        nvidia.com/gpu: 1
        volumeMounts:
        • name: kubeflow-dist-nas-mnist
          mountPath: "/training"
          volumes:
      • name: kubeflow-dist-nas-mnist
        persistentVolumeClaim:
        claimName: kubeflow-dist-nas-mnist
        restartPolicy: OnFailure
  • replicas: 1 # 1 Parameter server
    tfReplicaType: PS
    template:
    spec:
    containers:
    • image: registry.aliyuncs.com/tensorflow-samples/tf-mnist-distributed:cpu
      name: tensorflow
      command: ["python", "/app/main.py"]
      env:
      • name: TEST_TMPDIR
        value: /training
        imagePullPolicy: Always
        volumeMounts:
        • name: kubeflow-dist-nas-mnist
          mountPath: "/training"
          volumes:
      • name: kubeflow-dist-nas-mnist
        persistentVolumeClaim:
        claimName: kubeflow-dist-nas-mnist
        restartPolicy: OnFailure
        將該模板保存到mnist-simple-gpu-dist.yaml, 並且創建分布式訓練的TFJob:

kubectl create -f mnist-simple-gpu-dist.yaml

tfjob "mnist-simple-gpu-dist" created
檢查所有運行的Pod

RUNTIMEID=$(kubectl get tfjob mnist-simple-gpu-dist -o=jsonpath=‘{.spec.RuntimeId}‘)

kubectl get po -lruntime_id=$RUNTIMEID

NAME READY STATUS RESTARTS AGE
mnist-simple-gpu-dist-master-z5z4-0-ipy0s 1/1 Running 0 31s
mnist-simple-gpu-dist-ps-z5z4-0-3nzpa 1/1 Running 0 31s
mnist-simple-gpu-dist-worker-z5z4-0-zm0zm 1/1 Running 0 31s
查看master的日誌,可以看到ClusterSpec已經成功的構建出來了

kubectl logs -l runtime_id=$RUNTIMEID,job_type=MASTER

2018-06-10 09:31:55.342689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:00:08.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-06-10 09:31:55.342724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:08.0, compute capability: 6.0)
2018-06-10 09:31:55.805747: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job master -> {0 -> localhost:2222}
2018-06-10 09:31:55.805786: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> mnist-simple-gpu-dist-ps-m5yi-0:2222}
2018-06-10 09:31:55.805794: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> mnist-simple-gpu-dist-worker-m5yi-0:2222}
2018-06-10 09:31:55.807119: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222
...

Accuracy at step 900: 0.9709
Accuracy at step 910: 0.971
Accuracy at step 920: 0.9735
Accuracy at step 930: 0.9716
Accuracy at step 940: 0.972
Accuracy at step 950: 0.9697
Accuracy at step 960: 0.9718
Accuracy at step 970: 0.9738
Accuracy at step 980: 0.9725
Accuracy at step 990: 0.9724
Adding run metadata for 999
2.6 部署TensorBoard,並且查看訓練效果

為了更方便 TensorFlow 程序的理解、調試與優化,可以用 TensorBoard 來觀察 TensorFlow 訓練效果,理解訓練框架和優化算法, 而TensorBoard通過讀取TensorFlow的事件日誌獲取運行時的信息。

在之前的分布式訓練樣例中已經記錄了事件日誌,並且保存到文件events.out.tfevents*中

tree

.
└── tensorflow
├── input_data
│ ├── t10k-images-idx3-ubyte.gz
│ ├── t10k-labels-idx1-ubyte.gz
│ ├── train-images-idx3-ubyte.gz
│ └── train-labels-idx1-ubyte.gz
└── logs
├── checkpoint
├── events.out.tfevents.1528760350.mnist-simple-gpu-dist-master-fziz-0-74je9
├── graph.pbtxt
├── model.ckpt-0.data-00000-of-00001
├── model.ckpt-0.index
├── model.ckpt-0.meta
├── test
│ ├── events.out.tfevents.1528760351.mnist-simple-gpu-dist-master-fziz-0-74je9
│ └── events.out.tfevents.1528760356.mnist-simple-gpu-dist-worker-fziz-0-9mvsd
└── train
├── events.out.tfevents.1528760350.mnist-simple-gpu-dist-master-fziz-0-74je9
└── events.out.tfevents.1528760355.mnist-simple-gpu-dist-worker-fziz-0-9mvsd

5 directories, 14 files
在Kubernetes部署TensorBoard, 並且指定之前訓練的NAS存儲

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
app: tensorboard
name: tensorboard
spec:
replicas: 1
selector:
matchLabels:
app: tensorboard
template:
metadata:
labels:
app: tensorboard
spec:
volumes:

  • name: kubeflow-dist-nas-mnist
    persistentVolumeClaim:
    claimName: kubeflow-dist-nas-mnist
    containers:
  • name: tensorboard
    image: tensorflow/tensorflow:1.7.0
    imagePullPolicy: Always
    command:
    • /usr/local/bin/tensorboard
      args:
      • --logdir
      • /training/tensorflow/logs
        volumeMounts:
      • name: kubeflow-dist-nas-mnist
        mountPath: "/training"
        ports:
      • containerPort: 6006
        protocol: TCP
        dnsPolicy: ClusterFirst
        restartPolicy: Always
        將該模板保存到tensorboard.yaml, 並且創建tensorboard:

kubectl create -f tensorboard.yaml

deployment "tensorboard" created
TensorBoard創建成功後,通過kubectl port-forward命令進行訪問

PODNAME=$(kubectl get pod -l app=tensorboard -o jsonpath=‘{.items[0].metadata.name}‘)
kubectl port-forward ${PODNAME} 6006:6006
通過http://127.0.0.1:6006登錄TensorBoard,查看分布式訓練的模型和效果:

技術分享圖片

技術分享圖片

總結
利用tf-operator可以解決分布式訓練的問題,簡化數據科學家進行分布式訓練工作。同時使用Tensorboard查看訓練效果, 再利用NAS或者OSS來存放數據和模型,這樣一方面有效的重用訓練數據和保存實驗結果,另外一方面也是為模型預測的發布做準備。如何把模型訓練,驗證,預測串聯起來構成機器學習的工作流(workflow), 也是Kubeflow的核心價值,我們在後面的文章中也會進行介紹。

Kubeflow實戰系列: 利用TFJob運行分布式TensorFlow