Kubeflow實戰系列: 利用TFJob運行分布式TensorFlow
介紹
本系列將介紹如何在阿裏雲容器服務上運行Kubeflow, 本文介紹如何使用TfJob運行分布式模型訓練。
TensorFlow分布式訓練和Kubernetes
TensorFlow作為現在最為流行的深度學習代碼庫,在數據科學家中間非常流行,特別是可以明顯加速訓練效率的分布式訓練更是殺手級的特性。但是如何真正部署和運行大規模的分布式模型訓練,卻成了新的挑戰。 實際分布式TensorFLow的使用者需要關心3件事情。
尋找足夠運行訓練的資源,通常一個分布式訓練需要若幹數量的worker(運算服務器)和ps(參數服務器),而這些運算成員都需要使用計算資源。
安裝和配置支撐程序運算的軟件和應用
根據分布式TensorFlow的設計,需要配置ClusterSpec。這個json格式的ClusterSpec是用來描述整個分布式訓練集群的架構,比如需要使用兩個worker和ps,ClusterSpec應該長成下面的樣子,並且分布式訓練中每個成員都需要利用這個ClusterSpec初始化tf.train.ClusterSpec對象,建立集群內部通信
cluster = tf.train.ClusterSpec({"worker": ["<VM_1>:2222",
"ps": ["<IP_VM_1>:2223",
"<IP_VM_2>:2223"]})
其中第一件事情是Kubernetes資源調度非常擅長的事情,無論CPU和GPU調度,都是直接可以使用;而第二件事情是Docker擅長的,固化和可重復的操作保存到容器鏡像。而自動化的構建ClusterSpec是TFJob解決的問題,讓用戶通過簡單的集中式配置,完成TensorFlow分布式集群拓撲的構建。
應該說煩惱了數據科學家很久的分布式訓練問題,通過Kubernetes+TFJob的方案可以得到比較好的解決。
利用Kubernetes和TFJob部署分布式訓練
修改TensorFlow分布式訓練代碼
之前在阿裏雲上小試TFJob一文中已經介紹了TFJob的定義,這裏就不再贅述了。可以知道TFJob裏有的角色類型為MASTER, WORKER 和 PS。
舉個現實的例子,假設從事分布式訓練的TFJob叫做distributed-mnist, 其中節點有1個MASTER, 2個WORKERS和2個PS,ClusterSpec對應的格式如下所示:
{
"master":[
"distributed-mnist-master-0:2222"
],
"ps":[
"distributed-mnist-ps-0:2222",
"distributed-mnist-ps-1:2222"
],
"worker":[
"distributed-mnist-worker-0:2222",
"distributed-mnist-worker-1:2222"
]
}
而tf_operator的工作就是創建對應的5個Pod, 並且將環境變量TF_CONFIG傳入到每個Pod中,TF_CONFIG包含三部分的內容,當前集群ClusterSpec, 該節點的角色類型,以及id。比如該Pod為worker0,它所收到的環境變量TF_CONFIG為:
{
"cluster":{
"master":[
"distributed-mnist-master-0:2222"
],
"ps":[
"distributed-mnist-ps-0:2222"
],
"worker":[
"distributed-mnist-worker-0:2222",
"distributed-mnist-worker-1:2222"
]
},
"task":{
"type":"worker",
"index":0
},
"environment":"cloud"
}
在這裏,tf_operator負責將集群拓撲的發現和配置工作完成,免除了使用者的麻煩。對於使用者來說,他只需要在這裏代碼中使用通過獲取環境變量TF_CONFIG中的上下文。
這意味著,用戶需要根據和TFJob的規約修改分布式訓練代碼:
從環境變量TF_CONFIG中讀取json格式的數據
tf_config_json = os.environ.get("TF_CONFIG", "{}")
反序列化成python對象
tf_config = json.loads(tf_config_json)
獲取Cluster Spec
cluster_spec = tf_config.get("cluster", {})
cluster_spec_object = tf.train.ClusterSpec(cluster_spec)
獲取角色類型和id, 比如這裏的job_name 是 "worker" and task_id 是 0
task = tf_config.get("task", {})
job_name = task["type"]
task_id = task["index"]
創建TensorFlow Training Server對象
server_def = tf.train.ServerDef(
cluster=cluster_spec_object.as_cluster_def(),
protocol="grpc",
job_name=job_name,
task_index=task_id)
server = tf.train.Server(server_def)
如果job_name為ps,則調用server.join()
if job_name == ‘ps‘:
server.join()
檢查當前進程是否是master, 如果是master,就需要負責創建session和保存summary。
is_chief = (job_name == ‘master‘)
通常分布式訓練的例子只有ps和worker兩個角色,而在TFJob裏增加了master這個角色,實際在分布式TensorFlow的編程模型並沒有這個設計。而這需要使用TFJob的分布式代碼裏進行處理,不過這個處理並不復雜,只需要將master也看做worker_device的類型
with tf.device(tf.train.replica_device_setter(
worker_device="/job:{0}/task:{1}".format(job_name,task_id),
cluster=cluster_spec)):
具體代碼可以參考示例代碼
- 在本例子中,將演示如何使用TFJob運行分布式訓練,並且將訓練結果和日誌保存到NAS存儲上,最後通過Tensorboard讀取訓練日誌。
2.1 創建NAS數據卷,並且設置與當前Kubernetes集群的同一個具體vpc的掛載點。操作詳見文檔
2.2 在NAS上創建 /training的數據文件夾, 下載mnist訓練所需要的數據
mkdir -p /nfs
mount -t nfs -o vers=4.0 xxxxxxx.cn-hangzhou.nas.aliyuncs.com:/ /nfs
mkdir -p /nfs/training
umount /nfs
2.3 創建NAS的PV, 以下為示例nas-dist-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: kubeflow-dist-nas-mnist
labels:
tfjob: kubeflow-dist-nas-mnist
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteMany
storageClassName: nas
flexVolume:
driver: "alicloud/nas"
options:
mode: "755"
path: /training
server: xxxxxxx.cn-hangzhou.nas.aliyuncs.com
vers: "4.0"
將該模板保存到nas-dist-pv.yaml, 並且創建pv:
kubectl create -f nas-dist-pv.yaml
persistentvolume "kubeflow-dist-nas-mnist" created
2.4 利用nas-dist-pvc.yaml創建PVC
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: kubeflow-dist-nas-mnist
spec:
storageClassName: nas
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
selector:
matchLabels:
tfjob: kubeflow-dist-nas-mnist
具體命令:
kubectl create -f nas-dist-pvc.yaml
persistentvolumeclaim "kubeflow-dist-nas-mnist" created
2.5 創建TFJob
apiVersion: kubeflow.org/v1alpha1
kind: TFJob
metadata:
name: mnist-simple-gpu-dist
spec:
replicaSpecs:
- replicas: 1 # 1 Master
tfReplicaType: MASTER
template:
spec:
containers:- image: registry.aliyuncs.com/tensorflow-samples/tf-mnist-distributed:gpu
name: tensorflow
env:- name: TEST_TMPDIR
value: /training
command: ["python", "/app/main.py"]
resources:
limits:
nvidia.com/gpu: 1
volumeMounts: - name: kubeflow-dist-nas-mnist
mountPath: "/training"
volumes:
- name: TEST_TMPDIR
- name: kubeflow-dist-nas-mnist
persistentVolumeClaim:
claimName: kubeflow-dist-nas-mnist
restartPolicy: OnFailure
- image: registry.aliyuncs.com/tensorflow-samples/tf-mnist-distributed:gpu
- replicas: 1 # 1 or 2 Workers depends on how many gpus you have
tfReplicaType: WORKER
template:
spec:
containers:- image: registry.aliyuncs.com/tensorflow-samples/tf-mnist-distributed:gpu
name: tensorflow
env:- name: TEST_TMPDIR
value: /training
command: ["python", "/app/main.py"]
imagePullPolicy: Always
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:- name: kubeflow-dist-nas-mnist
mountPath: "/training"
volumes:
- name: kubeflow-dist-nas-mnist
- name: kubeflow-dist-nas-mnist
persistentVolumeClaim:
claimName: kubeflow-dist-nas-mnist
restartPolicy: OnFailure
- name: TEST_TMPDIR
- image: registry.aliyuncs.com/tensorflow-samples/tf-mnist-distributed:gpu
- replicas: 1 # 1 Parameter server
tfReplicaType: PS
template:
spec:
containers:- image: registry.aliyuncs.com/tensorflow-samples/tf-mnist-distributed:cpu
name: tensorflow
command: ["python", "/app/main.py"]
env:- name: TEST_TMPDIR
value: /training
imagePullPolicy: Always
volumeMounts:- name: kubeflow-dist-nas-mnist
mountPath: "/training"
volumes:
- name: kubeflow-dist-nas-mnist
- name: kubeflow-dist-nas-mnist
persistentVolumeClaim:
claimName: kubeflow-dist-nas-mnist
restartPolicy: OnFailure
將該模板保存到mnist-simple-gpu-dist.yaml, 並且創建分布式訓練的TFJob:
- name: TEST_TMPDIR
- image: registry.aliyuncs.com/tensorflow-samples/tf-mnist-distributed:cpu
kubectl create -f mnist-simple-gpu-dist.yaml
tfjob "mnist-simple-gpu-dist" created
檢查所有運行的Pod
RUNTIMEID=$(kubectl get tfjob mnist-simple-gpu-dist -o=jsonpath=‘{.spec.RuntimeId}‘)
kubectl get po -lruntime_id=$RUNTIMEID
NAME READY STATUS RESTARTS AGE
mnist-simple-gpu-dist-master-z5z4-0-ipy0s 1/1 Running 0 31s
mnist-simple-gpu-dist-ps-z5z4-0-3nzpa 1/1 Running 0 31s
mnist-simple-gpu-dist-worker-z5z4-0-zm0zm 1/1 Running 0 31s
查看master的日誌,可以看到ClusterSpec已經成功的構建出來了
kubectl logs -l runtime_id=$RUNTIMEID,job_type=MASTER
2018-06-10 09:31:55.342689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:00:08.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-06-10 09:31:55.342724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:08.0, compute capability: 6.0)
2018-06-10 09:31:55.805747: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job master -> {0 -> localhost:2222}
2018-06-10 09:31:55.805786: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> mnist-simple-gpu-dist-ps-m5yi-0:2222}
2018-06-10 09:31:55.805794: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> mnist-simple-gpu-dist-worker-m5yi-0:2222}
2018-06-10 09:31:55.807119: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222
...
Accuracy at step 900: 0.9709
Accuracy at step 910: 0.971
Accuracy at step 920: 0.9735
Accuracy at step 930: 0.9716
Accuracy at step 940: 0.972
Accuracy at step 950: 0.9697
Accuracy at step 960: 0.9718
Accuracy at step 970: 0.9738
Accuracy at step 980: 0.9725
Accuracy at step 990: 0.9724
Adding run metadata for 999
2.6 部署TensorBoard,並且查看訓練效果
為了更方便 TensorFlow 程序的理解、調試與優化,可以用 TensorBoard 來觀察 TensorFlow 訓練效果,理解訓練框架和優化算法, 而TensorBoard通過讀取TensorFlow的事件日誌獲取運行時的信息。
在之前的分布式訓練樣例中已經記錄了事件日誌,並且保存到文件events.out.tfevents*中
tree
.
└── tensorflow
├── input_data
│ ├── t10k-images-idx3-ubyte.gz
│ ├── t10k-labels-idx1-ubyte.gz
│ ├── train-images-idx3-ubyte.gz
│ └── train-labels-idx1-ubyte.gz
└── logs
├── checkpoint
├── events.out.tfevents.1528760350.mnist-simple-gpu-dist-master-fziz-0-74je9
├── graph.pbtxt
├── model.ckpt-0.data-00000-of-00001
├── model.ckpt-0.index
├── model.ckpt-0.meta
├── test
│ ├── events.out.tfevents.1528760351.mnist-simple-gpu-dist-master-fziz-0-74je9
│ └── events.out.tfevents.1528760356.mnist-simple-gpu-dist-worker-fziz-0-9mvsd
└── train
├── events.out.tfevents.1528760350.mnist-simple-gpu-dist-master-fziz-0-74je9
└── events.out.tfevents.1528760355.mnist-simple-gpu-dist-worker-fziz-0-9mvsd
5 directories, 14 files
在Kubernetes部署TensorBoard, 並且指定之前訓練的NAS存儲
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
app: tensorboard
name: tensorboard
spec:
replicas: 1
selector:
matchLabels:
app: tensorboard
template:
metadata:
labels:
app: tensorboard
spec:
volumes:
- name: kubeflow-dist-nas-mnist
persistentVolumeClaim:
claimName: kubeflow-dist-nas-mnist
containers: - name: tensorboard
image: tensorflow/tensorflow:1.7.0
imagePullPolicy: Always
command:- /usr/local/bin/tensorboard
args:- --logdir
- /training/tensorflow/logs
volumeMounts: - name: kubeflow-dist-nas-mnist
mountPath: "/training"
ports: - containerPort: 6006
protocol: TCP
dnsPolicy: ClusterFirst
restartPolicy: Always
將該模板保存到tensorboard.yaml, 並且創建tensorboard:
- /usr/local/bin/tensorboard
kubectl create -f tensorboard.yaml
deployment "tensorboard" created
TensorBoard創建成功後,通過kubectl port-forward命令進行訪問
PODNAME=$(kubectl get pod -l app=tensorboard -o jsonpath=‘{.items[0].metadata.name}‘)
kubectl port-forward ${PODNAME} 6006:6006
通過http://127.0.0.1:6006登錄TensorBoard,查看分布式訓練的模型和效果:
總結
利用tf-operator可以解決分布式訓練的問題,簡化數據科學家進行分布式訓練工作。同時使用Tensorboard查看訓練效果, 再利用NAS或者OSS來存放數據和模型,這樣一方面有效的重用訓練數據和保存實驗結果,另外一方面也是為模型預測的發布做準備。如何把模型訓練,驗證,預測串聯起來構成機器學習的工作流(workflow), 也是Kubeflow的核心價值,我們在後面的文章中也會進行介紹。
Kubeflow實戰系列: 利用TFJob運行分布式TensorFlow