Horovod in Docker
阿新 • • 發佈:2020-07-28
https://horovod.readthedocs.io/en/stable/docker.html
Step1 構建映象
GPU
$ mkdir horovod-docker-gpu
$ wget -O horovod-docker-gpu/Dockerfile https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.gpu
$ docker build -t horovod:latest horovod-docker-gpu
CPU
$ mkdir horovod-docker-gpu $ wget -O horovod-docker-gpu/Dockerfile https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.cpu $ docker build -t horovod:latest horovod-docker-cpu
在單機上執行
GPU 的機器,可以使用 nvidia-docker.
$ nvidia-docker run -it horovod:latest
root@c278c88dd552:/examples# horovodrun -np 4 -H localhost:4 python keras_mnist_advanced.py
在多機上執行
(一)多機執行的條件:ssh免密登陸
http://www.linuxproblem.org/art_9.html
- First log in on A as user a and generate a pair of authentication keys. Do not enter a passphrase:
a@A:~> ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/home/a/.ssh/id_rsa): Created directory '/home/a/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/a/.ssh/id_rsa. Your public key has been saved in /home/a/.ssh/id_rsa.pub. The key fingerprint is: 3e:4f:05:79:3a:9f:96:7c:3b:ad:e9:58:37:bc:37:e4 a@A
- Now use ssh to create a directory ~/.ssh as user b on B. (The directory may already exist, which is fine):
a@A:~> ssh b@B mkdir -p .ssh
b@B's password:
- Finally append a's new public key to b@B:.ssh/authorized_keys and enter b's password one last time:
a@A:~> cat .ssh/id_rsa.pub | ssh b@B 'cat >> .ssh/authorized_keys'
b@B's password:
- From now on you can log into B as b from A as a without password:
a@A:~> ssh b@B
(二)主worker
host1$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest
root@c278c88dd552:/examples# horovodrun -np 16 -H host1:4,host2:4,host3:4,host4:4 -p 12345 python keras_mnist_advanced.py
(三)從 workers:
host2$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
host3$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
host4$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh horovod:latest \
bash -c "/usr/sbin/sshd -p 12345; sleep infinity"
支援遠端直接資料儲存
$ nvidia-docker run -it --network=host -v /mnt/share/ssh:/root/.ssh --cap-add=IPC_LOCK --device=/dev/infiniband horovod:latest
root@c278c88dd552:/examples# ...