How to Distribute Keras Training with PowerAI DDL

阿新 • • 發佈：2019-01-16

The 1.5.3 release of PowerAI includes updates to IBM’s Distributed Deep Learning (DDL) framework which facilitate the distribution of Tensorflow Keras training. In this article we will walk through the process of taking an existing Tensorflow Keras model, making the code changes necessary to distribute its training using DDL and using ddlrun

to execute the distributed script.

The script we used as the starting point is the keras mnist_cnn.py example script.

Code Changes

1. Imports

The first step is to convert any keras imports to tensorflow.keras imports. This is accomplished by replacing import keras with from tensorflow.python import keras as keras

and replacing imports of the form from keras.xxxxx import ... with imports of the form from tensorflow.python.keras.xxxxx import .... We also have to import ddl and numpy as np. Importing ddl automatically distributes the gradient computation during training.

import keras                                                                  | from tensorflow.python  
import keras as keras
from keras.datasets import mnist                                              | from tensorflow.python.keras.datasets import mnist
from keras.models import Sequential                                           | from tensorflow.python.keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten                              | from tensorflow.python.keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D                                 | from tensorflow.python.keras.layers import Conv2D, MaxPooling2D
from keras import backend as K                                                | from tensorflow.python.keras import backend as K
                                                                              > import ddl
                                                                              > import numpy as np

2. Split the Training and Test Data

Next we have to split the training and test data so that each gpu is working on different data. This split is what is actually splitting up the work for ddl.

x_test_full and y_test_full are added to be able to do a final model evaluation at the end.
np.array_split(x_train, ddl.size())[ddl.rank()] is used to split the training data into ddl.size() pieces and select the piece that corresponds to the current rank, ddl.rank(). The same is done for all training and test data and labels.

                                                                              > # DDL: Save the full test data before splitting for final accuracy check.
                                                                              > x_test_full = x_test.astype('float32') / 255
                                                                              > y_test_full = keras.utils.to_categorical(y_test, num_classes)
                                                                              >
                                                                              > # DDL: Split the training & testing data.
                                                                              > x_train = np.array_split(x_train, ddl.size())[ddl.rank()]
                                                                              > x_test = np.array_split(x_test, ddl.size())[ddl.rank()]
x_train = x_train.astype('float32')                                             x_train = x_train.astype('float32')
x_test = x_test.astype('float32')                                               x_test = x_test.astype('float32')
x_train /= 255                                                                  x_train /= 255
x_test /= 255                                                                   x_test /= 255
print('x_train shape:', x_train.shape)                                          print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')                                        print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')                                          print(x_test.shape[0], 'test samples')

                                                                              > # DDL: Split the training & testing data.
                                                                              > y_train = np.array_split(y_train, ddl.size())[ddl.rank()]
                                                                              > y_test = np.array_split(y_test, ddl.size())[ddl.rank()]

3. Adjust the Learning Rate

The next change we have to make is to multiply the learning rate by the total number of GPUs. The intuition behind this is as follows. Since we are splitting up the data and performing gradient descent across ddl.size() GPUs, each with a batch size of 128, we end up with an effective global batch size of 128 * ddl.size(). This has the result of reducing the number of gradient descent updates that occur each epoch, slowing the convergence rate by a factor of approximately the number of GPUs. To compensate for this we must scale the learning rate by ddl.size().

                                                                              > # DDL: adjust learning rate based on number of GPUs.
model.compile(loss=keras.losses.categorical_crossentropy,                       model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),                          |               optimizer=keras.optimizers.Adadelta(lr=1.0 * ddl.size()),
              metrics=['accuracy'])                                                           metrics=['accuracy'])

4. Add DDL Callbacks

DDL requires that 2 callbacks be added to the list of Keras callbacks. To ensure that metrics used for early stopping and other hyper parameter tuning remain in sync throughout training, we have to add ddl.DDLCallback() as the first callback in the callback list. To ensure that all global variables in the model are correctly initialized we have to add ddl.DDLGlobalVariablesCallback() as the last callback in the callback list.

                                                                              > callbacks = list()
                                                                              >
                                                                              > # DDL: Add the DDL callback.
                                                                              > callbacks.append(ddl.DDLCallback())
                                                                              > callbacks.append(ddl.DDLGlobalVariablesCallback())

5. Restrict Printing to Rank 0

There are usually some operations that we only want to perform on a single node, printing for example. To restrict certain operations to rank 0 we can use if ddl.rank() == 0. Here we also use x_test_full and y_test_full to perform an evaluation of the model for the final accuracy check to display at the end.

                                                                              > # DDL: Only use verbose = 1 on rank 0.
model.fit(x_train, y_train,                                                     model.fit(x_train, y_train,
          batch_size=batch_size,                                                          batch_size=batch_size,
          epochs=epochs,                                                                  epochs=epochs,
          verbose=1,                                                          |           verbose=1 if ddl.rank() == 0 else 0,
          validation_data=(x_test, y_test))                                   |           validation_data=(x_test, y_test),
                                                                              >           callbacks=callbacks)
                                                                              > # DDL: Only do final accuracy check on rank 0.
                                                                              > if ddl.rank() == 0:
score = model.evaluate(x_test, y_test, verbose=0)                             | score = model.evaluate(x_test_full, y_test_full, verbose=0)
print('Test loss:', score[0])                                                 | print('Test loss:', score[0])
print('Test accuracy:', score[1])                                             | print('Test accuracy:', score[1])

Running the Script

To run the script across any number of nodes, we can use the following commands:

$ source /opt/DL/ddl-tensorflow/bin/ddl-tensorflow-activate
$ /opt/DL/ddl-tensorflow/bin/ddl-tensorflow-install-samples ~/samples
$ ddlrun -H host1,host2,host3,host4,... python ~/samples/examples/keras/mnist-tf-keras.py

On 4 GPUs the output looks like:

$ ddlrun python ~/samples/examples/keras/mnist-tf-keras.py
+ mpirun -x PATH -x LD_LIBRARY_PATH -x PYTHONPATH -tcp -disable_gpu_hooks --rankfile /tmp/DDLRUN/ddlrun.Rd3PDdkJYvRb/RANKFILE -x 'DDL_OPTIONS=-mode p:4x1x1x1 ' -n 4 python ~/samples/examples/keras/mnist-tf-keras.py
DDL: DDL_GROUP_SIZE=10000000.
2018-08-28 19:37:57.689450: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
totalMemory: 15.75GiB freeMemory: 15.34GiB
2018-08-28 19:37:57.689548: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 2
2018-08-28 19:37:57.689856: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
totalMemory: 15.75GiB freeMemory: 15.34GiB
2018-08-28 19:37:57.689948: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 1
2018-08-28 19:37:57.691164: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
totalMemory: 15.75GiB freeMemory: 15.34GiB
2018-08-28 19:37:57.691221: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-08-28 19:37:57.726137: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
totalMemory: 15.75GiB freeMemory: 15.34GiB
2018-08-28 19:37:57.726350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 3
2018-08-28 19:37:58.078092: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-28 19:37:58.078179: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0
2018-08-28 19:37:58.078203: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N
2018-08-28 19:37:58.078863: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14847 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0004:04:00.0, compute capability: 7.0)
2018-08-28 19:37:58.080687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-28 19:37:58.080722: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      1
2018-08-28 19:37:58.080738: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   N
2018-08-28 19:37:58.081261: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14849 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0004:05:00.0, compute capability: 7.0)
2018-08-28 19:37:58.150432: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-28 19:37:58.150481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      2
2018-08-28 19:37:58.150495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2:   N
2018-08-28 19:37:58.151084: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14846 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0035:03:00.0, compute capability: 7.0)
2018-08-28 19:37:58.405935: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-28 19:37:58.406037: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      3
2018-08-28 19:37:58.406065: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3:   N
2018-08-28 19:37:58.406917: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14846 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0035:04:00.0, compute capability: 7.0)
I 19:37:58.441 122001 122471 DDL:41  ] [MPI:0   ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ====
2018-08-28 19:37:59.918249: I ddl_MDR_ops.cc:826] [MPI:2   ]  name=Init local_gpuid=2 local_rank=2 local_size=4
2018-08-28 19:37:59.918246: I ddl_MDR_ops.cc:826] [MPI:3   ]  name=Init local_gpuid=3 local_rank=3 local_size=4
2018-08-28 19:37:59.918266: I ddl_MDR_ops.cc:826] [MPI:1   ]  name=Init local_gpuid=1 local_rank=1 local_size=4
2018-08-28 19:37:59.918266: I ddl_MDR_ops.cc:826] [MPI:0   ]  name=Init local_gpuid=0 local_rank=0 local_size=4
DDL: rank: 0, size: 4, gpuid: 0, hosts: 1
DDL: rank: 1, size: 4, gpuid: 1, hosts: 1
DDL: rank: 2, size: 4, gpuid: 2, hosts: 1
DDL: rank: 3, size: 4, gpuid: 3, hosts: 1
x_train shape: (15000, 28, 28, 1)
15000 train samples
2500 test samples
x_train shape: (15000, 28, 28, 1)
15000 train samples
2500 test samples
x_train shape: (15000, 28, 28, 1)
15000 train samples
2500 test samples
x_train shape: (15000, 28, 28, 1)
15000 train samples
2500 test samples
2018-08-28 19:38:00.963727: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 2
2018-08-28 19:38:00.963824: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-28 19:38:00.963838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      2
2018-08-28 19:38:00.963851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2:   N
2018-08-28 19:38:00.964433: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14846 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0035:03:00.0, compute capability: 7.0)
Train on 15000 samples, validate on 2500 samples
2018-08-28 19:38:01.026421: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-08-28 19:38:01.026512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-28 19:38:01.026535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0
2018-08-28 19:38:01.026565: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N
2018-08-28 19:38:01.027149: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14847 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0004:04:00.0, compute capability: 7.0)
2018-08-28 19:38:01.245015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 1
2018-08-28 19:38:01.245136: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-28 19:38:01.245160: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      1
2018-08-28 19:38:01.245180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   N
2018-08-28 19:38:01.245765: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14849 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0004:05:00.0, compute capability: 7.0)
2018-08-28 19:38:02.113830: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 3
2018-08-28 19:38:02.113934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-28 19:38:02.113949: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      3
2018-08-28 19:38:02.113963: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3:   N
2018-08-28 19:38:02.114639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14846 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0035:04:00.0, compute capability: 7.0)
Epoch 1/12
2018-08-28 19:38:04.161660: I ddl_MDR_ops.cc:357] [MPI:0   ]  name=training/Adadelta/AllReduceN _global_buf_size=1199882 _N=8
I 19:38:04.529 122001 122637 DDL:703 ] [MPI:0   ] selected algo: NCCLB   - NCCLB
15000/15000 [==============================] - 4s 282us/step - loss: 0.4831 - acc: 0.8497 - val_loss: 0.1190 - val_acc: 0.9596
Epoch 2/12
15000/15000 [==============================] - 2s 119us/step - loss: 0.1169 - acc: 0.9679 - val_loss: 0.0846 - val_acc: 0.9700
Epoch 3/12
15000/15000 [==============================] - 2s 133us/step - loss: 0.0805 - acc: 0.9760 - val_loss: 0.0731 - val_acc: 0.9728
Epoch 4/12
15000/15000 [==============================] - 2s 118us/step - loss: 0.0693 - acc: 0.9797 - val_loss: 0.0571 - val_acc: 0.9792
Epoch 5/12
15000/15000 [==============================] - 2s 122us/step - loss: 0.0514 - acc: 0.9843 - val_loss: 0.0443 - val_acc: 0.9832
Epoch 6/12
15000/15000 [==============================] - 2s 120us/step - loss: 0.0473 - acc: 0.9868 - val_loss: 0.0539 - val_acc: 0.9804
Epoch 7/12
15000/15000 [==============================] - 2s 120us/step - loss: 0.0408 - acc: 0.9869 - val_loss: 0.0510 - val_acc: 0.9844
Epoch 8/12
15000/15000 [==============================] - 2s 121us/step - loss: 0.0398 - acc: 0.9877 - val_loss: 0.0579 - val_acc: 0.9836
Epoch 9/12
15000/15000 [==============================] - 2s 122us/step - loss: 0.0373 - acc: 0.9893 - val_loss: 0.0485 - val_acc: 0.9840
Epoch 10/12
15000/15000 [==============================] - 2s 104us/step - loss: 0.0289 - acc: 0.9915 - val_loss: 0.0566 - val_acc: 0.9824
Epoch 11/12
15000/15000 [==============================] - 2s 111us/step - loss: 0.0291 - acc: 0.9907 - val_loss: 0.0565 - val_acc: 0.9816
Epoch 12/12
15000/15000 [==============================] - 2s 106us/step - loss: 0.0279 - acc: 0.9915 - val_loss: 0.0419 - val_acc: 0.9856
2018-08-28 19:38:26.596350: I ddl_MDR_ops.cc:270] [MPI:2   ] calling ddl_finalize

2018-08-28 19:38:26.598412: I ddl_MDR_ops.cc:270] [MPI:3   ] calling ddl_finalize

2018-08-28 19:38:26.655121: I ddl_MDR_ops.cc:270] [MPI:1   ] calling ddl_finalize

2018-08-28 19:38:27.320345: I ddl_MDR_ops.cc:270] [MPI:0   ] calling ddl_finalize

Test loss: 0.0279394076111
Test accuracy: 0.992

Complete Diff

'''Trains a simple convnet on the MNIST dataset.                                '''Trains a simple convnet on the MNIST dataset.

Gets to 99.25% test accuracy after 12 epochs                                    Gets to 99.25% test accuracy after 12 epochs
(there is still a lot of margin for parameter tuning).                          (there is still a lot of margin for parameter tuning).
16 seconds per epoch on a GRID K520 GPU.                                        16 seconds per epoch on a GRID K520 GPU.
'''                                                                             '''

from __future__ import print_function                                           from __future__ import print_function
import keras                                                                  | from tensorflow.python import keras as keras
from keras.datasets import mnist                                              | from tensorflow.python.keras.datasets import mnist
from keras.models import Sequential                                           | from tensorflow.python.keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten                              | from tensorflow.python.keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D                                 | from tensorflow.python.keras.layers import Conv2D, MaxPooling2D
from keras import backend as K                                                | from tensorflow.python.keras import backend as K
                                                                              > import ddl
                                                                              > import numpy as np

batch_size = 128                                                                batch_size = 128
num_classes = 10                                                                num_classes = 10
epochs = 12                                                                     epochs = 12

# input image dimensions                                                        # input image dimensions
img_rows, img_cols = 28, 28                                                     img_rows, img_cols = 28, 28

# the data, split between train and test sets                                   # the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()                        (x_train, y_train), (x_test, y_test) = mnist.load_data()

if K.image_data_format() == 'channels_first':                                   if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)              x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)                 x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)                                           input_shape = (1, img_rows, img_cols)
else:                                                                           else:
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)              x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)                 x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)                                           input_shape = (img_rows, img_cols, 1)

                                                                              > # DDL: Save the full test data before splitting for final accuracy check.
                                                                              > x_test_full = x_test.astype('float32') / 255
                                                                              > y_test_full = keras.utils.to_categorical(y_test, num_classes)
                                                                              >
                                                                              > # DDL: Split the training & testing data.
                                                                              > x_train = np.array_split(x_train, ddl.size())[ddl.rank()]
                                                                              > x_test = np.array_split(x_test, ddl.size())[ddl.rank()]
x_train = x_train.astype('float32')                                             x_train = x_train.astype('float32')
x_test = x_test.astype('float32')                                               x_test = x_test.astype('float32')
x_train /= 255                                                                  x_train /= 255
x_test /= 255                                                                   x_test /= 255
print('x_train shape:', x_train.shape)                                          print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')                                        print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')                                          print(x_test.shape[0], 'test samples')

                                                                              > # DDL: Split the training & testing data.
                                                                              > y_train = np.array_split(y_train, ddl.size())[ddl.rank()]
                                                                              > y_test = np.array_split(y_test, ddl.size())[ddl.rank()]
# convert class vectors to binary class matrices                                # convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)                      y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)                        y_test = keras.utils.to_categorical(y_test, num_classes)

model = Sequential()                                                            model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),                                        model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',                                                              activation='relu',
                 input_shape=input_shape))                                                       input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))                                model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))                                       model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))                                                        model.add(Dropout(0.25))
model.add(Flatten())                                                            model.add(Flatten())
model.add(Dense(128, activation='relu'))                                        model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))                                                         model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))                             model.add(Dense(num_classes, activation='softmax'))

                                                                              > # DDL: adjust learning rate based on number of GPUs.
model.compile(loss=keras.losses.categorical_crossentropy,                       model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),                          |               optimizer=keras.optimizers.Adadelta(lr=1.0 * ddl.size()),
              metrics=['accuracy'])                                                           metrics=['accuracy'])

                                                                              > callbacks = list()
                                                                              >
                                                                              > # DDL: Add the DDL callback.
                                                                              > callbacks.append(ddl.DDLCallback())
                                                                              > callbacks.append(ddl.DDLGlobalVariablesCallback())
                                                                              >
                                                                              > # DDL: Only use verbose = 1 on rank 0.
model.fit(x_train, y_train,                                                     model.fit(x_train, y_train,
          batch_size=batch_size,                                                          batch_size=batch_size,
          epochs=epochs,                                                                  epochs=epochs,
          verbose=1,                                                          |           verbose=1 if ddl.rank() == 0 else 0,
          validation_data=(x_test, y_test))                                   |           validation_data=(x_test, y_test),
                                                                              >           callbacks=callbacks)
                                                                              > # DDL: Only do final accuracy check on rank 0.
                                                                              > if ddl.rank() == 0:
score = model.evaluate(x_test, y_test, verbose=0)                             | score = model.evaluate(x_test_full, y_test_full, verbose=0)
print('Test loss:', score[0])                                                 | print('Test loss:', score[0])
print('Test accuracy:', score[1])                                             | print('Test accuracy:', score[1])

How to Distribute Keras Training with PowerAI DDL

Code Changes

1. Imports

2. Split the Training and Test Data

3. Adjust the Learning Rate

4. Add DDL Callbacks

5. Restrict Printing to Rank 0

Running the Script

Complete Diff

How to Distribute Keras Training with PowerAI DDL

How to train Keras model x20 times faster with TPU for free

How to Get Reproducible Results with Keras

How to create own operator with python in mxnet?

How to get bitting code with SEC-E9 key cutting machine

轉載 -- How To Optimize Your Site With GZIP Compression

轉載 -- How To Optimize Your Site With HTTP Caching

How to set connection timeout with OkHttp

[iOS] How to sort an NSMutableArray with custom objects in it?

How to become a team with chatbots

How to Mix Headless CMS with a Vue.js Website and Pay Zero

Simpson’s Paradox: How to Prove Opposite Arguments with the Same Data

How to Automate Surveillance Easily with Deep Learning

How to distribute your own Android library through jCenter and Maven Central from Android Studio

How to not make friends with AI?

Routing in React Native apps and how to configure your project with React

Ask HN: How to do a question with a throwaway account?

How to build Go plugin with data inside @ Alex Pliutau's Blog

How to boost online performances with Conversational Web Experiences

How to Make Blogging Easier with Artificial Intelligence

How to Distribute Keras Training with PowerAI DDL

Code Changes

1. Imports

2. Split the Training and Test Data

3. Adjust the Learning Rate

4. Add DDL Callbacks

5. Restrict Printing to Rank 0

Running the Script

Complete Diff

相關推薦