Some tips about using google’s TPU (Cont.)
Sometimes I get this error from TPUEstimator:
12345678910 | ...File"/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py",line900,inrun run_metadata_ptr)File"/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py",line1135,in_run |
And after stop and restart TPU in console of GCP, the error disappeared. TPU doesn’t allow users to use it directly like GPU. You can’t see the device in VM looks like ‘/dev/tpu’ or something like this. Google provides TPU as RPC service, so you can only run DNN training through this service. I think this RPC service is not stable enough so sometimes it can’t work and lead to the error ‘Deadline Exceeded’.
When I get this type of error from TPU:
1 | 2018-09-2901:57:12.779430:Wtensorflow/core/distributed_runtime/rpc/grpc_session.cc:349]GrpcSession::ListDevices will initialize the session with an empty graph andother defaults because the session has notyet been created. |
The only solution is to create a new TPU instance and delete the old one in GCP console. Seems Google need to improve the robust of their TPU RPC service.
Running 10000 steps and get ‘loss’ for every turn:
123456789101112131415161718192021 | INFO:tensorflow:Loss forfinalstep:3.2015076.INFO:tensorflow:Loss forfinalstep:2.5733204.INFO:tensorflow:Loss forfinalstep:1.8888541.INFO:tensorflow:Loss forfinalstep:2.3713436.INFO:tensorflow:Loss forfinalstep:2.9957836.INFO:tensorflow:Loss forfinalstep:1.3974692.INFO:tensorflow:Loss forfinalstep:1.3933656.INFO:tensorflow:Loss forfinalstep:2.3544135.INFO:tensorflow:Loss forfinalstep:1.9383199.INFO:tensorflow:Loss forfinalstep:2.0213509.INFO:tensorflow:Loss forfinalstep:1.8641331.INFO:tensorflow:Loss forfinalstep:1.6767861.INFO:tensorflow:Loss forfinalstep:2.63849.INFO:tensorflow:Loss forfinalstep:2.19468.INFO:tensorflow:Loss forfinalstep:1.9854712.INFO:tensorflow:Loss forfinalstep:1.9380764.INFO:tensorflow:Loss forfinalstep:0.97299415.INFO:tensorflow:Loss forfinalstep:2.089243.INFO:tensorflow:Loss forfinalstep:2.1150723.INFO:tensorflow:Loss forfinalstep:1.8242038.INFO:tensorflow:Loss forfinalstep:2.8426473. |
It’s quite strange that the ‘loss’ can’t go low enough. I still need to do more experiments.
Previously, I run MobileNet_v2 in a machine with Geforce GTX 960 and it could process 100 samples per second. And by using 8 TPUs of Version 2, it can process about 500 samples per second. Firstly, I am so disappointed about the performance-boosting of TPUv2, for it only has about 1.4 TFLOPS for each. But then I noticed that may be the bottleneck is not the performance of TPU, since IO is usually the limit for training speed. Besides, my model is MobileNet_v2, which is too simple and light so it can’t excavate all the capability of TPU.
Therefore I set ‘depth_multiplier=4’ for MobileNet_v2. Under this model, GTX 960 could process 21 samples per second, and TPUv2-8 could process 275 samples per second. This time, I can estimate each TPUv2 has about 4 TFLOPS. I know this metric seems too low from Google’s official 45 TFLOPS. But considering the possible bottlenecks of storage IO and network bandwidth, it becomes understandable. And also, there is another possibility: Google’s 45 TFLOPS means the half-precision operation performance