1. 程式人生 > >An Exploration of Large Model Support in PowerAI IBM Caffe

An Exploration of Large Model Support in PowerAI IBM Caffe

Introduction

Large Model Support (LMS) is a feature provided in IBM Caffe that allows the successful training of deep learning models that would otherwise exhaust GPU memory and abort with “out of memory” errors.

LMS manages this oversubscription of GPU memory by temporarily swapping tensors to host memory when they are not needed.

IBM POWER Systems with NVLink technology are especially well-suited to LMS because of their hardware topology that enables fast communication between CPU and GPUs.

Use cases

One or more elements of a deep learning model can lead to GPU memory exhaustion. These include:

  • Model depth and complexity
  • Base data size (e.g. high-resolution images)
  • Batch size

Traditionally, the solution to this problem has been to modify the model until it fits in GPU memory. This approach, however, can negatively impact accuracy – especially if concessions are made by reducing data fidelity or model complexity.

With LMS, deep learning models can scale significantly beyond what was previously possible and, ultimately, generate more accurate results.

LMS in Action

Let’s take a look at an example using ResNet-1521 on the ImageNet2 dataset. ResNet-152 is a deep residual network that requires a significant amount of GPU memory.

In this example, let’s define six scenarios: A, B, C, D, E and F – with batch sizes of 8, 16, 32, 64, 128, and 256, respectively. We’ll consider training each of these scenarios on a system with 4 x 16GB GPUs (NVIDIA Tesla V100).

Only scenario ‘A’ trains successfully without LMS. The others run into GPU memory limitations. For example, attempting to train scenario ‘B’ yields the following:

$ caffe train -gpu 0,1,2,3 --solver=solver-B.prototxt 
...
I0824 1780 solver.cpp:294] Solving ResNet-152
...
F0824 1780 syncedmem.cpp:569] Check failed: error == cudaSuccess (2 vs. 0)  out of memory
Aborted (core dumped)

To avoid the memory limitation, simply enable Large Model Support by including -lms on the command line:

$ caffe train -gpu 0,1,2,3 --solver=solver-B.prototxt -lms
...
I0828 3878 solver.cpp:294] Solving ResNet-152
...
I0828 3878 caffe.cpp:421] Optimization Done.

Enabling LMS allows all of our scenarios to successfully complete training without any modification to the model itself.

Performance Considerations

Before digging further into the results of running with LMS, let’s take a look at a useful tunable: the -lms_exclude command-line option.

This option allows the user to define a soft limit on GPU memory allocated for LMS tensors, where limit = GPU-capacity<user-specified value in MB>.

By default, LMS favors GPU memory reuse (moving inactive tensors to host memory) over new allocations. This effectively minimizes GPU memory consumption.

However, when a limit is defined via lms_exclude, the algorithm favors allocation of GPU memory up to the limit prior to swapping any tensors out to host memory. This allows the user to control the amount of GPU memory consumed when using LMS.

Tuning this option to optimize GPU memory utilization, therefore, can reduce data transfers and improve performance. Since the ideal tuning for any given scenario may differ, it is considered a best practice to determine the value of lms_exclude experimentally, arriving at the smallest value that does not result in an out of memory error.

Results

With this in mind, let us now examine the results from our example scenarios, looking at three variations:

Variation Description Command
Off LMS off
caffe train -gpu 0,1,2,3 --solver=solver-<scenario>.prototxt
Default LMS on
caffe train -gpu 0,1,2,3 --solver=solver-<scenario>.prototxt -lms
Tuned LMS on, with lms_exclude3
caffe train -gpu 0,1,2,3 --solver=solver-<scenario>.prototxt -lms -lms_exclude <value>

Let’s look first at GPU memory utilization:

Training fails for lack of memory in scenarios B through F without LMS, as indicated by the Xs. Note the difference in memory use between the default and tuned variations and that this difference is most pronounced in scenarios with lower memory demands.

Now let’s examine training performance. The figure below plots the training duration of a fixed number of training iterations. The values are normalized to show relative performance compared to the default LMS variation in each scenario:

Again, observe that tuning is most effective for the ‘smaller’ models. Transferring tensors between host and GPU memory does have a performance cost. Because smaller models that nearly fit in GPU memory have less need for those transfers, they benefit most from tuning. Larger models necessarily force more data to be transferred and therefore benefit less.

This leads us to an important point concerning performance and LMS. LMS allows training of large models that wouldn’t otherwise be possible with GPU only. The data transfers used by LMS come at a cost, however, and the performance of LMS will depend on the interconnects between GPU, CPU and system memory.

As stated before, IBM POWER Systems with NVLink technology are especially well-suited to LMS because of their hardware topology. Specifically, the NVLink 2.0 connections allow 150 GB/s communication in each direction between CPU and GPU compared to the 32 GB/s of PCI Gen3 in traditionally connected GPUs. See the article referenced in the Further Reading section for a detailed analysis on this point.

Conclusion

LMS allows you to increase the accuracy of your deep learning workloads by preserving the fidelity of your data and without any change to your model’s network architecture.

Have you hit memory issues with your models? You don’t need to anymore. Try LMS today in the latest PowerAI Release!

Further Reading

While increasing batch size is a simple way to demonstrate the mechanics of LMS, it is admittedly not the most compelling use case. For a real-world case study using LMS, along with a detailed analysis of the benefits of IBM POWER Systems with NVLink technology specific to LMS, see TensorFlow Large Model Support Case Study with 3D Image Segmentation.

See also Getting started with Caffe in the IBM Knowledge Center for more information on PowerAI Caffe, including additional optimizations and enhancements from IBM.

1 ResNet-152 example model based on https://github.com/antingshen/resnet-protofiles
2 ImageNet image database from http://image-net.org/
3 Tuned -lms-exclude values used: 1 (scenarios A, B), 5120 (scenarios C, D, E), 6144 (scenario F). Determined experimentally.

相關推薦

An Exploration of Large Model Support in PowerAI IBM Caffe

Introduction Large Model Support (LMS) is a feature provided in IBM Caffe that allows the successful training of deep learning models that would otherwi

See Robot Play: an exploration of curiosity in humans and machines.

In computer science and control theory, we can create models that predict the next state given a current state and an action. These models, often neural ne

[譯]深度神經網絡的多任務學習概覽(An Overview of Multi-task Learning in Deep Neural Networks)

noi 使用方式 stats 基於 共享 process machines 嬰兒 sdro 譯自:http://sebastianruder.com/multi-task/ 1. 前言 在機器學習中,我們通常關心優化某一特定指標,不管這個指標是一個標準值,還是企業KPI。為

深度神經網路的多工學習概覽(An Overview of Multi-task Learning in Deep Neural Networks)

譯自:http://sebastianruder.com/multi-task/ 1. 前言 在機器學習中,我們通常關心優化某一特定指標,不管這個指標是一個標準值,還是企業KPI。為了達到這個目標,我們訓練單一模型或多個模型集合來完成指定得任務。然後,我們通過精細調參,來改進模型直至效能不再

An Overview of Multi-Task Learning in Deep Neural Networks

在人類學習中,不同學科之間的往往能起到相互促進的作用。那麼,對於機器學習是否也是這樣的,我們不僅僅讓它專注於學習一個任務,而是讓它學習多個相關的任務,是否可以讓機器在各個任務之間融會貫通,從而提高在主任務上面的結果呢? 1.multi-task的兩種形式 前面的層是權

BEYOND ONE-HOT: AN EXPLORATION OF CATEGORICAL VARIABLES

categorical-encoding庫 專案地址:https://github.com/scikit-learn-contrib/categorical-encoding Star:494 Fork:115        這個庫擴充套件了很多實現 scikit-l

Convolution: An Exploration of a Familiar Operator’s Deeper Roots

Convolution: An Exploration of a Familiar Operator’s Deeper RootsIn the world of modern machine learning, the convolution operator occupies the strange pos

Convolution: An Exploration of a Familiar Operator's Deeper Roots

In the world of modern machine learning, the convolution operator occupies the strange position: it's both trivially familiar to anyone who's read a neural

How to Build an Ensemble Of Machine Learning Algorithms in R (ready to use boosting, bagging and stacking)

Tweet Share Share Google Plus Ensembles can give you a boost in accuracy on your dataset. In thi

Terraform Support in the IBM UrbanCode Deploy Blueprint Designer

The IBM UrbanCode Deploy blueprint designer now supports Terraform. Like Heat, Terraform is a cloud infrastructure configuration language that can be vers

An Analysis of Scale Invariance in Object Detection – SNIP 論文解讀

記錄 測試的 one zhang 不可 策略 correct 抽象 alt 前言 本來想按照慣例來一個overview的,結果看到一篇十分不錯而且詳細的介紹,因此copy過來,自己在前面大體總結一下論文,細節不做贅述,引用文章講得很詳細。 論文概述 引用文章 以下內容來自:

Given an array of integers that is already sorted in ascending order, find two numbers such that the

這道題自己思路也對了,就是陣列使用出了點問題,然後就是看了別人的程式碼才改過來,用到匿名陣列。 不多說,看程式碼,   class Solution {     public int[] twoSum(int[] numbers, int target) {

"The conversion of a datetime2 data type to a datetime data type resulted in an out-of-range value

這句話的意思是將datetime2資料型別轉換為datetime資料型別會導致超出範圍的值。宣告已經終止。 在使用EF插入資料是發生列轉換的錯誤,搞了好久,不知道問題出在哪裡! 根據提示的錯誤資訊來看是Datetime資料型別出現錯誤 後來發現 public Nullable<S

An Overview of JavaScript Testing in 2018

IntroLook at the logo of Jest, a testing framework by Facebook:As you can see, their slogan promises a “painless” JavaScript Testing, but as “some guy from

Corporate travel in an age of AI, big data and ML: TD CEO to moderate at ITB Asia

TD's own Brett Henry will be presenting at ITB Asia in Singapore on 17 October. Henry will interview guests Ken Kuguru, managing director, Asia Pacific at

Tales of an exploration on antigravity and other potentialy unrelated matters

Having experimented and used BeOS R5 Pro back in the early 2000’s, when the company that created it was just going down, I have been following with some i

The New Rules of Build vs. Buy in An AI World

As artificial intelligence as a service (AIaaS) emerges, the build vs buy debate is clear. Since the field of AI, and specifically computer vision, has ev

Martian moon may have come from impact on home planet: Spectral fingerprints of Phobos' surface support an ancient big crash ori

The dark faces of the moons resemble the primitive asteroids of the outer solar system, suggesting the moons might be asteroids caught long ago in Mars' g

What's next for smart homes: An 'Internet of Ears?': Next generation of smart homes envisions using changes in vibrations, sound

Today's smart home features appliances, entertainment systems, security cameras and lighting, heating and cooling systems that are connected to each other

An Overview of End-to-End Exactly-Once Processing in Apache Flink (with Apache Kafka, too!)

01 Mar 2018 Piotr Nowojski (@PiotrNowojski) & Mike Winters (@wints) This post is an adaptation of Piotr Nowojski’s presentation from Flink Forward Ber