Speed Up your Algorithms Part 1 — PyTorch
Index:
- Introduction
- How to check the availability of cuda?
- How to get more info on cuda devices?
- How to store Tensors and run Models on GPU?
- How to select and work on GPU(s) if you have multiple of them?
- Data Parallelism
- References
1. Introduction:
In this post I will show how to check, initialize GPU devices using torch
pycuda
, and how to make your algorithms faster.2. How to check the availability of cuda?
To check if you have cuda
device available using Torch
you can simply run:
import torch
torch.cuda.is_available()# True
3. How to get more info on your cuda devices?
To get basic info on devices, you can use torch.cuda
. But to get more info on your devices you can use pycuda
, a python wrapper around CUDA
library. You can use something like:
import torchimport pycuda.driver as cudacuda.init()
## Get Id of default devicetorch.cuda.current_device()# 0
cuda.Device(0).name() # '0' is the id of your GPU# Tesla K80
Or,
torch.cuda.get_device_name() # Get name of default device# 'Tesla K80'
I wrote a simple class to get information on your cuda
compatible GPU(s):
To get current usage of memory you can use pyTorch
's functions such as:
import torch
# Returns the current GPU memory usage by # tensors in bytes for a given devicetorch.cuda.memory_allocated()
# Returns the current GPU memory managed by the# caching allocator in bytes for a given devicetorch.cuda.memory_cached()
And after you have run your application, you can clear your cache using a simple command:
# Releases all unoccupied cached memory currently held by# the caching allocator so that those can be used in other# GPU application and visible in nvidia-smitorch.cuda.empty_cache()
However, using this command will not free the occupied GPU memory by tensors, so it can not increase the amount of GPU memory available for PyTorch.
4. How to store Tensors and run Models on GPU?
The .cuda
magic.
If you want to store something on cpu, you can simply write:
a = torch.DoubleTensor([1., 2.])
This vector is stored on cpu and any operation you do on it will be done on cpu. To transfer it to gpu you just have to do .cuda
:
a = torch.FloatTensor([1., 2.]).cuda()
Or,
a = torch.cuda.FloatTensor([1., 2.])
And this will select the default device for it which can be seen by the command:
torch.cuda.current_device()# 0
Or, you can also do:
a.get_device()# 0
You can also send a Model to the GPU device. For example consider a simple module made from nn.Sequential
:
sq = nn.Sequential( nn.Linear(20, 20), nn.ReLU(), nn.Linear(20, 4), nn.Softmax())
To send this to GPU device, simply do:
model = sq.cuda()
You can check if it is on GPU device or not, for that you will have to check if its parameters are on GPU or not, like:
next(model.parameters()).is_cuda# True
5. How to select and work on GPU(s) if you have multiple of them?
You can select a GPU for your current application/storage which can be different from the GPU you selected for your last application/storage.
As already seen in part (2) we can get all our cuda
compatible devices and their Id
's using pycuda
, we will not discuss that here.
Considering you have 3 cuda
compatible devices, you can initialize and allocate tensors
to a specific device like this:
cuda0 = torch.device('cuda:0')cuda1 = torch.device('cuda:1')cuda2 = torch.device('cuda:2')# If you use 'cuda' only, Tensors/models will be sent to # the default(current) device. (default= 0)
x = torch.Tensor([1., 2.], device=cuda1)# Orx = torch.Tensor([1., 2.]).to(cuda1)# Orx = torch.Tensor([1., 2.]).cuda(cuda1)
# NOTE:# If you want to change the default device, use:torch.cuda.set_device(2) # where '2' is Id of device
# And if you want to use only 2 of the 3 GPU's, you# will have to set the environment variable # CUDA_VISIBLE_DEVICES equal to say, "0,2" if you # only want to use first and third GPUs. Now if you # check how many GPUs you have, it will show two(0, 1).import osos.environ["CUDA_VISIBLE_DEVICES"] = "0,2"
When you do any operation on these Tensor
s, which you can do irrespective of the selected device, the result will be saved on the same device as the Tensor
.
x = torch.Tensor([1., 2.]).to(cuda2)y = torch.Tensor([3., 4.]).to(cuda2)
# This Tensor will be saved on 'cuda2' onlyz = x + y
If you have multiple GPUs, you can split your application’s work among them, but it will come with a overhead of communication between them. But if your doesn’t need to relay messages too much, you can give it a go.
Actually there is one more problem. In PyTorch
all GPU operations are asynchronous by default. And though it does make necessary synchronization when copying data between CPU and GPU or between two GPUs, still if you create your own stream with the help of the command torch.cuda.Stream()
then you will have to look after synchronization of instructions yourself.
Giving a example from PyTorch
's documentation, this is incorrect:
cuda = torch.device('cuda')s = torch.cuda.Stream() # Create a new stream.A = torch.empty((100, 100), device=cuda).normal_(0.0, 1.0)with torch.cuda.stream(s): # because sum() may start execution before normal_() finishes! B = torch.sum(A)
If you want to use multiple GPUs to its full potential, you can:
- use all GPUs for different tasks/applications,
- use each GPU for one model in an ensemble or stack, each GPU having a copy of data (if possible), as most processing is done during fitting to the model,
- use each GPU with sliced input and copy of model in each GPU. Each GPU will compute result separately and will send their results to a destination GPU where further computation will be done, etc.
6. Data Parallelism?
In data parallelism we split the data, a batch, that we get from Data Generator into smaller mini batches, which we then send to multiple GPUs for computation in parallel.
In PyTorch
data parallelism is implemented using torch.nn.DataParallel
.
But we will see a simple example to see what is going under the hood. And to do that we will have to use some of the functions of nn.parallel
, namely:
- Replicate: To replicate
Module
on multiple devices. - Scatter: To distribute the
input
in the first dimension among those devices. - Gather: To gather and concatenate the
input
in first dimension from those devices. - parallel_apply: To apply a set of distributed
input
s, which we got from Scatter, to corresponding set of distributedModule
s, which we got from Replicate.
# Replicate module to devices in device_idsreplicas = nn.parallel.replicate(module, device_ids)
# Distribute input to devices in device_idsinputs = nn.parallel.scatter(input, device_ids)
# Apply the models to corresponding inputsoutputs = nn.parallel.parallel_apply(replicas, inputs)
# Gather result from all devices to output_deviceresult = nn.parallel.gather(outputs, output_device)
Or, simply:
model = nn.DataParallel(model, device_ids=device_ids)result = model(input)