in PyTorch Note ~ read.
📙 PyTorch Note - CUDA

📙 PyTorch Note - CUDA

CPU Tensor or GPU tensor?

doc: torch.Tensor — PyTorch master documentation
ref: How to create a tensor on GPU as default - PyTorch Forums
torch.Tensor is an alias for the default tensor type (torch.FloatTensor). - said by the document

import torch
import torch.nn as nn
import torch.nn.functional as F

Torch defines 8 CPU tensor type and 8 GPU tensor type
The default tensor type is torch.FloatTensor, which is a CPU Tensor and has a dtype of torch.float32

print(torch.get_default_dtype()) # To get the default Tensor dtype(torch.float32)

And this(torch.FloatTensor) makes tensors to be created on CPU if no device is specified.
To make tensors to be crated on GPU by default:

torch.set_default_tensor_type('torch.cuda.FloatTensor')

After this, all tensors will be created on the selected GPU device, and still has a dtype of torch.float32 by default.

a = torch.tensor([1.])
print(a.dtype)
print(a.device)

One more example:

torch.set_default_tenor_type('torch.cuda.DoubleTensor')

This makes tensors to be created on GPU by default and has a dtype of torch.float64

Device, Current Device

Use torch.device to get the torch.device object

  1. Get the CPU device

    cpu = torch.device('cpu') # Current CPU device
    cpu1 = torch.device('cpu:0')
    

    It's exactly the same, cuz there is no multiple CPUs mode.

  2. Get the GPU device

    # Current GPU device
    cuda = torch.device('cuda')
    cuda = torch.device('cuda', None)
    # GPU 0
    cuda0 = torch.device('cuda:0')
    cuda0 = torch.device('cuda', 0)
    # GPU 1
    cuda1 = torch.device('cuda:1')
    cuda1 = torch.device('cuda', 1)
    

    Current CPU device will always be 'CPU:0', but current GPU device depends (on the currently selected device).
    So, if currently selected device is 'GPU 0' now, gpu is 'GPU 0'. But when we change current selected device to 'GPU 1'(if u have...😂), gpu will become 'GPU 1'.

  3. Create Tensors on device
    Get the index of currently selected device:

    print(torch.cuda.current_device())
    

    Let's suppose it's 0, now we can

    # Create a tensor on CPU, given a torch.device object or a string
    a = torch.tensor([1.], device=cpu)
    a = torch.tensor([1.], device='cpu')
    # Create a tensor on currently selected GPU, which is GPU 0 now
    b = torch.tensor([1.], device=cuda)
    b = torch.tensor([1.], device='cuda')
    # Create a tensor on specific GPU
    c = torch.tensor([1.], device=cuda1)
    c = torch.tensor([1.], device='cuda:1')
    

With One GPU

With one GPU, we only care about tensor on CPU or on GPU. No need to care about currently selected device, cuz u have only 1 GPU that can be selected :joy:.

Transfer Data (CPU <-> GPU)

torch.Tensor.cuda() returns a copy of this torch.Tensor object in CUDA memory in a specified device and will copy to the currently selected device if no device parameter was given.

cuda = torch.device('cuda')
cuda0 = torch.device('cuda:0')
tensor = torch.randn(2, 2)
# To currently selected GPU device or specific device(both 'cuda:0' in this situation)
tensor = tensor.cuda()
tensor = tensor.cuda(cuda0)

Inversely, torch.Tensor.cpu() to get a copy in CPU memory.

tensor = torch.randn(2, 2)
# CPU -> GPU
tensor = tensor.cuda()
# GPU -> CPU
tensor = tensor.cpu()

torch.Tensor.to() performs Tensor dtype and/or device conversion. It returns a copy of the desired Tensor.

cuda0 = torch.device('cuda:0')
cpu = torch.device('cpu')
tensor = torch.randn(2, 2)
# to float64
tensor = tensor.to(torch.float64)
# to float 32, using torch.Tensor.type()
tensor = tensor.type(torch.float32)
# to GPU
tensor = tensor.to(cuda0)
# to CPU
tensor = tensor.to(cpu)

So torch.Tensor.to(device, dtype) can be considered as a combination of torch.Tensor.cuda(device), torch.Tensor.cpu(device) and torch.Tensor.type(dtype)

Transfer Model (CPU <-> GPU)

Once the data Tensor is allocated (to CPU/GPU), we can do operations to it irrespective of the selected device, and the results will be always placed on the same device as the Tensor.

Furthermore, if we do operations between 2 or more Tensors, they should be allocated to the same device so the operation will take place at that device and the result will be placed there.

torch.nn.Parameter is a kind of Tensor that is to be considered a module parameter. And Parameters are Tensor subclasses. Equally, torch.nn module provides torch.nn.Module.cuda(), torch.nn.Module.cpu() methods for easily tensor(parameters) transferring between CPU and GPU. And also the torch.nn.Module.to() method to do the transfer/cast things.

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
       x = F.relu(self.conv1(x))
       return F.relu(self.conv2(x))

model = Model()
# list contains parameters of model.conv1(weight and bias)
param_list_conv1 = list(model.conv1.parameters())
print(param_list_conv1[0].device)
# CPU -> GPU(.cuda() method)
model.cuda()
print(param_list_conv1[0].device)
# GPU -> CPU(.cpu() method)
model.cpu()
print(param_list_conv1[0].device)
# CPU -> GPU(.to() method)
cuda0 = torch.device('cuda:0')
model.to(cuda0)
print(param_list_conv1[0].device)

After allocating data and model to GPU, we are able to use GPU to accelerate our training process.

With Multiple GPUs

Use Context-Manager

With multiple GPUs, you should care about the currently selected device. Use a context-manager torch.cuda.device() to manually control which GPU a tensor is created on meanwhile make our code more clear.

cuda = torch.device('cuda')

# Create tensor a,b,c on device cuda:0
with torch.cuda.device(0):
    a = torch.tensor([1., 2.], device=cuda)
    b = torch.tensor([1., 2.]).cuda()
    c = torch.tensor([1., 2.]).to(cuda)

# Create tensor d,e,f on device cuda:1
with torch.cuda.device(1):
    d = torch.tensor([1., 2.], device=cuda)
    e = torch.tensor([1., 2.]).cuda()
    f = torch.tensor([1., 2.]).to(cuda)

Control GPU Visibility with CUDA_VISIBLE_DEVICES

doc: J. CUDA Environment Variables :: CUDA Toolkit Documentation
ref: CUDA Pro Tip: Control GPU Visibility with CUDA_VISIBLE_DEVICES | NVIDIA Developer Blog
ref: 2Pac – Can't C Me Lyrics | Genius Lyrics

Let's suppose (or dream about) that you have 4 GPUs, and want to use three of them to train your model while using the remaining one to play :video_game:. Just set the environment variable CUDA_VISIBLE_DEVICES to restrict the devices that your CUDA application(model training process) sees.

Many ways to achieve that, just introduce 2 of them:

  1. Set the environment variable in your python script (not recommended)

    import os
    os.environ['CUDA_VISIBLE_DEVICE']='1,2,3'
    

    This method is not recommended cuz it's not flexible. Use this method when you do this thing as normal.

  2. Set the environment variable when you run the python script (recommended)

    CUDA_VISIBLE_DEVICES=1,2,3 python train.py
    

    Use this if you just want to play tonight. Or only make one of them visible to test the compatibility with one GPU environment.

And after that,

The blind stares of a million pairs of eyes
Lookin' hard but won't realize
That they will never see the 'GPU0'!

Data Parallelism

ref: Optional: Data Parallelism — PyTorch Tutorials
doc: torch.nn — PyTorch master documentation

Torch will only use one GPU by default. Simply use torch.nn.DataParallel to run your model parallelized over multiple GPUs in the batch dimension.

model = nn.DataParallel(model)

Use Pinned Memory Buffer and Asynchronization

ref: When to set pin_memory to true? - vision - PyTorch Forums
ref: How to Optimize Data Transfers in CUDA C/C++ | NVIDIA Developer Blog

torch.utils.data.DataLoader admits a parameter pin_memory, and if True the tensors will be copied into CUDA pinned memory.

#https://github.com/pytorch/examples/blob/master/imagenet/main.py#L211-L223
train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
    num_workers=args.workers, pin_memory=True, sampler=train_sampler)

By default, GPU operations are asynchronous, this allows to execute more computations in parallel. But when copying data between CPU and GPU or between GPUs, it will be synchronous by default. E.g. torch.Tensor.to()torch.Tensor.cuda() and torch.nn.Module.to() . And these functions admit a non_blocking argument which was named as async before. When non_blocking is set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.

#https://github.com/pytorch/examples/blob/master/imagenet/main.py#L270-L272
input = input.cuda(args.gpu, non_blocking=True)
target = target.cuda(args.gpu, non_blocking=True)

These methods provides a larger bandwidth between the host(CPU) and the device(GPU), also improves the data transfer performance.

Write Device-Agnostic Code

Use CUDA if Possible

# At the begining of the script
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# When loading data
image, label = image.to(device), label.to(device)
# Create the model
model = Model().to(device)

Use Arguments to Control it

# https://pytorch.org/docs/stable/notes/cuda.html#device-agnostic-code
import argparse
import torch

parser = argparse.ArgumentParser(description='PyTorch Example')
parser.add_argument('--disable-cuda', action='store_true',
                    help='Disable CUDA')
args = parser.parse_args()
args.device = None
if not args.disable_cuda and torch.cuda.is_available():
    args.device = torch.device('cuda')
else:
    args.device = torch.device('cpu')
    
# When loading the data
for i, x in enumerate(train_loader):
    x = x.to(args.device)
# When creating the model
model = Model().to(args.device)

In Practice

Actually, it's a brief conclusion. So in practice, we should:

  1. Write device-agnostic code that uses GPU by default and provide an argument to disable it.
  2. Use pinned memory buffer and also asynchronous data transfer.
  3. Use data parallel when you have multiple GPUs.
  4. Use environment variable to control GPU visibility when you have multiple GPUs.

Post cover image from Quick Guide for setting up PyTorch with Window in 2 mins