📙 PyTorch Note - CUDA
CPU Tensor or GPU tensor?
doc: torch.Tensor — PyTorch master documentation
ref: How to create a tensor on GPU as default - PyTorch Forums
torch.Tensoris an alias for the default tensor type (
torch.FloatTensor). - said by the document
import torch import torch.nn as nn import torch.nn.functional as F
Torch defines 8 CPU tensor type and 8 GPU tensor type
The default tensor type is
torch.FloatTensor, which is a CPU Tensor and has a dtype of
print(torch.get_default_dtype()) # To get the default Tensor dtype(torch.float32)
torch.FloatTensor) makes tensors to be created on CPU if no device is specified.
To make tensors to be crated on GPU by default:
After this, all tensors will be created on the selected GPU device, and still has a dtype of
torch.float32 by default.
a = torch.tensor([1.]) print(a.dtype) print(a.device)
One more example:
This makes tensors to be created on GPU by default and has a dtype of
Device, Current Device
torch.device to get the
Get the CPU device
cpu = torch.device('cpu') # Current CPU device cpu1 = torch.device('cpu:0')
It's exactly the same, cuz there is no multiple CPUs mode.
Get the GPU device
# Current GPU device cuda = torch.device('cuda') cuda = torch.device('cuda', None) # GPU 0 cuda0 = torch.device('cuda:0') cuda0 = torch.device('cuda', 0) # GPU 1 cuda1 = torch.device('cuda:1') cuda1 = torch.device('cuda', 1)
Current CPU device will always be 'CPU:0', but current GPU device depends (on the currently selected device).
So, if currently selected device is 'GPU 0' now,
gpuis 'GPU 0'. But when we change current selected device to 'GPU 1'(if u have...😂),
gpuwill become 'GPU 1'.
Create Tensors on device
Get the index of currently selected device:
Let's suppose it's
0, now we can
# Create a tensor on CPU, given a torch.device object or a string a = torch.tensor([1.], device=cpu) a = torch.tensor([1.], device='cpu') # Create a tensor on currently selected GPU, which is GPU 0 now b = torch.tensor([1.], device=cuda) b = torch.tensor([1.], device='cuda') # Create a tensor on specific GPU c = torch.tensor([1.], device=cuda1) c = torch.tensor([1.], device='cuda:1')
With One GPU
With one GPU, we only care about tensor on CPU or on GPU. No need to care about currently selected device, cuz u have only 1 GPU that can be selected :joy:.
Transfer Data (CPU <-> GPU)
torch.Tensor.cuda() returns a copy of this
torch.Tensor object in CUDA memory in a specified device and will copy to the currently selected device if no device parameter was given.
cuda = torch.device('cuda') cuda0 = torch.device('cuda:0') tensor = torch.randn(2, 2) # To currently selected GPU device or specific device(both 'cuda:0' in this situation) tensor = tensor.cuda() tensor = tensor.cuda(cuda0)
torch.Tensor.cpu() to get a copy in CPU memory.
tensor = torch.randn(2, 2) # CPU -> GPU tensor = tensor.cuda() # GPU -> CPU tensor = tensor.cpu()
torch.Tensor.to() performs Tensor dtype and/or device conversion. It returns a copy of the desired Tensor.
cuda0 = torch.device('cuda:0') cpu = torch.device('cpu') tensor = torch.randn(2, 2) # to float64 tensor = tensor.to(torch.float64) # to float 32, using torch.Tensor.type() tensor = tensor.type(torch.float32) # to GPU tensor = tensor.to(cuda0) # to CPU tensor = tensor.to(cpu)
torch.Tensor.to(device, dtype) can be considered as a combination of
Transfer Model (CPU <-> GPU)
Once the data Tensor is allocated (to CPU/GPU), we can do operations to it irrespective of the selected device, and the results will be always placed on the same device as the Tensor.
Furthermore, if we do operations between 2 or more Tensors, they should be allocated to the same device so the operation will take place at that device and the result will be placed there.
torch.nn.Parameter is a kind of Tensor that is to be considered a module parameter. And Parameters are Tensor subclasses. Equally,
torch.nn module provides
torch.nn.Module.cpu() methods for easily tensor(parameters) transferring between CPU and GPU. And also the
torch.nn.Module.to() method to do the transfer/cast things.
class Model(nn.Module): def __init__(self): super(Model, self).__init__() self.conv1 = nn.Conv2d(1, 20, 5) self.conv2 = nn.Conv2d(20, 20, 5) def forward(self, x): x = F.relu(self.conv1(x)) return F.relu(self.conv2(x)) model = Model() # list contains parameters of model.conv1(weight and bias) param_list_conv1 = list(model.conv1.parameters()) print(param_list_conv1.device) # CPU -> GPU(.cuda() method) model.cuda() print(param_list_conv1.device) # GPU -> CPU(.cpu() method) model.cpu() print(param_list_conv1.device) # CPU -> GPU(.to() method) cuda0 = torch.device('cuda:0') model.to(cuda0) print(param_list_conv1.device)
After allocating data and model to GPU, we are able to use GPU to accelerate our training process.
With Multiple GPUs
With multiple GPUs, you should care about the currently selected device. Use a context-manager
torch.cuda.device() to manually control which GPU a tensor is created on meanwhile make our code more clear.
cuda = torch.device('cuda') # Create tensor a,b,c on device cuda:0 with torch.cuda.device(0): a = torch.tensor([1., 2.], device=cuda) b = torch.tensor([1., 2.]).cuda() c = torch.tensor([1., 2.]).to(cuda) # Create tensor d,e,f on device cuda:1 with torch.cuda.device(1): d = torch.tensor([1., 2.], device=cuda) e = torch.tensor([1., 2.]).cuda() f = torch.tensor([1., 2.]).to(cuda)
Control GPU Visibility with CUDA_VISIBLE_DEVICES
doc: J. CUDA Environment Variables :: CUDA Toolkit Documentation
ref: CUDA Pro Tip: Control GPU Visibility with CUDA_VISIBLE_DEVICES | NVIDIA Developer Blog
ref: 2Pac – Can't C Me Lyrics | Genius Lyrics
Let's suppose (or dream about) that you have 4 GPUs, and want to use three of them to train your model while using the remaining one to play
CUDA_VISIBLE_DEVICES to restrict the devices that your CUDA application(model training process) sees.
Many ways to achieve that, just introduce 2 of them:
Set the environment variable in your python script (not recommended)
import os os.environ['CUDA_VISIBLE_DEVICE']='1,2,3'
This method is not recommended cuz it's not flexible. Use this method when you do this thing as normal.
Set the environment variable when you run the python script (recommended)
CUDA_VISIBLE_DEVICES=1,2,3 python train.py
Use this if you just want to play
tonight. Or only make one of them visible to test the compatibility with one GPU environment.
And after that,
The blind stares of a million pairs of eyes
Lookin' hard but won't realize
That they will never see the 'GPU0'!
ref: Optional: Data Parallelism — PyTorch Tutorials
doc: torch.nn — PyTorch master documentation
Torch will only use one GPU by default. Simply use
torch.nn.DataParallel to run your model parallelized over multiple GPUs in the batch dimension.
model = nn.DataParallel(model)
Use Pinned Memory Buffer and Asynchronization
ref: When to set pin_memory to true? - vision - PyTorch Forums
ref: How to Optimize Data Transfers in CUDA C/C++ | NVIDIA Developer Blog
torch.utils.data.DataLoader admits a parameter
pin_memory, and if
True the tensors will be copied into CUDA pinned memory.
#https://github.com/pytorch/examples/blob/master/imagenet/main.py#L211-L223 train_loader = torch.utils.data.DataLoader( train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None), num_workers=args.workers, pin_memory=True, sampler=train_sampler)
By default, GPU operations are asynchronous, this allows to execute more computations in parallel. But when copying data between CPU and GPU or between GPUs, it will be synchronous by default. E.g.
torch.nn.Module.to() . And these functions admit a
non_blocking argument which was named as
async before. When
non_blocking is set, it tries to convert/move asynchronously with respect to the host if possible, e.g., moving CPU Tensors with pinned memory to CUDA devices.
#https://github.com/pytorch/examples/blob/master/imagenet/main.py#L270-L272 input = input.cuda(args.gpu, non_blocking=True) target = target.cuda(args.gpu, non_blocking=True)
These methods provides a larger bandwidth between the host(CPU) and the device(GPU), also improves the data transfer performance.
Write Device-Agnostic Code
Use CUDA if Possible
# At the begining of the script device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # When loading data image, label = image.to(device), label.to(device) # Create the model model = Model().to(device)
Use Arguments to Control it
# https://pytorch.org/docs/stable/notes/cuda.html#device-agnostic-code import argparse import torch parser = argparse.ArgumentParser(description='PyTorch Example') parser.add_argument('--disable-cuda', action='store_true', help='Disable CUDA') args = parser.parse_args() args.device = None if not args.disable_cuda and torch.cuda.is_available(): args.device = torch.device('cuda') else: args.device = torch.device('cpu') # When loading the data for i, x in enumerate(train_loader): x = x.to(args.device) # When creating the model model = Model().to(args.device)
Actually, it's a brief conclusion. So in practice, we should:
- Write device-agnostic code that uses GPU by default and provide an argument to disable it.
- Use pinned memory buffer and also asynchronous data transfer.
- Use data parallel when you have multiple GPUs.
- Use environment variable to control GPU visibility when you have multiple GPUs.
Post cover image from Quick Guide for setting up PyTorch with Window in 2 mins