Skip to main content Link Menu Expand (external link) Document Search Copy Copied

RuntimeError: CUDA error: device-side assert triggered

Normally because index out of tensor dimension, frenquently happens to wrong label values. Add CUDA_LAUNCH_BLOCKING=1 python ... to get better tracing about the position where bug happens. Somtimes together with move the tensor to cpu can give you more details

NCCL communicator was aborted on rank 1

This error could be caused by many reasons but is NOT common. My case is quite special: it starts from that my model has some parameters that only contribute to the loss when input statisfying some condition, leading to the error of not contributing to the loss. So I tried forcing the loss to zero when parameters not contributing to the loss, and I also tried seeking help from setting find_unused_parameters=True. Both can make me get rid of the error of not contriting to the loss, but resulting the error of broken NCCL conmunication. Finally, I solved this error by making all parameters always contribute to the loss.

In short, it seems that it’s OK for model to have some parameters not participating the computation of the training loss (we can solve it by setting find_unused_parameters=True). But if these parameters are sometimes used while sometimes not in the iterations, it will cause the NCCL communication failed.

Error caused by the edited mutable Dictionary

Avoid editing the dictionary because when programming projects deep learning, the dictionary is very likely to be REUSED in next epochs. Editing a deep copy of the original dictionary instead (dict_ = copy.deepcopy(dict)).

First GPU memory higher than the other

When multiple GPU distributed training, I was once found that the rank0 GPU take too much memory than the others, which limits my batch size choice:

Screenshot from 2022-07-19 23-18-58

I finally found solution from here.

By revising my code from

from mmcv.runner import _load_checkpoint, load_state_dict

class MViT2(torch.nn.Module):
    def __init__(self):
        state_dict = _load_checkpoint("path_to.pth")
        load_state_dict(model, state_dict['model_state'])


from mmcv.runner import _load_checkpoint, load_state_dict

class MViT2(torch.nn.Module):
    def __init__(self):
        state_dict = _load_checkpoint("path_to.pth", map_location=lambda storage, loc: storage)
        load_state_dict(model, state_dict['model_state'])

Problem solved:

Screenshot from 2022-07-19 23-21-52

Multiprocessing slow

My instance is special: my multiprocessing task is at validation stage for computing the evluation metric. It runs fast during the training, but slow when soly evaluating the results.

Specifically, when I run a complete training epoch, the validation is fast. While when I manually load the saved model outputs, and testing them using the same evaluation script, it’s very slow (0.3 vs 1.5).

I found the difference is that my lib (mmaction2) setup multi-processing environment variables which my manual evalution script does NOT. After adding the python code below into my, the problem is solved.

import os
if 'OMP_NUM_THREADS' not in os.environ:
    os.environ['OMP_NUM_THREADS'] = str(1)
if 'MKL_NUM_THREADS' not in os.environ:
    os.environ['MKL_NUM_THREADS'] = str(1)