- Python Debugging
- Pytorch Debugging
Avoid editing the dictionary because when programming projects deep learning, the dictionary is very likely to be REUSED in next epochs. Editing a deep copy of the original dictionary instead (
dict_ = copy.deepcopy(dict)).
Normally because index out of tensor dimension, frenquently happens to wrong label values. Add
CUDA_LAUNCH_BLOCKING=1 python ... to get better tracing about the position where bug happens. Somtimes together with move the tensor to cpu can give you more details
This error could be caused by many reasons but is NOT common. My case is quite special: it starts from that my model has some parameters that only contribute to the loss when input statisfying some condition, leading to the
error of not contributing to the loss. So I tried forcing the loss to zero when parameters not contributing to the loss, and I also tried seeking help from setting
find_unused_parameters=True. Both can make me get rid of the
error of not contriting to the loss, but resulting the
error of broken NCCL conmunication. Finally, I solved this error by making all parameters always contribute to the loss.
In short, it seems that it’s OK for model to have some parameters not participating the computation of the training loss (we can solve it by setting
find_unused_parameters=True). But if these parameters are sometimes used while sometimes not in the iterations, it will cause the NCCL communication failed.
When multiple GPU distributed training, I was once found that the rank0 GPU take too much memory than the others, which limits my batch size choice:
I finally found solution from here.
By revising my code from
from mmcv.runner import _load_checkpoint, load_state_dict class MViT2(torch.nn.Module): def __init__(self): super().__init__() state_dict = _load_checkpoint("path_to.pth") load_state_dict(model, state_dict['model_state'])
from mmcv.runner import _load_checkpoint, load_state_dict class MViT2(torch.nn.Module): def __init__(self): super().__init__() state_dict = _load_checkpoint("path_to.pth", map_location=lambda storage, loc: storage) load_state_dict(model, state_dict['model_state'])
My instance is special: my multiprocessing task is at validation stage for computing the evluation metric. It runs fast during the training, but slow when soly evaluating the results.
Specifically, when I run a complete training epoch, the validation is fast. While when I manually load the saved model outputs, and testing them using the same evaluation script, it’s very slow (0.3 vs 1.5).
I found the difference is that my lib (mmaction2) setup multi-processing environment variables which my manual evalution script does NOT. After adding the python code below into my
evaluation.py, the problem is solved.
import os if 'OMP_NUM_THREADS' not in os.environ: os.environ['OMP_NUM_THREADS'] = str(1) if 'MKL_NUM_THREADS' not in os.environ: os.environ['MKL_NUM_THREADS'] = str(1)
RuntimeError: The current installed version of g++ (8.4.0) is greater than the maximum required version by CUDA 10.2 (8.0.0). Please make sure to use an adequate version of g++ (>=5.0.0, <=8.0.0).
# sudo apt install g++-7 CXX=/usr/bin/g++-7 yourcommand
Of course you have to install g++-7 first if there is no g++-7 in the path.
My case is that the version showed by
torch.__version__ is different with the torch version shown in
conda list. No matter how many times I reinstall the pytorch and even re-create the environment, the
torch.__version__ is always
2.0, while every time I am sure that I just have run
conda install pytorch==1.5.1.
The problem is that the new environment will use the system Pyhhon if no python version was specified when it was created:
conda create -n test conda activate test which python # /use/bin/python pip list # ... # torch 2.0.1 # ...
So it will always use the torch installed in the system rather than the environment.
Solution: create env with specific Python version:
conda create -n test python=3.7 conda activate test which python # /home/louis/miniconda3/envs/test/bin/python pip list # empty
*It may be related to the operation
conda config --set auto_activate_base false. I reinstall conda and set auto_activate_base to false, then encounter this problem.
Each time I open a terminal there will pop up below error:
WindowsPowerShell\profile.ps1 cannot be loaded because running scripts is disabled on this system. For more infor
The reulst is that I cannot switch and install packages in environments correctly.
conda init Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
This happens on distributed training:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss ...
Like the message says, there are parameters in your module that don’t contribute to the loss and this is not allowed in distribute training. To find which specific parameters that don’t contribute to the loss, set
TORCH_DISTRIBUTED_DEBUG=DETAIL python train.py