Pytorch quot;NCCL errorquot;: unhandled system error, NCCL version 2.4.8quot;(Pytorch“NCCL 错误:未处理的系统错误,NCCL 版本 2.4.8)
问题描述
我使用pytorch分布式训练我的模型.我有两个节点和每个节点两个gpu,我为一个节点运行代码:
I use pytorch to distributed training my model.I have two nodes and two gpu for each node, and I run the code for one node:
python train_net.py --config-file configs/InstanceSegmentation/pointrend_rcnn_R_50_FPN_1x_coco.yaml --num-gpu 2 --num-machines 2 --machine-rank 0 --dist-url tcp://192.168.**.***:8000
另一个:
python train_net.py --config-file configs/InstanceSegmentation/pointrend_rcnn_R_50_FPN_1x_coco.yaml --num-gpu 2 --num-machines 2 --machine-rank 1 --dist-url tcp://192.168.**.***:8000
但是另一个有 RuntimeError 问题
However the other has RuntimeError problem
global_rank 3 machine_rank 1 num_gpus_per_machine 2 local_rank 1
global_rank 2 machine_rank 1 num_gpus_per_machine 2 local_rank 0
Traceback (most recent call last):
File "train_net.py", line 109, in <module>
args=(args,),
File "/root/detectron2_repo/detectron2/engine/launch.py", line 49, in launch
daemon=False,
File "/root/anaconda3/envs/PointRend/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
while not spawn_context.join():
File "/root/anaconda3/envs/PointRend/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/root/anaconda3/envs/PointRend/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File "/root/detectron2_repo/detectron2/engine/launch.py", line 72, in _distributed_worker
comm.synchronize()
File "/root/detectron2_repo/detectron2/utils/comm.py", line 79, in synchronize
dist.barrier()
File "/root/anaconda3/envs/PointRend/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1489, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:410, unhandled system error, NCCL version 2.4.8
如果我把mask-rank=1改成mask-rank=0,那么就不会报错,但是不能分布式训练,有谁知道为什么会出现这个错误?
IF I change mask-rank = 1 to mask-rank = 0, then no error will be reported, but can't distributed training,Does anyone know why this error may occur?
推荐答案
许多原因都可能导致此问题,例如参见 1,2.添加行
A number of things can cause this issue, see for example 1, 2. Adding the line
import os
os.environ["NCCL_DEBUG"] = "INFO"
到您的脚本将记录导致错误的更具体的调试信息,为您提供更有用的错误消息给谷歌.
to your script will log more specific debug info leading up to the error, giving you a more helpful error message to google.
这篇关于Pytorch“NCCL 错误":未处理的系统错误,NCCL 版本 2.4.8"的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:Pytorch“NCCL 错误":未处理的系统错误,NCCL 版


基础教程推荐
- 用于分类数据的跳跃记号标签 2022-01-01
- Python kivy 入口点 inflateRest2 无法定位 libpng16-16.dll 2022-01-01
- 在 Python 中,如果我在一个“with"中返回.块,文件还会关闭吗? 2022-01-01
- 何时使用 os.name、sys.platform 或 platform.system? 2022-01-01
- 筛选NumPy数组 2022-01-01
- 使用PyInstaller后在Windows中打开可执行文件时出错 2022-01-01
- Dask.array.套用_沿_轴:由于额外的元素([1]),使用dask.array的每一行作为另一个函数的输入失败 2022-01-01
- 如何在海运重新绘制中自定义标题和y标签 2022-01-01
- 线程时出现 msgbox 错误,GUI 块 2022-01-01
- 如何让 python 脚本监听来自另一个脚本的输入 2022-01-01