[Issue]: tried to use nn.dataParallel however crashed #1421

jdgh000 · 2024-11-13T15:47:28Z

Problem Description

Ran following example:
https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html with little modification but it failed during run:
if I apply nn.dataParallel to model then it occurs, without applying it works
model = nn.DataParallel(model)

code:

import sys
sys.path.append('..')
from classes import *

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

class RandomDataset(Dataset):
    DEBUG = 0
    DEBUGL2 = 0

    def __init__(self, size, length):

        if self.DEBUG:
            print("GG: RandomDataset.__init__(size=", size, "length: ", length, ")")

        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):

        if self.DEBUGL2:
            print("GG: RandomDataset.__getitem__(index=", index, ")")

        return self.data[index]

    def __len__(self):

        if self.DEBUG:
            print("GG: RandomDataset.__len__() returning self.len: ", self.len)

        return len(self.data)

# Parameters and DataLoaders
input_size = 1000
output_size = 10

batch_size = 1000
data_size = 60000

if not torch.cuda.is_available():
    print("GPU is not detected.")
    quit(1)

device = torch.device("cuda:0")

# Create random data set: input size = 1k, data_size = 60k, batch_size: 1k.

rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
                         batch_size=batch_size, shuffle=True)

model = Model(input_size, output_size)

if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
  model = nn.DataParallel(model)

model.to(device)

for data in rand_loader:
    input = data.to(device)
    output = model(input)
    print("Outside: input size", input.size(), "output_size", output.size())

 root@u488 dataparallellism]$ sudo python3 ex1.py
Let's use 8 GPUs!
Traceback (most recent call last):
  File "/root/pytorch/dataparallellism/ex1.py", line 41, in <module>
    output = model(input)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 192, in forward
    replicas = self.replicate(self.module, self.device_ids[: len(inputs)])
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 199, in replicate
    return replicate(module, device_ids, not torch.is_grad_enabled())
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/replicate.py", line 134, in replicate
    param_copies = _broadcast_coalesced_reshape(params, devices, detach)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/replicate.py", line 103, in _broadcast_coalesced_reshape
    tensor_copies = Broadcast.apply(devices, *tensors)
  File "/usr/local/lib64/python3.9/site-packages/torch/autograd/function.py", line 575, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/_functions.py", line 22, in forward
    outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/comm.py", line 67, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
 root@u488 dataparallellism]$ nano -w "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/module.py"
 root@u488 dataparallellism]$ cat /opt/rocm/.info/version
6.2.0-66

Operating System

rhel9

CPU

9500hx ryzen

GPU

mi250

ROCm Version

ROCm 6.2.0

ROCm Component

rccl

Steps to Reproduce

Run example code with nn.dataParallel (actual code pasted in problem description):

https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

The text was updated successfully, but these errors were encountered:

harkgill-amd · 2024-11-13T16:12:22Z

Hi @jdgh000, I was able to reproduce your issue and have opened an internal ticket for further investigation.

zichguan-amd · 2024-11-13T18:36:45Z

Hi @jdgh000, looks like you are running on a laptop with integrated graphics, you can check if rocminfo shows two graphics devices. Since integrated graphics are not supported, you can bypass it by setting the environment variable HIP_VISIBLE_DEVICES to only use the discrete GPU as documented here: https://rocmdocs.amd.com/projects/HIP/en/develop/how-to/debugging.html#making-device-visible

jdgh000 · 2024-11-13T19:17:12Z

Hi @jdgh000, I was able to reproduce your issue and have opened an internal ticket for further investigation.

thx, let me know,

harkgill-amd · 2024-11-13T19:23:50Z

As @zichguan-amd mentioned, this has to do with the example being ran on your APU rather than a dedicated graphics card. Correct me if I'm wrong, but I believe you're running on a 5900HX. Could you try running directly on your dGPU by adding this line at the top of your python script?

os.environ['HIP_VISIBLE_DEVICES']='0'

jdgh000 · 2024-11-14T01:53:51Z

this is not apu sure, cpu model I put is wrong. it is mi250. since cpu model is not that important, i just typed the suggestion.

jdgh000 · 2024-11-14T01:56:14Z

Name: AMD EPYC 7763 64-Core Processor
Name: AMD EPYC 7763 64-Core Processor
Name: gfx90a
Name: gfx90a
Name: gfx90a
Name: gfx90a
Name: gfx90a
Name: gfx90a
Name: gfx90a
Name: gfx90a

zichguan-amd · 2024-11-14T14:47:30Z

In that case can you run with NCCL_DEBUG=INFO or NCCL_DEBUG=TRACE for details as suggested by the error message?

jdgh000 · 2024-11-14T19:20:18Z

I saw the prompt and did few times but does not seem to outputting much than not using...either TRACE or INFO

sudo mkdir log ; NCCL_DEBUG=INFO sudo python3 ex1.py 2>&1 | sudo tee log/ex1-NCCL_DEBUG.INFO.log
mkdir: cannot create directory ‘log’: File exists
Let's use 8 GPUs!
Traceback (most recent call last):
  File "/root/pytorch/dataparallellism/1-dataparallellism/ex1.py", line 41, in <module>
    output = model(input)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 192, in forward
    replicas = self.replicate(self.module, self.device_ids[: len(inputs)])
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/data_parallel.py", line 199, in replicate
    return replicate(module, device_ids, not torch.is_grad_enabled())
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/replicate.py", line 134, in replicate
    param_copies = _broadcast_coalesced_reshape(params, devices, detach)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/replicate.py", line 103, in _broadcast_coalesced_reshape
    tensor_copies = Broadcast.apply(devices, *tensors)
  File "/usr/local/lib64/python3.9/site-packages/torch/autograd/function.py", line 575, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/_functions.py", line 22, in forward
    outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
  File "/usr/local/lib64/python3.9/site-packages/torch/nn/parallel/comm.py", line 67, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)

jdgh000 · 2024-11-14T19:23:21Z

seems failing in one of these:
/usr/local/lib64/python3.9/site-packages/torch/_C/init.pyi:10823:def _broadcast_coalesced(
/usr/local/lib64/python3.9/site-packages/torch/_C/_distributed_c10d.pyi:619:def _broadcast_coalesced(
but i can only function prototype, not body, can not see what is going on in these call

zichguan-amd · 2024-11-14T22:02:11Z

With sudo you need to use -E to preserve the environment variables. Also, can you upgrade to the latest ROCm 6.2.4 and PyTorch 2.5.1 and see if that fixes it?

jdgh000 · 2024-11-15T00:33:03Z

It is already torch2.6.1 and ROCm6.2.4
torch 2.5.1+rocm6.2
torchaudio 2.5.1+rocm6.2
torchvision 0.20.1+rocm6.2

jdgh000 · 2024-11-16T02:45:58Z

you said you reproduced it, should not you be able to look into this instead of poking around blindfully? i am not able to do the experimental steps at this point I reported enough that you are able to see on your side.

Secondly, your reasoning, irrational and logic is very weak on this, you already seen on your system but then later attempts to attributes to APU, the fact that you can see it is due to APU is already negated by the fact that you are able to seen on your side. @zichguan-amd please dont have me to try something fruitless steps i.e. debug envariable and upgrading, it is just spinning the wheels all the time pls instead follow the reasoning and logic to address this issue!

zichguan-amd · 2024-11-18T14:29:56Z

We were only able to reproduce this issue when using integrated graphics, so we kindly ask you to provide more details in order for us to help you find a fix. NCCL Error 1 can have different causes, including HW failure, see pytorch/pytorch#11756. You may also want to check if the error only occurs when using some specific GPUs.

jdgh000 · 2024-12-01T01:29:11Z

it is not working on ROCm, on nvidia rtx gpu:

  return F.linear(input, self.weight, self.bias)
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
        In Model: input size torch.Size([500, 1000]) output size torch.Size([500, 10])
Outside: input size torch.Size([1000, 1000]) output_size torch.Size([1000, 10])

jdgh000 · 2024-12-01T01:32:43Z

We were only able to reproduce this issue when using integrated graphics, so we kindly ask you to provide more details in order for us to help you find a fix. NCCL Error 1 can have different causes, including HW failure, see pytorch/pytorch#11756. You may also want to check if the error only occurs when using some specific GPUs.

What made you think you were able to reproduce only on IG? It does not say anywhere it says that above?? I gave you all relevant information. I hgve you all the information gpu models, rocm version above, you just ignored those and re-asked. What you says here absolutely makes no sense because you just changed the story anew saying it is only reproducible on IG. Could you paste log on IG and discreet? I dont think you do because it makes no sense!

harkgill-amd added the Under Investigation label Nov 13, 2024

jdgh000 mentioned this issue Dec 1, 2024

[Issue]: nccl nn.parallel error, need more experienced to look #1442

Open

IMbackK mentioned this issue Dec 9, 2024

[Issue]: RCCL is compleatly broken in most configurations #1454

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: tried to use nn.dataParallel however crashed #1421

[Issue]: tried to use nn.dataParallel however crashed #1421

jdgh000 commented Nov 13, 2024 •

edited

Loading

harkgill-amd commented Nov 13, 2024

zichguan-amd commented Nov 13, 2024 •

edited

Loading

jdgh000 commented Nov 13, 2024

harkgill-amd commented Nov 13, 2024

jdgh000 commented Nov 14, 2024 •

edited

Loading

jdgh000 commented Nov 14, 2024

zichguan-amd commented Nov 14, 2024

jdgh000 commented Nov 14, 2024

jdgh000 commented Nov 14, 2024 •

edited

Loading

zichguan-amd commented Nov 14, 2024

jdgh000 commented Nov 15, 2024

jdgh000 commented Nov 16, 2024 •

edited

Loading

zichguan-amd commented Nov 18, 2024

jdgh000 commented Dec 1, 2024

jdgh000 commented Dec 1, 2024

[Issue]: tried to use nn.dataParallel however crashed #1421

[Issue]: tried to use nn.dataParallel however crashed #1421

Comments

jdgh000 commented Nov 13, 2024 • edited Loading

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

harkgill-amd commented Nov 13, 2024

zichguan-amd commented Nov 13, 2024 • edited Loading

jdgh000 commented Nov 13, 2024

harkgill-amd commented Nov 13, 2024

jdgh000 commented Nov 14, 2024 • edited Loading

jdgh000 commented Nov 14, 2024

zichguan-amd commented Nov 14, 2024

jdgh000 commented Nov 14, 2024

jdgh000 commented Nov 14, 2024 • edited Loading

zichguan-amd commented Nov 14, 2024

jdgh000 commented Nov 15, 2024

jdgh000 commented Nov 16, 2024 • edited Loading

zichguan-amd commented Nov 18, 2024

jdgh000 commented Dec 1, 2024

jdgh000 commented Dec 1, 2024

jdgh000 commented Nov 13, 2024 •

edited

Loading

zichguan-amd commented Nov 13, 2024 •

edited

Loading

jdgh000 commented Nov 14, 2024 •

edited

Loading

jdgh000 commented Nov 14, 2024 •

edited

Loading

jdgh000 commented Nov 16, 2024 •

edited

Loading