prterun noticed that process rank 7 with PID 0 on node gpu0304 exited on signal 6 (Aborted). #6896

fabiogeraci · 2024-12-19T00:18:44Z

the setup is the same as #6895 (comment)

../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [949,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [949,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "/nfs/users/nfs_f/fg12/repos/dna-mlm/src/dna_mlm/runner.py", line 218, in <module>
    run()
  File "/nfs/users/nfs_f/fg12/repos/dna-mlm/src/dna_mlm/runner.py", line 211, in run
    trainer.train(resume_from_checkpoint=RESUME_FROM_CHECKPOINT)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in train
    return inner_training_loop(
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2268, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/nfs/users/nfs_f/fg12/repos/dna-mlm/src/dna_mlm/trainer/esm/mlm/trainer.py", line 196, in training_step
    output_tensor: torch.Tensor = super().training_step(model=model, inputs=inputs)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 3307, in training_step
    loss = self.compute_loss(model, inputs)
  File "/nfs/users/nfs_f/fg12/repos/dna-mlm/src/dna_mlm/trainer/esm/mlm/trainer.py", line 134, in compute_loss
    v: tp.Any = super().compute_loss(model, inputs, return_outputs=return_outputs)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 3338, in compute_loss
    outputs = model(**inputs)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1852, in forward
    loss = self.module(*inputs, **kwargs)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/nfs/users/nfs_f/fg12/repos/dna-mlm/src/dna_mlm/models/esm/esm_mlm.py", line 42, in forward
    outputs: BaseModelOutputWithPoolingAndCrossAttentions = self.esm(
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/nfs/users/nfs_f/fg12/repos/dna-mlm/src/dna_mlm/models/esm/esm_base.py", line 154, in forward
    embedding_output: torch.Tensor = self.embeddings(
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/nfs/users/nfs_f/fg12/repos/dna-mlm/src/dna_mlm/layers/esm/embedding.py", line 137, in forward
    embeddings: torch.Tensor = self.word_embeddings(input_ids)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 163, in forward
    return F.embedding(
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/functional.py", line 2237, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[rank7]:[E ProcessGroupNCCL.cpp:1182] [Rank 7] NCCL watchdog thread terminated with exception: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x1470de2f4d87 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x1470de2a575f in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x1470de3c58a8 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x147093d993ac in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x147093d9d4c8 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x147093da0bfa in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x147093da1839 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xe9213 (0x1470dda84213 in /software/spack_environments/default/00/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-11.4.0/gcc-13.1.0-qnriecrozbrvb5fu5mdep4sywj3jiyw3/lib64/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x1470df45cac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x1470df4ee850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [Rank 7] NCCL watchdog thread terminated with exception: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x1470de2f4d87 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x1470de2a575f in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x1470de3c58a8 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x147093d993ac in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x147093d9d4c8 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x147093da0bfa in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x147093da1839 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xe9213 (0x1470dda84213 in /software/spack_environments/default/00/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-11.4.0/gcc-13.1.0-qnriecrozbrvb5fu5mdep4sywj3jiyw3/lib64/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x1470df45cac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x1470df4ee850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x1470de2f4d87 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xdf6b11 (0x147093af7b11 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe9213 (0x1470dda84213 in /software/spack_environments/default/00/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-11.4.0/gcc-13.1.0-qnriecrozbrvb5fu5mdep4sywj3jiyw3/lib64/libstdc++.so.6)
frame #3: <unknown function> + 0x94ac3 (0x1470df45cac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126850 (0x1470df4ee850 in /lib/x86_64-linux-gnu/libc.so.6)

--------------------------------------------------------------------------
prterun noticed that process rank 7 with PID 0 on node farm22-gpu0304 exited on
signal 6 (Aborted).
--------------------------------------------------------------------------

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prterun noticed that process rank 7 with PID 0 on node gpu0304 exited on signal 6 (Aborted). #6896

prterun noticed that process rank 7 with PID 0 on node gpu0304 exited on signal 6 (Aborted). #6896

fabiogeraci commented Dec 19, 2024

prterun noticed that process rank 7 with PID 0 on node gpu0304 exited on signal 6 (Aborted). #6896

prterun noticed that process rank 7 with PID 0 on node gpu0304 exited on signal 6 (Aborted). #6896

Comments

fabiogeraci commented Dec 19, 2024