Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prterun noticed that process rank 7 with PID 0 on node gpu0304 exited on signal 6 (Aborted). #6896

Open
fabiogeraci opened this issue Dec 19, 2024 · 0 comments

Comments

@fabiogeraci
Copy link

the setup is the same as #6895 (comment)

../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [949,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [949,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "/nfs/users/nfs_f/fg12/repos/dna-mlm/src/dna_mlm/runner.py", line 218, in <module>
    run()
  File "/nfs/users/nfs_f/fg12/repos/dna-mlm/src/dna_mlm/runner.py", line 211, in run
    trainer.train(resume_from_checkpoint=RESUME_FROM_CHECKPOINT)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in train
    return inner_training_loop(
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2268, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/nfs/users/nfs_f/fg12/repos/dna-mlm/src/dna_mlm/trainer/esm/mlm/trainer.py", line 196, in training_step
    output_tensor: torch.Tensor = super().training_step(model=model, inputs=inputs)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 3307, in training_step
    loss = self.compute_loss(model, inputs)
  File "/nfs/users/nfs_f/fg12/repos/dna-mlm/src/dna_mlm/trainer/esm/mlm/trainer.py", line 134, in compute_loss
    v: tp.Any = super().compute_loss(model, inputs, return_outputs=return_outputs)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 3338, in compute_loss
    outputs = model(**inputs)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1852, in forward
    loss = self.module(*inputs, **kwargs)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/nfs/users/nfs_f/fg12/repos/dna-mlm/src/dna_mlm/models/esm/esm_mlm.py", line 42, in forward
    outputs: BaseModelOutputWithPoolingAndCrossAttentions = self.esm(
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/nfs/users/nfs_f/fg12/repos/dna-mlm/src/dna_mlm/models/esm/esm_base.py", line 154, in forward
    embedding_output: torch.Tensor = self.embeddings(
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/nfs/users/nfs_f/fg12/repos/dna-mlm/src/dna_mlm/layers/esm/embedding.py", line 137, in forward
    embeddings: torch.Tensor = self.word_embeddings(input_ids)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 163, in forward
    return F.embedding(
  File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/functional.py", line 2237, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[rank7]:[E ProcessGroupNCCL.cpp:1182] [Rank 7] NCCL watchdog thread terminated with exception: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x1470de2f4d87 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x1470de2a575f in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x1470de3c58a8 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x147093d993ac in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x147093d9d4c8 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x147093da0bfa in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x147093da1839 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xe9213 (0x1470dda84213 in /software/spack_environments/default/00/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-11.4.0/gcc-13.1.0-qnriecrozbrvb5fu5mdep4sywj3jiyw3/lib64/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x1470df45cac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x1470df4ee850 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [Rank 7] NCCL watchdog thread terminated with exception: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x1470de2f4d87 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x1470de2a575f in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x1470de3c58a8 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x147093d993ac in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x147093d9d4c8 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x147093da0bfa in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x147093da1839 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xe9213 (0x1470dda84213 in /software/spack_environments/default/00/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-11.4.0/gcc-13.1.0-qnriecrozbrvb5fu5mdep4sywj3jiyw3/lib64/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x1470df45cac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #9: <unknown function> + 0x126850 (0x1470df4ee850 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x1470de2f4d87 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xdf6b11 (0x147093af7b11 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xe9213 (0x1470dda84213 in /software/spack_environments/default/00/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-11.4.0/gcc-13.1.0-qnriecrozbrvb5fu5mdep4sywj3jiyw3/lib64/libstdc++.so.6)
frame #3: <unknown function> + 0x94ac3 (0x1470df45cac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126850 (0x1470df4ee850 in /lib/x86_64-linux-gnu/libc.so.6)

--------------------------------------------------------------------------
prterun noticed that process rank 7 with PID 0 on node farm22-gpu0304 exited on
signal 6 (Aborted).
--------------------------------------------------------------------------
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant