We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the setup is the same as #6895 (comment)
../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [949,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed. ../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [949,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed. Traceback (most recent call last): File "/nfs/users/nfs_f/fg12/repos/dna-mlm/src/dna_mlm/runner.py", line 218, in <module> run() File "/nfs/users/nfs_f/fg12/repos/dna-mlm/src/dna_mlm/runner.py", line 211, in run trainer.train(resume_from_checkpoint=RESUME_FROM_CHECKPOINT) File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in train return inner_training_loop( File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2268, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/nfs/users/nfs_f/fg12/repos/dna-mlm/src/dna_mlm/trainer/esm/mlm/trainer.py", line 196, in training_step output_tensor: torch.Tensor = super().training_step(model=model, inputs=inputs) File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 3307, in training_step loss = self.compute_loss(model, inputs) File "/nfs/users/nfs_f/fg12/repos/dna-mlm/src/dna_mlm/trainer/esm/mlm/trainer.py", line 134, in compute_loss v: tp.Any = super().compute_loss(model, inputs, return_outputs=return_outputs) File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 3338, in compute_loss outputs = model(**inputs) File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1852, in forward loss = self.module(*inputs, **kwargs) File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/nfs/users/nfs_f/fg12/repos/dna-mlm/src/dna_mlm/models/esm/esm_mlm.py", line 42, in forward outputs: BaseModelOutputWithPoolingAndCrossAttentions = self.esm( File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/nfs/users/nfs_f/fg12/repos/dna-mlm/src/dna_mlm/models/esm/esm_base.py", line 154, in forward embedding_output: torch.Tensor = self.embeddings( File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/nfs/users/nfs_f/fg12/repos/dna-mlm/src/dna_mlm/layers/esm/embedding.py", line 137, in forward embeddings: torch.Tensor = self.word_embeddings(input_ids) File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 163, in forward return F.embedding( File "/software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/nn/functional.py", line 2237, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: CUDA error: device-side assert triggered Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. [rank7]:[E ProcessGroupNCCL.cpp:1182] [Rank 7] NCCL watchdog thread terminated with exception: CUDA error: device-side assert triggered Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x1470de2f4d87 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x1470de2a575f in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x1470de3c58a8 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x147093d993ac in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x147093d9d4c8 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x147093da0bfa in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x147093da1839 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #7: <unknown function> + 0xe9213 (0x1470dda84213 in /software/spack_environments/default/00/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-11.4.0/gcc-13.1.0-qnriecrozbrvb5fu5mdep4sywj3jiyw3/lib64/libstdc++.so.6) frame #8: <unknown function> + 0x94ac3 (0x1470df45cac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #9: <unknown function> + 0x126850 (0x1470df4ee850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [Rank 7] NCCL watchdog thread terminated with exception: CUDA error: device-side assert triggered Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x1470de2f4d87 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x1470de2a575f in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x1470de3c58a8 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x6c (0x147093d993ac in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x147093d9d4c8 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x15a (0x147093da0bfa in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x147093da1839 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #7: <unknown function> + 0xe9213 (0x1470dda84213 in /software/spack_environments/default/00/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-11.4.0/gcc-13.1.0-qnriecrozbrvb5fu5mdep4sywj3jiyw3/lib64/libstdc++.so.6) frame #8: <unknown function> + 0x94ac3 (0x1470df45cac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #9: <unknown function> + 0x126850 (0x1470df4ee850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x1470de2f4d87 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0xdf6b11 (0x147093af7b11 in /software/isg/users/fg12/envs/virtualenvs/dna-mlm-cGaHCcHr-py3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: <unknown function> + 0xe9213 (0x1470dda84213 in /software/spack_environments/default/00/opt/spack/linux-ubuntu22.04-x86_64_v3/gcc-11.4.0/gcc-13.1.0-qnriecrozbrvb5fu5mdep4sywj3jiyw3/lib64/libstdc++.so.6) frame #3: <unknown function> + 0x94ac3 (0x1470df45cac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: <unknown function> + 0x126850 (0x1470df4ee850 in /lib/x86_64-linux-gnu/libc.so.6) -------------------------------------------------------------------------- prterun noticed that process rank 7 with PID 0 on node farm22-gpu0304 exited on signal 6 (Aborted). --------------------------------------------------------------------------
The text was updated successfully, but these errors were encountered:
No branches or pull requests
the setup is the same as #6895 (comment)
The text was updated successfully, but these errors were encountered: