You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am having an issue when using your code. If I try to resume a training from a checkpoint when having more than one GPU (I am using docker containers) I get the following error:
File "__main__.py", line 55, in <module>
main(parser.parse_args())
File "__main__.py", line 39, in main
spawn(train_distributed, args=(replica_count, port, args, params), nprocs=replica_count, join=True)
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
while not context.join():
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/opt/diffwave/src/diffwave/learner.py", line 188, in train_distributed
_train_impl(replica_id, model, dataset, args, params)
File "/opt/diffwave/src/diffwave/learner.py", line 163, in _train_impl
learner.restore_from_checkpoint()
File "/opt/diffwave/src/diffwave/learner.py", line 95, in restore_from_checkpoint
checkpoint = torch.load(f'{self.model_dir}/{filename}.pt')
File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 584, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 842, in _load
result = unpickler.load()
File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 834, in persistent_load
load_tensor(data_type, size, key, _maybe_decode_ascii(location))
File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 823, in load_tensor
loaded_storages[key] = restore_location(storage, location)
File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 174, in default_restore_location
result = fn(storage, location)
File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 156, in _cuda_deserialize
return obj.cuda(device)
File "/opt/conda/lib/python3.7/site-packages/torch/_utils.py", line 77, in _cuda
return new_type(self.size()).copy_(self, non_blocking)
File "/opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py", line 480, in _lazy_new
return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
But when starting from scratch or using a single GPU this error does not show and the training goes flawlessly.
I must add that I have checked that the GPUs were completely free when launching the training.
Any advice on this issue?
Thanks in advance.
The text was updated successfully, but these errors were encountered:
It somehow spawns multiple processes on a single GPU, but only on one of them...
I am launching the training on 4 GPUs. Three of the GPUs spawn each a single process, but one of them spawns 4 processes.
If changing to 3 GPUs, the same happens: one of the GPUs spawns 3 processes.
I cannot find a bug in the code that forces this behaviour....
Hello.
I am having an issue when using your code. If I try to resume a training from a checkpoint when having more than one GPU (I am using docker containers) I get the following error:
But when starting from scratch or using a single GPU this error does not show and the training goes flawlessly.
I must add that I have checked that the GPUs were completely free when launching the training.
Any advice on this issue?
Thanks in advance.
The text was updated successfully, but these errors were encountered: