You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I run python -u tutorial_train.py --lr 2e-5 --gpus 8 --batch_size 12
logged into a slurm node, it starts on each GPU and each GPU would write log images.
But when I run the job with sbatch using a script like this:
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
When I run
python -u tutorial_train.py --lr 2e-5 --gpus 8 --batch_size 12
logged into a slurm node, it starts on each GPU and each GPU would write log images.
But when I run the job with sbatch using a script like this:
I only see log images from 1 gpu and I don't see log messages like this:
Do I need to do something extra for Deepspeed to launch other than the normal sbatch script?
Beta Was this translation helpful? Give feedback.
All reactions