-
-
Notifications
You must be signed in to change notification settings - Fork 16.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why I use multi GPUs to train a model, then the training details log missed? YOLOv8多GPU训练会丢失日志吗? #13492
Comments
👋 Hello @BGMer7, thank you for your interest in YOLOv5 🚀! For your question regarding multi-GPU training with YOLOv8, it's important to note that this repository is specifically for YOLOv5. However, since YOLOv8 shares some similarities, we might be able to assist you here! If this is a 🐛 Bug Report, could you please provide a minimum reproducible example (MRE), including all necessary code and steps to replicate the issue? This will help us investigate the behavior further. For custom training ❓ Questions, sharing additional details, such as setup specifics, logs, or configurations, would assist in diagnosing the issue. It might also be worth checking if your multi-GPU setup affects logging by trying configurations like adjusting verbosity or distributed training options. RequirementsYOLOv5 and YOLOv8 require Python>=3.8.0 with all required dependencies installed. Ensure your environment is configured correctly and up to date. EnvironmentsBoth YOLOv5 and YOLOv8 can be run in various environments, such as local setups, cloud-based GPUs, or Docker images. Verify your current setup matches recommended configurations, including ensuring that all GPUs are properly initialized and recognized. StatusIf you are receiving training logs when using a single GPU but missing them for a multi-GPU setup, it is possible that output redirection or distributed training settings are affecting the logs. When training on multiple GPUs, frameworks like PyTorch may modify how and where logs are written. This is an automated response 🛠️, but rest assured, an Ultralytics engineer will review your issue and provide further assistance soon. Let us know if you can provide any additional information in the meantime that might help clarify this behavior! 😊 |
Here I attach some more details here, import os
os.environ["CUDA_VISIBLE_DEVICES"]="0,1"
!source ~/.bashrc otherwise, when using |
@BGMer7 thank you for providing additional details. Setting
If the issue persists, please confirm your DDP setup and training script alignment with Ultralytics' multi-GPU training guide. |
@pderrenger Thanks for your reply! def train_model(self):
"""
Initialize and train YOLOv8 model with specified parameters
Returns:
training results
"""
cuda.init()
device_count = cuda.Device.count()
device = ','.join(str(i) for i in range(device_count)) if device_count > 0 else 'cpu'
# Initialize the wandb logging before training
# wandb.init(project="yolo_training", config={"epochs": 20, "batch_size": 16})
results = self.model.train(
data=DATA_YAML, # data.yaml file from Roboflow
epochs=20, # number of epochs
imgsz=640, # image size
batch=32, # batch size
name='yolov8_custom', # folder name for training results
device=device, # '0' for GPU, 'cpu' for CPU
patience=50, # early stopping patience
save=True, # save best model
pretrained=True, # use pretrained weights
plots=True, # save training plots
cache=True, # enable caching
verbose=True, # Enable verbose logging
workers=32
)
return results As for I didn't use !pip show accelerate
!pip install git+https://github.com/huggingface/accelerate But later I found my code can run without it, so I deleted this. Thanks for your remind, I found the logs in the final result page in kaggle though the output during running didn't contain this part. And this is the whole code in kaggle: Thanks! :) |
It seems your training process is functioning as intended, with logs available in the Kaggle results page, even if they are not displayed during runtime. This behavior is typical when using multi-GPU training, as certain logging outputs may only appear in the final results or the primary process. To ensure consistent logging:
For further refinement, you can explore the official Multi-GPU Training Guide to ensure optimal usage of resources. Let us know if you need more support! |
Search before asking
Question
I used YOLOv8 to train a model to ectract Facial Features, the mission is here: https://www.kaggle.com/datasets/osmankagankurnaz/facial-feature-extraction-dataset
Actually, it is not a fatel error, it's just a little question, but no YOLOv8 repo is found, so I posted here for help.
In kaggle, it provides 2 Tesla P4 GPU to accelarate, many people just use device='0' to make only one GPU in use, I make full use of these 2 GPUs, and it works. But the only difference and the problem is the training details logs are missing.
This is the code which uses only one GPU:
and this is the code which uses two GPUs:
一个GPU跑的话就是详细日志输出,每轮训练之后都会有指标输出,但是多个日志就变成只有最终模型有输出,中间的训练过程就没有日志了。
This is the training logs with only one GPU, the output is very detailed, and it printed the metrics after every epoch
But when chect to 2 GPUs, the output only contains the result of the final model
Has anyone met this before?
And I find it seems when I use only one GPU, Api invokes tensorflow, when 2 GPUs, it invokes pytorch.
Additional
No response
The text was updated successfully, but these errors were encountered: