-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Invalidate trace cache @ step 10: expected module 11, but got module 19 #6870
Comments
same issue. |
Same here |
same issue |
@liranringel and @DW934 can you share full repro steps? |
modelmodel_name_or_path: /home/models/qwen25_32B_lora methodstage: dpo datasetdataset: wmtbio24 outputoutput_dir: /home/models/qwen25_32B-lora-dpo-1epoch-bs1-half-pref trainper_device_train_batch_size: 1 evalval_size: 0.1 flash_attn: auto |
Describe the bug
I'm training Llama-3.1-70B-SFT with DPO using lora, equipped with Zero3. And the training log consitently output and stucks in this line "Invalidate trace cache @ step 10: expected module 11, but got module 19".
Yet the same training configuration work fine with 7B models, completely bug-free.
Hardware
8 *A100 (80G)
Deepspeed Config
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_gather_16bit_weights_on_model_save": true,
"stage3_prefetch_bucket_size": 0,
"stage3_max_live_parameters": 0,
"stage3_max_reuse_distance": 0
}
}
The text was updated successfully, but these errors were encountered: