Right way to load Lora checkpoint to further training #1968

BigDataMLexplorer · 2024-07-29T14:26:33Z

BigDataMLexplorer
Jul 29, 2024

@BenjaminBossan
Hello, can I ask you plase, how should I specify the Trainer if I want to proceed in training from lora checkpoint with same training arguments the model was trained before checkpoint?

I give the example here:
First training:

trainer = Trainer(
        model=model,
        train_dataset=data["train"],
        eval_dataset=data["eval"], 
        args=TrainingArguments(
            per_device_train_batch_size=4,
            gradient_accumulation_steps=4,
            warmup_steps=2, 
            eval_steps=eval_steps, 
            save_steps=save_steps, 
            max_steps=1500,
            learning_rate=2e-4,
            fp16=True,
            logging_steps=1,
            output_dir=tmp_dir,
        )

Now I load model from checkpoint:

PeftModel.from_pretrained(model, checkpoint_path, is_trainable=True)

I need to specify the Trainer again, if I want to continue from checkpoint and use

trainer.train(resume_from_checkpoint)

What I have to write in Trainer? All the arguemnts like in first training?
I am asking because I want the Trainer to continue with exaxtcly same arguments where the Trainer was stopped in training. For example I want to continue with value of learning rate from checkpoint, which decreased linearly during training. Will it overwrite the learning rate, when I write LR again in Trainer? When I put again the same values of learning rate and other things, what will happen?
And on the other hand, if I want to change
Number of eval and save steps, is it ok to change?

Conlusion:
1)I want just to continue the training where it stopped and change eval and save steps.
2)Is not eneough to write again only the model, training and eval dataset in Trainer ?
If so, what will happen to other stable arguments like fp16, weight decay, batch size?

Thank you very much

BenjaminBossan · 2024-07-29T15:07:40Z

BenjaminBossan
Jul 29, 2024
Maintainer

Hmm, I'm not very familiar with the Trainer class. I think the checkpoint system is smart enough to actually save all the state necessary to resume training exactly as you left it, since this is implemented in accelerate, but I'm not sure. Maybe @muellerzr can answer this question.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Right way to load Lora checkpoint to further training #1968

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Right way to load Lora checkpoint to further training #1968

BigDataMLexplorer Jul 29, 2024

Replies: 1 comment

BenjaminBossan Jul 29, 2024 Maintainer

BigDataMLexplorer
Jul 29, 2024

BenjaminBossan
Jul 29, 2024
Maintainer