Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于Lorenz在aistudio上的断点续训练2 #1039

Open
xiaoniv opened this issue Dec 6, 2024 · 1 comment
Open

关于Lorenz在aistudio上的断点续训练2 #1039

xiaoniv opened this issue Dec 6, 2024 · 1 comment

Comments

@xiaoniv
Copy link

xiaoniv commented Dec 6, 2024

bug 描述 bug description

我在aistudio上跑通了那个Lorenz例子,想试下断点继续训练。我跑了10个epoch,按了停止退出codelab。然后重新进入codelab,安装paddlescience,再运行训练之前的代码。这次想用save_load把model和optimizer读出来,再尝试初始化。运行以下代码

import ppsci.utils.save_load as save_load

OUTPUT_DIR = "./output/lorenz_transformer"
checkpoint_dir = f"{OUTPUT_DIR}/checkpoints/latest"

save_load.load_checkpoint(
    checkpoint_dir,
    model,  # Your model
    optimizer  # Your optimizer
)

solver = ppsci.solver.Solver(
    model=model,  # Will use loaded weights
    constraint=constraint,
    output_dir=OUTPUT_DIR,
    optimizer=optimizer,  # Will use loaded state
    lr_scheduler=lr_scheduler,
    eval_during_train=True,
    eval_freq=50,
    validator=validator,
    visualizer=visualizer,
)

solver.train() 

得到报错AssertionError: Optimizer set error, layer_norm_1.w_0_moment1_0 should in state dict。
我看了latest.pdopt,layer_norm是从21开始的,layer_norm_21.w_0_moment1_0。
请帮我分析一下为啥优化器的数据对不上,谢谢

其他补充信息 Additional Supplementary Information

No response

@zhiminzhang0830
Copy link
Collaborator

排查下来应该是notebook的问题,如需断点训练,可以直接把相关的代码复制到py文件当中,终端中运行py文件。另外,Solver支持断点训练功能,可以在初始化Solver时设置checkpoint_path参数即可。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants