如何加载模型参数或者继续训练模型(使用trainer 混合并行的方式训练的vit模型) #3721
Unanswered
stonewjf
asked this question in
Community | Q&A
Replies: 1 comment 1 reply
-
Hi @stonewjf What code are you using? How can we reproduce your issue? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
根据教程中的实例使用下面代码load参数报错
from colossalai.utils import load_checkpoint load_checkpoint('./checkpoints/checkpoint0002.pth', model, optimizer, lr_scheduler)
错误如下:
Traceback (most recent call last): File "train_with_trainer.py", line 143, in <module> load_checkpoint('./checkpoints/checkpoint0002.pth', model, optimizer, lr_scheduler) File "/home/haida_huanglei/anaconda3/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/checkpointing.py", line 234, in load_checkpoint train_imagenet() File "train_with_trainer.py", line 96, in train_imagenet model_state = partition_pipeline_parallel_state_dict(model, model_state) File "/home/haida_huanglei/anaconda3/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/checkpointing.py", line 133, in partition_pipeline_parallel_state_dict _send_state_dict(state_dict, gpc.get_next_global_rank(ParallelMode.PIPELINE), ParallelMode.PIPELINE) File "/home/haida_huanglei/anaconda3/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/checkpointing.py", line 99, in _send_state_dict load_checkpoint('./checkpoints/checkpoint0002.pth', model, optimizer, lr_scheduler) File "/home/haida_huanglei/anaconda3/envs/colossalai/lib/python3.8/site-packages/colossalai/utils/checkpointing.py", line 234, in load_checkpoint state_tensor, state_size = dist.distributed_c10d._object_to_tensor(state_dict) TypeError: _object_to_tensor() missing 1 required positional argument: 'device'
Beta Was this translation helpful? Give feedback.
All reactions