You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Basically, take a transformers and num_pipeline_stage as arguments, then divide the module like this:
The first stage and the last stage must include the embedding layer and lm_head, respectively.
All other stages in between should be divided evenly.
For example: if we have [embedding layer] > [8 x transformer blocks] > [language model head], and we want to shard them into 5 pipeline stages:
The first partition includes the embedding layer and the first block.
The 3 partitions in between each consist of 2 transformer blocks.
The last partition includes the language model head and the last block.
The goal is to arrange the first and the last pipeline stages so they do not become bottlenecks in terms of training speed, while all stages in between are distributed evenly to balance the computation.
The text was updated successfully, but these errors were encountered:
Basically, take a transformers and num_pipeline_stage as arguments, then divide the module like this:
The first stage and the last stage must include the embedding layer and lm_head, respectively.
All other stages in between should be divided evenly.
For example: if we have [embedding layer] > [8 x transformer blocks] > [language model head], and we want to shard them into 5 pipeline stages:
The last partition includes the language model head and the last block.
The goal is to arrange the first and the last pipeline stages so they do not become bottlenecks in terms of training speed, while all stages in between are distributed evenly to balance the computation.
The text was updated successfully, but these errors were encountered: