Enable sequence packing with FlashAttention-2 #41

taha-yassine · 2024-12-17T21:56:57Z

Currently, datasets are prepared for caching using transformer_lens' tokenize_and_concatenate(). This is problematic because sequences are concatenated as is, and no special handling is done to avoid cross-sequence attention contamination. In addition, sequences are separated by EOS tokens, which is not ideal when training SAEs.

An alternative can be to have one sequence per sample in each batch, but this requires padding which wastes GPU resources and is thus sub-optimal.

This PR brings the ability to "pack" sequences together, meaning that sequences in a batch are concatenated to form a single long sample containing all sequences and avoid padding. To avoid attention contamination, FlashAttension-2 is used with the ability to pass a position_ids argument to bypass the need to materialize attention masks in memory (which is impractical). Using FA2 also brings a speed boost which is always welcome.
For additional details, see: https://huggingface.co/blog/packing-with-FA2

Currently, FA2 with position_ids is only implemented for some models in transformers. I'm working on a patch to bring it to more models, specifically GPT-NeoX-based models (e.g., Pythia) and GPT-2. Until it's upstreamed, this PR uses my fork of the library.

Things done

Add the dataloader for handling packing sequences load_dataset()
Eventually remove load_tokenized_data()
Upstream FA2 patch to transformers
Dynamically pack sequences to have a consistent number of tokens per batch (challenging)
Update caching to work with the new dataset format
Update examples

CLAassistant · 2024-12-17T21:57:06Z

All committers have signed the CLA.

taha-yassine added 3 commits December 17, 2024 22:21

Add FA2 installation instruction to README

0eed5ae

Update dependencies

599df7e

Add ability to load packed dataset

15df167

taha-yassine marked this pull request as draft December 17, 2024 21:57

Update caching to work with FA2

7e5a566

taha-yassine force-pushed the fa2_packing branch from 33225ff to 7e5a566 Compare December 17, 2024 22:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable sequence packing with FlashAttention-2 #41

Enable sequence packing with FlashAttention-2 #41

taha-yassine commented Dec 17, 2024

CLAassistant commented Dec 17, 2024 •

edited

Loading

Enable sequence packing with FlashAttention-2 #41

Are you sure you want to change the base?

Enable sequence packing with FlashAttention-2 #41

Conversation

taha-yassine commented Dec 17, 2024

Things done

CLAassistant commented Dec 17, 2024 • edited Loading

CLAassistant commented Dec 17, 2024 •

edited

Loading