TE integration via full TransformerLayer #1297

tf-nv · 2024-09-30T12:59:13Z

This is a sketch of using the attention picking mechanism ("global", "flash", NEW: "TE") to use the high level TransformerLayer from TransformerEngine. This is more of a prototype to show that integration with deepspeed is possible and what perf to expect.

Things that work:

Training an 22B GPT2 style model on multiple DGXH100 with zero 1 and TP2 (BF16)
TE attention TFLOPS are 5% higher than flash attention in BF16, and 70% higher in FP8 (for the 22B model)
Activation checkpointing from TE

Many aspects are hardcoded, e.g. RoPE and activation checkpointing can not be reconfigured from the config files. #1282 is much more elaborate in that it exposes TE layers on a much lower level. Meanwhile this PR could serve as a benchmark, showing what is possible with TE on a classic GPT2 style network.

I kept the implementation as minimal as possible, there is room for further performance depending on the workload. There is e.g. sequence parallelism and different memory layouts.

The dockerfile now uses a later ngc pytorch container and installs a later deepspeed tag from source for compatibility.

Quentin-Anthony · 2024-12-12T19:13:36Z

Will merge this after the finegrained TE PR

Quentin-Anthony · 2024-12-19T23:06:24Z

megatron/model/transformer.py

+                                "The mask will be discarded")
+        hidden_states, attention_mask = args
+
+        fp8_format = Format.HYBRID  # E4M3 during forward pass, E5M2 during backward pass


should instead accept neox_arg from the new te_fp8_format

Quentin-Anthony · 2024-12-19T23:07:49Z

megatron/model/gpt2_model.py

@@ -271,6 +272,24 @@ def init_specs(self):
                        layer_number=i,
                    )
                )
+            elif layer_type in ["TE"]:


needs tested with PP and TP, since we'd be relying on two external codebases (deepspeed for PP, TE for TP) whose topologies probably don't play nicely together.

Quentin-Anthony · 2024-12-19T23:10:55Z

Dockerfile

+
+RUN DS_BUILD_FUSED_LAMB=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_TRANSFORMER=1 DS_BUILD_STOCHASTIC_TRANSFORMER=1  DS_BUILD_UTILS=1 \
+    TORCH_CUDA_ARCH_LIST="8.0 9.0+PTX" \
+    python -m pip install git+https://github.com/microsoft/[email protected]


this should probably instead be latest deeperspeed. Can't hardcode the arch list.

TE integration

4894d2b

tf-nv requested a review from Quentin-Anthony as a code owner September 30, 2024 12:59

Quentin-Anthony reviewed Dec 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TE integration via full TransformerLayer #1297

TE integration via full TransformerLayer #1297

tf-nv commented Sep 30, 2024 •

edited

Loading

Quentin-Anthony commented Dec 12, 2024

Quentin-Anthony Dec 19, 2024

Quentin-Anthony Dec 19, 2024

Quentin-Anthony Dec 19, 2024

TE integration via full TransformerLayer #1297

Are you sure you want to change the base?

TE integration via full TransformerLayer #1297

Conversation

tf-nv commented Sep 30, 2024 • edited Loading

Quentin-Anthony commented Dec 12, 2024

Quentin-Anthony Dec 19, 2024

Choose a reason for hiding this comment

Quentin-Anthony Dec 19, 2024

Choose a reason for hiding this comment

Quentin-Anthony Dec 19, 2024

Choose a reason for hiding this comment

tf-nv commented Sep 30, 2024 •

edited

Loading