fix: `RuntimeError` for UCP large DP #6918

saforem2 · 2024-12-29T18:23:51Z

We encountered a strange bug when attempting to convert checkpoints (created with DP=768) to universal format.

An overview of the bug as well as a detailed description of the proposed fix is written up in:

argonne-lcf / Megatron-DeepSpeed / ALCF / notes / universal_checkpoint_bug.md

saforem2 · 2024-12-30T19:29:00Z

@loadams thanks for the formatting fix!

Also just wanted to say no rush on this, I spoke briefly with @minjiazhang before the holidays about this issue and mentioned I would write up a more complete description of what I was seeing.

Maybe some minor thoughts:

I think the change in deepspeed / checkpoint / deepspeed_checkpoint.py,
e.g. passing the strip_tensor_paddings argument through to the self.zero_checkpoint.get_state_for_rank call (shown below):

-    def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index) -> dict:
+    def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index, strip_tensor_paddings: bool = True):
         return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
                                                        tp_index=tp_index,
                                                        dp_index=dp_index,
                                                        keys_to_ignore=[PARAM_SHAPES])
                                                        keys_to_ignore=[PARAM_SHAPES],
+                                                       strip_tensor_paddings=strip_tensor_paddings)

✅ is OK since this just passes the argument through.

However, I'm a bit less sure about this change:

 sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index,
                                                  tp_index=tp_index,
                                                  dp_index=dp_index,
+                                                 strip_tensor_paddings=False)

since I'm not completely clear on how the internals of this
_strip_tensor_paddings() function work.

For our purposes, setting this to False, and thereby skipping the:

if strip_tensor_paddings:
    self._strip_tensor_paddings(sd)

block in the get_state_for_rank call seems to work, though I'm not really sure why.

fix: ds_to_universal.py when for large DP

cc1478e

saforem2 requested a review from tjruwase as a code owner December 29, 2024 18:23

saforem2 assigned saforem2, minjiazhang and samadejacobs and unassigned saforem2, minjiazhang and samadejacobs Dec 29, 2024

saforem2 requested review from minjiazhang and samadejacobs December 29, 2024 18:25

Formatting fix

d143141

loadams requested a review from lekurile December 30, 2024 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: `RuntimeError` for UCP large DP #6918

fix: `RuntimeError` for UCP large DP #6918

saforem2 commented Dec 29, 2024

saforem2 commented Dec 30, 2024

fix: RuntimeError for UCP large DP #6918

Are you sure you want to change the base?

fix: RuntimeError for UCP large DP #6918

Conversation

saforem2 commented Dec 29, 2024

saforem2 commented Dec 30, 2024

fix: `RuntimeError` for UCP large DP #6918

fix: `RuntimeError` for UCP large DP #6918