Add main_grad #1140

jianyuh · 2023-10-02T01:08:35Z

What does this PR do?

Fixes main_grad following up #1139 (comment)

Before submitting

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

vedanuj · 2023-10-02T20:05:24Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

@@ -1713,6 +1713,13 @@ def _post_backward_hook(self, param: Parameter, *unused: Any) -> None:

        # Switch to FP32 shard after backward.
        self._use_fp32_param_shard([param])
+        if self.mixed_precision and self.fp32_reduce_scatter:


Currently for fp8, we do not use mixed_precision, so we should remove this.

Only check

if self.fp32_reduce_scatter:

Addressed the comment.

jianyuh · 2023-10-04T05:10:34Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py


            if self.fp32_reduce_scatter:
                # Cast grad to FP32.
                param.grad.data = param.grad.data.float()

+            orig_grad_data = param.grad.data


Move here to make orig_grad_data FP32. This was from #1139 (comment)

jspark1105 · 2023-10-04T05:32:30Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

@@ -1721,23 +1728,31 @@ def _post_backward_hook(self, param: Parameter, *unused: Any) -> None:
        # reductions in post_backward stream.
        self._streams["post_backward"].wait_stream(torch.cuda.current_stream())
        with torch.cuda.stream(self._streams["post_backward"]):
-            orig_grad_data = param.grad.data

            if self.fp32_reduce_scatter:
                # Cast grad to FP32.
                param.grad.data = param.grad.data.float()


I don't feel this is right since param.grad will be None from L1722.

Overall, this PR creates main_grad for flat parameters while what we need to do is main_grad visible to TE modules. So probably we need to change FlatParameter as well?

Is this based on one of Naman's branches?

I have a branch where i am adding param.main_grad to FlatParams to enable fuse wgrad accumulation. here is the PR : #1142

Thanks! Feel free to ignore the changes in this PR. Still learning about FlatParams etc.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 2, 2023

jianyuh requested review from jspark1105, jiecaoyu and vedanuj October 2, 2023 01:08

jianyuh marked this pull request as ready for review October 2, 2023 01:09

jianyuh mentioned this pull request Oct 2, 2023

Fix fsdp+pp+te WPS decreasing issue #1139

Merged

10 tasks

jianyuh changed the base branch from ngoyal_changes_for_pp_fp8_fix_handle to ngoyal_changes_for_pp_fp8 October 2, 2023 03:03

jianyuh added 3 commits October 1, 2023 20:09

Fix fsdp+pp+te WPS decreasing issue

81ee78d

Address comment; remove unused stuff

71495ba

split into wps fix P841842878 only and main_grad fix

f3ae46e

jianyuh force-pushed the ngoyal_changes_for_pp_fp8_fix_handle_grad_main branch from 3f34441 to 239ed36 Compare October 2, 2023 03:11

vedanuj reviewed Oct 2, 2023

View reviewed changes

Add main_grad

ad54660

jianyuh force-pushed the ngoyal_changes_for_pp_fp8_fix_handle_grad_main branch from 239ed36 to ad54660 Compare October 2, 2023 21:48

jianyuh commented Oct 4, 2023

View reviewed changes

jspark1105 suggested changes Oct 4, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add main_grad #1140

Add main_grad #1140

jianyuh commented Oct 2, 2023

vedanuj Oct 2, 2023

jianyuh Oct 2, 2023

jianyuh Oct 4, 2023

jspark1105 Oct 4, 2023

vedanuj Oct 4, 2023

jianyuh Oct 4, 2023

Add main_grad #1140

Are you sure you want to change the base?

Add main_grad #1140

Conversation

jianyuh commented Oct 2, 2023

What does this PR do?

Before submitting

PR review

vedanuj Oct 2, 2023

Choose a reason for hiding this comment

jianyuh Oct 2, 2023

Choose a reason for hiding this comment

jianyuh Oct 4, 2023

Choose a reason for hiding this comment

jspark1105 Oct 4, 2023

Choose a reason for hiding this comment

vedanuj Oct 4, 2023

Choose a reason for hiding this comment

jianyuh Oct 4, 2023

Choose a reason for hiding this comment