-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch matmul fast path in MHAWithCache #449
base: main
Are you sure you want to change the base?
Conversation
This pull request was exported from Phabricator. Differential Revision: D48418780 |
This pull request was exported from Phabricator. Differential Revision: D48418780 |
Summary: Pull Request resolved: facebookresearch#449 When doing self attention, an optimization is to combine the Q, K, V input projection matrices and do a single matmul, instead of 3. Adding this optimization in MHAWithCache. Differential Revision: D48418780 fbshipit-source-id: e8001eb870e827b05146221bb66f82939deae0c6
e1233cc
to
dfd2ec6
Compare
This pull request was exported from Phabricator. Differential Revision: D48418780 |
Summary: Pull Request resolved: facebookresearch#449 When doing self attention, an optimization is to combine the Q, K, V input projection matrices and do a single matmul, instead of 3. Adding this optimization in MHAWithCache. Differential Revision: D48418780 fbshipit-source-id: 0501341832910bf90a7ea1cc902b98f0760548ab
dfd2ec6
to
a2e0a70
Compare
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #449 +/- ##
==========================================
- Coverage 69.11% 69.11% -0.01%
==========================================
Files 170 170
Lines 11524 11530 +6
==========================================
+ Hits 7965 7969 +4
- Misses 3559 3561 +2
☔ View full report in Codecov by Sentry. |
This pull request was exported from Phabricator. Differential Revision: D48418780 |
…th (facebookresearch#449) Summary: Pull Request resolved: facebookresearch#449 When doing self attention, an optimization is to combine the Q, K, V input projection matrices and do a single matmul, instead of 3. Adding this optimization for MHA with cache in a new module `MultiHeadSelfAttentionWithCache`. Note: we are primarily using a new module to avoid breaking checkpoint BC with respect to `MultiHeadAttentionWithCache`. In the future, we should consolidate these MHA implementations. Differential Revision: D48418780 fbshipit-source-id: 5ad930ff27a4b131f8ff1f097a4c9e1548efb587
a2e0a70
to
919dc03
Compare
…th (facebookresearch#449) Summary: Pull Request resolved: facebookresearch#449 When doing self attention, an optimization is to combine the Q, K, V input projection matrices and do a single matmul, instead of 3. Adding this optimization for MHA with cache in a new module `MultiHeadSelfAttentionWithCache`. Note: we are primarily using a new module to avoid breaking checkpoint BC with respect to `MultiHeadAttentionWithCache`. In the future, we should consolidate these MHA implementations. Differential Revision: D48418780 fbshipit-source-id: eb0691e9d3a4bf729cfd7ca3293585c7d0108403
This pull request was exported from Phabricator. Differential Revision: D48418780 |
919dc03
to
6d67dae
Compare
This pull request was exported from Phabricator. Differential Revision: D48418780 |
…th (facebookresearch#449) Summary: Pull Request resolved: facebookresearch#449 When doing self attention, an optimization is to combine the Q, K, V input projection matrices and do a single matmul, instead of 3. Adding this optimization for MHA with cache in a new module `MultiHeadSelfAttentionWithCache`. Note: we are primarily using a new module to avoid breaking checkpoint BC with respect to `MultiHeadAttentionWithCache`. In the future, we should consolidate these MHA implementations. Differential Revision: D48418780 fbshipit-source-id: 0b20fb807527109a9a3ad419805e47e0f9ba2c74
6d67dae
to
173699e
Compare
…th (facebookresearch#449) Summary: Pull Request resolved: facebookresearch#449 When doing self attention, an optimization is to combine the Q, K, V input projection matrices and do a single matmul, instead of 3. Adding this optimization for MHA with cache in a new module `MultiHeadSelfAttentionWithCache`. Note: we are primarily using a new module to avoid breaking checkpoint BC with respect to `MultiHeadAttentionWithCache`. In the future, we should consolidate these MHA implementations. Differential Revision: D48418780 fbshipit-source-id: 58f00205af26d39f778853c7aa50d560e024b9f8
This pull request was exported from Phabricator. Differential Revision: D48418780 |
173699e
to
d459f16
Compare
Summary: When doing self attention, an optimization is to combine the Q, K, V input projection matrices and do a single matmul, instead of 3. Adding this optimization in MHAWithCache.
Differential Revision: D48418780