Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rocm warp size fix #5402

Merged
merged 9 commits into from
May 17, 2024
Merged

Rocm warp size fix #5402

merged 9 commits into from
May 17, 2024

Conversation

rraminen
Copy link
Contributor

@rraminen rraminen commented Apr 11, 2024

This PR enables building the below extensions for AMD GPUs with warp size 32.

  • transformer_inference
  • quantizer
  • random_ltd

This PR works stand-alone for torch version <=2.0. For the latest versions, #5401 is required to be merged in addition to this PR.

Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2) on NAVI3x:

transformer_inference:
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/transformer/inference

Before this PR:
===== 674 failed, 622 skipped, 8 warnings, 1728 errors in 69.37s (0:01:09) =====

After this PR:
========== 476 failed, 1062 passed, 1486 skipped, 8 warnings in 9.31s ==========

quantizer:
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n 4 unit/ops/quantizer

Before this PR:
==== 244 failed, 8 warnings in 30.53s ====

After this PR:
====== 186 failed, 58 passed, 8 warnings in 8.89s ======

I could not find random_ltd related unit tests to run.

Fixes:
#4753
#5474
ROCm#68

cc: @jithunnair-amd

@rraminen rraminen marked this pull request as ready for review April 17, 2024 19:31
@rraminen rraminen force-pushed the rocm_warp_size_fix branch from 295b743 to afaee86 Compare April 29, 2024 18:27
@rraminen rraminen force-pushed the rocm_warp_size_fix branch from afaee86 to fb5ad02 Compare May 10, 2024 17:24
@rraminen rraminen force-pushed the rocm_warp_size_fix branch from fb5ad02 to c87c4f3 Compare May 14, 2024 15:01
github-merge-queue bot pushed a commit that referenced this pull request May 17, 2024
Fixes #4989

In addition to this PR, below changes are required to build below
extensions successfully. Please note that not all unit tests for these
extensions will pass with this PR. More details on the unit test results
are below. These unit tests are skipped in CI anyway, so they will not
break the CI.
- transformer_inference
- quantizer
- random_ltd

- pytorch/pytorch#121030
- #5402


Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2)
on MI200:

**transformer_inference:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/transformer/inference

Before this PR: 
==== 674 failed, 622 skipped, 8 warnings, 1728 errors in 123.66s
(0:02:03) =====

After this PR:
========== 555 failed, 983 passed, 1486 skipped, 8 warnings in 14.35s
==========

**quantizer:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/quantizer

Before this PR: 
==== 244 failed, 8 warnings in 48.02s ====

After this PR:
===== 187 failed, 57 passed, 8 warnings in 14.74s ====

I could not find random_ltd related unit tests to run.

---------

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
@loadams loadams added this pull request to the merge queue May 17, 2024
Merged via the queue into microsoft:master with commit 76c9c69 May 17, 2024
13 checks passed
sfc-gh-reyazda pushed a commit to Snowflake-Labs/DeepSpeed that referenced this pull request Jun 10, 2024
Fixes microsoft#4989

In addition to this PR, below changes are required to build below
extensions successfully. Please note that not all unit tests for these
extensions will pass with this PR. More details on the unit test results
are below. These unit tests are skipped in CI anyway, so they will not
break the CI.
- transformer_inference
- quantizer
- random_ltd

- pytorch/pytorch#121030
- microsoft#5402


Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2)
on MI200:

**transformer_inference:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/transformer/inference

Before this PR: 
==== 674 failed, 622 skipped, 8 warnings, 1728 errors in 123.66s
(0:02:03) =====

After this PR:
========== 555 failed, 983 passed, 1486 skipped, 8 warnings in 14.35s
==========

**quantizer:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/quantizer

Before this PR: 
==== 244 failed, 8 warnings in 48.02s ====

After this PR:
===== 187 failed, 57 passed, 8 warnings in 14.74s ====

I could not find random_ltd related unit tests to run.

---------

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
sfc-gh-reyazda pushed a commit to Snowflake-Labs/DeepSpeed that referenced this pull request Jun 10, 2024
This PR enables building the below extensions for AMD GPUs with warp
size 32.
- transformer_inference
- quantizer
- random_ltd


This PR works stand-alone for torch version <=2.0. For the latest
versions, microsoft#5401 is required
to be merged in addition to this PR.

Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2)
on NAVI3x:

**transformer_inference:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/transformer/inference

Before this PR:
===== 674 failed, 622 skipped, 8 warnings, 1728 errors in 69.37s
(0:01:09) =====

After this PR:
========== 476 failed, 1062 passed, 1486 skipped, 8 warnings in 9.31s
==========

**quantizer:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/quantizer

Before this PR:
     ==== 244 failed, 8 warnings in 30.53s ====

After this PR:
    ====== 186 failed, 58 passed, 8 warnings in 8.89s ======

I could not find random_ltd related unit tests to run.

Fixes: 
microsoft#4753
microsoft#5474
ROCm#68

cc: @jithunnair-amd

---------

Co-authored-by: [email protected] <rraminen>
Co-authored-by: Logan Adams <[email protected]>
github-merge-queue bot pushed a commit that referenced this pull request Nov 4, 2024
When launching apply_rotary_pos_half kernel, only threads_per_head of 64
is supported for wavefront size of 64.
This change adds support for threads_per_head < 64 such as 4, 8, 16.

Fixes the issue introduced in
#5402

---------

Signed-off-by: Jagadish Krishnamoorthy <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants