Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the quantized data shape compatible with original tensor shape #5483

Open
wants to merge 47 commits into
base: master
Choose a base branch
from

Conversation

sfc-gh-reyazda
Copy link
Contributor

This PR adds a simple modification for the FP-Quantizer module to have the quantized weight compatible with the original checkpoint weight's shape.
This addresses Snowflake-Labs/snowflake-arctic#16

@sfc-gh-reyazda
Copy link
Contributor Author

@microsoft-github-policy-service agree [company="{your company}"]

@microsoft-github-policy-service agree [company="Snowflake"]

@sfc-gh-reyazda
Copy link
Contributor Author

@microsoft-github-policy-service agree company="your company"

@microsoft-github-policy-service agree company=Snowflake

sfc-gh-reyazda and others added 17 commits June 9, 2024 17:57
Was providing the optimizer name which was configured, and not optimizer
that was actually taking place after this function processing.
This is not always aligned.

Co-authored-by: Logan Adams <[email protected]>
…oft#5159)

Enhance testing: Skip fused_optimizer tests if not supported.

Added condition check to skip fused_optimizer tests if FusedAdam and
FusedLamb are not supported by the accelerator. This enhancement ensures
that the tests are appropriately skipped when the hardware configuration
does not support these optimizers, preventing potential issues.

Details:
- Introduced a condition check to determine support for FusedAdam and
FusedLamb.
- If not supported, fused_optimizer tests are skipped to improve test
reliability.
- Improved compatibility and stability across different hardware
configurations.

---------

Co-authored-by: Logan Adams <[email protected]>
Fixes microsoft#4989

In addition to this PR, below changes are required to build below
extensions successfully. Please note that not all unit tests for these
extensions will pass with this PR. More details on the unit test results
are below. These unit tests are skipped in CI anyway, so they will not
break the CI.
- transformer_inference
- quantizer
- random_ltd

- pytorch/pytorch#121030
- microsoft#5402


Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2)
on MI200:

**transformer_inference:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/transformer/inference

Before this PR: 
==== 674 failed, 622 skipped, 8 warnings, 1728 errors in 123.66s
(0:02:03) =====

After this PR:
========== 555 failed, 983 passed, 1486 skipped, 8 warnings in 14.35s
==========

**quantizer:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/quantizer

Before this PR: 
==== 244 failed, 8 warnings in 48.02s ====

After this PR:
===== 187 failed, 57 passed, 8 warnings in 14.74s ====

I could not find random_ltd related unit tests to run.

---------

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
This PR enables building the below extensions for AMD GPUs with warp
size 32.
- transformer_inference
- quantizer
- random_ltd


This PR works stand-alone for torch version <=2.0. For the latest
versions, microsoft#5401 is required
to be merged in addition to this PR.

Unit test results (rocm/pytorch:rocm6.1_ubuntu20.04_py3.9_pytorch_2.1.2)
on NAVI3x:

**transformer_inference:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/transformer/inference

Before this PR:
===== 674 failed, 622 skipped, 8 warnings, 1728 errors in 69.37s
(0:01:09) =====

After this PR:
========== 476 failed, 1062 passed, 1486 skipped, 8 warnings in 9.31s
==========

**quantizer:**
pytest --color=yes --durations=0 --verbose -s -m "inference_ops" -rF -n
4 unit/ops/quantizer

Before this PR:
     ==== 244 failed, 8 warnings in 30.53s ====

After this PR:
    ====== 186 failed, 58 passed, 8 warnings in 8.89s ======

I could not find random_ltd related unit tests to run.

Fixes: 
microsoft#4753
microsoft#5474
ROCm#68

cc: @jithunnair-amd

---------

Co-authored-by: [email protected] <rraminen>
Co-authored-by: Logan Adams <[email protected]>
* Use all_reduce instead of all_gather to fetch module parameters. This
improves performance by reducing the overhead of concatenation and
slicing, which are no longer required.
* Instead, all tensors views are created prior to the collective
(all_reduce), so upon its completion only the parameter status is
updated.
* The behavior is enabled via a new boolean flag under the section
"zero_optimization": { "stage3_use_all_reduce_for_fetch_params": true }
* By default the optimization is not enabled.

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Hi.
Please review the following changes
I added support for BF16 to cpu adam. BF16, FP16 and float are supported
at compilation time. the correct template is called at runtime according
to input params dtype.

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Fixing following error
/datadisk2/wengshiy/llm.devkit/DeepSpeed/deepspeed/runtime/utils.py
    return get_accelerator().FloatTensor(float(v)).detach()
TypeError: new(): data must be a sequence (got float)

cuda accelerator modified the interface for fixing warning:
microsoft@177dc14

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
…icrosoft#5519)

There is following error on XPU while unit testing
"DeepSpeed/tests/unit/moe/test_moe.py"
DeepSpeed/deepspeed/moe/sharded_moe.py line 223, in top1gating
RuntimeError: Expected all tensors to be on the same device, but found
at least two devices, xpu:0 and cpu!

Fix it by device conversion.

---------

Co-authored-by: Logan Adams <[email protected]>
**Fix overwriting of the compiled wrapper class attributes by those of
the wrapped class itself: Copy only those attributes which are not
already present in the wrapper.**

In the current implementation of the `CompiledModuleWrapper` the wrapper
attributes (eg `forward` method) are overwritten by `self._dict_ =
module._dict_.copy()`:

```
def CompiledModuleWrapper(mod, compile_config: Union[CompileConfig, None] = None):
     class wrapper(mod.__class__):
         def __init__(self, module, compile_config: Union[CompileConfig, None] = None):
             self.__dict__ = module.__dict__.copy()
```
This causes the `wrapper`'s `forward` method not being called and,
consequently, the wrapped module not compiled. Instead, the wrapped
module `forward` method is being called as illustrated in the diagram
below (a real scenario from Deespeed-Chat):


![compiled_module_wrapper_bug](https://github.com/microsoft/DeepSpeed/assets/75629718/00eeb3d1-927c-49c7-84ab-f882821cc452)

The proposed fix copies only those attributes which are not present in
the wrapper class, thus implementing the desired inheritance quality of
the wrapper.

Attached is a simple reproducer of the problem.

[compiled_module_wrapper_bug.zip](https://github.com/microsoft/DeepSpeed/files/15378282/compiled_module_wrapper_bug.zip)

Co-authored-by: Logan Adams <[email protected]>
Addresses the following warning:

```
/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/transformers/utils/hub.py:123: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
```

and the code on the transformers side is
[here](https://github.com/huggingface/transformers/blob/1a585c1222a56bcaecc070966d558d4a9d862e83/src/transformers/utils/hub.py#L86C1-L96C81).
loadams and others added 18 commits June 10, 2024 17:27
till today only last layer (idx=-1) was considered using
FINAL_LAYER_NORM_INDEX which is set to -1.
this PR allows the user to pass custom value for model where this
default value does not apply.
see example for usage in HabanaAI/Megatron-DeepSpeed fork repository:

https://github.com/HabanaAI/Megatron-DeepSpeed/blob/c9feb8cacabc6dd4da4266cff08db555a21122e2/tools/verify_checkpoint_non_tp_consistency.py#L296

---------

Co-authored-by: Logan Adams <[email protected]>
Co-authored-by: Olatunji Ruwase <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
Add CUDA versions 12.4 and 12.5 to the list
Fixed the Windows build.

Fixes applied:
- Remove some more ops that don't build on Windows.
- Remove the use of symlinks that didn't work correctly and replace with
`shutil.copytree()`.
- Small fixes to make the C++ code compile.

Tested with Python 3.9 and CUDA 12.1.

---------

Co-authored-by: Costin Eseanu <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
…osoft#5099)

was considering 4 bytes per model param, and 4 bytes per gradient. 
fixed it to 2 bytes - under the assumption of FP16/BF16

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
This is a simple fix for inference woq part, changing from `'cuda'` to
`get_accelerator().device_name()`.

---------

Co-authored-by: Logan Adams <[email protected]>
use dp_world_size for grad reduction, instead of seq_dp_world_size.
Currently, for zero0, only sparse tensors use the correct world_size.

tiny model with sp=4 grad norm test:
grad_norm | step1 | step2 | step3 | step4 |step5 | step100
-- | -- | -- | -- | -- | --| --
zero1 | 15.825 | 16.646|15.853 | 16.159 | 17.333 | 15.555
zero0 | 3.956 | 4.161 | 3.963 | 4.040 | 4.333| 3.889
zero0(this patch) | 15.825 | 16.646 | 15.853| 16.159 | 17.333 | 15.554
In the process of adding onebit optimizers support for XPU devices, we
have noticed that for different accelerator, the main difference of
implementation of `compressed_allreduce` lies on `packbits` and
`unpackbits`. CUDA uses cupy and NPU uses torch_npu. Instead of replace
these to xpu only functions, we provided a CompressedBackend to do the
`compressed_allreduce` work where users can add their own
packbits/unpackbits kernels, which is a general path for all kinds of
accelerators.

In this PR, we:
1. Add CompressedBackend for onebitAdam, onebitLamb and zerooneAdam
2. Add XPU implement of packbits/unpackbits with SYCL, built in
PackbitsBuilder
3. Add tests for onebit with CompressedBackend

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
Updated hpu-gaudi2 tests content as quantizer module is not yet
supported.
      MII legacy tests use `from transformers import Conversation`
[here](https://github.com/microsoft/DeepSpeed-MII/blob/c171c4ee290e96c0d3e618b654be8add5eca973b/mii/legacy/method_table.py#L8).

Conversation was removed from transformers
[here](huggingface/transformers#31165) so we pin
to a version before that before unpinning.
…icrosoft#5590)

We have been encountered an accuracy issue when running Torch compile +
zero3 + activation checkpointing. Specifically some grads gets is zeroed
(running without torch compile, this issue is not encountered). This
issue was also reproduced by Umesh Chand from the DS team. We found that
in the Pytorch repo torch compile has been specifically disabled using
the label: @torch._disable_dynamo()
reference to the WA in the Pytorch repo
(https://github.com/pytorch/pytorch/blob/ec8b254ef49b4a057cf89c2ae64520fb7b423a3e/torch/utils/checkpoint.py#L324)
this indicates that there is some issue with torch compile and
checkpointing (not necessarily DS related).

given that the checkpointing function in DeepSpeed is based on the
Pytorch function, We propose to adopt this WA to ensure correct behavior
(it can be removed later if the underlying issue is fixed)
Note: this shouldn't impact non-troch compile cases.

---------

Co-authored-by: Olatunji Ruwase <[email protected]>
instead of "if" that causes host/device synchronization and introduces a
bubble, while clamp is hapenning on the device
`deepspeed.runtime.zero.stage_1_and_2.DeepSpeedZeroOptimizer.average_tensor`
only sets reduction stream waiting for default stream. This is ok in
cases where the computation time is longer than the communication time,
but when the communication time is longer, it may result in a rewrite of
the ipg_buffer when the communication is not completed.



![image](https://github.com/microsoft/DeepSpeed/assets/35059704/950cbf8a-f439-4cf9-a364-dcdfd47f46a0)



To fix this bug, the easiest way is just add default stream to wait for
reduction stream at the **same point**. For example, in point 1, the
`reduction stream` needs to wait for '2', so we add a wait_stream to
`reduction stream` waiting for `default stream`. Also, the `default
stream` needs to wait for 'A', so we need to add a wait_stream to
`default stream` waiting for `reduction stream` before the 'B'.


![image](https://github.com/microsoft/DeepSpeed/assets/35059704/588a9469-d3f9-4c39-976d-3ae0502cf1d1)



Compared with the modification of
microsoft#5523, wait_stream does not
cause host synchronization.

Compared with the modification of
microsoft#5545, the modification is
more simple and the logic is the same, just waiting for what needs to
wait.

---

With this modification, losses of Qwen-1.5 with and without overlap_comm
are totally identical.


![image](https://github.com/microsoft/DeepSpeed/assets/35059704/4d48d54e-e55b-4230-8b99-93549910a43f)

---

On the contrary, there is an obvious gap with a small sequence length,
which means a short computation time.


![image](https://github.com/microsoft/DeepSpeed/assets/35059704/c80af498-3358-4e36-9b13-8f266551d51d)

Co-authored-by: gp513 <[email protected]>
Co-authored-by: CurryRice233 <[email protected]>
Co-authored-by: Joe Mayer <[email protected]>
Co-authored-by: Logan Adams <[email protected]>
@loadams loadams requested review from jeffra, tjruwase and hwchen2017 and removed request for arashb, mrwyattii and awan-10 January 7, 2025 20:10
@@ -45,6 +45,18 @@ at::Tensor quantize(torch::Tensor& out,
return out;
}

at::Tensor get_scales(torch::Tensor& out, int num_groups)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is redefined at line 118.

@@ -62,7 +62,8 @@ def _ensure_quantized(self, tensor: torch.Tensor):
tensor.data = self.quantizer.quantize(tensor.data,
q_bits=self.quantization_config.q_bits,
q_mantisa_bits=self.quantization_config.mantissa_bits)
assert tensor.dtype == torch.uint8
assert (tensor.dtype == torch.int8), \
Copy link
Contributor

@hwchen2017 hwchen2017 Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't it be torch.uint8 instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.