Installation might be incomplete #196

leedrake5 · 2024-10-22T03:03:44Z

I am installing for the RTX 6000 Ada. I wanted to optimize for that system to run FP8. I follow the commands to install:

git clone https://github.com/Azure/MS-AMP.git
cd MS-AMP
git submodule update --init --recursive
cd third_party/msccl

# RTX Ada 6000
make -j src.build NVCC_GENCODE="-gencode=arch=compute_89,code=sm_89"


apt-get update
apt install build-essential devscripts debhelper fakeroot
make pkg.debian.build
sudo dpkg -i build/pkg/deb/libnccl2_*.deb
sudo dpkg -i build/pkg/deb/libnccl-dev_2*.deb

cd -

python3 -m pip install --upgrade pip
python3 -m pip install .
sudo -E make postinstall

I really really wish I didn't have to use sudo to do this, but didn't have much of an option given docker wasn't compatible for some reason ('zlib version less than 1.2.3' though the latest is installed on my system) and install rejected both virtual and conda envs because it doesn't like symlinks. I get no errors and everything installs (cuda 12.4) but I always get the same error:

  File "/home/<user>/.local/lib/python3.11/site-packages/msamp/optim/adamw.py", line 16, in <module>
    import msamp_adamw
ModuleNotFoundError: No module named 'msamp_adamw'

Which is odd because I am told it is in fact installed correctly by the initial installation instructions:
'''
Successfully installed msamp_arithmetic-0.0.1
Successfully installed msamp_adamw-0.0.1
'''

Looking into the packages, I can see that there is no 'msamp_adam2' package (or 'msamp_arithmetic' for that matter), just 'msamp':

Note that I am using ssh into the linux system from a Mac, thus the interface.

So I am very confused - are these libraries not installed?

I can get training in FP8 to work in a limited way with transformer engine - but I'd really like to use ms-amp. But I don't see a feasible way to do so

The text was updated successfully, but these errors were encountered:

leedrake5 · 2024-10-22T03:35:51Z

In case it helps, here's what happens when I try to use the recommended docker image:

$ sudo docker run -it -d --name=msamp --privileged --net=host --ipc=host --gpus=all nvcr.io/nvidia/pytorch:23.10-py3 bash
Unable to find image 'nvcr.io/nvidia/pytorch:23.10-py3' locally
23.10-py3: Pulling from nvidia/pytorch
37aaf24cf781: Extracting [==================================================>]  29.54MB/29.54MB
c15d1d6b2c11: Download complete 
7e97a8ec5681: Download complete 
894330fe1bf5: Download complete 
97707dfd1d40: Download complete 
d69ae92c3e1e: Download complete 
a013d53fd443: Download complete 
18989d23e6f7: Download complete 
53638f96ad3c: Download complete 
edbefd2705db: Download complete 
4a10dab4bd4c: Download complete 
7ee32cc2089f: Download complete 
91eeea9164ed: Download complete 
7aa5209b2eba: Download complete 
6729082aba49: Download complete 
c926d5f5cde0: Download complete 
4f4fb700ef54: Download complete 
c8a736dc04ec: Download complete 
07cf6ce1eca7: Download complete 
9c90b8728b50: Download complete 
ade437946b14: Download complete 
5e8709f8c02d: Download complete 
866ac4b0341d: Download complete 
9d3d147186f3: Download complete 
5d57a558faf6: Download complete 
1373dde86157: Download complete 
ad53f9124ce2: Download complete 
9f11293d3693: Download complete 
7d470bc79d5a: Download complete 
cfb097252e12: Download complete 
9d44a7cad2d3: Download complete 
8771942f5e66: Download complete 
3660958c4b05: Download complete 
ee70577cbd50: Download complete 
264099d06354: Download complete 
1cd8bbadca17: Download complete 
c39a5d65a5d5: Download complete 
7e7320428757: Download complete 
3748e1ef72fc: Download complete 
0e694a487f92: Download complete 
e37baf71233c: Download complete 
0408d58f5552: Download complete 
d53c17415131: Download complete 
b55316528757: Download complete 
2285bb2a2191: Download complete 
1ad3a5cc4688: Download complete 
776e338b632d: Download complete 
cca7b16b7b04: Download complete 
9c55ea83da60: Download complete 
48bdaee6e86f: Download complete 
77ee12f01893: Download complete 
docker: failed to register layer: exit status 22: unpigz: abort: zlib version less than 1.2.3

As far as I can tell, my installs of pigsz and zlib are up to date:

zlib1g-dev is already the newest version (1:1.3.dfsg-3.1ubuntu2.1).
pigz is already the newest version (2.4-1).

So not sure what can be done to remedy the problem.

wkcn · 2024-10-22T03:44:39Z

Hi @leedrake5 , thanks for your attention to our work!

I could not reproduce the issue.
It seems that the packages msamp_arithmetic and msamp_adamw are not copied into the site-packages folder of python.

You can try to copy the following *.so files into the site-packages folder manually.

msamp/operators/arithmetic/build/lib.linux-*/msamp_arithmetic.cpython-*.so
msamp/optim/build/lib.linux-*/msamp_adamw.cpython-*.so

The environmental variable should be set:

LD_PRELOAD="${THE_PATH_OF_MSAMP}/msamp/operators/dist_op/build/libmsamp_dist.so:/usr/local/lib/libnccl.so:${LD_PRELOAD}"

leedrake5 · 2024-10-22T04:51:48Z

Thought I had it, but I didn't. When I run the following commands:

'''
NCCL_LIBRARY=/usr/lib/x86_64-linux-gnu/libnccl.so # Change as needed
export LD_PRELOAD="/usr/local/lib/libmsamp_dist.so:${NCCL_LIBRARY}:${LD_PRELOAD}"
'''

...it unfortunately breaks torch

'''

import torch
Traceback (most recent call last):
File "", line 1, in
File "/home/bly/.local/lib/python3.11/site-packages/torch/init.py", line 368, in
from torch._C import * # noqa: F403
^^^^^^^^^^^^^^^^^^^^^^
ImportError: /home/<user_name>/.local/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so: undefined symbol: ncclCommRegister
'''

I'm guessing it didn't install correctly. I don't want this to devolve to an individual operator problem, but what gets more unusual is that if I don't link, training works (with accelerate config msamp O2). It's slower than bf16, but loss is reasonable and RAM usage is down. So it seems to have kinda installed.

wkcn · 2024-10-23T02:04:35Z

@leedrake5
The custom NCCL library in MS-AMP is used to support all-reduce operations for FP8 weight gradients.

If the custom NCCL is not installed, the FP8 all-reduce in Megatron Optimizer and FSDP does not work.
However, it does not affect PyTorch DDP and DeepSpeed, where weight gradients are stored as FP8 tensors and reduced using BF16.

Regarding the issue of FP8 being slower than BF16, it is due to the overhead associated with FP8 quantization and dequantization. You can try to train models larger than 7B using MS-AMP + TransformerEngine to mitigate the overhead.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Installation might be incomplete #196

Installation might be incomplete #196

leedrake5 commented Oct 22, 2024 •

edited

Loading

leedrake5 commented Oct 22, 2024

wkcn commented Oct 22, 2024 •

edited

Loading

leedrake5 commented Oct 22, 2024 •

edited

Loading

wkcn commented Oct 23, 2024 •

edited

Loading

Installation might be incomplete #196

Installation might be incomplete #196

Comments

leedrake5 commented Oct 22, 2024 • edited Loading

leedrake5 commented Oct 22, 2024

wkcn commented Oct 22, 2024 • edited Loading

leedrake5 commented Oct 22, 2024 • edited Loading

wkcn commented Oct 23, 2024 • edited Loading

leedrake5 commented Oct 22, 2024 •

edited

Loading

wkcn commented Oct 22, 2024 •

edited

Loading

leedrake5 commented Oct 22, 2024 •

edited

Loading

wkcn commented Oct 23, 2024 •

edited

Loading