-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Installation might be incomplete #196
Comments
In case it helps, here's what happens when I try to use the recommended docker image:
As far as I can tell, my installs of pigsz and zlib are up to date:
So not sure what can be done to remedy the problem. |
Hi @leedrake5 , thanks for your attention to our work! I could not reproduce the issue. You can try to copy the following *.so files into the
The environmental variable should be set:
|
Thought I had it, but I didn't. When I run the following commands: ''' ...it unfortunately breaks torch '''
I'm guessing it didn't install correctly. I don't want this to devolve to an individual operator problem, but what gets more unusual is that if I don't link, training works (with accelerate config msamp O2). It's slower than bf16, but loss is reasonable and RAM usage is down. So it seems to have kinda installed. |
@leedrake5 If the custom NCCL is not installed, the FP8 all-reduce in Megatron Optimizer and FSDP does not work. Regarding the issue of FP8 being slower than BF16, it is due to the overhead associated with FP8 quantization and dequantization. You can try to train models larger than 7B using MS-AMP + TransformerEngine to mitigate the overhead. |
I am installing for the RTX 6000 Ada. I wanted to optimize for that system to run FP8. I follow the commands to install:
I really really wish I didn't have to use sudo to do this, but didn't have much of an option given docker wasn't compatible for some reason ('zlib version less than 1.2.3' though the latest is installed on my system) and install rejected both virtual and conda envs because it doesn't like symlinks. I get no errors and everything installs (cuda 12.4) but I always get the same error:
Which is odd because I am told it is in fact installed correctly by the initial installation instructions:
'''
Successfully installed msamp_arithmetic-0.0.1
Successfully installed msamp_adamw-0.0.1
'''
Looking into the packages, I can see that there is no 'msamp_adam2' package (or 'msamp_arithmetic' for that matter), just 'msamp':
Note that I am using ssh into the linux system from a Mac, thus the interface.
So I am very confused - are these libraries not installed?
I can get training in FP8 to work in a limited way with transformer engine - but I'd really like to use ms-amp. But I don't see a feasible way to do so
The text was updated successfully, but these errors were encountered: