Releases: facebookresearch/xformers
Releases · facebookresearch/xformers
[v0.0.29.post1] Fix Flash2 on windows
This fixes the issue reported in #1163 (comment)
Enabling FAv3 by default, removed deprecated components
Pre-built binary wheels require PyTorch 2.5.1
Improved:
- [fMHA] Creating a
LowerTriangularMask
no longer creates a CUDA tensor - [fMHA] Updated Flash-Attention to
v2.7.2.post1
- [fMHA] Flash-Attention v3 will now be used by
memory_efficient_attention
by default when available, unless the operator is enforced with theop
keyword-argument. Switching from Flash2 to Flash3 can make transformer trainings ~10% faster end-to-end on H100s - [fMHA] Fixed a performance regression with the
cutlass
backend for the backward pass (#1176) - mostly used on older GPUs (eg V100) - Fixed swiglu operator compatibility with torch-compile with PyTorch 2.6
- Fix activation checkpointing of SwiGLU when AMP is enabled (#1152)
Removed:
- Following PyTorch, xFormers no longer builds binaries for conda. Pip is now the only recommended way to get xFormers
- Removed unmaintained/deprecated components in
xformers.components.*
(see #848)
`v0.0.28.post3` - build for PyTorch 2.5.1
[0.0.28.post3] - 2024-10-30
Pre-built binary wheels require PyTorch 2.5.1
`v0.0.28.post2` - build for PyTorch 2.5.0
[0.0.28.post2] - 2024-10-18
Pre-built binary wheels require PyTorch 2.5.0
`0.0.28.post1` - fixing upload for cuda 12.4 wheels
[0.0.28.post1] - 2024-09-13
Properly upload wheels for cuda 12.4
FAv3, profiler update & AMD
Pre-built binary wheels require PyTorch 2.4.1
Added
- Added wheels for cuda 12.4
- Added conda builds for python 3.11
- Added wheels for rocm 6.1
Improved
- Profiler: Fix computation of FLOPS for the attention when using xFormers
- Profiler: Fix MFU/HFU calculation when multiple dtypes are used
- Profiler: Trace analysis to compute MFU & HFU is now much faster
- fMHA/splitK: Fixed
nan
in the output when using atorch.Tensor
bias where a lot of consecutive keys are masked with-inf
- Update Flash-Attention version to
v2.6.3
when building from scratch - When using the most recent version of Flash-Attention, it is no longer possible to mix it with the cutlass backend. In other words, it is no longer possible to use the cutlass Fw with the flash Bw.
Removed
- fMHA: Removed
decoder
andsmall_k
backends - profiler: Removed
DetectSlowOpsProfiler
profiler - Removed compatibility with PyTorch < 2.4
- Removed conda builds for python 3.9
- Removed windows pip wheels for cuda 12.1 and 11.8
torch.compile support, bug fixes & more
Pre-built binary wheels require PyTorch 2.4.0
Added
- fMHA: PagedBlockDiagonalGappyKeysMask
- fMHA: heterogeneous queries in triton_splitk
- fMHA: support for paged attention in flash
- fMHA: Added backwards pass for merge_attentions
- fMHA: Added torch.compile support for 3 biases (LowerTriangularMask, LowerTriangularMaskWithTensorBias and BlockDiagonalMask) - some might require PyTorch 2.4
- fMHA: Added torch.compile support in memory_efficient_attention when passing the flash operator explicitely (eg memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp)))
- fMHA: memory_efficient_attention now expects its attn_bias argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device.
- fMHA: AttentionBias subclasses are now constructed by default on the cuda device if available - they used to be created on the CPU device
- 2:4 sparsity: Added xformers.ops.sp24.sparsify24_ste for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values
Improved
- fMHA: Fixed out-of-bounds reading for Split-K triton implementation
- Profiler: fix bug with modules that take a single tuple as argument
- Profiler: Added manual trigger for a profiling step, by creating a trigger file in the profiling directory
Removed
- Removed support for PyTorch version older than 2.2.0
torch.compile support, bug fixes & more
Pre-built binary wheels require PyTorch 2.4.0
Added
- fMHA: PagedBlockDiagonalGappyKeysMask
- fMHA: heterogeneous queries in triton_splitk
- fMHA: support for paged attention in flash
- fMHA: Added backwards pass for merge_attentions
- fMHA: Added torch.compile support for 3 biases (LowerTriangularMask, LowerTriangularMaskWithTensorBias and BlockDiagonalMask) - some might require PyTorch 2.4
- fMHA: Added torch.compile support in memory_efficient_attention when passing the flash operator explicitely (eg memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp)))
- fMHA: memory_efficient_attention now expects its attn_bias argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device.
- fMHA: AttentionBias subclasses are now constructed by default on the cuda device if available - they used to be created on the CPU device
- 2:4 sparsity: Added xformers.ops.sp24.sparsify24_ste for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values
Improved
- fMHA: Fixed out-of-bounds reading for Split-K triton implementation
- Profiler: fix bug with modules that take a single tuple as argument
- Profiler: Added manual trigger for a profiling step, by creating a trigger file in the profiling directory
Removed
- Removed support for PyTorch version older than 2.2.0
[v0.0.27] torch.compile support, bug fixes & more
Added
- fMHA:
PagedBlockDiagonalGappyKeysMask
- fMHA: heterogeneous queries in
triton_splitk
- fMHA: support for paged attention in flash
- fMHA: Added backwards pass for
merge_attentions
- fMHA: Added
torch.compile
support for 3 biases (LowerTriangularMask
,LowerTriangularMaskWithTensorBias
andBlockDiagonalMask
) - some might require PyTorch 2.4 - fMHA: Added
torch.compile
support inmemory_efficient_attention
when passing the flash operator explicitely (egmemory_efficient_attention(..., op=(flash.FwOp, flash.BwOp))
) - fMHA:
memory_efficient_attention
now expects itsattn_bias
argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device. - fMHA:
AttentionBias
subclasses are now constructed by default on thecuda
device if available - they used to be created on the CPU device - 2:4 sparsity: Added
xformers.ops.sp24.sparsify24_ste
for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values
Improved
- fMHA: Fixed out-of-bounds reading for Split-K triton implementation
- Profiler: fix bug with modules that take a single tuple as argument
- Profiler: Added manual trigger for a profiling step, by creating a
trigger
file in the profiling directory
Removed
- Removed support for PyTorch version older than 2.2.0
2:4 sparsity & `torch.compile`-ing memory_efficient_attention
Pre-built binary wheels require PyTorch 2.3.0
Added
- [2:4 sparsity] Added support for Straight-Through Estimator for
sparsify24
gradient (GRADIENT_STE
) - [2:4 sparsity]
sparsify24_like
now supports the cuSparseLt backend, and the STE gradient - Basic support for
torch.compile
for thememory_efficient_attention
operator. Currently only supports Flash-Attention, and without any bias provided. We want to expand this coverage progressively.
Improved
- merge_attentions no longer needs inputs to be stacked.
- fMHA: triton_splitk now supports additive bias
- fMHA: benchmark cleanup