Releases: facebookresearch/xformers
Releases · facebookresearch/xformers
`v0.0.25.post1`: Building binaries for PyTorch 2.2.2
Pre-built binary wheels require PyTorch 2.2.2
2:4 sparsity, fused sequence parallel, torch compile & more
Pre-built binary wheels require PyTorch 2.2.0
Added
- Added components for model/sequence parallelism, as near-drop-in replacements for FairScale/Megatron Column&RowParallelLinear modules. They support fusing communication and computation for sequence parallelism, thus making the communication effectively free.
- Added kernels for training models with 2:4-sparsity. We introduced a very fast kernel for converting a matrix A into 24-sparse format, which can be used during training to sparsify weights dynamically, activations etc... xFormers also provides an API that is compatible with torch-compile, see
xformers.ops.sparsify24
.
Improved
- Make selective activation checkpointing be compatible with torch.compile.
Removed
- Triton kernels now require a GPU with compute capability 8.0 at least (A100 or newer). This is due to newer versions of triton not supporting older GPUs correctly
- Removed support for PyTorch version older than 2.1.0
Binary builds for PyTorch 2.1.2
Binary wheels and conda binary builds for PyTorch 2.1.2.
For users who need to use a previous version of PyTorch, they can either:
- Install a previous version of xFormers
- Build from source
Bugfixes/improvements in `memory_efficient_attention`
Pre-built binary wheels require PyTorch 2.1.1
Fixed
- fMHA: Fixed a bug in cutlass backend forward pass where the logsumexp was not correctly calculated, resulting in wrong results in the BW pass. This would happen with MQA when one sequence has a query with
length%64 == 1
- fMHA: Updated Flash-Attention to v2.3.6 - this fixes a performance regression in causal backward passes, and now supports
BlockDiagonalCausalWithOffsetPaddedKeysMask
Added
- fMHA: Added
LocalAttentionFromBottomRightMask
(local) - fMHA: Added
LowerTriangularFromBottomRightMask
(causal) - fMHA: Added
LowerTriangularFromBottomRightLocalAttentionMask
(local + causal)
Removed
- Removed
xformers.triton.sum_strided
[0.0.22.post7] Wheels for Flash-Attention on windows [cu121]
We also add support for cu118/cu121 - we will update the README once the wheels are ready
[0.0.22.post4] Build binaries for pytorch 2.1.0 / cuda12.1
Also adds back support for Flash-Attention on windows (only for cuda 12.1 build) - the wheels won't include FA on windows for now, as we have some issues to fix in our CI first (should be done in ~a week hopefully)
Faster LLM inference with Flash-Decoding, Local attention
Fixed
- fMHA: Backward pass now works in PyTorch deterministic mode (although slower)
Added
- fMHA: Added experimental support for Multi-Query Attention and Grouped-Query Attention. This is handled by passing 5-dimensional inputs to
memory_efficient_attention
, see the documentation for more details - fMHA: Added experimental support for Local Attention biases to
memory_efficient_attention
- Added an example of efficient LLaMa decoding using xformers operators
- Added Flash-Decoding for faster attention during Large Language Model (LLM) decoding - up to 50x faster for long sequences (token decoding up to 8x faster end-to-end)
- Added an efficient rope implementation in triton, to be used in LLM decoding
- Added selective activation checkpointing, which gives fine-grained control of which activations to keep and which activations to recompute
xformers.info
now indicates the Flash-Attention version used
Removed
- fMHA: Removed
smallK
backend support for CPU.memory_efficient_attention
only works for CUDA/GPU tensors now - DEPRECATION: Many classes in
xformers.factory
,xformers.triton
andxformers.components
have been or will be deprecated soon (see tracking issue #848)
Flashv2, attention for decoding and H100 support
[0.0.21] - 2023-08-18
Improved
- fMHA: Updated flash-attention to v2, with massive performance improvements for both the forward pass and backward pass. This implementation is now used by default when it's available
Bug fixes
- fMHA/cutlass: Fix potential race condition in the FW/BW passes
- fMHA/cutlass: Fix
attn_bias
stride overflow for very long sequences (>32k) LowerTriangularMask
is now backward compatible with older xformers versions
Breaking changes
memory_efficient_attention
now expects theattn_bias
argument to have a head dimensionmemory_efficient_attention
no longer broadcasts the batch/head dimensions ofattn_bias
. Please use.expand
if you need to broadcast the bias- Remove
causal_diagonal
argument fromBlockDiagonalCausalWithOffsetPaddedKeysMask
Added
- Binary wheels on pypi/conda now contain H100 kernels
- fMHA: Added backend specialized for decoding that does not use TensorCores - useful when not using multiquery
NOTE: Binary wheels are now provided only for PyTorch 2 with cuda 11.8. It is still possible to use xFormers with older versions of PyTorch by building from source or using conda.
Performance improvements for `memory_efficient_attention`
[0.0.20] - 2023-05-23
Improved
- fMHA/cutlass (backward): Massive performance improvements when
batch_size * num_heads
is low (10x+) - fMHA/cutlass: Further performance improvements for both the forward & backward kernels
- fMHA (backward): Now dispatching to cutlass when
embed_dim>64
- fMHA: Updated Flash-Attention to
v1.0.5
Added
- fMHA now runs on H100 (support is experimental)
Bugfixes & perf improvement for `memory_efficient_attention`
[0.0.19] - 2023-04-28
Added
- Display
nvcc
version used to compilexformers
inpython -m xformers.info
Fixed
- Fixed performance regression with
nvcc>11.6
(#712) - fMHA/cutlass: Fixed
nan
in the output when using atorch.Tensor
with-inf
prefixes asattn_bias
(#722) - fMHA/cutlass: Fixed
nan
in the output when the sequence length is larger than2 ** 15
(#719) - fMHA/cutlass: Significative performance improvements (up to 2x) for both the forward pass and backward pass
- fMHA/cutlass: The kernel are now deterministic
- fMHA/cutlass: Fixed backward pass correctness when using dropout (#724)