Releases · facebookresearch/xformers

29 Mar 14:05

danthe3rd

v0.0.25.post1

7fffd3d

`v0.0.25.post1`: Building binaries for PyTorch 2.2.2

Pre-built binary wheels require PyTorch 2.2.2

Assets 2

31 Jan 08:42

danthe3rd

v0.0.24

f7e46d5

2:4 sparsity, fused sequence parallel, torch compile & more

Pre-built binary wheels require PyTorch 2.2.0

Added

Added components for model/sequence parallelism, as near-drop-in replacements for FairScale/Megatron Column&RowParallelLinear modules. They support fusing communication and computation for sequence parallelism, thus making the communication effectively free.
Added kernels for training models with 2:4-sparsity. We introduced a very fast kernel for converting a matrix A into 24-sparse format, which can be used during training to sparsify weights dynamically, activations etc... xFormers also provides an API that is compatible with torch-compile, see xformers.ops.sparsify24.

Improved

Make selective activation checkpointing be compatible with torch.compile.

Removed

Triton kernels now require a GPU with compute capability 8.0 at least (A100 or newer). This is due to newer versions of triton not supporting older GPUs correctly
Removed support for PyTorch version older than 2.1.0

Assets 2

15 Dec 12:14

danthe3rd

v0.0.23.post1

042abc8

Binary builds for PyTorch 2.1.2

Binary wheels and conda binary builds for PyTorch 2.1.2.
For users who need to use a previous version of PyTorch, they can either:

Install a previous version of xFormers
Build from source

Assets 2

06 Dec 16:05

danthe3rd

v0.0.23

1254a16

Bugfixes/improvements in `memory_efficient_attention`

Pre-built binary wheels require PyTorch 2.1.1

Fixed

fMHA: Fixed a bug in cutlass backend forward pass where the logsumexp was not correctly calculated, resulting in wrong results in the BW pass. This would happen with MQA when one sequence has a query with length%64 == 1
fMHA: Updated Flash-Attention to v2.3.6 - this fixes a performance regression in causal backward passes, and now supports BlockDiagonalCausalWithOffsetPaddedKeysMask

Added

fMHA: Added LocalAttentionFromBottomRightMask (local)
fMHA: Added LowerTriangularFromBottomRightMask (causal)
fMHA: Added LowerTriangularFromBottomRightLocalAttentionMask (local + causal)

Removed

Removed xformers.triton.sum_strided

Assets 2

25 Oct 12:54

danthe3rd

v0.0.22.post7

e1b36f7

[0.0.22.post7] Wheels for Flash-Attention on windows [cu121]

We also add support for cu118/cu121 - we will update the README once the wheels are ready

Assets 2

13 Oct 16:41

danthe3rd

v0.0.22.post4

16e4245

[0.0.22.post4] Build binaries for pytorch 2.1.0 / cuda12.1

~~Also adds back support for Flash-Attention on windows (only for cuda 12.1 build)~~ - the wheels won't include FA on windows for now, as we have some issues to fix in our CI first (should be done in ~a week hopefully)

Assets 2

27 Sep 12:30

danthe3rd

v0.0.22

1e065bc

Faster LLM inference with Flash-Decoding, Local attention

Fixed

fMHA: Backward pass now works in PyTorch deterministic mode (although slower)

Added

fMHA: Added experimental support for Multi-Query Attention and Grouped-Query Attention. This is handled by passing 5-dimensional inputs to memory_efficient_attention, see the documentation for more details
fMHA: Added experimental support for Local Attention biases to memory_efficient_attention
Added an example of efficient LLaMa decoding using xformers operators
Added Flash-Decoding for faster attention during Large Language Model (LLM) decoding - up to 50x faster for long sequences (token decoding up to 8x faster end-to-end)
Added an efficient rope implementation in triton, to be used in LLM decoding
Added selective activation checkpointing, which gives fine-grained control of which activations to keep and which activations to recompute
xformers.info now indicates the Flash-Attention version used

Removed

fMHA: Removed smallK backend support for CPU. memory_efficient_attention only works for CUDA/GPU tensors now
DEPRECATION: Many classes in xformers.factory, xformers.triton and xformers.components have been or will be deprecated soon (see tracking issue #848)

Assets 2

18 Aug 14:34

danthe3rd

v0.0.21

320b5ad

Flashv2, attention for decoding and H100 support

[0.0.21] - 2023-08-18

Improved

fMHA: Updated flash-attention to v2, with massive performance improvements for both the forward pass and backward pass. This implementation is now used by default when it's available

Bug fixes

fMHA/cutlass: Fix potential race condition in the FW/BW passes
fMHA/cutlass: Fix attn_bias stride overflow for very long sequences (>32k)
LowerTriangularMask is now backward compatible with older xformers versions

Breaking changes

memory_efficient_attention now expects the attn_bias argument to have a head dimension
memory_efficient_attention no longer broadcasts the batch/head dimensions of attn_bias. Please use .expand if you need to broadcast the bias
Remove causal_diagonal argument from BlockDiagonalCausalWithOffsetPaddedKeysMask

Added

Binary wheels on pypi/conda now contain H100 kernels
fMHA: Added backend specialized for decoding that does not use TensorCores - useful when not using multiquery

NOTE: Binary wheels are now provided only for PyTorch 2 with cuda 11.8. It is still possible to use xFormers with older versions of PyTorch by building from source or using conda.

Assets 2

23 May 21:04

danthe3rd

v0.0.20

1dc3d7a

Performance improvements for `memory_efficient_attention`

[0.0.20] - 2023-05-23

Improved

fMHA/cutlass (backward): Massive performance improvements when batch_size * num_heads is low (10x+)
fMHA/cutlass: Further performance improvements for both the forward & backward kernels
fMHA (backward): Now dispatching to cutlass when embed_dim>64
fMHA: Updated Flash-Attention to v1.0.5

Added

fMHA now runs on H100 (support is experimental)

Assets 2

28 Apr 08:35

danthe3rd

v0.0.19

8bf59c9

Bugfixes & perf improvement for `memory_efficient_attention`

[0.0.19] - 2023-04-28

Added

Display nvcc version used to compile xformers in python -m xformers.info

Fixed

Fixed performance regression with nvcc>11.6 (#712)
fMHA/cutlass: Fixed nan in the output when using a torch.Tensor with -inf prefixes as attn_bias (#722)
fMHA/cutlass: Fixed nan in the output when the sequence length is larger than 2 ** 15 (#719)
fMHA/cutlass: Significative performance improvements (up to 2x) for both the forward pass and backward pass
fMHA/cutlass: The kernel are now deterministic
fMHA/cutlass: Fixed backward pass correctness when using dropout (#724)

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added

Improved

Removed

Fixed

Added

Removed

Fixed

Added

Removed

[0.0.21] - 2023-08-18

Improved

Bug fixes

Breaking changes

Added

[0.0.20] - 2023-05-23

Improved

Added

[0.0.19] - 2023-04-28

Added

Fixed

Releases: facebookresearch/xformers

`v0.0.25.post1`: Building binaries for PyTorch 2.2.2

2:4 sparsity, fused sequence parallel, torch compile & more

Added

Improved

Removed

Binary builds for PyTorch 2.1.2

Bugfixes/improvements in `memory_efficient_attention`

Fixed

Added

Removed

[0.0.22.post7] Wheels for Flash-Attention on windows [cu121]

[0.0.22.post4] Build binaries for pytorch 2.1.0 / cuda12.1

Faster LLM inference with Flash-Decoding, Local attention

Fixed

Added

Removed

Flashv2, attention for decoding and H100 support

[0.0.21] - 2023-08-18

Improved

Bug fixes

Breaking changes

Added

Performance improvements for `memory_efficient_attention`

[0.0.20] - 2023-05-23

Improved

Added

Bugfixes & perf improvement for `memory_efficient_attention`

[0.0.19] - 2023-04-28

Added

Fixed