Skip to content

Releases: facebookresearch/xformers

`v0.0.25.post1`: Building binaries for PyTorch 2.2.2

29 Mar 14:05
7fffd3d
Compare
Choose a tag to compare

Pre-built binary wheels require PyTorch 2.2.2

2:4 sparsity, fused sequence parallel, torch compile & more

31 Jan 08:42
Compare
Choose a tag to compare

Pre-built binary wheels require PyTorch 2.2.0

Added

  • Added components for model/sequence parallelism, as near-drop-in replacements for FairScale/Megatron Column&RowParallelLinear modules. They support fusing communication and computation for sequence parallelism, thus making the communication effectively free.
  • Added kernels for training models with 2:4-sparsity. We introduced a very fast kernel for converting a matrix A into 24-sparse format, which can be used during training to sparsify weights dynamically, activations etc... xFormers also provides an API that is compatible with torch-compile, see xformers.ops.sparsify24.

Improved

  • Make selective activation checkpointing be compatible with torch.compile.

Removed

  • Triton kernels now require a GPU with compute capability 8.0 at least (A100 or newer). This is due to newer versions of triton not supporting older GPUs correctly
  • Removed support for PyTorch version older than 2.1.0

Binary builds for PyTorch 2.1.2

15 Dec 12:14
Compare
Choose a tag to compare

Binary wheels and conda binary builds for PyTorch 2.1.2.
For users who need to use a previous version of PyTorch, they can either:

  • Install a previous version of xFormers
  • Build from source

Bugfixes/improvements in `memory_efficient_attention`

06 Dec 16:05
Compare
Choose a tag to compare

Pre-built binary wheels require PyTorch 2.1.1

Fixed

  • fMHA: Fixed a bug in cutlass backend forward pass where the logsumexp was not correctly calculated, resulting in wrong results in the BW pass. This would happen with MQA when one sequence has a query with length%64 == 1
  • fMHA: Updated Flash-Attention to v2.3.6 - this fixes a performance regression in causal backward passes, and now supports BlockDiagonalCausalWithOffsetPaddedKeysMask

Added

  • fMHA: Added LocalAttentionFromBottomRightMask (local)
  • fMHA: Added LowerTriangularFromBottomRightMask (causal)
  • fMHA: Added LowerTriangularFromBottomRightLocalAttentionMask (local + causal)

Removed

  • Removed xformers.triton.sum_strided

[0.0.22.post7] Wheels for Flash-Attention on windows [cu121]

25 Oct 12:54
Compare
Choose a tag to compare

We also add support for cu118/cu121 - we will update the README once the wheels are ready

[0.0.22.post4] Build binaries for pytorch 2.1.0 / cuda12.1

13 Oct 16:41
16e4245
Compare
Choose a tag to compare

Also adds back support for Flash-Attention on windows (only for cuda 12.1 build) - the wheels won't include FA on windows for now, as we have some issues to fix in our CI first (should be done in ~a week hopefully)

Faster LLM inference with Flash-Decoding, Local attention

27 Sep 12:30
Compare
Choose a tag to compare

Fixed

  • fMHA: Backward pass now works in PyTorch deterministic mode (although slower)

Added

  • fMHA: Added experimental support for Multi-Query Attention and Grouped-Query Attention. This is handled by passing 5-dimensional inputs to memory_efficient_attention, see the documentation for more details
  • fMHA: Added experimental support for Local Attention biases to memory_efficient_attention
  • Added an example of efficient LLaMa decoding using xformers operators
  • Added Flash-Decoding for faster attention during Large Language Model (LLM) decoding - up to 50x faster for long sequences (token decoding up to 8x faster end-to-end)
  • Added an efficient rope implementation in triton, to be used in LLM decoding
  • Added selective activation checkpointing, which gives fine-grained control of which activations to keep and which activations to recompute
  • xformers.info now indicates the Flash-Attention version used

Removed

  • fMHA: Removed smallK backend support for CPU. memory_efficient_attention only works for CUDA/GPU tensors now
  • DEPRECATION: Many classes in xformers.factory, xformers.triton and xformers.components have been or will be deprecated soon (see tracking issue #848)

Flashv2, attention for decoding and H100 support

18 Aug 14:34
Compare
Choose a tag to compare

[0.0.21] - 2023-08-18

Improved

  • fMHA: Updated flash-attention to v2, with massive performance improvements for both the forward pass and backward pass. This implementation is now used by default when it's available

Bug fixes

  • fMHA/cutlass: Fix potential race condition in the FW/BW passes
  • fMHA/cutlass: Fix attn_bias stride overflow for very long sequences (>32k)
  • LowerTriangularMask is now backward compatible with older xformers versions

Breaking changes

  • memory_efficient_attention now expects the attn_bias argument to have a head dimension
  • memory_efficient_attention no longer broadcasts the batch/head dimensions of attn_bias. Please use .expand if you need to broadcast the bias
  • Remove causal_diagonal argument from BlockDiagonalCausalWithOffsetPaddedKeysMask

Added

  • Binary wheels on pypi/conda now contain H100 kernels
  • fMHA: Added backend specialized for decoding that does not use TensorCores - useful when not using multiquery

NOTE: Binary wheels are now provided only for PyTorch 2 with cuda 11.8. It is still possible to use xFormers with older versions of PyTorch by building from source or using conda.

Performance improvements for `memory_efficient_attention`

23 May 21:04
Compare
Choose a tag to compare

[0.0.20] - 2023-05-23

Improved

  • fMHA/cutlass (backward): Massive performance improvements when batch_size * num_heads is low (10x+)
  • fMHA/cutlass: Further performance improvements for both the forward & backward kernels
  • fMHA (backward): Now dispatching to cutlass when embed_dim>64
  • fMHA: Updated Flash-Attention to v1.0.5

Added

  • fMHA now runs on H100 (support is experimental)

Bugfixes & perf improvement for `memory_efficient_attention`

28 Apr 08:35
Compare
Choose a tag to compare

[0.0.19] - 2023-04-28

Added

  • Display nvcc version used to compile xformers in python -m xformers.info

Fixed

  • Fixed performance regression with nvcc>11.6 (#712)
  • fMHA/cutlass: Fixed nan in the output when using a torch.Tensor with -inf prefixes as attn_bias (#722)
  • fMHA/cutlass: Fixed nan in the output when the sequence length is larger than 2 ** 15 (#719)
  • fMHA/cutlass: Significative performance improvements (up to 2x) for both the forward pass and backward pass
  • fMHA/cutlass: The kernel are now deterministic
  • fMHA/cutlass: Fixed backward pass correctness when using dropout (#724)