Skip to content

Releases: facebookresearch/xformers

[v0.0.29.post1] Fix Flash2 on windows

31 Dec 10:10
Compare
Choose a tag to compare

Enabling FAv3 by default, removed deprecated components

27 Dec 09:39
Compare
Choose a tag to compare

Pre-built binary wheels require PyTorch 2.5.1

Improved:

  • [fMHA] Creating a LowerTriangularMask no longer creates a CUDA tensor
  • [fMHA] Updated Flash-Attention to v2.7.2.post1
  • [fMHA] Flash-Attention v3 will now be used by memory_efficient_attention by default when available, unless the operator is enforced with the op keyword-argument. Switching from Flash2 to Flash3 can make transformer trainings ~10% faster end-to-end on H100s
  • [fMHA] Fixed a performance regression with the cutlass backend for the backward pass (#1176) - mostly used on older GPUs (eg V100)
  • Fixed swiglu operator compatibility with torch-compile with PyTorch 2.6
  • Fix activation checkpointing of SwiGLU when AMP is enabled (#1152)

Removed:

  • Following PyTorch, xFormers no longer builds binaries for conda. Pip is now the only recommended way to get xFormers
  • Removed unmaintained/deprecated components in xformers.components.* (see #848)

`v0.0.28.post3` - build for PyTorch 2.5.1

30 Oct 17:56
@lw lw
Compare
Choose a tag to compare

[0.0.28.post3] - 2024-10-30

Pre-built binary wheels require PyTorch 2.5.1

`v0.0.28.post2` - build for PyTorch 2.5.0

22 Oct 11:13
@lw lw
Compare
Choose a tag to compare

[0.0.28.post2] - 2024-10-18

Pre-built binary wheels require PyTorch 2.5.0

`0.0.28.post1` - fixing upload for cuda 12.4 wheels

13 Sep 15:52
Compare
Choose a tag to compare

[0.0.28.post1] - 2024-09-13

Properly upload wheels for cuda 12.4

FAv3, profiler update & AMD

12 Sep 15:49
Compare
Choose a tag to compare

Pre-built binary wheels require PyTorch 2.4.1

Added

  • Added wheels for cuda 12.4
  • Added conda builds for python 3.11
  • Added wheels for rocm 6.1

Improved

  • Profiler: Fix computation of FLOPS for the attention when using xFormers
  • Profiler: Fix MFU/HFU calculation when multiple dtypes are used
  • Profiler: Trace analysis to compute MFU & HFU is now much faster
  • fMHA/splitK: Fixed nan in the output when using a torch.Tensor bias where a lot of consecutive keys are masked with -inf
  • Update Flash-Attention version to v2.6.3 when building from scratch
  • When using the most recent version of Flash-Attention, it is no longer possible to mix it with the cutlass backend. In other words, it is no longer possible to use the cutlass Fw with the flash Bw.

Removed

  • fMHA: Removed decoder and small_k backends
  • profiler: Removed DetectSlowOpsProfiler profiler
  • Removed compatibility with PyTorch < 2.4
  • Removed conda builds for python 3.9
  • Removed windows pip wheels for cuda 12.1 and 11.8

torch.compile support, bug fixes & more

26 Jul 15:41
@lw lw
Compare
Choose a tag to compare

Pre-built binary wheels require PyTorch 2.4.0

Added

  • fMHA: PagedBlockDiagonalGappyKeysMask
  • fMHA: heterogeneous queries in triton_splitk
  • fMHA: support for paged attention in flash
  • fMHA: Added backwards pass for merge_attentions
  • fMHA: Added torch.compile support for 3 biases (LowerTriangularMask, LowerTriangularMaskWithTensorBias and BlockDiagonalMask) - some might require PyTorch 2.4
  • fMHA: Added torch.compile support in memory_efficient_attention when passing the flash operator explicitely (eg memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp)))
  • fMHA: memory_efficient_attention now expects its attn_bias argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device.
  • fMHA: AttentionBias subclasses are now constructed by default on the cuda device if available - they used to be created on the CPU device
  • 2:4 sparsity: Added xformers.ops.sp24.sparsify24_ste for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values

Improved

  • fMHA: Fixed out-of-bounds reading for Split-K triton implementation
  • Profiler: fix bug with modules that take a single tuple as argument
  • Profiler: Added manual trigger for a profiling step, by creating a trigger file in the profiling directory

Removed

  • Removed support for PyTorch version older than 2.2.0

torch.compile support, bug fixes & more

25 Jul 11:59
@lw lw
Compare
Choose a tag to compare

Pre-built binary wheels require PyTorch 2.4.0

Added

  • fMHA: PagedBlockDiagonalGappyKeysMask
  • fMHA: heterogeneous queries in triton_splitk
  • fMHA: support for paged attention in flash
  • fMHA: Added backwards pass for merge_attentions
  • fMHA: Added torch.compile support for 3 biases (LowerTriangularMask, LowerTriangularMaskWithTensorBias and BlockDiagonalMask) - some might require PyTorch 2.4
  • fMHA: Added torch.compile support in memory_efficient_attention when passing the flash operator explicitely (eg memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp)))
  • fMHA: memory_efficient_attention now expects its attn_bias argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device.
  • fMHA: AttentionBias subclasses are now constructed by default on the cuda device if available - they used to be created on the CPU device
  • 2:4 sparsity: Added xformers.ops.sp24.sparsify24_ste for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values

Improved

  • fMHA: Fixed out-of-bounds reading for Split-K triton implementation
  • Profiler: fix bug with modules that take a single tuple as argument
  • Profiler: Added manual trigger for a profiling step, by creating a trigger file in the profiling directory

Removed

  • Removed support for PyTorch version older than 2.2.0

[v0.0.27] torch.compile support, bug fixes & more

09 Jul 16:35
Compare
Choose a tag to compare

Added

  • fMHA: PagedBlockDiagonalGappyKeysMask
  • fMHA: heterogeneous queries in triton_splitk
  • fMHA: support for paged attention in flash
  • fMHA: Added backwards pass for merge_attentions
  • fMHA: Added torch.compile support for 3 biases (LowerTriangularMask, LowerTriangularMaskWithTensorBias and BlockDiagonalMask) - some might require PyTorch 2.4
  • fMHA: Added torch.compile support in memory_efficient_attention when passing the flash operator explicitely (eg memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp)))
  • fMHA: memory_efficient_attention now expects its attn_bias argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device.
  • fMHA: AttentionBias subclasses are now constructed by default on the cuda device if available - they used to be created on the CPU device
  • 2:4 sparsity: Added xformers.ops.sp24.sparsify24_ste for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values

Improved

  • fMHA: Fixed out-of-bounds reading for Split-K triton implementation
  • Profiler: fix bug with modules that take a single tuple as argument
  • Profiler: Added manual trigger for a profiling step, by creating a trigger file in the profiling directory

Removed

  • Removed support for PyTorch version older than 2.2.0

2:4 sparsity & `torch.compile`-ing memory_efficient_attention

29 Apr 14:40
Compare
Choose a tag to compare

Pre-built binary wheels require PyTorch 2.3.0

Added

  • [2:4 sparsity] Added support for Straight-Through Estimator for sparsify24 gradient (GRADIENT_STE)
  • [2:4 sparsity] sparsify24_like now supports the cuSparseLt backend, and the STE gradient
  • Basic support for torch.compile for the memory_efficient_attention operator. Currently only supports Flash-Attention, and without any bias provided. We want to expand this coverage progressively.

Improved

  • merge_attentions no longer needs inputs to be stacked.
  • fMHA: triton_splitk now supports additive bias
  • fMHA: benchmark cleanup