31 Dec 10:10

danthe3rd

v0.0.29.post1

46a02df

[v0.0.29.post1] Fix Flash2 on windows Latest

Latest

This fixes the issue reported in #1163 (comment)

Assets 2

27 Dec 09:39

danthe3rd

v0.0.29

56be3b5

Enabling FAv3 by default, removed deprecated components

Pre-built binary wheels require PyTorch 2.5.1

Improved:

[fMHA] Creating a LowerTriangularMask no longer creates a CUDA tensor
[fMHA] Updated Flash-Attention to v2.7.2.post1
[fMHA] Flash-Attention v3 will now be used by memory_efficient_attention by default when available, unless the operator is enforced with the op keyword-argument. Switching from Flash2 to Flash3 can make transformer trainings ~10% faster end-to-end on H100s
[fMHA] Fixed a performance regression with the cutlass backend for the backward pass (#1176) - mostly used on older GPUs (eg V100)
Fixed swiglu operator compatibility with torch-compile with PyTorch 2.6
Fix activation checkpointing of SwiGLU when AMP is enabled (#1152)

Removed:

Following PyTorch, xFormers no longer builds binaries for conda. Pip is now the only recommended way to get xFormers
Removed unmaintained/deprecated components in xformers.components.* (see #848)

Assets 2

30 Oct 17:56

v0.0.28.post3

f3bc7a7

`v0.0.28.post3` - build for PyTorch 2.5.1

[0.0.28.post3] - 2024-10-30

Pre-built binary wheels require PyTorch 2.5.1

Assets 2

22 Oct 11:13

v0.0.28.post2

68b7fd1

`v0.0.28.post2` - build for PyTorch 2.5.0

[0.0.28.post2] - 2024-10-18

Pre-built binary wheels require PyTorch 2.5.0

Assets 2

13 Sep 15:52

danthe3rd

v0.0.28.post1

d3948b5

`0.0.28.post1` - fixing upload for cuda 12.4 wheels

[0.0.28.post1] - 2024-09-13

Properly upload wheels for cuda 12.4

Assets 2

12 Sep 15:49

danthe3rd

v0.0.28

c909f0d

FAv3, profiler update & AMD

Pre-built binary wheels require PyTorch 2.4.1

Added

Added wheels for cuda 12.4
Added conda builds for python 3.11
Added wheels for rocm 6.1

Improved

Profiler: Fix computation of FLOPS for the attention when using xFormers
Profiler: Fix MFU/HFU calculation when multiple dtypes are used
Profiler: Trace analysis to compute MFU & HFU is now much faster
fMHA/splitK: Fixed nan in the output when using a torch.Tensor bias where a lot of consecutive keys are masked with -inf
Update Flash-Attention version to v2.6.3 when building from scratch
When using the most recent version of Flash-Attention, it is no longer possible to mix it with the cutlass backend. In other words, it is no longer possible to use the cutlass Fw with the flash Bw.

Removed

fMHA: Removed decoder and small_k backends
profiler: Removed DetectSlowOpsProfiler profiler
Removed compatibility with PyTorch < 2.4
Removed conda builds for python 3.9
Removed windows pip wheels for cuda 12.1 and 11.8

Assets 2

26 Jul 15:41

v0.0.27.post2

1fc661f

torch.compile support, bug fixes & more

Pre-built binary wheels require PyTorch 2.4.0

Added

fMHA: PagedBlockDiagonalGappyKeysMask
fMHA: heterogeneous queries in triton_splitk
fMHA: support for paged attention in flash
fMHA: Added backwards pass for merge_attentions
fMHA: Added torch.compile support for 3 biases (LowerTriangularMask, LowerTriangularMaskWithTensorBias and BlockDiagonalMask) - some might require PyTorch 2.4
fMHA: Added torch.compile support in memory_efficient_attention when passing the flash operator explicitely (eg memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp)))
fMHA: memory_efficient_attention now expects its attn_bias argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device.
fMHA: AttentionBias subclasses are now constructed by default on the cuda device if available - they used to be created on the CPU device
2:4 sparsity: Added xformers.ops.sp24.sparsify24_ste for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values

Improved

fMHA: Fixed out-of-bounds reading for Split-K triton implementation
Profiler: fix bug with modules that take a single tuple as argument
Profiler: Added manual trigger for a profiling step, by creating a trigger file in the profiling directory

Removed

Removed support for PyTorch version older than 2.2.0

Assets 2

25 Jul 11:59

v0.0.27.post1

b3831ea

torch.compile support, bug fixes & more

Pre-built binary wheels require PyTorch 2.4.0

Added

fMHA: PagedBlockDiagonalGappyKeysMask
fMHA: heterogeneous queries in triton_splitk
fMHA: support for paged attention in flash
fMHA: Added backwards pass for merge_attentions
fMHA: Added torch.compile support for 3 biases (LowerTriangularMask, LowerTriangularMaskWithTensorBias and BlockDiagonalMask) - some might require PyTorch 2.4
fMHA: Added torch.compile support in memory_efficient_attention when passing the flash operator explicitely (eg memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp)))
fMHA: memory_efficient_attention now expects its attn_bias argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device.
fMHA: AttentionBias subclasses are now constructed by default on the cuda device if available - they used to be created on the CPU device
2:4 sparsity: Added xformers.ops.sp24.sparsify24_ste for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values

Improved

fMHA: Fixed out-of-bounds reading for Split-K triton implementation
Profiler: fix bug with modules that take a single tuple as argument
Profiler: Added manual trigger for a profiling step, by creating a trigger file in the profiling directory

Removed

Removed support for PyTorch version older than 2.2.0

Assets 2

09 Jul 16:35

danthe3rd

v0.0.27

184b280

[v0.0.27] torch.compile support, bug fixes & more

Added

fMHA: PagedBlockDiagonalGappyKeysMask
fMHA: heterogeneous queries in triton_splitk
fMHA: support for paged attention in flash
fMHA: Added backwards pass for merge_attentions
fMHA: Added torch.compile support for 3 biases (LowerTriangularMask, LowerTriangularMaskWithTensorBias and BlockDiagonalMask) - some might require PyTorch 2.4
fMHA: Added torch.compile support in memory_efficient_attention when passing the flash operator explicitely (eg memory_efficient_attention(..., op=(flash.FwOp, flash.BwOp)))
fMHA: memory_efficient_attention now expects its attn_bias argument to be on the same device as the other input tensor. Previously, it would convert the bias to the right device.
fMHA: AttentionBias subclasses are now constructed by default on the cuda device if available - they used to be created on the CPU device
2:4 sparsity: Added xformers.ops.sp24.sparsify24_ste for Straight Through Estimator (STE) with options to rescale the gradient differently for masked out/kept values

Improved

fMHA: Fixed out-of-bounds reading for Split-K triton implementation
Profiler: fix bug with modules that take a single tuple as argument
Profiler: Added manual trigger for a profiling step, by creating a trigger file in the profiling directory

Removed

Removed support for PyTorch version older than 2.2.0

Assets 2

29 Apr 14:40

danthe3rd

v0.0.26.post1

fad50d4

2:4 sparsity & `torch.compile`-ing memory_efficient_attention

Pre-built binary wheels require PyTorch 2.3.0

Added

[2:4 sparsity] Added support for Straight-Through Estimator for sparsify24 gradient (GRADIENT_STE)
[2:4 sparsity] sparsify24_like now supports the cuSparseLt backend, and the STE gradient
Basic support for torch.compile for the memory_efficient_attention operator. Currently only supports Flash-Attention, and without any bias provided. We want to expand this coverage progressively.

Improved

merge_attentions no longer needs inputs to be stacked.
fMHA: triton_splitk now supports additive bias
fMHA: benchmark cleanup

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved:

Removed:

[0.0.28.post3] - 2024-10-30

[0.0.28.post2] - 2024-10-18

[0.0.28.post1] - 2024-09-13

Added

Improved

Removed

Added

Improved

Removed

Added

Improved

Removed

Added

Improved

Removed

Added

Improved

Releases: facebookresearch/xformers

[v0.0.29.post1] Fix Flash2 on windows

Enabling FAv3 by default, removed deprecated components

Improved:

Removed:

`v0.0.28.post3` - build for PyTorch 2.5.1

[0.0.28.post3] - 2024-10-30

`v0.0.28.post2` - build for PyTorch 2.5.0

[0.0.28.post2] - 2024-10-18

`0.0.28.post1` - fixing upload for cuda 12.4 wheels

[0.0.28.post1] - 2024-09-13

FAv3, profiler update & AMD

Added

Improved

Removed

torch.compile support, bug fixes & more

Added

Improved

Removed

torch.compile support, bug fixes & more

Added

Improved

Removed

[v0.0.27] torch.compile support, bug fixes & more

Added

Improved

Removed

2:4 sparsity & `torch.compile`-ing memory_efficient_attention

Added

Improved