Skip to content

FBGEMM_GPU v1.0.0 Release Notes

Latest
Compare
Choose a tag to compare
@spcyppt spcyppt released this 19 Oct 20:57
· 377 commits to main since this release

Stable API

We provide the stable API support starting from FBGEMM_GPU v1.0.0. This includes Table batched embedding (TBE) modules, Pooled embedding operators and modules, Sparse operators, Jagged tensor operators and Quantization operators.

  • API backward compatibility guarantees via thorough testing. We guarantee that our stable APIs will be backward compatible within a major version, meaning that the stable APIs for v1.0.0 will be compatible with every future release unless explicitly announced in advance
    *Enhanced documentation, ensuring that every stable API has comprehensive and up-to-date documentation.
  • Functionality guarantees are only provided through unit testing framework. We do NOT guarantee any functionalities that are NOT explicitly tested and documented in our unit tests.
  • No performance guarantees. However, we are committed to providing support on a best-effort basis.

More details can be found in stable API documentation

Highlights

Table Batched Embedding (TBE)

  • New optimizer support for TBE Training
  • Enhanced Global weight decay support in TBE
  • Improvement and bug fixes for TBE training and inference modules and sparse operators

For SSD

  • New pipeline prefetching enabled
  • New cache and indices related ops
  • Integration of L3 cache to TBE operators
  • Many improvements to kernel and logging

For CPU

  • New type support for CPU Sequence TBE
  • Kernel improvements and bug fixes

Generative AI

  • Gen AI Ops support and improvement
  • Improvements to Triton-based and CUTLASS-based operators
  • New and optimized FP8 GEMM and quantization operators

Others

  • Optimized MX4 quantization operators
  • New dequantization operator
  • Removal of python 3.8 Support

Better engineering

  • Code refactoring and reorganization for faster builds
  • New and improved tests and benchmarks
  • Improved AMD support

Software Requirements

FBGEMM_GPU v1.0.0 has been tested and known to work on the following setups:

  • PyTorch: v2.5
  • CUDA: v11.8, 12.1, 12.4
  • Python: v3.9, 3.10, 3.11, 3.12

It is recommended to prepare an isolated environment for installing and running FBGEMM_GPU, such as Conda and/or Docker.

Availability

FBGEMM_GPU can be fetched directly from PyPI:

# FBGEMM_GPU CUDA variant (only the CUDA 12.4 variant is available)
pip install fbgemm-gpu==1.0.0

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu-cpu==1.0.0

Alternatively, it can be fetched from PyTorch PIP:

# FBGEMM_GPU CUDA variant
pip install fbgemm-gpu==1.0.0 --index-url https://download.pytorch.org/whl/cu118/
pip install fbgemm-gpu==1.0.0 --index-url https://download.pytorch.org/whl/cu121/

# FBGEMM_GPU CPU variant
pip install fbgemm-gpu==1.0.0 --index-url https://download.pytorch.org/whl/cpu

Changes

Table batched embedding (TBE) operators

For GPU

  • [New] Ensemble adagrad optimizer (#3197, #2955, #2954, #3161, #3091, #2981, #2889, #3180, #3158)
  • [New] Bounds check in prefetch in TBE training (#3015)
  • [New] Method to update internal hyperparameters for FBGEMM TBE (#3025)
  • [Improvement] Enhanced Global Weight Decay and state tracking (#2904, #2897, #2882, #2896, #2890, #2884, #2883 )
  • [Improvement] masked_index_* values index type fix (#2979)
  • [Improvement] generate_vbe_metadata fixes (#3095, #3087)
  • [Improvement] Fixes on the efficiency of VBE TBE forward due to blocking D2H copy (#2862)
  • [Improvement] Work around on offsets and indices type mismatch int TBE training (#3037)
  • [Improvement] Add a host map option for a UVM tensor alloc (#3073)
  • [Improvement] uvm_to_device expose device as interface (#3030)
  • [Improvement] Add Meta backend/dispatcher for new_unified_tensor (#3005)
  • [Improvement] General TBE enhancements and bug fixes (#2892, #3114, #3022, #2958)
  • [Improvement] Consolidate repeat code in TBE inference (#3028)

For CPU

  • [New] Add int4 to int4 CPU Sequence TBE kernel (#2996, #2994)
  • [New] Use auto-vec kernel in CPU sequential embedding lookup for int8 tables (#2863, #2878)
  • [Improvement] Work around OMP barrier issue with MSVCand unused var error (#2918, #3084)

SSD Table batched embedding (TBE) operators

  • [New] Enable pipeline prefetching (#2963)
  • [New] Enable cache line locking support in SSD kernel (#2949)
  • [New] Add L2 flush (#3110)
  • [New] Added SSD ODS and IO/mem stats (#2906, #2913, #3035)
  • [New] Add SSDScratchPadIndicesQueue (#2911, #2948)
  • [New] Integrate l2 cache to TBE operator (#2959, #3032, #3031 )
  • [New] Add ssd_update_row_addrs (#2953)
  • [New] Add bounds check in SSD-TBE (#3013)
  • [New] Add 32-bit index support in SSD kernels (#3064)
  • [New] Add kv cache related ops (#3001, #2968)
  • [New] Add compact_indices op (#3075 )
  • [New] Create embedding cache interface and impl RocksDB cache (#2858)
  • [New] Reduce prefetch SM usage when using pipeline prefetching (#2991)
  • [New] Add a host map option for a UVM tensor alloc (#3003)
  • [New] Add masked_index_select and refactor masked_index_put (#2910)
  • [Improvement] Add parallelism on cache update (#3062)
  • [Improvement] add parameter server attributes (#2947)
  • [Improvement] Make the scratch pad tensor UVA (#2844)
  • [Improvement] Use less thread blocks for find_uncached kernel (#3101)
  • [Improvement] Fix stream sync for scratch pad eviction (#2843)
  • [Improvement] Make indices related to cache eviction UVA tensors (#3077
  • [Improvement] Split cachelib cache into header and src (#3063)
  • [Improvement] Record more functions and logging in SSD TBE (#2854, #2867, #2975)
  • [Improvement] Attach eviction filling logic to set_cache (#3034)
  • [Improvement] Move set_cache and set_async to background thread (#3033)
  • [Improvement] Refactoring vec copy in masked_index_put_kernel (#2861, #2908)
  • [Improvement] Increase memcpy and compute overlap (#2860)
  • [Improvement] Add set_async in background thread (#3036 )
  • [Improvement] Make evicted_rows a UVA buffer (#3079 )
  • [Improvement] General enhancement and bug fixes (#2937, #2993, #3151, #3089, #2898, #2930)

GenAI Support and Operators

  • [New] Decode and Prefill support (#3009 )
  • [New] Support rope with block tables (#3146)
  • [New] EP support (#3071)
  • [New] Implement SDPA kernel wrapper to use run_kernel flow for perf (#2820)
  • [Improvement] Move mqa code (#3011)
  • [Improvement] BE improvements to init_comms #3103

Triton GEMM support

  • [New] Enable torch.compile compatibility for triton fp8 rowwise gemm (#2978)
  • [New] Add 3D+ input support for fp8 rowwise GEMM (#2845)
  • [New] GEMM custom op enablement (#3046)
  • [New] Add 3D+ input support for fp8 rowwise GEMM (#2845)
  • [Improvement] Add fused bias to Triton FP8 Rowwise Kernels (#2852)
  • [Improvement] Triton dependency ( #3027)
  • [Improvement] Fix triton fp8 handling of non-contiguous inputs (#2919)
  • [Improvement] More autotune configs and bug fixes in TMA kernel (#3078, #3066, #3072)
  • [Improvement] Fp8 gemm tweak for 405B Decoding (#3104 )

FP8 and other Quantization support

  • [New] CK FP8 Optimizations and fixes (#2940, #2912, #2987, #3017, (#2893 )
  • [New] FP8 kernel development and enablement (#2866)
  • [New] GenAI CK Version update and integration (#2865, #2971)
  • [Improvement] Also hipify the fp8 related cuda functions (#2834 )
  • [Improvement] Auto-generation of CUTLASS Extension Kernel Templates (#2932)
  • [Improvement] Marlin Mixed Input Kernel Productionization (#3008)
  • [Improvement] Remove redundant torch.abs (#3020, #2822 )
  • [Improvement] Tuning for 405B/70B Prefill with small seqlen (#3042)
  • [Improvement] Added new instances for 405B decoding (#2936 )

Permute and Pooled Embeddings Ops

  • [New] Implementation of permute_multi_embedding (#2833)
  • [Improvement] Clean up and removal of unused exception (#2832, #2891)
  • [Improvement] Use at::parallel_for in cpu kernel (#2817)
  • [Improvement] Add dispatch_to_cpu for the operators (#2874, #2881)
  • [Improvement] Print the exact variable values triggering the alert in Merge Pooled Embedding (#3038)

Sparse Operators

  • [New] Support original indices for FBGEMM block bucketization flag (#2999, #2925)
  • [Improvement] Fix pack_segments backward when grad is non-contig (#3006)
  • [Improvement] Fix FBGEMM_GPU_MEMCHECK in sparse_ops_cuda (#2943 )
  • [Improvement] Update sparse_ops.py to use generic gpu target fbgemm_gpu:input_combine to support both nvidia and AMD(#2905)
  • [Improvement] Add abstract impl and functions (#2962, #2983, #3000 )
  • [Improvement] Use guard_size_oblivious in tbe_input_combine_abstract fake kernel (#2923)
  • [Improvement] Out variant for asynchronous_exclusive_cumsum_cpu + some more static dispatch kernels (#3090)

Quantize ops

  • [New] Add a CPU nbit to float dequantization op that supports torch.quintMxN type (#2995)

MX4 Ops

  • [New] Optimize FBGEMM Triton MX4 Quantize-Dequantize (#2838, #2837)
  • [New] Rounding Mode Support (#2821, #2816, #2933, #2859 )
  • [New] FBGEMM/TorchRec MX4 padding support (#3055, #3047, #3010 )
  • [New] Add Stochastic downcasting to MX4 Quantization (#2899)
  • [New] Support for other MX4 formats in Triton kernels (#2900)
  • [Improvement] Refactor MX4 Kernel to operate on flat tensors (#2836)
  • [Improvement] Optimize MX4 padding to minimize need for tuning (#3040)

Benchmarks / Tests

  • [New] Add schema compatibility test (#3130)
  • [New] Add SSD/UVM caching in TBE device benchmark (#3076)
  • [New] Add EmbeddingSpMDM8BitBenchmarkOutTypeFloat16 (#2952 )
  • [New] Add benchmark EmbeddingSpMDMNBitBenchmarkOutTypeFloat16 (#2901 )
  • [New] Add unit test for int4 to int4 sequence CPU TBE (#2997)
  • [New] Add rocm support for fp8 benchmarks (#2965)
  • [New] Add rotating buffer feature to quantize_bench #2857)
  • [New] Benchmark of fbgemm op - permute_multi_embedding (#2828 )
  • [New] Add test for supporting torch.float16 and torch.bfloat16 (2992 )
  • [Improvement] Fix logging and remove sync points in benchmarks (#3149, #3113, 2855)
  • [Improvement] Update TBE training benchmark (#3112, #3074, #3051
  • [Improvement] Improve ssd-training benchmark (#2850, #3004, #3069, #2989)
  • [Improvement] Fix segfault in ssd training unit tests (#2929)
  • [Improvement] Fixes on genai tests (#2864, #2885, #2970, #2849, 2869 )
  • [Improvement] Fix minor issues in EmbeddingSpMDMNBitBenchmark (#2894)
  • [Improvement] Fix test skipping for UVM tests (#3016)
  • [Improvement] Fix failures_dict_fast.json in TBE inference test (#3024, #3060)

Build / CI improvements and Better Engineering