Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Element-wise BLAS APIs & new Tensor for Python: ⬆️ 450 kernels #220

Open
wants to merge 68 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
fb5dce7
Add: Sum & Scale APIs
ashvardanian Oct 31, 2024
6cdd147
Docs: PyTorch examples & Element-wise ops
ashvardanian Oct 28, 2024
d436795
Improve: Faster scale/shift on Haswell with FMA
ashvardanian Oct 31, 2024
a57264a
Add: `simsimd_ndindex_t` for high-rank tensors
ashvardanian Nov 1, 2024
e382774
Merge pull request #223 from ashvardanian/main
ashvardanian Nov 1, 2024
25a76ca
Make: Mark tests in CMake (#226)
Spixmaster Nov 2, 2024
7666884
Add: Same-type element-wise ops
ashvardanian Nov 4, 2024
999243e
Merge branch 'main-elementwise' of https://github.com/ashvardanian/Si…
ashvardanian Nov 4, 2024
617e1f7
Fix: Increment `global_offset` on final step
ashvardanian Nov 4, 2024
0ce4024
Fix: Cast in `sum_i8_haswell`
ashvardanian Nov 4, 2024
d54f567
Fix: Conflicting type on Windows
ashvardanian Nov 4, 2024
a334e99
Merge branch 'main' into main-elementwise
ashvardanian Nov 5, 2024
18c41fd
Improve: Type-casting logic
ashvardanian Nov 5, 2024
47f79c7
Improve: `ndindex` -> `mdindices`
ashvardanian Nov 6, 2024
4bfe1d1
Improve: Drop `global_offset`
ashvardanian Nov 6, 2024
ac5841f
Add: `mdspan`
ashvardanian Nov 6, 2024
4c69e7d
Add: Type-casts to & from `[iuf]64`
ashvardanian Nov 6, 2024
4646d6b
Break: Support mixed-type element-wise ops
ashvardanian Nov 6, 2024
383b799
Break: Shorter op-codes
ashvardanian Nov 6, 2024
08010ba
Improve: Same type-casting as NumPy
ashvardanian Nov 7, 2024
54bb07d
Add: `i16` element-wise kernels for NEON
ashvardanian Nov 8, 2024
1f91b92
Add: `i32` element-wise kernels for NEON
ashvardanian Nov 8, 2024
75993e7
Add: `i64` element-wise kernels for NEON
ashvardanian Nov 8, 2024
38df49c
Break: Shorter symbol names
ashvardanian Nov 8, 2024
0e7c656
Add: `i16` element-wise kernels for Haswell
ashvardanian Nov 8, 2024
e2698b0
Add: `i32` element-wise kernels for Haswell
ashvardanian Nov 8, 2024
d10d27e
Add: `i8` element-wise kernels for Skylake
ashvardanian Nov 8, 2024
8950a7e
Add: `i16` element-wise kernels for Skylake
ashvardanian Nov 9, 2024
d1bb51c
Add: `i32` element-wise kernels for Skylake
ashvardanian Nov 9, 2024
463e8f3
Add: `i64`element-wise kernels for Skylake
ashvardanian Nov 9, 2024
e089626
Improve: Unsigned type literals for masks
ashvardanian Nov 9, 2024
09735ea
Add: Element-wise saturated addition for Ice Lake
ashvardanian Nov 9, 2024
602f812
Add: Dynamic dispatch for element-wise ops
ashvardanian Nov 9, 2024
3aac9ad
Add: Missing serial integer `wsum`-s
ashvardanian Nov 9, 2024
02236d1
Fix: Match type-casting rules of NumPy
ashvardanian Nov 9, 2024
48bd712
Add: `simsimd.multiply`
ashvardanian Nov 9, 2024
7aa118b
Fix: `_mm256_adds_epi32` emulation
ashvardanian Nov 10, 2024
8295e11
Fix: Serial emulation of `_mm256_adds_epu32`
ashvardanian Nov 10, 2024
400dfaa
Fix: `sadd` for `u(8|16|32)`
ashvardanian Nov 10, 2024
d81868a
Add: `simsimd.multiply`
ashvardanian Nov 10, 2024
79c4552
Fix: Missing 64-bit Haswell kernels
ashvardanian Nov 10, 2024
3f48285
Improve: Clipping doubles on Haswell
ashvardanian Nov 11, 2024
bcbe538
Fix: Missing `__m256d[]` operator on MSVC
ashvardanian Nov 11, 2024
9afe040
Improve: Reduce fuzzy tests
ashvardanian Nov 11, 2024
a4fce6d
Fix: Keeping one capability ON
ashvardanian Nov 11, 2024
ac60194
Improve: Overflow clipping on Skylake
ashvardanian Nov 11, 2024
45e806f
Improve: Clipping on x86
ashvardanian Nov 11, 2024
69e3a94
Fix: Inferring `possible_capabilities`
ashvardanian Nov 11, 2024
e568e6c
Improve: Log operand descriptor
ashvardanian Nov 11, 2024
4d0880f
Merge branch 'main' into main-elementwise
ashvardanian Nov 11, 2024
a0f88b7
Improve: Test saturating arithmetic
ashvardanian Nov 11, 2024
72b219e
Merge branch 'main-elementwise' of https://github.com/ashvardanian/Si…
ashvardanian Nov 11, 2024
0bf67d0
Improve: Re-group Py/Rs benchmarks
ashvardanian Nov 12, 2024
b3f98e6
Add: `u8` APIs to Rust SDK
ashvardanian Nov 12, 2024
c49abe3
Make: Bump Rust dependencies
ashvardanian Nov 12, 2024
6f69eee
Improve: Generalize Rust benchmarks
ashvardanian Nov 12, 2024
cf507db
Improve: Report throughput in Rust benchmarks
ashvardanian Nov 12, 2024
a22607d
Add: BLAS benchmarks for elementwise ops
ashvardanian Nov 12, 2024
b6012ca
Fix: FMA can't be implemented in BLAS
ashvardanian Nov 13, 2024
8fb5a0c
Add: Element-wise Python benchmark
ashvardanian Nov 13, 2024
fe62187
Improve: Mixed `dtype` benchmarks
ashvardanian Nov 13, 2024
8c9b71e
Improve: "MD" -> "XD"
ashvardanian Nov 15, 2024
b480b5c
Break: `cos` distance renamed to `angular`
ashvardanian Nov 20, 2024
d65c6e8
Improve: Ignore renaming to `angular`
ashvardanian Nov 20, 2024
96adae5
Add: Trigonometry based on SLEEF
ashvardanian Nov 20, 2024
cb14ffb
Improve: Polish `simsimd_f32_sin`
ashvardanian Nov 20, 2024
d4a6288
Improve: Cleaner trigonometry
ashvardanian Nov 21, 2024
bc6ed87
Add: `atan` & `atan2` serial variants
ashvardanian Nov 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .git-blame-ignore-revs
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
a4022a988287e527757ecc9bc16a4f2e7dc4770e
750c59f5116a2000507a0cec09db009fd7d31232
b480b5c3ebddd6de0f8e1c179cdc02f18edbb8ae
6 changes: 6 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -100,9 +100,11 @@
"cSpell.words": [
"allclose",
"Altra",
"astype",
"Axion",
"bfloat",
"bitalg",
"bitmask",
"BLAS",
"castsi",
"CBLAS",
Expand Down Expand Up @@ -133,6 +135,9 @@
"Logarithmotechnia",
"maccs",
"maskz",
"mdindices",
"mdspan",
"musllinux",
"napi",
"ndarray",
"Needleman",
Expand Down Expand Up @@ -168,6 +173,7 @@
"VNNI",
"vpopcntdq",
"Wojciech",
"wsum",
"Wunsch",
"Zilla"
],
Expand Down
4 changes: 4 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -106,10 +106,14 @@ endif ()
if (SIMSIMD_BUILD_TESTS)
add_executable(simsimd_test_compile_time scripts/test.c)
target_link_libraries(simsimd_test_compile_time simsimd m)
add_test(NAME simsimd_test_compile_time COMMAND simsimd_test_compile_time)

add_executable(simsimd_test_run_time scripts/test.c c/lib.c)
target_compile_definitions(simsimd_test_run_time PRIVATE SIMSIMD_DYNAMIC_DISPATCH=1)
target_link_libraries(simsimd_test_run_time simsimd m)
add_test(NAME simsimd_test_run_time COMMAND simsimd_test_run_time)

enable_testing()
endif ()

if (SIMSIMD_BUILD_SHARED)
Expand Down
13 changes: 10 additions & 3 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,13 @@ cmake -D CMAKE_BUILD_TYPE=Release \
cmake --build build_release --config Release
```

I'd recommend putting the following breakpoints:

- `__asan::ReportGenericError` - to detect illegal memory accesses.
- `__GI_exit` - to stop at exit points - the end of running any executable.
- `__builtin_unreachable` - to catch unexpected code paths.
- `_sz_assert_failure` - to catch StringZilla logic assertions.

## Python

Testing:
Expand Down Expand Up @@ -91,14 +98,14 @@ Benchmarking:

```sh
pip install numpy scipy scikit-learn # for comparison baselines
python scripts/bench_vectors.py # to run default benchmarks
python scripts/bench_vectors.py --n 1000 --ndim 1536 # batch size and dimensions
python scripts/bench_similarity.py # to run default benchmarks
python scripts/bench_similarity.py --n 1000 --ndim 1536 # batch size and dimensions
```

You can also benchmark against other libraries, filter the numeric types, and distance metrics:

```sh
$ python scripts/bench_vectors.py --help
$ python scripts/bench_similarity.py --help
> usage: bench.py [-h] [--ndim NDIM] [-n COUNT]
> [--metric {all,dot,spatial,binary,probability,sparse}]
> [--dtype {all,bin8,int8,uint16,uint32,float16,float32,float64,bfloat16,complex32,complex64,complex128}]
Expand Down
21 changes: 14 additions & 7 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

15 changes: 5 additions & 10 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -24,18 +24,12 @@ name = "simsimd"
path = "rust/lib.rs"

[build-dependencies]
cc = "1.0.83"


[[bench]]
name = "cosine"
harness = false
path = "scripts/bench_cosine.rs"
cc = "1.2.0"

[[bench]]
name = "sqeuclidean"
name = "bench_similarity"
harness = false
path = "scripts/bench_sqeuclidean.rs"
path = "scripts/bench_similarity.rs"

[profile.bench]
opt-level = 3 # Corresponds to -O3
Expand All @@ -46,4 +40,5 @@ rpath = false # On some systems, setting this to false can help with optimiz
[dev-dependencies]
criterion = { version = "0.5.1" }
rand = { version = "0.8.5" }
half = { version = "2.4.0" }
half = { version = "2.4.1" }
num-traits = "0.2.19"
72 changes: 45 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@

Computing dot-products, similarity measures, and distances between low- and high-dimensional vectors is ubiquitous in Machine Learning, Scientific Computing, Geo-Spatial Analysis, and Information Retrieval.
These algorithms generally have linear complexity in time, constant or linear complexity in space, and are data-parallel.
In other words, it is easily parallelizable and vectorizable and often available in packages like BLAS (level 1) and LAPACK, as well as higher-level `numpy` and `scipy` Python libraries.
Ironically, even with decades of evolution in compilers and numerical computing, [most libraries can be 3-200x slower than hardware potential][benchmarks] even on the most popular hardware, like 64-bit x86 and Arm CPUs.
In other words, they are easily parallelizable and vectorizable and often available in packages like BLAS (level 1) and LAPACK, as well as higher-level `numpy` and `scipy` Python libraries.
Ironically, even with decades of evolution in compilers and numerical computing, [most libraries can be 3x - 1'000x slower than hardware potential][benchmarks] even on the most popular hardware, like 64-bit x86 and Arm CPUs.
Moreover, most lack mixed-precision support, which is crucial for modern AI!
The rare few that support minimal mixed precision, run only on one platform, and are vendor-locked, by companies like Intel and Nvidia.
SimSIMD provides an alternative.
Expand Down Expand Up @@ -42,7 +42,7 @@ SimSIMD provides an alternative.

## Features

__SimSIMD__ (Arabic: "سيمسيم دي") is a mixed-precision math library of __over 200 SIMD-optimized kernels__ extensively used in AI, Search, and DBMS workloads.
__SimSIMD__ (Arabic: "سيمسيم دي") is a mixed-precision math library of __over 450 SIMD-optimized kernels__ extensively used in AI, Search, and DBMS workloads.
Named after the iconic ["Open Sesame"](https://en.wikipedia.org/wiki/Open_sesame) command that opened doors to treasure in _Ali Baba and the Forty Thieves_, SimSimd can help you 10x the cost-efficiency of your computational pipelines.
Implemented distance functions include:

Expand All @@ -52,7 +52,7 @@ Implemented distance functions include:
- Set Intersections for Sparse Vectors and Text Analysis. _[docs][docs-sparse]_
- Mahalanobis distance and Quadratic forms for Scientific Computing. _[docs][docs-curved]_
- Kullback-Leibler and Jensen–Shannon divergences for probability distributions. _[docs][docs-probability]_
- Fused-Multiply-Add (FMA) and Weighted Sums to replace BLAS level 1 functions. _[docs][docs-fma]_
- Fused-Multiply-Add (FMA) and Weighted Sums to replace BLAS level 1 functions. _[docs][docs-elementwise]_
- For Levenshtein, Needleman–Wunsch, and Smith-Waterman, check [StringZilla][stringzilla].
- 🔜 Haversine and Vincenty's formulae for Geospatial Analysis.

Expand All @@ -62,7 +62,7 @@ Implemented distance functions include:
[docs-binary]: https://github.com/ashvardanian/SimSIMD/pull/138
[docs-dot]: #complex-dot-products-conjugate-dot-products-and-complex-numbers
[docs-probability]: #logarithms-in-kullback-leibler--jensenshannon-divergences
[docs-fma]: #mixed-precision-in-fused-multiply-add-and-weighted-sums
[docs-elementwise]: #mixed-precision-in-fused-multiply-add-and-weighted-sums
[scipy]: https://docs.scipy.org/doc/scipy/reference/spatial.distance.html#module-scipy.spatial.distance
[numpy]: https://numpy.org/doc/stable/reference/generated/numpy.inner.html
[stringzilla]: https://github.com/ashvardanian/stringzilla
Expand Down Expand Up @@ -139,7 +139,7 @@ import numpy as np

vec1 = np.random.randn(1536).astype(np.float32)
vec2 = np.random.randn(1536).astype(np.float32)
dist = simsimd.cosine(vec1, vec2)
dist = simsimd.angular(vec1, vec2)
```

Supported functions include `cosine`, `inner`, `sqeuclidean`, `hamming`, `jaccard`, `kulbackleibler`, `jensenshannon`, and `intersect`.
Expand All @@ -158,11 +158,11 @@ Unlike SciPy, SimSIMD allows explicitly stating the precision of the input vecto
The `dtype` argument can be passed both by name and as a positional argument:

```py
dist = simsimd.cosine(vec1, vec2, "int8")
dist = simsimd.cosine(vec1, vec2, "float16")
dist = simsimd.cosine(vec1, vec2, "float32")
dist = simsimd.cosine(vec1, vec2, "float64")
dist = simsimd.hamming(vec1, vec2, "bit8")
dist = simsimd.angular(vec1, vec2, "int8")
dist = simsimd.angular(vec1, vec2, "float16")
dist = simsimd.angular(vec1, vec2, "float32")
dist = simsimd.angular(vec1, vec2, "float64")
dist = simsimd.jaccard(vec1, vec2, "bin8") # Binary vectors with 8-bit words
```

With other frameworks, like PyTorch, one can get a richer type-system than NumPy, but the lack of good CPython interoperability makes it hard to pass data without copies.
Expand All @@ -181,7 +181,7 @@ torch.randn(8, out=vec2)

# Both libs will look into the same memory buffers and report the same results
dist_slow = 1 - torch.nn.functional.cosine_similarity(vec1, vec2, dim=0)
dist_fast = simsimd.cosine(buf1, buf2, "bf16")
dist_fast = simsimd.angular(buf1, buf2, "bf16")
```

It also allows using SimSIMD for half-precision complex numbers, which NumPy does not support.
Expand Down Expand Up @@ -220,8 +220,8 @@ vec1 = np.random.randn(1536).astype(np.float32) # rank 1 tensor
batch1 = np.random.randn(1, 1536).astype(np.float32) # rank 2 tensor
batch2 = np.random.randn(100, 1536).astype(np.float32)

dist_rank1 = simsimd.cosine(vec1, batch2)
dist_rank2 = simsimd.cosine(batch1, batch2)
dist_rank1 = simsimd.angular(vec1, batch2)
dist_rank2 = simsimd.angular(batch1, batch2)
```

### Many-to-Many Distances
Expand All @@ -232,7 +232,7 @@ For two batches of 100 vectors to compute 100 distances, one would call it like
```py
batch1 = np.random.randn(100, 1536).astype(np.float32)
batch2 = np.random.randn(100, 1536).astype(np.float32)
dist = simsimd.cosine(batch1, batch2)
dist = simsimd.angular(batch1, batch2)
```

Input matrices must have identical shapes.
Expand Down Expand Up @@ -609,7 +609,7 @@ import SimSIMD
let vectorA: [Int8] = [1, 2, 3]
let vectorB: [Int8] = [4, 5, 6]

let cosineSimilarity = vectorA.cosine(vectorB) // Computes the cosine similarity
let cosineSimilarity = vectorA.angular(vectorB) // Computes the cosine similarity
let dotProduct = vectorA.dot(vectorB) // Computes the dot product
let sqEuclidean = vectorA.sqeuclidean(vectorB) // Computes the squared Euclidean distance
```
Expand Down Expand Up @@ -637,9 +637,9 @@ int main() {
simsimd_f32_t vector_a[1536];
simsimd_f32_t vector_b[1536];
simsimd_kernel_punned_t distance_function = simsimd_metric_punned(
simsimd_metric_cos_k, // Metric kind, like the angular cosine distance
simsimd_datatype_f32_k, // Data type, like: f16, f32, f64, i8, b8, and complex variants
simsimd_cap_any_k); // Which CPU capabilities are we allowed to use
simsimd_angular_k, // Metric kind, like the angular cosine distance
simsimd_f32_k, // Data type, like: f16, f32, f64, i8, b8, complex variants, etc.
simsimd_cap_any_k); // Which CPU capabilities are we allowed to use
simsimd_distance_t distance;
distance_function(vector_a, vector_b, 1536, &distance);
return 0;
Expand Down Expand Up @@ -684,10 +684,10 @@ int main() {
simsimd_distance_t distance;

// Cosine distance between two vectors
simsimd_cos_i8(i8s, i8s, 1536, &distance);
simsimd_cos_f16(f16s, f16s, 1536, &distance);
simsimd_cos_f32(f32s, f32s, 1536, &distance);
simsimd_cos_f64(f64s, f64s, 1536, &distance);
simsimd_angular_i8(i8s, i8s, 1536, &distance);
simsimd_angular_f16(f16s, f16s, 1536, &distance);
simsimd_angular_f32(f32s, f32s, 1536, &distance);
simsimd_angular_f64(f64s, f64s, 1536, &distance);

// Euclidean distance between two vectors
simsimd_l2sq_i8(i8s, i8s, 1536, &distance);
Expand Down Expand Up @@ -988,24 +988,42 @@ Both functions are defined for non-negative numbers, and the logarithm is a key

### Mixed Precision in Fused-Multiply-Add and Weighted Sums

The Fused-Multiply-Add (FMA) operation is a single operation that combines element-wise multiplication and addition with different scaling factors.
The Weighted Sum is it's simplified variant without element-wise multiplication.
The "Fused-Multiply-Add" (FMA) operation is a single operation that combines element-wise multiplication and addition with different scaling factors.
The "Weighted Sum" is it's simplified variant without element-wise multiplication.
The "Sum" operation is a further simplified variant without scaling factors, and "Scale" is the unary equivalent of FMA:

```math
\text{FMA}_i(A, B, C, \alpha, \beta) = \alpha \cdot A_i \cdot B_i + \beta \cdot C_i
\text{Scale}_i(A, \alpha, \beta) = \alpha \cdot A_i + \beta
```

```math
\text{Sum}_i(A, B) = A_i + B_i
```

```math
\text{WSum}_i(A, B, \alpha, \beta) = \alpha \cdot A_i + \beta \cdot B_i
```

In NumPy terms, the implementation may look like:
```math
\text{FMA}_i(A, B, C, \alpha, \beta) = \alpha \cdot A_i \cdot B_i + \beta \cdot C_i
```

In NumPy terms, the implementation __may__ look like:

```py
import numpy as np

def scale(A: np.ndarray, /, Alpha: float, Beta: float) -> np.ndarray:
return (Alpha * A + Beta).astype(A.dtype)

def sum(A: np.ndarray, B: np.ndarray) -> np.ndarray:
assert A.dtype == B.dtype, "Input types must match and affect the output style"
return (A + B).astype(A.dtype)

def wsum(A: np.ndarray, B: np.ndarray, /, Alpha: float, Beta: float) -> np.ndarray:
assert A.dtype == B.dtype, "Input types must match and affect the output style"
return (Alpha * A + Beta * B).astype(A.dtype)

def fma(A: np.ndarray, B: np.ndarray, C: np.ndarray, /, Alpha: float, Beta: float) -> np.ndarray:
assert A.dtype == B.dtype and A.dtype == C.dtype, "Input types must match and affect the output style"
return (Alpha * A * B + Beta * C).astype(A.dtype)
Expand Down
2 changes: 1 addition & 1 deletion build.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ fn main() {
.define("SIMSIMD_NATIVE_BF16", "0")
.define("SIMSIMD_DYNAMIC_DISPATCH", "1")
.flag("-O3")
.flag("-std=c99") // Enforce C99 standard
.flag("-std=c23") // We could enforce the C99 standard, but it's nicer to use `_Float16` in C23
.flag("-pedantic") // Ensure strict compliance with the C standard
.warnings(false);

Expand Down
Loading
Loading