WIP: Add benchmark for tfidf #239

GoWind · 2024-11-23T00:30:45Z

Add a vanilla Rust program for calculating tfidf scores and calculating the top 10 similar documents

We use a cosine similarity for calculating relevancy of documents for a given query.

@ashvardanian , thanks for your time !
Wanted to check with you on the approach to see if it makes sense to you as a valid benchmark

Use the leipzig1m and the § XL Sum datasetas corpuses. I assumes that each 10000 lines in theleipzig1m` dataset is a single document and then calculate the tf-idf scores for each (term, document) in the corpus.

Given a query, calculate the tf-idf score of a given query based on the same corpus (and assume the query to be a separate document)
The next step is to calculate the cosine score. I couldn't fine a good source for where I can calculate the score as a vector for a query or a document (Claude gave me a fn that I could possibly use)
Once we have a cosine similarity score, sort and fetch top 10.

For benchmarking SImSIMD, I assume the cosine part is where I can use methods from SimSIMD and benchmark it against the vanilla implementation ?
And for the query, in memchr vs stringzilla you basically pick a random set of tokens and then benchmark searching from left and right using memchr and stringzilla . I was thinking of doing something similar by picking terms at random from the corpus and constructing random queries to benchmark.

Add a benchmark for calculating tfidf scores and calculating the top 10 similar documents We use a cosine similarity for calculating relevancy of documents for a given query

ashvardanian · 2024-11-23T00:33:10Z

Hi @GoWind! The biggest SimSIMD improvement should come from the Sparse kernels. Have you managed to use them in your TFIDF implementation?

GoWind · 2024-11-23T00:46:28Z

@ashvardanian , not yet, I am trying to figure out what could be the ideal benchmark (and also learning how to use TF-IDF)

In scikitlearn, the TfIdfVectorizer creates an 2d array for the documents where each row is the document, and there is a column for each "token" in the corpus and
array[row][column] = frequency of the token in the document

Similarly, we can tokenize the query and run intersect between the query (as a row vector) and each document in our corpus to get the intersection size.
the intersection would be if the frequency of every unique token in the query matches the frequency of the term in the document (something like using simsimd_intersect ). Is that what you had in mind using sparse kernels ?

ashvardanian · 2024-11-23T15:19:48Z

@GoWind, the intersect function may not be the only one you need. Also look into: spdot_counts and spdot_weights 🤗

GoWind · 2024-11-23T20:49:43Z

will do. Thanks for the pointers ! Also reading through the scikit implementation to see how I can possibly do this :)

GovindarajanNagarajan-TomTom · 2024-11-27T12:09:34Z

Hi @ashvardanian , making progress on the tfidf based similarity calculator.
I noticed some discrepancy when calculating the cosine similarity for vec of f64s via the Rust bindings
The plain rust cosine calculations match the values I get both from numpy and from the simsimd bindings via Python
For the implementation I tried to compute a vector of tfidf_values per query and then compute a cosine similarity between the query and each document , based on the answers on the SO question

Here is how i prepared the script (added a few hacks to test quickly, will update the scripts to a be [[bench]] in the subsequent commits)

head -n 10000 leipzig1m.txt > leipzig10000.txt
cargo run --bin tfidf leipzig10000.txt

I took the first 10k lines and batched them into 10 document of 1k lines each.
The query (hardcoded) in the script is transformed into a vector representation and I compute the cosine similarities

Similarity for document via simsimd 0: Some(0.5928754241421308)
Similarity for document via plain cosine similarity 0: Some(0.40712457585786876)
Similarity for document via simsimd 1: Some(0.5993249839897541)
Similarity for document via plain cosine similarity 1: Some(0.4006750160102468)
Similarity for document via simsimd 2: Some(0.5914559242162761)
Similarity for document via plain cosine similarity 2: Some(0.408544075783724)
Similarity for document via simsimd 3: Some(0.5998267820476098)
Similarity for document via plain cosine similarity 3: Some(0.40017321795239075)
Similarity for document via simsimd 4: Some(0.5906444006555799)
Similarity for document via plain cosine similarity 4: Some(0.40935559934442023)
Similarity for document via simsimd 5: Some(0.5902192553116478)
Similarity for document via plain cosine similarity 5: Some(0.4097807446883521)
Similarity for document via simsimd 6: Some(0.5943923707602529)
Similarity for document via plain cosine similarity 6: Some(0.4056076292397477)
Similarity for document via simsimd 7: Some(0.6028015678055032)
Similarity for document via plain cosine similarity 7: Some(0.3971984321944968)
Similarity for document via simsimd 8: Some(0.5957380843868555)
Similarity for document via plain cosine similarity 8: Some(0.4042619156131435)
Similarity for document via simsimd 9: Some(0.5913356879962984)
Similarity for document via plain cosine similarity 9: Some(0.4086643120037022)

Not sure if I am doing something wrong, but could there be some sort of discrepancy here ?

ashvardanian · 2024-11-27T12:11:44Z

Seems like one is x and the other is 1-x. One is a similarity score and the other is a distance.

GoWind · 2024-11-27T16:43:13Z

Ah, I see. now it makes sense


SIMSIMD_INTERNAL simsimd_distance_t _simsimd_cos_normalize_f64_neon(simsimd_f64_t ab, simsimd_f64_t a2,
                                                                    simsimd_f64_t b2) {
    if (a2 == 0 && b2 == 0) return 0;
    if (ab == 0) return 1;
    simsimd_f64_t squares_arr[2] = {a2, b2};
    float64x2_t squares = vld1q_f64(squares_arr);
    ......
    rsqrts = vmulq_f64(rsqrts, vrsqrtsq_f64(vmulq_f64(squares, rsqrts), rsqrts));
    rsqrts = vmulq_f64(rsqrts, vrsqrtsq_f64(vmulq_f64(squares, rsqrts), rsqrts));
    vst1q_f64(squares_arr, rsqrts);
    simsimd_distance_t result = 1 - ab * squares_arr[0] * squares_arr[1];
    return result > 0 ? result : 0;
}

GoWind · 2024-12-08T02:29:43Z

I was able to get a benchmark of running the cosine search across a database against a query and the SimSIMD version run upto 5x faster than the plain Rust version.

I am still not sure how to use spdot here instead of a cosine similarity. Can you give me some pointers on how I can employ spdotand benchmark it (now I have most of the foundation for a TF-IDF implementation, reckon it should be easier)

Benchmarks done on a M2 Pro with 32GB of Ram

For the dataset I used the first 10k lines from the Leipzig dataset head -n 10000 leipzig1m.txt > leipzig10000.txt

warning: `simsimd` (bench "tfidf") generated 1 warning (run `cargo fix --bench "tfidf"` to apply 1 suggestion)
    Finished bench [optimized] target(s) in 0.08s
     Running scripts/bench_tfidf.rs (target/release/deps/tfidf-601c525378540ae8)
Gnuplot not found, using plotters backend
Benchmarking TF-IDF Similarity/SimSIMD Cosine Similarity: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 6.7s or enable flat sampling.
TF-IDF Similarity/SimSIMD Cosine Similarity
                        time:   [120.97 ms 121.69 ms 122.57 ms]
                        change: [-10.060% -6.1234% -2.7896%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
Benchmarking TF-IDF Similarity/Rust plain Cosine similarity: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 5.4s.
TF-IDF Similarity/Rust plain Cosine similarity
                        time:   [534.30 ms 535.29 ms 536.42 ms]
                        change: [-8.7229% -5.3702% -2.4358%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

ashvardanian · 2024-12-08T14:40:02Z

Thanks for sharing these benchmarks—great to see SimSIMD achieving a 5x speedup over the Rust implementation! To use spdot, you can directly compute the weighted dot product of TF-IDF vectors without normalizing them, as required for cosine similarity. This is both faster and aligns naturally with the sparse representation of TF-IDF.

For benchmarking, precompute the TF-IDF vectors as sparse arrays (indices for term IDs, weights for TF-IDF values) and replace the cosine similarity calculation with spdot. This avoids unnecessary operations like normalization and takes full advantage of sparsity. Let me know how the updated results compare—excited to see the numbers!

GoWind · 2024-12-08T14:48:19Z

you can directly compute the weighted dot product of TF-IDF vectors without normalizing them, as required for cosine similarity. This is both faster and aligns naturally with the sparse representation of TF-IDF.
precompute the TF-IDF vectors as sparse arrays (indices for term IDs, weights for TF-IDF values)

Got it. thanks for the pointers. I will setup the benchmarks and share numbers again. This does look promising !

ashvardanian/SimSIMD#239

ashvardanian · 2024-12-08T20:23:30Z

Hey @GoWind! I've added a placeholder file for this benchmark on the main-dev branch of this repo. Let's move it there 🤗

Thanks!

GoWind · 2024-12-09T00:52:50Z

Sure, will move the benchmark code to the repo linked !

GoWind · 2024-12-10T00:33:51Z

Hi @ashvardanian , I noticed that there isn't a NEON implementation for simsimd_spdot_weights_u16 (or simsimd_spdot_counts_u16 either). I think it might make sense to add it as part of the benchmark as I am running this on a Mac (and other ARM devices might benefit anyway)

Also, I wrote the tfidf benchmark using f64 and realized that there isn't a spdot version for u16 with f32 or f64 weights either. Does it make sense to add them as well ?

Also , there seems to be simsimd native bf16 type and a half bf16 types and both atleast at the type level are not compatible. Do you think it might be safe to use the half::bf16 types and then cast them to simsimd bf16s ?

GoWind · 2025-01-02T17:06:02Z

Hi @ashvardanian , a Happy New Year,

I didn't have a lot of time to work on this, but could squeeze out some time during the holidays to finish up some stuff.
I wrote a f32 neon implementation for spdot. I noticed that there seems to be only implementation for sve2 for bf16. My algorithm doesn't seems to be slower than the serial one and I was wondering if you could provide any insights for me on maybe speeding it up ?

The vectors I am comparing are about 20-50 elements big. I will try to do one more round with a larger number of elements to see if the algorithm gets faster beyond a certain size.

Would you also be interested in a f16 implementation on NEON ? If there are any improvements I can make to the f32 version, I think I might be able to port them to the f16 / bf16 version as well.

All measurements done on an M2 PRO with 32GB of RAM.

feat: Add benchmark for tfidf

ec6df04

Add a benchmark for calculating tfidf scores and calculating the top 10 similar documents We use a cosine similarity for calculating relevancy of documents for a given query

GoWind added 3 commits November 27, 2024 02:18

Get vanialla version of TfIdf matrix calculation working

52cf048

calculate cosine similarity using simsimd

19850f6

add plain implementation of cosine similarity

4cf2e0c

GoWind added 3 commits November 28, 2024 00:54

Speed up stuff by removing Regex out of the hot loop

92eda8d

Attempt to turn our code into a proper benchmark

0a5849a

Make tfidf a proper benchmark

f9ce740

ashvardanian added a commit to ashvardanian/stringzilla-benchmarks-rs that referenced this pull request Dec 8, 2024

Add: Placeholder for TF-IDF

bd23a21

ashvardanian/SimSIMD#239

GoWind added 2 commits January 2, 2025 16:43

Add NEON implementation for SPDOT

a487d02

Add benchmark file for spdot

deaea65

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add benchmark for tfidf #239

WIP: Add benchmark for tfidf #239

GoWind commented Nov 23, 2024

ashvardanian commented Nov 23, 2024

GoWind commented Nov 23, 2024 •

edited

Loading

ashvardanian commented Nov 23, 2024

GoWind commented Nov 23, 2024

GovindarajanNagarajan-TomTom commented Nov 27, 2024

ashvardanian commented Nov 27, 2024

GoWind commented Nov 27, 2024

GoWind commented Dec 8, 2024

ashvardanian commented Dec 8, 2024

GoWind commented Dec 8, 2024

ashvardanian commented Dec 8, 2024

GoWind commented Dec 9, 2024

GoWind commented Dec 10, 2024

GoWind commented Jan 2, 2025

WIP: Add benchmark for tfidf #239

Are you sure you want to change the base?

WIP: Add benchmark for tfidf #239

Conversation

GoWind commented Nov 23, 2024

ashvardanian commented Nov 23, 2024

GoWind commented Nov 23, 2024 • edited Loading

ashvardanian commented Nov 23, 2024

GoWind commented Nov 23, 2024

GovindarajanNagarajan-TomTom commented Nov 27, 2024

ashvardanian commented Nov 27, 2024

GoWind commented Nov 27, 2024

GoWind commented Dec 8, 2024

ashvardanian commented Dec 8, 2024

GoWind commented Dec 8, 2024

ashvardanian commented Dec 8, 2024

GoWind commented Dec 9, 2024

GoWind commented Dec 10, 2024

GoWind commented Jan 2, 2025

GoWind commented Nov 23, 2024 •

edited

Loading