-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Add benchmark for tfidf #239
base: main
Are you sure you want to change the base?
Conversation
Add a benchmark for calculating tfidf scores and calculating the top 10 similar documents We use a cosine similarity for calculating relevancy of documents for a given query
Hi @GoWind! The biggest SimSIMD improvement should come from the Sparse kernels. Have you managed to use them in your TFIDF implementation? |
@ashvardanian , not yet, I am trying to figure out what could be the ideal benchmark (and also learning how to use TF-IDF) In scikitlearn, the TfIdfVectorizer creates an 2d array for the documents where each row is the document, and there is a column for each "token" in the corpus and Similarly, we can tokenize the query and run |
@GoWind, the |
will do. Thanks for the pointers ! Also reading through the scikit implementation to see how I can possibly do this :) |
Hi @ashvardanian , making progress on the tfidf based similarity calculator. Here is how i prepared the script (added a few hacks to test quickly, will update the scripts to a be
I took the first 10k lines and batched them into 10 document of 1k lines each.
Not sure if I am doing something wrong, but could there be some sort of discrepancy here ? |
Seems like one is |
Ah, I see. now it makes sense
|
I was able to get a benchmark of running the cosine search across a database against a query and the SimSIMD version run upto 5x faster than the plain Rust version. I am still not sure how to use Benchmarks done on a M2 Pro with 32GB of Ram For the dataset I used the first 10k lines from the Leipzig dataset
|
Thanks for sharing these benchmarks—great to see SimSIMD achieving a 5x speedup over the Rust implementation! To use spdot, you can directly compute the weighted dot product of TF-IDF vectors without normalizing them, as required for cosine similarity. This is both faster and aligns naturally with the sparse representation of TF-IDF. For benchmarking, precompute the TF-IDF vectors as sparse arrays (indices for term IDs, weights for TF-IDF values) and replace the cosine similarity calculation with spdot. This avoids unnecessary operations like normalization and takes full advantage of sparsity. Let me know how the updated results compare—excited to see the numbers! |
Got it. thanks for the pointers. I will setup the benchmarks and share numbers again. This does look promising ! |
Sure, will move the benchmark code to the repo linked ! |
Hi @ashvardanian , I noticed that there isn't a NEON implementation for Also, I wrote the tfidf benchmark using f64 and realized that there isn't a spdot version for u16 with f32 or f64 weights either. Does it make sense to add them as well ? Also , there seems to be simsimd native bf16 type and a |
Hi @ashvardanian , a Happy New Year, I didn't have a lot of time to work on this, but could squeeze out some time during the holidays to finish up some stuff. The vectors I am comparing are about 20-50 elements big. I will try to do one more round with a larger number of elements to see if the algorithm gets faster beyond a certain size. Would you also be interested in a f16 implementation on NEON ? If there are any improvements I can make to the f32 version, I think I might be able to port them to the f16 / bf16 version as well. All measurements done on an M2 PRO with 32GB of RAM. |
Add a vanilla Rust program for calculating tfidf scores and calculating the top 10 similar documents
We use a cosine similarity for calculating relevancy of documents for a given query.
@ashvardanian , thanks for your time !
Wanted to check with you on the approach to see if it makes sense to you as a valid benchmark
Use the
leipzig1m
and the § XL Sum datasetas corpuses. I assumes that each 10000 lines in the
leipzig1m` dataset is a single document and then calculate the tf-idf scores for each (term, document) in the corpus.Given a query, calculate the tf-idf score of a given query based on the same corpus (and assume the query to be a separate document)
The next step is to calculate the cosine score. I couldn't fine a good source for where I can calculate the score as a vector for a query or a document (Claude gave me a fn that I could possibly use)
Once we have a cosine similarity score, sort and fetch top 10.
For benchmarking SImSIMD, I assume the cosine part is where I can use methods from SimSIMD and benchmark it against the vanilla implementation ?
And for the query, in memchr vs stringzilla you basically pick a random set of tokens and then benchmark searching from left and right using memchr and stringzilla . I was thinking of doing something similar by picking terms at random from the corpus and constructing random queries to benchmark.