Questions on SimHash Deduplication #107

mohamedlekarim · 2025-01-02T09:19:09Z

Hello,

Is there a threshold that controls the level of duplication aggressiveness in the SimHash implementation, similar to the one in the Mishmash approach?
I’m also wondering if we can adjust these parameters:

NUM_BUCKET
BIT_DIFF
F
NGRAM
BATCH_SIZE
Could you please provide an overview of these parameters and suggest optimal values for each?

ChenghaoMou · 2025-01-02T18:02:54Z

Thanks for the question.

You can try increase bit-diff or decrease ngram size to allow more matches, at the price of being slower and/or more false positives.

Optimal settings should be tuned based on your dataset and experiments. I would suggest using a random subset to see the effect before committing to larger runs. Here is a good resource on how it works: https://github.com/seomoz/simhash-cpp?tab=readme-ov-file#architecture

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions on SimHash Deduplication #107

Questions on SimHash Deduplication #107

mohamedlekarim commented Jan 2, 2025

ChenghaoMou commented Jan 2, 2025

Questions on SimHash Deduplication #107

Questions on SimHash Deduplication #107

Comments

mohamedlekarim commented Jan 2, 2025

ChenghaoMou commented Jan 2, 2025