Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions on SimHash Deduplication #107

Open
mohamedlekarim opened this issue Jan 2, 2025 · 1 comment
Open

Questions on SimHash Deduplication #107

mohamedlekarim opened this issue Jan 2, 2025 · 1 comment

Comments

@mohamedlekarim
Copy link

Hello,

Is there a threshold that controls the level of duplication aggressiveness in the SimHash implementation, similar to the one in the Mishmash approach?
I’m also wondering if we can adjust these parameters:

  • NUM_BUCKET
  • BIT_DIFF
  • F
  • NGRAM
  • BATCH_SIZE
    Could you please provide an overview of these parameters and suggest optimal values for each?
@ChenghaoMou
Copy link
Owner

Thanks for the question.

You can try increase bit-diff or decrease ngram size to allow more matches, at the price of being slower and/or more false positives.

Optimal settings should be tuned based on your dataset and experiments. I would suggest using a random subset to see the effect before committing to larger runs. Here is a good resource on how it works: https://github.com/seomoz/simhash-cpp?tab=readme-ov-file#architecture

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants