Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling SimHash Deduplication Across Multiple Machines #108

Open
mohamedlekarim opened this issue Jan 2, 2025 · 1 comment
Open

Scaling SimHash Deduplication Across Multiple Machines #108

mohamedlekarim opened this issue Jan 2, 2025 · 1 comment

Comments

@mohamedlekarim
Copy link

Hello,

I am currently running SimHash deduplication on a single instance, but I've encountered performance issues due to the process being very time-consuming. I am looking for guidance on how to scale this process horizontally across multiple machines to improve efficiency.

Specifically, I would like to know if it’s possible to run the deduplication on different data chunks across multiple machines and then group the results. If so, could you please outline the steps required to run the deduplication on different machines ( Loading, SimHashing, Indexing, Filtering, Saving ) and how to merge the results on a single instance?

Thank you for your assistance!

@ChenghaoMou
Copy link
Owner

In theory, one could adopt the current script into a distributed framework like pyspark, but you would save much more time/effort using algorithm like minhash, which already has a pyspark version. However, if you have to use simhash, then your best starting point is referencing the minhash pyspark script.

Directly separating data and processing them individually will cause false negatives (e.g. A and B are duplicates but A and B are processed separately on two machines, so they will never be considered as duplicates)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants