Scaling SimHash Deduplication Across Multiple Machines #108

mohamedlekarim · 2025-01-02T09:25:33Z

Hello,

I am currently running SimHash deduplication on a single instance, but I've encountered performance issues due to the process being very time-consuming. I am looking for guidance on how to scale this process horizontally across multiple machines to improve efficiency.

Specifically, I would like to know if it’s possible to run the deduplication on different data chunks across multiple machines and then group the results. If so, could you please outline the steps required to run the deduplication on different machines ( Loading, SimHashing, Indexing, Filtering, Saving ) and how to merge the results on a single instance?

Thank you for your assistance!

ChenghaoMou · 2025-01-02T18:11:18Z

In theory, one could adopt the current script into a distributed framework like pyspark, but you would save much more time/effort using algorithm like minhash, which already has a pyspark version. However, if you have to use simhash, then your best starting point is referencing the minhash pyspark script.

Directly separating data and processing them individually will cause false negatives (e.g. A and B are duplicates but A and B are processed separately on two machines, so they will never be considered as duplicates)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling SimHash Deduplication Across Multiple Machines #108

Scaling SimHash Deduplication Across Multiple Machines #108

mohamedlekarim commented Jan 2, 2025

ChenghaoMou commented Jan 2, 2025

Scaling SimHash Deduplication Across Multiple Machines #108

Scaling SimHash Deduplication Across Multiple Machines #108

Comments

mohamedlekarim commented Jan 2, 2025

ChenghaoMou commented Jan 2, 2025