You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am currently running SimHash deduplication on a single instance, but I've encountered performance issues due to the process being very time-consuming. I am looking for guidance on how to scale this process horizontally across multiple machines to improve efficiency.
Specifically, I would like to know if it’s possible to run the deduplication on different data chunks across multiple machines and then group the results. If so, could you please outline the steps required to run the deduplication on different machines ( Loading, SimHashing, Indexing, Filtering, Saving ) and how to merge the results on a single instance?
Thank you for your assistance!
The text was updated successfully, but these errors were encountered:
In theory, one could adopt the current script into a distributed framework like pyspark, but you would save much more time/effort using algorithm like minhash, which already has a pyspark version. However, if you have to use simhash, then your best starting point is referencing the minhash pyspark script.
Directly separating data and processing them individually will cause false negatives (e.g. A and B are duplicates but A and B are processed separately on two machines, so they will never be considered as duplicates)
Hello,
I am currently running SimHash deduplication on a single instance, but I've encountered performance issues due to the process being very time-consuming. I am looking for guidance on how to scale this process horizontally across multiple machines to improve efficiency.
Specifically, I would like to know if it’s possible to run the deduplication on different data chunks across multiple machines and then group the results. If so, could you please outline the steps required to run the deduplication on different machines ( Loading, SimHashing, Indexing, Filtering, Saving ) and how to merge the results on a single instance?
Thank you for your assistance!
The text was updated successfully, but these errors were encountered: