Questions on MinHash Deduplication #106

XChen-Zero · 2024-11-14T10:23:39Z

Hello! I have a few questions and observations regarding the deduplication approach using MinHash in this repository. Specifically, I’m interested in some intuition around handling false positives and negatives and the clustering strategy used.

In this Hugging Face blog post on deduplication, it is mentioned that relying on the Locality Sensitive Hashing (LSH) results directly can also yield good performance for downstream language models (LMs). Could it be interpreted that false positives don’t significantly harm LMs, as they merely remove some texts from the dataset? Given a sufficiently large dataset, the diversity might remain relatively unaffected. In contrast, false negatives introduce more duplicate texts, which could degrade model performance. Is this a reasonable way to interpret the impact of false positives and negatives on LMs?
In your code, the final clustering and deduplication use a union-find approach. You also recommended this script for MinHash deduplication, which calculates an “extrema.” However, from my understanding, the extrema calculation could be order-dependent. For instance, if the similarity between A and B exceeds the threshold, as does the similarity between B and C, but not between A and C, running the calculations sequentially might retain A and C. In your view, is the difference between retaining A and C versus B and C negligible, or does it depend on specific dataset characteristics?
Thank you for your work, and I appreciate any insights you can provide on these points!

ChenghaoMou · 2024-11-14T20:16:51Z

Thanks for the great questions!

Yes, that's correct. Though there might be different opinions, especially when it comes to different datasets. For example, DCLM claims full deduplication does not improve model performance. Even fineweb only deduplicates within each CC snapshot, not the entire dataset. The line is even thinner if you count repetition/multiple epochs as duplicates.
I am not able to find the extrema you are referring to in the link provided. However, in cases like this, it could be a dataset-dependent issue. For example, if most duplicate clusters are very small, then greedy or not, it's unlikely to make a huge difference. If the clusters are highly skewed, greedy methods are more likely to incur more duplicates left in large clusters as a result. Non-greedy methods, however, require more compute and time to calculate. Like the one in the Spark version in this repo, it requires a Spark cluster and a distributed connected-component implementation to scale to TB for proper clustering.

XChen-Zero · 2024-11-15T02:15:21Z

Thank you so much for your detailed response—it has been incredibly helpful in deepening my understanding of the approach!

I realized that I made an error in the second point of my original message. The correct link to the script I was referring to is actually this one: MinHash deduplication script. I apologize for the confusion caused by the incorrect link earlier.

Thanks again for your patience and insightful answers! They’ve been invaluable as I work through these concepts.

github-actions · 2025-01-14T17:47:20Z

Stale issue message

github-actions bot added the no-issue-activity label Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions on MinHash Deduplication #106

Questions on MinHash Deduplication #106

XChen-Zero commented Nov 14, 2024 •

edited

Loading

ChenghaoMou commented Nov 14, 2024

XChen-Zero commented Nov 15, 2024

github-actions bot commented Jan 14, 2025

Questions on MinHash Deduplication #106

Questions on MinHash Deduplication #106

Comments

XChen-Zero commented Nov 14, 2024 • edited Loading

ChenghaoMou commented Nov 14, 2024

XChen-Zero commented Nov 15, 2024

github-actions bot commented Jan 14, 2025

XChen-Zero commented Nov 14, 2024 •

edited

Loading