Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions on MinHash Deduplication #106

Open
XChen-Zero opened this issue Nov 14, 2024 · 3 comments
Open

Questions on MinHash Deduplication #106

XChen-Zero opened this issue Nov 14, 2024 · 3 comments

Comments

@XChen-Zero
Copy link

XChen-Zero commented Nov 14, 2024

Hello! I have a few questions and observations regarding the deduplication approach using MinHash in this repository. Specifically, I’m interested in some intuition around handling false positives and negatives and the clustering strategy used.

  1. In this Hugging Face blog post on deduplication, it is mentioned that relying on the Locality Sensitive Hashing (LSH) results directly can also yield good performance for downstream language models (LMs). Could it be interpreted that false positives don’t significantly harm LMs, as they merely remove some texts from the dataset? Given a sufficiently large dataset, the diversity might remain relatively unaffected. In contrast, false negatives introduce more duplicate texts, which could degrade model performance. Is this a reasonable way to interpret the impact of false positives and negatives on LMs?

  2. In your code, the final clustering and deduplication use a union-find approach. You also recommended this script for MinHash deduplication, which calculates an “extrema.” However, from my understanding, the extrema calculation could be order-dependent. For instance, if the similarity between A and B exceeds the threshold, as does the similarity between B and C, but not between A and C, running the calculations sequentially might retain A and C. In your view, is the difference between retaining A and C versus B and C negligible, or does it depend on specific dataset characteristics?
    Thank you for your work, and I appreciate any insights you can provide on these points!

@ChenghaoMou
Copy link
Owner

Thanks for the great questions!

  1. Yes, that's correct. Though there might be different opinions, especially when it comes to different datasets. For example, DCLM claims full deduplication does not improve model performance. Even fineweb only deduplicates within each CC snapshot, not the entire dataset. The line is even thinner if you count repetition/multiple epochs as duplicates.

  2. I am not able to find the extrema you are referring to in the link provided. However, in cases like this, it could be a dataset-dependent issue. For example, if most duplicate clusters are very small, then greedy or not, it's unlikely to make a huge difference. If the clusters are highly skewed, greedy methods are more likely to incur more duplicates left in large clusters as a result. Non-greedy methods, however, require more compute and time to calculate. Like the one in the Spark version in this repo, it requires a Spark cluster and a distributed connected-component implementation to scale to TB for proper clustering.

@XChen-Zero
Copy link
Author

Thank you so much for your detailed response—it has been incredibly helpful in deepening my understanding of the approach!

I realized that I made an error in the second point of my original message. The correct link to the script I was referring to is actually this one: MinHash deduplication script. I apologize for the confusion caused by the incorrect link earlier.

Thanks again for your patience and insightful answers! They’ve been invaluable as I work through these concepts.

Copy link

Stale issue message

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants