-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions on MinHash Deduplication #106
Comments
Thanks for the great questions!
|
Thank you so much for your detailed response—it has been incredibly helpful in deepening my understanding of the approach! I realized that I made an error in the second point of my original message. The correct link to the script I was referring to is actually this one: MinHash deduplication script. I apologize for the confusion caused by the incorrect link earlier. Thanks again for your patience and insightful answers! They’ve been invaluable as I work through these concepts. |
Stale issue message |
Hello! I have a few questions and observations regarding the deduplication approach using MinHash in this repository. Specifically, I’m interested in some intuition around handling false positives and negatives and the clustering strategy used.
In this Hugging Face blog post on deduplication, it is mentioned that relying on the Locality Sensitive Hashing (LSH) results directly can also yield good performance for downstream language models (LMs). Could it be interpreted that false positives don’t significantly harm LMs, as they merely remove some texts from the dataset? Given a sufficiently large dataset, the diversity might remain relatively unaffected. In contrast, false negatives introduce more duplicate texts, which could degrade model performance. Is this a reasonable way to interpret the impact of false positives and negatives on LMs?
In your code, the final clustering and deduplication use a union-find approach. You also recommended this script for MinHash deduplication, which calculates an “extrema.” However, from my understanding, the extrema calculation could be order-dependent. For instance, if the similarity between A and B exceeds the threshold, as does the similarity between B and C, but not between A and C, running the calculations sequentially might retain A and C. In your view, is the difference between retaining A and C versus B and C negligible, or does it depend on specific dataset characteristics?
Thank you for your work, and I appreciate any insights you can provide on these points!
The text was updated successfully, but these errors were encountered: