Thoughts on Nvidia Data Curator #49
Replies: 3 comments
-
I think, for now, this repo should focus on deduplication itself, as text preprocessing is too big a scope to be included here. Even in projects like BigCode, there are multiple work groups, each focusing on one of the components you mentioned here. There is too much domain-specific knowledge in each area as well. Their code is only available to some beta testers and relies heavily on the Dask cluster and distributed Redis database. They adopted some optimizations I did not include in this repo. e.g. their connected component is done on a single sparse matrix with some Cython code, and the minhash computation (https://github.com/mattilyra/LSH). It is highly dockerized which I guess also means less portable. I will see if I can benchmark their performance once it is generally available. |
Beta Was this translation helpful? Give feedback.
-
Hi! I'm one of the developers working on NeMo-Curator and happy to answer some of the questions brought up in this discussion and any others that folks might have! At its core a lot of features in Curator operate on dataframes(document batches) consisting of documents, id's & any additional metadata. Many functions like the filters can also operate on a document level.
We rely on Dask's distributed dataframe capabilities for some stages like fuzzy dedup which require large scale shuffling/groupby but most other steps are generally parallel and could be extended to other distributed computing frameworks. We've also deprecated some of the redis dependencies during deduplication in favor of a GPU accelerated dedup implementation that writes out intermediates as parquet files instead.
Exact deduplication uses the md5 hash and also supports a GPU accelerated version of the md5 hash and therefore has a GPU version of exact dedup. We also worked on a GPU accelerated fuzzy deduplication implementation since this stage can be a bottleneck at scale, and used a GPU accelerated minhash kernel based on mmh3 32/64 bit operating on char level ngrams & connected components. |
Beta Was this translation helpful? Give feedback.
-
Thank you @ayushdg for the details, and it is great to see that the source code is finally released. I will try to take a look and share any follow-up questions here. Thanks again for chiming in! |
Beta Was this translation helpful? Give feedback.
-
Recently Nvidia released Data Curator for Trillion scale datasets.
Some thoughts on what we could learn from it.
It accomplishes the following
huggingface/datasets
minhashlsh
. we have a very streamlined single computer implementation as well as a spark version.I have not been able to extract the actual scripts available from Nemo Megatron Launcher
I will add more detail if I ever get around doing so.
Going forward I will try to implement ftfy.
Possibly, as an argument for users to set.
Beta Was this translation helpful? Give feedback.
All reactions