Thoughts on Nvidia Data Curator #49

chris-ha458 · 2023-08-11T01:03:29Z

chris-ha458
Aug 11, 2023

Recently Nvidia released Data Curator for Trillion scale datasets.

Some thoughts on what we could learn from it.
It accomplishes the following

Data download
- text-dedup uses locally provided or through huggingface/datasets
Text extraction
- text-dedup does not do it afaict
Text reformatting and cleaning
- ccnet facilitates a bit of it.
- among other things, data curator uses ftfy. I will try to implement it.
Quality filtering
- curator does it through fasttext. including language filtering
- it is unclear whether text-dedup should or could do it.
Exact or fuzzy deduplication
- for exact hash, they use 128bit hashes. its not clear what hash they use
- here, we use md5,xxh3 128bit hashes or sha-2 256bit hashes
- It seems like it might be more distributive friendly.
- I am looking for a solution with np.arrays or other datastructures instead of sets that might make at least parallelism possible.
- for near dedup, they use minhashlsh. we have a very streamlined single computer implementation as well as a spark version.

I have not been able to extract the actual scripts available from Nemo Megatron Launcher

I will add more detail if I ever get around doing so.

Going forward I will try to implement ftfy.
Possibly, as an argument for users to set.

ChenghaoMou · 2023-08-15T20:57:48Z

ChenghaoMou
Aug 15, 2023
Maintainer

I think, for now, this repo should focus on deduplication itself, as text preprocessing is too big a scope to be included here.

Even in projects like BigCode, there are multiple work groups, each focusing on one of the components you mentioned here. There is too much domain-specific knowledge in each area as well.

Their code is only available to some beta testers and relies heavily on the Dask cluster and distributed Redis database. They adopted some optimizations I did not include in this repo. e.g. their connected component is done on a single sparse matrix with some Cython code, and the minhash computation (https://github.com/mattilyra/LSH). It is highly dockerized which I guess also means less portable.

I will see if I can benchmark their performance once it is generally available.

0 replies

ayushdg · 2024-04-19T00:06:48Z

ayushdg
Apr 19, 2024

Hi! I'm one of the developers working on NeMo-Curator and happy to answer some of the questions brought up in this discussion and any others that folks might have!

At its core a lot of features in Curator operate on dataframes(document batches) consisting of documents, id's & any additional metadata. Many functions like the filters can also operate on a document level.

Their code is only available to some beta testers and relies heavily on the Dask cluster and distributed Redis database.

We rely on Dask's distributed dataframe capabilities for some stages like fuzzy dedup which require large scale shuffling/groupby but most other steps are generally parallel and could be extended to other distributed computing frameworks. We've also deprecated some of the redis dependencies during deduplication in favor of a GPU accelerated dedup implementation that writes out intermediates as parquet files instead.

Exact or fuzzy deduplication

Exact deduplication uses the md5 hash and also supports a GPU accelerated version of the md5 hash and therefore has a GPU version of exact dedup.

We also worked on a GPU accelerated fuzzy deduplication implementation since this stage can be a bottleneck at scale, and used a GPU accelerated minhash kernel based on mmh3 32/64 bit operating on char level ngrams & connected components.

0 replies

ChenghaoMou · 2024-04-19T10:43:46Z

ChenghaoMou
Apr 19, 2024
Maintainer

Thank you @ayushdg for the details, and it is great to see that the source code is finally released.

I will try to take a look and share any follow-up questions here. Thanks again for chiming in!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thoughts on Nvidia Data Curator #49

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Thoughts on Nvidia Data Curator #49

chris-ha458 Aug 11, 2023

Replies: 3 comments

ChenghaoMou Aug 15, 2023 Maintainer

ayushdg Apr 19, 2024

ChenghaoMou Apr 19, 2024 Maintainer

chris-ha458
Aug 11, 2023

ChenghaoMou
Aug 15, 2023
Maintainer

ayushdg
Apr 19, 2024

ChenghaoMou
Apr 19, 2024
Maintainer