Dolma dedup import or reimplementation #57
chris-ha458
started this conversation in
Ideas
Replies: 1 comment
-
Good idea. I found it best to write Python wrappers for multithreaded rust programs, e.g., Google's suffix array is wrapped in Python in this repo. It is probably easier to create a specific script for this purpose as well instead of modifying the current implementation. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
dolma repo, which is a data curation software built for dolma dataset, includes a rust based thread safe bloomfilter dedup implementation. (Unclear if it is multithreaded yet)
currently, this repo's bloomfilter implementation is python based and single threaded ( both in the embedding/processing stage).
Since both codebases are apache 2.0 i do not foresee any licensing issue in either reimplementing or importing it for use here.
Beta Was this translation helpful? Give feedback.
All reactions