Skip to content

Commit

Permalink
Remove outdated note on very large datasets in MultiEpochDataset (#521)
Browse files Browse the repository at this point in the history
Thanks to Savitha for pointing out this bit of the docs that's no longer
relevant. We have a new algorithm here that makes the MultiEpochDataset
also O(1)

Signed-off-by: Peter St. John <[email protected]>
  • Loading branch information
pstjohn authored Jan 13, 2025
1 parent 2e90bf5 commit db237fc
Show file tree
Hide file tree
Showing 2 changed files with 1 addition and 7 deletions.
6 changes: 0 additions & 6 deletions docs/docs/user-guide/background/megatron_datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,12 +50,6 @@ for sample in MultiEpochDatasetResampler(dataset, num_epochs=3, shuffle=True):
...
```

!!! note "Very large datasets"

For datasets where `len(dataset)` is too large for a shuffled list of indices to comfortably fit in memory,
[PRNGResampleDataset][bionemo.core.data.resamples.PRNGResampleDataset] offers a simple solution for shuffling a
dataset with replacement in O(1) memory.

## Training Resumption
To ensure identical behavior with and without job interruption, BioNeMo provides [MegatronDataModule][bionemo.llm.data.datamodule.MegatronDataModule] to save and load state dict for training resumption, and provides [WrappedDataLoader][nemo.lightning.data.WrappedDataLoader] to add a `mode` attribute to [DataLoader][torch.utils.data.DataLoader].

Expand Down
2 changes: 1 addition & 1 deletion docs/docs/user-guide/getting-started/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -351,7 +351,7 @@ confirmed to be working with bionemo2 (and those that are tested in CI).
To initialize these sub-modules when cloning the repo, add the `--recursive` flag to the git clone command:

```bash
git clone --recursive [email protected]:NVIDIA/bionemo-fw-ea.git
git clone --recursive [email protected]:NVIDIA/bionemo-framework.git
```

To download the pinned versions of these submodules within an existing git repository, run
Expand Down

0 comments on commit db237fc

Please sign in to comment.