How to Create TreeOfLife-10M

Note:

TreeOfLife-10M has the EOL images, but not iNat21 or BIOSCAN-1M due to licensing restrictions.
To reconstruct the full dataset, please follow the steps outlined below in Reproduce TreeOfLife-10M. This reproduction process is designed to be run on an HPC system using Slurm.

Reproduce TreeOfLife-10M

All of the following steps should be completed in the root directory of the repository. Start by setting up your conda environment with requirements-training.yml:

conda env create -f requirements-training.yml --solver=libmamba -y
conda activate bioclip-train
pip install -e .

Download TreeOfLife-10M:
- Optional: Change the dataset storage location and other Slurm parameters (within the "customize" section) in the component download setup script (scripts/setup_download_tol-10m_components.bash).
- Download TreeOfLife-10M components by running:
```
 sbatch --account <HPC-account> scripts/submit_download_tol-10m_components.bash
```
  This will download the tar and metadata files from Hugging Face, as well as iNat21 and BIOSCAN-1M into ../data/TreeOfLife-10M/ relative to the script, in the format specified in disk_reproduce.
  - Note: This launches a collection of scripts which can also be run individually.
make-dataset-wds_reproduce:
- This actually creates the webdataset files by running make_wds_reproduce for each of the splits.
- Make appropriate adjustments for your local setup to make-dataset-wds_reproduce (i.e., change account and path information, settings as described below).
- On your HPC, run:
```
 sbatch --account <HPC-account> slurm/make-dataset-wds_reproduce.sh
```
  - This runs the scripts/evobio10m/make_wds_reproduce.py for each of the splits using 32 workers.
  - It takes a long time (6 hours) and requires lots of memory.

check_wds:

Checks for bad shards and records them.

Run

 sbatch --account <HPC-account> --cpus-per-task <num-CPUs> slurm/check-wds.slurm <shards>

Writes a list of bad shards to logs/bad-shards.txt.
For instance, if images are placed in the default location, run the following to check the training split:

 sbatch --account <HPC-account> --cpus-per-task 32 slurm/check-wds.slurm 'data/TreeOfLife-10M/dataset/evobio10m-CVPR-2024/224x224/train/shard-{000000..000165}.tar'

make_catalog_reproduce:
- Generates the catalog of all images in the dataset, which includes information about their original data source and taxonomic record.
- Run
```
 sbatch --account <HPC-account> --cpus-per-task <N> slurm/make-catalog_reproduce.slurm \
 --dir <path/to/splits> \
 --db <path/to/db> \
 --tag <tag> \
 --batch-size <batch-size>
```
  - Creates a file catalog.csv in --dir which is a list of all names in the webdataset.
  - Note: mapping.sqlite is a SQLite database comprised of just the predicted-catalog.csv and can be replaced by a SQLite database constructed from TreeOfLife-10M/metadata/catalog.csv, which may be overwritten on this step depending on where these are saved.
  - For instance, if images are placed in the default location, run the following to generate the catalog file:
```
 sbatch --account <HPC-account> --cpus-per-task 32 slurm/make-catalog_reproduce.slurm \
 --dir data/TreeOfLife-10M/dataset/evobio10m-CVPR-2024/224x224 \
 --db data/TreeOfLife-10M/metadata/mapping.sqlite \
 --tag CVPR-2024 \
 --batch-size 256
```
check_taxa:
- This will check the actual catalog file for any taxa issues.
- More information on this file can be found here.
- Run
```
 python scripts/evobio10m/check_taxa.py /<path-to>/data/evobio10m-CVPR-2024/catalog.csv
```

Original TreeOfLife-10M Generation

This was the process for creating the entire dataset, version 3.3 (which we used to train BioCLIP for the public release).

download_data:
- Run bash scripts/download_data.sh to download most of the metadata files.
make_mapping:
- Creates the sqlite database that maps from original files to tree of life ids.
- Run python scripts/evobio10m/make_mapping.py --tag v3.3 --workers 8
  - Can run on login nodes and should take several hours. If you want it much faster, you can queue it on slurm with more workers.
make_splits:
- Adds the splits table to the sqlite database: marks each image as belonging to either val or train, and then picks out 10% of the training images to use as an ablation study.
- Run python scripts/evobio10m/make_splits.py --db /fs/ess/PAS2136/open_clip/data/evobio10m-v3.3/mapping.sqlite --val-split 5 --train-small-split 10 --seed 17
  - This will run quickly on a login node.
make_metadata:
- Creates all the metadata files that can be easily used by make_wds.py.
- Also makes a predicted-catalog.csv file that will closely mimic catalog.csv (described below). predicted-catalog.csv includes rows for the rare species which are not included in catalog.csv.
  - See ToL-EDA HF Repo for more information about these files.
- Run python scripts/evobio10m/make_metadata.py --db /fs/ess/PAS2136/open_clip/data/evobio10m-v3.3/mapping.sqlite
check_taxa:
- This will check the predicted catalog file for any taxa issues. If there are major issues, fix them first.
- Run python scripts/evobio10m/check_taxa.py /fs/ess/PAS2136/open_clip/data/evobio10m-v3.3/predicted-catalog.csv
make-dataset-wds:
- This actually creates the webdataset files by running make_wds for each of the splits.
- Run sbatch slurm/make-dataset-wds.sh on Pitzer.
  - This runs the scripts/evobio10m/make_wds.py for each of the splits using 32 workers.
  - It takes a long time (6 hours) and requires lots of memory.
check_wds:
- Checks for bad shards and records them.
- Run scripts/evobio10m/check_wds.py --shardlist SHARDS --workers 8 > logs/bad-shards.txt
  - Writes a list of bad shards to logs/bad-shards.txt.
make_catalog:
- Generates the catalog of all images in the dataset, which includes information about their original data source and taxonomic record.
- Run python scripts/evobio10m/make_catalog.py --dir /fs/ess/PAS2136/open_clip/data/evobio10m-v3.3/224x224/ --workers 8 --batch-size 256 --tag v3.3 --db /fs/ess/PAS2136/open_clip/data/evobio10m-v3.3/mapping.sqlite
  - Creates a file catalog.csv in --dir which is a list of all names in the webdataset.
  - Note: mapping.sqlite is a SQLite database comprised of just the predicted-catalog.csv and can be replaced by a SQLite database constructed from TreeOfLife-10M/metadata/catalog.csv, which may be overwritten on this step depending on where these are saved.
check_taxa:
- This will check the actual catalog file for any taxa issues.
- More information on this file can be found here.
- Run python scripts/evobio10m/check_taxa.py /fs/ess/PAS2136/open_clip/data/evobio10m-v3.3/catalog.csv

This process is buggy and doesn't always work. make_wds.py tries to re-write wds files that are corrupted, but it doesn't always work. make_wds.py also ignores images and species used in the rare species benchmark.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

treeoflife10m.md

treeoflife10m.md

How to Create TreeOfLife-10M

Reproduce TreeOfLife-10M

Original TreeOfLife-10M Generation

Files

treeoflife10m.md

Latest commit

History

treeoflife10m.md

File metadata and controls

How to Create TreeOfLife-10M

Reproduce TreeOfLife-10M

Original TreeOfLife-10M Generation