Skip to content

Latest commit



117 lines (107 loc) · 8.55 KB

File metadata and controls

117 lines (107 loc) · 8.55 KB

How to Create TreeOfLife-10M


  • TreeOfLife-10M has the EOL images, but not iNat21 or BIOSCAN-1M due to licensing restrictions.
  • To reconstruct the full dataset, please follow the steps outlined below in Reproduce TreeOfLife-10M. This reproduction process is designed to be run on an HPC system using Slurm.

Reproduce TreeOfLife-10M

All of the following steps should be completed in the root directory of the repository. Start by setting up your conda environment with requirements-training.yml:

conda env create -f requirements-training.yml --solver=libmamba -y
conda activate bioclip-train
pip install -e .
  1. Download TreeOfLife-10M:
    • Optional: Change the dataset storage location and other Slurm parameters (within the "customize" section) in the component download setup script (scripts/setup_download_tol-10m_components.bash).
    • Download TreeOfLife-10M components by running:
       sbatch --account <HPC-account> scripts/submit_download_tol-10m_components.bash
      This will download the tar and metadata files from Hugging Face, as well as iNat21 and BIOSCAN-1M into ../data/TreeOfLife-10M/ relative to the script, in the format specified in disk_reproduce.
      • Note: This launches a collection of scripts which can also be run individually.
  2. make-dataset-wds_reproduce:
    • This actually creates the webdataset files by running make_wds_reproduce for each of the splits.
    • Make appropriate adjustments for your local setup to make-dataset-wds_reproduce (i.e., change account and path information, settings as described below).
    • On your HPC, run:
       sbatch --account <HPC-account> slurm/
      • This runs the scripts/evobio10m/ for each of the splits using 32 workers.
      • It takes a long time (6 hours) and requires lots of memory.
  3. check_wds:
    • Checks for bad shards and records them.
    • Run
       sbatch --account <HPC-account> --cpus-per-task <num-CPUs> slurm/check-wds.slurm <shards> 
      • Writes a list of bad shards to logs/bad-shards.txt.
      • For instance, if images are placed in the default location, run the following to check the training split:
       sbatch --account <HPC-account> --cpus-per-task 32 slurm/check-wds.slurm 'data/TreeOfLife-10M/dataset/evobio10m-CVPR-2024/224x224/train/shard-{000000..000165}.tar'
  4. make_catalog_reproduce:
    • Generates the catalog of all images in the dataset, which includes information about their original data source and taxonomic record.
    • Run
       sbatch --account <HPC-account> --cpus-per-task <N> slurm/make-catalog_reproduce.slurm \
       --dir <path/to/splits> \
       --db <path/to/db> \
       --tag <tag> \
       --batch-size <batch-size>
      • Creates a file catalog.csv in --dir which is a list of all names in the webdataset.
      • Note: mapping.sqlite is a SQLite database comprised of just the predicted-catalog.csv and can be replaced by a SQLite database constructed from TreeOfLife-10M/metadata/catalog.csv, which may be overwritten on this step depending on where these are saved.
      • For instance, if images are placed in the default location, run the following to generate the catalog file:
       sbatch --account <HPC-account> --cpus-per-task 32 slurm/make-catalog_reproduce.slurm \
       --dir data/TreeOfLife-10M/dataset/evobio10m-CVPR-2024/224x224 \
       --db data/TreeOfLife-10M/metadata/mapping.sqlite \
       --tag CVPR-2024 \
       --batch-size 256
  5. check_taxa:
    • This will check the actual catalog file for any taxa issues.
    • More information on this file can be found here.
    • Run
       python scripts/evobio10m/ /<path-to>/data/evobio10m-CVPR-2024/catalog.csv

Original TreeOfLife-10M Generation

This was the process for creating the entire dataset, version 3.3 (which we used to train BioCLIP for the public release).

  1. download_data:
    • Run bash scripts/ to download most of the metadata files.
  2. make_mapping:
    • Creates the sqlite database that maps from original files to tree of life ids.
    • Run python scripts/evobio10m/ --tag v3.3 --workers 8
      • Can run on login nodes and should take several hours. If you want it much faster, you can queue it on slurm with more workers.
  3. make_splits:
    • Adds the splits table to the sqlite database: marks each image as belonging to either val or train, and then picks out 10% of the training images to use as an ablation study.
    • Run python scripts/evobio10m/ --db /fs/ess/PAS2136/open_clip/data/evobio10m-v3.3/mapping.sqlite --val-split 5 --train-small-split 10 --seed 17
      • This will run quickly on a login node.
  4. make_metadata:
    • Creates all the metadata files that can be easily used by
    • Also makes a predicted-catalog.csv file that will closely mimic catalog.csv (described below). predicted-catalog.csv includes rows for the rare species which are not included in catalog.csv.
    • Run python scripts/evobio10m/ --db /fs/ess/PAS2136/open_clip/data/evobio10m-v3.3/mapping.sqlite
  5. check_taxa:
    • This will check the predicted catalog file for any taxa issues. If there are major issues, fix them first.
    • Run python scripts/evobio10m/ /fs/ess/PAS2136/open_clip/data/evobio10m-v3.3/predicted-catalog.csv
  6. make-dataset-wds:
    • This actually creates the webdataset files by running make_wds for each of the splits.
    • Run sbatch slurm/ on Pitzer.
      • This runs the scripts/evobio10m/ for each of the splits using 32 workers.
      • It takes a long time (6 hours) and requires lots of memory.
  7. check_wds:
    • Checks for bad shards and records them.
    • Run scripts/evobio10m/ --shardlist SHARDS --workers 8 > logs/bad-shards.txt
      • Writes a list of bad shards to logs/bad-shards.txt.
  8. make_catalog:
    • Generates the catalog of all images in the dataset, which includes information about their original data source and taxonomic record.
    • Run python scripts/evobio10m/ --dir /fs/ess/PAS2136/open_clip/data/evobio10m-v3.3/224x224/ --workers 8 --batch-size 256 --tag v3.3 --db /fs/ess/PAS2136/open_clip/data/evobio10m-v3.3/mapping.sqlite
      • Creates a file catalog.csv in --dir which is a list of all names in the webdataset.
      • Note: mapping.sqlite is a SQLite database comprised of just the predicted-catalog.csv and can be replaced by a SQLite database constructed from TreeOfLife-10M/metadata/catalog.csv, which may be overwritten on this step depending on where these are saved.
  9. check_taxa:
    • This will check the actual catalog file for any taxa issues.
    • More information on this file can be found here.
    • Run python scripts/evobio10m/ /fs/ess/PAS2136/open_clip/data/evobio10m-v3.3/catalog.csv

This process is buggy and doesn't always work. tries to re-write wds files that are corrupted, but it doesn't always work. also ignores images and species used in the rare species benchmark.