Goal: to find associations between dna data and visual traits.
conda create -n gtp python=3.10
conda activate gtp
pip install -e .
To personalize options can either:
- Change the configurations in src/gtp/configs/default_configs.yaml
- Copy that file and create a new one. Then update DEFAULT_YAML_CONFIG_PATH
in src/gtp/configs/loaders.py
To run tests, run the follwing command:
pytest tests
NOTE: these tests depend on the current configs and may need to be adjusted for your development environment.
To preprocess the Data, you can either:
- Run the notebook in
notebooks/preprocess_pipeline.ipynb
. - Or, run the commands:
python -m gtp.pipelines.preprocess --method phenotype
python -m gtp.pipelines.preprocess --method genotype
NOTE: preprocessing genotype data takes a long time. You can add the num-processes
argument to the command to help speed this up. 4 by default. Do not be fooled by the initial processing speed as smaller files are processed first.
To keep data splits consistent across runs, precompute the splits with the following command:
python -m gtp.pipelines.create_training_splits
Run the following command:
python -m gtp.pipelines.train --chromosome 18 --species erato --color color_3 --wing forewings --epochs 100 --batch_size 32 --exp_name debug
To calculate the attributions of a single chromosome or entire genome (depending on configs: see experiment.genotype_scope
), run the following command:
python -m gtp.pipelines.process_attributions --species erato --color color_3 --wing forewings --exp_name debug
NOTE: you have to have trained a model for every chromosome for a species in order for the above command to work on the entire genome. Otherwise, it'll only plot / calculate data for chromosomes with a trained model.
To plot the attributions of a single chromosome or entire genome (depending on configs: see experiment.genotype_scope
), run the following command:
python -m gtp.pipelines.plot_attributions --species erato --color color_3 --wing forewings --exp_name debug
NOTE: you have to have trained a model for every chromosome for a species in order for the above command to work on the entire genome. Otherwise, it'll only plot / calculate data for chromosomes with a trained model.