This is a competition orginally hosted on Kaggle, reproduced here to encourage containerization of submissions by way of Singularity. If you aren't familiar with Singularity, it's a container (like Docker) that can be run securely on HPC architectures.
First fork the repo to your own username. For example, if my user name is vsoch
:
git clone https://www.github.com/vsoch/flavours-of-physics-ftw
cd flavours-of-physics-ftw
singularity create --size 8000 container.ftw
sudo singularity bootstrap container.ftw Singularity
singularity run -B data/input:/data/input -B analysis:/code --pwd /code container.ftw
Now edit main.py
, do better, and submit a PR to the contest repo for your entry. Want more details? keep reading!
- Goals: Read about the data to get a breakdown of the data provided, and the background and goals of the competition is beautifully described and shown on Kaggle.
- Build: Build your container (see build section below), which will install dependencies and prepare data for you. If you find that you need any more, or any additional software or libraries, you can add them to the
%post
section of the Singularity file. - Code: Once you have your container built, you can use it to develop and test your submission.
- Submit: A submission to the competition means submitting a PR (pull request)
Evaluation for this competition is based on AUC (area under the curve), defined as area under the curve, which broadly gets at the ratio of false positives to false negatives for your model. In addition to this criteria, the metrics file includes multiple checks that physicists do to make sure that results are unbiased.
When you are ready to start your submission, you should fork the repo to your branch, and then clone the fork. For example, if my username on Github was vsoch
, I would fork and then do:
git clone https://www.github.com/vsoch/flavours-of-physics-ftw
cd flavours-of-physics-ftw
Then you can build your image. You will need one dependency, that Singularity is installed. Building comes down to creating an image and then using bootstrap
to build from the container recipe, Singularity.
singularity create --size 8000 container.ftw
sudo singularity bootstrap container.ftw Singularity
To shell into your container, you will want to mount the analysis folder, and the external data. You can do that like this. Note that we are making the present working directory (pwd
) our folder with analysis scripts:
singularity shell -B data/input:/data/input -B analysis:/code --pwd /code container.ftw
When you shell into your container, it probably will look the same, but if you do ls /
you will see a file called singularity
and root folders /data
and /code
that aren't on your host. If you look inside, you will see the data and
analysis scripts mounted!
ls /code
README.md helpers main.py metrics.py results tests
Try creating a file on the host, and you will see it change in the container, or vice versa. Thus, your general workflow will be the following:
- run things from within the container, using the python or ipython located at
/opt/conda/bin
- edit code in your editor of choice on your host machine
If you want to ever find data or results locations, these have been provided for you via environment variables:
CONTAINERSFTW_DATA
: The base folder with dataCONTAINERSFTW_RESULT
: The folder where results are written toCONTAINERSFTW_WORK
: The folder where your scripts live.
It's definitely a good idea if you are interested to shell around the container to understand where things are located, and test the variables to confirm they are the same:
echo $CONT
$CONTAINERSFTW_RESULT $CONTAINERSFTW_DATA $CONTAINERSFTW_WORK
echo $CONTAINERSFTW_DATA
/data/input
You can work from inside the container, or comfortable from the host in the analysis
folder (mapped to /code
in the container). Your main work is going to be located at /code/main.py
in the container, which is analysis/main.py
on the host. If you open up this file, you can start interactively working in an ipython terminal in the container to test commands. For example, from /code
let's try loading the data in ipython
from sklearn.ensemble import GradientBoostingClassifier
from helpers.data import load_data
from helpers.logger import bot
train = load_data(name="training")
DEBUG Loading training : /data/input/training.csv
and if you proceed through the rest of the script, you will produce an example result. You can also run the entire example without shelling into the container at all:
singularity run -B data/input:/data/input -B analysis:/code --pwd /code container.ftw
DEBUG Loading training : /data/input/training.csv
Checking Agreement:
DEBUG Loading check_agreement : /data/input/check_agreement.csv
KS metric 0.0681705596239 True
Checking Correlation:
DEBUG Loading check_correlation : /data/input/check_correlation.csv
CvM metric 0.000981509354914 True
Checking AUC:
AUC 0.834346382383
DEBUG Loading test : /data/input/test.csv
DEBUG submission : /code/results/submission.csv
LOG Result saved to /code/results/submission.csv
The result file is what gets tested in the continuous integration.
If you add dependencies (another python module, additional data that conforms to competition rules, etc) you should update the Singularity recipe, for example, we have marked in %post
where you can add installation steps:
#########################################################
# Install additional software / libraries here
#########################################################
pip install -y pokemon
#########################################################
- Do I have to use Python?: Of course not! The base template image given to use is based on a choice by the creator (for example, lots of people use
scikit-learn
in python for machine learning). At the end of the day, the evaluation is done over the text file in/analysis/results/submission
and is ambivalent to how it is generated. Your submission (the container image) must simply run to generate it, and you are good.
For now, for additional FAQ please see our documentation