Official pytorch implementation of out paper:
Can CLIP Help Sound Source Localization?
Sooyoung Park*, Arda Senocak*, Joon Son Chung (* Equal Contribution)
WACV 2024
This repo is pytorch implementation of Audio-Grounded Contrastive Learning (ACL). Code is very simple and easy to understand fastly.
Some of these codes are based on AudioToken, BEATs, TCL.
- Python = 3.10.8
- Pytorch = 1.13.0
- transformers = 4.25.1
$ conda install -c nvidia cudatoolkit=11.7
$ conda install -c conda-forge cudnn
$ conda install python=3.10
$ pip install torch==1.13.0+cu117 torchvision==0.14.0+cu117 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117
$ pip install tensorboard
$ pip transformers==4.25.1
$ pip install opencv-python
$ pip install tqdm
$ pip install scikit-learn
Important Note: All audio samples must be converted to 16kHz, and for detailed instructions, refer to the readme in each dataset-specific directory.
- Dataset
Downloading pretrained model (audio backbone) in pretrain folder
- BEATs: https://github.com/microsoft/unilm/tree/master/beats
- BEATs_iter3_plus_AS2M_finedtuned_on_AS2M_cpt2.pt
- Ensure that you check the .sh files and set the
$ export CUDA_VISIBLE_DEVICES=”**”
according to your hardware setup. - Make sure that
—model_name
corresponds to the configuration file located at./config/model/{-model_name}.yaml
. - Model files (.pth) will be saved in the directory
{—save_path}/Train_record/{-model_name}_{-exp_name}/
. - Review the configuration settings in
./config/train/{-train_config}.yaml
to ensure they match your training requirements. - Choose one of the following methods to initiate training:
$ sh SingleGPU_Experiment.sh. # For single GPU setup
$ sh Distributed_Experiment.sh. # For multi-GPU setup (DDP)
- Before testing, please review the .sh file and set the
$ export CUDA_VISIBLE_DEVICES=”**”
environment variable according to your hardware configuration. - Ensure that the
—model_name
parameter corresponds to the configuration file located at./config/model/{-model_name}.yaml
. - Model files (.pth) located in the directory
{—save_path}/{-model_name}_{-exp_name}/Param_{-epochs}.pth
will be used for testing. - The
—epochs
parameter can accept either an integer or a list of integers (e.g., 1, 2, 3). - If
—epochs
is left unspecified (null), the default model file{—save_path}/Train_record/{-model_name}_{-exp_name}/Param_best.pth
will be used for testing.
$ sh Test_PTModels
Important Note: After downloading the Param_best.pth file, move it to the directory {—save_path}/{-model_name}_{-exp_name}/
before use.
- VGG-Sound 144k trained model: [Link]
- This model was trained using a 2-GPU setup.
If you use this project, please cite this project as:
@inproceedings{park2023clip,
title={Can CLIP Help Sound Source Localization?},
author={Sooyoung Park and Arda Senocak and Joon Son Chung},
journal = {arXiv preprint arXiv:2311.04066},
year={2023},
}