Skip to content

Latest commit

 

History

History

ViT-Adapter with DINOv2

Preparation

Please download the DINOv2 pretrained weights into the pretrained/ folder:

model # of
params
ImageNet
k-NN
ImageNet
linear
download
ViT-S/14 distilled 21 M 79.0% 81.1% backbone only
ViT-B/14 distilled 86 M 82.1% 84.5% backbone only
ViT-L/14 distilled 300 M 83.5% 86.3% backbone only
ViT-g/14 1,100 M 83.5% 86.5% backbone only

Then convert these models to have patch size 16:

python convert_14to16.py pretrained/dinov2_vits14_pretrain.pth
python convert_14to16.py pretrained/dinov2_vitb14_pretrain.pth
python convert_14to16.py pretrained/dinov2_vitl14_pretrain.pth
python convert_14to16.py pretrained/dinov2_vitg14_pretrain.pth

After that, the directory structure is:

detection
├── pretrained
│   └── dinov2_vits14_pretrain.pth
│   └── dinov2_vitb14_pretrain.pth
│   └── dinov2_vitl14_pretrain.pth
│   └── dinov2_vitg14_pretrain.pth
│   └── dinov2_vits14_pretrain_14to16.pth
│   └── dinov2_vitb14_pretrain_14to16.pth
│   └── dinov2_vitl14_pretrain_14to16.pth
│   └── dinov2_vitg14_pretrain_14to16.pth
└── convert_14to16.py

Results and Models

Backbone Pretrain Lr schd box AP mask AP #Param Config Download
ViT-Adapter-S DeiT-S 3x+MS 48.2 42.8 48M config ckpt
ViT-Adapter-S DINOv2-S 3x+MS 51.5 (+3.3) 45.6 (+2.8) 48M config ckpt | log
ViT-Adapter-B DeiT-B 3x+MS 49.6 43.6 120M config ckpt
ViT-Adapter-B DINOv2-B 3x+MS 54.1 (+4.5) 47.8 (+4.2) 120M config ckpt | log
ViT-Adapter-L AugReg-L 3x+MS 52.1 46.0 348M config ckpt | log
ViT-Adapter-L DINOv2-L 3x+MS 55.3 (+3.2) 49.0 (+3.0) 348M config ckpt | log

Note that, the hyper-parameter layer_decay_rate significantly impacts on the performance of DINOv2. For example, for the ViT-Adapter-S with DINOv2-S, the box AP of different layer_decay_rate are:

Backbone Pretrain 0.70 0.75 0.80 0.90 0.95
ViT-Adapter-S DINOv2-S 51.5 51.0 50.8 49.4 48.8

Perhaps further reducing layer_decay_rate will continue to improve performance.