We introduce Grounding DINO 1.5, which is our most powerful open-world object detection model series. Building on the solid foundation of its predecessor, Grounding DINO, this enhanced model increases both the model size and its training dataset, enhancing its ability to understand and detect visual objects more accurately. We provide Grounding DINO 1.5 with two models for different scenarios:
- Grounding DINO 1.5 Pro — our most capable model for open-set object detection. It encompasses a wide range of detection scenarios, including but not limited to long-tailed object detection, dense object detection, and long caption phrase grounding, etc.
- Grounding DINO 1.5 Edge — our most efficient model for edge computing scenarios. It strives for fast and reliable detection while maintaining low latency and reduced power consumption.
State-of-the-art Zero-Shot Transfer Performance
Grounding DINO 1.5 sets new records on several academic benchmarks. Grounding DINO 1.5 Pro achieves a 54.3 AP on the COCO detection zero-shot transfer benchmark and simultaneously achieves a 55.7 AP and a 47.6 AP on the LVIS-minival and LVIS-val zero-shot transfer benchmarks, respectively. We compare the zero-shot performance of Grounding DINO 1.5 Pro and Grounding DINO in Figure 1. A more comprehensive benchmark is presented in Table 1.
Fine-Tuning Results on Downstream Tasks
Our study shows that fine-tuning Grounding DINO 1.5 significantly boosts performance. On the LVIS dataset, it achieves a 68.1 AP on LVIS-minival and a 63.5 AP on LVIS-val, improving by 12.4 and 15.9 AP compared with its zero-setting results. On the ODinW35 benchmark, it sets new records with a 70.6 AP across 35 datasets and a 72.4 AP on 13 datasets. These results underscore the model's great potential for using in any domain when a small scale of high quality dataset is available.
Visualizations
The following visualizations of the Grounding DINO 1.5 Pro model's predictions effectively demonstrate its detection capabilities across various scenarios. These include common object detection, long-tailed object detection, dense object detection, short caption phrase grounding, and long caption phrase grounding. Additionally, we present side-by-side comparisons of the prediction results between Grounding DINO 1.5 Pro and Grounding DINO. And we also highlight each model's capability to manage the object hallucinations.
Model architecture & Training data
Grounding DINO 1.5 Pro Model
Grounding DINO 1.5 Pro preserves the core structure of Grounding DINO while incorporating a larger Vision Transformer backbone. We adopt the pretrained ViT-L (EVA-02) model as our primary vision backbone for its superior performance on downstream tasks and its pure Transformer design, which lays a solid foundation for optimizing the model training and inference processes.
Dataset
To train a robust open-set detector, it is crucial to construct a high-quality grounding dataset that is sufficiently rich in categories and encompasses a wide range of detection scenarios. Grounding DINO 1.5 Pro is pretrained on over 20M grounding images, termed Grounding-20M, which were collected from publicly available sources. And we have carefully developed a series of annotation pipelines and post-processing rules to guarantee the high quality of the collected data.
Acknowledgment
We would like to thank everyone involved in the Grounding DINO 1.5 project, including application lead Wei Liu, product manager Qin Liu and Xiaohui Wang, front-end developers Yuanhao Zhu, Ce Feng, and Jiongrong Fan, back-end developers Weiqiang Hu and Zhiqiang Li, UX designer Xinyi Ruan, tester Yinuo Chen, and Zijun Deng for helping with demo videos.
References
- [1] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded Language-Image Pre-training. CVPR, 2022.
- [2] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding DINO: Marrying Dino with Grounded Pre-Training for Open-Set Object Detection. arXiv preprint arXiv:2303.05499, 2023.
- [3] Matthias Minderer, Alexey Gritsenko, and Neil Houlsby. Scaling Open-Vocabulary Object Detection. NeurIPS, 2023.
- [4] Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, Chunjing Xu, and Hang Xu. DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pretraining for Open-world Detection. NeurIPS, 2022.
- [5] Tiancheng Zhao, Peng Liu, Xuan He, Lu Zhang, and Kyusong Lee. Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head. arXiv preprint arXiv:2403.06892, 2024.
- [6] Lewei Yao, Jianhua Han, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, and Hang Xu. DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment. CVPR, 2023.
- [7] Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, and Dan Xu. DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection. CVPR, 2024.
- [8] Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, and Lei Zhang. T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy. arXiv preprint arXiv:2403.14610, 2024.
- [9] Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. EVA-02: A Visual Representation for Neon Genesis. arXiv preprint arXiv:2303.11331, 2023.