DDS-LOGO

Grounding DINO 1.5 Pro

Tianhe Ren · May 17, 2024

We introduce Grounding DINO 1.5, which is our most powerful open-world object detection model series. Building on the solid foundation of its predecessor, Grounding DINO, this enhanced model increases both the model size and its training dataset, enhancing its ability to understand and detect visual objects more accurately. We provide Grounding DINO 1.5 with two models for different scenarios:

  • Grounding DINO 1.5 Pro — our most capable model for open-set object detection. It encompasses a wide range of detection scenarios, including but not limited to long-tailed object detection, dense object detection, and long caption phrase grounding, etc.
  • Grounding DINO 1.5 Edge — our most efficient model for edge computing scenarios. It strives for fast and reliable detection while maintaining low latency and reduced power consumption.

State-of-the-art Zero-Shot Transfer Performance

Grounding DINO 1.5 sets new records on several academic benchmarks. Grounding DINO 1.5 Pro achieves a 54.3 AP on the COCO detection zero-shot transfer benchmark and simultaneously achieves a 55.7 AP and a 47.6 AP on the LVIS-minival and LVIS-val zero-shot transfer benchmarks, respectively. We compare the zero-shot performance of Grounding DINO 1.5 Pro and Grounding DINO in Figure 1. A more comprehensive benchmark is presented in Table 1.

image

Figure 1: Grounding DINO 1.5 Pro zero-shot transfer performance on public benchmarks

image

Table 1: Grounding DINO 1.5 Pro zero-shot transfer performance on the COCO, LVIS, ODinW35 and ODinW13 benchmarks compared to previous methods.

Fine-Tuning Results on Downstream Tasks

Our study shows that fine-tuning Grounding DINO 1.5 significantly boosts performance. On the LVIS dataset, it achieves a 68.1 AP on LVIS-minival and a 63.5 AP on LVIS-val, improving by 12.4 and 15.9 AP compared with its zero-setting results. On the ODinW35 benchmark, it sets new records with a 70.6 AP across 35 datasets and a 72.4 AP on 13 datasets. These results underscore the model's great potential for using in any domain when a small scale of high quality dataset is available.

image

Figure 2: Fine-tuning performance of Grounding DINO 1.5 Pro on public benchmarks

image

Table 2: Fine-tuning performance of Grounding DINO 1.5 Pro model on LVIS-minival, LVIS-val, ODinW35 and ODinW13 benchmarks. The Fixed AP on LVIS-minival and val splits are reported. † indicates results of fine-tuning with LVIS base categories only.

Visualizations

The following visualizations of the Grounding DINO 1.5 Pro model's predictions effectively demonstrate its detection capabilities across various scenarios. These include common object detection, long-tailed object detection, dense object detection, short caption phrase grounding, and long caption phrase grounding. Additionally, we present side-by-side comparisons of the prediction results between Grounding DINO 1.5 Pro and Grounding DINO. And we also highlight each model's capability to manage the object hallucinations.

image

Figure 3: Model predictions on common objects (part 1)

image

Figure 4: Model predictions on common objects (part 2)

image

Figure 5: Model predictions on long-tailed categories

image

Figure 6: Model predictions on dense objects (part 1)

image

Figure 7: Model predictions on dense objects (part 2)

image

Figure 8: Short caption phrase grounding

image

Figure 9: Long caption phrase grounding

image

Figure 10: Side-by-side comparison between Grounding DINO 1.5 Pro and Grounding DINO (part 1).

image

Figure 11: Side-by-side comparison between Grounding DINO 1.5 Pro and Grounding DINO (part 2).

image

Figure 12: Side-by-side comparison between Grounding DINO 1.5 Pro and Grounding DINO of object hallucinations.

Model architecture & Training data

Grounding DINO 1.5 Pro Model

Grounding DINO 1.5 Pro preserves the core structure of Grounding DINO while incorporating a larger Vision Transformer backbone. We adopt the pretrained ViT-L (EVA-02) model as our primary vision backbone for its superior performance on downstream tasks and its pure Transformer design, which lays a solid foundation for optimizing the model training and inference processes.

Dataset

To train a robust open-set detector, it is crucial to construct a high-quality grounding dataset that is sufficiently rich in categories and encompasses a wide range of detection scenarios. Grounding DINO 1.5 Pro is pretrained on over 20M grounding images, termed Grounding-20M, which were collected from publicly available sources. And we have carefully developed a series of annotation pipelines and post-processing rules to guarantee the high quality of the collected data.

Acknowledgment

We would like to thank everyone involved in the Grounding DINO 1.5 project, including application lead Wei Liu, product manager Qin Liu and Xiaohui Wang, front-end developers Yuanhao Zhu, Ce Feng, and Jiongrong Fan, back-end developers Weiqiang Hu and Zhiqiang Li, UX designer Xinyi Ruan, tester Yinuo Chen, and Zijun Deng for helping with demo videos.

References

  • [1] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded Language-Image Pre-training. CVPR, 2022.
  • [2] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding DINO: Marrying Dino with Grounded Pre-Training for Open-Set Object Detection. arXiv preprint arXiv:2303.05499, 2023.
  • [3] Matthias Minderer, Alexey Gritsenko, and Neil Houlsby. Scaling Open-Vocabulary Object Detection. NeurIPS, 2023.
  • [4] Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, Chunjing Xu, and Hang Xu. DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pretraining for Open-world Detection. NeurIPS, 2022.
  • [5] Tiancheng Zhao, Peng Liu, Xuan He, Lu Zhang, and Kyusong Lee. Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head. arXiv preprint arXiv:2403.06892, 2024.
  • [6] Lewei Yao, Jianhua Han, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, and Hang Xu. DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment. CVPR, 2023.
  • [7] Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, and Dan Xu. DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection. CVPR, 2024.
  • [8] Qing Jiang, Feng Li, Zhaoyang Zeng, Tianhe Ren, Shilong Liu, and Lei Zhang. T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy. arXiv preprint arXiv:2403.14610, 2024.
  • [9] Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. EVA-02: A Visual Representation for Neon Genesis. arXiv preprint arXiv:2303.11331, 2023.

Try Grounding DINO 1.5 Pro Demo

Read Grounding DINO 1.5 Paper

Go to Github

Apply for API