I. Introduction
Coronary artery disease (CAD), as one of the leading causes of death globally, requires accurate early detection for effective treatment. While X-ray coronary angiography (XCA) remains the gold standard for diagnosing CAD, manual interpretation of these images is time-consuming and subject to inter-observer variability. This study evaluates the performance of three advanced object detection models—Grounding DINO, YOLO, and DINO—for automatic detection of stenosis (narrowing of blood vessels) in coronary angiography images using the ARCADE dataset.
Figure 1: X-ray coronary angiography image with stenosis region highlighted in purple. Stenosis represents the narrowing of blood vessels, which restricts blood flow to the heart muscle.
II. Understanding Coronary Artery Disease and Stenosis
Coronary artery disease primarily occurs when the major vessels that supply the heart become damaged or diseased, typically caused by plaque buildup in the arteries. This leads to narrowing of vessels (stenosis), restricting blood flow to the heart muscle. If left untreated, it can result in angina (chest pain), heart attacks, or even death.
Traditional diagnostic approaches for CAD include:
(1)Clinical assessment: Evaluating symptoms and risk factors;
(2)Non-invasive testing: Including electrocardiograms, stress tests, and CT scans;
(3)X-ray coronary angiography (XCA): A definitive diagnostic procedure involving injection of contrast agent into coronary arteries and capturing X-ray images.
Radiologists must manually interpret these angiography images to identify areas of stenosis, a process that is time-consuming and subject to interpretation variations. Automating this process through deep learning can significantly improve diagnostic efficiency and consistency.
III. The ARCADE Dataset
This study utilizes the ARCADE dataset (Automated Registration for Coronary Artery Disease Events), a public benchmark dataset designed for automated CAD diagnosis. It contains X-ray angiography images with expert annotations of stenosis detection, providing a standardized resource for evaluating different detection algorithms.
Key features of the ARCADE dataset include:
(1) Expert-labeled stenosis regions
(2) Diverse patient cases representing varying degrees of stenosis
(3) Multiple viewing angles of coronary arteries
(4) Standardized format for comparing different algorithms
IV. Object Detection Models
The study evaluated three distinct object detection architectures, each representing different approaches in computer vision:
1. YOLO (You Only Look Once)
YOLO is a CNN-based object detection system renowned for its real-time inference capabilities. It divides the image into a grid and predicts bounding boxes and class probabilities directly from the full image in a single pass.
Figure 2: The YOLO architecture showing the feature backbone, feature pyramid, and prediction head components.
Key characteristics:
(1) Single-stage detector that processes the entire image in one pass;
(2) Feature pyramid for multi-scale feature extraction, capable of attending to stenosis areas of different sizes;
(3) High processing speed, like a quick diagnosis, suitable for situations requiring immediate results;
(4) Struggles with small objects or complex medical imaging data, potentially missing very small stenosis regions
2. DINO
DINO is a Transformer-based model that leverages self-attention mechanisms to enhance feature representation. It is based on the DETR (DEtection TRansformer) architecture with improvements in training convergence and performance.
Figure 3: The DINO-DETR architecture showing the encoder-decoder transformer structure with multi-scale feature processing.
Key characteristics:
(1) End-to-end object detection without the need for non-maximum suppression, facilitating direct diagnosis without multiple processing steps;
(2) Multi-scale feature processing, attending to both major and subtle symptoms in patients;
(3) Enhanced query selection mechanism for better feature representation, enabling more precise localization of stenosis;
(4) Strong performance on complex detection tasks, but may require higher computational resources.
3. Grounding DINO
Grounding DINO combines the DINO architecture with grounded pre-training for open-set object detection. It incorporates both visual and textual features for more robust object detection.
Figure 4: The Grounding DINO architecture showing cross-modality processing between text and image features.
Key characteristics:
(1) Cross-modality learning between text and image features;
(2) Feature enhancement layers for improved representation, like using a magnifying glass to enhance observation details;
(3) Language-guided query selection, capable of finding specific areas based on linguistic descriptions, such as "find regions with more than 50% stenosis";
(4) Minimal supervision detection capabilities, learning well even with limited labeled data.
V. Methodology
The study implemented all three object detection models using the MMDetection framework (an open-source object detection toolbox). The evaluation followed these key steps:
1. Data Preprocessing Annotation files were adjusted to align with the ARCADE dataset format, and redundant annotations were filtered to ensure label consistency.
2. Model Configuration
Each model was configured with appropriate hyperparameters in the MMDetection framework:
(1) YOLO: Using ResNet backbone with feature pyramid network;
(2) DINO: Using Swin Transformer backbone and transformer encoder-decoder structure;
(3) Grounding DINO: Configured with vision-language fusion and cross-attention mechanisms.
3. Training Process
Models were trained on the ARCADE dataset using standard optimization techniques:
(1) Learning rate scheduling;
(2) Medical image data augmentation;
(3) Loss functions appropriate for object detection (IoU loss, classification loss)
4. Evaluation Metrics
Standard COCO evaluation metrics were used to assess detection performance across different models:
(1) IoU (Intersection over Union): Measureing the overlap between predicted and ground-truth boxes;
(2) Average Precision (AP): Measuring detection accuracy across different IoU thresholds;
(3) Average Recall (AR): Measuring ability to find all stenosis regions.
VI. Results and Performance Comparison
The comparative evaluation revealed different performance patterns among the three models:
1. Quantitative Results
(1) Transformer-based models (DINO and Grounding DINO) generally achieved higher mean Average Precision (mAP) than YOLO across most IoU thresholds;
(2) Grounding DINO obtained the highest mAP at IoU = 0.50, indicating strong performance for moderate overlap detection;
(3) DINO outperformed other models in mAP across IoU thresholds from 0.50 to 0.95, demonstrating excellent precision across varying overlap requirements;
(4) YOLO achieved competitive mAP50 results, showing balanced performance for moderate overlap targets.
Performance metrics from the experiments reflected trade-offs between precision and recall across different architectures:
(1) DINO: Highest precision but lower recall
(2) Grounding DINO: Good balance between precision and recall
(3) YOLO: Good recall with moderate precision
2. Detection Visualization Comparison
The following images demonstrate the performance of each model on the same coronary angiography image:
Figure 5: DINO-DETR detection result showing fewer, more precise detections.
Figure 6: Grounding DINO detection showing high confidence stenosis detection with clear labeling.
Figure 7: YOLO detection showing multiple detection regions with confidence scores.
3. Qualitative Analysis
Beyond quantitative metrics, qualitative analysis of detection results revealed significant differences in how each model approached stenosis detection:
3.1 DINO Detection Patterns
DINO generally produced fewer detections, occasionally missing relevant stenosis areas, but those it detected were typically accurate. This suggests the model learned more stringent criteria for stenosis judgment. For instance, in certain test cases, DINO detected a single stenosis area with high confidence while missing more subtle secondary areas. This pattern is evident in Figures 5 and 8, where the model identified primary stenosis areas but potentially overlooked others.
Figure 8: DINO-DETR detection showing its tendency for fewer, more precise detections.
3.2 Grounding DINO Detection Patterns
Grounding DINO identified more areas but sometimes produced cluttered predictions due to over-detection. The model appeared to leverage its cross-modality understanding to detect a wider range of stenosis manifestations. As shown in Figures 6 and 9, Grounding DINO typically detected multiple stenosis areas with varying confidence scores. While this increased the likelihood of capturing all stenosis regions, it also raised the possibility of false positives.
Figure 9: Grounding DINO detection showing multiple detection regions with confidence scores.
3.3 YOLO Detection Patterns
YOLO offered a reasonable compromise by effectively capturing anatomical structures while maintaining relatively high confidence scores. It performed better at detecting smaller stenosis areas compared to transformer-based models.
As shown in Figures 7 and 10, YOLO's detection pattern typically included multiple regions with varying confidence scores. This approach provided a reasonable balance between precision and recall for clinical applications.
Figure 10: YOLO detection showing multiple detection regions with confidence scores.
VII. Limitations and Challenges
Through this study, we identified several general limitations affecting model performance:
1. Data-Related Challenges
(1) Limited size of the ARCADE dataset compared to general object detection datasets;
(2) Class imbalance between normal cases and stenosis cases;
(3) Variations in image quality and contrast levels.
2. Model-Specific Limitations
(1) YOLO: Difficulty with very small stenosis areas and low-contrast regions;
(2) DINO: Computationally intensive and requiring longer training times;
(3) Grounding DINO: Tendency to over-detect in certain complex scenarios.
3. Clinical Integration Challenges
(1) Need for higher precision to avoid false positives in clinical settings;
(2) Explainability requirements: Physicians need to understand why AI made certain judgments;
(3) Variations in coronary artery anatomy across patients.
VIII. Future Research Directions
Based on our findings, we identified several promising research directions:
**1. Post-Processing Techniques to Improve Detection Accuracy **
(1) Ensemble methods combining outputs from different models, such as using YOLO's rapid detection capabilities for initial screening, followed by DINO's precise localization abilities for refinement;
(2) Boundary box optimization for better localization.
2. Model Improvements
(1) Alternative model configurations and training strategies;
(2) Medical imaging domain-specific augmentation techniques;
(3) Semi-supervised learning approaches to leverage unlabeled data
3. Hybrid Approaches
(1) Integration of CNN and transformer architectures for balanced performance;
(2) Incorporation of anatomical prior knowledge to enhance detection accuracy.
4. Clinical Validation
(1) Prospective studies comparing model performance against radiologists' diagnoses;
(2) Integration with clinical workflows for real-world evaluation.
Conclusion
The comparative evaluation of YOLO, DINO, and Grounding DINO for stenosis detection using the ARCADE dataset provides valuable insights into the strengths and limitations of these cutting-edge object detection models in the context of CAD diagnosis. Transformer-based architectures like DINO and Grounding DINO demonstrate enhanced precision and good recall, making them particularly well-suited for CAD detection tasks in XCA images. Conversely, YOLO offers computational efficiency advantages that may be valuable in resource-constrained scenarios. The study indicates that while object detection technologies have made significant progress, there remains substantial room for optimization, whether through model refinements, improved post-processing techniques, or exploration of hybrid architectures.
By conducting systematic evaluations using standardized metrics, this research contributes to the ongoing development of automated CAD diagnostic systems and highlights the potential of deep learning to improve diagnostic accuracy and reduce the burden on healthcare professionals.
References
(1) Paper "Evaluating Stenosis Detection with Grounding DINO, YOLO, and DINO-DETR" by Muhammad Musab Ansari. Link: https://arxiv.org/abs/2503.01601
(2) Access the latest DINO models API on the DINO-X Platform: https://cloud.deepdataspace.com/
(3) Grounding DINO Playground: https://cloud.deepdataspace.com/playground/grounding_dino