I. Seeking Breakthrough
Despite the significant progress made by DINO-X in the field of object detection, the ability of current object detection models, including DINO-X, to precisely identify specific individuals based on natural language descriptions—known as Referring Expression Comprehension (REC)—remains considerably inadequate. To enhance the capabilities of object detection models in the REC domain, researchers decided to focus on a widely applicable scenario—humans—as a breakthrough point to experiment with and explore future improvement directions for REC in object detection models.
While computer vision has long focused on detecting and recognizing objects in images, humans remain the core subjects in most real-world applications. Traditional REC methods primarily concentrate on one-to-one references, essentially limiting you to saying "find the person wearing red" rather than "find all people wearing red." This limitation fails to reflect real-life situations where we often need to identify multiple people matching the same description. For example, finding "all students wearing uniforms" in a school group photo, or identifying "all guests holding drinks" in a party photo.
Figure 1 The annotation process for the HumanRef dataset. The process involves: (a) pseudo labeling, (b) writing property lists, (c) assigning properties to each person, and (d) transferring to referring style using an LLM.
Researchers from the International Digital Economy Academy (IDEA) and South China University of Technology (SCUT) addressed this gap by introducing a new approach called "Referring to Any Person." They redefined the REC task to align better with natural human language habits, developed a dataset specifically for human identification, and created RexSeek, an advanced model that combines visual detection with language understanding, thereby laying the foundation for the birth of DINO-XSeek.
II. Redefining REC
Imagine the following scenario: You and a friend are looking at a photo from a music festival, and you might say, "Find all people wearing sunglasses on the left side of the stage," or "Point out all members of a certain band in the photo." Existing technology struggles to handle these seemingly simple requests because current REC methods have several key limitations:
1. One-to-one referring: Most existing models assume each referring expression corresponds to only a single object. They can answer questions like "Who is the class monitor?" but cannot address questions like "Who are the members of the class committee?" which refer to multiple people, contradicting real-world usage patterns.
2. Limited scope: Current datasets focus on simple attributes or spatial relationships, ignoring the complexity of human descriptions. Existing systems excel at identifying "the person wearing red" but struggle to understand complex descriptions like "people who look happy" or "young people talking to elderly individuals."
3. No rejection capability: Existing models often fail to recognize when described people are not present in an image, leading to hallucinations. If you ask, "Find the astronaut in the photo," and the photo shows a family gathering, existing systems tend to incorrectly designate someone rather than honestly answering "there is no astronaut."
To address these issues, the paper introduces the concept of "Referring to Any Person," which encompasses five description modes commonly used in real-world scenarios:
1. Attributes: Including physical features, clothing, and accessories, such as "the blonde-haired, blue-eyed person," "the man wearing glasses and a suit," or "the woman with a red handbag."
2. Position: Spatial relationships and positions, such as "the person standing next to the window," "the singer at the center of the stage," or "the students sitting in the last row."
3. Interaction: Behaviors and relationships with other people or objects, such as "parents taking photos of their child," "employees shaking hands with the CEO," or "two people talking to each other."
4. Reasoning: Reasoning based on context or combinations of features, such as "the person who appears to be the host of this family," "the person who seems to be the team leader."
5. Celebrity recognition: Identifying well-known figures, such as "Tom Cruise in the photo" or "Bill Gates in the front row."
This redefinition enables computer vision systems to handle three key functions:
1. Multi-instance referring: The ability to identify all individuals matching a description, such as finding "all people wearing black clothes," not just one of them.
2. Multi-instance discrimination: The ability to distinguish between different groups of people, such as differentiating between "students on the right" and "teachers on the left."
3. Rejecting non-existence: The ability to recognize when described individuals are not present, such as honestly answering "no one matches this description" when asked to "find the astronaut in the photo" when there are no astronauts.
This redefinition transforms REC from a simple one-to-one mapping to a more nuanced and practical task, aligning with natural human communication patterns and greatly expanding the practicality of computer vision.
III. The HumanRef Dataset
To train systems capable of understanding complex human descriptions, the research team created the HumanRef dataset—essentially a textbook for teaching machines to "understand human descriptions," containing 7,302 images and 103,028 referring expressions.
Figure 2 Distribution of people per image (left) and ground truth boxes per referring expression (right) in the HumanRef dataset.
The dataset was constructed through a three-step annotation process:
1. Property listing: Determining relevant properties that can be used to refer to people, such as "lady, short hair, wearing a yellow dress, standing at the doorway";
2. Property assignment: Interactively assigning properties to each person in the image;
3. Referring style rewriting: Using a large language model to convert attribute lists into natural referring expressions.
Researchers deliberately selected scenes with larger numbers of people (averaging 9.6 people per image) to ensure the dataset covers various complex situations. This differs significantly from previous datasets, which were like teaching machines to recognize only individual objects, whereas HumanRef teaches machines to understand complex human descriptions and group identification.
IV. RexSeek Architecture
To understand RexSeek, imagine it as an assistant with super-strong vision and language comprehension abilities. When you say, "Find all children wearing hats in the photo," what would this assistant do? The RexSeek architecture and workflow are as follows:
1. Visual Encoder: Processes visual information from the input image, understanding the overall content;
2. Person Detector(DINO-X): Identifies all individuals in the image and provides bounding boxes;
3. Large Language Model(Qwen 2.5): Interprets referring expressions and matches them with detected individuals, understanding the specific meaning of descriptions like "children wearing hats";
4. Specialized Token System: Includes grounding tokens, object tokens, and object index tokens to connect language descriptions with visual elements.
Figure 3 The RexSeek model architecture, showing the integration of vision encoders, person detector, and language model with specialized tokens for object referencing.
The model processes both the image and the referring expression simultaneously, then outputs the indices of all detected individuals matching the description. If no matches are found, the model can identify and indicate the absence of matching individuals. Unlike previous approaches that primarily focused on detection or heavily relied on language models, RexSeek achieves a balanced integration of both capabilities. This integration is crucial for handling the complexity of human-centered references, where both visual accuracy and language understanding are essential.
V. Training Method
The development of RexSeek involves a carefully designed multi-stage training method to build perception and understanding capabilities, including:
1. Modality Alignment: Initial training using image-captioning data to align visual and textual modalities;
2. Perception Training: Enhancing detection capabilities using detection-oriented data;
3. General Understanding: Incorporating multimodal data to improve overall comprehension;
4. Task-Specific Fine-tuning: Final refinement using the HumanRef dataset.
Researchers found that this multi-stage approach significantly outperforms traditional training methods, enabling the model to establish strong foundations in both visual perception and language understanding before specializing in human-centered referring expression tasks.
VI. Model Performance
In HumanRef benchmark evaluations, RexSeek demonstrated significant improvements compared to existing state-of-the-art models. Experiments revealed several key findings:
1. Superior performance on multi-instance referring: While most existing models' performance drops sharply as the number of target individuals increases, RexSeek maintains high precision and recall across all scenarios.
Figure 4 Precision and recall performance of various models based on the number of instances per referring expression. RexSeek maintains performance across all scenarios, while other models deteriorate with increasing instances.
2. Effective rejection capability: Unlike other models that tend to hallucinate when referred people do not exist, RexSeek successfully identifies non-existent cases.
3. Strong generalization: Despite being primarily trained for human identification, RexSeek can also understand instructions like "find the dog in the photo" or "point out the coffee cup on the table," demonstrating its universal adaptability.
4. Balanced precision and recall: RexSeek performs excellently in both accuracy (not misidentifying people) and comprehensiveness (not missing eligible people).
These results validate the researchers' approach to redefining the referring task and highlight the effectiveness of their model architecture and training strategy. RexSeek significantly outperforms existing models on all subsets of the HumanRef benchmark, especially in challenging scenarios involving multiple instances and rejection cases.
VII. Application Scenarios
RexSeek's technological breakthroughs open up possibilities for numerous practical applications that will change how we interact with the digital world, including:
1. Human-robot interaction: Enabling robots to understand natural language instructions about people in their environment;
2. Visual search systems: Allowing users to search for specific individuals in photo collections using natural language;
3. Security and surveillance: Identifying persons of interest based on verbal descriptions;
4. Assistive technologies: Helping visually impaired individuals understand who is present in images;
5. Content analysis: Automating the identification of people in media for content moderation and organization.
Conclusion
RexSeek's research represents a qualitative leap forward by redefining how systems identify and locate individuals based on natural language descriptions, bringing machines closer to how humans understand the visual world—evolving from recognizing only single objects to understanding complex scenes with multiple instances.
As this technology develops, we anticipate computer vision systems will increasingly naturally understand the human world, not just seeing people but more profoundly comprehending the complex relationships between objects across various dimensions of human existence. This computer vision capability will make technology more intuitive and naturally integrated into our lives.
Appendix
-
Paper: "Referring to Any Person," Authors: Qing Jiang, Lin Wu, Zhaoyang Zeng, Tianhe Ren, Yuda Xiong, Yihao Chen, Qin Liu, Lei Zhang. Link: https://arxiv.org/abs/2503.08507
-
To use the latest DINO API, please visit the DINO-X Platform: https://cloud.deepdataspace.com
-
To experience the latest DINO-XSeek model online, please visit DINO-X Playground: https://cloud.deepdataspace.com/playground/dino-x?referring_prompt=0