【DINO Talk】Visincept Founder Lei Zhang: Building Large Vision Models with Robust Object-Level Understanding to Power the Core of Spatial Intelligence

I. Guest Profile

4M - 张磊 - DINO Talk 封面图.jpg

Lei Zhang, Founder and CEO of Visincept, is currently a Chair Professor at the Computer Vision and Robotics Center (CVR) of IDEA Institute and an IEEE Fellow.

Lei Zhang previously served as a Principal Researcher at Microsoft Research Headquarters. His research findings have been widely integrated into Microsoft Bing Search and the Cognitive Services cloud computing platform. He has published over 150 papers in the field of Computer Vision (CV) at top international conferences and journals, with more than 62,000 citations on Google Scholar and an H-Index of 102. He holds over 60 granted U.S. patents and was elected an IEEE Fellow in recognition of his outstanding contributions to large-scale image recognition and multimedia information retrieval.

Prior to the establishment of Visincept, the DINO-X team led by Lei Zhang was incubated within IDEA-CVR, focusing on multi-modal visual AI research and dedicated to enhancing models' visual perception capabilities. The team's independently developed initial DINO model topped the COCO object detection benchmark for five consecutive months upon its launch. Its subsequent Grounding DINO model was ranked by PaperDigest as one of the Most Influential Papers at ECCV 2024. These technical accumulations laid the foundation for the development of the general vision model DINO-X, opening the door to open-world perception and object-level understanding.

In 2025, the DINO-X team completed its incubation and announced securing tens of millions of RMB in financing, including a 20 million RMB investment from a Chinese listed company ANYKA. The team is fully committed to building 2D and 3D embodied intelligence solutions based on DINO-X.

II. Interview Session

DINO Talk: Let’s start with a light question. What led you to name your product after a dinosaur (DINO)?

Lei: DINO Model is a further improvement built on our two prior works, DAB-DETR [ICLR 2022] and DN-DETR [CVPR 2022]. Object detection is one of the most fundamental problems in computer vision. We kept "DETR" in the names of those two earlier projects because we wanted to recognize and pay tribute to DETR’s original contributions to the detection task. DETR comes from DEtection TRansformer and it was the first research to propose using Transformer for object detection.

When we were developing our third improved work, DINO Model, based on DETR, we hit SOTA (State-of-the-Art) in the vision field with Transformer-based detection algorithms for the very first time. That made us want a more memorable name. We struggled to find something we liked for a while until a chance zoo trip with my child gave us inspiration. Seeing the rhinoceros (Rhino), made me think of DINO. It’s a colloquial term kids in the US use for dinosaur.

Everyone on the team working on this project loved the name. Later works like Mask DINO [CVPR 2023] and Grounding DINO [ECCV 2024] also adopted the "DINO" moniker. That’s how dinosaurs became the symbol for all kinds of models in our team. It even includes our later distinctive object detection work based on visual prompts, T-Rex2 [ECCV 2024]. That one references the Tyrannosaurus Rex, another member of the dinosaur family.

DINO Talk: Unlike most mainstream "language-driven" vision models, DINO takes a "vision-native" technical path. Why did you choose this differentiated approach?

Lei: First, let’s address "why vision first, then language."

I believe vision is the perceptual foundation for machines to interact with the physical world. Humans developed language so machines can understand instructions and our intentions, but when machines actually engage with the environment, they don’t need to converse with it. Instead, they rely on visual and action capabilities to complete the "perception-decision-execution" loop.

Looking back, before language emerged, humans and other animals still recognized objects, judged directions, and took actions through perception and movement. That’s why, before language technology breakthroughs, programming-driven machines had an extremely high threshold. Because only a few people could do it. New technologies like GPT let us give instructions in natural language, which was a major leap. For vision itself, works like DETR have made significant progress in object-level understanding, and object detection research has been refined over decades. But relying solely on "detection" or just "language" isn’t enough to support the goal of "letting humans drive machines through high-level semantics." So language-focused models keep integrating visual capabilities (like GPT-4V, GPT-4o, GPT-5), while our work builds on the "detection" core by adding language understanding. With language prompts and constraints, we expand a basic detector into an open-vocabulary one, moving toward open-world detection. This lets the model recognize common objects and handle unseen categories in open environments, laying the groundwork for machines to operate stably in the real world.

Both "language-driven" and "vision-native" paths have their pros and cons. Language-based multimodal large models excel at grasping overall/abstract semantics and logical connections, but they essentially model one-dimensional sequences (tokens). Vision-perception models, by contrast, have advantages in spatial orientation, fine-grained details, and 2D/3D data modeling. They can depict structural and geometric relationships more directly. Deep integration between the two is always possible, but in current practice, the integration isn’t ideal yet. We still need to further break through in representation, training objectives, and system implementation.

DINO Talk: What’s the biggest challenge in building a "vision-native" large model?

Lei: One of the core challenges lies in data. The model’s ability to understand objects at the granular level is limited by training data with finer granularity.

Most mainstream multimodal approaches rely mainly on image-text aligned data, which often reaches the billion-pair scale. But for vision-based multimodal models, take DINO-X as an example, we need annotations for almost every target in the image when working with object-level training data, which means adding bounding boxes and supplementing fine-grained labels like categories and attributes. These annotations significantly boost the model’s ability to understand and locate objects, but they’re high-investment and require wide coverage. Of course, more and more multimodal models are now adopting similar object/region-level annotations to enhance their perception and reasoning in complex, open scenarios.

DINO Talk: What’s the pain point in large-scale commercialization of vision models?

Lei: The biggest pain point is the sheer diversity of long-tail scenarios in real-world applications. Even general vision large models struggle to accurately adapt to a specific niche scenario or rare object. At the same time, most of these long-tail scenarios are small in scale. So the return on investing heavily in manpower and customized models is quite low. Right now, vision models face a dilemma with long-tail cases: a million problems require a million models to solve. This leads to severely fragmented solutions, which in turn creates a host of other issues.

DINO Talk: You just mentioned that current solutions for vision models in long-tail scenarios are overly fragmented. So how does the DINO-X model address this issue? Could you explain it with specific cases?

Lei: DINO-X’s goal is to solve all problems with a single model. Of course, the DINO-X general vision model alone can’t cover every scenario. For one thing, we can’t possibly go through all those scenarios and objects. There are just too many things in the world we haven’t seen or imagined. For another, data is a constraint. Data for long-tail scenarios is extremely hard to come by, so that we can’t train the model to be flawless like we do with common vocabularies.

So what’s the solution? For long-tail scenarios, we’ve developed a technology called oVP (optimized Visual Prompt) . To make it easy to understand, we call it a "Custom Template" in our product. Here’s how it works: through multiple visual prompt optimizations, we generate a visual Embedding, which is a prompt vector, and feed that Embedding into the DINO-X model to predict objects in new images. This technology lets us drastically boost the model’s adaptability and accuracy. DINO-X’s Custom Template only needs a small number of images to train the Embedding, and there’s no need for development or coding at all to achieve the precision of a customized model.

Take smart homes as an example. Even a regular home scenario actually has tons of long-tail targets. That’s because home environments are really complex and open: new furniture, new appliances, and all kinds of other things keep popping up. These "new things" and "new scenarios" can be covered with Custom Templates. The same goes for part recognition and defect detection in industrial quality inspection. We’ve also collaborated with the TxstureAxis Team from Central Academy of Fine Arts on pattern recognition for ancient cultural relics. All these highly specialized scenarios and rare objects can be supplemented with DINO-X’s Custom Templates without needs to develop new models from scratch.

DINO Talk: Has the DINO-X Custom Template achieved mature commercial application yet?

Lei: It’s already been put into real business use. We’ve collaborated with many enterprise clients on Custom Template applications this year, like China Merchants Group and Meituan. Their scenarios are all vertical ones deeply tied to their businesses, and that’s where DINO-X’s Custom Templates come into play.

Besides, we’ve integrated the Custom Template feature into our MaaS (Model-as-a-Service) product line. For example, users can train their own exclusive Custom Templates on the DINO-X Platform, then integrate them into their own products or businesses via API. We also have an image annotation product called T-Rex Label, where users can directly add Custom Templates to automatically annotate categories that regular models used to struggle with. Also, feedbacks on the Custom Template’s recognition performance in our AI counting app CountAnything has been really good.

DINO Talk: On Visincept’s vision. Why mphasize "Building a Large Vision Model with Superb Object-Level Understanding Capability"?

Lei：Over the past few years, we’ve been laser-focused on perfecting the task of object detection. Because we firmly believe that for AI to understand the world, it must first see the world clearly. After the birth of the DINO-X model and the resolution of long-tail scenario challenges through customized templates, we set out to enhance the model’s comprehension capabilities.

Take DINO-XSeek as an example: its semantic understanding goes far beyond just recognizing nouns and simple adjective modifiers. Instead, it can truly parse the grammatical structure of sentences, possess high-level semantic reasoning abilities, and handle complex instructions that require multi-step logical analysis.

But that’s not enough. Because the world we live in is three-dimensional. The so-called "superb object-level understanding capability" means that AI must not only understand all kinds of objects, but also grasp their structures and spatial relationships. That’s why we’re working to extend DINO-X to enable 3D object understanding. Starting from 2D, our ultimate goal is to build a 3D large vision model that truly comprehends the physical world, laying the groundwork for developing cutting-edge spatial intelligence and embodied intelligence.

DINO Talk: Could you share the latest progress of DINO-X in the 3D domain?

Lei: Our most recently announced breakthrough lies in the field of embodied intelligence.

At the just-concluded 2025 IDEA Conference, we unveiled DINO-XGrasp, a universal grasping model designed as an "embodied brain" specifically for robotic arms. By integrating the universal perception capabilities of DINO-X, the robotic arm can not only grasp any object but also execute long-horizon manipulation tasks with object-level cognition. What’s worth highlighting is that this embodied brain can be deployed on any robotic arm, enabling flexible control and precise positioning purely through visual algorithms.

Of course, we’ve also made remarkable strides in 3D vision models. We have a number of exciting achievements in the pipeline that will be released soon. You’ll see them before long, but I’ll keep it under wraps for now.

DINO Talk: You just mentioned that "The ultimate goal is to build a 3D large vision model that truly comprehends the physical world, laying the groundwork for developing cutting-edge spatial intelligence and embodied intelligence." We’ve discussed the progress in embodied intelligence. Could you share your insights on spatial intelligence, as well as Visincept’s upcoming development plans in this field?

Lei: Actually, for us, the core of spatial intelligence is enabling machines to perceive and utilize physical space in the same way humans do.

First off, many might think spatial intelligence is a new concept, but its roots go way back. As early as 1983, American psychologist Howard Gardner proposed the idea of "visual-spatial intelligence" in his book Frames of Mind. The core of this theory lies in the ability to understand the shape, size, position and three-dimensional relationships of objects, as well as to imagine and manipulate these objects mentally.

In recent years, Professor Fei-Fei Li has been spearheading the technological implementation of this concept. She emphasized at both the 2024 TED Conference and NVIDIA GTC that spatial intelligence is a more fundamental AI technology. It allows machines to perform tasks directly without pre-training in the real world. The key here is to infer how images and text map to 3D environments, and then act based on that inference. This is also our core understanding of spatial intelligence: it is not a simple stacking of technologies, but rather a means to enable machines to truly comprehend the physical environment and make rational decisions accordingly.

To achieve this goal, the first step is to strengthen the fundamentals by building robust object detection capabilities. Professor Fei-Fei Li’s team previously carried out a project called "Digital Cousins," which detects objects from single images and then matches them with digital assets to construct simulated environments. In the initial stage of image object analysis for this project, they used our Grounding DINO model. This further confirms that object detection plays a foundational, gatekeeping role throughout the entire workflow.

For a long time, the focus of DINO-X has been on long-tailed detection. Our goal is to achieve universal detection covering both common and rare object categories. At the same time, we are also enhancing the granularity of our detection outputs, including instance segmentation, key point detection and 3D structure understanding. Building on this foundation, we are advancing step by step: from 2D detection, to 3D object perception, and finally to 3D environment perception.

DINO Talk: So this actually maps onto the development roadmap of DINO-X, starting from 2D Detection, moving to 3D Object Perception, then to 3D Environment Perception, and ultimately achieving a World Model. What specific tasks does this roadmap entail?

Lei: Let’s start with 3D Object Perception. This is not a simple extension of 2D detection. Instead, it enables finer-grained estimation of an object’s 3D pose, key points, and geometric structure, laying the fundamental object-level foundation for 3D environment understanding.

3D Object Perception can integrate data from diverse sources, such as 2D images, depth maps (from LiDAR, radar, or stereo vision), and point cloud data. By designing efficient multi-modal fusion strategies, we can enhance the ability to perceive an object’s 3D structure, thereby achieving higher robustness in complex environments.

Next up is 3D Environment Perception, which operates at a more macro level. On one hand, it involves scene semantic parsing: combining 3D reconstruction, localization technologies, and 2D semantic understanding to build a global 3D scene semantic map. The model doesn’t just identify what objects are present. It also grasps their spatial relationships and categorical attributes.

On the other hand, dynamic modeling challenges such as pedestrian movement trajectories and light/shadow variations are needed to be addressed. These dynamic factors directly impact machine decision-making and must be analyzed and predicted with precision. To give an example: if a robot needs to navigate a shopping mall and assist pedestrians, it must not only know the locations of shelves and aisles (the static scene) , but also predict pedestrian movement paths (the dynamic scene). This is the core problem that 3D environmental perception aims to solve.

Last but not least, there is a critical enabler: data. To empower models to truly understand diverse, complex scenarios, we must construct large-scale, multimodal spatial perception datasets, which is the cornerstone of all algorithmic research. Without high-quality data, even the most advanced models will struggle to be put into practical use. Therefore, alongside advancing technological R&D, we are simultaneously building this dataset ecosystem.

DINO Talk: To sum up, the Visincept initiative will first lay a solid foundation with object detection and 3D structure understanding. Next, it will achieve comprehensive “from object to scene” comprehension through 3D object perception and 3D environmental perception. Finally, supported by high-quality datasets, hardware acceleration, and efficient data processing pipelines, it will build a 3D large vision model capable of truly understanding the physical world, providing core support for the implementation of spatial intelligence.

Lei: Exactly.

DINO Talk：What key challenges will we face in the process of evolving the DINO-X universal visual perception model toward spatial intelligence?

Lei: There are plenty of them. This is a journey of constantly overcoming challenges.

First is the unified representation of 3D structures. Multiple approaches exist for representing an object’s 3D structure, including 3D bounding boxes, 3D keypoints, 3D point clouds, and 3D meshes. Future research must explore how to devise an algorithm-level unified representation that delivers high scalability while adapting seamlessly to diverse scenarios.

Second comes the semantic understanding of 3D environments. 3D environmental reconstruction primarily relies on geometry-based multi-view vision techniques such as SLAM and SfM. These methods require the integration of 2D and 3D object perception to achieve fine-grained semantic scene understanding. Going forward, research should focus on more effective fusion of geometry-based and object-aware algorithms to enable more semantically rich 3D environmental comprehension.

Third is computational complexity. Processing 3D data drastically elevates computational complexity, making it critical to strike a balance between accuracy and efficiency in model architecture design.

Last but not least is generalization capability. We need to formalize the pipeline for constructing 3D perception datasets, thereby ensuring robust model generalization in real-world scenarios, particularly when handling cross-domain or incomplete data.

DINO Talk: To wrap up with a forward look, Visincept aspires to build a powerful large vision model with exceptional object-level comprehension capabilities, laying a solid groundwork for spatial intelligence. And what is our ultimate goal?

Lei: Leveraging the strengths of the DINO-X model, our ultimate objective is to develop a system that integrates human common sense, physical laws, spatial reasoning, and world knowledge to understand the physical world, construct world models, and predict the motion states of objects in the physical realm. In doing so, we aim to make substantial contributions to the technological advancement of embodied intelligence.