- The paper introduces a decoupled framework that precomputes semantic prototypes to eliminate runtime text encoding and optimize open-vocabulary detection.
- It employs dynamic hierarchical concept pooling and dual-projection distillation to balance semantic generalization and localization accuracy.
- Parametric decoupling training resolves optimization conflicts, leading to state-of-the-art novel-category detection and reduced inference costs.
DeCo-DETR: A Decoupled Cognition Framework for Efficient Open-Vocabulary Object Detection
Introduction
This paper presents DeCo-DETR, a DETR-based vision-centric pipeline addressing two main deficiencies in existing open-vocabulary object detection (OVOD) models: reliance on high-latency text encoders during inference and optimization conflicts between closed-set precision and open-world generalization. The authors propose a unified decoupling paradigm, specifically targeting efficient transfer and deployment of semantic knowledge from large vision-LLMs (LVLMs) such as LLaVA and CLIP, while explicitly decoupling semantic and localization training objectives. The approach is validated through extensive experiments on OV-COCO and OV-LVIS benchmarks, with state-of-the-art zero-shot detection accuracy and substantially reduced inference costs.
Conventional OVOD pipelines address open-set recognition by fusing visual and linguistic cues, leaning heavily on CLIP-style cross-modal alignment and increasingly on prompt-based use of LLMs. However, text encoders and prompt-based designs severely limit deployment efficiency due to their inference-time computational burden. Moreover, the standard multimodal fusion paradigm introduces a trade-off between base-class (closed-set) accuracy and novel-class (open-set) generalization, which often manifests as optimization interference—where improving one results in the degradation of the other. DeCo-DETR addresses these by:
- Precomputing semantic prototypes to obviate runtime text encoding.
- Structuring hierarchical knowledge transfer.
- Decoupling semantic and localization branches at both representation and optimization levels.
Architecture and Methodology
Dynamic Hierarchical Concept Pool (DHCP)
DHCP constructs a self-evolving, hierarchical semantic prototype repository to capture both coarse-grained category-level and fine-grained attribute-level semantics. The prototype pool is initialized offline by generating region-level descriptions with LLaVA, aligning these with visual features in a CLIP-shared embedding space, then applying successive K-Means and DBSCAN clustering to model prototypes at different levels of semantic granularity. Notably, DHCP retains only high-confidence region pairs (cosine similarity above a threshold), filtering noise from LLM-generated captions.
This offline pool is continually refined online with a batch-wise, momentum-driven updating routine, adapting to distribution shift and incremental learning signals during detector training. By making the prototype space runtime-reusable, DHCP eradicates the need for computationally expensive LLMs or text encoders in the inference pipeline.
Hierarchical Knowledge Distillation (Hi-Know DPA)
Beyond semantic memory construction, DeCo-DETR applies a dual-projection distillation: detector queries, after standard DETR backbone and decoder, are mapped into the CLIP-aligned semantic space via a learnable projection. Each query's relationship (via soft assignment) to all prototypes is computed, producing semantically enriched query embeddings through a weighted aggregation plus residual connection.
Teacher-guided distillation ensures the student detector replicates the CLIP-derived semantic assignments for both base and unseen classes, facilitating robust open-world generalization. Critically, hierarchical knowledge distillation—structuring the prototype space in levels—directly benefits transfer to fine-grained and compositional novel categories, as evidenced by ablation demonstrating a 10.5 point APnovel drop when fine-grained units (M2) are omitted.
Parametric Decoupling Training (PD-DuGi)
To resolve the well-documented optimization interference between localization and semantic alignment, DeCo-DETR’s PD-DuGi ensures strict gradient decoupling between these objectives. The detection head is optimized solely with standard DETR loss (Hungarian matching over bounding boxes and objectness), with gradients restricted to backbone and decoder parameters. In parallel, a cognition stream—with gradient flow stopped from detector queries—trains the semantic projection and classification head using distribution alignment to CLIP’s output. A time-dependent cosine annealing weighting schedule initially privileges semantic alignment, later shifting weight to detection, yielding more stable convergence and further reducing task interference.
Empirical Evaluation
Across benchmark datasets (OV-COCO and OV-LVIS), DeCo-DETR achieves the strongest reported novel-category APs in both V-OVD and G-OVD paradigms, as well as record overall mAPs for long-tailed recognition:
- On OV-COCO, DeCo-DETR secures 41.3% AP50novel (+3.5 points over the best previous result), and 56.7% overall AP50, notably improving over fusion-based and distillation baselines.
- On OV-LVIS, the model attains 29.4% AP on rare (novel) classes and 35.2% overall AP, establishing new records for open-vocabulary long-tail detection.
Inference efficiency is a central result: DeCo-DETR achieves 135ms per image (ResNet-50 backbone), compared to 250ms+ for prompt-based or text-fusion methods, with memory and parameter overheads within 5% of standard DETR. This enables real-time deployment on commodity hardware, unmatched by prior art.
Ablation and Efficiency Analysis
Hierarchical DHCP and gradient isolation both yield substantial quantitative improvements, with DHCP alone contributing up to +6.2 points AP50novel in isolation. Increasing query and prototype granularity provides consistent but saturating gains, confirming the scalability of the transformer-based design. Furthermore, DeCo-DETR is robust to the choice of LVLM; improvements in prototype quality saturate when model scale exceeds 13B parameters, guiding practical model selection for deployment.
Qualitative user studies further corroborate that semantic descriptions and localization of novel classes are both improved over leading methods, especially under distribution shift or image corruption, underscoring the practical robustness of the decoupled, prototype-driven approach.
Implications and Future Directions
DeCo-DETR exemplifies a new paradigm for OVOD, where semantic cognition and localization are structurally decoupled at all stages, from representation through to optimization. This framework not only supports broader category generalization without runtime LLM dependence but also minimizes the efficiency–accuracy trade-off traditionally seen in multimodal vision systems.
The practical implication is a scalable, resource-friendly OVOD solution suitable for real-world applications such as autonomous driving, robotics, and surveillance, where open-set recognition and inference tractability are paramount. Theoretical implications suggest that architectural and optimization decoupling can resolve or mitigate the intrinsic conflicts of joint cross-modal representation learning, hinting at similar strategies for broader multimodal and cross-domain vision challenges.
The paper's approach can be extended to dense prediction, open-vocabulary instance segmentation, or generalized visual grounding, and future work may include adapting the DHCP-HKD-PD-DuGi paradigm to spatiotemporal (video) domains, incremental/continual learning, or edge-device deployment via further compression and pruning of prototype spaces.
Conclusion
DeCo-DETR introduces a principled, vision-centric alternative to conventional OVOD, eliminating inference-time dependence on text encoders and resolving optimization conflicts through hierarchical semantic abstraction and parametric gradient decoupling. The framework sets new standards in zero-shot and open-world detection, balancing scalability, accuracy, and inference efficiency, and lays a methodological foundation for the next generation of interpretable, modular, and deployable open-vocabulary perception systems (2604.02753).