Published 2025-04-30
How to Cite

This work is licensed under a Creative Commons Attribution 4.0 International License.
Abstract
Recent advances in vision-language pretraining have significantly improved performance across a wide range of visual understanding tasks, including image captioning, visual question answering (VQA), and open-world object detection. However, existing models often suffer from domain sensitivity, shallow cross-modal alignment, and limited adaptability to real-world multimodal scenes. In this work, we propose a unified cross-modal representation learning framework that integrates image, text, and depth modalities through a multi-stream transformer architecture. Our approach emphasizes three design principles: modality-specific feature enhancement, global alignment via contrastive learning, and adaptive fine-tuning with dynamic negative sampling. We demonstrate the effectiveness of our framework on four benchmark datasets spanning open-vocabulary detection, cross-modal retrieval, and zero-shot classification. Extensive experiments show consistent performance gains over state-of-the-art baselines, including CLIP and BLIP-2, with up to +6.3% improvement in retrieval recall and +4.8 mAP in detection tasks. Qualitative analysis further confirms the model’s ability to capture high-level semantic associations across modalities. This work provides new insights into robust cross-modal vision systems and offers a scalable solution for real-world multimodal reasoning applications.