Published 2024-08-30
Keywords
- Infrared and visible light fusion, target detection, RT-DETR, modal attention
How to Cite

This work is licensed under a Creative Commons Attribution 4.0 International License.
Abstract
Infrared-visible fusion object detection plays a vital role in visual perception under complex environments. However, existing methods still face challenges in feature alignment, modality complementarity, and detection accuracy. To address these issues, this paper proposes a multimodal object detection method based on an improved RT-DETR. A dual-branch feature extraction network is designed to process infrared and visible images separately, and a modality attention mechanism is introduced to enhance cross-modal information interaction. In addition, a feature alignment loss is employed to optimize the fusion process and improve the model's adaptability to different modalities. Experimental results show that the proposed method achieves superior performance on multiple benchmark datasets. Compared to traditional single-modal approaches, the improved RT-DETR shows higher mAP@50 and mAP@95 scores and demonstrates greater robustness under challenging lighting conditions. Compared with existing multimodal detection methods, the proposed model maintains high detection accuracy while improving class discrimination and reducing false positives and missed detections, validating its effectiveness in multimodal visual perception tasks.