Vol. 4 No. 6 (2025)
Articles

A Modular Framework for Robust Multimodal Representation Learning via Dynamic Modality Weighting

Published 2025-06-30

How to Cite

Forsberg, L., & Whitmore, C. (2025). A Modular Framework for Robust Multimodal Representation Learning via Dynamic Modality Weighting. Journal of Computer Technology and Software, 4(6). Retrieved from https://ashpress.org/index.php/jcts/article/view/184

Abstract

 Multimodal systems empower machines to interpret and reason over diverse information sources such as text, images, and audio, thereby achieving a level of understanding closer to human cognition. This paper introduces a unified framework that combines modality-specific encoders, a hierarchical cross-modal fusion module, and dynamic weighting strategies. We validate the framework on three representative tasks-emotion recognition, image-text retrieval, and medical report generation—where it consistently outperforms competitive baselines in both accuracy and robustness. Comprehensive experiments and case analyses highlight its adaptability to real-world scenarios. The proposed solution is scalable, interpretable, and broadly applicable to fields such as healthcare, education, and human-computer interaction.