A Modular Framework for Robust Multimodal Representation Learning via Dynamic Modality Weighting

Linnea Forsberg; Callum Whitmore

doi:10.5281/zenodo.15809647

Vol. 4 No. 6 (2025)

Articles

A Modular Framework for Robust Multimodal Representation Learning via Dynamic Modality Weighting

pdf

Linnea Forsberg,
Callum Whitmore

DOI: https://doi.org/10.5281/zenodo.15809647

Published 2025-06-30

How to Cite

Forsberg, L., & Whitmore, C. (2025). A Modular Framework for Robust Multimodal Representation Learning via Dynamic Modality Weighting. Journal of Computer Technology and Software, 4(6). https://doi.org/10.5281/zenodo.15809647

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Multimodal systems empower machines to interpret and reason over diverse information sources such as text, images, and audio, thereby achieving a level of understanding closer to human cognition. This paper introduces a unified framework that combines modality-specific encoders, a hierarchical cross-modal fusion module, and dynamic weighting strategies. We validate the framework on three representative tasks-emotion recognition, image-text retrieval, and medical report generation—where it consistently outperforms competitive baselines in both accuracy and robustness. Comprehensive experiments and case analyses highlight its adaptability to real-world scenarios. The proposed solution is scalable, interpretable, and broadly applicable to fields such as healthcare, education, and human-computer interaction.

pdf

A Modular Framework for Robust Multimodal Representation Learning via Dynamic Modality Weighting

How to Cite

Download Citation

Abstract

Most read articles by the same author(s)