A Comprehensive Survey on Multilingual and Multimodal Automatic Speech Recognition Systems

Maelle Trenton

Vol. 4 No. 10 (2025)

Articles

A Comprehensive Survey on Multilingual and Multimodal Automatic Speech Recognition Systems

pdf

Maelle Trenton

Published 2025-10-30

How to Cite

Trenton, M. (2025). A Comprehensive Survey on Multilingual and Multimodal Automatic Speech Recognition Systems. Journal of Computer Technology and Software, 4(10). Retrieved from https://ashpress.org/index.php/jcts/article/view/228

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Speech recognition has evolved from rule-based systems to deep learning–driven architectures that can comprehend complex linguistic and acoustic patterns. However, the rapid progress achieved in high-resource languages, such as English and Mandarin, has not been equally reflected in multilingual and low-resource contexts. This disparity limits the inclusiveness and global applicability of automatic speech recognition (ASR) technologies. In parallel, the emergence of multimodal learning-integrating speech, vision, and text-has opened new possibilities for robust and context-aware recognition systems that align with human communication patterns. This paper provides a comprehensive survey of recent advances in multilingual and multimodal speech recognition. It reviews state-of-the-art models, including end-to-end architectures, self-supervised learning, and transformer-based approaches such as wav2vec 2.0, Whisper, and SpeechT5. The review also explores multilingual pretraining strategies, transfer learning for low-resource adaptation, and multimodal fusion techniques that combine audio with visual or textual modalities to enhance recognition accuracy and robustness. Moreover, we analyze benchmark datasets, evaluation metrics, and key challenges such as code-switching, domain adaptation, and cultural diversity. Finally, we highlight future trends in cross-lingual model generalization, data-efficient learning, and multimodal interaction for next-generation intelligent speech systems. The findings indicate that progress in multilingual and multimodal ASR is essential to bridge the linguistic divide and to achieve equitable access to AI-driven technologies worldwide.

pdf

A Comprehensive Survey on Multilingual and Multimodal Automatic Speech Recognition Systems

How to Cite

Download Citation

Abstract