Published 2025-10-30
How to Cite

This work is licensed under a Creative Commons Attribution 4.0 International License.
Abstract
Speech recognition has evolved from rule-based systems to deep learning–driven architectures that can comprehend complex linguistic and acoustic patterns. However, the rapid progress achieved in high-resource languages, such as English and Mandarin, has not been equally reflected in multilingual and low-resource contexts. This disparity limits the inclusiveness and global applicability of automatic speech recognition (ASR) technologies. In parallel, the emergence of multimodal learning-integrating speech, vision, and text-has opened new possibilities for robust and context-aware recognition systems that align with human communication patterns. This paper provides a comprehensive survey of recent advances in multilingual and multimodal speech recognition. It reviews state-of-the-art models, including end-to-end architectures, self-supervised learning, and transformer-based approaches such as wav2vec 2.0, Whisper, and SpeechT5. The review also explores multilingual pretraining strategies, transfer learning for low-resource adaptation, and multimodal fusion techniques that combine audio with visual or textual modalities to enhance recognition accuracy and robustness. Moreover, we analyze benchmark datasets, evaluation metrics, and key challenges such as code-switching, domain adaptation, and cultural diversity. Finally, we highlight future trends in cross-lingual model generalization, data-efficient learning, and multimodal interaction for next-generation intelligent speech systems. The findings indicate that progress in multilingual and multimodal ASR is essential to bridge the linguistic divide and to achieve equitable access to AI-driven technologies worldwide.