Published 2024-09-30
How to Cite
This work is licensed under a Creative Commons Attribution 4.0 International License.
Abstract
This study presents a speech emotion recognition system that integrates a dynamic convolutional neural network with a bi-directional long short-term memory (Bi-LSTM) network. The dynamic convolutional kernel enables the neural network to capture global dynamic emotional patterns, enhancing model performance without significantly increasing computational demands. Simultaneously, the Bi-LSTM component allows for more efficient classification of emotional features by leveraging temporal information. The system was evaluated using three datasets: the CISIA Chinese speech emotion dataset, the EMO-DB German emotion corpus, and the IEMOCAP English corpus. The experimental results yielded average emotion recognition accuracies of 59.08%, 89.29%, and 71.25%, respectively. These results represent improvements of 1.17%, 1.36%, and 2.97% over the accuracy achieved by existing speech emotion recognition systems using mainstream models, demonstrating the effectiveness of the proposed approach.