Cross-Lingual Speech Emotion Recognition with Attention-Driven Bi-LSTM: Advancing Kashmiri and Multilingual Adaptation
Main Article Content
Abstract
Speech Emotion Recognition (SER) has achieved notable success in high-resource languages, yet remains underexplored for Kashmiri, a low-resource Dardic language characterized by tonal and prosodic complexity. This study introduces the first systematic framework for Kashmiri SER and examines its cross-lingual adaptability using Urdu, Persian, and English language datasets. A Bidirectional Long Short-Term Memory (Bi-LSTM) network with attention mechanism was employed to capture bidirectional temporal dependencies while emphasizing emotionally salient segments, with Mel-Frequency Cepstral Coefficients (MFCCs) and spectrogram features as inputs. Three experiments were conducted: within-language evaluation yielded high accuracies (93.2% for Kashmiri, 97% for Urdu, 85% for Persian, and 80.05% for English); cross-lingual transfer revealed substantial performance decline (25–34%), highlighting phonetic and prosodic mismatches; and progressive domain adaptation improved results up to 89%, 81%, and 83% for Urdu, Persian, and English, respectively. These findings demonstrate the challenges of direct transfer and the promise of adaptation, offering a pathway toward resource-efficient, multilingual SER for underrepresented languages.
Article Details
References
- S. Madanian, T. Chen, O. Adeleye, J.M. Templeton, C. Poellabauer, et al., Speech Emotion Recognition Using Machine Learning –- A Systematic Review, Intell. Syst. Appl. 20 (2023), 200266. https://doi.org/10.1016/j.iswa.2023.200266.
- G.H. Mohmad Dar, R. Delhibabu, Speech Databases, Speech Features, and Classifiers in Speech Emotion Recognition: A Review, IEEE Access 12 (2024), 151122–151152. https://doi.org/10.1109/access.2024.3476960.
- G. Alhussein, I. Ziogas, S. Saleem, L.J. Hadjileontiadis, Speech Emotion Recognition in Conversations Using Artificial Intelligence: A Systematic Review and Meta-Analysis, Artif. Intell. Rev. 58 (2025), 198. https://doi.org/10.1007/s10462-025-11197-8.
- R. Zhao, X. Jiang, F. Richard Yu, V.C.M. Leung, T. Wang, et al., Leveraging Cross-Attention Transformer and Multifeature Fusion for Cross-Linguistic Speech Emotion Recognition, IEEE Internet Things J. 12 (2025), 50653–50664. https://doi.org/10.1109/jiot.2025.3613687.
- S. Zhang, R. Liu, X. Tao, X. Zhao, Deep Cross-Corpus Speech Emotion Recognition: Recent Advances and Perspectives, Front. Neurorobotics 15 (2021), 784514. https://doi.org/10.3389/fnbot.2021.784514.
- R. Ullah, M. Asif, W.A. Shah, F. Anjam, I. Ullah, et al., Speech Emotion Recognition Using Convolution Neural Networks and Multi-Head Convolutional Transformer, Sensors 23 (2023), 6212. https://doi.org/10.3390/s23136212.
- N. Saleem, J. Gao, R. Irfan, A. Almadhor, H.T. Rauf, et al., DeepCNN: Spectro‐Temporal Feature Representation for Speech Emotion Recognition, CAAI Trans. Intell. Technol. 8 (2023), 401–417. https://doi.org/10.1049/cit2.12233.
- G.M. Dar, R. Delhibabu, Exploring Emotion Detection in Kashmiri Audio Reviews Using the Fusion Model of CNN, LSTM, and RNN: Gender-Specific Speech Patterns and Performance Analysis, Int. J. Inf. Technol. (2024). https://doi.org/10.1007/s41870-024-02105-4.
- C. Barhoumi, Y. BenAyed, Real-Time Speech Emotion Recognition Using Deep Learning and Data Augmentation, Artif. Intell. Rev. 58 (2024), 49. https://doi.org/10.1007/s10462-024-11065-x.
- E.A. Alkhamali, A. Allinjawi, R.B. Ashari, Combining Transformer, Convolutional Neural Network, and Long Short-Term Memory Architectures: A Novel Ensemble Learning Technique That Leverages Multi-Acoustic Features for Speech Emotion Recognition in Distance Education Classrooms, Appl. Sci. 14 (2024), 5050. https://doi.org/10.3390/app14125050.
- V. Bhardwaj, An Experimental Framework of Speaker Independent Speech Recognition System for Kashmiri Language (k-Asr) System Using Sphinx, Int. J. Emerg. Trends Sci. Technol. 04 (2017), 5348–5352. https://doi.org/10.18535/ijetst/v4i7.07.
- G.M. Dar, R. Delhibabu, Emotion Recognition in Kashmiri Speech: Evaluating Coefficient-Based Acoustic Features Using Bidirectional LSTM Networks, Procedia Comput. Sci. 258 (2025), 1909–1921. https://doi.org/10.1016/j.procs.2025.04.442.
- Y.R. Dar, A. Nazir, M. Ahmed, Acoustic Analysis of Vowels in Kashmiri-Speaking Adolescents With Down Syndrome, J. Appl. Linguist. Lang. Res. 7 (2020), 168–175.
- K. Scherer, Vocal Communication of Emotion: A Review of Research Paradigms, Speech Commun. 40 (2003), 227–256. https://doi.org/10.1016/s0167-6393(02)00084-5.
- Y. Gao, L. Wang, J. Liu, J. Dang, S. Okada, Adversarial Domain Generalized Transformer for Cross-Corpus Speech Emotion Recognition, IEEE Trans. Affect. Comput. 15 (2024), 697–708. https://doi.org/10.1109/taffc.2023.3290795.
- M. Agarla, S. Bianco, L. Celona, P. Napoletano, A. Petrovsky, et al., Semi-Supervised Cross-Lingual Speech Emotion Recognition, Expert Syst. Appl. 237 (2024), 121368. https://doi.org/10.1016/j.eswa.2023.121368.
- S.G. Koolagudi, R. Reddy, J. Yadav, K.S. Rao, IITKGP-SEHSC: Hindi Speech Corpus for Emotion Analysis, in: 2011 International Conference on Devices and Communications (ICDeCom), IEEE, 2011, pp. 1-5. https://doi.org/10.1109/icdecom.2011.5738540.
- A. Asghar, S. Sohaib, S. Iftikhar, M. Shafi, K. Fatima, An Urdu Speech corpus for Emotion Recognition, PeerJ Comput. Sci. 8 (2022), e954. https://doi.org/10.7717/peerj-cs.954.
- S. Mohanty, B.K. Swain, Emotion Recognition Using Fuzzy K-Means from Oriya Speech, Int. J. Comput. Commun. Technol. 1 (2011), 24–28. https://doi.org/10.47893/ijcct.2011.1066.
- A. Geethashree, D.J. Ravi, Kannada Emotional Speech Database: Design, Development and Evaluation, Lecture Notes in Networks and Systems, Vol. 14, Springer, Singapore, 2017: pp. 135–143. https://doi.org/10.1007/978-981-10-5146-3_14.
- A. Jacob, Modelling Speech Emotion Recognition Using Logistic Regression and Decision Trees, Int. J. Speech Technol. 20 (2017), 897–905. https://doi.org/10.1007/s10772-017-9457-6.
- G. Agarwal, H. Om, Performance of Deer Hunting Optimization Based Deep Learning Algorithm for Speech Emotion Recognition, Multimed. Tools Appl. 80 (2020), 9961–9992. https://doi.org/10.1007/s11042-020-10118-x.
- S. Sultana, M.S. Rahman, M.R. Selim, M.Z. Iqbal, SUST Bangla Emotional Speech Corpus (SUBESCO): An Audio-Only Emotional Speech Corpus for Bangla, PLOS ONE 16 (2021), e0250173. https://doi.org/10.1371/journal.pone.0250173.
- Z.S. Syed, S. Ali, M. Shehram, A. Shah, Introducing the Urdu-Sindhi Speech Emotion Corpus: A Novel Dataset of Speech Recordings for Emotion Recognition for Two Low-Resource Languages, Int. J. Adv. Comput. Sci. Appl. 11 (2020), 01104104. https://doi.org/10.14569/IJACSA.2020.01104104.
- A.K. Samantaray, K. Mahapatra, B. Kabi, A. Routray, A Novel Approach of Speech Emotion Recognition with Prosody, Quality and Derived Features Using SVM Classifier for a Class of North-Eastern Languages, in: 2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS), IEEE, 2015: pp. 372–377. https://doi.org/10.1109/ReTIS.2015.7232907.
- S.A. Ali, A. Khan, N. Bashir, Analyzing the Impact of Prosodic Feature (Pitch) on Learning Classifiers for Speech Emotion Corpus, Int. J. Inf. Technol. Comput. Sci. 7 (2015), 54–59. https://doi.org/10.5815/ijitcs.2015.02.07.
- R.V. Darekar, A.P. Dhande, Emotion Recognition from Marathi Speech Database Using Adaptive Artificial Neural Network, Biol. Inspir. Cogn. Arch. 23 (2018), 35–42. https://doi.org/10.1016/j.bica.2018.01.002.
- J. Basu, S. Majumder, Performance Evaluation of Language Identification on Emotional Speech Corpus of Three Indian Languages, in: Intelligence Enabled Research. Advances in Intelligent Systems and Computing, Springer Singapore, 2020: pp. 55–63. https://doi.org/10.1007/978-981-15-9290-4_6.
- P. Dhar, S. Guha, A System to Predict Emotion from Bengali Speech, Int. J. Math. Sci. Comput. 7 (2021), 26–35. https://doi.org/10.5815/ijmsc.2021.01.04.
- B. Fernandes, K. Mannepalli, Speech Emotion Recognition Using Deep Learning LSTM for Tamil Language, Pertanika J. Sci. Technol. 29 (2021), 1915–1936. https://doi.org/10.47836/pjst.29.3.33.
- V.P. Tank, S.K. Hadia, Creation of Speech Corpus for Emotion Analysis in Gujarati Language and Its Evaluation by Various Speech Parameters, Int. J. Electr. Comput. Eng. (IJECE) 10 (2020), 4752. https://doi.org/10.11591/ijece.v10i5.pp4752-4758.
- S. Aziz, N.H. Arif, S. Ahbab, S. Ahmed, T. Ahmed, et al., Improved Speech Emotion Recognition in Bengali Language Using Deep Learning, in: 2023 26th International Conference on Computer and Information Technology (ICCIT), IEEE, 2023: pp. 1–6. https://doi.org/10.1109/ICCIT60459.2023.10441053.
- T.J. Sefara, The Effects of Normalisation Methods on Speech Emotion Recognition, in: 2019 International Multidisciplinary Information Technology and Engineering Conference (IMITEC), IEEE, 2019. https://doi.org/10.1109/IMITEC45504.2019.9015895.
- P. Harar, R. Burget, M.K. Dutta, Speech Emotion Recognition with Deep Learning, in: 2017 4th International Conference on Signal Processing and Integrated Networks (SPIN), IEEE, 2017: pp. 137–140. https://doi.org/10.1109/SPIN.2017.8049931.
- Mustaqeem, M. Sajjad, S. Kwon, Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM, IEEE Access 8 (2020), 79861–79875. https://doi.org/10.1109/ACCESS.2020.2990405.
- K.H. Lee, D.H. Kim, Design of a Convolutional Neural Network for Speech Emotion Recognition, in: 2020 International Conference on Information and Communication Technology Convergence (ICTC), IEEE, 2020: pp. 1332–1335. https://doi.org/10.1109/ICTC49870.2020.9289227.
- Z. Han, J. Wang, Speech Emotion Recognition Based on Deep Learning and Kernel Nonlinear PSVM, in: 2019 Chinese Control And Decision Conference (CCDC), IEEE, 2019: pp. 1426–1430. https://doi.org/10.1109/CCDC.2019.8832414.
- J. Wang, Z. Han, Research on Speech Emotion Recognition Technology Based on Deep and Shallow Neural Network, in: 2019 Chinese Control Conference (CCC), IEEE, 2019: pp. 3555–3558. https://doi.org/10.23919/ChiCC.2019.8866568.
- G. Liu, W. He, B. Jin, Feature Fusion of Speech Emotion Recognition Based on Deep Learning, in: 2018 International Conference on Network Infrastructure and Digital Content (IC-NIDC), IEEE, 2018: pp. 193–197. https://doi.org/10.1109/icnidc.2018.8525706.
- S. Davis, P. Mermelstein, Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences, IEEE Trans. Acoust. Speech Signal Process. 28 (1980), 357–366. https://doi.org/10.1109/TASSP.1980.1163420.
- J. Li, L. Deng, Y. Gong, R. Haeb-Umbach, An Overview of Noise-Robust Automatic Speech Recognition, IEEE/ACM Trans. Audio Speech Lang. Process. 22 (2014), 745–777. https://doi.org/10.1109/TASLP.2014.2304637.
- C. Oflazoglu, S. Yildirim, Recognizing Emotion from Turkish Speech Using Acoustic Features, EURASIP J. Audio Speech Music. Process. 2013 (2013), 26. https://doi.org/10.1186/1687-4722-2013-26.
- S. Agarwalla, K.K. Sarma, Machine Learning Based Sample Extraction for Automatic Speech Recognition Using Dialectal Assamese Speech, Neural Netw. 78 (2016), 97–111. https://doi.org/10.1016/j.neunet.2015.12.010.
- A. Mohanty, R.C. Cherukuri, Whispered Speech Emotion Recognition with Gender Detection Using BiLSTM and DCNN, J. Inf. Syst. Telecommun. 12 (2024), 152–161. https://doi.org/10.61186/jist.43703.12.46.152.
- J.H. Chowdhury, S. Ramanna, K. Kotecha, Speech Emotion Recognition with Light Weight Deep Neural Ensemble Model Using Hand Crafted Features, Sci. Rep. 15 (2025), 11824. https://doi.org/10.1038/s41598-025-95734-z.
- S. Leem, D. Fulford, J. Onnela, D. Gard, C. Busso, Selective Acoustic Feature Enhancement for Speech Emotion Recognition with Noisy Speech, IEEE/ACM Trans. Audio Speech Lang. Process. 32 (2024), 917–929. https://doi.org/10.1109/TASLP.2023.3340603.
- X. Yuanchao, C. Zhiming, K. Xiaopeng, Improved Pitch Shifting Data Augmentation for Ship-Radiated Noise Classification, Appl. Acoust. 211 (2023), 109468. https://doi.org/10.1016/j.apacoust.2023.109468.
- K. Kaur, P. Singh, Impact of Feature Extraction and Feature Selection Algorithms on Punjabi Speech Emotion Recognition Using Convolutional Neural Network, ACM Trans. Asian Low-Resource Lang. Inf. Process. 21 (2022), 1–23. https://doi.org/10.1145/3511888.
- S. Hochreiter, J. Schmidhuber, Long Short-Term Memory, Neural Comput. 9 (1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735.
- A. Graves, J. Schmidhuber, Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures, Neural Netw. 18 (2005), 602–610. https://doi.org/10.1016/j.neunet.2005.06.042.
- F.A. Gers, J. Schmidhuber, F. Cummins, Learning to Forget: Continual Prediction with LSTM, Neural Comput. 12 (2000), 2451–2471. https://doi.org/10.1162/089976600300015015.
- D. Issa, M. Fatih Demirci, A. Yazici, Speech Emotion Recognition with Deep Convolutional Neural Networks, Biomed. Signal Process. Control. 59 (2020), 101894. https://doi.org/10.1016/j.bspc.2020.101894.
- T. Ozseven, Evaluation of the Effect of Frame Size on Speech Emotion Recognition, in: 2018 2nd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), IEEE, 2018: pp. 1–4. https://doi.org/10.1109/ISMSIT.2018.8567303.
- S. Ntalampiras, N. Fakotakis, Modeling the Temporal Evolution of Acoustic Parameters for Speech Emotion Recognition, IEEE Trans. Affect. Comput. 3 (2012), 116–125. https://doi.org/10.1109/T-AFFC.2011.31.
- S. Bai, J.Z. Kolter, V. Koltun, An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling, arXiv:1803.01271 (2018). https://doi.org/10.48550/arXiv.1803.01271.
- K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio, Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation, arXiv:1406.1078 (2014). https://doi.org/10.48550/ARXIV.1406.1078.
- Y. Kim, C. Denton, L. Hoang, A.M. Rush, Structured Attention Networks, arXiv:1702.00887 (2017). https://doi.org/10.48550/ARXIV.1702.00887.
- J. Shen, R. Pang, R.J. Weiss, M. Schuster, N. Jaitly, et al., Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions, in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018: pp. 4779–4783. https://doi.org/10.1109/ICASSP.2018.8461368.
- B. Su, C. Chang, Y. Lin, C. Lee, Improving Speech Emotion Recognition Using Graph Attentive Bi-Directional Gated Recurrent Unit Network, in: Proceedings of INTERSPEECH 2020, pp. 506–510, 2020. https://doi.org/10.21437/Interspeech.2020-1733.
- K. Manohar, E. Logashanmugam, Speech-Based Human Emotion Recognition Using CNN and LSTM Model Approach, in: Smart Innovation, Systems and Technologies, Springer, Singapore, 2022: pp. 85–93. https://doi.org/10.1007/978-981-16-9669-5_8.
- S. Sultana, M.Z. Iqbal, M.R. Selim, M.M. Rashid, M.S. Rahman, Bangla Speech Emotion Recognition and Cross-Lingual Study Using Deep CNN and BLSTM Networks, IEEE Access 10 (2022), 564–578. https://doi.org/10.1109/ACCESS.2021.3136251.
- S. Wang, X. Fu, K. Ding, C. Chen, H. Chen, et al., Federated Few-Shot Learning, in: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, 2023, pp. 2374–2385. https://doi.org/10.1145/3580305.3599347.
- J. Yang, J. Liu, K. Huang, J. Xia, Z. Zhu, et al., Single- and Cross-Lingual Speech Emotion Recognition Based on WavLM Domain Emotion Embedding, Electronics 13 (2024), 1380. https://doi.org/10.3390/electronics13071380.
- S.A.M. Zaidi, S. Latif, J. Qadir, Cross-Language Speech Emotion Recognition Using Multimodal Dual Attention Transformers, arXiv:2306.13804 (2023). https://doi.org/10.48550/ARXIV.2306.13804.