Cross-Lingual Speech Emotion Recognition with Attention-Driven Bi-LSTM: Advancing Kashmiri and Multilingual Adaptation

GH Mohmad Dar; Radhakrishnan Delhibabu

doi:10.28924/2291-8639-24-2026-43

PDF

DOI: https://doi.org/10.28924/2291-8639-24-2026-43

Published: Feb 19, 2026

Cite as:

GH Mohmad Dar, Radhakrishnan Delhibabu, Cross-Lingual Speech Emotion Recognition with Attention-Driven Bi-LSTM: Advancing Kashmiri and Multilingual Adaptation, Int. J. Anal. Appl., 24 (2026), 43.

GH Mohmad Dar, Radhakrishnan Delhibabu

Abstract

Speech Emotion Recognition (SER) has achieved notable success in high-resource languages, yet remains underexplored for Kashmiri, a low-resource Dardic language characterized by tonal and prosodic complexity. This study introduces the first systematic framework for Kashmiri SER and examines its cross-lingual adaptability using Urdu, Persian, and English language datasets. A Bidirectional Long Short-Term Memory (Bi-LSTM) network with attention mechanism was employed to capture bidirectional temporal dependencies while emphasizing emotionally salient segments, with Mel-Frequency Cepstral Coefficients (MFCCs) and spectrogram features as inputs. Three experiments were conducted: within-language evaluation yielded high accuracies (93.2% for Kashmiri, 97% for Urdu, 85% for Persian, and 80.05% for English); cross-lingual transfer revealed substantial performance decline (25–34%), highlighting phonetic and prosodic mismatches; and progressive domain adaptation improved results up to 89%, 81%, and 83% for Urdu, Persian, and English, respectively. These findings demonstrate the challenges of direct transfer and the promise of adaptation, offering a pathway toward resource-efficient, multilingual SER for underrepresented languages.

References

S. Madanian, T. Chen, O. Adeleye, J.M. Templeton, C. Poellabauer, et al., Speech Emotion Recognition Using Machine Learning –- A Systematic Review, Intell. Syst. Appl. 20 (2023), 200266. https://doi.org/10.1016/j.iswa.2023.200266.
G.H. Mohmad Dar, R. Delhibabu, Speech Databases, Speech Features, and Classifiers in Speech Emotion Recognition: A Review, IEEE Access 12 (2024), 151122–151152. https://doi.org/10.1109/access.2024.3476960.
G. Alhussein, I. Ziogas, S. Saleem, L.J. Hadjileontiadis, Speech Emotion Recognition in Conversations Using Artificial Intelligence: A Systematic Review and Meta-Analysis, Artif. Intell. Rev. 58 (2025), 198. https://doi.org/10.1007/s10462-025-11197-8.
R. Zhao, X. Jiang, F. Richard Yu, V.C.M. Leung, T. Wang, et al., Leveraging Cross-Attention Transformer and Multifeature Fusion for Cross-Linguistic Speech Emotion Recognition, IEEE Internet Things J. 12 (2025), 50653–50664. https://doi.org/10.1109/jiot.2025.3613687.
S. Zhang, R. Liu, X. Tao, X. Zhao, Deep Cross-Corpus Speech Emotion Recognition: Recent Advances and Perspectives, Front. Neurorobotics 15 (2021), 784514. https://doi.org/10.3389/fnbot.2021.784514.
R. Ullah, M. Asif, W.A. Shah, F. Anjam, I. Ullah, et al., Speech Emotion Recognition Using Convolution Neural Networks and Multi-Head Convolutional Transformer, Sensors 23 (2023), 6212. https://doi.org/10.3390/s23136212.
N. Saleem, J. Gao, R. Irfan, A. Almadhor, H.T. Rauf, et al., DeepCNN: Spectro‐Temporal Feature Representation for Speech Emotion Recognition, CAAI Trans. Intell. Technol. 8 (2023), 401–417. https://doi.org/10.1049/cit2.12233.
G.M. Dar, R. Delhibabu, Exploring Emotion Detection in Kashmiri Audio Reviews Using the Fusion Model of CNN, LSTM, and RNN: Gender-Specific Speech Patterns and Performance Analysis, Int. J. Inf. Technol. (2024). https://doi.org/10.1007/s41870-024-02105-4.
C. Barhoumi, Y. BenAyed, Real-Time Speech Emotion Recognition Using Deep Learning and Data Augmentation, Artif. Intell. Rev. 58 (2024), 49. https://doi.org/10.1007/s10462-024-11065-x.
E.A. Alkhamali, A. Allinjawi, R.B. Ashari, Combining Transformer, Convolutional Neural Network, and Long Short-Term Memory Architectures: A Novel Ensemble Learning Technique That Leverages Multi-Acoustic Features for Speech Emotion Recognition in Distance Education Classrooms, Appl. Sci. 14 (2024), 5050. https://doi.org/10.3390/app14125050.
V. Bhardwaj, An Experimental Framework of Speaker Independent Speech Recognition System for Kashmiri Language (k-Asr) System Using Sphinx, Int. J. Emerg. Trends Sci. Technol. 04 (2017), 5348–5352. https://doi.org/10.18535/ijetst/v4i7.07.
G.M. Dar, R. Delhibabu, Emotion Recognition in Kashmiri Speech: Evaluating Coefficient-Based Acoustic Features Using Bidirectional LSTM Networks, Procedia Comput. Sci. 258 (2025), 1909–1921. https://doi.org/10.1016/j.procs.2025.04.442.
Y.R. Dar, A. Nazir, M. Ahmed, Acoustic Analysis of Vowels in Kashmiri-Speaking Adolescents With Down Syndrome, J. Appl. Linguist. Lang. Res. 7 (2020), 168–175.
K. Scherer, Vocal Communication of Emotion: A Review of Research Paradigms, Speech Commun. 40 (2003), 227–256. https://doi.org/10.1016/s0167-6393(02)00084-5.
Y. Gao, L. Wang, J. Liu, J. Dang, S. Okada, Adversarial Domain Generalized Transformer for Cross-Corpus Speech Emotion Recognition, IEEE Trans. Affect. Comput. 15 (2024), 697–708. https://doi.org/10.1109/taffc.2023.3290795.
M. Agarla, S. Bianco, L. Celona, P. Napoletano, A. Petrovsky, et al., Semi-Supervised Cross-Lingual Speech Emotion Recognition, Expert Syst. Appl. 237 (2024), 121368. https://doi.org/10.1016/j.eswa.2023.121368.
S.G. Koolagudi, R. Reddy, J. Yadav, K.S. Rao, IITKGP-SEHSC: Hindi Speech Corpus for Emotion Analysis, in: 2011 International Conference on Devices and Communications (ICDeCom), IEEE, 2011, pp. 1-5. https://doi.org/10.1109/icdecom.2011.5738540.
A. Asghar, S. Sohaib, S. Iftikhar, M. Shafi, K. Fatima, An Urdu Speech corpus for Emotion Recognition, PeerJ Comput. Sci. 8 (2022), e954. https://doi.org/10.7717/peerj-cs.954.
S. Mohanty, B.K. Swain, Emotion Recognition Using Fuzzy K-Means from Oriya Speech, Int. J. Comput. Commun. Technol. 1 (2011), 24–28. https://doi.org/10.47893/ijcct.2011.1066.
A. Geethashree, D.J. Ravi, Kannada Emotional Speech Database: Design, Development and Evaluation, Lecture Notes in Networks and Systems, Vol. 14, Springer, Singapore, 2017: pp. 135–143. https://doi.org/10.1007/978-981-10-5146-3_14.
A. Jacob, Modelling Speech Emotion Recognition Using Logistic Regression and Decision Trees, Int. J. Speech Technol. 20 (2017), 897–905. https://doi.org/10.1007/s10772-017-9457-6.
G. Agarwal, H. Om, Performance of Deer Hunting Optimization Based Deep Learning Algorithm for Speech Emotion Recognition, Multimed. Tools Appl. 80 (2020), 9961–9992. https://doi.org/10.1007/s11042-020-10118-x.
S. Sultana, M.S. Rahman, M.R. Selim, M.Z. Iqbal, SUST Bangla Emotional Speech Corpus (SUBESCO): An Audio-Only Emotional Speech Corpus for Bangla, PLOS ONE 16 (2021), e0250173. https://doi.org/10.1371/journal.pone.0250173.
Z.S. Syed, S. Ali, M. Shehram, A. Shah, Introducing the Urdu-Sindhi Speech Emotion Corpus: A Novel Dataset of Speech Recordings for Emotion Recognition for Two Low-Resource Languages, Int. J. Adv. Comput. Sci. Appl. 11 (2020), 01104104. https://doi.org/10.14569/IJACSA.2020.01104104.
A.K. Samantaray, K. Mahapatra, B. Kabi, A. Routray, A Novel Approach of Speech Emotion Recognition with Prosody, Quality and Derived Features Using SVM Classifier for a Class of North-Eastern Languages, in: 2015 IEEE 2nd International Conference on Recent Trends in Information Systems (ReTIS), IEEE, 2015: pp. 372–377. https://doi.org/10.1109/ReTIS.2015.7232907.
S.A. Ali, A. Khan, N. Bashir, Analyzing the Impact of Prosodic Feature (Pitch) on Learning Classifiers for Speech Emotion Corpus, Int. J. Inf. Technol. Comput. Sci. 7 (2015), 54–59. https://doi.org/10.5815/ijitcs.2015.02.07.
R.V. Darekar, A.P. Dhande, Emotion Recognition from Marathi Speech Database Using Adaptive Artificial Neural Network, Biol. Inspir. Cogn. Arch. 23 (2018), 35–42. https://doi.org/10.1016/j.bica.2018.01.002.
J. Basu, S. Majumder, Performance Evaluation of Language Identification on Emotional Speech Corpus of Three Indian Languages, in: Intelligence Enabled Research. Advances in Intelligent Systems and Computing, Springer Singapore, 2020: pp. 55–63. https://doi.org/10.1007/978-981-15-9290-4_6.
P. Dhar, S. Guha, A System to Predict Emotion from Bengali Speech, Int. J. Math. Sci. Comput. 7 (2021), 26–35. https://doi.org/10.5815/ijmsc.2021.01.04.
B. Fernandes, K. Mannepalli, Speech Emotion Recognition Using Deep Learning LSTM for Tamil Language, Pertanika J. Sci. Technol. 29 (2021), 1915–1936. https://doi.org/10.47836/pjst.29.3.33.
V.P. Tank, S.K. Hadia, Creation of Speech Corpus for Emotion Analysis in Gujarati Language and Its Evaluation by Various Speech Parameters, Int. J. Electr. Comput. Eng. (IJECE) 10 (2020), 4752. https://doi.org/10.11591/ijece.v10i5.pp4752-4758.
S. Aziz, N.H. Arif, S. Ahbab, S. Ahmed, T. Ahmed, et al., Improved Speech Emotion Recognition in Bengali Language Using Deep Learning, in: 2023 26th International Conference on Computer and Information Technology (ICCIT), IEEE, 2023: pp. 1–6. https://doi.org/10.1109/ICCIT60459.2023.10441053.
T.J. Sefara, The Effects of Normalisation Methods on Speech Emotion Recognition, in: 2019 International Multidisciplinary Information Technology and Engineering Conference (IMITEC), IEEE, 2019. https://doi.org/10.1109/IMITEC45504.2019.9015895.
P. Harar, R. Burget, M.K. Dutta, Speech Emotion Recognition with Deep Learning, in: 2017 4th International Conference on Signal Processing and Integrated Networks (SPIN), IEEE, 2017: pp. 137–140. https://doi.org/10.1109/SPIN.2017.8049931.
Mustaqeem, M. Sajjad, S. Kwon, Clustering-Based Speech Emotion Recognition by Incorporating Learned Features and Deep BiLSTM, IEEE Access 8 (2020), 79861–79875. https://doi.org/10.1109/ACCESS.2020.2990405.
K.H. Lee, D.H. Kim, Design of a Convolutional Neural Network for Speech Emotion Recognition, in: 2020 International Conference on Information and Communication Technology Convergence (ICTC), IEEE, 2020: pp. 1332–1335. https://doi.org/10.1109/ICTC49870.2020.9289227.
Z. Han, J. Wang, Speech Emotion Recognition Based on Deep Learning and Kernel Nonlinear PSVM, in: 2019 Chinese Control And Decision Conference (CCDC), IEEE, 2019: pp. 1426–1430. https://doi.org/10.1109/CCDC.2019.8832414.
J. Wang, Z. Han, Research on Speech Emotion Recognition Technology Based on Deep and Shallow Neural Network, in: 2019 Chinese Control Conference (CCC), IEEE, 2019: pp. 3555–3558. https://doi.org/10.23919/ChiCC.2019.8866568.
G. Liu, W. He, B. Jin, Feature Fusion of Speech Emotion Recognition Based on Deep Learning, in: 2018 International Conference on Network Infrastructure and Digital Content (IC-NIDC), IEEE, 2018: pp. 193–197. https://doi.org/10.1109/icnidc.2018.8525706.
S. Davis, P. Mermelstein, Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences, IEEE Trans. Acoust. Speech Signal Process. 28 (1980), 357–366. https://doi.org/10.1109/TASSP.1980.1163420.
J. Li, L. Deng, Y. Gong, R. Haeb-Umbach, An Overview of Noise-Robust Automatic Speech Recognition, IEEE/ACM Trans. Audio Speech Lang. Process. 22 (2014), 745–777. https://doi.org/10.1109/TASLP.2014.2304637.
C. Oflazoglu, S. Yildirim, Recognizing Emotion from Turkish Speech Using Acoustic Features, EURASIP J. Audio Speech Music. Process. 2013 (2013), 26. https://doi.org/10.1186/1687-4722-2013-26.
S. Agarwalla, K.K. Sarma, Machine Learning Based Sample Extraction for Automatic Speech Recognition Using Dialectal Assamese Speech, Neural Netw. 78 (2016), 97–111. https://doi.org/10.1016/j.neunet.2015.12.010.
A. Mohanty, R.C. Cherukuri, Whispered Speech Emotion Recognition with Gender Detection Using BiLSTM and DCNN, J. Inf. Syst. Telecommun. 12 (2024), 152–161. https://doi.org/10.61186/jist.43703.12.46.152.
J.H. Chowdhury, S. Ramanna, K. Kotecha, Speech Emotion Recognition with Light Weight Deep Neural Ensemble Model Using Hand Crafted Features, Sci. Rep. 15 (2025), 11824. https://doi.org/10.1038/s41598-025-95734-z.
S. Leem, D. Fulford, J. Onnela, D. Gard, C. Busso, Selective Acoustic Feature Enhancement for Speech Emotion Recognition with Noisy Speech, IEEE/ACM Trans. Audio Speech Lang. Process. 32 (2024), 917–929. https://doi.org/10.1109/TASLP.2023.3340603.
X. Yuanchao, C. Zhiming, K. Xiaopeng, Improved Pitch Shifting Data Augmentation for Ship-Radiated Noise Classification, Appl. Acoust. 211 (2023), 109468. https://doi.org/10.1016/j.apacoust.2023.109468.
K. Kaur, P. Singh, Impact of Feature Extraction and Feature Selection Algorithms on Punjabi Speech Emotion Recognition Using Convolutional Neural Network, ACM Trans. Asian Low-Resource Lang. Inf. Process. 21 (2022), 1–23. https://doi.org/10.1145/3511888.
S. Hochreiter, J. Schmidhuber, Long Short-Term Memory, Neural Comput. 9 (1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735.
A. Graves, J. Schmidhuber, Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures, Neural Netw. 18 (2005), 602–610. https://doi.org/10.1016/j.neunet.2005.06.042.
F.A. Gers, J. Schmidhuber, F. Cummins, Learning to Forget: Continual Prediction with LSTM, Neural Comput. 12 (2000), 2451–2471. https://doi.org/10.1162/089976600300015015.
D. Issa, M. Fatih Demirci, A. Yazici, Speech Emotion Recognition with Deep Convolutional Neural Networks, Biomed. Signal Process. Control. 59 (2020), 101894. https://doi.org/10.1016/j.bspc.2020.101894.
T. Ozseven, Evaluation of the Effect of Frame Size on Speech Emotion Recognition, in: 2018 2nd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), IEEE, 2018: pp. 1–4. https://doi.org/10.1109/ISMSIT.2018.8567303.
S. Ntalampiras, N. Fakotakis, Modeling the Temporal Evolution of Acoustic Parameters for Speech Emotion Recognition, IEEE Trans. Affect. Comput. 3 (2012), 116–125. https://doi.org/10.1109/T-AFFC.2011.31.
S. Bai, J.Z. Kolter, V. Koltun, An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling, arXiv:1803.01271 (2018). https://doi.org/10.48550/arXiv.1803.01271.
K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio, Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation, arXiv:1406.1078 (2014). https://doi.org/10.48550/ARXIV.1406.1078.
Y. Kim, C. Denton, L. Hoang, A.M. Rush, Structured Attention Networks, arXiv:1702.00887 (2017). https://doi.org/10.48550/ARXIV.1702.00887.
J. Shen, R. Pang, R.J. Weiss, M. Schuster, N. Jaitly, et al., Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions, in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018: pp. 4779–4783. https://doi.org/10.1109/ICASSP.2018.8461368.
B. Su, C. Chang, Y. Lin, C. Lee, Improving Speech Emotion Recognition Using Graph Attentive Bi-Directional Gated Recurrent Unit Network, in: Proceedings of INTERSPEECH 2020, pp. 506–510, 2020. https://doi.org/10.21437/Interspeech.2020-1733.
K. Manohar, E. Logashanmugam, Speech-Based Human Emotion Recognition Using CNN and LSTM Model Approach, in: Smart Innovation, Systems and Technologies, Springer, Singapore, 2022: pp. 85–93. https://doi.org/10.1007/978-981-16-9669-5_8.
S. Sultana, M.Z. Iqbal, M.R. Selim, M.M. Rashid, M.S. Rahman, Bangla Speech Emotion Recognition and Cross-Lingual Study Using Deep CNN and BLSTM Networks, IEEE Access 10 (2022), 564–578. https://doi.org/10.1109/ACCESS.2021.3136251.
S. Wang, X. Fu, K. Ding, C. Chen, H. Chen, et al., Federated Few-Shot Learning, in: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ACM, New York, NY, USA, 2023, pp. 2374–2385. https://doi.org/10.1145/3580305.3599347.
J. Yang, J. Liu, K. Huang, J. Xia, Z. Zhu, et al., Single- and Cross-Lingual Speech Emotion Recognition Based on WavLM Domain Emotion Embedding, Electronics 13 (2024), 1380. https://doi.org/10.3390/electronics13071380.
S.A.M. Zaidi, S. Latif, J. Qadir, Cross-Language Speech Emotion Recognition Using Multimodal Dual Attention Transformers, arXiv:2306.13804 (2023). https://doi.org/10.48550/ARXIV.2306.13804.

Article Sidebar

Cite as:

Main Article Content

Abstract

Article Details

References