Divya; Suresha D

KanAVNet: A CNN-BiLSTM-CTC-Based Audio-Visual Speech Recognition System for Kannada to Assist the Hearing Impaired

PDF (1906KB), PP.87-106

Views: 0 Downloads: 0

Author(s)

1. A. J. Institute of Engineering & Technology Research Center, Visvesvaraya Technological University, Belagavi - 590018, Karnataka, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijigsp.2026.02.06

Received: 4 Jul. 2025 / Revised: 12 Oct. 2025 / Accepted: 11 Jan. 2026 / Published: 8 Apr. 2026

Index Terms

Assistive Technologies, Bilstm, Connectionist Temporal Classification, Dual-Modality Speech Recognition System, Kanavnet (Kannada Audio-Visual Network), MFCC-Mel-Frequency Cepstral Coefficients

Abstract

This research outlines a comprehensive dual-modality speech recognition system designed specifically to support hearing-impaired students in understanding spoken Kannada through synchronized processing of auditory signals and visual articulatory cues. The approach capitalizes on deep learning capabilities to improve performance to extract speech-related features from spectrograms and Mel-Frequency Cepstral Coefficients (MFCC) for audio, and lip movement discriminative features via CNNs and Temporal Convolutional Networks (TCNs) for visual input. A hybrid architecture, KanAVNet (Kannada Audio-Visual Network), based on a CNN–BiLSTM framework is integrated with a Connectionist Temporal Classification (CTC) loss function to enable robust sequence-to-sequence mapping while addressing temporal alignment challenges in audio-visual speech recognition. The system is fitted on a custom-developed Kannada audiovisual dataset, addressing the scarcity of regional-language AVSR resources. Empirical evidence shows that the model performs with a high degree of accuracy of 93.2%, a Word Error Rate (WER) of 9.8%, and an F1 score of 91.2%, outperforming baseline unimodal and existing multimodal models. This research highlights the effectiveness of multimodal fusion strategies in noisy environments and showcases the potential of AI-driven tools in promoting accessible and inclusive education for students with auditory impairments.

Cite This Paper

Divya, Suresha D., "KanAVNet: A CNN-BiLSTM-CTC-Based Audio-Visual Speech Recognition System for Kannada to Assist the Hearing Impaired", International Journal of Image, Graphics and Signal Processing(IJIGSP), Vol.18, No.2, pp. 87-106, 2026. DOI:10.5815/ijigsp.2026.02.06

Reference

[1]Adeel, A., Gogate, M., Hussain, A., & Whitmer, W. (2018). Lip-Reading Driven Deep Learning Approach for Speech Enhancement. IEEE Transactions on Emerging Topics in Computational Intelligence, 5, 481-490. https://doi.org/10.1109/TETCI.2019.2917039
[2]Abishek, M., Harish, S., Prasath, K., Sudarshan, D., & Supriya, P. (2024). Deep Learning Based Lip Reading for Speech Recognition. 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), 1-6. https://doi.org/10.1109/ICCCNT61001.2024.10725052
[3]R. Nevatia, N. Dave, M. Shah, and S. Correia, “Lip Reading: Delving into Deep Learning,” Int. J. Res. Appl. Sci. Eng. Technol. (IJRASET), vol. 9, no. 4, pp. 1442–1446, Apr. 2021.
[4]Sri, N., Akhil, R., Prasad, S., Jayanth, V., & Jyothi, K. (2023). Lip Reading Using Neural Networks and Deep Learning. International Journal Of Scientific Research In Engineering And Management. https://doi.org/10.55041/ijsrem18765
[5]Miled, M., Messaoud, M., & Bouzid, A. (2022). Lip reading of words with lip segmentation and deep learning. Multimedia Tools and Applications, 82, 551-571. https://doi.org/10.1007/s11042-022-13321-0
[6]Fernandez-Lopez, A., & Sukno, F. (2018). Survey on automatic lip-reading in the era of deep learning. Image Vis. Comput., 78, 53-72. https://doi.org/10.1016/J.IMAVIS.2018.07.002
[7]B, R. (2024). Lip Reading using Deep Learning. International Journal Of Scientific Research In Engineering And Management. https://doi.org/10.55041/ijsrem36177
[8]M, T., S, B., D, S., John, J., & G, V. (2024). Real - Time Speech Decoding through Advanced lip Reading using Deep Learning Model. 2024 International Conference on Emerging Research in Computational Science (ICERCS), 1-4. https://doi.org/10.1109/ICERCS63125.2024.10895513
[9]Jeevakumari, S., & Dey, K. (2024). LipSyncNet: A Novel Deep Learning Approach for Visual Speech Recognition in Audio-Challenged Situations. IEEE Access, 12, 110891-110904. https://doi.org/10.1109/ACCESS.2024.3436931
[10]Juyal, A., Joshi, R., Jain, V., & Chaturvedy, S. (2023). Analysis of Lip-Reading Using Deep Learning Techniques: A Review. 2023 International Conference on the Confluence of Advancements in Robotics, Vision and Interdisciplinary Technology Management (IC-RVITM), 1-6. https://doi.org/10.1109/IC-RVITM60032.2023.10435099
[11]Mroueh, Y., Marcheret, E., Goel, V., & Anastasopoulos, A. (2015). Deep multimodal learning for audio-visual speech recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[12]Afouras, T., Chung, J. S., Senior, A., Vinyals, O., & Zisserman, A. (2018). Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.
[13]Assael, Y. M., Shillingford, B., Whiteson, S., & de Freitas, N. (2016). LipNet: End-to-End Sentence-level Lipreading. arXiv preprint arXiv:1611.01599.
[14]Gulati, A., Qin, J., Chiu, C. C., Parmar, N., Zhang, Y., Yu, J., & Pang, R. (2020). Conformer: Convolution-augmented transformer for speech recognition. Interspeech.
[15]Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. ICML.
[16]A. Graves, S. Fernández, and J. Schmidhuber, “Bidirectional LSTM networks for improved phoneme classification and recognition,” in Proc. Int. Conf. Artificial Neural Networks (ICANN), 2005, pp. 799–804.
[17]A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), 2013, pp. 6645–6649.
[18]A. Graves, “Supervised sequence labelling with recurrent neural networks,” Studies in Computational Intelligence, vol. 385, Springer, 2012.
[19]T. Afouras, J. S. Chung, and A. Zisserman, “Deep lip reading: A comparison of models and an online application,” in Proc. Interspeech, 2018, pp. 3514–3518.
[20]Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding,” in Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015, pp. 167–174.
[21]J. S. Chung and A. Zisserman, “Lip reading in the wild,” in Proc. Asian Conf. Computer Vision (ACCV), 2016, pp. 87–103.
[22]S. Petridis, Y. Wang, Z. Li, and M. Pantic, “End-to-end visual speech recognition with LSTMs,” in Proc. IEEE ICASSP, 2017, pp. 2592–2596.
[23]G. Zhao, M. Barnard, and M. Pietikäinen, “Lipreading with local spatiotemporal descriptors,” IEEE Transactions on Multimedia, vol. 11, no. 7, pp. 1254–1265, Nov. 2009.
[24]A. Hannun et al., “Deep Speech: Scaling up end‑to‑end speech recognition,” arXiv, Dec. 2014.
[25]J. Wang, Z. Fang, and H. Zhao, “AlignNet: A unifying approach to audio‑visual alignment,” in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 3298–3317.
[26]T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio‑visual speech recognition,” IEEE Trans. Pattern Anal. Mach. Intell., 2018.
[27]Zhang, Q., Li, Y., Hu, Y., & Zhao, X. (2020). An Encrypted Speech Retrieval Method Based on Deep Perceptual Hashing and CNN-BiLSTM. IEEE Access, 8, 148556-148569. https://doi.org/10.1109/ACCESS.2020.3015876.
[28]Ahmed, G., & Lawaye, A. (2023). End-to-end ASR framework for Indian-English accent: using speech CNN-based segmentation. Int. J. Speech Technol., 26, 903-918. https://doi.org/10.1007/s10772-023-10053-w.
[29]S.S., P., Menon, V., & Gopalan, S. (2024). Hybrid CNN-BiLSTM architecture with multiple attention mechanisms to enhance speech emotion recognition. Biomed. Signal Process. Control., 100, 106967. https://doi.org/10.1016/j.bspc.2024.106967.
[30]Paul, B., Phadikar, S., Bera, S., Dey, T., & Nandi, U. (2024). Isolated word recognition based on a hyper-tuned cross-validated CNN-BiLSTM from Mel Frequency Cepstral Coefficients. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-024-19750-3.
[31]W. Cooke, J. M. F. Tobias, and J. Barker, “An audio‑visual corpus for speech perception and automatic speech recognition,” J. Acoust. Soc. Am., vol. 120, no. 5, 2006.

International Journal of Image, Graphics and Signal Processing (IJIGSP)