IJIGSP Vol. 18, No. 2, 8 Apr. 2026
Cover page and Table of Contents: PDF (size: 1906KB)
PDF (1906KB), PP.87-106
Views: 0 Downloads: 0
Assistive Technologies, Bilstm, Connectionist Temporal Classification, Dual-Modality Speech Recognition System, Kanavnet (Kannada Audio-Visual Network), MFCC-Mel-Frequency Cepstral Coefficients
This research outlines a comprehensive dual-modality speech recognition system designed specifically to support hearing-impaired students in understanding spoken Kannada through synchronized processing of auditory signals and visual articulatory cues. The approach capitalizes on deep learning capabilities to improve performance to extract speech-related features from spectrograms and Mel-Frequency Cepstral Coefficients (MFCC) for audio, and lip movement discriminative features via CNNs and Temporal Convolutional Networks (TCNs) for visual input. A hybrid architecture, KanAVNet (Kannada Audio-Visual Network), based on a CNN–BiLSTM framework is integrated with a Connectionist Temporal Classification (CTC) loss function to enable robust sequence-to-sequence mapping while addressing temporal alignment challenges in audio-visual speech recognition. The system is fitted on a custom-developed Kannada audiovisual dataset, addressing the scarcity of regional-language AVSR resources. Empirical evidence shows that the model performs with a high degree of accuracy of 93.2%, a Word Error Rate (WER) of 9.8%, and an F1 score of 91.2%, outperforming baseline unimodal and existing multimodal models. This research highlights the effectiveness of multimodal fusion strategies in noisy environments and showcases the potential of AI-driven tools in promoting accessible and inclusive education for students with auditory impairments.
Divya, Suresha D., "KanAVNet: A CNN-BiLSTM-CTC-Based Audio-Visual Speech Recognition System for Kannada to Assist the Hearing Impaired", International Journal of Image, Graphics and Signal Processing(IJIGSP), Vol.18, No.2, pp. 87-106, 2026. DOI:10.5815/ijigsp.2026.02.06
[1]Adeel, A., Gogate, M., Hussain, A., & Whitmer, W. (2018). Lip-Reading Driven Deep Learning Approach for Speech Enhancement. IEEE Transactions on Emerging Topics in Computational Intelligence, 5, 481-490. https://doi.org/10.1109/TETCI.2019.2917039
[2]Abishek, M., Harish, S., Prasath, K., Sudarshan, D., & Supriya, P. (2024). Deep Learning Based Lip Reading for Speech Recognition. 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), 1-6. https://doi.org/10.1109/ICCCNT61001.2024.10725052
[3]R. Nevatia, N. Dave, M. Shah, and S. Correia, “Lip Reading: Delving into Deep Learning,” Int. J. Res. Appl. Sci. Eng. Technol. (IJRASET), vol. 9, no. 4, pp. 1442–1446, Apr. 2021.
[4]Sri, N., Akhil, R., Prasad, S., Jayanth, V., & Jyothi, K. (2023). Lip Reading Using Neural Networks and Deep Learning. International Journal Of Scientific Research In Engineering And Management. https://doi.org/10.55041/ijsrem18765
[5]Miled, M., Messaoud, M., & Bouzid, A. (2022). Lip reading of words with lip segmentation and deep learning. Multimedia Tools and Applications, 82, 551-571. https://doi.org/10.1007/s11042-022-13321-0
[6]Fernandez-Lopez, A., & Sukno, F. (2018). Survey on automatic lip-reading in the era of deep learning. Image Vis. Comput., 78, 53-72. https://doi.org/10.1016/J.IMAVIS.2018.07.002
[7]B, R. (2024). Lip Reading using Deep Learning. International Journal Of Scientific Research In Engineering And Management. https://doi.org/10.55041/ijsrem36177
[8]M, T., S, B., D, S., John, J., & G, V. (2024). Real - Time Speech Decoding through Advanced lip Reading using Deep Learning Model. 2024 International Conference on Emerging Research in Computational Science (ICERCS), 1-4. https://doi.org/10.1109/ICERCS63125.2024.10895513
[9]Jeevakumari, S., & Dey, K. (2024). LipSyncNet: A Novel Deep Learning Approach for Visual Speech Recognition in Audio-Challenged Situations. IEEE Access, 12, 110891-110904. https://doi.org/10.1109/ACCESS.2024.3436931
[10]Juyal, A., Joshi, R., Jain, V., & Chaturvedy, S. (2023). Analysis of Lip-Reading Using Deep Learning Techniques: A Review. 2023 International Conference on the Confluence of Advancements in Robotics, Vision and Interdisciplinary Technology Management (IC-RVITM), 1-6. https://doi.org/10.1109/IC-RVITM60032.2023.10435099
[11]Mroueh, Y., Marcheret, E., Goel, V., & Anastasopoulos, A. (2015). Deep multimodal learning for audio-visual speech recognition. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[12]Afouras, T., Chung, J. S., Senior, A., Vinyals, O., & Zisserman, A. (2018). Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.
[13]Assael, Y. M., Shillingford, B., Whiteson, S., & de Freitas, N. (2016). LipNet: End-to-End Sentence-level Lipreading. arXiv preprint arXiv:1611.01599.
[14]Gulati, A., Qin, J., Chiu, C. C., Parmar, N., Zhang, Y., Yu, J., & Pang, R. (2020). Conformer: Convolution-augmented transformer for speech recognition. Interspeech.
[15]Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. ICML.
[16]A. Graves, S. Fernández, and J. Schmidhuber, “Bidirectional LSTM networks for improved phoneme classification and recognition,” in Proc. Int. Conf. Artificial Neural Networks (ICANN), 2005, pp. 799–804.
[17]A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), 2013, pp. 6645–6649.
[18]A. Graves, “Supervised sequence labelling with recurrent neural networks,” Studies in Computational Intelligence, vol. 385, Springer, 2012.
[19]T. Afouras, J. S. Chung, and A. Zisserman, “Deep lip reading: A comparison of models and an online application,” in Proc. Interspeech, 2018, pp. 3514–3518.
[20]Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding,” in Proc. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015, pp. 167–174.
[21]J. S. Chung and A. Zisserman, “Lip reading in the wild,” in Proc. Asian Conf. Computer Vision (ACCV), 2016, pp. 87–103.
[22]S. Petridis, Y. Wang, Z. Li, and M. Pantic, “End-to-end visual speech recognition with LSTMs,” in Proc. IEEE ICASSP, 2017, pp. 2592–2596.
[23]G. Zhao, M. Barnard, and M. Pietikäinen, “Lipreading with local spatiotemporal descriptors,” IEEE Transactions on Multimedia, vol. 11, no. 7, pp. 1254–1265, Nov. 2009.
[24]A. Hannun et al., “Deep Speech: Scaling up end‑to‑end speech recognition,” arXiv, Dec. 2014.
[25]J. Wang, Z. Fang, and H. Zhao, “AlignNet: A unifying approach to audio‑visual alignment,” in IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 3298–3317.
[26]T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep audio‑visual speech recognition,” IEEE Trans. Pattern Anal. Mach. Intell., 2018.
[27]Zhang, Q., Li, Y., Hu, Y., & Zhao, X. (2020). An Encrypted Speech Retrieval Method Based on Deep Perceptual Hashing and CNN-BiLSTM. IEEE Access, 8, 148556-148569. https://doi.org/10.1109/ACCESS.2020.3015876.
[28]Ahmed, G., & Lawaye, A. (2023). End-to-end ASR framework for Indian-English accent: using speech CNN-based segmentation. Int. J. Speech Technol., 26, 903-918. https://doi.org/10.1007/s10772-023-10053-w.
[29]S.S., P., Menon, V., & Gopalan, S. (2024). Hybrid CNN-BiLSTM architecture with multiple attention mechanisms to enhance speech emotion recognition. Biomed. Signal Process. Control., 100, 106967. https://doi.org/10.1016/j.bspc.2024.106967.
[30]Paul, B., Phadikar, S., Bera, S., Dey, T., & Nandi, U. (2024). Isolated word recognition based on a hyper-tuned cross-validated CNN-BiLSTM from Mel Frequency Cepstral Coefficients. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-024-19750-3.
[31]W. Cooke, J. M. F. Tobias, and J. Barker, “An audio‑visual corpus for speech perception and automatic speech recognition,” J. Acoust. Soc. Am., vol. 120, no. 5, 2006.