Developing Audio-to-Text Converters with Natural Language Processing for Smart Assistants

PDF (628KB), PP.70-81

Views: 0 Downloads: 0

Author(s)

Mareeswari V. 1 Vijayan Ramaraj 2,* Pratistha Tulsyan 3 Suji R. 3

1. Department of Software and Systems Engineering, School of Computer Science Engineering and Information Systems (SCORE), Vellore Institute of Technology (VIT), Vellore, India

2. Department of Information Technology, School of Computer Science Engineering and Information Systems (SCORE), Vellore Institute of Technology (VIT), Vellore, India

3. Bachelor of Computer Science, School of Computer Science Engineering and Information Systems (SCORE), Vellore Institute of Technology (VIT), Vellore, 632014, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijmecs.2025.04.05

Received: 9 Oct. 2024 / Revised: 27 Apr. 2025 / Accepted: 16 May 2025 / Published: 8 Aug. 2025

Index Terms

Natural Language Processing, Feedforward Neural Network, Smart Assistant, Intent Recognition, Audio-To-Text Converter

Abstract

In recent years, smart assistants have transformed human interaction with technology, offering voice-controlled interactions like music playback and information retrieval. However, existing systems often struggle with accurately interpreting natural language input. To address it, this proposed work aims to develop an audio-to-text converter integrated with natural language processing (NLP) capabilities to enhance interactions of smart assistants. Additionally, the system will incorporate intent recognition to discern user intentions and generate relevant responses. The proposed work commenced with a literature survey to gather insights into existing smart assistant systems. Based on the findings, a comprehensive architecture was designed, integrating NLP techniques like tokenization and lemmatization. The implementation phase involved developing and training a Feedforward Neural Network (FNN) model tailored for NLP tasks, leveraging Python and libraries like TensorFlow and NLTK. Testing evaluated the system's performance using standard evaluation metrics, including Word Error Rate (WER) and Character Error Rate (CER), across various audio input conditions. The system exhibited higher WER and CER with accented speech (15.3% and 7.9% respectively) while the clean audio dataset produced WER and CER of 4.7% and 2.55% respectively. The proposed work also involved training the FNN model while monitoring training loss and accuracy to ensure model performance. Ultimately, the model achieved an accuracy of 97.62% with training loss reduced to 1.45%. Insights from the training phase inform further optimization efforts to improve system performance. It practices the Google WebSpeech API and compares it with other Speech-to-text models. In conclusion, our proposed work represents a significant step towards realizing seamless voice-controlled interactions with smart assistants and enhancing user experiences and productivity. Future work includes refining the system architecture, optimizing model performance, and expanding the capabilities of the smart assistant for broader application domains.

Cite This Paper

Mareeswari V., Vijayan Ramaraj, Pratistha Tulsyan, Suji R., "Developing Audio-to-Text Converters with Natural Language Processing for Smart Assistants", International Journal of Modern Education and Computer Science(IJMECS), Vol.17, No.4, pp. 70-81, 2025. DOI:10.5815/ijmecs.2025.04.05

Reference

[1]Korchynskyi, V., & Vynogradov, I. (2024). Methods of improving the quality of speech-to-text conversion. Scientific Collection, InterConf, (194), 426-437.
[2]Antonius, F., Alapati, P. R., Ritonga, M., Patra, I., El-Ebiary, Y. A. B., Orosoo, M., & Rengarajan, M. (2023). Incorporating Natural Language Processing into Virtual Assistants: An Intelligent Assessment Strategy for Enhancing Language Comprehension. International Journal of Advanced Computer Science and Applications, 14(10).
[3]Chennuri, Varun & Rodda, Vamshi Prashanth. (2023). Development of AI-based voice assistants using Large Language Models. 10.13140/RG.2.2.20195.12321.
[4]Haz, A. L., Fajrianti, E. D., Funabiki, N., & Sukaridhoto, S. (2023). A Study of Audio-to-Text Conversion Software Using Whispers Model. In 2023 Sixth International Conference on Vocational Education and Electrical Engineering (ICVEE) (pp. 268-273). IEEE.
[5]Abougarair, Ahmed. (2022). Design and implementation of smart voice assistant and recognizing academic words. International Journal of Robotics and Automation. 8. 27-32.
[6]Dutsinma, F.L., Pal, D., Funilkul, S., & Chan, J.H. (2022). A Systematic Review of Voice Assistant Usability: An ISO 9241–11 Approach. Sn Computer Science, 3.
[7]K, Vindhya & K, Tejasvi & K, Pravallika & Krishna, Ch. (2022). A Natural Language Processing based Intelligent Bot Application. 391-396.
[8]Sharma, D., Paliwal, M., & Rai, J. (2021). NLP for Intelligent Conversational Assistance. International Journal of Innovative Research in Computer Science & Technology, 9(3), 179-184.
[9]Ezhilan, A., Dheekksha, R., & Shridevi, S. (2021). Audio style conversion using deep learning. International Journal of Applied Science and Engineering, 18(5), 1-8.
[10]Deshmukh, R. D., & Kiwelekar, A. (2020). Deep learning techniques for part-of-speech tagging by natural language processing. In 2020 2nd International Conference on Innovative Mechanisms for Industry Applications (ICIMIA) (pp. 76-81). IEEE.
[11]Vijayan, Ramaraj &. V, Mareeswari & Prince, Ephzibah & Kumar, Chandhan. (2024). Improving the BERT model for long text sequences in the question answering domain. International Journal of Advances in Applied Sciences. 13. 106. 10.11591/ijaas.v13.i1.pp106-115.
[12]V, M., Ramaraj, V., G, U., R, S. & E., P. (2019). Accessibility using human face, object, and text recognition for visually impaired people. International Journal of Innovative Technology and Exploring Engineering, 8(6):853–858.
[13]V, Mareeswari & Periyasamy, Jayalakshmi & Thangavel, Muthamilselvan & Bhatia, Surbhi & Almusharraf, Ahlam & Santhanam, Prasanna & Vijayan, Ramaraj & Elsisi, Mahmoud. (2023). Design and Development of IoT and Deep Ensemble Learning Based Model for Disease Monitoring and Prediction. Diagnostics. 13. 1942. 10.3390/diagnostics13111942. 1998.
[14]Sethiya, N., & Maurya, C. K. (2024). End-to-end speech-to-text translation: A survey. Computer Speech & Language, 101751.
[15]A. Ranganathan and F. Dellaert, “Online probabilistic topological mapping,” The International Journal of Robotics Research, vol. 30, no. 6, pp. 755–771, 2011.
[16]Kim, K. M., Chen, X., & Liu, X. (2024). Accuracy scoring of elicited imitation: A tutorial of automating speech data with commercial NLP support. Research Methods in Applied Linguistics, 3(3), 100127.
[17]https://www.kaggle.com/datasets/yasiashpot/librispeech/data
[18]Majhi, M. K., & Saha, S. K. (2024). An automatic speech recognition system in Odia language using attention mechanism and data augmentation. International Journal of Speech Technology, 27(3), 717-728.
[19]Labied, M., Belangour, A., & Banane, M. (2024, December). Assessing Speech-to-Text Translation Quality: An Overview of Key Metrics. In 2024 International Conference on Decision Aid Sciences and Applications (DASA) (pp. 1-6). IEEE.
[20]Ashwell, T., & Elam, J. R. (2017). How Accurately Can the Google Web Speech API Recognize and Transcribe Japanese L2 English Learners' Oral Production?. Jalt Call Journal, 13(1), 59-76.
[21]Prasad, P. K. V., Krishna, N. V., & Jacob, T. P. (2022, April). AI chatbot using web speech API and node. Js. In 2022 International Conference on Sustainable Computing and Data Communication Systems (ICSCDS) (pp. 360-362). IEEE..
[22]Al-Anzi, F. S., & Shalini, S. B. (2024, May). Continuous Arabic Speech Recognition Model with N-gram Generation Using Deep Speech. In 2024 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA) (pp. 1-9). IEEE.
[23]Jain, R., Barcovschi, A., Yiwere, M. Y., Bigioi, D., Corcoran, P., & Cucu, H. (2023). A wav2vec2-based experimental study on self-supervised learning methods to improve child speech recognition. IEEE Access, 11, 46938-46948.
[24]Pople, V., Kanaujia, A., Vijayan, R., & Mareeswari, V. (2022). Face Mask Detection for Real-Time Video Streams. ECS Transactions, 107(1), 8275.
[25]Chen, S. H. K., Saeli, C., & Hu, G. (2024). A proof-of-concept study for automatic speech recognition to transcribe AAC speakers’ speech from high-technology AAC systems. Assistive Technology, 36(4), 319-326.