Suji R.

Work place: Bachelor of Computer Science, School of Computer Science Engineering and Information Systems (SCORE), Vellore Institute of Technology (VIT), Vellore, 632014, India

E-mail: suji.r2021@vitstudent.ac.in

Website:

Research Interests:

Biography

Suji R. was a Bachelor of Computer Science student in SCORE, VIT, Vellore, India. She was good at academics
and developed many software projects. She is currently pursuing her Master of Computer Applications (MCA), her
research interests focus on Natural Language Processing and Image Processing.

Author Articles
Developing Audio-to-Text Converters with Natural Language Processing for Smart Assistants

By Pratistha Tulsyan Mareeswari V. Vijayan Ramaraj Suji R.

DOI: https://doi.org/10.5815/ijmecs.2025.04.05, Pub. Date: 8 Aug. 2025

In recent years, smart assistants have transformed human interaction with technology, offering voice-controlled interactions like music playback and information retrieval. However, existing systems often struggle with accurately interpreting natural language input. To address it, this proposed work aims to develop an audio-to-text converter integrated with natural language processing (NLP) capabilities to enhance interactions of smart assistants. Additionally, the system will incorporate intent recognition to discern user intentions and generate relevant responses. The proposed work commenced with a literature survey to gather insights into existing smart assistant systems. Based on the findings, a comprehensive architecture was designed, integrating NLP techniques like tokenization and lemmatization. The implementation phase involved developing and training a Feedforward Neural Network (FNN) model tailored for NLP tasks, leveraging Python and libraries like TensorFlow and NLTK. Testing evaluated the system's performance using standard evaluation metrics, including Word Error Rate (WER) and Character Error Rate (CER), across various audio input conditions. The system exhibited higher WER and CER with accented speech (15.3% and 7.9% respectively) while the clean audio dataset produced WER and CER of 4.7% and 2.55% respectively. The proposed work also involved training the FNN model while monitoring training loss and accuracy to ensure model performance. Ultimately, the model achieved an accuracy of 97.62% with training loss reduced to 1.45%. Insights from the training phase inform further optimization efforts to improve system performance. It practices the Google WebSpeech API and compares it with other Speech-to-text models. In conclusion, our proposed work represents a significant step towards realizing seamless voice-controlled interactions with smart assistants and enhancing user experiences and productivity. Future work includes refining the system architecture, optimizing model performance, and expanding the capabilities of the smart assistant for broader application domains.

[...] Read more.
Other Articles