IJCNIS Vol. 17, No. 3, 8 Jun. 2025
Cover page and Table of Contents: PDF (size: 5282KB)
PDF (5282KB), PP.109-143
Views: 0 Downloads: 0
BERT Model, Speech Recognition, Voice User Interface, ASR, Human-Computer Interaction, Intent Recognition, Multilingual Models, Neural Networks, Command Conversion, Dataset Quality, Natural Language Processing
Voice User Interfaces (VUIs) focus on their application in IT and linguistics. Our research examines the capabilities and limitations of small and multilingual BERT models in the context of speech recognition and command conversion. We evaluate the performance of these models through a series of experiments, including the application of confusion matrices to assess their effectiveness. The findings reveal that larger models like multilingual BERT theoretically offer advanced capabilities but often demand more substantial resources and well-balanced datasets. Conversely, smaller models, though less resource-intensive, may sometimes provide more practical solutions. Our study underscores the importance of dataset quality, model fine-tuning, and efficient resource management in optimising VUIS. Insights gained from this research highlight the potential of neural networks to enhance and improve user interaction. Despite challenges in achieving a fully functional interface, the study provides valuable contributions to the VUIs development and sets the stage for future advancements in integrating AI with linguistic technologies. The article describes the development of a voice user interface (VUI) capable of recognising, analysing, and interpreting the Ukrainian language. For this purpose, several neural network architectures were used, including the Squeezeformer-CTC model, as well as a modified w2v-bert-2.0-uk model, which was used to decode speech commands into text. The multilingual BERT model (mBERT) for the classification of intentions was also tested. The developed system showed the prospects of using BERT models in combination with lightweight ASR architectures to create an effective voice interface in Ukrainian. Accuracy indicators (F1 = 91.5%, WER = 12.7%) indicate high-quality recognition, which is provided even in models with low memory capacity. The system is adaptable to conditions with limited resources, particularly for educational and living environments with a Ukrainian-speaking audience.
Victoria Vysotska, Zhengbing Hu, Nikita Mykytyn, Olena Nagachevska, Kateryna Hazdiuk, Dmytro Uhryn, "Development and Testing of Voice User Interfaces Based on BERT Models for Speech Recognition in Distance Learning and Smart Home Systems", International Journal of Computer Network and Information Security(IJCNIS), Vol.17, No.3, pp.109-143, 2025. DOI:10.5815/ijcnis.2025.03.07
[1]iPhone. [Online]. Available: https://support.apple.com/uk-ua/guide/iphone/iph83aad8922/ios
[2]K. Yasar, and B. Botelho, "What is a virtual assistant." Available: https://www.techtarget.com/searchcustomerexperience/definition/virtual-assistant-AI-assistant
[3]Northwest Software Inc Talent Acquisition Team, Scope of Voice Assistants in Everyday Life. 2020. [Online]. Available: https://www.nwsi.com/images/articles/Scope%20of%20Voice%20Assistant.pdf
[4]J. Sachin, et. al., "Multimodal LLM Driven Computer Interface." [Online]. Available: https://www.ijraset.com/best-journal/multimodal-llm-driven-computer-interface
[5]L. Wang, et. al., "Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-ThoughtReasoning by Large Language Models," 2023. [Online]. Available: https://aclanthology.org/2023.acl-long.147.pdf
[6]A. Sartiukova, et. al., "Remote Voice Control of Computer Based on Convolutional Neural Network," International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, pp. 1058-1064, Sept. 2023.
[7]V. Trysnyuk, et. al., "A method for user authenticating to critical infrastructure objects based on voice message identification," Advanced Information Systems, vol. 4(3), pp. 11–16, 2020.
[8]O. Bisikalo, et. al., "Precision automated phonetic analysis of speech signals for information technology of text-dependent authentication of a person by voice," CEUR Workshop Proceedings, vol. 2853, pp. 276–288, 2021.
[9]N. Kholodna, et. al., "A Machine Learning Model for Automatic Emotion Detection from Speech," CEUR Workshop Proceedings, vol. 2917, pp. 699-713, 2021.
[10]A. Dmytriv, et. al., "The Speech Parts Identification for Ukrainian Words Based on VESUM and Horokh Using," International Conference on Computer Sciences and Information Technologies, pp. 21–33, Sept. 2021.
[11]O. Tverdokhlib, et. al., "Information technology for identifying hate speech in online communication based on machine learning," Lecture Notes on Data Engineering and Communications Technologies vol. 195, pp. 339-369, 2024.
[12]T. Kovaliuk, I. Yurchuk, and O. Gurnik, "Topological structure of Ukrainian tongue twisters based on speech sound analysis," CEUR Workshop Proceedings, vol. 3723, pp. 328-339, 2024.
[13]Z. Rybchak, O. Kulyna, and L. Kobyliukh, "An intelligent system for speech analysis and control using customized criteria," CEUR Workshop Proceedings, vol. 3723, pp. 412-426, 2024.
[14]O. Turuta, et. al., "Audio processing methods for speech emotion recognition using machine learning," CEUR Workshop Proceedings, vol. 3711, pp. 75-108, 2024.
[15]I. Krak, et. al., "Abusive Speech Detection Method for Ukrainian Language Used Recurrent Neural Network, CEUR Workshop Proceedings, vol. 3688, pp. 16-28, 2024.
[16]L. Kobylyukh, Z. Rybchak, and O. Basystiuk, "Analyzing the Accuracy of Speech-to-Text APIs in Transcribing the Ukrainian Language," CEUR Workshop Proceedings, vol. 3396, 217-227, 2023.
[17]O. Romanovskyi, et. al., "Prototyping Methodology of End-to-End Speech Analytics Software," CEUR Workshop Proceedings, vol. 3312, pp. 76-86, 2022.
[18]A. Dmytriv, et. al., "Comparative Analysis of Using Different Parts of Speech in the Ukrainian Texts Based on Stylistic Approach," CEUR Workshop Proceedings, vol. 3171, pp. 546-560, 2022.
[19]K. Wołk, et al. "Survey on dialogue systems including slavic languages," Neurocomputing, vol. 477, pp. 62-84, 2022.
[20]M. Sazhok, et. al., "Punctuation Restoration for Ukrainian Broadcast Speech Recognition System based on Bidirectional Recurrent Neural Network and Word Embeddings," CEUR Workshop Proceedings, vol. 2870, pp. 300-310, 2021.
[21]Z. Haladzhun, et. al., "Hate Speech in Media Towards the Representatives of Roma Ethnic Community," CEUR Workshop Proceedings, vol. 2870, pp. 755-768, 2021.
[22]V. Lytvyn, et. al., "Peculiarities of Generation of Semantics of Natural Language Speech by Helping Unlimited and Context-Dependent Grammar," CEUR workshop proceedings, 2604, pp. 536-551, 2020.
[23]N. Shakhovska, O. Basystiuk, and K. Shakhovska, Development of the Speech-to-Text Chatbot Interface Based on Google API," CEUR Workshop Proceedings, vol. 2386, pp. 212-221, 2019.
[24]B. Rusyn, and A. Chorniy, "Application of wawelet-transformation in to the system of speech recognition," International Conference on Modern Problems of Radio Engineering, Telecommunications and Computer Science, pp. 345, 2008.
[25]V. Motyka, et. al., "System Project for Ukrainian-language Feedback Tonality Analysis in the Health Care Field Based on BERT Model," International Conference on Computer Sciences and Information Technologies, October 2023.
[26]N. Khairova, et. al., "Using BERT model to Identify Sentences Paraphrase in the News Corpus, CEUR Workshop Proceedings, vol. 3171, pp. 38-48, 2022.
[27]H. Livinska, and O. Makarevych, Feasibility of Improving BERT for Linguistic Prediction on Ukrainian Сorpus," CEUR workshop proceedings, vol. 2604, pp. 552-561, 2020.