An Efficient Approach for Text-to-Speech Conversion Using Machine Learning and Image Processing Technique

Full Text (PDF, 362KB), PP.44-49

Views: 0 Downloads: 0


Smt. Swaroopa Shastri 1 Shashank Vishwakarma 1,*

1. Department of CSE (MCA), Visvesvaraya Technological University, Centre for PG Studies, Kalaburagi, India

* Corresponding author.


Received: 11 Sep. 2022 / Revised: 16 Oct. 2022 / Accepted: 15 Nov. 2022 / Published: 8 Aug. 2023

Index Terms

Image processing MSER, OCR, Geometrical properties, SWT, TTS Synthesizer


This study explores the conversion of English to Hindi, first to text, and subsequently to speech. The first part of the implementation is the text recognition from images, in which two approaches are used for text character recognition: a maximally stable extensible region (MSER) and grayscale conversion the second part of the paper deals with the geometric filtering in combination with stroke width transform (SWT). Subsequently, letter/alphabets are grouped to detect text sequences, which are then fragmented into words. Finally, a 96 percent accurate spell check is performed using naive Bayes and decision tree algorithms, followed by the use of optical character recognition (OCR) to digitize. The word Give our text-to-speech synthesizer (TTS) the recognized text to convert it to Hindi language using the text-to-speech model. Based on aspects such speech speed, sound quality, pronunciation, and clarity.

Cite This Paper

Swaroopa Shastri, Shashank Vishwakarma, "An Efficient Approach for Text-to-Speech Conversion Using Machine Learning and Image Processing Technique", International Journal of Engineering and Manufacturing (IJEM), Vol.13, No.4, pp. 44-49, 2023. DOI:10.5815/ijem.2023.04.05


[1]Niblack, W. 1993. The QBIC Project: Querying Images by Content Using Color, Texture, and Shape. In Proc. Storage and Retrieval for Image and Video Databases, SPIE Bellingham, Wash,173-187
[2]Asha G. Hagargund, Shasha Vanaria Thota, Mitadru Bera, Eram Fatima Shaik (2017) “Image to Speech Conversion for Visually Impaired”, International Journal of Latest Research in Engineering and Technology, ISSN: 2454- 5031, Issue 06, Vol. 03, No. 0, pp. 09-15.
[3]A. V. Bapat and L. K. Nagalkar, "Phonetic Speech Analysis for Speech to Text Conversion," 2008 IEEE Region 10 and the Third International Conference on Industrial and Information Systems, 2008, pp. 1-4, DOI: 10.1109/ICIINFS.2008.4798390.
[4]Kiran Rakshana R, Chitra C(2019) “A Smart Navguide System for Visually Impaired”, International Journal of Innovative Technology and Exploring Engineering, ISSN: 2278- 3075, Issue 6S3, Vol. 8, No. 0, pp. 0.
[5]Jain, A.K., and Yu, B. 1998. Automatic Text Lo cation in Images and Video Frames, Pattern Recognition Society. Vol. 31(12), 2055-2076.
[6]Wolf, C., and Jo lion, J.M. 2004. Model-Based Text Detection in Images and Videos: A Learning Approach. Technical Report LIRIS RR.
[7]Vaibhav V. Govekar, Meenakshi A (2018) “A Smart Reader for Blind People”, International Journal of Science Technology & Engineering, ISSN: 2349-784X, Issue 1, Vol. 5, pp. 0.
[8]A. Laptev, R. Korostik, A. Svischev, A. Andrusenko, I. Medennikov, and S. Rybin, "You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation," 2020 13th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), 2020, pp. 439-444, DOI: 10.1109/CISPBMEI51763.2020.9263564.
[9]Hao, Y., Yi, Z., Zeng-Guang H., and Min, T. 2003. Automatic Text Detection in Video Frames Based on Bootstrap Artificial Neural Network and CED. Journal of Winter School of Computer Graphics (WSCG), Vol. 11.
[10]Misran, C., and Swain, P.K. 2011. An Automated HSV-Based Text Tracking System from Complex Color Video. LNCS, Vol 6536, 255-26