Shilpi Goyal; Deepak Motwani

PARSeq-GeoAware: Explicit Geometric Modeling for Robust Scene Text Recognition in the Wild

PDF (1894KB), PP.151-166

Views: 0 Downloads: 0

Author(s)

Shilpi Goyal ^1,* Deepak Motwani ¹

1. Amity University/ Department of Computer Science, Gwalior, 474005, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijigsp.2026.03.08

Received: 15 Jan. 2026 / Revised: 19 Feb. 2026 / Accepted: 18 Mar. 2026 / Published: 8 Jun. 2026

Index Terms

Scene Text Recognition, Vision Transformers, Geometric Feature Extraction, Adaptive Rectification, Irregular Text, Curved Text

Abstract

Scene text recognition in unconstrained environments remains challenging due to geometric distortions including arbitrary orientations, curved baselines, and perspective deformations. Transformer-based methods achieve strong performance on regular benchmarks through implicit spatial learning but suffer accuracy drops of 8–12% on heavily curved text, where attention weights become diffuse and fail to capture explicit geometric structure. No prior work quantifies the isolated contribution of explicit geometric modeling within transformer architectures. To address this, we propose PARSeq-GeoAware, a dual-branch scene text recognition framework integrating an Enhanced Geometric Feature Extractor (GFE), adaptive coarse-to-fine rectification (affine + TPS), and a cross-attention fusion module combining explicit geometric representations with ViT-based visual features decoded by a CTC head. Trained on 176,630 image-label pairs across three progressive stages and evaluated on six standard benchmarks, PARSeq-GeoAware achieves 89.87% on IIIT5K, 82.07% on SVT, 84.55% on ICDAR13, 68.90% on ICDAR15, 71.26% on ArT, and 81.27% on Total-Text. On irregular and curved text benchmarks — the primary target of this work — our ±1 character accuracy reaches 84.10% on ArT and 90.05% on Total-Text, exceeding PARSeq's published word accuracy of 79.3% and 87.1% respectively by +4.8pp and +2.95pp, without a language model. Ablation studies confirm that disabling all geometric components reduces ArT word accuracy from 71.26% to 42.89% (−28.37pp), establishing the GFE as the primary driver of irregular text performance. The adaptive rectification module achieves full-pipeline inference in 11.9 ± 1.4ms on Tesla T4, which is 6.5× faster than DAN (78ms). A three-stage progressive training curriculum prevents catastrophic forgetting, retaining 89.87% regular accuracy after irregular specialization versus 80.6% with joint training (+14.8pp). These results demonstrate that explicit geometric modeling enables a single architecture to handle synthetic, regular, and irregular scene text without specialized language model post-processing. The code is available at https://github.com/Arni-123/PARSeq-GeoAware.

Cite This Paper

Shilpi Goyal, Deepak Motwani, "PARSeq-GeoAware: Explicit Geometric Modeling for Robust Scene Text Recognition in the Wild", International Journal of Image, Graphics and Signal Processing(IJIGSP), Vol.18, No.3, pp. 151-166, 2026. DOI:10.5815/ijigsp.2026.03.08

Reference

[1]F. Naiemi, V. Ghods, and H. Khalesi, “Scene text detection and recognition: a survey,” Multimed Tools Appl, vol. 81, no. 14, pp. 20255–20290, Jun. 2022, doi: 10.1007/s11042-022-12693-7.
[2]S. Goyal and D. Motwani, “A Study of Text Extraction Algorithms for Natural Scene Images,” SN Comput Sci, vol. 5, no. 6, p. 731, Jul. 2024, doi: 10.1007/s42979-024-03068-w.
[3]Y. Baek, B. Lee, D. Han, S. Yun, and H. Lee, “Character region awareness for text detection,” Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit, vol. 2019-June, pp. 9365–9374, 2019, doi: 10.1109/CVPR.2019.00959.
[4]B. Shi, X. Bai, and C. Yao, “An End-to-End Trainable Neural Network for Image-Based Sequence Recognition and Its Application to Scene Text Recognition,” IEEE Trans Pattern Anal Mach Intell, vol. 39, no. 11, pp. 2298–2304, 2016, doi: 10.1109/TPAMI.2016.2646371.
[5]Z. Qiao, Y. Zhou, D. Yang, Y. Zhou, and W. Wang, “Seed: Semantics enhanced encoder-decoder framework for scene text recognition,” Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit, pp. 13528–13537, 2020, doi: 10.1109/CVPR42600.2020.01354.
[6]S. Fang, H. Xie, Y. Wang, Z. Mao, and Y. Zhang, “Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition,” Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit, pp. 7098–7107, 2021, doi: 10.1109/CVPR46437.2021.00702.
[7]D. Bautista and R. Atienza, “Scene Text Recognition with Permuted Autoregressive Sequence Models,” Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics), vol. 13688 LNCS, pp. 178–196, 2022, doi: 10.1007/978-3-031-19815-1_11.
[8]Y. Du et al., “SVTR: Scene Text Recognition with a Single Visual Model,” arXiv Prepr arXiv220500159, 2022, doi: 10.48550/arXiv.2205.00159.
[9]L. Yuliang, J. Lianwen, Z. Shuaitao, and Z. Sheng, “Detecting Curve Text in the Wild: New Dataset and New Solution,” arXiv Prepr arXiv171202170, 2017, [Online]. Available: http://arxiv.org/abs/1712.02170.
[10]C. K. Ch’Ng and C. S. Chan, “Total-Text: A Comprehensive Dataset for Scene Text Detection and Recognition,” Proc Int Conf Doc Anal Recognition, ICDAR, vol. 1, pp. 935–942, 2017, doi: 10.1109/ICDAR.2017.157.
[11]B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai, “Robust Scene Text Recognition with Automatic Rectification,” Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit, vol. 2016-Decem, pp. 4168–4176, 2016, doi: 10.1109/CVPR.2016.452.
[12]B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, “ASTER: An Attentional Scene Text Recognizer with Flexible Rectification,” IEEE Trans Pattern Anal Mach Intell, vol. 41, no. 9, pp. 2035–2048, 2019, doi: 10.1109/TPAMI.2018.2848939.
[13]T. Wang et al., “Decoupled attention network for text recognition,” in Proceedings of the AAAI conference on artificial intelligence, 2020, vol. 34, no. 07, pp. 12216–12224, doi: 10.1609/aaai.v34i07.6903.
[14]M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition,” arXiv Prepr arXiv14062227, pp. 1–10, 2014, [Online]. Available: http://arxiv.org/abs/1406.2227.
[15]C. Mishra, Anand and Alahari, Karteek and Jawahar, “Scene text recognition using higher order language priors,” in BMVC-British machine vision conference, 2012.
[16]C. K. Chng et al., “ICDAR2019 robust reading challenge on arbitrary-shaped text-RRC-ArT,” Proc Int Conf Doc Anal Recognition, ICDAR, pp. 1571–1576, 2019, doi: 10.1109/ICDAR.2019.00252.
[17]T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan, “Recognizing text with perspective distortion in natural scenes,” Proc IEEE Int Conf Comput Vis, pp. 569–576, 2013, doi: 10.1109/ICCV.2013.76.
[18]D. Karatzas et al., “ICDAR 2013 robust reading competition,” Proc Int Conf Doc Anal Recognition, ICDAR, pp. 1484–1493, 2013, doi: 10.1109/ICDAR.2013.221.
[19]D. Karatzas et al., “ICDAR 2015 competition on Robust Reading,” Proc Int Conf Doc Anal Recognition, ICDAR, vol. 2015-Novem, pp. 1156–1160, 2015, doi: 10.1109/ICDAR.2015.7333942.
[20]C. Xu, W. Jia, R. Wang, X. Luo, and X. He, “MorphText: Deep Morphology Regularized Accurate Arbitrary-Shape Scene Text Detection,” IEEE Trans Multimed, vol. 25, pp. 4199–4212, 2023, doi: 10.1109/TMM.2022.3172547.
[21]P. Wang, C. Da, and C. Yao, “Multi-granularity Prediction for Scene Text Recognition,” Eur Conf Comput Vis, vol. 13688, pp. 339–355, 2022, doi: 10.1007/978-3-031-19815-1_20.
[22]X. Ren, H. Shi, and J. Li, “Open-Vocabulary Scene Text Recognition via Pseudo-Image Labeling and Margin Loss,” arXiv Prepr arXiv240307518, 2024.
[23]M. Fujitake, “Dtrocr: Decoder-only transformer for optical character recognition,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 8025--8035.
[24]X. Zhao, M. Xu, W. Silamu, and Y. Li, “CLIP-Llama : A New Approach for Scene Text Recognition with a Pre-Trained Vision-Language Model and a Pre-Trained Language Model,” Sensors, vol. 24, no. 22, p. 7371, 2024.

International Journal of Image, Graphics and Signal Processing (IJIGSP)