A ViT-based Model for Detecting Kidney Stones in Coronal CT Images

PDF (606KB), PP.1-11

Views: 0 Downloads: 0

Author(s)

A. Cong Tran 1,2,* Huynh Vo-Thuy 3

1. Can Tho University, Can Tho, Vietnam

2. CTU-AIMED Leading Research Team, Can Tho, Vietnam

3. Can Tho General Hospital, Can Tho, Vietnam

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2025.05.01

Received: 9 May 2025 / Revised: 7 Jul. 2025 / Accepted: 21 Aug. 2025 / Published: 8 Oct. 2025

Index Terms

Kidney Stone, Coronal CT, Vision Transformer, CSPDarknet, FPN-PANet

Abstract

Detecting kidney stones in coronal CT images remains challenging due to the small size of stones, anatomical complexity, and noise from surrounding objects. To address these challenges, we propose a deep learning architecture that augments a Vision Transformer (ViT) with a pre-processing module. This module integrates CSPDarknet for efficient feature extraction, a Feature Pyramid Network (FPN), and Path Aggregation Network (PANet) for multi-scale context aggregation, along with convolutional layers for spatial refinement. Together, these trained components filter irrelevant background regions and highlight kidney-specific features before classification by ViT, thereby improving accuracy and efficiency. This design leverages ViT’s global context modeling while mitigating its sensitivity to irrelevant regions and limited data. The proposed model was evaluated on two coronal CT datasets (one public and one private dataset) comprising 6,532 images under six experimental scenarios with varying training and testing conditions. It achieved 99.3% accuracy, 98.7% F1-score, and 99.4% mAP@0.5, higher than both YOLOv10 and the baseline ViT. The model contains 61.2 million parameters and has a computational cost of 37.3 GFLOPs, striking a balance between ViT (86.0M, 17.6 GFLOPs) and YOLOv10 (22.4M, 92.0GFLOPs). Despite having more parameters than YOLOv10, the model achieved a lower inference time than YOLOv10, approximately 0.06 seconds per image on an NVIDIA RTX 3060 GPU. These findings suggest the potential of our approach as a foundation for clinical decision-support tools, pending further validation on heterogeneous and challenging clinical datasets such as small (<2 mm) or low-contrast stones.

Cite This Paper

A. Cong Tran, Huynh Vo-Thuy, "A ViT-based Model for Detecting Kidney Stones in Coronal CT Images", International Journal of Information Technology and Computer Science(IJITCS), Vol.17, No.5, pp.1-11, 2025. DOI:10.5815/ijitcs.2025.05.01

Reference

[1]V. Romero, H. Akpinar, and D. G. Assimos, “Kidney stones: A global picture of prevalence, incidence, and associated risk factors,” Reviews in Urology, vol. 12, no. 2-3, pp. e86, 2010.
[2]I. Sorokin, C. Mamoulakis, K. Miyazawa, A. Rodgers, J. Talati, and Y. Lotan, “Epidemiology of stone disease across the world,” World Journal of Urology, vol. 35, no. 9, pp. 1301–1320, 2017.
[3]C. D. Scales Jr, A. C. Smith, J. M. Hanley, C. S. Saigal, and Urologic Diseases in America Project,, “Prevalence of kidney stones in the United States,” European urology, vol. 62, no. 1, pp. 160–165, 2012.
[4]M. S. Pearle, E. A. Calhoun, G. C. Curhan, and Urologic Diseases of America Project, “Urologic diseases in America project: Urolithiasis,” The Journal of Urology, vol. 173, no. 3, pp. 848–857, 2005.
[5]R. Eliahou, G. Hidas, M. Duvdevani, and J. Sosna, “Determination of renal stone composition with dual-energy computed tomography: An emerging application,” Seminars in Ultrasound, CT and MRI, vol. 31, no. 4, pp. 315–320, Elsevier, 2010.
[6]M. W. Ciaschini, E. M. Remer, M. E. Baker, M. Lieber, and B. R. Herts, “Urinary calculi: radiation dose reduction of 50% and 75% at CT-Effect on sensitivity,” Radiology,vol. 251, no. 1, pp. 105–111, 2009.
[7]R. P. Reimer, K. Klein, M. Rinneburger, D. Zopfs, S. Lennartz, J. Salem, A. Heidenreich, D. Maintz, S. Haneder, and N. Große Hokamp, “Manual kidney stone size measurements in computed tomography are most accurate using multiplanar image reformatations and bone window settings,” Scientific Reports, vol. 11, no. 1, p. 16437, 2021.
[8]O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Proc. Medical image computing and computer-assisted intervention–MICCAI 2015: 18th International Conference, Munich, Germany, pp. 234-241, 2015.
[9]Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3D U-Net:  Learning dense volumetric segmentation from sparse annotation,” in Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 424–432, 2016.
[10]M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in Proc. International Conference on Machine Learning, pp. 6105–6114, 2019.
[11]G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. W. M. Van Der Laak, B. Van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” Medical Image Analysis, vol. 42, pp. 60–88, 2017.  
[12]H. R. Roth, L. Lu, A. Farag, H.-C. Shin, J. Liu, E. B. Turkbey, and R. M. Summers, “DeepOrgan: Multi-level deep convolutional networks for automated pancreas segmentation,” in Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 556–564, 2015.
[13]G. Jocher et al., “Ultralytics/yolov5: v6.0 - yolov5n ‘Nano’ models,” Roboflow integration, TensorFlow export, OpenCV DNN support, vol. 10, 2021.
[14]O. Oktay et al., “Attention U-Net:  Learning where to look for the pancreas,” arXiv preprint arXiv:1804.03999, 2018.
[15]Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++:  Redesigning skip connections to exploit multiscale features in medical image segmentation,” IEEE Transactions on Medical Imaging, vol. 39, no. 6, pp. 1856–1867, 2021.
[16]J. M. J. Valanarasu, P. Oza, I. Hacihaliloglu, and V. M. Patel, “Medical transformer: Gated axial-attention for medical image segmentation,” in Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer International Publishing, 2021.
[17]A. Hatamizadeh et al., “Unetr: Transformers for 3d medical image segmentation,” in Proc. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 574–584, 2022.
[18]J. Chen et al., “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv preprint arXiv:2102.04306, 2021.
[19]Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proc. IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10012–10022, 2021.
[20]A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[21]A. Hassani et al., “Escaping the big data paradigm with compact transformers”, arXiv preprint arXiv:2104.05704, 2021.
[22]K. Yildirim et al., “Deep learning model for automated kidney stone detection using coronal CT images,” Computers in biology and medicine, vol. 135, p. 104569, 2021.