Shobhit Shrotriya; Nizar Banu P. K.; Avi Kulkarni; Vinod G. Kumar

Application of Large Language Models for Data-Driven Analytics in Oncology: Insights and Evidence Generation from Real-World Imaging Data

PDF (2279KB), PP.38-59

Views: 0 Downloads: 0

Author(s)

Shobhit Shrotriya ^1,* Nizar Banu P. K. ² Avi Kulkarni ³ Vinod G. Kumar ⁴

1. Department of Computer Science, CHRIST (Deemed to be University), Bangalore - 560029 and Accenture, India

2. Department of Computer Science, CHRIST (Deemed to be University), Bangalore Central Campus, Hosur Road, Near Dairy Circle, Bangalore – 560029, India

3. ThoughtSphere, 99 S Almaden Blvd, San Jose, CA 95113, United States

4. Accenture, Prestige Technopolis, 8/1 Dr M. H. Marigowda Road, Adugodi Rd., Bangalore-560029, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijigsp.2025.06.03

Received: 5 Jan. 2025 / Revised: 24 Apr. 2025 / Accepted: 20 Jul. 2025 / Published: 8 Dec. 2025

Index Terms

Machine Learning, Artificial Intelligence, Generative AI, Large Language Models, Embeddings, Breast Cancer

Abstract

Breast cancer is one of the most common and serious types of cancer. It can affect people of all ages and genders around the world. The increasing incidence of breast cancer, coupled with its complexity, has placed a significant burden on healthcare systems and patients alike. Traditional diagnostic methods, while effective, often face limitations in early detection and accurate prognosis, which are critical for improving patient outcomes. In recent years, artificial intelligence (AI) and machine learning (ML) are changing the way we solve problems and make decisions in the field of medical diagnostics, enhancing the ability to detect, diagnose and predict breast cancer. However, there are still challenges, such as the need for large and diverse datasets to train these models, making AI tools work smoothly in hospitals, and addressing ethical concerns in healthcare. This paper looks at how AI and ML are used in breast cancer care, especially in analyzing real-world medical data like images, histopathology, and other datasets such as doctor notes & discharge summaries, to identify patterns that may be unnoticeable to medical experts. Large Language Models (LLMs) using embeddings, are highlighted for their capacity to improve the accuracy of image related interpretations, potentially detect early-stage tumours, and predict disease progression and treatment responses. Real-world medical datasets have been collected and analysed using different models. A publicly available Convolutional Neural Network (CNN) and a custom-built Large Language Model (LLM) with embeddings were tested. The Generative AI model achieved 98.44% accuracy, significantly higher than the traditional AI model's 61.72%. Future research can explore how Generative AI can help classify patients based on risk levels. This could lead to personalized treatment plans, reducing unnecessary treatments and improving patients' quality of life. Given the research is primarily focussed on breast cancer, there is an attempt to showcase that by harnessing the power of AI and ML, there is potential to significantly reduce the global burden of breast cancer, offering new avenues for early detection, accurate diagnosis, and tailored therapeutic strategies. Continued research and collaboration among oncologists, data scientists, and policymakers are essential to fully realize the benefits of AI in the fight against breast cancer, ultimately leading to better patient outcomes and a decrease in breast cancer-related mortality.

Cite This Paper

Shobhit Shrotriya, Nizar Banu P. K., Avi Kulkarni, Vinod G. Kumar, "Application of Large Language Models for Data-Driven Analytics in Oncology: Insights and Evidence Generation from Real-World Imaging Data", International Journal of Image, Graphics and Signal Processing(IJIGSP), Vol.17, No.6, pp. 38-59, 2025. DOI:10.5815/ijigsp.2025.06.03

Reference

[1]Ferlay J, Colombet M, Soerjomataram I, Parkin DM, Piñeros M, Znaor A, Bray F. Cancer statistics for the year 2020: An overview. Int J Cancer. 2021 Apr 5. doi: 10.1002/ijc.33588. Epub ahead of print. PMID: 33818764
[2]Ferlay J, Ervik M, Lam F, Laversanne M, Colombet M, Mery L, Piñeros M, Znaor A, Soerjomataram I, Bray F (2024). Global Cancer Observatory: Cancer Today. Lyon, France: International Agency for Research on Cancer. Available from: https://gco.iarc.who.int/today.
[3]Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, Bray F. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2021 Feb 4. doi: 10.3322/caac.21660. Epub ahead of print. PMID: 33538338.
[4]Roy, J.K, Singh Deo, S.K., Yadav, D., Naik, R. A., Sonkar, V. K., Singh, B., Ministry of Health and Family Welfare, Department of Health and Family Welfare, & Prof. Singh Baghel, S.P. (2024). Unstarred Question no. 1227. In Lok Sabha [Report]
[5]Ahmed MI, Spooner B, Isherwood J, Lane M, Orrock E, Dennison A. A Systematic Review of the Barriers to the Implementation of Artificial Intelligence in Healthcare. Cureus. 2023 Oct 4;15(10):e46454. doi: 10.7759/cureus.46454. PMID: 37927664; PMCID: PMC10623210.
[6]Deng, L., Wang, T., Yangzhang, N., Zhai, Z., Tao, W., Li, J., Zhao, Y., Luo, S., & Xu, J. (2024). Evaluation of large language models in breast cancer clinical scenarios: a comparative analysis based on ChatGPT-3.5, ChatGPT-4.0, and Claude2. International Journal of Surgery, 110(4), 1941–1950. https://doi.org/10.1097/js9.0000000000001066
[7]Ahn JS, Shin S, Yang SA, Park EK, Kim KH, Cho SI, Ock CY, Kim S. Artificial Intelligence in Breast Cancer Diagnosis and Personalized Medicine. J Breast Cancer. 2023 Oct;26(5):405-435. doi: 10.4048/jbc.2023.26.e45. PMID: 37926067; PMCID: PMC10625863.
[8]Quazi S. Artificial intelligence and machine learning in precision and genomic medicine. Med Oncol. 2022 Jun 15;39(8):120. doi: 10.1007/s12032-022-01711-1. PMID: 35704152; PMCID: PMC9198206.
[9]Varsha Nemade, Vishal Fegade, Machine Learning Techniques for Breast Cancer Prediction, Procedia Computer Science, Volume 218, 2023, Pages 1314-1320, ISSN 1877-0509, https://doi.org/10.1016/j.procs.2023.01.110.
[10]Khalid A, Mehmood A, Alabrah A, Alkhamees BF, Amin F, AlSalman H, Choi GS. Breast Cancer Detection and Prevention Using Machine Learning. Diagnostics (Basel). 2023 Oct 2;13(19):3113. doi: 10.3390/diagnostics13193113. PMID: 37835856; PMCID: PMC10572157.
[11]Yamashita, R., Nishio, M., Do, R.K.G. et al. Convolutional neural networks: an overview and application in radiology. Insights Imaging 9, 611–629 (2018). https://doi.org/10.1007/s13244-018-0639-9
[12]Alzubaidi, L., Zhang, J., Humaidi, A.J. et al. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data 8, 53 (2021). https://doi.org/10.1186/s40537-021-00444-8
[13]Ahmed, S.F., Alam, M.S.B., Hassan, M. et al. Deep learning modelling techniques: current progress, applications, advantages, and challenges. Artif Intell Rev 56, 13521–13617 (2023). https://doi.org/10.1007/s10462-023-10466-8
[14]Besen, S. (2024, March 14). LLM Embeddings — Explained Simply - AI Mind. Medium. https://pub.aimind.so/llm-embeddings-explained-simply-f7536d3d0e4b
[15]Saleem, A. (2024, April 29). Explore the Role of Vector Embeddings in Generative AI. Data Science Dojo. https://datasciencedojo.com/blog/vector-embeddings-generative-ai/
[16]Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., The University of Sydney, University of Engineering and Technology (UET), Lahore, Pakistan, The Chinese University of Hong Kong (CUHK), HKSAR, China, University of Technology Sydney (UTS), Sydney, Australia, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Sydney, Australia, King Fahd University of Petroleum and Minerals (KFUPM), Dhahran, Saudi Arabia, SDAIA-KFUPM Joint Research Center for Artificial Intelligence (JRCAI), Dhahran, Saudi Arabia, The University of Melbourne (UoM), Melbourne, Australia, Australian National University (ANU), Canberra, Australia, The University of Western Australia (UWA), Perth, Australia, Barnes, N., & Mian, A. (2024). A Comprehensive Overview of Large Language Models. In Preprint.
[17]Demystifying Embedding Spaces using Large Language Models. (n.d.). https://arxiv.org/html/2310.04475v2
[18]Monir, S. S., Lau, I., Yang, S., & Zhao, D. (2024). VectorSearch: Enhancing Document Retrieval with Semantic Embeddings and Optimized Search. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2409.17383
[19]Dandamudi, H. (2025, January 26). Vector databases: data storage, querying, and embeddings. The ByteDoodle Blog. https://blog.bytedoodle.com/vector-databases-data-storage-querying-and-embeddings/
[20]PingCAP. (2024, July 16). Understanding the Cosine Similarity Formula. TiDB. https://www.pingcap.com/article/understanding-the-cosine-similarity-formula/#:~:text=It%20quantifies%20how%20closely%20two,image%20recognition%2C%20and%20recommendation%20systems.
[21]The Role of Cosine Similarity in Vector Space and its Relevance in SEO. (n.d.). https://marketbrew.ai/a/cosine-similarity
[22]Grønne, M. (2022, October 18). Introduction to Embedding, Clustering, and Similarity. Medium. https://towardsdatascience.com/introduction-to-embedding-clustering-and-similarity-11dd80b00061
[23]S. Bhattacharjee et al., "Cluster Analysis: Unsupervised Classification for Identifying Benign and Malignant Tumors on Whole Slide Image of Prostate Cancer," 2022 IEEE 5th International Conference on Image Processing Applications and Systems (IPAS), Genova, Italy, 2022, pp. 1-5, doi: 10.1109/IPAS55744.2022.10052952.
[24]Goncalves, D. (2024, October 30). ML vs. LLM: Is one “better” than the other? | Superwise ML Observability. Model Observability Platform | Observe, Monitor & Improve ML. https://superwise.ai/blog/ml-vs-llm-is-one-better-than-the-other/#:~:text=While%20LLMs%20shine%20in%20generative,efficiency%20and%20lower%20resource%20demands.
[25]Lin, Ting-Yu; Huang, Mei-Ling (2020), “Dataset of Breast mammography images with Masses”, Mendeley Data, V2, doi: 10.17632/ywsbh3ndr8.2
[26]Mammography Dataset from INbreast, MIAS, and DDSM. (2024, May 31). Kaggle. https://www.kaggle.com/datasets/emiliovenegas1/mammography-dataset-from-inbreast-mias-and-ddsm/data
[27]OpenAI Platform. (n.d.). https://platform.openai.com/docs/guides/prompt-engineering
[28]OpenAI Cookbook. (n.d.). https://cookbook.openai.com/
[29]Understanding LLM Embeddings: A Comprehensive Guide. (n.d.). https://irisagent.com/blog/understanding-llm-embeddings-a-comprehensive-guide/
[30]Khawaja, R. (2024, October 15). Embeddings 101: The foundation of large language models. Data Science Dojo. https://datasciencedojo.com/blog/embeddings-and-llm/
[31]LLM-driven Multimodal Target Volume Contouring in Radiation Oncology. (n.d.). https://arxiv.org/html/2311.01908v3
[32]Oh, Y., Park, S., Byun, H.K. et al. LLM-driven multimodal target volume contouring in radiation oncology. Nat Commun 15, 9186 (2024). https://doi.org/10.1038/s41467-024-53387-y
[33]DeepranjanG.(n.d.). Breast_Cancer_Classification/Breast_Cancer_Classification.ipynb at main DeepranjanG/Breast_Cancer_Classification. GitHub. https://github.com/DeepranjanG/Breast_Cancer_Classification/blob/main/Breast_Cancer_Classification.ipynb

International Journal of Image, Graphics and Signal Processing (IJIGSP)