An Efficient and Scalable Technique for Clustering Comorbidity Patterns of Diabetic Patients from Clinical Datasets

Full Text (PDF, 884KB), PP.35-52

Views: 0 Downloads: 0


Bramesh S M 1,* Anil Kumar K.M 2

1. P. E. S. College of Engineering, Mandya, 571401, Karnataka, India

2. JSS Science and Technology University, Mysuru, 570006, Karnataka, India

* Corresponding author.


Received: 24 May 2022 / Revised: 16 Jun. 2022 / Accepted: 19 Aug. 2022 / Published: 8 Dec. 2022

Index Terms

Diabetes, Comorbidity patterns, Topic modeling, Clustering, and ICD-9-CM codes.


Clustering diabetic patients with comorbidity patterns are necessary to learn relationships between diabetes patients’ clinical profiles and as an essential pre-processing stage for analysis tasks, like classification and categorization. Nevertheless, the heterogeneity of these data makes traditional clustering methods more difficult to apply, necessitating the development of novel clustering algorithms. In this paper, we recommend an effective and scalable clustering technique suitable for datasets made up of attributes which are atomic and set-valued. In these datasets, each record corresponds to a different diagnosis detail of a diabetic patient based on his or her hospital visit, where diagnosis details in each record are represented using the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) codes. Our proposed technique involves three main stages. In the first stage, we selected the top-k diabetes-specific comorbidities patterns. In the second stage, we ensured that the co-occurring conditions in the selected top-k diabetes-specific comorbidities patterns really co-occur together or not using topic modeling and in the last stage, we constructed high quality clusters efficiently using average linkage agglomerative clustering with cosine similarity. Also, based on silhouette analysis, we assessed the efficiency and effectiveness of our proposed technique using a large, freely available MIMIC dataset (MIMIC-III and MIMIC-IV), comprised of over 14,222 and 68,118 distinct records, respectively. Our findings reveal that our technique finds clusters that: (i) preserve interrelations between demographics (age, gender) and diagnosis codes (ICD-9-CM codes), and (ii) are well-separated and compact. Finally, the founded clusters are beneficial for numerous investigative tasks like query answering, visualization, anonymization, classification etc.

Cite This Paper

Bramesh S M, Anil Kumar K M, "An Efficient and Scalable Technique for Clustering Comorbidity Patterns of Diabetic Patients from Clinical Datasets", International Journal of Modern Education and Computer Science(IJMECS), Vol.14, No.6, pp. 35-52, 2022. DOI:10.5815/ijmecs.2022.06.04


[2]Healthcare Information and Management Systems Society (HIMSS),, 2016.
[3]P. Campanella, E. Lovato, C. Marone, L. Fallacara, A. Mancuso, W. Ricciardi, M.L. Specchia, “The impact of electronic health records on healthcare quality: a systematic review and meta-analysis”, Eur. J. Public Health 26 (1) 60–64, 2015.
[4]C. Rinner, S.K. Sauter, G. Endel, G. Heinze, S. Thurner, P. Klimek, G. Duftschmid, “Improving the informational continuity of care in diabetes mellitus treatment with a nationwide shared EHR system: estimates from austrian claims data”, Int. J. Med. Inform. 92 44–53, 2016.
[5]D. Gotz, J. Sun, N. Cao, S. Ebadollahi, “Visual cluster analysis in support of clinical decision intelligence”, in: AMIA Annual Symposium Proceedings, Vol. 2011, pp. 481–490, 2011.
[6]P. Yadav, M. Steinbach, V. Kumar, G. Simon, “Mining electronic health records (EHRs): a survey”, ACM Comput. Surv. 50 (6) 85, 2018.
[7]R.J. Carroll, A.E. Eyler, J.C. Denny, “Intelligent use and clinical benefits of electronic health records in rheumatoid arthritis”, Exp. Rev. Clin. Immunol. 11 (3), 329–337, 2015.
[8]G. Poulis, G. Loukides, S. Skiadopoulos, A. Gkoulalas-Divanis, “Anonymizing datasets with demographics and diagnosis codes in the presence of utility constraints”, J. Biomed. Inform. 65, 76–96, 2017.
[9]Centers for Medicare & Medicaid Services, Proposed changes to the CMS-HCC risk adjustment model for payment year 2017, 2015.
[10]A. Kemp, D.B. Preen, C. Saunders, C.D.J. Holman, M. Bulsara, K. Rogers, E.E. Roughead, “Ascertaining invasive breast cancer cases; the validity of administrative and self-reported data sources in Australia”, BMC Med. Res. Methodol. 13 (1) 17, 2013.
[11]G. Tsoumakas, I. Katakis, Multi-label classification: an overview, Int. J. Data Warehouse. Min. (IJDWM) 3 (3) 1–13, 2007.
[12]N. Mohammed, X. Jiang, R. Chen, B.C. Fung, L. Ohno-Machado, “Privacy-preserving heterogeneous health data sharing”, J. Am. Med. Inform. Assoc. 20 (3), 462–469, 2012.
[13]R. Xu, D.C. Wunsch, “Survey of clustering algorithms”, IEEE Trans. Neural Networks 16 (3), 645–678, 2005.
[14]V. Guralnik, G. Karypis, “A scalable algorithm for clustering sequential data”, Proceedings of the 2001 IEEE International Conference on Data Mining, pp. 179–186, 2001.
[15]N. Sokolovska, O. Cappé, F. Yvon, “The asymptotics of semi-supervised learning in discriminative probabilistic models”, Proceedings of the 25th international conference on Machine learning, ACM, pp. 984–991, 2008.
[16]V. Nouri, M.-R. Akbarzadeh-T, A. Rowhanimanesh, “A hybrid type-2 fuzzy clustering technique for input data preprocessing of classification algorithms”, in: IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), IEEE, 2014, pp. 1131–1138, 2014.
[17]G. Poulis, G. Loukides, A. Gkoulalas-Divanis, S. Skiadopoulos, “Anonymizing data with relational and transaction attributes, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases”, pp. 353–369, 2013.
[18]R. Henriques, F.L. Ferreira, S.C. Madeira, “BicPAMS: software for biological data analysis with pattern-based biclustering”, BMC Bioinform. 18 (1) 82, 2017.
[19]A. Zhang, C. Tang, D. Jiang, “Cluster analysis for gene expression data: a survey”, IEEE Trans. Knowl. Data Eng. (11) 1370–1386, 2004.
[20]J. Lustgarten, V. Gopalakrishnan, H. Grover, S.V. S, “Improving classification performance with discretization on biomedical datasets”, AMIA Annual Symposium Proceedings, pp. 445–449, 2008.
[21]M.J. Zaki, W.M. Jr., W. Meira, “Data Mining and Analysis: Fundamental Concepts and Algorithms”, Cambridge University Press, 2014.
[22]S. Guha, R. Rastogi, K. Shim, “ROCK: a robust clustering algorithm for categorical attributes”, Inform. Syst. 25 (5) 345–366, 2000.
[23]F. Giannotti, C. Gozzi, G. Manco, “Clustering transactional data”, Proceedings of the 2002 European Conference on Principles of Data Mining and Knowledge Discovery, 2002.
[24]A.S. Shirkhorshidi, S. Aghabozorgi, T.Y. Wah, “A comparison study on similarity and dissimilarity measures in clustering continuous data”, PLOS ONE 10.
[25]P.B. Jensen, L.J. Jensen, S. Brunak, “Mining electronic health records: towards better research applications and clinical care”, Nat. Rev. Genet. 13 (6) 395, 2012.
[26]J.A. Hartigan, M.A. Wong, “Algorithm AS 136: a K-means clustering algorithm”, J. Roy. Stat. Soc. 28 (1) 100–108, 1979.
[27]D. Arthur, S. Vassilvitskii, “k-means++: The advantages of careful seeding”, Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, Society for Industrial and Applied Mathematics, pp. 1027–1035, 2007.
[28]H. Park, C. Jun, “A simple and fast algorithm for K-medoids clustering”, Exp. Syst. Appl. 36 (2) 3336–3341, 2009.
[29]M. Ankerst, M.M. Breunig, H.-P. Kriegel, J. Sander, “OPTICS: ordering points to identify the clustering structure”, in: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, Vol. 28, ACM, pp. 49–60, 1999.
[30]B. Andreopoulos, A. An, X. Wang, D. Labudde, “Efficient layered density-based clustering of categorical data”, J. Biomed. Inform. 42 (2) 365–376, 2009.
[31]Y. Yang, X. Guan, J. You, “Clope a fast and effective clustering algorithm for transactional data”, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp. 682–687, 2002.
[32]H. Yan, K. Chen, L. Liu, “Efficiently clustering transactional data with weighted coverage density”, Proceedings of the 15th ACM international conference on Information and knowledge management, ACM, pp. 367–376, 2006.
[33]F. Cao, J.Z. Huang, J. Liang, X. Zhao, Y. Meng, K. Feng, Y. Qian, “An algorithm for clustering categorical data with set-valued features”, IEEE Trans. Neural Networks Learn. Syst. 29 (10) 4593–4606, 2018.
[34]L. Kalankesh, J. Weatherall, T. Ba-Dhfari, I.E. Buchan, A. Brass, “Taming EHR data: using semantic similarity to reduce dimensionality”, Stud. Health Technol. Inform. 192 52–56, 2013.
[35]F.S. Roque, P.B. Jensen, H. Schmock, M. Dalgaard, M. Andreatta, T. Hansen, K. Søeby, S. Bredkjær, A. Juul, T. Werge, L.J. Jensen, S. Brunak, “Using electronic patient records to discover disease correlations and stratify patient cohorts”, PLOS Comput. Biol. 7 (8) 1–10, 2011.
[36]F. Doshi-Velez, Y. Ge, I. Kohane, “Comorbidity clusters in autism spectrum disorders: an electronic health record time-series analysis”, Pediatrics 133 (1) e54–e63, 2014.
[37]S. Ghassempour, F. Girosi, A. Maeder, “Clustering multivariate time series using hidden markov models”, Int. J. Environ. Res. Public Health 11 (3) 2741–2763, 2014.
[38]C.E. Lopez, S. Tucker, T. Salameh, C.S. Tucker, “An unsupervised machine learning method for discovering patient clusters based on genetic signatures”, J. Biomed. Inform. 85 30–39, 2018.
[39]A. Ultsch, J. Loetsch, “Machine-learned cluster identification in high-dimensional data”, J. Biomed. Inform. 66 95–104, 2017.
[40]H. Xu, Y. Wu, N. Elhadad, P.D. Stetson, C. Friedman, “A new clustering method for detecting rare senses of abbreviations in clinical notes”, J. Biomed. Inform. 45 (6) 1075–1083, 2012.
[41]M. Moradi, “CIBS: a biomedical text summarizer using topic-based sentence clustering”, J. Biomed. Inform. 88 53–61, 2018.
[42]L. Parsons, E. Haque, H. Liu, “Subspace clustering for high dimensional data: a review”, ACM SIGKDD Explor. Newslett. 6 (1) 90–105, 2004.
[43]R. Gwadera, “Pattern-based solution risk model for strategic it outsourcing”, in: Industrial Conference on Data Mining, Vol. 7987, pp. 55–69, 2013.
[44]H.-P. Kriegel, P. Kröger, A. Zimek, “Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering”, ACM Trans. Knowl. Discov. Data (TKDD) 3 (1) 1, 2009.
[45]C.C. Aggarwal, C. Zhai, “A survey of text clustering algorithms, Mining Text Data”, Springer, pp. 77–128, 2012.
[46]B.C. Fung, K. Wang, M. Ester, “Hierarchical document clustering using frequent itemsets”, in, Proceedings of the 2003 SIAM international conference on data mining, SIAM, pp. 59–70, 2003.
[47]C. Su, Q. Chen, X. Wang, X. Meng, “Text clustering approach based on maximal frequent term sets”, 2009 IEEE International Conference on Systems, Man and Cybernetics, IEEE, pp. 1551–1556, 2009.
[48]G. Kiran, R. Shankar, V. Pudi, “Frequent itemset based hierarchical document clustering using Wikipedia as external knowledge”, International Conference on Knowledge-based and Intelligent Information and Engineering Systems, Springer, pp. 11–20, 2010.
[49]S.C. Madeira, A.L. Oliveira, “Biclustering algorithms for biological data analysis: a survey”, IEEE/ACM Trans. Comput. Biol. Bioinf. 1 (1) 24–45, 2004.
[50]Y. Cheng, G.M. Church, “Biclustering of expression data”, in: International Conference on Intelligent Systems for Molecular Biology, Vol. 8, pp. 93–103, 2000.
[51]I.V. Mechelen, H.-H. Bock, P.D. Boeck, “Two-mode clustering methods: a structured overview”, Stat. Methods Med. Res. 13 (5) 363–394, 2004.
[52]A. Tanay, R. Sharan, R. Shamir, “Handbook of computational molecular biology” 9 (1–20) 122–124, 2005.
[53]D.M. Blei, A.Y. Ng, M.I. Jordan, “Latent Dirichlet allocation”, J. Mach. Learn. research. 3 993–1022, 2003.
[54]S. Sahni, T. Gonzalez, “P-complete approximation problems”, J. ACM (JACM) 23 (3) 555–565, 1976.
[55]A. Czumaj, C. Sohler, “Small space representations for metric min-sum k-clustering and their applications”, Annual Symposium on Theoretical Aspects of Computer Science, Springer, pp. 536–548, 2007.
[56]Johnson, A., Pollard, T., & Mark III, R. (2016). MIMIC-III clinical database. Physio Net, 10, C2XW26.
[57]Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L. A., & Mark IV, R. (2020). Mimic-iv (version 0.4). PhysioNet.
[59]Bramesh, S. M., & KM, A. K. “An Effective Rule Based Approach for Identification of Comorbidity Patterns in Diabetic Patients”, Indian Journal of Computer Science and Engineering (IJCSE), Vol. 13, No. 4, pp. 1067- 1082, Jul-Aug 2022. DOI: 10.21817/indjcse/2022/v13i4/221304054.
[61]T.M. Kodinariya, P.R. Makwana, “Review on determining number of clusters in Kmeans clustering”, Int. J. 1 (6) 90–95, 2013.
[62]L. Peng, W. Qing, G. Yujia, “Study on comparison of discretization methods”, in: 2009 International Conference on Artificial Intelligence and Computational Intelligence, Vol. 4, IEEE, pp. 380–384, 2009.
[63]Tabinda Sarwar, Sattar Seifollahi, Jeffrey Chan, Xiuzhen Zhang, Vural Aksakalli, Irene Hudson, Karin Ver-spoor, and Lawrence Cavedon. (2022). “The Secondary Use of Electronic Health Records for Data Mining: Data Characteristics and Challenges”. ACM Comput. Surv. 55, 2, Article 33,, (January 2022).
[64]Sudhir Anakal, P Sandhya, “Decision Support System for Drug-Drug Interaction Pertaining to COPD and its Comorbidities”, International Journal of Education and Management Engineering, Vol.12, No.2, pp. 1-6, 2022.
[65]Arnold Adimabua Ojugo, Elohor Ekurume, “Predictive Intelligent Decision Support Model in Forecasting of the Diabetes Pandemic Using a Reinforcement Deep Learning Approach”, International Journal of Education and Management Engineering, Vol.11, No.2, pp. 40-48, 2021.