Sample of Groups: A New Strategy to Find a Representative Point for Each Undisclosed Cluster

Full Text (PDF, 572KB), PP.1-11

Views: 0 Downloads: 0


Wallace A. Pinheiro 1,* Ana B. S. Pinheiro 2

1. Center for Systems Development, Brazilian Army, Brazil

2. University of Brasilia, Department of Tropical Medicine, Brazil

* Corresponding author.


Received: 14 May 2023 / Revised: 23 Jul. 2023 / Accepted: 28 Aug. 2023 / Published: 8 Oct. 2023

Index Terms

Clustering, Grouping, Similarity, Sampling, Grid


Some problems involving the selection of samples from undisclosed groups are relevant in various areas such as health, statistics, economics, and computer science. For instance, when selecting a sample from a population, well-known strategies include simple random and stratified random selection. Another related problem is selecting the initial points corresponding to samples for the K-means clustering algorithm. In this regard, many studies propose different strategies for choosing these samples. However, there is no consensus on the best or most effective ap-proaches, even when considering specific datasets or domains. In this work, we present a new strategy called the Sam-ple of Groups (SOG) Algorithm, which combines concepts from grid, density, and maximum distance clustering algo-rithms to identify representative points or samples located near the center of the cluster mass. To achieve this, we create boxes with the right size to partition the data and select the representatives of the most relevant boxes. Thus, the main goal of this work is to find quality samples or seeds of data that represent different clusters. To compare our approach with other algorithms, we not only utilize indirect measures related to K-means but also employ two direct measures that facili-tate a fairer comparison among these strategies. The results indicate that our proposal outperforms the most common-ly used algorithms.

Cite This Paper

Wallace A. Pinheiro, Ana B. S. Pinheiro, "Sample of Groups: A New Strategy to Find a Representative Point for Each Undisclosed Cluster", International Journal of Information Technology and Computer Science(IJITCS), Vol.15, No.5, pp.1-11, 2023. DOI:10.5815/ijitcs.2023.05.01


[1]T. Velmurugan and T. Santhanam, “A Survey of partition based clustering algorithms in data mining: An experimental approach”, Information Technology Journal, vol. 10, no. 3, pp. 478-484, 2011.
[2]M. I. Akodj`enou-Jeannin, K. Salamatian, and P. Gallinari, “Flexible grid-based clustering”, in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 4702 LNAI, pp. 350-357, Springer Verlag, 2007.
[3]J. A. Hartigan and M. A. Wong, “Algorithm AS 136: A K-Means Clustering Algorithm”, Applied Statistics, vol. 28, no. 1, p. 100, 1979.
[4]P. Bhattacharjee and P. Mitra, “A survey of density based clustering algorithms”, Frontiers of Computer Science, vol. 15, pp. 1-27, 2021.
[5]M. Elfil and A. Negida, “Sampling methods in clinical research; an educational review”, Archives of Academic Emergency Medicine, vol. 7, no. 1, p. 52, 2019.
[6]D. Xu and Y. Tian, “A Comprehensive Survey of Clustering Algorithms”, Annals of Data Science, vol. 2, pp. 165-193, 2015.
[7]M. Ankerst, M. M. Breunig, H. P. Kriegel, and J. Sander, “OPTICS: Ordering Points to Identify the Clustering Structure”, SIGMOD Record (ACM Special Interest Group on Management of Data), vol. 28, pp. 49-60, 1999.
[8]D. Arthur, D. Arthur, and S. Vassilvitskii, “K-means++: the advantages of careful seeding”, In Proceedings of the 18th annual ACM-SIAM Symposium on Discrete Algorithms, 2007.
[9]P. Berkhin, "A survey of clustering data mining techniques," in Grouping Multidimensional Data: Recent Advances in Clustering, pp. 25-71, Springer Berlin Heidelberg, 2006.
[10]M. E. Celebi, H. A. Kingravi, P. A. Vela, A comparative study of efficient initialization methods for the k-means cluster-ing algorithm, Expert Systems with Applications, vol 40, no. 1, pp. 200-210, 2013.
[11]K. Chowdhury, D. Chaudhuri, A. K. Pal. An entropy-based initialization method of K-means clustering on the optimal number of clusters. Neural Computing and Applications, vol. 33, pp. 6965–6982, 2021.
[12]G. Balakrishnan and Z. Syed, “Scalable Personalization of Long-Term Physiological Monitoring: Active Learning Method-ologies for Epileptic Seizure Onset Detection”, in PMLR (N. D. Lawrence and M. Girolami, eds.), vol. 22 of Proceedings of Machine Learning Research (PMLR), (La Palma, Canary Islands), pp. 73-81, 2012.
[13]R. Q. A. Fernandes, W. A. Pinheiro, G. B. Xex´eo, and J. M. de Souza, “Path Clustering: Grouping in an Efficient Way Complex Data Distributions”, Journal on Today’s Ideas - Tomorrow’s Technologies, vol. 5, pp. 141-155, 2017.
[14]R. C. De Amorim, “An empirical evaluation of different initializations on the number of K-Means iterations”, in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 7629 LNAI, pp. 15-26, Springer, Berlin, Heidelberg, 2013.
[15]I. Dokmanic, R. Parhizkar, J. Ranieri, and M. Vetterli, “Euclidean Distance Matrices: Essential theory, algorithms, and applications”, IEEE Signal Processing Magazine, vol. 32, no. 6, pp. 12-30, 2015.
[16]H. A. Dau, E. Keogh, K. Kamgar, C.-C. M. Yeh, Y. Zhu, S. Gharghabi, C. A. Ratanamahatana, Yanping, B. Hu, N. Begum, A. Bagnall, A. Mueen, and G. Batista, “The UCR Time Series Classification Archive”, 2018.
[17]M. Z. Rodriguez, C. H. Comin, D. Casanova, O. M. Bruno, D. R. Amancio, L. F. Costa, F. A. Rodrigues. Clustering algo-rithms: A comparative approach. PLoS ONE, vol. 14, no. 1, 2019.
[18]J. M. Peña, J. A. Lozano, and P. Larrañaga, “An empirical comparison of four initialization methods for the KMeans algo-rithm”, Pattern Recognition Letters, vol. 20, no. 10, pp. 1027-1040, 1999.
[19]W. Wang, J. Yang, and R. R. Muntz, “STING: A Statistical Information Grid Approach to Spatial Data Mining”, in Pro-ceedings of the 23rd International Conference on Very Large Data Bases, VLDB ’97, (San Francisco, CA, USA), pp. 186-195, Morgan Kaufmann Publishers Inc., 1997.
[20]T. Sajana, C. M. Sheela Rani, and K. V. Narayana, “A survey on clustering techniques for big data mining”, Indian Jour-nal of Science and Technology, vol. 9, pp. 1-12, 2016.
[21]L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley, 1990.
[22]E. Forgy, “Cluster Analysis of Multivariate Data: Efficiency versus Interpretability of Classification”, Biometrics, vol. 21, no. 3, pp. 768-769, 1965.
[23]K. Chowdhury, D. Chaudhuri, A. K. Pal, A.K, A. Samal. Seed selection algorithm through K-means on optimal number of clusters. Multimedia and Tools Applications, vol. 78, pp. 18617–18651, 2019.
[24]B. Yuan, W. Zhang, and Y. Yuan, “A max-min clustering method for k-means algorithm of data clustering”, Journal of Industrial and Management Optimization, vol. 8, pp. 565-575, 2012.
[25]B. Schelling, L. Miklautz, and C. Plant, “Non-linear Cluster Enhancement: Forcing Clusters into a compact shape”, in 24th European Conference on Artificial Intelligence, 2020.
[26]J. M. Santos and M. Embrechts, “On the use of the adjusted rand index as a metric for evaluating supervised classifica-tion”, in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5769 LNCS, pp. 175-184, Springer, Berlin, Heidelberg, 2009.
[27]E. Schikuta, “Grid-clustering: An efficient hierarchical clustering method for very large data sets”, in Proceedings - Inter-national Conference on Pattern Recognition, vol. 2, pp. 101-105, Institute of Electrical and Electronics Engineers Inc., 1996.
[28]P. Singh and P. A. Meshram, “Survey of density based clustering algorithms and its variants”, in Proceedings of the Inter-national Conference on Inventive Computing and Informatics, ICICI 2017, pp. 920-926, Institute of Electrical and Electron-ics Engineers Inc., 2018.
[29]R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic subspace clustering of high dimensional data for data mining applications”, SIGMOD Record, vol. 27, pp. 94-105, 1998.
[30]J. Lever, M. Krzywinski, and N. Altman, “Points of significance: Principal component analysis”, 2017.
[31]S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis”, Chemometrics and Intelligent Laboratory Systems, vol. 2, no. 1-3, pp. 37-52, 1987.
[32]M. Ester, M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise”, KDD-96, pp. 226-231, 1996.