IJITCS Vol. 17, No. 4, 8 Aug. 2025
Cover page and Table of Contents: PDF (size: 4259KB)
PDF (4259KB), PP.80-119
Views: 0 Downloads: 0
Text Clustering, Public Opinion, HDBSCAN, K-means, Thematic Analysis, Text Vectorisation, Clustering, Neural Networks, Ukrainian Language, Natural Language Processing, Short Texts, Trend Analysis
The article presents a modern approach to analysing public opinion based on Ukrainian-language content from Telegram channels. This study presents a hybrid clustering approach that combines DBSCAN and K-means algorithms to analyse vectorised Ukrainian-language social media posts in order to detect public opinion trends. The methodology relies on a multilingual neural network–based text vectorisation model, which enables effective representation of the semantic content of posts. Experiments conducted on a corpus of 90 Ukrainian-language messages (collected between March and May 2025) allowed for the identification of six principal thematic clusters reflecting key areas of public discourse. Despite the small volume of the corpus (90 messages), the sample is structured and balanced by topic (news, vacancies, gaming), which allows you to test the effectiveness of the proposed methodology in conditions of limited data. This approach is appropriate in the case of the analysis of short texts in low-resource languages, where large-scale corpora are not available. A special advantage of this approach is the use of semantic vector representation and the construction of graphs of lexical co-occurrence networks (term co-occurrence networks), which demonstrate a stable topological structure even with small amounts of data. It allows you to identify latent topic patterns and coherent clusters that have the potential to scale to broader corpora. The authors acknowledge the limitations associated with sample size, but emphasise the role of this study as a pilot stage for the development of a universal, linguistically adaptive method for analysing public discourse. In the future, it is planned to expand the body to increase the representativeness and accuracy of the conclusions. The paper proposes a hybrid method of automatic thematic cluster analysis of short texts in social media, in particular Telegram. Vectorisation of Ukrainian-language messages is implemented using the transformer model multilingual-e5-large-instruct. A combination of HDBSCAN and K-means algorithms was used to detect clusters. More than 36,000 messages from three Telegram channels (news, games, vacancies) were analysed, and six main thematic clusters were identified. To identify thematic trends, a hybrid clustering approach was used, in which the HDBSCAN algorithm was used at the first stage to identify dense clusters and identify "noise" points, after which K-means were used to reclassify residual ("noise") embeddings to the nearest cluster centres.
Such a two-tier strategy allows you to combine the advantages of flexible allocation of free-form clusters from HDBSCAN and stable classification of less pronounced groups through K-means. It is especially effective when working with fragmented short texts of social networks. To validate the quality of clustering, both visualisation tools (PCA, t-SNE, word clouds) and quantitative metrics were used: Silhouette Score (0.41) and Davis-Boldin index (0.78), which indicate moderate coherence and resolution of clusters. Separately, the high level of "noise" in HDBSCAN (34.2%) was analysed, which may be due to the specifics of short texts, model parameters, or stylistic fragmentation of Telegram messages. The results obtained show the effectiveness of combining modern vectorisation models with flexible clustering methods to identify structured topics in fragmented Ukrainian-language content of social networks. The proposed approach has the potential to further expand to other sources, types of discourse, and tasks of digital sociology. As a result of processing 90 messages received from three different channels (news, gaming content, and vacancies), six main thematic clusters were identified. The largest share is occupied by clusters related to employment (28.2%) and security-patriotic topics (24.7%). The average level of "noise" after the initial HDBSCAN clustering was 34.2%. Additional analysis revealed that post lengths varied significantly, ranging from short announcements (average of 10 words) to analytical texts (over 140 words). Visualisations (timelines, PCA, t-SNE, word clouds, term co-occurrence graphs) confirm the thematic coherence of clusters and reveal changes in thematic priorities over time. The proposed system is an effective tool for detecting information trends in the environment of short, fragmented texts and can be used to monitor public sentiment in low-resource languages.
Roman Lynnyk, Victoria Vysotska, Zhengbing Hu, Dmytro Uhryn, Liliia Diachenko, Kyrylo Smelyakov, "Information Technology for Modelling Social Trends in Telegram Using E5 Vectors and Hybrid Cluster Analysis", International Journal of Information Technology and Computer Science(IJITCS), Vol.17, No.4, pp.80-119, 2025. DOI:10.5815/ijitcs.2025.04.07
[1]R. Rogers, “Deplatforming: Following extreme Internet celebrities to Telegram and alternative social media,” European Journal of Communication, vol. 35, no. 3, pp. 213–229, 2020.
[2]B. Yuskiv, N. Karpchuk, and O. Pelekh, “The structure of wartime strategic communications: Case study of the Telegram channel Insider Ukraine,” Politologija, vol. 107, no. 3, pp. 90–119, 2022.
[3]N. O. Natalina, “Telegram channels as tools of strategic communication: A study on Ukraine’s media landscape during the war,” Bulletin of Vasyl Stus Donetsk National University. Political Science Series, pp. 53–60, 2023.
[4]O. Moroz, I. Trunina, M. Moroz, V. Zahorianskyi, and K. Vasylkovska, “Digital marketing communications transformation in wartime,” in 2023 IEEE 5th International Conference on Modern Electrical and Energy System (MEES), pp. 1–6, 2023.
[5]K. Shu, D. Mahudeswaran, S. Wang, D. Lee, and H. Liu, “Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media,” Big Data, vol. 8, no. 3, pp. 171–188, 2020.
[6]D. H. Nguyen, A. T. H. Nguyen, and K. Van Nguyen, “A weakly supervised data labeling framework for machine lexical normalization in Vietnamese social media,” Cognitive Computation, vol. 17, no. 1, pp. 57–57, 2025.
[7]B. T. Ta and N. M. Le, “Transfer learning methods for low-resource speech accent recognition: A case study on Vietnamese language,” Engineering Applications of Artificial Intelligence, vol. 132, pp. 107895, 2024.
[8]H. X. Huynh, L. X. Dang, N. Duong-Trung, and C. T. Phan, “Vietnamese short text classification via distributed computation,” International Journal of Advanced Computer Science and Applications, vol. 12, no. 7, pp. 1–7, 2021.
[9]V. Solopova, O. I. Popescu, C. Benzmüller, and T. Landgraf, “Automated multilingual detection of pro-Kremlin propaganda in newspapers and Telegram posts,” Datenbank-Spektrum, vol. 23, no. 1, pp. 5–14, 2023.
[10]A. Tretiakov, S. D’Antonio-Maceiras, and A. Martín, “Topic modeling in Telegram channels during the Russia-Ukraine conflict,” in International Conference on Intelligent Data Engineering and Automated Learning, pp. 493–504, Cham: Springer Nature Switzerland, 2024.
[11]P. Hall, C. Heath, and L. Coles-Kemp, “Critical visualisation: A case for rethinking how we visualise risk and security,” Journal of Cybersecurity, vol. 1, no. 1, pp. 93–108, 2015.
[12]J. Biswas, “Decoding COVID-19 conversations with visualization: Twitter analytics and emerging trends,” Journal of Computer Science and Software Testing, vol. 10, no. 1, pp. 1–12, 2024.
[13]D. M. R. O. Dissanayaka, M. B. D. Salgado, R. S. Udayar, and Z. H. Muhammadh, “Review paper on the end-to-end encryption technique used in social media applications,” CEUR Workshop Proceedings, vol. 3826, pp. 1–6, 2024.
[14]M. Nazarkevych, V. Vysotska, Y. Myshkovskyi, N. Nakonechnyi, and A. Nazarkevych, “Model for forecasting the development of information threats in the cyberspace of Ukraine,” Eastern-European Journal of Enterprise Technologies, vol. 130, no. 2, pp. 6–15, 2024.
[15]V. Vysotska, M. Nazarkevych, S. Vladov, O. Lozynska, O. Markiv, R. Romanchuk, and V. Danylyk, “Devising a method for detecting information threats in the Ukrainian cyber space based on machine learning,” Eastern-European Journal of Enterprise Technologies, vol. 132, no. 2, pp. 6–15, 2024.
[16]V. Vysotska, L. Chyrun, S. Chyrun, and I. Holets, “Information technology for identifying disinformation sources and inauthentic chat users’ behaviours based on machine learning,” CEUR Workshop Proceedings, vol. 3723, pp. 466–483, 2024.
[17]V. Vysotska, S. Mazepa, L. Chyrun, O. Brodyak, I. Shakleina, and V. Schuchmann, “NLP tool for extracting relevant information from criminal reports or fakes/propaganda content,” in 2022 IEEE 17th International Conference on Computer Sciences and Information Technologies (CSIT), pp. 93–98, 2022.
[18]S. Mainych, A. Bulhakova, and V. Vysotska, “Cluster analysis of discussions change dynamics on Twitter about war in Ukraine,” CEUR Workshop Proceedings, vol. 3396, pp. 490–530, 2023.
[19]O. Tverdokhlib, V. Vysotska, P. Pukach, and M. Vovk, “Information technology for identifying hate speech in online communication based on machine learning,” in Data-Centric Business and Applications: Modern Trends in Financial and Innovation Data Processes 2023. Volume 1, pp. 339–369, Cham: Springer Nature Switzerland, 2024.
[20]V. Vysotska, S. Voloshyn, O. Markiv, O. Brodyak, N. Sokulska, and V. Panasyuk, “Tone analysis of regional articles in English-language newspapers based on recurrent neural network Bi-LSTM,” in 2023 IEEE 5th International Conference on Advanced Information and Communication Technologies (AICT), pp. 1–6, 2023.
[21]A. Berko, L. Chyrun, O. Prokipchuk, S. Chyrun, V. Panasyuk, and M. Luchkevych, “Intelligent analysis of Ukrainian-language tweets of regional public opinion study and management,” in 2023 IEEE 18th International Conference on Computer Science and Information Technologies (CSIT), pp. 1–7, 2023.
[22]V. Vysotska, O. Markiv, D. Svyshch, L. Chyrun, S. Chyrun, and R. Romanchuk, “Fake news identification based on NLP, big data analysis and deep learning technology,” in 2024 IEEE 17th International Conference on Advanced Trends in Radioelectronics, Telecommunications and Computer Engineering (TCSET), pp. 199–204, 2024.
[23]V. Vysotska, O. Nagachevska, N. Mozol, S. Chyrun, L. Chyrun, and O. Prokipchuk, “Identifying accent and origin of English language users by voice and text analysis, NLP and machine learning technology,” in 2024 IEEE 17th International Conference on Advanced Trends in Radioelectronics, Telecommunications and Computer Engineering (TCSET), pp. 1–6, 2024.
[24]V. A. C. Ligo, L. Y. Cheung, R. K. W. Lee, K. Saha, E. C. Tandoc Jr, and N. Kumar, “User archetypes and information dynamics on Telegram: COVID-19 and climate change discourse in Singapore,” arXiv preprint arXiv:2406.06717, 2024.
[25]A. D. Nobari, M. H. K. M. Sarraf, M. Neshati, and F. E. Daneshvar, “Characteristics of viral messages on Telegram; The world’s largest hybrid public and private messenger,” Expert Systems with Applications, vol. 168, pp. 114303, 2021.
[26]A. Karakikes, P. Alexiadis, and K. Kotis, “Bias in X (Twitter) and Telegram based intelligence analysis: Exploring challenges and potential mitigating roles of AI,” SN Computer Science, vol. 5, no. 5, p. 574, 2024.
[27]M. Bahman-Abadi and M. B. Ghaznavi-Ghoushchi, “Analysis and extraction of tempo-spatial events in an efficient archival CDN with emphasis on Telegram,” arXiv preprint arXiv:2309.08924, 2023.
[28]M. Channels, A. Büsgen, L. Klöser, P. Kohll, O. Schmidts, B. Kraft, and A. Zündorf, “Profiling on German Telegram Black,” in Data Management Technologies and Applications: 10th International Conference, DATA 2021, Virtual Event, July 6–8, 2021, and 11th International Conference, DATA 2022, Lisbon, Portugal, July 11–13, 2022, Revised Selected Papers, p. 176, Springer Nature, 2023.
[29]M. C. R. Martins, “Political mobilization in Brazil from 2013 to 2017: A technopolitical analysis using surveys and social network data mining,” Doctoral dissertation, Universidade do Porto, Portugal, 2019.
[30]W. Sidorenko, “Sentiment analysis of German Twitter,” arXiv preprint arXiv:1911.13062, 2019.
[31]Oleh Prokipchuk, Victoria Vysotska, Petro Pukach, Vasyl Lytvyn, Dmytro Uhryn, Yuriy Ushenko, Zhengbing Hu, "Intelligent Analysis of Ukrainian-language Tweets for Public Opinion Research based on NLP Methods and Machine Learning Technology", International Journal of Modern Education and Computer Science, Vol.15, No.3, pp. 70-93, 2023.
[32]K. K. Kapoor, K. Tamilmani, N. P. Rana, P. Patil, Y. K. Dwivedi, and S. Nerur, "Advances in social media research: past, present and future," Inf. Syst. Front., vol. 20, no. 3, pp. 531–558, 2018, doi: 10.1007/s10796-017-9810-y.
[33]A. Panchenko, K. Lopukhin, D. Ustalov, N. Arefyev, A. Leontyev, and N. Loukachevitch, "Ukrainian language modeling for low-resource settings: XLM-R adaptation and news corpus construction," in Proc. 17th Int. Conf. ICTERI, 2021, pp. 124–135.
[34]J. MacQueen, "Some methods for classification and analysis of multivariate observations," in Proc. 5th Berkeley Symp. Math. Stat. Probab., vol. 1, pp. 281–297, 1967.
[35]A. Petukhova, J. P. Matos-Carvalho, and N. Fachada, "Text clustering with large language model embeddings," Int. J. Cogn. Comput. Eng., 2024, doi: 10.1016/j.ijcce.2024.11.004.
[36]L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei, "Multilingual-E5 Text Embeddings: A Technical Report," Hugging Face, 2024. [Online]. Available: https://huggingface.co/intfloat/multilingual-e5-large-instruct
[37]L. Wong, Q. Zhang, and C. Lee, "Contrastive multilingual embedding model trained on synthetic and real text pairs," in Proc. Int. Conf. Comput. Linguist. (COLING), 2021, pp. 1123–1132.
[38]A. Capuccinelli, S. Albarello, and C. Lucarelli, "Social media impact on public opinion: A review of 132 academic publications," J. Soc. Media Stud., vol. 5, no. 2, pp. 45–67, 2021.
[39]R. Jain, D. Patel, and A. Mehta, "Hybrid clustering with K-means and Ant Lion Optimization: An empirical study," J. Comput. Intell. Data Min., vol. 12, no. 3, pp. 210–225, 2020.
[40]A. Del Río, J. M. Martín, and L. Rodríguez, "A hybrid FA-GA-DBSCAN clustering approach for complex datasets with heterogeneous density," Pattern Recognit. Lett., vol. 129, pp. 108–115, 2020.
[41]M. Ester, H. P. Kriegel, J. Sander, and X. Xu, "A density-based algorithm for discovering clusters in large spatial databases with noise," in Proc. 2nd Int. Conf. Knowl. Discov. Data Min. (KDD), 1996, pp. 226–231.
[42]F. Murtagh and P. Contreras, "Algorithms for hierarchical clustering: An overview," Wiley Interdiscip. Rev. Data Min. Knowl. Discov., vol. 2, no. 1, pp. 86–97, 2012.
[43]R. J. Jamin, M. A. R. Talukder, P. Malakar, M. M. Kabir, K. Nur, and M. F. Mridha, "Recent advancements and challenges of NLP-based sentiment analysis: A state-of-the-art review," Nat. Lang. Process. J., vol. 6, Art. no. 100059, 2024. doi: 10.1016/j.nlpj.2024.100059.
[44]R. S. Stroud, A. Al-Saffar, M. Carter, M. P. Moody, S. Pedrazzini, and M. R. Wenman, "Testing outlier detection algorithms for identifying early stage solute clusters in atom probe tomography," Microsc. Microanal., vol. 30, pp. 853–865, 2024, doi: 10.1093/mam/ozae076.
[45]R. Raman et al., "Fake news research trends, linkages to generative artificial intelligence and sustainable development goals," Heliyon, vol. 10, Art. no. e24727, 2024. doi: 10.1016/j.heliyon.2024.e24727.
[46]M. Han and Y. Zhou, "Exploring trends and emerging topics in oceanography (1992–2021) using deep learning-based topic modeling and cluster analysis," npj Ocean Sustain., Art. no. 97, 2024. doi: 10.1038/s44183-024-00097-z.
[47]Snowflake Inc., "Vector Embeddings," Snowflake Documentation. [Online]. Available: https://docs.snowflake.com/en/guides/ai-ml/llm/vector-embeddings
[48]H. Cao, "Recent advances in universal text embeddings: A comprehensive review of top-performing methods on the MTEB Benchmark," arXiv preprint, arXiv:2406.01067, 2024. [Online]. Available: https://arxiv.org/abs/2406.01067
[49]J. Devins, "Multilingual vector search with the E5 embedding model," Elastic Search Labs Blog, September 12, 2023. [Online]. Available: https://elastic.co/search-labs/blog/multilingual-vector-search-e5-embedding-model
[50]S. Nazeri, "Comparing the state-of-the-art clustering algorithms," Medium, July 19, 2023. [Online]. Available: https://medium.com/@sina.nazeri/comparing-the-state-of-the-art-clustering-algorithms-1e65a08157a1
[51]A. Bansal, "Optimizing customer segmentation for enhanced recommendation systems through comparative analysis of K-Means, Hierarchical Clustering, and DBSCAN algorithms," ResearchGate, May 2023. [Online]. Available: https://www.researchgate.net/publication/384604526