Data Optimization through Compression Methods Using Information Technology

PDF (1056KB), PP.84-99

Views: 0 Downloads: 0

Author(s)

Igor V. Malyk 1,* Yevhen Kyrychenko 1 Mykola Gorbatenko 2 Taras Lukashiv 3

1. Department of Mathematical Problems of Control and Cybernetics, Yuriy Fedkovych Chernivtsi National University, Chernivtsi, 58000, Ukraine

2. Department of Mathematical Modeling, Yuriy Fedkovych Chernivtsi National University, Chernivtsi, 58000, Ukraine

3. Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Belvaux, L-4370, Luxembourg

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2025.05.07

Received: 25 May 2025 / Revised: 20 Jul. 2025 / Accepted: 26 Aug. 2025 / Published: 8 Oct. 2025

Index Terms

Information Technology, Data Similarity, Compressed Copy of Tabular Data, Compact Data Representation

Abstract

Efficient comparison of heterogeneous tabular datasets is difficult when sources are unknown or weakly documented. We address this problem by introducing a unified, type-aware framework that builds compact data represen- tations (CDRs)—concise summaries sufficient for downstream analysis—and a corresponding similarity graph (and tree) over a data corpus. Our novelty is threefold: (i) a principled vocabulary and procedure for constructing CDRs per variable type (factor, time, numeric, string), (ii) a weighted, type-specific similarity metric we call Data Information Structural Similarity (DISS) that aggregates distances across heterogeneous summaries, and (iii) an end-to-end, cloud-scalable real- ization that supports large corpora. Methodologically, factor variables are summarized by frequency tables; time variables by fixed-bin histograms; numeric variables by moment vectors (up to the fourth order); and string variables by TF–IDF vectors. Pairwise similarities use Hellinger, Wasserstein (p=1), total variation, and L1/L2 distances, with MAE/MAPE for numeric summaries; the DISS score combines these via learned or user-set weights to form an adjacency graph whose minimum-spanning tree yields a similarity tree. In experiments on multi-source CSVs, the approach enables accurate retrieval of closest datasets and robust corpus-level structuring while reducing storage and I/O. This contributes a repro- ducible pathway from raw tables to a similarity tree, clarifying terminology and providing algorithms that practitioners can deploy at scale.

Cite This Paper

Igor V. Malyk, Yevhen Kyrychenko, Mykola Gorbatenko, Taras Lukashiv, "Data Optimization through Compression Methods Using Information Technology", International Journal of Information Technology and Computer Science(IJITCS), Vol.17, No.5, pp.84-99, 2025. DOI:10.5815/ijitcs.2025.05.07

Reference

[1]R. Kannan, G. Bayraksan, and J. R. Luedtke, “Technical Note: Data-Driven Sample Average Approximation with Covariate Information”, Operations Research, 2025.
[2]C.R. Siddharth and S.B. Vishwadeepak, “Enhancing Cloud Migration Efficiency with Automated Data Pipelines and AI-Driven Insights”, International Journal of Innovative Science and Research Technology, vol. 9(11), 2025.
[3]U. Sarmah, P. Borah, and D. K. Bhattacharyya, “Ensemble Learning Methods: An Empirical Study”, SN COMPUT. SCI., vol. 5, 924, 2024.
[4]D. Zhang, C. Yin, J. Zeng, X. Yuan, and P. Zhang, “Combining structured and unstructured data for predictive models: a deep learning approach”, BMC Med Inform Decis Mak, vol. 20, 280, 2020.
[5]J. Errasti, I. Amigo, and M. Villadangos, “Emotional Uses of Facebook and Twitter: Its Relation With Empathy, Narcissism, and Self-Esteem in Adolescence”, Psychol Rep, vol. 120(6), pp. 997-1018, 2017.
[6]J. Cui, M. Cui, B. Xiao, and G. Li, “Compact and Discriminative Representation of Bag-of-Features”, Neurocom- puting, vol. 169, 2015.
[7]M. J. Goswami, “Leveraging AI for Cost Efficiency and Optimized Cloud Resource Management”, International Journal of New Media Studies, vol. 7(1), 2020.
[8]C. Zheng, R. Zheng, R. Wang, S. Zhao, and H. Bao, “A Compact Representation of Measured BRDFs Using Neural Processes”, ACM Transactions on Graphics, vol. 41, pp. 1–15, 2022.
[9]E. Zohner, E. Gunning, G. Hooker, and J. Morris, “CLaRe: Compact near-lossless Latent Representations of High- Dimensional Object Data”, 2025. 10.48550/arXiv.2502.07084.
[10]Z. Long, H. Meng, T. Li, and S. Li, “Compact geometric representation of qualitative directional knowledge”, Knowledge-Based Systems, vol. 195, 105616, 2020.
[11]J. Chen, K. Liao, Y. Wan, D. Chen, and J. Wu, “DANets: Deep Abstract Networks for Tabular Data Classification and Regression”, 2021. 10.48550/arXiv.2112.02962.
[12]K. Labunets, F. Massacci, F. Paci, S. Marczak, and F. Oliveira, “Model comprehension for security risk assessment: an empirical comparison of tabular vs. graphical representations”, Empirical Software Engineering, vol. 2(6), pp. 3017–3056, 2017.
[13]S. Abrar, and M. Samad, “Perturbation of deep autoencoder weights for model compression and classification of tabular data”, Neural Networks, vol. 156(C). pp. 160–169, 2022.
[14]C. Bordenave and B. Collins, “Strong asymptotic freeness for independent uniform variables on compact groups associated to nontrivial representations”, Inventiones mathematicae, vol. 237, pp. 221–273, 2024.
[15]J. Wu, S. Chen, Q. Zhao, R. Sergazinov, C. Li, S. Liu, C. Zhao, T. Xie, H. Guo, C. Ji, D. Cociorva, and H. Brunzell, “SwitchTab: Switched Autoencoders Are Effective Tabular Learners”, Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 15924–15933, 2024.
[16]S. A´ lvarez-Garc´ıa, B. Freire Castro, S. Ladra, and O. Pedreira, “Compact and Efficient Representation of General Graph Databases”, Knowledge and Information Systems, vol. 60, pp. 1479–1510, 2019.
[17]X. Liang, Y. Qian, Q. Guo, and K. Zheng, “A data representation method using distance correlation”, Frontiers of Computer Science, vol. 19, 191303, 2024.
[18]C. Troisemaine, J. Flocon-Cholet, S. Gosselin, S. Vaton, A. Reiffers-Masson, and V. Lemaire, “A Method for Discov- ering Novel Classes in Tabular Data”, IEEE International Conference on Knowledge Graph (ICKG), pp. 265–274, 2022.
[19]S.-K. Kim, “Compact Data Learning for Machine Learning Classification”, Axioms, vol. 13, 137, 2024.
[20]Y. Zhu, T. Brettin, F. Xia, A. Partin, M. Shukla, H. Yoo, Y. Evrard, J. Doroshow, and R. Stevens, “Converting tabular data into images for deep learning with convolutional neural networks”, Scientific Reports, vol. 11, 11325, 2021.
[21]Z. Zhao, A. Kunar, R. Birke, H. Scheer, and L. Chen, “CTAB-GAN+: enhancing tabular data synthesis”, Frontiers in big data, vol. 6, 1296508, 2024.
[22]Amazon Web Services. Amazon EMR Developer Guide. Amazon Web Services Documentation 2023. Available online: https://docs.aws.amazon.com/emr/ (accessed on 19.05.2025)
[23]Apache Software Foundation. Apache Airflow Documentation. Apache Airflow. Available online: https://airflow.apache.org/docs/ (accessed on 19.05.2025).
[24]Amazon Web Services. Storage Best Practices for Data & Analytics. AWS Whitepaper 2022. Available on- line: https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/building-data-lake-aws.html (accessed on 19.05.2025).
[25]University of Illinois. Data Science Discovery: Data Types. University of Illinois. Available online: https://discovery.cs.illinois.edu/ (accessed on 19.05.2025).
[26]Mercadier, Y. Distance Measures for Probability Distributions. Distancia Library Documentation 2022. Available online: https://distancia.readthedocs.io/en/latest/ (accessed on 19.05.2025).
[27]Apache Software Foundation. Apache Parquet. Apache Parquet. Available online: https://parquet.apache.org/ (ac- cessed on 19.05.2025).
[28]Amazon Web Services. AWS Glue Documentation. Amazon Web Services Documentation 2023. Available online: https://docs.aws.amazon.com/glue/(accessed on 19.05.2025).
[29]Amazon Web Services. Amazon Athena Documentation. Amazon Web Services Documentation 2023. Available online: https://docs.aws.amazon.com/athena/ (accessed on 19.05.2025).
[30]M. Moazeni, ”Automating Stock Market Data Pipeline with Airflow”, Spark, Postgres. Medium 2023. Available on- line: https://medium.com/@mehran1414/automating-stock-market-data-pipeline-with-apache-airflow-minio-spark- and-postgres-b67f7379566a (accessed on 19.05.2025).