Data Cleaning In Data Warehouse: A Survey of Data Pre-processing Techniques and Tools

Full Text (PDF, 601KB), PP.50-61

Views: 0 Downloads: 0


Anosh Fatima 1,* Nosheen Nazir 1 Muhammad Gufran Khan 1

1. National University of Computer and Emerging Sciences, Faisalabad, 38000, Pakistan

* Corresponding author.


Received: 15 Apr. 2016 / Revised: 20 Aug. 2016 / Accepted: 19 Nov. 2016 / Published: 8 Mar. 2017

Index Terms

Data Cleaning, Data Ware House, Data Pre-processing, Missing Values, Materialized Views, Evaluation Attributes in DWH, Data Mining Algorithms


A Data Warehouse is a computer system designed for storing and analyzing an organization's historical data from day-to-day operations in Online Transaction Processing System (OLTP). Usually, an organization summarizes and copies information from its operational systems to the data warehouse on a regular schedule and management performs complex queries and analysis on the information without slowing down the operational systems. Data need to be pre-processed to improve quality of data, before storing into data warehouse. This survey paper presents data cleaning problems and the approaches in use currently for pre-processing. To determine which technique of pre-processing is best in what scenario to improve the performance of Data Warehouse is main goal of this paper. Many techniques have been analyzed for data cleansing, using certain evaluation attributes and tested on different kind of data sets. Data quality tools such as YALE, ALTERYX, and WEKA have been used for conclusive results to ready the data in data warehouse and ensure that only cleaned data populates the warehouse, thus enhancing usability of the warehouse. Results of paper can be useful in many future activities like cleansing, standardizing, correction, matching and transformation. This research can help in data auditing and pattern detection in the data.

Cite This Paper

Anosh Fatima, Nosheen Nazir, Muhammad Gufran Khan, "Data Cleaning In Data Warehouse: A Survey of Data Pre-processing Techniques and Tools", International Journal of Information Technology and Computer Science(IJITCS), Vol.9, No.3, pp.50-61, 2017. DOI:10.5815/ijitcs.2017.03.06


[1] “Evaluation of three Simple Imputation Methods for Enhancing Preprocessing of Data with Missing Values” by R.S. Somasundaram & R. Nedunchezhian-2011

[2] Survey on Data Cleaning by Prerna S. .Kulkarni & Dr. J.W.Bakal

[3]Calvanese, D., G. De Giacomo, et al. (2012). "View-based query answering in description logics: Semantics and complexity." Journal of Computer and System Sciences78(1): 26-46.

[4]Chaturvedi, S., T. A. Faruquie, et al. (2015). Cleansing a database system to improve data quality, Google Patents.

[5]Christen, P. (2012). "A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication." IEEE Transactions on Knowledge and Data Engineering24(9): 1537-1555.

[6]Cios, K. J., W. Pedrycz, et al. (2012). Data mining methods for knowledge discovery, Springer Science & Business Media.

[7]Cravero, A. and S. Sepúlveda (2012). "A chronological study of paradigms for data warehouse design." Ingeniería e Investigación32(2): 58-62.

[8]Debbarma, N., G. Nath, et al. (2013). "Analysis of Data Quality and Performance Issues in Data Warehousing and Business Intelligence." International Journal of Computer Applications79(15).

[9]Dixit, S. and N. Gwal (2014). "An Implementation of Data Pre-Processing for Small Dataset." International Journal of Computer Applications103(6).

[10]Fan, W., S. Ma, et al. (2014). "Interaction between Record Matching and Data Repairing." J. Data and Information Quality4(4): 1-38.

[11]Folkert, N. K., A. Gupta, et al. (2011). Using estimated cost to schedule an order for refreshing a set of materialized views (MVS), Google Patents.

[12]Jony, R. I., N. Mohammed, et al. (2015). An Evaluation of Data Processing Solutions Considering Preprocessing and "Special" Features. 2015 11th International Conference on Signal-Image Technology & Internet-Based Systems (SITIS).

[13]Koch, C., Y. Ahmad, et al. (2014). "Dbtoaster: higher-order delta processing for dynamic, frequently fresh views." The VLDB Journal23(2): 253-278.

[14]Kumar, R. and R. Chadrasekaran (2011). "Attribute correction-data cleaning using association rule and clustering methods." Intl. Jrnl. of Data Mining & Knowledge Management Process1(2): 22-32.

[15]Lu, Y., T. Ma, et al. (2013). "Implementation of the fuzzy c-means clustering algorithm in meteorological data." International Journal of Database Theory and Application6(6): 1-18.

[16]Majchrzak, T. A., T. Jansen, et al. (2011). Efficiency evaluation of open source ETL tools. Proceedings of the 2011 ACM Symposium on Applied Computing, ACM.

[17]Misra, S., S. K. Saha, et al. (2013). Performance Comparison of Hadoop Based Tools with Commercial ETL Tools – A Case Study. Big Data Analytics: Second International Conference, BDA 2013, Mysore, India, December 16-18, 2013, Proceedings. V. Bhatnagar and S. Srinivasa. Cham, Springer International Publishing: 176-184.

[18]Mitschke, R., S. Erdweg, et al. (2014). i3QL: Language-integrated live data views. ACM SIGPLAN Notices, ACM.

[19]Mohamed, H. H., T. L. Kheng, et al. (2011). E-Clean: A Data Cleaning Framework for Patient Data. Informatics and Computational Intelligence (ICI), 2011 First International Conference on, IEEE.

[20]Philip Chen, C. L. and C.-Y. Zhang (2014). "Data-intensive applications, challenges, techniques and technologies: A survey on Big Data." Information Sciences275: 314-347.

[21]Rahm, E. and H. H. Do (2000). "Data cleaning: Problems and current approaches." IEEE Data Eng. Bull.23(4): 3-13.

[22]Rahman, N. (2016). "An empirical study of data warehouse implementation effectiveness." International Journal of Management Science and Engineering Management: 1-9.

[23]Rajaraman, A., J. D. Ullman, et al. (2012). Mining of massive datasets, Cambridge University Press Cambridge.

[24]Saqib, M., M. Arshad, et al. (2012). "Improve Data Warehouse Performance by Preprocessing and Avoidance of Complex Resource Intensive Calculations." International Journal of Computer Science Issues (IJCSI)9(1): 1694-0814.

[25]Satterthwaite, T. D., M. A. Elliott, et al. (2013). "An improved framework for confound regression and filtering for control of motion artifact in the preprocessing of resting-state functional connectivity data." Neuroimage64: 240-256.

[26]Singhal, S. and M. Jena (2013). "A Study on WEKA Tool for Data Preprocessing, Classification and Clustering." International Journal of Innovative Technology and Exploring Engineering (IJITEE)2(6): 250-253.

[27]Somasundaram, R. and R. Nedunchezhian (2011). "Evaluation of three simple imputation methods for enhancing preprocessing of data with missing values." International Journal of Computer Applications, Vol2110.

[28]Wibowo, A. (2015). Problems and available solutions on the stage of Extract, Transform, and Loading in near real-time data warehousing (a literature study). Intelligent Technology and Its Applications (ISITIA), 2015 International Seminar on.

[29]Alshamesti, O. Y., & Romi, I. M. (2013). Optimal Clustering Algorithms for Data Mining. International Journal of Information Engineering and Electronic Business, 5(2), 22.

[30]Lekhi, N., & Mahajan, M. (2015). Outlier Reduction using Hybrid Approach in Data Mining. International Journal of Modern Education and Computer Science, 7(5), 43.