A Novel Approach for Data Cleaning by Selecting the Optimal Data to Fill the Missing Values for Maintaining Reliable Data Warehouse

Full Text (PDF, 413KB), PP.64-70

Views: 0 Downloads: 0


Raju Dara 1,* Ch. Satyanarayana 1 A Govardhan 1

1. Department of Computer Science and Engineering Jawaharlal Nehru Technological University Kakinada, Andhra Pradesh, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijmecs.2016.05.08

Received: 24 Jan. 2016 / Revised: 2 Mar. 2016 / Accepted: 1 Apr. 2016 / Published: 8 May 2016

Index Terms

Apriori similarity function, Classification, Data Cleaning, Jaccard Dissimilarity function


At present trillion of bytes of information is being created by projects particularly in web. To accomplish the best choice for business benefits, access to that information in a very much arranged and intuitive way is dependably a fantasy of business administrators and chiefs. Information warehouse is the main feasible arrangement that can bring the fantasy into reality. The upgrade of future attempts to settle on choices relies on upon the accessibility of right data that depends on nature of information basic. The quality information must be created by cleaning information preceding stacking into information distribution center following the information gathered from diverse sources will be grimy. Once the information have been pre-prepared and purified then it produces exact results on applying the information mining question. There are numerous cases where the data is sparse in nature. To get accurate results with sparse data is hard. In this paper the main goal is to fill the missing values in acquired data which is sparse in nature. Precisely caution must be taken to choose minimum number of text pieces to fill the holes for which we have used Jaccard Dissimilarity function for clustering the data which is frequent in nature.

Cite This Paper

Raju Dara, Ch. Satyanarayana, A. Govardhan, "A Novel Approach for Data Cleaning by Selecting the Optimal Data to Fill the Missing Values for Maintaining Reliable Data Warehouse", International Journal of Modern Education and Computer Science(IJMECS), Vol.8, No.5, pp.64-70, 2016. DOI:10.5815/ijmecs.2016.05.08


[1]R.Agrawal, R.Srikant. “Fast algorithms for mining association rules” in the Proceedings of 20th International Conference on Very Large Data Bases, VLDB 1215,Pg. 487-499
[2]R.Agrawal, T.Imielinski, A. Swami. “Mining association rules between sets of items in large databases” in the proceedings of ACM SIGMOD Conference on managing data, 22(2), pg.207-216.
[3]R.Agrawal, R.Srikant. “Mining Sequential Patterns” in the proceedings of 11th International Conference on Data Engineering 1995.
[4]R.Agrawal, C.Faloutsos, A. Swami. “Efficient similarity search in sequence databases”, Foundations of Data knowledge and Engineering, pg.69-84
[5]R Agrawal, JC Shafer. ” Parallel mining of association rules”. in the IEEE Transactions on Knowledge and Data Engineering, 1996, Vol8(6), pg.962-969.
[6]J Shafer, R Agrawal, M Mehta.” SPRINT: A scalable parallel classifier for data mining, Proc. 1996 Int. Conf. Very Large Data Bases, 544-555
[7]M Mehta, R Agrawal, J Rissanen,” SLIQ: A fast scalable classifier for data mining”, Advances in Database Technology,1996, pg.18-32
[8]Narendra, Patrenahalli M. “A Branch and Bound Algorithm for Feature Subset Selection”, IEEE Transactions on computers, Vol26 (9), 1977.
[9]Ari Frank, Dan Geiger, Zohar Yakhin. “ A Distance-Based Branch and Bound Feature Selection Algorithm, pg.241-248., UAI2003
[10]Tao Liu, Shengping Liu, Zheng Chen. “An Evaluation on Feature Selection for Text Clustering”, Proceedings of the 12th International Conference on Machine Learning (ICML-2003), Washington DC, 2003.
[11]Jung-Yi Jiang, Ren-Jia Liou, and Shie-Jue Lee, “Fuzzy Similarity-based Feature Clustering for Document Classification”, proceedings of International conference on Information Technology and applications in Outlying Islands, pg.477-81, 2009.
[12]SAEED Khazaee, ALI Bozorgmehr, “A New Hybrid Classification Method for Condensing of Large Datasets: A Case Study in the Field of Intrusion Detection”, MECS-I. J. Modern Education and Computer Science, April 2015, 4, 32-41, DOI: 10.5815/ijmecs.2015.04.04, (http://www.mecs-press.org/).
[13]Vladimir Estivill-Castro.”Why so many clustering algorithms". ACM SIGKDD Explorations Newsletter, pg.65-72, 2002.
[14]C. Agarwal et.al. “A survey of text clustering algorithms”, Text book on Mining Text data, Springer Publications, 2012.
[15]Inderjit S. Dhillon, Subramanyam Mallela, Rahul Kumar. “Enhanced Word Clustering for Hierarchical Text Classification, proceedings of ACM. KDD 2002.
[16]Inderjit S. Dhillon,Yuqiang Guan, J. Kogan, “Iterative Clustering of High Dimensional Text Data Augmented by Local Search”, Proceedings of the Second IEEE International Conference on Data Mining, pages 131-138, Maebishi, Japan, December 2002.
[18]Jon Kleinberg,”An Impossibility Theorem for Clustering”, NIPS 15, pg.446-53, 2002.
[19]Chu, Xu, Ihab F. Ilyas, and Paolo Papotti, “Holistic data cleaning: Putting violations into context”, Data Engineering (ICDE), 2013 IEEE 29th International Conference on. IEEE, 2013.
[20]Wen Zhanga,Taketoshi Yoshida, Xijin Tang “A comparative study of TF*IDF, LSI and multi-words for text classification” Expert Systems with Applications 38 (2011) 2758–2765.
[21]Wen Zhanga,, Taketoshi Yoshida, Xijin Tang, Qing Wang. “Text clustering using frequent item sets”, Knowledge-Based Systems, Volume 23.Pg.379–388, 2010
[22]Yi Peng, Gang Kou. A Descriptive frame work for the field of data mining and knowledge discovery, Intr. Journal of Information Technology and Decision making, Volume 7, No.4, 2008, Pg.639- 682.
[23]Hellerstein, Joseph M, “Quantitative data cleaning for large databases”, United Nations Economic Commission for Europe (UNECE) (2008).
[24]Raju Dara and Dr. Ch. Satyanarayana, “A Robust Approach for Data Cleaning used by Decision Tree Induction Method in the Enterprise Data Warehouse”, International Journal on Computational Science & Applications (IJCSA) Vol.5, No.4, August 2015.
[25]T.Miranda Lakshmi, A.Martin, R.Mumtaj Begum, Dr.V.Prasanna Venkatesan, “An Analysis on Performance of Decision Tree Algorithms using Student’s Qualitative Data”, MECS-I. J. Modern Education and Computer Science, June 2013, 5, 18-27, DOI:10.5815/ijmecs.2013.05.03, (http://www.mecs-press.org/).