An Analytical Assessment on Document Clustering

Full Text (PDF, 268KB), PP.63-71

Views: 0 Downloads: 0


Pushplata 1,* Ram Chatterjee 1

1. Maharishi Dayanand University, Rohtak, Manva Rachna Collage of Engineering, Faridabad, India

* Corresponding author.


Received: 4 Sep. 2011 / Revised: 11 Jan. 2012 / Accepted: 2 Mar. 2012 / Published: 8 Jun. 2012

Index Terms

Data mining, Document clustering, Suffix Tree Clustering (STC) steps, K-means, Agglomerative Hierarchical Clustering (AHC), cosine similarity


Clustering is related to data mining for information retrieval. Relevant information is retrieved quickly while doing the clustering of documents. It organizes the documents into groups; each group contains the documents of similar type content. Document clustering is an unsupervised approach of data mining. Different clustering algorithms are used for clustering the documents such as partitioned clustering (K-means Clustering) and Hierarchical Clustering (Agglomerative Hierarchical Clustering (AHC)). This paper presents analysis of Suffix Tree Clustering (STC) Algorithm and other clustering techniques (K-means, AHC) that are being done in literature survey. The paper also focuses on traditional Vector Space Model (VSM) for similarity measures, which is used for clustering the documents. This paper also focuses on the comparison of different clustering algorithms. STC algorithm improves the searching performance as compare to other clustering algorithms as the papers studied in literature survey. The paper presents STC algorithm applied on the search result documents, which is stored in the dataset. This paper articulates the key requirements for web document clustering and clusters would be created on the full text of the web documents. STC perform the clustering and make the clusters based on phrases shared between the documents. STC is faster clustering algorithm for document clustering.

Cite This Paper

Pushplata, Ram Chatterjee, "An Analytical Assessment on Document Clustering", International Journal of Computer Network and Information Security(IJCNIS), vol.4, no.5, pp.63-71,2012. DOI:10.5815/ijcnis.2012.05.08


[1]Kale, U. Bharambe, M. Sashi Kumar, "A New Suffix Tree Similarity Measure and Labeling for Web Search Results Clustering", Proc. Second International Conference on Emerging Trends in Engineering and Technology, ICETET-09, p.856-861.
[2](2012).L. B. Ayre, "Data mining for information Professional".
[3]V. M. A. Bai and Dr. D. Manimegalai, "An Analysis of Document Clustering Algorithm", in ICCCCT-10, IEEE 2010, p.402-406.
[4]C.Tsai,T.Liang,J.Ho,C.Yang and M.Chiang, "A Document Clustering Approach for Search Engines",2006 International Conference on System,Man, and Cybernetics October 8-11,2006,Taipei,Talwan,p.1050-1055.
[5]L.Muflikhah and B.Baharudin, "Document Clustering using Concept Space and Cosine Similarity Measurement",2009 International Conference on Computer Technology and Development, 2009 IEEE, p. 58-62.
[6]S.Na,G. yongand L. Xumin, "Research on K-means Clustering Algorithm",Third internation Symposium on intelligent Information Technology and security informatics,2010 IEEE,p. 63-67.
[7]K. A. A. Nazeer, M. P. Sebastian, "Improving the Accuracy and Efficiency of the k-means Clustering Algorithm", Proceedings of the World Congress on Engineering 2009 Vol I WCE 2009, July 1 - 3, 2009, London, U.K.
[8]D. Napoleon and P. G.lakshmi, "An Efficient K-Means Clustering Algorithm for Reducing Time Complexity using Uniform Distribution Data Points",Proc. IEEE 2010,p.42-45.
[9](2012). "K-Means Clustering Tutorials" http:\people. revoledu.comkardi tutorialkMean.
[10]G. Zhang, Y.Liu, S.Tan, and X.Cheng, "A Novel Method for Hierarchical Clustering of Search Result", 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology-Workshops.
[11]H.Sun, Z.Liu and L.Kong, "A Document Clustering Method Based on Hierarchical Algorithm with Model Clustering", 22nd International Conference on Advanced Information Networking and Application-Workshops. IEEE 2008, p.1229-1233.
[12]O. Zamir and O. Etzioni, "Web Document Clustering: A Feasibility Demonstration", in Proc. the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, 1998, p. 46-54.
[13]H. Chim and X. Deng, "Efficient Phrase-Based Document Similarity for Clustering," IEEE Transaction on Knowledge and Data Engineering, vol. 20, no. September 2008, pp. 1217-1229.
[14]S.osiuski and D.Weiss, "A Concept-Driven Algorithm for Clustering Search Results", IEEE 2005.
[15]H. Wen, G. Luang, and Z.Li , "Clustering Web Search Results Using Semantic Information", in Proc. the eighth international conference on machine learning and cybernetics, boarding, 12-15 July 2009 IEEE, p.1504-1059.
[16]D.Zang and Y. Dong, "Semantic, Hierarchical, Online Clustering for Web Search Results".
[17]Rafi, M.Maujood, M.M.Fazal, S.M.Ali, "A Comparision of Two Suffix Tree Based Document Clustering Algorithm", in Proc. IEEE 2010NU-FAST, Karachi, Pakistan.
[18](2011) home page on CS.[Online].Avalable: e/dau/stat/ clustgalgs/clust5_bdy.html.
[19]J.Han and M.Kamber, "Data Mining Concepts and Techniques", 2nd Edition, 2006 Elsevier.
[20](2011) Available: