Rahim Khan; Yurong Qian; Sajid Naeem

Extractive based Text Summarization Using K-Means and TF-IDF

Full Text (PDF, 1048KB), PP.33-44

Views: 0 Downloads: 0

Author(s)

Rahim Khan ^1,* Yurong Qian ¹ Sajid Naeem ¹

1. School of Software, Xinjiang University, Urumqi 830008, China

* Corresponding author.

DOI: https://doi.org/10.5815/ijieeb.2019.03.05

Received: 15 Mar. 2019 / Revised: 5 Apr. 2019 / Accepted: 30 Apr. 2019 / Published: 8 May 2019

Index Terms

Summarization, Extractive Summary, TF-IDF, Clustering, K-Means

Abstract

The quantity of information on the internet is massively increasing and gigantic volume of data with numerous compositions accessible openly online become more widespread. It is challenging nowadays for a user to extract the information efficiently and smoothly. As one of the methods to tackle this challenge, text summarization process diminishes the redundant information and retrieves the useful and relevant information from a text document to form a compressed and shorter version which is easy to understand and time-saving while reflecting the main idea of the discussed topic within the document. The approaches of automatic text summarization earn a keen interest within the Text Mining and NLP (Natural Language Processing) communities because it is a laborious job to manually summarize a text document. Mainly there are two types of text summarization, namely extractive based and abstractive based. This paper focuses on the extractive based summarization using K-Means Clustering with TF-IDF (Term Frequency-Inverse Document Frequency) for summarization. The paper also reflects the idea of true K and using that value of K divides the sentences of the input document to present the final summary. Furth more, we have combined the K-means, TF-IDF with the issue of K value and predict the resulting system summary which shows comparatively best results.

Cite This Paper

Rahim Khan, Yurong Qian, Sajid Naeem, "Extractive based Text Summarization Using K-Means and TF-IDF", International Journal of Information Engineering and Electronic Business(IJIEEB), Vol.11, No.3, pp. 33-44, 2019. DOI:10.5815/ijieeb.2019.03.05

Reference

[1]Rupal Bhargava.et al. 2016, “ATSSI: Abstractive Text Summarization using Sentiment Infusion” Twelfth International Multi-Conference on Information Processing.
[2]Elena Lioret. Manuel Palomar 30 April 2011, “Text summarization in progress: a literature review” Artif Intell Rev.
[3]Shraddha, S.et al. 2014, “A Review ON K-means DATA Clustering APPROACH” International Journal of Information and Computation Technology.
[4]Sajid et al. 2018, “Study and Implementing K-mean Clustering Algorithm on English text and techniques to find the Optimal value of K” International Journal of Computer Applications (0975 – 8887).
[5]S.A.Babar.et al. 2014, “Improving Performance of Text Summarization” International Conference on Information and Communication Technologies.
[6]Saeedeh Gholamrezazadeh.et al. 2009, “A Comprehensive Survey on Text Summarization System” IEEE.
[7]Vishal Gupta.et al. 2010, “A Survey of Text summarization techniques” Journal of Emerging Technologies in Web Intelligence. Vol 2 No 3.
[8]Oi Mean Foong.et al. 2010. “Challenges and Trends of Automatic Text Summarization” International journal of Information and Telecommunication Technology; Vol.1, Issue 1.
[9]Nikita Munot.et al.2014, “Comparative Study of Text Summarization Methods” International Journal of Computer Applications. Volume 102-No.12.
[10]Eduard Hovy.et al. 1999. “Automated text summarization in SUMMARIST. MIT Press, Pages 81-94.
[11]Milad Moradi , Nasser Ghadiri 2018, “Different approaches for identifying important concept in probabilistic biomedical text summarization” Artificial intelligence in medicine. Pages 101-116.
[12]Sonail Ghandi, et al. April 2017. ”Review on Query Focused Summarization using TF-IDF, K-Mean Clustering and HMM” International journal of Innovative Research in Computer and Communication Engineering.
[13]Md. Majharul Haque, et al. May 2013 “Literature Review of Automatic Multiple Documents Text Summarization” International Journal of Innovation and Applied Studies. Pages 121-129
[14]Hans Christian, et al December 2016 “SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREWUENCY-INVERSE DOCUMENT FREWUENCY (TF-IDF)” ComTech. Pages 285-294
[15]S.Mohamed Saleem, et al. May 2015 “STUDY ON TEXT SUMMARIZATION USING EXTRACTIVE METHODS” International Journal of Science, Engineering and Technology Research. Volume 4.
[16]Y,S,Patail , M.B. Vaidya 2012, “A Technical survey on Clustering Analysis in Data mining” International Journal of Emerging Technology and Advanced Engineering.
[17]Himanshu Gupta, Dr.Rajeev Srivastav 2014, “K-means Based Document Clustering with Automatic ‘K’ Selection and Cluster Refinement” International Journal of Computer Science and Mobile Applications.
[18]Greg Hamerly and Charles Elkan 2003, “Learning the k in k- means” In Neural Information Processing System, MIT Press.
[19]Chun-ling Chen,S.C. Tseng and Tyne Liang Nov. 2010, “An integration of Word Net and Fuzzy association rule mining for multi-label document clustering” Data and Knowledge Engineering, pp. 1208-1226.
[20]J.T. Tou and R.C. Gonzalez 1974, “Pattern Recognition Principles” Massachusetts: Addison-Wesley.
[21]Martin F Porter 1980, “An algorithm for suffix stripping” Program: Electronic Library and information system, pp. 130–137.
[22]Julie B Lovins 1968, “Development of a stemming algorithm. MIT Information Processing Group” Electronic Systems Laboratory.
[23]Mehdi Allahyari.et al. August 2017, “A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques” In Proceedings of KDD Bigdas, Halifax, Canada, 13 pages.
[24]Twinkle Svadas, Jasmin Jha June 2015, “Document Cluster Mining on Text Documents” International Journal of Computer Science and Mobile Computing Vol.4, pg.778-782.
[25]Neepa Shah, Sunita Mahajan October 2012, “Document Clustering: A Detailed Review” International Journal of Applied Information Systems (IJAIS) Vol. 4.
[26]Abdennour Mohamed Jalil, Imad Hafidi et al. 2016, “Comparitive Study of Clustering Algorithms in Text Mining Context” International Journal of Interactive Multimedia and Artificial Intelligence Vol. 3, No. 7.
[27]Jonathan J Webster and Chunyu Kit 1992, “Tokenization as the initial phase in NLP” In Proceedings of the 14th conference on Computational linguistics Vol. 4, pp. 1106–1110.
[28]Hassan Saif et al 2014 “On stopwords filtering and data sparsity for sentiment analysis of twitter” School of Engineering and Applied Science, Aston University, UK.
[29]Martin F Porter 1980, “An algorithm for suffix stripping” Program: Electronic Library and information system, pp. 130–137.
[30]David A Hull et al. 1996, “Stemming algorithms: A case study for detailed evaluation” JASIS, pp. 70–84.
[31]Mehdi Allahyari.et al. August 2017, “A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques” In Proceedings of KDD Bigdas, Halifax, Canada, 13 pages.
[32]Neepa Shah, Sunita Mahajan October 2012, “Document Clustering: A Detailed Review” International Journal of Applied Information Systems (IJAIS) Vol. 4.
[33]Everitt, B., 1980. “Cluster Analysis” 2nd Edition. Halsted Press, New York
[34]M. Meila, and D.Hackerman 1998, “An Experimental Comparison of Several Clustering and Initialization Method” Microsoft Research Redmond, WA.
[35]Ville Satopa et al. “Finding a “Kneedle” in a Haystack: Detecting Knee Points in System Behavior” International Computer Science Institute, Berkeley, CA.
[36]Wei Fu and Patrick O. Perry February 10, 2017, “Estimating the number of clusters using cross-validation” Stern School of Business, New York University.
[37]Moh'd Belal Al- Zoubi and Mohammad al Rawi, “An Efficient Approach for Computing Silhouette Coefficients” Department of Computer Information Systems, University of Jordan, Amman 11942, Jordan.
[38]Tippaya Thinsungnoena et al 2015, “The Clustering Validity with Silhouette and Sum of Squared Errors” The 3rd International Conference on Industrial Application Engineering (ICIAE2015).
[39]Yohei SEKI 2003, “Sentence Extraction by tf/idf and Position Weighting from Newspaper Articles” National Institute of Informatics.
[40]https://www.kaggle.com/datasets
[41]Kishore Papineni.et al. July 2002, “A Method for Automatic Evaluation of Machine Translation” Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, pp.311-318.

International Journal of Information Engineering and Electronic Business (IJIEEB)