Anurag Sarkar; Debabrata Datta

A Frequency Based Approach to Multi-Class Text Classification

Full Text (PDF, 304KB), PP.15-22

Views: 0 Downloads: 0

Author(s)

Anurag Sarkar ^1,* Debabrata Datta ²

1. Northeastern University, Boston, United States

2. St. Xavier’s College (Autonomous), Kolkata, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2017.05.03

Received: 17 Jun. 2016 / Revised: 3 Oct. 2016 / Accepted: 16 Jan. 2017 / Published: 8 May 2017

Index Terms

Supervised learning, Multi-class classification, Text classification, Text mining, Text categorization, tf-idf

Abstract

Text classification is a method which involves managing and processing important information that can be categorized into predefined classes within a collection of text data. This method plays a vital role in the field of information processing and information retrieval. Different approaches to text classification specifically based on machine learning algorithms have been discussed and proposed in various research works. This paper discusses a classification approach based on the frequencies of some important text parameters and classifies a given text accordingly into one among multiple categories. Using a newly defined parameter called wf-icf, classification accuracy obtained in a previous work was significantly improved upon.

Cite This Paper

Anurag Sarkar, Debabrata Datta, "A Frequency Based Approach to Multi-Class Text Classification", International Journal of Information Technology and Computer Science(IJITCS), Vol.9, No.5, pp.15-22, 2017. DOI:10.5815/ijitcs.2017.05.03

Reference

[1]A. Sarkar, S. Chatterjee, W. Das, D. Datta, “Text Classification using Support Vector Machine”, International Journal of Engineering Science Invention, Vol. 4 Issue 11, November 2015, pp. 33 – 37.

[2]M. Ikonomakis, S. Kotsiantis, V. Tampakas, “Text Classification Using Machine Learning Techniques”, WSEAS Transactions on Computers, Vol. 4 Issue 8, August 2005, pp. 966 – 974.

[3]F. Sebastiani, “Text Categorization”, The Encyclopedia of Database Technologies and Applications, 2005, pp. 683 – 687.

[4]T. Joachims, “Text Categorization with Support Vector Machines: Learning with Many Relevant Features”, Technical Report 23, Universitat Dortmund, LS VIII, 1997.

[5]N. Cristianini, “Support Vector and Kernel Machines”, Tutorial at the 18th International Conference on Machine Learning, June 28, 2001.

[6]K. Ming Leung, “Naive Bayesian Classifier”, Polytechnic University Department of Computer Science/Finance and Risk Engineering, 2007.

[7]E. Frank, and R. R. Bouckaert, “Naive Bayes for Text Classification with Unbalanced Classes”, Knowledge Discovery in Databases: PKDD 2006, pp 503 – 510.

[8]W. Dai, G. Xue, Q. Yang and Y. Yu, “Transferring Naive Bayes Classifiers for Text Classification”, Proceedings of the 22nd National Conference on Artificial Intelligence, Vol. 1, 2007, pp. 540 – 545.

[9]G. Guo, H. Wang, D. Bell, Y. Bi and K. Greer, “Using kNN Model for Automatic Text Categorization”, Soft Computing, Vol. 10, Issue 5, 2006, pp. 423 – 430.

[10]G. Toker and O. Kirmemis, “Text Categorization using k Nearest Neighbor Classification”, Survey Paper, Middle East Technical University.

[11]Baoli Li, Shiwen Yu, and Qin Lu., “An Improved k-nearest Neighbor Algorithm for Text Categorization”, arXiv preprint cs/0306099, 2003.

[12]D. D. Lewis, and W. A. Gale, “A Sequential Algorithm for Training Text Classifiers”, Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, (Springer-Verlag New York, Inc., 1994).

[13]K. Nigam, A. McCallum, S. Thrun, and T. Mitchell, “Learning to Classify Text from Labeled and Unlabeled Documents”, AAAI/IAAI792, 1998.

[14]P. Soucy and G. Mineau, “Feature Selection Strategies for Text Categorization”, AI 2003, LNAI 2671, 2003, pp. 505 – 509.

[15]A. Kehagias, V. Petridis, V. Kaburlasos and P. Fragkou, “A Comparison of Word- and Sense-Based Text Categorization Using Several Classification Algorithms”, JIIS, Volume 21, Issue 3, 2003, pp. 227 – 247.

[16]H. P. Luhn, “A Statistical Approach to Mechanized Encoding and Searching of Literary Information”, IBM Journal of Research and Development, October 1957.

[17]K. S. Jones, “A Statistical Interpretation of Term Specificity and its Application in Retrieval”, Journal of Documentation, Vol. 28 Issue 1, pp. 11 – 21.

[18]S. Robertson, "Understanding Inverse Document Frequency: On theoretical arguments for IDF", Journal of Documentation 60 no. 5, pp. 503 – 520.

International Journal of Information Technology and Computer Science (IJITCS)