An Improved Information Retrieval Approach to Short Text Classification

Full Text (PDF, 664KB), PP.31-37

Views: 0 Downloads: 0


Indrajit Mukherjee 1,* Sudip Sahana 1 P.K. Mahanti 2

1. Department of Computer Science & Engg. Birla Institute of Technology Mesra, India

2. Department of Computer Science University of New Burnswick Saint John, Canada

* Corresponding author.


Received: 19 Mar. 2017 / Revised: 1 Apr. 2017 / Accepted: 26 May 2017 / Published: 8 Jul. 2017

Index Terms

Twitter, topic modeling, Word-Sense Disambiguation


Twitter act as a most important medium of communication and information sharing. As tweets do not provide sufficient word occurrences i.e. of 140 characters limits, classification methods that use traditional approaches like “Bag-Of-Words” have limitations. The proposed system used an intuitive approach to determine the class labels with the set of features. The System can able to classify incoming tweets mainly into three generic categories: News, Movies and Sports. Since these categories are diverse and cover most of the topics that people usually tweet about .Experimental results using the proposed technique outperform the existing models in terms of accuracy.

Cite This Paper

Indrajit Mukherjee, Sudip Sahana, P.K. Mahanti, "An Improved Information Retrieval Approach to Short Text Classification", International Journal of Information Engineering and Electronic Business(IJIEEB), Vol.9, No.4, pp.31-37, 2017. DOI:10.5815/ijieeb.2017.04.05


[1]Bharath Shriram, short text classification, Ohio university, 2010.
[2]Shu Zhang et al., Semi-supervised Classification of Twitter Messages for Organization Name Disambiguation, International Joint Conference on Natural Language Processing, 869–873, Nagoya, Japan, 14-18 October 2013.
[3]S Zhang et al., Semi-supervised Classification of Twitter Messages for Organization Name Disambiguation, International Joint Conference microblogs. In Conference on Social Media, 31, 2010.
[4]I. Hemalatha et al., Pre-processing the Informal Text for efficient Sentiment Analysis, IJETTCS,1,2, 2012.
[5]A. Java X. Song, T. Finin, and B. Tseng, Why we twitter: understanding microblogging usage and communities. In Procs WebKDD/SNA-KDD '07 (San Jose, California), 56-65, 2007.
[6]J.Allan, editor. Topic detection and tracking: event-based information organization, Kluwer Academic Publishers, Norwell, MA, USA, 2002.
[7]Kriti Puniyani, Jacob Eisenstein, Shay Cohen, and Eric P. Xing, Social links from latent topics in microblogs. In Conference on Social Media, page 31, 2010.
[8]Richard Beaufort, Sophie Roekhaut, Louise-Am´elie Cougnon, and C´edrick Fairon, A hybridrule/model-based finite-state framework for normalizing SMS messages. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 770–779, Uppsala, Sweden, 2010.
[9]J.Allan, editor, Topic detection and tracking: event-based information organization. Kluwer Academic Publishers, Norwell, MA, USA, 2002.
[10]Mario Cataldi, Luigi Di Caro and Claudio Schifanella, Emerging Topic Detection on Twitter based on Temporal and Social Terms Evaluation, MDMKDD,2010
[11]Rao, D.; D., Y., Shreevats, A., and Gupta, M., Classifying Latent User Attributes in Twitter. In Proceedings of SMUC-10, 710–718, 2010.
[12]Edgar Meij, Wouter Weerkamp, and Maarten de Rijke, Adding Semantic to Microblog Posts, In proceedings of 5th ACM Web Search and Data Mining, pages 563-572, 2012.
[13]Offer Egozi, Shaul Markovitch, and Evgeniy Gabrilovich, Concept-Based Information Retrieval Using Explicit Semantic Analysis, ACM Transactions on Information Systems, Vol. 29, No. 2, Article 8, 2011
[14]Quan, X.J., Liu, G., Lu, Z., Ni, X.L., Liu, W.Y., Short text similarity based on probabilistic topics. Knowl. Inf. Syst., 25(3):473-491, 2010.
[15]Chenglong Ma, Weiqun Xu, Peijia Li, Yonghong Yan, Distributional Representations of Words for Short Text Classification, Proceedings of NAACL-HLT ‘15, Denver, Colorado, 33–38, 2015.
[16]Stephane Clinchant and Florent Perronnin, Aggregating continuous word embeddings for information retrieval, In Proceedings of the Workshop on CVSM and their Compositionality, ACL,Sofia, Bulgaria, 100–109, 2013.
[17]Aixin Sun, Short text classification using very few words, In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, ACM, 1145–1146, 2012.
[18]X. Zheng et al., Detecting spammers on social networks, The Journal of Neurocomputing,159, 27–34,2015.
[19]Mengen Chen, Xiaoming Jin, and Dou Shen. Short text classification improved by learning multigranularity topics. In Proc. of IJCAI, 1776–1781, 2011.
[20]Gabrilovich,E. and Markovitch, S., Computing semantic relatedness using wikipedia-based explicit semantic analysis, In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI’07). Morgan Kaufmann Publishers, 1606–1611, 2007.
[21]I. Mukherjee, V. Bhattacharya, P.K. Mahanti, Samudra Banerjee, Text Classification using Document-Document Semantic Similarity, International Journal of Web Science(1757-8795), 2, 1-2, 2013.
[22]Toriumi, Fujio, and Seigo Baba. Real-time Tweet Classification In Disaster Situation, Proceedings of the 25th International Conference Companion on World Wide Web and International World Wide Web Conferences Steering Committee, 2016.
[23]Hussain, Muhammad IrshadAlam, Evaluation of graph centrality measures for tweet classification, Proceedings of the International Conference on Computing, Electronic and Electrical Engineering (ICE Cube), 2016.
[24]Joao Gama, Indre Zliobaite, Albert Bifet, Mykola Pechenizkiy and Abdelhamid Bouchahia, A Survey on Concept Drift Adaptation, ACM Computing Surveys, Vol. 1, No. 1, Article 1, January 2013.
[25]Castillo, C. ; Mendoza, M. , and Poblete, B., Information credibility on twitter, In Proceedings of the 20th International Conference on World Wide Web, WWW ’11, pages 675–684, New York, NY, USA, ACM, 2011.
[26]Kairam, S. R., Morris, M. R., Teevan, J. Liebling, D. , and Dumais, S., Towards supporting search over trending events with social media, In Proceedings of ICWSM 2013, the 7th International AAAI Conference on Weblogs and Social Media, 2013.
[27]I. Mukherjee, Jasni M Zain, P.K. Mahanti, An Automated Real-Time System for Opinion Mining using a Hybrid Approach, I.J. Intelligent Systems and Applications, 7, 55-64, 2016.
[28]Celik, Koray, and TungaGungor, A comprehensive analysis of using semantic information in text categorization, Innovations in Intelligent Systems and Applications (INISTA), 2013 IEEE International Symposium on IEEE, 2013.
[29]A. Zubiaga, D. Spina, V. Fresno, R. Martínez, Real-Time Classification of Twitter Trends, Journal of the American Society for Information Science and Technology (JASIST), 2014.
[30]Bing-kun WANG, Yong-feng HUANG, Wan-xia YANG, Xing LI, Short text classification based on strong feature thesaurus, J Zhejiang Univ-Sci C (Comput & Electron) 2012 13(9):649-659, 2012.
[31]Ping Chen, Wei Ding, Chris Bowes, David Brown, A Fully Unsupervised Word Sense Disambiguation Method Using Dependency Knowledge, The 2009 Annual Conference of the North American Chapter of the ACL, pages 28–36, Boulder, Colorado, June 2009.
[32]Lili Yanga, , Chunping Lia , Qiang Dingb, Li Lib, Combining Lexical and Semantic Features for Short Text Classification, 17th International Conference in Knowledge Based and Intelligent Information and Engineering Systems, 2013
[33]Xiaojun Quan , Gang Liu , Zhi Lu , Xingliang Ni , Liu Wenyin. Short text similarity based on probabilistic topics. Knowledge and Information Systems, v.25 n.3, p.473-491, 2010.
[34]Paolo Ferragina , Ugo Scaiella. TAGME: on-the-fly annotation of short text fragments (by wikipedia entities). Proceedings of the 19th ACM international conference on Information and knowledge management, 2010.
[35]Jinhee Park,Sungwoo Lee, Hye-Wuk Jung and Jee-Hyong Lee. Topic word selection for blogs by topic richness using web search result clustering. Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, 2012.
[36]Gu Bin, S. Sheng Victor, Keng Yeow Tay, Walter Romano, Shuo Li, Incremental Support Vector Learning for Ordinal Regression, IEEE Transactions on Neural Networks and Learning Systems, vol. 67, no. 2015.
[37]Anuj Mahajan, Sharmistha, and Shourya Roy, Feature Selection for Short Text Classification using Wavelet Packet Transform, Proceedings of the 19th Conference on Computational Language Learning, pages 321–326, Beijing, China, July 30-31, 2015.