A Supervised Approach for Automatic Web Documents Topic Extraction Using Well-Known Web Design Features

Full Text (PDF, 624KB), PP.20-27

Views: 0 Downloads: 0


Kazem Taghandiki 1,* Ahmad Zaeri 1 Amirreza Shirani 1

1. Department of Computer Engineering, Faculty of Computer Engineering, University of Isfahan, Isfahan, Iran

* Corresponding author.

DOI: https://doi.org/10.5815/ijmecs.2016.11.03

Received: 9 Aug. 2016 / Revised: 12 Sep. 2016 / Accepted: 23 Oct. 2016 / Published: 8 Nov. 2016

Index Terms

Topic extraction, web document, supervised active learning, Topic modeling


The aim of this paper is to propose an efficient method for identification of web document topics which is often considered as one of the debatable challenges in many information retrieval systems. Most of the previous works have focused on analyzing the entire text using time-consuming methods and also many of them have used unsupervised approaches to identify the main topic of documents. However, in this paper, it is attempted to exploit the most widely-used Hyper-Text Markup Language (HTML) features to extract topics from web documents using a supervised approach.
Hiring an interactive crawler, we firstly try to analyze HTML structures of 5000 webpages in order to identify the most widely-used HTML features. In the next step, the selected features of 1500 webpages are extracted using the same crawler.
Suitable topics are given to each web document by users in a supervised learning process. A topic modeling technique is used over extracted features to build four classifiers- C4.5, Decision Tree, Naïve Bayes and Maximum Entropy- which are separately adopted to train and test our data. The results of classifiers are compared and the high accurate classifier is selected. In order to examine our approach in a larger scale, a new set of 3500 web documents is evaluated using the selected classifier. Results show that the proposed system provides remarkable performance which is able to obtain 71.8% recognition rate.

Cite This Paper

Kazem Taghandiki, Ahmad Zaeri, Amirreza Shirani, "A Supervised Approach for Automatic Web Documents Topic Extraction Using Well-Known Web Design Features", International Journal of Modern Education and Computer Science(IJMECS), Vol.8, No.11, pp.20-27, 2016. DOI:10.5815/ijmecs.2016.11.03


[1]M. d. Kunder. (2015, 26 Oct). The size of the World Wide Web (The Internet). Available: http://www.worldwidewebsize.com/
[2]M. Chen, S. Mao, and Y. Liu, "Big data: A survey," Mobile Networks and Applications, vol. 19, pp. 171-209, 2014.
[3]L. Teixeira, G. Lopes, and R. A. Ribeiro, "Automatic extraction of document topics," in Technological Innovation for Sustainability, ed: Springer, 2011, pp. 101-108.
[4]A. K. McCallum. (2002). MALLET: A Machine Learning for Language Toolkit. Available: http://www.cs.umass.edu/~mccallum/mallet
[5]C.-Y. Lin, "Knowledge-based automatic topic identification," in Proceedings of the 33rd annual meeting on Association for Computational Linguistics, 1995, pp. 308-310.
[6]H. Kong, M. Hwang, G. Hwang, J. Shim, and P. Kim, "Topic selection of web documents using specific domain ontology," in MICAI 2006: Advances in Artificial Intelligence, ed: Springer, 2006, pp. 1047-1056.
[7]U. Erra, S. Senatore, F. Minnella, and G. Caggianese, "Approximate TF–IDF based on topic extraction from massive message stream using the GPU," Information Sciences, vol. 292, pp. 143-161, 2015.
[8]K. K. Bun and M. Ishizuka, "Topic extraction from news archive using TF-PDF algorithm," in null, 2002, p. 73.
[9]R. Dong, M. Schaal, M. P. O'Mahony, and B. Smyth, "Topic extraction from online reviews for classification and recommendation," in Proceedings of the Twenty-Third international joint conference on Artificial Intelligence, 2013, pp. 1310-1316.
[10]G. Ercan and I. Cicekli, "Using lexical chains for keyword extraction," Information Processing & Management, vol. 43, pp. 1705-1714, 2007.
[11]F. Liu, D. Pennell, F. Liu, and Y. Liu, "Unsupervised approaches for automatic keyword extraction using meeting transcripts," in Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the association for computational linguistics, 2009, pp. 620-628.
[12]J. M. Cigarrán, A. Peñas, J. Gonzalo, and F. Verdejo, "Automatic selection of noun phrases as document descriptors in an FCA-based information retrieval system," in Formal concept analysis, ed: Springer, 2005, pp. 49-63.
[13]A. T. Misirli, H. Erdogmus, N. Juristo, and O. Dieste, "Topic selection in industry experiments," in Proceedings of the 2nd International Workshop on Conducting Empirical Studies in Industry, 2014, pp. 25-30.
[14]Y. Ma, Y. Wang, and B. Jin, "A three-phase approach to document clustering based on topic significance degree," Expert Systems with Applications, vol. 41, pp. 8203-8210, 2014.
[15](2016 Jun 19). DMOZ - The Directory of the Web. Available: https://www.dmoz.org/
[16]J. Han, M. Kamber, and J. Pei, Data mining: concepts and techniques: concepts and techniques: Elsevier, 2011.
[17]T. N. Phyu, "Survey of classification techniques in data mining," in Proceedings of the International MultiConference of Engineers and Computer Scientists, 2009, pp. 18-20.
[18]S. L. Salzberg, "C4. 5: Programs for machine learning by j. ross quinlan. morgan kaufmann publishers, inc., 1993," Machine Learning, vol. 16, pp. 235-240, 1994.
[19]K. Nigam, J. Lafferty, and A. McCallum, "Using maximum entropy for text classification," in IJCAI-99 workshop on machine learning for information filtering, 1999, pp. 61-67.