Web Data Extraction from Scientific Publishers’ Website Using Heuristic Algorithm

Full Text (PDF, 458KB), PP.31-39

Views: 0 Downloads: 0


Umamageswari Kumaresan 1,* Kalpana Ramanujam 2

1. New Prince Shri Bhavani College of Engineering & Technology, Chennai, 600073, India

2. Pondicherry Engineering College, Pillaichavady, Puducherry, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijisa.2017.10.04

Received: 4 Mar. 2017 / Revised: 11 May 2017 / Accepted: 5 Jun. 2017 / Published: 8 Oct. 2017

Index Terms

Structured data, Information Extraction, Deep web, wrapper, DOM Tree, template, JQuery, XPATH


WWW is a huge repository of information and the amount of information available on the web is growing day by day in an exponential manner. End users make use of search engines like Google, Yahoo, and Bingo etc. for retrieving information. Search engines use web crawlers or spiders which crawl through a sequence of web pages in order to locate the relevant pages and provide a set of links ordered by relevancy. Those indexed web pages are part of surface web. Getting data from deep web requires form submission and is not performed by search engines. Data analytics and data mining applications depend on data from deep web pages and automatic extraction of data from deep web is cumbersome due to diverse structure of web pages. In the proposed work, a heuristic algorithm for automatic navigation and information extraction from journal’s home page has been devised. The algorithm is applied to many publishers website such as Nature, Elsevier, BMJ, Wiley etc. and the experimental results show that the heuristic technique provides promising results with respect to precision and recall values.

Cite This Paper

Umamageswari Kumaresan, Kalpana Ramanujam, "Web Data Extraction from Scientific Publishers’ Website Using Heuristic Algorithm", International Journal of Intelligent Systems and Applications(IJISA), Vol.9, No.10, pp.31-39, 2017. DOI:10.5815/ijisa.2017.10.04


[1]Akshi Kumar, Teeja Mary Sebastian (2012). Sentiment Analysis: A Perspective on its Past, Present and Future, IJISA, vol.4, no.10, pp.1-14, 2012. DOI: 10.5815/ijisa.2012.10.01
[2]Arasu, A., & Garcia-Molina, H. (2003). Extracting structured data from Web pages. Proceedings of the ACM SIGMOD International Conference on Management of Data, San Diego, California, pp.337-348.
[3]Bergman, M. (2001). The deep Web: Surfacing hidden value. The Journal of Electronic Publishing, Vol. 7.
[4]BioMed Journal. Accessed May 18, 2016 from https://www.biomedcentral.com/journals
[5]Chang, C.-H., & Kuo, S.-C. (2004). OLERA: A Semi-Supervised Approach for Web Data Extraction with Visual Support. IEEE Intelligent Systems, 19(6), pp. 56-64.
[6]Chang, C.-H., & Lui, S.-C. (2001). IEPAD: Information Extraction based on Pattern Discovery. Proceedings of the Tenth International Conference on World Wide Web (WWW), Hong-Kong, pp. 223-231.
[7]Crescenzi, V., Mecca, G., & Merialdo, P. (2002), Roadrunner: Automatic Data Extraction from Data-Intensive Websites. SIGMOD, pp. 624–624.
[8]Crescenzi, V., Merialdo, P., & Qiu, D., (2013). A Framework for Learning Web Wrappers from the Crowd. WWW'13 Proceedings of the 22nd international conference on World Wide Web, pp. 261-272.
[9]Dönz, B., Bruckner, D., (2013). Extracting and Integrating Structured Information from Web Databases Using Rule-Based Semantic Annotations. Industrial Electronics Society IECON 2013–39th Annual Conference of the IEEE, pp. 4470-4475.
[10]Dönz, B., Boley, H., (2014). Extracting Data from the Deep Web with Global-as-View Mediators Using Rule-Enriched Semantic Annotations. Proceedings of the RuleML 2014 Challenge and the RuleML 2014 Doctoral Consortium hosted by the 8th International Web Rule Symposium, vol. 1211, pp. 1-15.
[11]Elsevier Journals. Accessed May 18, 2016 from https://www.elsevier.com/journals/title/a
[12]Furche, A., Gottlob, T., Grasso, G., Orsi, G., Schallhart, G., Wang, C., (2012). AMBER: Automatic Supervision for Multi-Attribute Extraction. CoRR abs/1210.5984 2012.
[13]Hammer, J., McHugh, J., & Gracia-Molina, H. (1997). Semistructured data: The TSIMMIS experience. Proceedings of the First East-Europen Symposium on Advances in Databases and Information Systems (St. Petersburg, Russia), pp. 1-8.
[14]Hogue, A., & Karger, D. (2005). Thresher: Automating the Unwrapping of Semantic Content from the World Wide. Proceedings of the 14th International Conference on World Wide Web (WWW), Japan, pp. 86-95.
[15]Janosi-Rancz ,K.-T., Lajos ,A. (2015). Semantic Data Extraction. Elsevier Procedia Technology, Vol. 19, pp. 827–834.
[16]Jaunt API. Accessed May 18, 2016 from http:// jaunt-api.com/
[17]Jer Lang Hong (2011). Data Extraction for Deep Web using WordNet. IEEE Transactions on Systems, Man, and Cybernetics—Part C: Applications and Reviews, Vol. 41, No. 6, pp. 854 – 868.
[18]JQuery selector, Accessed May 18, 2016 from https://api.jquery.com/category/selectors/
[19]JSOUP API. Accessed May 18, 2016 from https://jsoup.org/
[20]Kayed, M. & Chang, C.-H. (2010), FiVaTech: Page-level web data extraction from template pages. IEEE Transactions on Knowledge and Data Engineering, 22(2), pp. 249–263.
[21]KDNuggets. Accessed May 18, 2016 from http://www.kdnuggets.com/
[22]Knuth, Donald E., James H. Morris, Jr, and Vaughan R. Pratt. (1977). Fast pattern matching in strings. SIAM journal on computing 6.2 pp. 323-350.
[23]Kushmerick, N., Weld, D., & Doorenbos, R. (1997), Wrapper Induction for Information Extraction. Proceedings of the Fifteenth International Conference on Artificial Intelligence (IJCAI), pp. 729-735.
[24]Lixto. Accessed May 18, 2016 from http://www.lixto.com/
[25]Long Nguyen Hung, Thuy Nguyen Thi Thu, Giap Cu Nguyen (2015). An Efficient Algorithm in Mining Frequent Itemsets with Weights over Data Stream Using Tree Data Structure, Int’l Jour. of Intelligent Systems and Applications, Vol. 7, No. 12, pp. 23-31.
[26]Manoj, D.-S., Sonune, G., Meshram, B.-B. (2013). Understanding the Technique of Data Extraction from Deep Web. (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 4 (3), pp. 533-537.
[27]Michael Bolin. (2005) End-user programming for the web. Master’s thesis, Massachusetts Institute of Technology, May 2005.
[28]Mozenda. Retrieved 18 May, 2016 from http://mozenda.com/
[29]Nature Journal. Accessed May 18, 2016 from http://www.nature.com/siteindex/
[30]Oxford Journal. Accessed April 17, 2017 from https://academic.oup.com/journals/pages/journals_a_to_z
[31]Pavai, G., Geetha, T.-V., (2013). A Unified Architecture for Surfacing the Content of Deep Web Databases. Proc. of Int. Conf. on Advances in Communication, Network, and Computing pp. 35 – 38.
[32]Sahuguet, A., Azavant, F. (2001). Building Intelligent Web Applications using Lightweight Wrappers. IEEE Transactions on Data and Knowledge Engineering, 36(3), pp. 283-316.
[33]Singhal, A., & Srivastava, J. (2013). Data Extract: Mining Context from the Web for Dataset Extraction. International Journal of Machine Learning and Computing, Vol. 3, No. 2, pp 219 – 223.
[34]Sleiman, H.-A., Corchuelo, R. (2014). Trinity: On Using Trinary Trees for Unsupervised Web Data Extraction. IEEE Transactions on Knowledge and Data Engineering, 26(6), pp. 1544-1556.
[35]Springer Journals. Accessed on April 17, 2017 from https://www.springeropen.com/journals-a-z#A
[36]Thamviset, W. & Wongthanavasu, S. (2014). Information Extraction for Deep Web using Repetitive Subject Pattern. World Wide Web (2014) 17: 1109. doi:10.1007/s11280-013-0248-y
[37]Wang, J., & Lochovsky, F. - H. (2003). Data extraction and Label Assignment for Web databases. Proceedings of the Twelfth International Conference on World Wide Web (WWW), Budapest, Hungary, pp. 187-196.
[38]Wiley Journals. Accessed on May 18, 2016 from http://onlinelibrary.wiley.com/browse/publications
[39]XPATH, Accessed May 18, 2016 from https://www.w3.org/TR/xpath/
[40]Zhai, Y., & Liu, B. (2005). Web Data Extraction Based on Partial Tree Alignment. Proceedings of the 14th International Conference on World Wide Web (WWW), Japan, pp. 76-85.
[41]Zhou, S., Zhang, S., & Karypis, G., (2012). Automated Web Data Mining Using Semantic Analysis. ADMA 2012, LNAI 7713, pp. 539–551.