Improved Architecture of Focused Crawler on the basis of Content and Link Analysis

Bhupinderjit Singh 1,* Deepak Kumar Gupta 1 Raj Mohan Singh 1

1. Department of Computer Science & Engineering, Dr. B R Ambedkar National Institute of Technology, Jalandhar, India

* Corresponding author.


Received: 5 Aug. 2017 / Revised: 11 Sep. 2017 / Accepted: 16 Oct. 2017 / Published: 8 Nov. 2017

Index Terms

Focused Crawler, Topic Weight Table, Search Engine, Page Score, Link Score, URL Queue Optimization


World Wide Web is a vast, dynamic and continuously growing collection of web documents. Due to its huge size, it is very difficult for the users to search for the relevant information about a particular topic of interest. In this paper, an improved architecture of focused crawler is proposed, which is a hybrid of various techniques used earlier. The main goal of a focused crawler is to fetch the web documents which are related to a pre-defined set of topics/domains and to ignore the irrelevant web pages. To check the relevancy of a web page, Page Score is computed on the basis of content similarity of the web page with reference to the topic keywords. URLs Priority Queue is implemented by calculating the Link Score of extracted URLs based on URLs attributes. URLs queue is also optimized by removing the duplicate contents. Topic Keywords Weight Table is expanded by extracting more keywords from the relevant pages database and recalculating the keywords weight. The experimental result shows that our proposed crawler has better efficiency than the earlier crawlers.

Cite This Paper

Bhupinderjit Singh, Deepak Kumar Gupta, Raj Mohan Singh, "Improved Architecture of Focused Crawler on the basis of Content and Link Analysis", International Journal of Modern Education and Computer Science(IJMECS), Vol.9, No.11, pp. 33-40, 2017. DOI:10.5815/ijmecs.2017.11.04


