Search CORE

1 research outputs found

Developing web crawlers for vertical search engines: a survey of the current research

Author: Huitema Pedro
Publication venue: Western CEDAR
Publication date: 01/01/2008
Field of study

Vertical search engines allow users to query for information within a subset of documents relevant to a pre-determined topic (Chakrabarti, 1999). One challenging aspect to deploying a vertical search engine is building a Web crawler that distinguishes relevant documents from non-relevant documents. In this research, we describe and analyze various methods to crawl relevant documents for vertical search engines, and we examine ways to apply these methods to building a local search engine. In a typical crawl cycle for a vertical search engine, the crawler grabs a URL from the URL frontier, downloads content from the URL, and determines the document’s relevancy to the pre-defined topic. If the document is deemed relevant, it is indexed and its links are added to the URL frontier. Two questions are raised in this process: how do we judge a document’s relevance, and how should we prioritize URLs in the frontier in order to reach the best documents first? To determine the relevancy of a document, we may hold on to a set of pre-determined keywords that we attempt to match in a crawled document’s content and metadata. Another possibility is to use relevance feedback, a mechanism where we train the crawler to spot relevant documents by feeding it training data. In order to prioritize links within the URL frontier, we can use a breadth-first crawler where we just index pages one level at a time, bridges which are pages that aren’t crawled but used to gather more links, reinforcement learning where the crawler is rewarded for reaching relevant pages, and decision trees where the priority given to a link depends on the quality of the parent page

Western Washington University