2 research outputs found
Collection Selection for Distributed Web Search
Current popular web search engines, such as Google, Live Search and Yahoo!,
rely on crawling to build an index of the World Wide Web. Crawling is a continuous
process to keep the index fresh and generates an enormous amount of data
traffic. By far the largest part of the web remains unindexed, because crawlers
are unaware of the existence of web pages and they have difficulties crawling
dynamically generated content.
These problems were the main motivation to research distributed web search.
We assume that web sites, or peers, can index a collection consisting of local
content, but possibly also content from other web sites. Peers cooperate with
a broker by sending a part of their index. Receiving indices from many peers,
the broker gains a global overview of the peers’ content. When a user poses a
query to a broker, the broker selects a few peers to which it forwards the query.
Selected peers should be promising to create a good result set with many relevant
documents. The result sets are merged at the broker and sent to the user.
This research focuses on collection selection, which corresponds to the selection
of the most promising peers. The use of highly discriminative keys is employed
as a strategy to select those peers. A highly discriminative key is a term
set which is an index entry at the broker. The key is highly discriminative with
respect to the collections because the posting lists pointing to the collections are
relatively small. Query-driven indexing is applied to reduce the index size by only
storing index entries that are part of popular queries. A PageRank-like algorithm
is also tested to assign scores to collections that can be used for ranking.
The Sophos prototype was developed to test these methods. Sophos was evaluated
on different aspects, such as collection selection performance and index
sizes. The performance of the methods is compared to a baseline that applied
language modeling onto merged documents in collections. The results show that
Sophos can outperform the baseline with ad-hoc queries on a web based test set.
Query-driven indexing is able to substantially reduce index sizes against a small
loss in collection selection performance. We also found large differences in the
level of difficulty to answer queries on various corpus splits
Query based Site Selection for Distributed Search Engines
We have developed a distributed search engine, called Cooperative Search Engine (CSE), in order to retrieve fresh information. In CSE, a local search engine located in each Web server makes an index of local pages. And, a Meta search server integrates these local search engines in order to realize a global search engine. In such a way, the communication delay occurs at retrieval time. So, we have developed several speedup techniques in order to realize fast retrieval. However, these techniques cannot be used for first page retrieval in “Next 10 ” search if the page has not been searched yet. So, we have proposed Query based Site Selection (QbSS), which is widely available in all cases. In this paper, we describe QbSS in detail and discuss its features. 1