Search CORE

19 research outputs found

Distributed Information Retrieval using Keyword Auctions

Author: Hiemstra D.
Publication venue: Centre for Telematics and Information Technology, University of Twente
Publication date: 01/01/2008
Field of study

This report motivates the need for large-scale distributed approaches to information retrieval, and proposes solutions based on keyword auctions

CiteSeerX

Radboud Repository

University of Twente Research Information

A New Modified Collection Selection Algorithm using Optimal Term Weight for Web based Applications

Author: B. Ramana Reddy
B. Ramana Reddy
K. S. Niraja
Publication venue: Global Journals Inc. (US)
Publication date: 15/01/2016
Field of study

As the number of electronic data collections available on the internet increases so does the difficulty of finding the right collection for a given query Often the first time user will be overwhelmed by the array of options available and will waste time hunting through pages of collection names followed by time reading results pages after doing an adhoc search Collection selection using optimal weight methods try to solve this problem by suggesting the best subset of collections to search based on a query This is of importance to fields containing large number of electronic collections which undergo frequent change and collections that cannot be fully indexed using traditional methods such as spiders This paper presents a solution to these problems of selecting the best collections and reducing the number of collections needing to be searche

Global Journal of Computer Science and Technology (GJCST)

Query-driven document partitioning and collection selection

Author: Domenico Laforenza
Fabrizio Silvestri
Publication venue
Publication date: 01/01/2006
Field of study

Abstract — We present a novel strategy to partition a document collection onto several servers and to perform effective collection selection. The method is based on the analysis of query logs. We proposed a novel document representation called query-vectors model. Each document is represented as a list recording the queries for which the document itself is a match, along with their ranks. To both partition the collection and build the collection selection function, we co-cluster queries and documents. The document clusters are then assigned to the underlying IR servers, while the query clusters represent queries that return similar results, and are used for collection selection. We show that this document partition strategy greatly boosts the performance of standard collection selection algorithms, including CORI, w.r.t. a round-robin assignment. Secondly, we show that performing collection selection by matching the query to the existing query clusters and successively choosing only one server, we reach an average precision-at-5 up to 1.74 and we constantly improve CORI precision of a factor between 11 % and 15%. As a side result we show a way to select rarely asked-for documents. Separating these documents from the rest of the collection allows the indexer to produce a more compact index containing only relevant documents that are likely to be requested in the future. In our tests, around 52 % of the documents (3,128,366) are not returned among the first 100 top-ranked results of any query. I

CiteSeerX

Recommended from our members

Summarizing and Searching Hidden-Web Databases Hierarchically Using Focused Probes

Author: Gravano Luis
Ipeirotis Panagiotis G.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2001
Field of study

Many valuable text databases on the web have non-crawlable contents that are "hidden" behind search interfaces. Metasearchers are helpful tools for searching over many such databases at once through a unified query interface. A critical task for a metasearcher to process a query efficiently and effectively is the selection of the most promising databases for the query, a task that typically relies on statistical summaries of the database contents. Unfortunately, web-accessible text databases do not generally export content summaries. In this paper, we present an algorithm to derive content summaries from "uncooperative" databases by using "focused query probes," which adaptively zoom in on and extract documents that are representative of the topic coverage of the databases. The content summaries that result from this algorithm are efficient to derive and more accurate than those from previously proposed probing techniques for content-summary extraction. We also present a novel database selection algorithm that exploits both the extracted content summaries and a hierarchical classification of the databases, automatically derived during probing, to produce accurate results even for imperfect content summaries. Finally, we evaluate our techniques thoroughly using a variety of databases, including 50 real web-accessible text databases

Columbia University Academic Commons

The effectiveness of query expansion for distributed information retrieval

Author: Jamie Callan
Paul Ogilvie
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2004
Field of study

Crossref

Query-driven document partitioning and collection selection

Author: DIEGO PUPPIN
DOMENICO LAFORENZA
SILVESTRI F
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2006
Field of study

Crossref

Archivio della ricerca- Università di Roma La Sapienza

Classification-Aware Hidden-Web Text Database Selection,

Author: Gravano Luis
Ipeirotis Panagiotis
Publication venue: 'American College of Medical Physics (ACMP)'
Publication date: 06/03/2006
Field of study

Many valuable text databases on the web have noncrawlable contents that are “hidden” behind search interfaces. Metasearchers are helpful tools for searching over multiple such “hidden-web” text databases at once through a unified query interface. An important step in the metasearching process is database selection, or determining which databases are the most relevant for a given user query. The state-of-the-art database selection techniques rely on statistical summaries of the database contents, generally including the database vocabulary and associated word frequencies. Unfortunately, hidden-web text databases typically do not export such summaries, so previous research has developed algorithms for constructing approximate content summaries from document samples extracted from the databases via querying.We present a novel “focused-probing” sampling algorithm that detects the topics covered in a database and adaptively extracts documents that are representative of the topic coverage of the database. Our algorithm is the first to construct content summaries that include the frequencies of the words in the database. Unfortunately, Zipf’s law practically guarantees that for any relatively large database, content summaries built from moderately sized document samples will fail to cover many low-frequency words; in turn, incomplete content summaries might negatively affect the database selection process, especially for short queries with infrequent words. To enhance the sparse document samples and improve the database selection decisions, we exploit the fact that topically similar databases tend to have similar vocabularies, so samples extracted from databases with a similar topical focus can complement each other. We have developed two database selection algorithms that exploit this observation. The first algorithm proceeds hierarchically and selects the best categories for a query, and then sends the query to the appropriate databases in the chosen categories. The second algorithm uses “shrinkage,” a statistical technique for improving parameter estimation in the face of sparse data, to enhance the database content summaries with category-specific words.We describe how to modify existing database selection algorithms to adaptively decide (at runtime) whether shrinkage is beneficial for a query. A thorough evaluation over a variety of databases, including 315 real web databases as well as TREC data, suggests that the proposed sampling methods generate high-quality content summaries and that the database selection algorithms produce significantly more relevant database selection decisions and overall search results than existing algorithms.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc

CiteSeerX

New York University Faculty Digital Archive