Search CORE

62 research outputs found

Query-driven document partitioning and collection selection

Author: Domenico Laforenza
Fabrizio Silvestri
Publication venue
Publication date: 01/01/2006
Field of study

Abstract — We present a novel strategy to partition a document collection onto several servers and to perform effective collection selection. The method is based on the analysis of query logs. We proposed a novel document representation called query-vectors model. Each document is represented as a list recording the queries for which the document itself is a match, along with their ranks. To both partition the collection and build the collection selection function, we co-cluster queries and documents. The document clusters are then assigned to the underlying IR servers, while the query clusters represent queries that return similar results, and are used for collection selection. We show that this document partition strategy greatly boosts the performance of standard collection selection algorithms, including CORI, w.r.t. a round-robin assignment. Secondly, we show that performing collection selection by matching the query to the existing query clusters and successively choosing only one server, we reach an average precision-at-5 up to 1.74 and we constantly improve CORI precision of a factor between 11 % and 15%. As a side result we show a way to select rarely asked-for documents. Separating these documents from the rest of the collection allows the indexer to produce a more compact index containing only relevant documents that are likely to be requested in the future. In our tests, around 52 % of the documents (3,128,366) are not returned among the first 100 top-ranked results of any query. I

CiteSeerX

GlOSS

Author: Anthony Tomasic
BOWMAN C. M.
DANZIG P. B.
FLATER D. W.
GRAVANO L.
GRAVANO L.
Héctor García-Molina
KAHLE B.
Luis Gravano
NEUMAN B. C.
SCHWARTZ M. F.
SELBERG E.
SIMPSON P.
VOORHEES E. M.
YAN T. W.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Recommending anchor points in structure-preserving hypertext document retrieval

Author: Cheung David WL
Kao Ben CM
Lee Joseph KW
Ng CY
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1998
Field of study

Traditional WWW search engines index and recommend individual Web pages to assist users in locating relevant documents. Users are often overwhelmed by the large answer set recommended by the search engines. The logical starting point of the hyper-document is thus hidden among the large basket of matching pages. Users need to spend a lot of effort browsing through the pages to locate the starting point, a very time consuming process. This paper studies the anchor point indexing problem. The anchor points of a given user query is a small set of key pages from which the larger set of documents that are relevant to the query can be easily reached. The use of anchor points help solve the problems of huge answer set and low precision suffered by most search engines by considering the hyper-link structures of the relevant documents, and by providing a summary view of the result set.published_or_final_versio

HKU Scholars Hub

Méthodes pour la sélection de collections dans un environnement distribué

Author: Abbaci Faïza
Beigbeder Michel
Savoy Jacques
Publication venue: HAL CCSD
Publication date: 20/10/2002
Field of study

http://www.emse.fr/~mbeig/PUBLIS/2002-cide-p227-abbaci.pdfInternational audienceNous explorons dans cet article trois approches de sélection de collections dans un environnement de recherche d'informations distribuée. Le processus de recherche se fait par l'intermédiaire d'un courtier qui pour une requête donnée sélectionne les collections à interroger et fusionne les résultats qu'elles retournent. Notre première approche de sélection consiste à classer les collections selon leur pertinence à la requête posée, les n premières collections sont alors interrogées. La seconde approche sélectionne les collections dont le score dépasse un certain seuil. Enfin, la troisième approche définit le nombre de documents à rechercher dans chaque collection. L'originalité de notre démarche est qu'elle utilise des données récoltées au moment de l'interrogation et ne repose pas sur des méta-données sauvegardées a priori au niveau du courtier comme c'est le cas de la plupart des méthodes connues dans la littérature. Afin d'évaluer nos approches et les comparer aux autres techniques notamment l'approche centralisée (à index unique) et CORI [CALL95] [XU98], nous avons conduit des expérimentations sur la collection de test WT10g, et les gains sont appréciable

HAL-EMSE

Query-driven document partitioning and collection selection

Author: DIEGO PUPPIN
DOMENICO LAFORENZA
SILVESTRI F
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2006
Field of study

Crossref

Archivio della ricerca- Università di Roma La Sapienza

Cluster-based database selection techniques for routing bibliographic queries

Author: LIM Ee Peng
NG Wee-Keong
XU Jian
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/08/1999
Field of study

Crossref

Institutional Knowledge at Singapore Management University

Usercentric Operational Decision Making in Distributed Information Retrieval

Author: Hosanagar kartik
Publication venue: ScholarlyCommons
Publication date: 01/12/2011
Field of study

Information specialists in enterprises regularly use distributed information retrieval (DIR) systems that query a large number of information retrieval (IR) systems, merge the retrieved results, and display them to users. There can be considerable heterogeneity in the quality of results returned by different IR servers. Further, because different servers handle collections of different sizes and have different processing and bandwidth capacities, there can be considerable heterogeneity in their response times. The broker in the DIR system has to decide which servers to query, how long to wait for responses, and which retrieved results to display based on the benefits and costs imposed on users. The benefit of querying more servers and waiting longer is the ability to retrieve more documents. The costs may be in the form of access fees charged by IR servers or user’s cost associated with waiting for the servers to respond. We formulate the broker’s decision problem as a stochastic mixed-integer program and present analytical solutions for the problem. Using data gathered from FedStats—a system that queries IR engines of several U.S. federal agencies—we demonstrate that the technique can significantly increase the utility from DIR systems. Finally, simulations suggest that the technique can be applied to solve the broker’s decision problem under more complex decision environments

ScholarlyCommons@Penn

Recommended from our members

Summarizing and Searching Hidden-Web Databases Hierarchically Using Focused Probes

Author: Gravano Luis
Ipeirotis Panagiotis G.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2001
Field of study

Many valuable text databases on the web have non-crawlable contents that are "hidden" behind search interfaces. Metasearchers are helpful tools for searching over many such databases at once through a unified query interface. A critical task for a metasearcher to process a query efficiently and effectively is the selection of the most promising databases for the query, a task that typically relies on statistical summaries of the database contents. Unfortunately, web-accessible text databases do not generally export content summaries. In this paper, we present an algorithm to derive content summaries from "uncooperative" databases by using "focused query probes," which adaptively zoom in on and extract documents that are representative of the topic coverage of the databases. The content summaries that result from this algorithm are efficient to derive and more accurate than those from previously proposed probing techniques for content-summary extraction. We also present a novel database selection algorithm that exploits both the extracted content summaries and a hierarchical classification of the databases, automatically derived during probing, to produce accurate results even for imperfect content summaries. Finally, we evaluate our techniques thoroughly using a variety of databases, including 50 real web-accessible text databases

Columbia University Academic Commons

Selección de recursos distribuidos en ambientes dinámicos basados en web

Author: Banchero Santiago
Bordignon Fernando Raúl Alfredo
Tolosa Gabriel Hernán
Publication venue
Publication date: 01/05/2007
Field of study

La masificación de las comunicaciones de datos y el surgimiento de múltiples fuentes de información en-línea ha generado le necesidad de poner atención en el problema de realizar búsquedas sobre repositorios que se encuentran distribuidos. Este problema puede dividirse en tres partes: la representación de cada fuente a los efectos de permitir las búsquedas, la selección de las adecuadas de acuerdo a una consulta y la fusión de los resultados para presentar al usuario. Este artículo presenta los primeros avances en el trabajo de construcción de descripciones de recursos distribuidos y evaluación de algoritmos de selección. El objetivo es integrar y adaptar distintos algoritmos pertenecientes al área de Recuperación de Información Distribuida para que funcionen conjuntamente con fuentes de información heterogéneas en ambientes dinámicos basados en Web. Se utilizarán recursos que presten servicio de sindicación de contenido y así poder evaluar cómo responden los algoritmos de selección de recursos distribuidos en espacios acotados como son blogs y otras fuentes que utilizan esta modalidad de publicación de contenidos.Eje: Arquitectura, Redes y Sistemas OperativosRed de Universidades con Carreras en Informática (RedUNCI