62 research outputs found

    Query-driven document partitioning and collection selection

    Get PDF
    Abstract — We present a novel strategy to partition a document collection onto several servers and to perform effective collection selection. The method is based on the analysis of query logs. We proposed a novel document representation called query-vectors model. Each document is represented as a list recording the queries for which the document itself is a match, along with their ranks. To both partition the collection and build the collection selection function, we co-cluster queries and documents. The document clusters are then assigned to the underlying IR servers, while the query clusters represent queries that return similar results, and are used for collection selection. We show that this document partition strategy greatly boosts the performance of standard collection selection algorithms, including CORI, w.r.t. a round-robin assignment. Secondly, we show that performing collection selection by matching the query to the existing query clusters and successively choosing only one server, we reach an average precision-at-5 up to 1.74 and we constantly improve CORI precision of a factor between 11 % and 15%. As a side result we show a way to select rarely asked-for documents. Separating these documents from the rest of the collection allows the indexer to produce a more compact index containing only relevant documents that are likely to be requested in the future. In our tests, around 52 % of the documents (3,128,366) are not returned among the first 100 top-ranked results of any query. I

    Recommending anchor points in structure-preserving hypertext document retrieval

    Get PDF
    Traditional WWW search engines index and recommend individual Web pages to assist users in locating relevant documents. Users are often overwhelmed by the large answer set recommended by the search engines. The logical starting point of the hyper-document is thus hidden among the large basket of matching pages. Users need to spend a lot of effort browsing through the pages to locate the starting point, a very time consuming process. This paper studies the anchor point indexing problem. The anchor points of a given user query is a small set of key pages from which the larger set of documents that are relevant to the query can be easily reached. The use of anchor points help solve the problems of huge answer set and low precision suffered by most search engines by considering the hyper-link structures of the relevant documents, and by providing a summary view of the result set.published_or_final_versio

    Méthodes pour la sélection de collections dans un environnement distribué

    Get PDF
    http://www.emse.fr/~mbeig/PUBLIS/2002-cide-p227-abbaci.pdfInternational audienceNous explorons dans cet article trois approches de sélection de collections dans un environnement de recherche d'informations distribuée. Le processus de recherche se fait par l'intermédiaire d'un courtier qui pour une requête donnée sélectionne les collections à interroger et fusionne les résultats qu'elles retournent. Notre première approche de sélection consiste à classer les collections selon leur pertinence à la requête posée, les n premières collections sont alors interrogées. La seconde approche sélectionne les collections dont le score dépasse un certain seuil. Enfin, la troisième approche définit le nombre de documents à rechercher dans chaque collection. L'originalité de notre démarche est qu'elle utilise des données récoltées au moment de l'interrogation et ne repose pas sur des méta-données sauvegardées a priori au niveau du courtier comme c'est le cas de la plupart des méthodes connues dans la littérature. Afin d'évaluer nos approches et les comparer aux autres techniques notamment l'approche centralisée (à index unique) et CORI [CALL95] [XU98], nous avons conduit des expérimentations sur la collection de test WT10g, et les gains sont appréciable

    Usercentric Operational Decision Making in Distributed Information Retrieval

    Get PDF
    Information specialists in enterprises regularly use distributed information retrieval (DIR) systems that query a large number of information retrieval (IR) systems, merge the retrieved results, and display them to users. There can be considerable heterogeneity in the quality of results returned by different IR servers. Further, because different servers handle collections of different sizes and have different processing and bandwidth capacities, there can be considerable heterogeneity in their response times. The broker in the DIR system has to decide which servers to query, how long to wait for responses, and which retrieved results to display based on the benefits and costs imposed on users. The benefit of querying more servers and waiting longer is the ability to retrieve more documents. The costs may be in the form of access fees charged by IR servers or user’s cost associated with waiting for the servers to respond. We formulate the broker’s decision problem as a stochastic mixed-integer program and present analytical solutions for the problem. Using data gathered from FedStats—a system that queries IR engines of several U.S. federal agencies—we demonstrate that the technique can significantly increase the utility from DIR systems. Finally, simulations suggest that the technique can be applied to solve the broker’s decision problem under more complex decision environments

    Selección de recursos distribuidos en ambientes dinámicos basados en web

    Get PDF
    La masificación de las comunicaciones de datos y el surgimiento de múltiples fuentes de información en-línea ha generado le necesidad de poner atención en el problema de realizar búsquedas sobre repositorios que se encuentran distribuidos. Este problema puede dividirse en tres partes: la representación de cada fuente a los efectos de permitir las búsquedas, la selección de las adecuadas de acuerdo a una consulta y la fusión de los resultados para presentar al usuario. Este artículo presenta los primeros avances en el trabajo de construcción de descripciones de recursos distribuidos y evaluación de algoritmos de selección. El objetivo es integrar y adaptar distintos algoritmos pertenecientes al área de Recuperación de Información Distribuida para que funcionen conjuntamente con fuentes de información heterogéneas en ambientes dinámicos basados en Web. Se utilizarán recursos que presten servicio de sindicación de contenido y así poder evaluar cómo responden los algoritmos de selección de recursos distribuidos en espacios acotados como son blogs y otras fuentes que utilizan esta modalidad de publicación de contenidos.Eje: Arquitectura, Redes y Sistemas OperativosRed de Universidades con Carreras en Informática (RedUNCI
    • …
    corecore