5 research outputs found

    Méthodes pour la sélection de collections dans un environnement distribué

    Get PDF
    http://www.emse.fr/~mbeig/PUBLIS/2002-cide-p227-abbaci.pdfInternational audienceNous explorons dans cet article trois approches de sélection de collections dans un environnement de recherche d'informations distribuée. Le processus de recherche se fait par l'intermédiaire d'un courtier qui pour une requête donnée sélectionne les collections à interroger et fusionne les résultats qu'elles retournent. Notre première approche de sélection consiste à classer les collections selon leur pertinence à la requête posée, les n premières collections sont alors interrogées. La seconde approche sélectionne les collections dont le score dépasse un certain seuil. Enfin, la troisième approche définit le nombre de documents à rechercher dans chaque collection. L'originalité de notre démarche est qu'elle utilise des données récoltées au moment de l'interrogation et ne repose pas sur des méta-données sauvegardées a priori au niveau du courtier comme c'est le cas de la plupart des méthodes connues dans la littérature. Afin d'évaluer nos approches et les comparer aux autres techniques notamment l'approche centralisée (à index unique) et CORI [CALL95] [XU98], nous avons conduit des expérimentations sur la collection de test WT10g, et les gains sont appréciable

    Approaches to collection selection and results merging for distributed information retrieval

    Get PDF

    Shortest Substring Ranking (MultiText Experiments for TREC-4)

    No full text
    To address the TREC-4 topics, we used a precise query language that yields and combines arbitrary intervals of text rather than pre-defined units like words and documents. Each solution was scored in inverse proportion to the length of the shortest interval containing it. Each document was scored by the sum of the scores of solutions within it. Whenever the above strategy yielded less than 1000 documents, documents satisfying successively weaker queries were added with lower rank. Our results for the ad-hoc topics compare favourably with the median average precision for all groups. 1 Introduction The central concern of the MultiText project at the University of Waterloo is the management of data in large-scale distributed text database systems [10]. A major component of this work has been the development of a query language that is suitable for expressing queries over the heterogeneous data that is present in a very large text database. The query language developed for the MultiText p..

    Increasing the Efficiency of High-Recall Information Retrieval

    Get PDF
    The goal of high-recall information retrieval (HRIR) is to find all, or nearly all, relevant documents while maintaining reasonable assessment effort. Achieving high recall is a key problem in the use of applications such as electronic discovery, systematic review, and construction of test collections for information retrieval tasks. State-of-the-art HRIR systems commonly rely on iterative relevance feedback in which human assessors continually assess machine learning-selected documents. The relevance of the assessed documents is then fed back to the machine learning model to improve its ability to select the next set of potentially relevant documents for assessment. In many instances, thousands of human assessments might be required to achieve high recall. These assessments represent the main cost of such HRIR applications. Therefore, their effectiveness in achieving high recall is limited by their reliance on human input when assessing the relevance of documents. In this thesis, we test different methods in order to improve the effectiveness and efficiency of finding relevant documents using state-of-the-art HRIR system. With regard to the effectiveness, we try to build a machine-learned model that retrieves relevant documents more accurately. For efficiency, we try to help human assessors make relevance assessments more easily and quickly via our HRIR system. Furthermore, we try to establish a stopping criteria for the assessment process so as to avoid excessive assessment. In particular, we hypothesize that total assessment effort to achieve high recall can be reduced by using shorter document excerpts (e.g., extractive summaries) in place of full documents for the assessment of relevance and using a high-recall retrieval system based on continuous active learning (CAL). In order to test this hypothesis, we implemented a high-recall retrieval system based on state-of-the-art implementation of CAL. This high-recall retrieval system could display either full documents or short document excerpts for relevance assessment. A search engine was also integrated into our system to provide assessors the option of conducting interactive search and judging. We conducted a simulation study, and separately, a 50-person controlled user study to test our hypothesis. The results of the simulation study show that judging even a single extracted sentence for relevance feedback may be adequate for CAL to achieve high recall. The results of the controlled user study confirmed that human assessors were able to find a significantly larger number of relevant documents within limited time when they used the system with paragraph-length document excerpts as opposed to full documents. In addition, we found that allowing participants to compose and execute their own search queries did not improve their ability to find relevant documents and, by some measures, impaired performance. Moreover, integrating sampling methods with active learning can yield accurate estimates of the number of relevant documents, and thus avoid excessive assessments
    corecore