5 research outputs found
Méthodes pour la sélection de collections dans un environnement distribué
http://www.emse.fr/~mbeig/PUBLIS/2002-cide-p227-abbaci.pdfInternational audienceNous explorons dans cet article trois approches de sélection de collections dans un environnement de recherche d'informations distribuée. Le processus de recherche se fait par l'intermédiaire d'un courtier qui pour une requête donnée sélectionne les collections à interroger et fusionne les résultats qu'elles retournent. Notre première approche de sélection consiste à classer les collections selon leur pertinence à la requête posée, les n premières collections sont alors interrogées. La seconde approche sélectionne les collections dont le score dépasse un certain seuil. Enfin, la troisième approche définit le nombre de documents à rechercher dans chaque collection. L'originalité de notre démarche est qu'elle utilise des données récoltées au moment de l'interrogation et ne repose pas sur des méta-données sauvegardées a priori au niveau du courtier comme c'est le cas de la plupart des méthodes connues dans la littérature. Afin d'évaluer nos approches et les comparer aux autres techniques notamment l'approche centralisée (à index unique) et CORI [CALL95] [XU98], nous avons conduit des expérimentations sur la collection de test WT10g, et les gains sont appréciable
Shortest Substring Ranking (MultiText Experiments for TREC-4)
To address the TREC-4 topics, we used a precise query language that yields and combines arbitrary intervals of text rather than pre-defined units like words and documents. Each solution was scored in inverse proportion to the length of the shortest interval containing it. Each document was scored by the sum of the scores of solutions within it. Whenever the above strategy yielded less than 1000 documents, documents satisfying successively weaker queries were added with lower rank. Our results for the ad-hoc topics compare favourably with the median average precision for all groups. 1 Introduction The central concern of the MultiText project at the University of Waterloo is the management of data in large-scale distributed text database systems [10]. A major component of this work has been the development of a query language that is suitable for expressing queries over the heterogeneous data that is present in a very large text database. The query language developed for the MultiText p..
Increasing the Efficiency of High-Recall Information Retrieval
The goal of high-recall information retrieval (HRIR) is to find all,
or nearly all, relevant documents while maintaining reasonable assessment effort.
Achieving high recall is a key problem in the use of applications such as
electronic discovery, systematic review, and construction of test collections for
information retrieval tasks. State-of-the-art HRIR systems commonly rely on iterative relevance feedback in which
human assessors continually assess machine learning-selected documents.
The relevance of the assessed documents is then fed back to
the machine learning model to improve its ability to select the next set of
potentially relevant documents for assessment. In many instances, thousands of human assessments might be required to achieve high recall. These assessments represent the main cost of such HRIR
applications. Therefore, their effectiveness in achieving high recall
is limited by their reliance on human input when assessing the relevance of
documents. In this thesis, we test different methods in order to improve the effectiveness and
efficiency of finding relevant documents using state-of-the-art HRIR
system. With regard to the effectiveness, we try to build a machine-learned
model that retrieves relevant documents more accurately.
For efficiency, we try to help human assessors make
relevance assessments more easily and quickly via our HRIR system.
Furthermore, we try to establish a stopping criteria for the
assessment process so as to avoid excessive assessment.
In particular, we hypothesize that total assessment effort to achieve high
recall can be reduced by using shorter document excerpts
(e.g., extractive summaries) in place of full documents for the assessment of
relevance and using a high-recall retrieval system based on continuous active
learning (CAL). In order to test this hypothesis, we implemented a
high-recall retrieval system based on state-of-the-art implementation of CAL. This high-recall retrieval system could display
either full documents or short document excerpts for relevance assessment.
A search engine was also integrated into our system to provide
assessors the option of conducting interactive search and judging.
We conducted a simulation study, and separately, a 50-person controlled user study to test our hypothesis.
The results of the simulation study show that judging even a single
extracted sentence for relevance feedback may be adequate for CAL
to achieve high recall. The results of the controlled user study
confirmed that human assessors were able to find
a significantly larger number of relevant documents within limited time when they used the
system with paragraph-length document excerpts as opposed to full documents.
In addition, we found that allowing participants to compose and execute their
own search queries did not improve their ability to find relevant
documents and, by some measures, impaired performance.
Moreover, integrating sampling methods with active
learning can yield accurate estimates of the number of relevant documents, and thus avoid excessive assessments