311 research outputs found
Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval
Although more and more language pairs are covered by machine translation
services, there are still many pairs that lack translation resources.
Cross-language information retrieval (CLIR) is an application which needs
translation functionality of a relatively low level of sophistication since
current models for information retrieval (IR) are still based on a
bag-of-words. The Web provides a vast resource for the automatic construction
of parallel corpora which can be used to train statistical translation models
automatically. The resulting translation models can be embedded in several ways
in a retrieval model. In this paper, we will investigate the problem of
automatically mining parallel texts from the Web and different ways of
integrating the translation models within the retrieval process. Our
experiments on standard test collections for CLIR show that the Web-based
translation models can surpass commercial MT systems in CLIR tasks. These
results open the perspective of constructing a fully automatic query
translation device for CLIR at a very low cost.Comment: 37 page
Intégration des Analyses du Français dans la Recherche d'Information
International audienceCet article dĂ©crit des approches que nous avons implantĂ©es dans le cadre d'une collaboration de recherche entre nos deux groupes. Ces approches visent Ă crĂ©er une reprĂ©sentation plus prĂ©cise pour les documents et les requĂȘtes dans un SRI. Elles sont basĂ©es sur des extractions de termes composĂ©s, au lieu de termes simples utilisĂ©s dans les approches traditionnelles. Deux approches sont employĂ©es: par une analyse syntaxico-statistique et par l'utilisation d'une base de terminologie manuelle. Nous dĂ©crivons ces deux approches, ainsi que les rĂ©sultats prĂ©liminaires obtenus
Using a Medical Thesaurus to Predict Query Difficulty
International audienceEstimating query performance is the task of predicting the quality of results returned by a search engine in response to a query. In this paper, we focus on pre-retrieval prediction methods for the medical domain. We propose a novel predictor that exploits a thesaurus to as- certain how difficult queries are. In our experiments, we show that our predictor outperforms the state-of-the-art methods that do not use a thesaurus
Augmenting Ad-Hoc IR Dataset for Interactive Conversational Search
A peculiarity of conversational search systems is that they involve
mixed-initiatives such as system-generated query clarifying questions.
Evaluating those systems at a large scale on the end task of IR is very
challenging, requiring adequate datasets containing such interactions. However,
current datasets only focus on either traditional ad-hoc IR tasks or query
clarification tasks, the latter being usually seen as a reformulation task from
the initial query. The only two datasets known to us that contain both document
relevance judgments and the associated clarification interactions are Qulac and
ClariQ. Both are based on the TREC Web Track 2009-12 collection, but cover a
very limited number of topics (237 topics), far from being enough for training
and testing conversational IR models. To fill the gap, we propose a methodology
to automatically build large-scale conversational IR datasets from ad-hoc IR
datasets in order to facilitate explorations on conversational IR. Our
methodology is based on two processes: 1) generating query clarification
interactions through query clarification and answer generators, and 2)
augmenting ad-hoc IR datasets with simulated interactions. In this paper, we
focus on MsMarco and augment it with query clarification and answer
simulations. We perform a thorough evaluation showing the quality and the
relevance of the generated interactions for each initial query. This paper
shows the feasibility and utility of augmenting ad-hoc IR datasets for
conversational IR
- âŠ