Search CORE

28 research outputs found

Using NLP to build the hypertextuel network of a back-of-the-book index

Author: Mekki Touria Aït El
Nazarenko Adeline
Publication venue
Publication date: 01/01/2005
Field of study

Relying on the idea that back-of-the-book indexes are traditional devices for navigation through large documents, we have developed a method to build a hypertextual network that helps the navigation in a document. Building such an hypertextual network requires selecting a list of descriptors, identifying the relevant text segments to associate with each descriptor and finally ranking the descriptors and reference segments by relevance order. We propose a specific document segmentation method and a relevance measure for information ranking. The algorithms are tested on 4 corpora (of different types and domains) without human intervention or any semantic knowledge

arXiv.org e-Print Archive

HAL Descartes

HAL-Paris 13

Hal-Diderot

Linguistic Analysis of Users' Queries: towards an adaptive Information Retrieval System

Author: Mothe Josiane
Tanguy Ludovic
Publication venue: HAL CCSD
Publication date: 01/01/2007
Field of study

International audienceMost of Information Retrieval Systems transform natural language users'queries into bags of words that are matched to documents also represented as bags of words. Through such process, the richness of the query is lost. In this paper we show that linguistic features of a query are good indicators to predict systems failure to answer it. The experiments are based on 42 systems or system variants and 50 TREC topics that consist of a descriptive part expressed in natural language

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

Unités d'indexation et taille des requêtes pour la recherche d'information en français

Author: Mothe Josiane
Tanguy Ludovic
Publication venue: HAL CCSD
Publication date: 01/01/2007
Field of study

International audienceThis paper analyses different indexing method for French (lemmas, stems and truncated terms) as well as their fusing. We also examine the influence of the different section of a topic on precision. Our study uses the collections from CLEF – French monolingual from 2000 to 2005. We show that the best method is the one based on lemmas and that fuse the results obtained with the different sections of a topic.MOTS-CLÉS :recherche d'information, fusion, indexation, influence de l'indexation, recherche d'information en français.Dans cet article, nous nous intéressons à la recherche d'information en Français. Nous analysons différentes techniques d'indexation (basées sur des lemmes, des radicaux ou des termes) et leur fusion. Nous analysons également l'influence de la prise en compte des différentes parties d'une requête. Notre étude porte sur 6 campagnes d'évaluation de CLEF Français. Nous montrons que l'utilisation des lemmes et la combinaison des différentes variantes d'une requête sont les plus efficaces pour améliorer la précision moyenne et la haute précisio

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

Evaluating the Potential of Explicit Phrases for Retrieval Quality

Author: Andreas Broschart
Klaus Berberich
Ralf Schenkel
Publication venue
Publication date: 30/04/2020
Field of study

Abstract. This paper evaluates the potential impact of explicit phrases on retrieval quality through a case study with the TREC Terabyte benchmark. It compares the performance of user-and system-identified phrases with a standard score and a proximity-aware score, and shows that an optimal choice of phrases, including term permutations, can significantly improve query performance

CiteSeerX

Combining compound and single terms under language model framework

Author: Ahmed-Ouamar Rachid
Boughanem Mohand
Hammache Arezki
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 08/03/2013
Field of study

International audienceMost existing Information Retrieval model including probabilistic and vector space models are based on the term independence hypothesis. To go beyond this assumption and thereby capture the semantics of document and query more accurately, several works have incorporated phrases or other syntactic information in IR, such attempts have shown slight benefit, at best. Particularly in language modeling approaches this extension is achieved through the use of the bigram or n-gram models. However, in these models all bigrams/n-grams are considered and weighted uniformly. In this paper we introduce a new approach to select and weight relevant n-grams associated with a document. Experimental results on three TREC test collections showed an improvement over three strongest state-of-the-art model baselines, which are the original unigram language model, the Markov Random Field model, and the positional language model

Crossref

Scientific Publications of the University of Toulouse II Le Mirail

Open Archive Toulouse Archive Ouverte

Lexical cohesion and term proximity in document ranking

Author: Karamuftuoglu M.
Vechtmova O.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2008
Field of study

Cataloged from PDF version of article.We demonstrate effective new methods of document ranking based on lexical cohesive relationships between query terms. The proposed methods rely solely on the lexical relationships between original query terms, and do not involve query expansion or relevance feedback. Two types of lexical cohesive relationship information between query terms are used in document ranking: short-distance collocation relationship between query terms, and long-distance relationship, determined by the collocation of query terms with other words. The methods are evaluated on TREC corpora, and show improvements over baseline systems. (C) 2008 Elsevier Ltd. All rights reserved

Bilkent University Institutional Repository

Data-poor categorization and passage retrieval for Gene Ontology Annotation in Swiss-Prot

Author: Ehrler Frédéric
Geissbühler Antoine
Jimeno Antonio
Ruch Patrick
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

Abstract Background In the context of the BioCreative competition, where training data were very sparse, we investigated two complementary tasks: 1) given a Swiss-Prot triplet, containing a protein, a GO (Gene Ontology) term and a relevant article, extraction of a short passage that justifies the GO category assignement; 2) given a Swiss-Prot pair, containing a protein and a relevant article, automatic assignement of a set of categories. Methods Sentence is the basic retrieval unit. Our classifier computes a distance between each sentence and the GO category provided with the Swiss-Prot entry. The Text Categorizer computes a distance between each GO term and the text of the article. Evaluations are reported both based on annotator judgements as established by the competition and based on mean average precision measures computed using a curated sample of Swiss-Prot. Results Our system achieved the best recall and precision combination both for passage retrieval and text categorization as evaluated by official evaluators. However, text categorization results were far below those in other data-poor text categorization experiments The top proposed term is relevant in less that 20% of cases, while categorization with other biomedical controlled vocabulary, such as the Medical Subject Headings, we achieved more than 90% precision. We also observe that the scoring methods used in our experiments, based on the retrieval status value of our engines, exhibits effective confidence estimation capabilities. Conclusion From a comparative perspective, the combination of retrieval and natural language processing methods we designed, achieved very competitive performances. Largely data-independent, our systems were no less effective that data-intensive approaches. These results suggests that the overall strategy could benefit a large class of information extraction tasks, especially when training data are missing. However, from a user perspective, results were disappointing. Further investigations are needed to design applicable end-user text mining tools for biologists.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

University of Melbourne Institutional Repository

Dealing with Syntactic Variation Through a Locality-Based Approach

Author: C. Jacquemin
C.H. Koster
J. Lee
J. Perez-Carballo
J. Rocchio
J. Vilares
J. Vilares
M. Kaszkiel
M.A. Alonso
O. Kretser de
O. Kretser de
R. Attar
T. Saracevic
W.B. Croft
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2004
Field of study

Crossref

Information Retrieval: Recent Advances and Beyond

Author: Hambarde Kailash A.
Proenca Hugo
Publication venue
Publication date: 01/01/2023
Field of study

In this paper, we provide a detailed overview of the models used for information retrieval in the first and second stages of the typical processing chain. We discuss the current state-of-the-art models, including methods based on terms, semantic retrieval, and neural. Additionally, we delve into the key topics related to the learning process of these models. This way, this survey offers a comprehensive understanding of the field and is of interest for for researchers and practitioners entering/working in the information retrieval domain

arXiv.org e-Print Archive

Directory of Open Access Journals