Search CORE

59 research outputs found

Enhancing Content-And-Structure Information Retrieval using a Native XML Database

Author: Pehcevski Jovan
Thom James A.
Vercoustre Anne-Marie
Publication venue
Publication date: 01/01/2004
Field of study

Three approaches to content-and-structure XML retrieval are analysed in this paper: first by using Zettair, a full-text information retrieval system; second by using eXist, a native XML database, and third by using a hybrid XML retrieval system that uses eXist to produce the final answers from likely relevant articles retrieved by Zettair. INEX 2003 content-and-structure topics can be classified in two categories: the first retrieving full articles as final answers, and the second retrieving more specific elements within articles as final answers. We show that for both topic categories our initial hybrid system improves the retrieval effectiveness of a native XML database. For ranking the final answer elements, we propose and evaluate a novel retrieval model that utilises the structural relationships between the answer elements of a native XML database and retrieves Coherent Retrieval Elements. The final results of our experiments show that when the XML retrieval task focusses on highly relevant elements our hybrid XML retrieval system with the Coherent Retrieval Elements module is 1.8 times more effective than Zettair and 3 times more effective than eXist, and yields an effective content-and-structure XML retrieval

arXiv.org e-Print Archive

CiteSeerX

INRIA a CCSD electronic archive server

Hal-Diderot

Users and Assessors in the Context of INEX: Are Relevance Dimensions Relevant?

Author: Pehcevski Jovan
Thom James A.
Vercoustre Anne-Marie
Publication venue
Publication date: 01/01/2005
Field of study

The main aspects of XML retrieval are identified by analysing and comparing the following two behaviours: the behaviour of the assessor when judging the relevance of returned document components; and the behaviour of users when interacting with components of XML documents. We argue that the two INEX relevance dimensions, Exhaustivity and Specificity, are not orthogonal dimensions; indeed, an empirical analysis of each dimension reveals that the grades of the two dimensions are correlated to each other. By analysing the level of agreement between the assessor and the users, we aim at identifying the best units of retrieval. The results of our analysis show that the highest level of agreement is on highly relevant and on non-relevant document components, suggesting that only the end points of the INEX 10-point relevance scale are perceived in the same way by both the assessor and the users. We propose a new definition of relevance for XML retrieval and argue that its corresponding relevance scale would be a better choice for INEX

arXiv.org e-Print Archive

CiteSeerX

INRIA a CCSD electronic archive server

Hal-Diderot

Use of Wikipedia Categories in Entity Ranking

Author: Pehcevski Jovan
Thom James A.
Vercoustre Anne-Marie
Publication venue
Publication date: 01/01/2007
Field of study

Wikipedia is a useful source of knowledge that has many applications in language processing and knowledge representation. The Wikipedia category graph can be compared with the class hierarchy in an ontology; it has some characteristics in common as well as some differences. In this paper, we present our approach for answering entity ranking queries from the Wikipedia. In particular, we explore how to make use of Wikipedia categories to improve entity ranking effectiveness. Our experiments show that using categories of example entities works significantly better than using loosely defined target categories

arXiv.org e-Print Archive

CiteSeerX

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Experiments in Clustering Homogeneous XML Documents to Validate an Existing Typology

Author: Despeyroux Thierry
Lechevallier Yves
Trousse Brigitte
Vercoustre Anne-Marie
Publication venue
Publication date: 01/01/2005
Field of study

This paper presents some experiments in clustering homogeneous XMLdocuments to validate an existing classification or more generally anorganisational structure. Our approach integrates techniques for extracting knowledge from documents with unsupervised classification (clustering) of documents. We focus on the feature selection used for representing documents and its impact on the emerging classification. We mix the selection of structured features with fine textual selection based on syntactic characteristics.We illustrate and evaluate this approach with a collection of Inria activity reports for the year 2003. The objective is to cluster projects into larger groups (Themes), based on the keywords or different chapters of these activity reports. We then compare the results of clustering using different feature selections, with the official theme structure used by Inria.Comment: (postprint); This version corrects a couple of errors in authors' names in the bibliograph

arXiv.org e-Print Archive

CiteSeerX

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Edition structuree. Approche hypertexte. Cooperation et complementarite

Author: Vercoustre Anne-Marie
Publication venue: HAL CCSD
Publication date: 01/06/1989
Field of study

INRIA a CCSD electronic archive server

Hal-Diderot

Le meta-decompilateur Pretty sous Mentor

Author: Vercoustre Anne-Marie
Publication venue: HAL CCSD
Publication date: 01/12/1985
Field of study

INRIA a CCSD electronic archive server

Hal-Diderot

Minerve.Un meta-editeur syntaxique

Author: Vercoustre Anne-Marie
Publication venue: HAL CCSD
Publication date: 01/07/1983
Field of study

INRIA a CCSD electronic archive server

Hal-Diderot

Extraction d'entit\'es dans des collections \'evolutives

Author: Despeyroux Thierry
Fraschini Eduardo
Vercoustre Anne-Marie
Publication venue
Publication date: 23/01/2007
Field of study

The goal of our work is to use a set of reports and extract named entities, in our case the names of Industrial or Academic partners. Starting with an initial list of entities, we use a first set of documents to identify syntactic patterns that are then validated in a supervised learning phase on a set of annotated documents. The complete collection is then explored. This approach is similar to the ones used in data extraction from semi-structured documents (wrappers) and do not need any linguistic resources neither a large set for training. As our collection of documents would evolve over years, we hope that the performance of the extraction would improve with the increased size of the training set.Comment: The bibteX file has been replaced with the correct on

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot