59 research outputs found
Enhancing Content-And-Structure Information Retrieval using a Native XML Database
Three approaches to content-and-structure XML retrieval are analysed in this
paper: first by using Zettair, a full-text information retrieval system; second
by using eXist, a native XML database, and third by using a hybrid XML
retrieval system that uses eXist to produce the final answers from likely
relevant articles retrieved by Zettair. INEX 2003 content-and-structure topics
can be classified in two categories: the first retrieving full articles as
final answers, and the second retrieving more specific elements within articles
as final answers. We show that for both topic categories our initial hybrid
system improves the retrieval effectiveness of a native XML database. For
ranking the final answer elements, we propose and evaluate a novel retrieval
model that utilises the structural relationships between the answer elements of
a native XML database and retrieves Coherent Retrieval Elements. The final
results of our experiments show that when the XML retrieval task focusses on
highly relevant elements our hybrid XML retrieval system with the Coherent
Retrieval Elements module is 1.8 times more effective than Zettair and 3 times
more effective than eXist, and yields an effective content-and-structure XML
retrieval
Users and Assessors in the Context of INEX: Are Relevance Dimensions Relevant?
The main aspects of XML retrieval are identified by analysing and comparing
the following two behaviours: the behaviour of the assessor when judging the
relevance of returned document components; and the behaviour of users when
interacting with components of XML documents. We argue that the two INEX
relevance dimensions, Exhaustivity and Specificity, are not orthogonal
dimensions; indeed, an empirical analysis of each dimension reveals that the
grades of the two dimensions are correlated to each other. By analysing the
level of agreement between the assessor and the users, we aim at identifying
the best units of retrieval. The results of our analysis show that the highest
level of agreement is on highly relevant and on non-relevant document
components, suggesting that only the end points of the INEX 10-point relevance
scale are perceived in the same way by both the assessor and the users. We
propose a new definition of relevance for XML retrieval and argue that its
corresponding relevance scale would be a better choice for INEX
Use of Wikipedia Categories in Entity Ranking
Wikipedia is a useful source of knowledge that has many applications in
language processing and knowledge representation. The Wikipedia category graph
can be compared with the class hierarchy in an ontology; it has some
characteristics in common as well as some differences. In this paper, we
present our approach for answering entity ranking queries from the Wikipedia.
In particular, we explore how to make use of Wikipedia categories to improve
entity ranking effectiveness. Our experiments show that using categories of
example entities works significantly better than using loosely defined target
categories
Experiments in Clustering Homogeneous XML Documents to Validate an Existing Typology
This paper presents some experiments in clustering homogeneous XMLdocuments
to validate an existing classification or more generally anorganisational
structure. Our approach integrates techniques for extracting knowledge from
documents with unsupervised classification (clustering) of documents. We focus
on the feature selection used for representing documents and its impact on the
emerging classification. We mix the selection of structured features with fine
textual selection based on syntactic characteristics.We illustrate and evaluate
this approach with a collection of Inria activity reports for the year 2003.
The objective is to cluster projects into larger groups (Themes), based on the
keywords or different chapters of these activity reports. We then compare the
results of clustering using different feature selections, with the official
theme structure used by Inria.Comment: (postprint); This version corrects a couple of errors in authors'
names in the bibliograph
Extraction d'entit\'es dans des collections \'evolutives
The goal of our work is to use a set of reports and extract named entities,
in our case the names of Industrial or Academic partners. Starting with an
initial list of entities, we use a first set of documents to identify syntactic
patterns that are then validated in a supervised learning phase on a set of
annotated documents. The complete collection is then explored. This approach is
similar to the ones used in data extraction from semi-structured documents
(wrappers) and do not need any linguistic resources neither a large set for
training. As our collection of documents would evolve over years, we hope that
the performance of the extraction would improve with the increased size of the
training set.Comment: The bibteX file has been replaced with the correct on
- …