44,070 research outputs found
Enhancing Content-And-Structure Information Retrieval using a Native XML Database
Three approaches to content-and-structure XML retrieval are analysed in this
paper: first by using Zettair, a full-text information retrieval system; second
by using eXist, a native XML database, and third by using a hybrid XML
retrieval system that uses eXist to produce the final answers from likely
relevant articles retrieved by Zettair. INEX 2003 content-and-structure topics
can be classified in two categories: the first retrieving full articles as
final answers, and the second retrieving more specific elements within articles
as final answers. We show that for both topic categories our initial hybrid
system improves the retrieval effectiveness of a native XML database. For
ranking the final answer elements, we propose and evaluate a novel retrieval
model that utilises the structural relationships between the answer elements of
a native XML database and retrieves Coherent Retrieval Elements. The final
results of our experiments show that when the XML retrieval task focusses on
highly relevant elements our hybrid XML retrieval system with the Coherent
Retrieval Elements module is 1.8 times more effective than Zettair and 3 times
more effective than eXist, and yields an effective content-and-structure XML
retrieval
Hybrid XML Retrieval: Combining Information Retrieval and a Native XML Database
This paper investigates the impact of three approaches to XML retrieval:
using Zettair, a full-text information retrieval system; using eXist, a native
XML database; and using a hybrid system that takes full article answers from
Zettair and uses eXist to extract elements from those articles. For the
content-only topics, we undertake a preliminary analysis of the INEX 2003
relevance assessments in order to identify the types of highly relevant
document components. Further analysis identifies two complementary sub-cases of
relevance assessments ("General" and "Specific") and two categories of topics
("Broad" and "Narrow"). We develop a novel retrieval module that for a
content-only topic utilises the information from the resulting answer list of a
native XML database and dynamically determines the preferable units of
retrieval, which we call "Coherent Retrieval Elements". The results of our
experiments show that -- when each of the three systems is evaluated against
different retrieval scenarios (such as different cases of relevance
assessments, different topic categories and different choices of evaluation
metrics) -- the XML retrieval systems exhibit varying behaviour and the best
performance can be reached for different values of the retrieval parameters. In
the case of INEX 2003 relevance assessments for the content-only topics, our
newly developed hybrid XML retrieval system is substantially more effective
than either Zettair or eXist, and yields a robust and a very effective XML
retrieval.Comment: Postprint version. The editor version can be accessed through the DO
A Database Approach to Content-based XML retrieval
This paper describes a rst prototype system for content-based retrieval from XML data. The system's design supports both XPath queries and complex information retrieval queries based on a language modelling approach to information retrieval. Evaluation using the INEX benchmark shows that it is beneficial if the system is biased to retrieve large XML fragments over small fragments
The State-of-the-arts in Focused Search
The continuous influx of various text data on the Web requires search engines to improve their retrieval abilities for more specific information. The need for relevant results to a userās topic of interest has gone beyond search for domain or type specific documents to more focused result (e.g. document fragments or answers to a query). The introduction of XML provides a format standard for data representation, storage, and exchange. It helps focused search to be carried out at different granularities of a structured document with XML markups. This report aims at reviewing the state-of-the-arts in focused search, particularly techniques for topic-specific document retrieval, passage retrieval, XML retrieval, and entity ranking. It is concluded with highlight of open problems
The Simplest Evaluation Measures for XML Information Retrieval that Could Possibly Work
This paper reviews several evaluation measures developed for evaluating XML information retrieval (IR) systems. We argue that these measures, some of which are currently in use by the INitiative for the Evaluation of XML Retrieval (INEX), are complicated, hard to understand, and hard to explain to users of XML IR systems. To show the value of keeping things simple, we report alternative evaluation results of official evaluation runs submitted to INEX 2004 using simple metrics, and show its value for INEX
Theoretical evaluation of XML retrieval
This thesis develops a theoretical framework to evaluate XML retrieval. XML retrieval deals with retrieving those document parts that specifically answer a query. It is concerned with using the document structure to improve the retrieval of information from documents by only delivering those parts of a document an information need is about. We define a theoretical evaluation methodology based on the idea of `aboutness' and apply it to XML retrieval models. Situation Theory is used to express the aboutness proprieties of XML retrieval models. We develop a dedicated methodology for the evaluation of XML retrieval and apply this methodology to five XML retrieval models and other XML retrieval topics such as evaluation methodologies, filters and experimental results
Users and Assessors in the Context of INEX: Are Relevance Dimensions Relevant?
The main aspects of XML retrieval are identified by analysing and comparing
the following two behaviours: the behaviour of the assessor when judging the
relevance of returned document components; and the behaviour of users when
interacting with components of XML documents. We argue that the two INEX
relevance dimensions, Exhaustivity and Specificity, are not orthogonal
dimensions; indeed, an empirical analysis of each dimension reveals that the
grades of the two dimensions are correlated to each other. By analysing the
level of agreement between the assessor and the users, we aim at identifying
the best units of retrieval. The results of our analysis show that the highest
level of agreement is on highly relevant and on non-relevant document
components, suggesting that only the end points of the INEX 10-point relevance
scale are perceived in the same way by both the assessor and the users. We
propose a new definition of relevance for XML retrieval and argue that its
corresponding relevance scale would be a better choice for INEX
Content-Aware DataGuides for Indexing Large Collections of XML Documents
XML is well-suited for modelling structured data with
textual content. However, most indexing approaches perform
structure and content matching independently, combining
the retrieved path and keyword occurrences in a third
step. This paper shows that retrieval in XML documents can
be accelerated significantly by processing text and structure
simultaneously during all retrieval phases. To this end,
the Content-Aware DataGuide (CADG) enhances the wellknown
DataGuide with (1) simultaneous keyword and path
matching and (2) a precomputed content/structure join. Extensive
experiments prove the CADG to be 50-90% faster
than the DataGuide for various sorts of query and document,
including difficult cases such as poorly structured
queries and recursive document paths. A new query classification
scheme identifies precise query characteristics with
a predominant influence on the performance of the individual
indices. The experiments show that the CADG is applicable
to many real-world applications, in particular large
collections of heterogeneously structured XML documents
- ā¦