20 research outputs found
Theoretical evaluation of XML retrieval
This thesis develops a theoretical framework to evaluate XML retrieval. XML retrieval deals with retrieving those document parts that specifically answer a query. It is concerned with using the document structure to improve the retrieval of information from documents by only delivering those parts of a document an information need is about. We define a theoretical evaluation methodology based on the idea of `aboutness' and apply it to XML retrieval models. Situation Theory is used to express the aboutness proprieties of XML retrieval models. We develop a dedicated methodology for the evaluation of XML retrieval and apply this methodology to five XML retrieval models and other XML retrieval topics such as evaluation methodologies, filters and experimental results
XML retrieval using pruned element-index files
An element-index is a crucial mechanism for supporting content-only (CO) queries over XML collections. A full element-index that indexes each element along with the content of its descendants involves a high redundancy and reduces query processing efficiency. A direct index, on the other hand, only indexes the content that is directly under each element and disregards the descendants. This results in a smaller index, but possibly in return to some reduction in system effectiveness. In this paper, we propose using static index pruning techniques for obtaining more compact index files that can still result in comparable retrieval performance to that of a full index. We also compare the retrieval performance of these pruning based approaches to some other strategies that make use of a direct element-index. Our experiments conducted along with the lines of INEX evaluation framework reveal that pruned index files yield comparable to or even better retrieval performance than the full index and direct index, for several tasks in the ad hoc track. © 2010 Springer-Verlag Berlin Heidelberg
Investigating the document structure as a source of evidence for multimedia fragment retrieval
International audienceMultimedia objects can be retrieved using their context that can be for instance the text surrounding them in documents. This text may be either near or far from the searched objects. Our goal in this paper is to study the impact, in term of effectiveness, of text position relatively to searched objects. The multimedia objects we consider are described in structured documents such as XML ones. The document structure is therefore exploited to provide this text position in documents. Although structural information has been shown to be an effective source of evidence in textual information retrieval, only a few works investigated its interest in multimedia retrieval. More precisely, the task we are interested in this paper is to retrieve multimedia fragments (i.e. XML elements having at least one multimedia object). Our general approach is built on two steps: we first retrieve XML elements containing multimedia objects, and we then explore the surrounding information to retrieve relevant multimedia fragments. In both cases, we study the impact of the surrounding information using the documents structure.Our work is carried out on images, but it can be extended to any other media, since the physical content of multimedia objects is not used. We conducted several experiments in the context of the Multimedia track of the INEX evaluation campaign. Results showed that structural evidences are of high interest to tune the importance of textual context for multimedia retrieval. Moreover, the proposed approach outperforms state of the art approaches
The State-of-the-arts in Focused Search
The continuous influx of various text data on the Web requires search engines to improve their retrieval abilities for more specific information. The need for relevant results to a userâs topic of interest has gone beyond search for domain or type specific documents to more focused result (e.g. document fragments or answers to a query). The introduction of XML provides a format standard for data representation, storage, and exchange. It helps focused search to be carried out at different granularities of a structured document with XML markups. This report aims at reviewing the state-of-the-arts in focused search, particularly techniques for topic-specific document retrieval, passage retrieval, XML retrieval, and entity ranking. It is concluded with highlight of open problems
Efficiency and effectiveness of XML keyword search using a full element index
Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2010.Thesis (Master's) -- Bilkent University, 2010.Includes bibliographical references leaves 63-67.In the last decade, both the academia and industry proposed several techniques
to allow keyword search on XML databases and document collections. A common
data structure employed in most of these approaches is an inverted index, which
is the state-of-the-art for conducting keyword search over large volumes of textual
data, such as world wide web. In particular, a full element-index considers (and
indexes) each XML element as a separate document, which is formed of the text
directly contained in it and the textual content of all of its descendants. A major
criticism for a full element-index is the high degree of redundancy in the index
(due to the nested structure of XML documents), which diminishes its usage for
large-scale XML retrieval scenarios.
As the rst contribution of this thesis, we investigate the e ciency and e ectiveness
of using a full element-index for XML keyword search. First, we suggest
that lossless index compression methods can signi cantly reduce the size of a full
element-index so that query processing strategies, such as those employed in a
typical search engine, can e ciently operate on it. We show that once the most
essential problem of a full element-index, i.e., its size, is remedied, using such
an index can improve both the result quality (e ectiveness) and query execution
performance (e ciency) in comparison to other recently proposed techniques in
the literature. Moreover, using a full element-index also allows generating query
results in di erent forms, such as a ranked list of documents (as expected by a
search engine user) or a complete list of elements that include all of the query
terms (as expected by a DBMS user), in a uni ed framework.
As a second contribution of this thesis, we propose to use a lossy approach,
static index pruning, to further reduce the size of a full element-index. In this way, we aim to eliminate the repetition of an element's terms at upper levels in an
adaptive manner considering the element's textual content and search system's
ranking function. That is, we attempt to remove the repetitions in the index only
when we expect that removal of them would not reduce the result quality. We
conduct a well-crafted set of experiments and show that pruned index les are
comparable or even superior to the full element-index up to very high pruning
levels for various ad hoc tasks in terms of retrieval e ectiveness.
As a nal contribution of this thesis, we propose to apply index pruning
strategies to reduce the size of the document vectors in an XML collection to
improve the clustering performance of the collection. Our experiments show that
for certain cases, it is possible to prune up to 70% of the collection (or, more
speci cally, underlying document vectors) and still generate a clustering structure
that yields the same quality with that of the original collection, in terms of a set
of evaluation metrics.Atılgan, DuyguM.S
Recommended from our members
Okapi-based XML indexing
Purpose
â Being an important data exchange and information storage standard, XML has generated a great deal of interest and particular attention has been paid to the issue of XML indexing. Clear use cases for structured search in XML have been established. However, most of the research in the area is either based on relational database systems or specialized semiâstructured data management systems. This paper aims to propose a method for XML indexing based on the information retrieval (IR) system Okapi.
Design/methodology/approach
â First, the paper reviews the structure of inverted files and gives an overview of the issues of why this indexing mechanism cannot properly support XML retrieval, using the underlying data structures of Okapi as an example. Then the paper explores a revised method implemented on Okapi using path indexing structures. The paper evaluates these index structures through the metrics of indexing run time, path search run time and space costs using the INEX and Reuters RVC1 collections.
Findings
â Initial results on the INEX collections show that there is a substantial overhead in space costs for the method, but this increase does not affect run time adversely. Indexing results on differing sized Reuters RVC1 subâcollections show that the increase in space costs with increasing the size of a collection is significant, but in terms of run time the increase is linear. Path search results show subâmillisecond run times, demonstrating minimal overhead for XML search.
Practical implications
â Overall, the results show the method implemented to support XML search in a traditional IR system such as Okapi is viable.
Originality/value
â The paper provides useful information on a method for XML indexing based on the IR system Okapi
An Exponentiation Method for XML Element Retrieval
XML document is now widely used for modelling and storing structured documents. The structure is very rich and carries important
information about contents and their relationships, for example, e-Commerce. XML data-centric collections require query terms allowing users to specify
constraints on the document structure; mapping structure queries and assigning the weight are significant for the set of possibly relevant documents
with respect to structural conditions. In this paper, we present an extension to the MEXIR search system that supports the combination
of structural and content queries in the form of content-and-structure queries, which we call the Exponentiation function. It has been shown
the structural information improve the effectiveness of the search system up to 52.60% over the baseline BM25 at MAP
A Hybrid Chinese Information Retrieval Model
A distinctive feature of Chinese test is that a Chinese document is a sequence of Chinese with no space or boundary between Chinese words. This feature makes Chinese information retrieval more difficult since a retrieved document which contains the query term as a sequence of Chinese characters may not be really relevant to the query since the query term (as a sequence Chinese characters) may not be a valid Chinese word in that documents. On the other hand, a document that is actually relevant may not be retrieved because it does not contain the query sequence but contains other relevant words. In this research, we propose a hybrid Chinese information retrieval model by incorporating word-based techniques with the traditional character-based techniques. The aim of this approach is to investigate the influence of Chinese segmentation on the performance of Chinese information retrieval. Two ranking methods are proposed to rank retrieved documents based on the relevancy to the query calculated by combining character-based ranking and word-based ranking. Our experimental results show that Chinese segmentation can improve the performance of Chinese information retrieval, but the improvement is not significant if it incorporates only Chinese segmentation with the traditional character-based approach