32 research outputs found
The State-of-the-arts in Focused Search
The continuous influx of various text data on the Web requires search engines to improve their retrieval abilities for more specific information. The need for relevant results to a userâs topic of interest has gone beyond search for domain or type specific documents to more focused result (e.g. document fragments or answers to a query). The introduction of XML provides a format standard for data representation, storage, and exchange. It helps focused search to be carried out at different granularities of a structured document with XML markups. This report aims at reviewing the state-of-the-arts in focused search, particularly techniques for topic-specific document retrieval, passage retrieval, XML retrieval, and entity ranking. It is concluded with highlight of open problems
An Exponentiation Method for XML Element Retrieval
XML document is now widely used for modelling and storing structured documents. The structure is very rich and carries important
information about contents and their relationships, for example, e-Commerce. XML data-centric collections require query terms allowing users to specify
constraints on the document structure; mapping structure queries and assigning the weight are significant for the set of possibly relevant documents
with respect to structural conditions. In this paper, we present an extension to the MEXIR search system that supports the combination
of structural and content queries in the form of content-and-structure queries, which we call the Exponentiation function. It has been shown
the structural information improve the effectiveness of the search system up to 52.60% over the baseline BM25 at MAP
Recommended from our members
Okapi-based XML indexing
Purpose
â Being an important data exchange and information storage standard, XML has generated a great deal of interest and particular attention has been paid to the issue of XML indexing. Clear use cases for structured search in XML have been established. However, most of the research in the area is either based on relational database systems or specialized semiâstructured data management systems. This paper aims to propose a method for XML indexing based on the information retrieval (IR) system Okapi.
Design/methodology/approach
â First, the paper reviews the structure of inverted files and gives an overview of the issues of why this indexing mechanism cannot properly support XML retrieval, using the underlying data structures of Okapi as an example. Then the paper explores a revised method implemented on Okapi using path indexing structures. The paper evaluates these index structures through the metrics of indexing run time, path search run time and space costs using the INEX and Reuters RVC1 collections.
Findings
â Initial results on the INEX collections show that there is a substantial overhead in space costs for the method, but this increase does not affect run time adversely. Indexing results on differing sized Reuters RVC1 subâcollections show that the increase in space costs with increasing the size of a collection is significant, but in terms of run time the increase is linear. Path search results show subâmillisecond run times, demonstrating minimal overhead for XML search.
Practical implications
â Overall, the results show the method implemented to support XML search in a traditional IR system such as Okapi is viable.
Originality/value
â The paper provides useful information on a method for XML indexing based on the IR system Okapi
A survey on tree matching and XML retrieval
International audienceWith the increasing number of available XML documents, numerous approaches for retrieval have been proposed in the literature. They usually use the tree representation of documents and queries to process them, whether in an implicit or explicit way. Although retrieving XML documents can be considered as a tree matching problem between the query tree and the document trees, only a few approaches take advantage of the algorithms and methods proposed by the graph theory. In this paper, we aim at studying the theoretical approaches proposed in the literature for tree matching and at seeing how these approaches have been adapted to XML querying and retrieval, from both an exact and an approximate matching perspective. This study will allow us to highlight theoretical aspects of graph theory that have not been yet explored in XML retrieval
Interactive Information Retrieval with Structured Documents
In recent years there has been a growing realisation in the IR community that the interaction of searchers with information is an indispensable component of the IR process. As a result, issues relating to interactive IR have been extensively investigated in the last decade. This research has been performed in the context of unstructured documents or in the context of the loosely-defined structure encountered in web pages. XML documents, on the other hand, define a different context, by offering the possibility of navigating within the structure of a single document, or of following links to other documents.
Relatively little work has been carried out to study user interaction with IR systems that make use of the additional features offered by XML documents. As part of the INEX initiative for the evaluation of XML retrieval, the INEX interactive track has focused on interactive XML retrieval since 2004. Here user friendly exposition to various features of XML documents is provided and some new features are designed and implemented to enable searchers to have access to their desired information in an efficient manner.
In this study interaction entails three levels: query formulation, inspecting result list, and examining the detail. For query formulation, suggesting related terms is a conventional method to assist searchers. Here we
investigate the related terms derived from two different co-occurrence units: elements and documents. In addition, contextual aspect is added to facilitate the searchers for appropriate selection of terms. Results showed the usefulness of suggesting related terms and some what acceptance of the contextual related tool.
For inspecting the result list, classic document retrieval systems such as web search engines retrieve whole documents, and leave it to the searchers to collect their required information from possibly a lengthy text. In contrast, element retrieval aims at a focused view of information by pointing to the optimal access points of the document. A number of strategies have been investigated for presenting result lists.
For examining the detail of a document, traditionally the complete document is presented to a searcher and here again the searcher has to put in effort to reach its required information. We investigated the use of additional support such as a table of contents along with document detail. In addition, we also investigated graphical representations of documents depicting its structure and granularity of retrieved elements along with their estimated relevance. Here the table of contents was found to be a very useful features for examining details.
In order to conduct the analysis of searcher's interaction, a visualisation technique based on Tree Map was developed. It depicts the search interaction with element retrieval system. A number of browsing strategies has been identified
with the help of this tool.
The value of element retrieval for searchers and comparison between two focused approaches such as element and passage retrieval system was also evaluated. The study suggests that searchers find elements useful for their tasks and they locate a lot of the relevant information in specific elements rather than full documents. Sections, in particular, appear to be helpful.
In order to provide user-specific support, the system needs feedback from searchers, who in turn, are very reluctant to
give this information explicitly. Therefore, we investigated to what extent
the different features can be used as relevance predictors. Of the five features regarded, primarily
the reading time is a useful relevance predictor. Overall, relevance predictors for structured
documents seem to be much weaker than for the case of atomic documents