884 research outputs found

    The State-of-the-arts in Focused Search

    Get PDF
    The continuous influx of various text data on the Web requires search engines to improve their retrieval abilities for more specific information. The need for relevant results to a user’s topic of interest has gone beyond search for domain or type specific documents to more focused result (e.g. document fragments or answers to a query). The introduction of XML provides a format standard for data representation, storage, and exchange. It helps focused search to be carried out at different granularities of a structured document with XML markups. This report aims at reviewing the state-of-the-arts in focused search, particularly techniques for topic-specific document retrieval, passage retrieval, XML retrieval, and entity ranking. It is concluded with highlight of open problems

    The State-of-the-arts in Focused Search

    Get PDF

    A survey on tree matching and XML retrieval

    Get PDF
    International audienceWith the increasing number of available XML documents, numerous approaches for retrieval have been proposed in the literature. They usually use the tree representation of documents and queries to process them, whether in an implicit or explicit way. Although retrieving XML documents can be considered as a tree matching problem between the query tree and the document trees, only a few approaches take advantage of the algorithms and methods proposed by the graph theory. In this paper, we aim at studying the theoretical approaches proposed in the literature for tree matching and at seeing how these approaches have been adapted to XML querying and retrieval, from both an exact and an approximate matching perspective. This study will allow us to highlight theoretical aspects of graph theory that have not been yet explored in XML retrieval

    Using Proximity and Tag Weights for Focused Retrieval in Structured Documents

    Get PDF
    International audienceFocused information retrieval is concerned with the retrieval of small units of information. In this context, the structure of the documents as well as the proximity among query terms have been found useful for improving retrieval effectiveness. In this article, we propose an approach combining the proximity of the terms and the tags which mark these terms. Our approach is based on a Fetch and Browse method where the fetch step is performed with BM25 and the browse step with a structure enhanced proximity model. In this way, the ranking of a document depends not only upon the existence of the query terms within the document but also upon the tags which mark these terms. Thus, the document tends to be highly relevant when query terms are close together and are emphasized by tags. The evaluation of this model on a large XML structured collection provided by the INEX 2010 XML IR evaluation campaign shows that the use of term proximity and structure improves the retrieval effectiveness of BM25 in the context of focused information retrieval

    Theoretical evaluation of XML retrieval

    Get PDF
    This thesis develops a theoretical framework to evaluate XML retrieval. XML retrieval deals with retrieving those document parts that specifically answer a query. It is concerned with using the document structure to improve the retrieval of information from documents by only delivering those parts of a document an information need is about. We define a theoretical evaluation methodology based on the idea of `aboutness' and apply it to XML retrieval models. Situation Theory is used to express the aboutness proprieties of XML retrieval models. We develop a dedicated methodology for the evaluation of XML retrieval and apply this methodology to five XML retrieval models and other XML retrieval topics such as evaluation methodologies, filters and experimental results

    Theoretical evaluation of XML retrieval

    Full text link

    Evaluation of effective XML information retrieval

    Get PDF
    XML is being adopted as a common storage format in scientific data repositories, digital libraries, and on the World Wide Web. Accordingly, there is a need for content-oriented XML retrieval systems that can efficiently and effectively store, search and retrieve information from XML document collections. Unlike traditional information retrieval systems where whole documents are usually indexed and retrieved as information units, XML retrieval systems typically index and retrieve document components of varying granularity. To evaluate the effectiveness of such systems, test collections where relevance assessments are provided according to an XML-specific definition of relevance are necessary. Such test collections have been built during four rounds of the INitiative for the Evaluation of XML Retrieval (INEX). There are many different approaches to XML retrieval; most approaches either extend full-text information retrieval systems to handle XML retrieval, or use database technologies that incorporate existing XML standards to handle both XML presentation and retrieval. We present a hybrid approach to XML retrieval that combines text information retrieval features with XML-specific features found in a native XML database. Results from our experiments on the INEX 2003 and 2004 test collections demonstrate the usefulness of applying our hybrid approach to different XML retrieval tasks. A realistic definition of relevance is necessary for meaningful comparison of alternative XML retrieval approaches. The three relevance definitions used by INEX since 2002 comprise two relevance dimensions, each based on topical relevance. We perform an extensive analysis of the two INEX 2004 and 2005 relevance definitions, and show that assessors and users find them difficult to understand. We propose a new definition of relevance for XML retrieval, and demonstrate that a relevance scale based on this definition is useful for XML retrieval experiments. Finding the appropriate approach to evaluate XML retrieval effectiveness is the subject of ongoing debate within the XML information retrieval research community. We present an overview of the evaluation methodologies implemented in the current INEX metrics, which reveals that the metrics follow different assumptions and measure different XML retrieval behaviours. We propose a new evaluation metric for XML retrieval and conduct an extensive analysis of the retrieval performance of simulated runs to show what is measured. We compare the evaluation behaviour obtained with the new metric to the behaviours obtained with two of the official INEX 2005 metrics, and demonstrate that the new metric can be used to reliably evaluate XML retrieval effectiveness. To analyse the effectiveness of XML retrieval in different application scenarios, we use evaluation measures in our new metric to investigate the behaviour of XML retrieval approaches under the following two scenarios: the ad-hoc retrieval scenario, exploring the activities carried out as part of the INEX 2005 Ad-hoc track; and the multimedia retrieval scenario, exploring the activities carried out as part of the INEX 2005 Multimedia track. For both application scenarios we show that, although different values for retrieval parameters are needed to achieve the optimal performance, the desired textual or multimedia information can be effectively located using a combination of XML retrieval approaches

    Focused Retrieval

    Get PDF
    Traditional information retrieval applications, such as Web search, return atomic units of retrieval, which are generically called ``documents''. Depending on the application, a document may be a Web page, an email message, a journal article, or any similar object. In contrast to this traditional approach, focused retrieval helps users better pin-point their exact information needs by returning results at the sub-document level. These results may consist of predefined document components~---~such as pages, sections, and paragraphs~---~or they may consist of arbitrary passages, comprising any sub-string of a document. If a document is marked up with XML, a focused retrieval system might return individual XML elements or ranges of elements. This thesis proposes and evaluates a number of approaches to focused retrieval, including methods based on XML markup and methods based on arbitrary passages. It considers the best unit of retrieval, explores methods for efficient sub-document retrieval, and evaluates formulae for sub-document scoring. Focused retrieval is also considered in the specific context of the Wikipedia, where methods for automatic vandalism detection and automatic link generation are developed and evaluated

    Proceedings of the 6th Dutch-Belgian Information Retrieval Workshop

    Get PDF
    • 

    corecore