8 research outputs found

    An In-Depth Analysis of Tags and Controlled Metadata for Book Search

    Get PDF
    Book search for information needs that go beyond standard bibliographic data is far from a solved problem. Such complex information needs often cover a combination of different aspects, such as specific genres or plot elements, engagement or novelty. By design, subject information in controlled vocabularies is not always adequate in covering such complex needs, and social tags have been proposed as an alternative. In this paper we present a large-scale empirical comparison and in-depth analysis of the value of controlled vocabularies and tags for book retrieval using a test collection of over 2 million book records and over 330 real-world book information needs. We find that while tags and controlled vocabulary terms provide complementary performance, tags perform better overall. However, this is not due to a popularity effect; instead, tags are better at matching the language of regular users. Finally, we perform a detailed failure analysis and show, using tags and controlled vocabulary terms, that some request types are inherently more difficult to solve than others

    POLIS: a probabilistic summarisation logic for structured documents

    Get PDF
    PhDAs the availability of structured documents, formatted in markup languages such as SGML, RDF, or XML, increases, retrieval systems increasingly focus on the retrieval of document-elements, rather than entire documents. Additionally, abstraction layers in the form of formalised retrieval logics have allowed developers to include search facilities into numerous applications, without the need of having detailed knowledge of retrieval models. Although automatic document summarisation has been recognised as a useful tool for reducing the workload of information system users, very few such abstraction layers have been developed for the task of automatic document summarisation. This thesis describes the development of an abstraction logic for summarisation, called POLIS, which provides users (such as developers or knowledge engineers) with a high-level access to summarisation facilities. Furthermore, POLIS allows users to exploit the hierarchical information provided by structured documents. The development of POLIS is carried out in a step-by-step way. We start by defining a series of probabilistic summarisation models, which provide weights to document-elements at a user selected level. These summarisation models are those accessible through POLIS. The formal definition of POLIS is performed in three steps. We start by providing a syntax for POLIS, through which users/knowledge engineers interact with the logic. This is followed by a definition of the logics semantics. Finally, we provide details of an implementation of POLIS. The final chapters of this dissertation are concerned with the evaluation of POLIS, which is conducted in two stages. Firstly, we evaluate the performance of the summarisation models by applying POLIS to two test collections, the DUC AQUAINT corpus, and the INEX IEEE corpus. This is followed by application scenarios for POLIS, in which we discuss how POLIS can be used in specific IR tasks

    Techniques for improving efficiency and scalability for the integration of information retrieval and databases

    Get PDF
    PhDThis thesis is on the topic of integration of Information Retrieval (IR) and Databases (DB), with particular focuses on improving efficiency and scalability of integrated IR and DB technology (IR+DB). The main purpose of this study is to develop efficient and scalable techniques for supporting integrated IR and DB technology, which is a popular approach today for handling complex queries over text and structured data. Our specific interest in this thesis is how to efficiently handle queries over large-scale text and structured data. The work is based on a technology that integrates probability theory and relational algebra, where retrievals for text and data are to be expressed in probabilistic logical programs such as probabilistic relational algebra or probabilistic Datalog. To support efficient processing of probabilistic logical programs, we proposed three optimization techniques that focus on aspects covered logical and physical layers, which include: scoring-driven query optimization using scoring expression, query processing with top-k incorporated pipeline, and indexing with relational inverted index. Specifically, scoring expressions are proposed for expressing the scoring or probabilistic semantics of implied scoring functions of PRA expressions, so that efficient query execution plan can be generated by rule-based scoring-driven optimizer. Secondly, to balance efficiency and effectiveness so that to improve query response time, we studied methods for incorporating topk algorithms into pipelined query execution engine for IR+DB systems. Thirdly, the proposed relational inverted index integrates IR-style inverted index and DB-style tuple-based index, which can be used to support efficient probability estimation and aggregation as well as conventional relational operations. Experiments were carried out to investigate the performances of proposed techniques. Experimental results showed that the efficiency and scalability of an IR+DB prototype have been improved, while the system can handle queries efficiently on considerable large data sets for a number of IR tasks

    Modelling search and stopping in interactive information retrieval

    Get PDF
    Searching for information when using a computerised retrieval system is a complex and inherently interactive process. Individuals during a search session may issue multiple queries, and examine a varying number of result summaries and documents per query. Searchers must also decide when to stop assessing content for relevance - or decide when to stop their search session altogether. Despite being such a fundamental activity, only a limited number of studies have explored stopping behaviours in detail, with a majority reporting that searchers stop because they decide that what they have found feels "good enough". Notwithstanding the limited exploration of stopping during search, the phenomenon is central to the study of Information Retrieval, playing a role in the models and measures that we employ. However, the current de facto assumption considers that searchers will examine k documents - examining up to a fixed depth. In this thesis, we examine searcher stopping behaviours under a number of different search contexts. We conduct and report on two user studies, examining how result summary lengths and a variation of search tasks and goals affect such behaviours. Interaction data from these studies are then used to ground extensive simulations of interaction, exploring a number of different stopping heuristics (operationalised as twelve stopping strategies). We consider how well the proposed strategies perform and match up with real-world stopping behaviours. As part of our contribution, we also propose the Complex Searcher Model, a high-level conceptual searcher model that encodes stopping behaviours at different points throughout the search process. Within the Complex Searcher Model, we also propose a new results page stopping decision point. From this new stopping decision point, searchers can obtain an impression of the page before deciding to enter or abandon it. Results presented and discussed demonstrate that searchers employ a range of different stopping strategies, with no strategy standing out in terms of performance and approximations offered. Stopping behaviours are clearly not fixed, but are rather adaptive in nature. This complex picture reinforces the idea that modelling stopping behaviour is difficult. However, simplistic stopping strategies do offer good performance and approximations, such as the frustration-based stopping strategy. This strategy considers a searcher's tolerance to non-relevance. We also find that combination strategies - such as those combining a searcher's satisfaction with finding relevant material, and their frustration towards observing non-relevant material - also consistently offer good approximations and performance. In addition, we also demonstrate that the inclusion of the additional stopping decision point within the Complex Searcher Model provides significant improvements to performance over our baseline implementation. It also offers improvements to the approximations of real-world searcher stopping behaviours. This work motivates a revision of how we currently model the search process and demonstrates that different stopping heuristics need to be considered within the models and measures that we use in Information Retrieval. Measures should be reformed according to the stopping behaviours of searchers. A number of potential avenues for future exploration can also be considered, such as modelling the stopping behaviours of searchers individually (rather than as a population), and to explore and consider a wider variety of different stopping heuristics under different search contexts. Despite the inherently difficult task that understanding and modelling the stopping behaviours of searchers represents, potential benefits of further exploration in this area will undoubtedly aid the searchers of future retrieval systems - with further work bringing about improved interfaces and experiences
    corecore