8 research outputs found
An In-Depth Analysis of Tags and Controlled Metadata for Book Search
Book search for information needs that go beyond standard bibliographic data is far from a solved problem. Such complex information needs often cover a combination of different aspects, such as specific genres or plot elements, engagement or novelty. By design, subject information in controlled vocabularies is not always adequate in covering such complex needs, and social tags have been proposed as an alternative. In this paper we present a large-scale empirical comparison and in-depth analysis of the value of controlled vocabularies and tags for book retrieval using a test collection of over 2 million book records and over 330 real-world book information needs. We find that while tags and controlled vocabulary terms provide complementary performance, tags perform better overall. However, this is not due to a popularity effect; instead, tags are better at matching the language of regular users. Finally, we perform a detailed failure analysis and show, using tags and controlled vocabulary terms, that some request types are inherently more difficult to solve than others
Recommended from our members
Poetry: Identification, Entity Recognition, and Retrieval
Modern advances in natural language processing (NLP) and information retrieval (IR) provide for the ability to automatically analyze, categorize, process and search textual resources. However, generalizing these approaches remains an open problem: models that appear to understand certain types of data must be re-trained on other domains.
Often, models make assumptions about the length, structure, discourse model and vocabulary used by a particular corpus. Trained models can often become biased toward an original dataset, learning that – for example – all capitalized words are names of people or that short documents are more relevant than longer documents. As a result, small amounts of noise or shifts in style can cause models to fail on unseen data. The key to more robust models is to look at text analytics tasks on more challenging and diverse data.
Poetry is an ancient art form that is believed to pre-date writing and is still a key form of expression through text today. Some poetry forms (e.g., haiku and sonnets) have rigid structure but still break our traditional expectations of text. Other poetry forms drop punctuation and other rules in favor of expression.
Our contributions include a set of novel, challenging datasets that extend traditional tasks: a text classification task for which content features perform poorly, a named entity recognition task that is inherently ambiguous, and a retrieval corpus over the largest public collection of poetry ever released.
We begin by looking at poetry identification - the task of finding poetry within existing textual collections, and devise an effective method of extracting poetry based on how it is usually formatted within digitally scanned books, since content models do not generalize well. Then we work on the content of poetry: we construct a dataset of around 6,000 tagged spans that identify the people, places, organizations and personified concepts within poetry. We show that cross-training with existing datasets based on news-corpora helps modern models to learn to recognize entities within poetry. Finally, we return to IR, and construct a dataset of queries and documents inspired by real-world data that expose some of the key challenges of searching through poetry. Our work is the first significant effort to use poetry in these three tasks and our datasets and models will provide strong baselines for new avenues of research on this challenging domain
POLIS: a probabilistic summarisation logic for structured documents
PhDAs the availability of structured documents, formatted in markup languages such as SGML, RDF,
or XML, increases, retrieval systems increasingly focus on the retrieval of document-elements,
rather than entire documents. Additionally, abstraction layers in the form of formalised retrieval
logics have allowed developers to include search facilities into numerous applications, without
the need of having detailed knowledge of retrieval models.
Although automatic document summarisation has been recognised as a useful tool for reducing
the workload of information system users, very few such abstraction layers have been developed
for the task of automatic document summarisation. This thesis describes the development
of an abstraction logic for summarisation, called POLIS, which provides users (such as developers
or knowledge engineers) with a high-level access to summarisation facilities. Furthermore,
POLIS allows users to exploit the hierarchical information provided by structured documents.
The development of POLIS is carried out in a step-by-step way. We start by defining a series
of probabilistic summarisation models, which provide weights to document-elements at a user
selected level. These summarisation models are those accessible through POLIS. The formal
definition of POLIS is performed in three steps. We start by providing a syntax for POLIS,
through which users/knowledge engineers interact with the logic. This is followed by a definition
of the logics semantics. Finally, we provide details of an implementation of POLIS.
The final chapters of this dissertation are concerned with the evaluation of POLIS, which is
conducted in two stages. Firstly, we evaluate the performance of the summarisation models by
applying POLIS to two test collections, the DUC AQUAINT corpus, and the INEX IEEE corpus.
This is followed by application scenarios for POLIS, in which we discuss how POLIS can be used in specific IR tasks
Techniques for improving efficiency and scalability for the integration of information retrieval and databases
PhDThis thesis is on the topic of integration of Information Retrieval (IR) and Databases (DB), with
particular focuses on improving efficiency and scalability of integrated IR and DB technology
(IR+DB). The main purpose of this study is to develop efficient and scalable techniques for
supporting integrated IR and DB technology, which is a popular approach today for handling
complex queries over text and structured data.
Our specific interest in this thesis is how to efficiently handle queries over large-scale text
and structured data. The work is based on a technology that integrates probability theory and
relational algebra, where retrievals for text and data are to be expressed in probabilistic logical
programs such as probabilistic relational algebra or probabilistic Datalog. To support efficient
processing of probabilistic logical programs, we proposed three optimization techniques
that focus on aspects covered logical and physical layers, which include: scoring-driven query
optimization using scoring expression, query processing with top-k incorporated pipeline, and
indexing with relational inverted index.
Specifically, scoring expressions are proposed for expressing the scoring or probabilistic semantics
of implied scoring functions of PRA expressions, so that efficient query execution plan
can be generated by rule-based scoring-driven optimizer. Secondly, to balance efficiency and
effectiveness so that to improve query response time, we studied methods for incorporating topk
algorithms into pipelined query execution engine for IR+DB systems. Thirdly, the proposed
relational inverted index integrates IR-style inverted index and DB-style tuple-based index, which
can be used to support efficient probability estimation and aggregation as well as conventional
relational operations.
Experiments were carried out to investigate the performances of proposed techniques. Experimental
results showed that the efficiency and scalability of an IR+DB prototype have been
improved, while the system can handle queries efficiently on considerable large data sets for a
number of IR tasks
Modelling search and stopping in interactive information retrieval
Searching for information when using a computerised retrieval system is a complex and inherently interactive process. Individuals during a search session may issue multiple queries, and examine a varying number of result summaries and documents per query. Searchers must also decide when to stop assessing content for relevance - or decide when to stop their search session altogether. Despite being such a fundamental activity, only a limited number of studies have explored stopping behaviours in detail, with a majority reporting that searchers stop because they decide that what they have found feels "good enough". Notwithstanding the limited exploration of stopping during search, the phenomenon is central to the study of Information Retrieval, playing a role in the models and measures that we employ. However, the current de facto assumption considers that searchers will examine k documents - examining up to a fixed depth.
In this thesis, we examine searcher stopping behaviours under a number of different search contexts. We conduct and report on two user studies, examining how result summary lengths and a variation of search tasks and goals affect such behaviours. Interaction data from these studies are then used to ground extensive simulations of interaction, exploring a number of different stopping heuristics (operationalised as twelve stopping strategies). We consider how well the proposed strategies perform and match up with real-world stopping behaviours. As part of our contribution, we also propose the Complex Searcher Model, a high-level conceptual searcher model that encodes stopping behaviours at different points throughout the search process. Within the Complex Searcher Model, we also propose a new results page stopping decision point. From this new stopping decision point, searchers can obtain an impression of the page before deciding to enter or abandon it.
Results presented and discussed demonstrate that searchers employ a range of different stopping strategies, with no strategy standing out in terms of performance and approximations offered. Stopping behaviours are clearly not fixed, but are rather adaptive in nature. This complex picture reinforces the idea that modelling stopping behaviour is difficult. However, simplistic stopping strategies do offer good performance and approximations, such as the frustration-based stopping strategy. This strategy considers a searcher's tolerance to non-relevance. We also find that combination strategies - such as those combining a searcher's satisfaction with finding relevant material, and their frustration towards observing non-relevant material - also consistently offer good approximations and performance. In addition, we also demonstrate that the inclusion of the additional stopping decision point within the Complex Searcher Model provides significant improvements to performance over our baseline implementation. It also offers improvements to the approximations of real-world searcher stopping behaviours.
This work motivates a revision of how we currently model the search process and demonstrates that different stopping heuristics need to be considered within the models and measures that we use in Information Retrieval. Measures should be reformed according to the stopping behaviours of searchers. A number of potential avenues for future exploration can also be considered, such as modelling the stopping behaviours of searchers individually (rather than as a population), and to explore and consider a wider variety of different stopping heuristics under different search contexts. Despite the inherently difficult task that understanding and modelling the stopping behaviours of searchers represents, potential benefits of further exploration in this area will undoubtedly aid the searchers of future retrieval systems - with further work bringing about improved interfaces and experiences