6 research outputs found

    Reducing long queries using query quality predictors

    Full text link
    Long queries frequently contain many extraneous terms that hinder retrieval of relevant documents. We present techniques to reduce long queries to more effective shorter ones that lack those extraneous terms. Our work is motivated by the observation that perfectly reducing long TREC description queries can lead to an average improvement of 30 % in mean average precision. Our approach involves transforming the reduction problem into a problem of learning to rank all sub-sets of the original query (sub-queries) based on their predicted quality, and select the top sub-query. We use various measures of query quality described in the literature as features to represent sub-queries, and train a classifier. Replacing the original long query with the top-ranked subquery chosen by the ranking classifier results in a statistically significant average improvement of 8 % on our test sets. Analysis of the results shows that query reduction is wellsuited for moderately-performing long queries, and a small set of query quality predictors are well-suited for the task of ranking sub-queries

    Managing tail latency in large scale information retrieval systems

    Get PDF
    As both the availability of internet access and the prominence of smart devices continue to increase, data is being generated at a rate faster than ever before. This massive increase in data production comes with many challenges, including efficiency concerns for the storage and retrieval of such large-scale data. However, users have grown to expect the sub-second response times that are common in most modern search engines, creating a problem - how can such large amounts of data continue to be served efficiently enough to satisfy end users? This dissertation investigates several issues regarding tail latency in large-scale information retrieval systems. Tail latency corresponds to the high percentile latency that is observed from a system - in the case of search, this latency typically corresponds to how long it takes for a query to be processed. In particular, keeping tail latency as low as possible translates to a good experience for all users, as tail latency is directly related to the worst-case latency and hence, the worst possible user experience. The key idea in targeting tail latency is to move from questions such as "what is the median latency of our search engine?" to questions which more accurately capture user experience such as "how many queries take more than 200ms to return answers?" or "what is the worst case latency that a user may be subject to, and how often might it occur?" While various strategies exist for efficiently processing queries over large textual corpora, prior research has focused almost entirely on improvements to the average processing time or cost of search systems. As a first contribution, we examine some state-of-the-art retrieval algorithms for two popular index organizations, and discuss the trade-offs between them, paying special attention to the notion of tail latency. This research uncovers a number of observations that are subsequently leveraged for improved search efficiency and effectiveness. We then propose and solve a new problem, which involves processing a number of related queries together, known as multi-queries, to yield higher quality search results. We experiment with a number of algorithmic approaches to efficiently process these multi-queries, and report on the cost, efficiency, and effectiveness trade-offs present with each. Ultimately, we find that some solutions yield a low tail latency, and are hence suitable for use in real-time search environments. Finally, we examine how predictive models can be used to improve the tail latency and end-to-end cost of a commonly used multi-stage retrieval architecture without impacting result effectiveness. By combining ideas from numerous areas of information retrieval, we propose a prediction framework which can be used for training and evaluating several efficiency/effectiveness trade-off parameters, resulting in improved trade-offs between cost, result quality, and tail latency

    Automated Data Mapping Specifications via Schema Heuristics and User Interaction

    No full text
    Data transformation problems are very common but they are challenging to implement for large, complex datasets. We describe a new approach for specifying data mapping transformations between XML schema using a combination of automated schema analysis agents and selective user interaction. A graphical tool visualises parts of the two schemas to be mapped and a variety of agents analyse all or parts of the schema, voting on the likelihood of matching subsets. The user can confirm or reject suggestions, or even allow schema matches to be automatically determined, incrementally building up a fully-mapped schema. An implementation of the mapping specification can then be generated from the various inter-schema matches. 1

    Automated Data Mapping Specification via Schema Heuristics and User Interaction

    No full text
    Data transformation problems are very common and are challenging to implement for large and complex datasets. We describe a new approach for specifying data mapping transformations between XML schemas using a combination of automated schema analysis agents and selective user interaction. A graphical tool visualises parts of the two schemas to be mapped and a variety of agents analyse all or parts of the schema, voting on the likelihood of matching subsets. The user can confirm or reject suggestions, or even allow schema matches to be automatically determined, incrementally building up to a fully-mapped schema. An implementation of the mapping specification can then be generated. 1
    corecore