6,932 research outputs found

    Document Distance for the Automated Expansion of Relevance Judgements for Information Retrieval Evaluation

    Full text link
    This paper reports the use of a document distance-based approach to automatically expand the number of available relevance judgements when these are limited and reduced to only positive judgements. This may happen, for example, when the only available judgements are extracted from a list of references in a published review paper. We compare the results on two document sets: OHSUMED, based on medical research publications, and TREC-8, based on news feeds. We show that evaluations based on these expanded relevance judgements are more reliable than those using only the initially available judgements, especially when the number of available judgements is very limited.Comment: SIGIR 2014 Workshop on Gathering Efficient Assessments of Relevanc

    Unbiased Comparative Evaluation of Ranking Functions

    Full text link
    Eliciting relevance judgments for ranking evaluation is labor-intensive and costly, motivating careful selection of which documents to judge. Unlike traditional approaches that make this selection deterministically, probabilistic sampling has shown intriguing promise since it enables the design of estimators that are provably unbiased even when reusing data with missing judgments. In this paper, we first unify and extend these sampling approaches by viewing the evaluation problem as a Monte Carlo estimation task that applies to a large number of common IR metrics. Drawing on the theoretical clarity that this view offers, we tackle three practical evaluation scenarios: comparing two systems, comparing kk systems against a baseline, and ranking kk systems. For each scenario, we derive an estimator and a variance-optimizing sampling distribution while retaining the strengths of sampling-based evaluation, including unbiasedness, reusability despite missing data, and ease of use in practice. In addition to the theoretical contribution, we empirically evaluate our methods against previously used sampling heuristics and find that they generally cut the number of required relevance judgments at least in half.Comment: Under review; 10 page

    RECVID as a Re-Usable Test-Collection for Video Retrieval

    Get PDF
    TRECVID has been running as a video retrieval benchmarking platform for a number of years now. Some progress seems to be made in the area of video retrieval, but also it has been shown that many of the differences in scores between tested approaches are nonsignificant. This paper studies the reliability of the TRECVID search collections for measuring video retrieval effectiveness and investigates how useful the collections are for re-use

    Prioritizing relevance judgments to improve the construction of IR test collections

    Get PDF
    We consider the problem of optimally allocating a fixed budget to construct a test collection with associated relevance judgements, such that it can (i) accurately evaluate the relative performance of the participating systems, and (ii) generalize to new, previously unseen systems. We propose a two stage approach. For a given set of queries, we adopt the traditional pooling method and use a portion of the budget to evaluate a set of documents retrieved by the participating systems. Next, prioritize the queries and associated documents for further refinement of the test collection. Our objective is to increase the effectiveness of the test collection for comparative evaluation and extendibility to new systems. The query prioritization is formulated as a convex optimization problem, thereby permitting efficient solution and providing a flexible framework to incorporate various constraints. We use the remaining budget to evaluate query-document pairs with the highest priority scores. The budgets for the initial and the refinement phase are expended during the construction of the test collection and consider only the documents that have been retrieved by the participating systems. We evaluate our resource optimization approach on two TREC test collections namely TREC 8 and TREC 2004 Robust Track. We demonstrate that our optimization techniques are cost efficient and yield a significant improvement in the reusability of the test collections

    Incorporating Clicks, Attention and Satisfaction into a Search Engine Result Page Evaluation Model

    Get PDF
    Modern search engine result pages often provide immediate value to users and organize information in such a way that it is easy to navigate. The core ranking function contributes to this and so do result snippets, smart organization of result blocks and extensive use of one-box answers or side panels. While they are useful to the user and help search engines to stand out, such features present two big challenges for evaluation. First, the presence of such elements on a search engine result page (SERP) may lead to the absence of clicks, which is, however, not related to dissatisfaction, so-called "good abandonments." Second, the non-linear layout and visual difference of SERP items may lead to non-trivial patterns of user attention, which is not captured by existing evaluation metrics. In this paper we propose a model of user behavior on a SERP that jointly captures click behavior, user attention and satisfaction, the CAS model, and demonstrate that it gives more accurate predictions of user actions and self-reported satisfaction than existing models based on clicks alone. We use the CAS model to build a novel evaluation metric that can be applied to non-linear SERP layouts and that can account for the utility that users obtain directly on a SERP. We demonstrate that this metric shows better agreement with user-reported satisfaction than conventional evaluation metrics.Comment: CIKM2016, Proceedings of the 25th ACM International Conference on Information and Knowledge Management. 201

    User performance versus precision measures for simple search tasks

    Get PDF
    Several recent studies have demonstrated that the type of improvements in information retrieval system effectiveness reported in forums such as SIGIR and TREC do not translate into a benefit for users. Two of the studies used an instance recall task, and a third used a question answering task, so perhaps it is unsurprising that the precision based measures of IR system effectiveness on one-shot query evaluation do not correlate with user performance on these tasks. In this study, we evaluate two different information retrieval tasks on TREC Web-track data: a precision-based user task, measured by the length of time that users need to find a single document that is relevant to a TREC topic; and, a simple recall-based task, represented by the total number of relevant documents that users can identify within five minutes. Users employ search engines with controlled mean average precision (MAP) of between 55% and 95%. Our results show that there is no significant relationship between system effectiveness measured by MAP and the precision-based task. A significant, but weak relationship is present for the precision at one document returned metric. A weak relationship is present between MAP and the simple recall-based task

    Active Sampling for Large-scale Information Retrieval Evaluation

    Get PDF
    Evaluation is crucial in Information Retrieval. The development of models, tools and methods has significantly benefited from the availability of reusable test collections formed through a standardized and thoroughly tested methodology, known as the Cranfield paradigm. Constructing these collections requires obtaining relevance judgments for a pool of documents, retrieved by systems participating in an evaluation task; thus involves immense human labor. To alleviate this effort different methods for constructing collections have been proposed in the literature, falling under two broad categories: (a) sampling, and (b) active selection of documents. The former devises a smart sampling strategy by choosing only a subset of documents to be assessed and inferring evaluation measure on the basis of the obtained sample; the sampling distribution is being fixed at the beginning of the process. The latter recognizes that systems contributing documents to be judged vary in quality, and actively selects documents from good systems. The quality of systems is measured every time a new document is being judged. In this paper we seek to solve the problem of large-scale retrieval evaluation combining the two approaches. We devise an active sampling method that avoids the bias of the active selection methods towards good systems, and at the same time reduces the variance of the current sampling approaches by placing a distribution over systems, which varies as judgments become available. We validate the proposed method using TREC data and demonstrate the advantages of this new method compared to past approaches
    • …
    corecore