Search CORE

13 research outputs found

Closing the loop: assisting archival appraisal and information retrieval in one sweep

Author: Kim Y.
Ross S.
Publication venue
Publication date: 01/01/2013
Field of study

In this article, we examine the similarities between the concept of appraisal, a process that takes place within the archives, and the concept of relevance judgement, a process fundamental to the evaluation of information retrieval systems. More specifically, we revisit selection criteria proposed as result of archival research, and work within the digital curation communities, and, compare them to relevance criteria as discussed within information retrieval's literature based discovery. We illustrate how closely these criteria relate to each other and discuss how understanding the relationships between the these disciplines could form a basis for proposing automated selection for archival processes and initiating multi-objective learning with respect to information retrieval

Crossref

Enlighten

Analysis of change in users' assessment of search results over time

Author: Bar-Ilan
Bar-Ilan
Bar-Ilan
Brin
Fagin
Field
Hariri
Jansen
Joachims
Mizzaro
Rees
Salton
Saracevic
Serola
Spink
Tang
Vakkari
Vakkari
Vakkari
Publication venue: 'Wiley'
Publication date: 01/05/2017
Field of study

We present the first systematic study of the influence of time on user judgements for rankings and relevance grades of web search engine results. The goal of this study is to evaluate the change in user assessment of search results and explore how users' judgements change. To this end, we conducted a large-scale user study with 86 participants who evaluated two different queries and four diverse result sets twice with an interval of two months. To analyse the results we investigate whether two types of patterns of user behaviour from the theory of categorical thinking hold for the case of evaluation of search results: (1) coarseness and (2) locality. To quantify these patterns we devised two new measures of change in user judgements and distinguish between local (when users swap between close ranks and relevance values) and non-local changes. Two types of judgements were considered in this study: 1) relevance on a 4-point scale, and 2) ranking on a 10-point scale without ties. We found that users tend to change their judgements of the results over time in about 50% of cases for relevance and in 85% of cases for ranking. However, the majority of these changes were local

Crossref

Birkbeck Institutional Research Online

On Aggregating Labels from Multiple Crowd Workers to Infer Relevance of Documents

Author: Gabriella Kazai
Ingemar J. Cox
Mehdi Hosseini
Nataša Milić-frayling
Vishwa Vinay
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Abstract. We consider the problem of acquiring relevance judgements for in-formation retrieval (IR) test collections through crowdsourcing when no true relevance labels are available. We collect multiple, possibly noisy relevance la-bels per document from workers of unknown labelling accuracy. We use these labels to infer the document relevance based on two methods. The first method is the commonly used majority voting (MV) which determines the document relevance based on the label that received the most votes, treating all the work-ers equally. The second is a probabilistic model that concurrently estimates the document relevance and the workers accuracy using expectation maximization (EM). We run simulations and conduct experiments with crowdsourced rele-vance labels from the INEX 2010 Book Search track to investigate the accuracy and robustness of the relevance assessments to the noisy labels. We observe the effect of the derived relevance judgments on the ranking of the search systems. Our experimental results show that the EM method outperforms the MV method in the accuracy of relevance assessments and IR systems ranking. The performance improvements are especially noticeable when the number of labels per document is small and the labels are of varied quality.

CiteSeerX

Crossref

Closing the loop: Assisting archival appraisal and information retrieval in one sweep

Author: Barry
Barry
Bianchini
Blei
Borlund
Burges
Cacheda
Castillo
Caverlee
Cerviño Beresi
Cook
Deerwester
Eastwood
Goldberg
Hagar
Harvey
Jenkinson
Joachims
Jones
Kim
Manning
Mathews
Oliver
Petrenz
Ponte
Savoy
Scaringella
Schellenberg
Schneider
Sebastiani
Selamat
Spärck-Jones
Teufel
Tibbo
Tzanetakis
Vapnik
Whyte
Publication venue: 'Wiley'
Publication date
Field of study

Crossref

The Effects of Time Constraints and Document Excerpts on Relevance Assessing Behavior

Author: Rahbariasl Shahin
Publication venue: 'University of Waterloo'
Publication date: 01/01/2018
Field of study

Assessors who judge the relevance of documents to the search topics and perform the relevance assessment process are one of the main parts of Information Retrieval (IR) sys- tems evaluations. They play a significant role in making test collections which can be used in evaluations and system designs. Relevance assessment is also highly important for e-discovery where relevant documents and materials should be found with acceptable cost and in an efficient way. In order to study the relevance judging behavior of assessors better, we conducted a user study to further examine the effects of time constraints and document excerpts on relevance behavior. Participants were shown either full documents or document excerpts that they had to judge within 15, 30, or 60 seconds time constraint per document. For producing document excerpts or paragraph-long summaries, we have used algorithms to extract what a model of relevance considers most relevant from a full document. We found that the quality of judging slightly differs within each time constraint but not significantly. While time constraints have little effect on the quality of judging, they can increase the judging speed rate of the assessors. We also found that assessors perform as good and in most cases better if shown a paragraph-long document excerpt instead of a full document, therefore, they have the potential to replace full documents in relevance assessment. Since document excerpts are significantly faster to judge, we con- clude that showing document excerpts or summaries to the assessors can lead to better quality of judging with less cost and effort

University of Waterloo's Institutional Repository

Stopping methods for technology assisted reviews based on point processes

Author: Bin-Hezam R.
Stevenson M.
Publication venue: Association for Computing Machinery (ACM)
Publication date: 11/11/2023
Field of study

Technology Assisted Review (TAR), which aims to reduce the effort required to screen collections of documents for relevance, is used to develop systematic reviews of medical evidence and identify documents that must be disclosed in response to legal proceedings. Stopping methods are algorithms which determine when to stop screening documents during the TAR process, helping to ensure that workload is minimised while still achieving a high level of recall. This paper proposes a novel stopping method based on point processes, which are statistical models that can be used to represent the occurrence of random events. The approach uses rate functions to model the occurrence of relevant documents in the ranking and compares four candidates, including one that has not previously been used for this purpose (hyperbolic). Evaluation is carried out using standard datasets (CLEF e-Health, TREC Total Recall, TREC Legal), and this work is the first to explore stopping method robustness by reporting performance on a range of rankings of varying effectiveness. Results show that the proposed method achieves the desired level of recall without requiring an excessive number of documents to be examined in the majority of cases and also compares well against multiple alternative approaches

arXiv.org e-Print Archive

White Rose Research Online

Quantifying test collection quality based on the consistency of relevance judgements

Author: Sanderson M
Scholer F
Turpin A
Publication venue: ACM (New York, United States)
Publication date: 01/01/2011
Field of study

Relevance assessments are a key component for test collectionbased evaluation of information retrieval systems. This paper reports on a feature of such collections that is used as a form of ground truth data to allow analysis of human assessment error. A wide range of test collections are retrospectively examined to determine how accurately assessors judge the relevance of documents. Our results demonstrate a high level of inconsistency across the collections studied. The level of irregularity is shown to vary across topics, with some showing a very high level of assessment error

RMIT Research Repository

Quantifying test collection quality based on the consistency of relevance judgements

Author: Andrew Turpin
Falk Scholer
Mark Sanderson
Publication venue
Publication date: 02/04/2020
Field of study

ABSTRACT Relevance assessments are a key component for test collectionbased evaluation of information retrieval systems. This paper reports on a feature of such collections that is used as a form of ground truth data to allow analysis of human assessment error. A wide range of test collections are retrospectively examined to determine how accurately assessors judge the relevance of documents. Our results demonstrate a high level of inconsistency across the collections studied. The level of irregularity is shown to vary across topics, with some showing a very high level of assessment error. We investigate possible influences on the error, and demonstrate that inconsistency in judging increases with time. While the level of detail in a topic specification does not appear to influence the errors that assessors make, judgements are significantly affected by the decisions made on previously seen similar documents. Assessors also display an assessment inertia. Alternate approaches to generating relevance judgements appear to reduce errors. A further investigation of the way that retrieval systems are ranked using sets of relevance judgements produced early and late in the judgement process reveals a consistent influence measured across the majority of examined test collections. We conclude that there is a clear value in examining, even inserting, ground truth data in test collections, and propose ways to help minimise the sources of inconsistency when creating future test collections

CiteSeerX