90 research outputs found

    Creation of Reliable Relevance Judgments in Information Retrieval Systems Evaluation Experimentation through Crowdsourcing: A Review

    Get PDF
    Test collection is used to evaluate the information retrieval systems in laboratory-based evaluation experimentation. In a classic setting, generating relevance judgments involves human assessors and is a costly and time consuming task. Researchers and practitioners are still being challenged in performing reliable and low-cost evaluation of retrieval systems. Crowdsourcing as a novel method of data acquisition is broadly used in many research fields. It has been proven that crowdsourcing is an inexpensive and quick solution as well as a reliable alternative for creating relevance judgments. One of the crowdsourcing applications in IR is to judge relevancy of query document pair. In order to have a successful crowdsourcing experiment, the relevance judgment tasks should be designed precisely to emphasize quality control. This paper is intended to explore different factors that have an influence on the accuracy of relevance judgments accomplished by workers and how to intensify the reliability of judgments in crowdsourcing experiment

    Crowdsourcing Relevance: Two Studies on Assessment

    Get PDF
    Crowdsourcing has become an alternative approach to collect relevance judgments at large scale. In this thesis, we focus on some specific aspects related to time, scale, and agreement. First, we address the issue of the time factor in gathering relevance label: we study how much time the judges need to assess documents. We conduct a series of four experiments which unexpectedly reveal us how introducing time limitations leads to benefits in terms of the quality of the results. Furthermore, we discuss strategies aimed to determine the right amount of time to make available to the workers for the relevance assessment, in order to both guarantee the high quality of the gathered results and the saving of the valuable resources of time and money. Then we explore the application of magnitude estimation, a psychophysical scaling technique for the measurement of sensation, for relevance assessment. We conduct a large-scale user study across 18 TREC topics, collecting more than 50,000 magnitude estimation judgments, which result to be overall rank-aligned with ordinal judgments made by expert relevance assessors. We discuss the benefits, the reliability of the judgements collected, and the competitiveness in terms of assessor cost. We also report some preliminary results on the agreement among judges. Often, the results of crowdsourcing experiments are affected by noise, that can be ascribed to lack of agreement among workers. This aspect should be considered as it can affect the reliability of the gathered relevance labels, as well as the overall repeatability of the experiments.openDottorato di ricerca in Informatica e scienze matematiche e fisicheopenMaddalena, Edd

    The Effects of Time Constraints and Document Excerpts on Relevance Assessing Behavior

    Get PDF
    Assessors who judge the relevance of documents to the search topics and perform the relevance assessment process are one of the main parts of Information Retrieval (IR) sys- tems evaluations. They play a significant role in making test collections which can be used in evaluations and system designs. Relevance assessment is also highly important for e-discovery where relevant documents and materials should be found with acceptable cost and in an efficient way. In order to study the relevance judging behavior of assessors better, we conducted a user study to further examine the effects of time constraints and document excerpts on relevance behavior. Participants were shown either full documents or document excerpts that they had to judge within 15, 30, or 60 seconds time constraint per document. For producing document excerpts or paragraph-long summaries, we have used algorithms to extract what a model of relevance considers most relevant from a full document. We found that the quality of judging slightly differs within each time constraint but not significantly. While time constraints have little effect on the quality of judging, they can increase the judging speed rate of the assessors. We also found that assessors perform as good and in most cases better if shown a paragraph-long document excerpt instead of a full document, therefore, they have the potential to replace full documents in relevance assessment. Since document excerpts are significantly faster to judge, we con- clude that showing document excerpts or summaries to the assessors can lead to better quality of judging with less cost and effort

    Perspectives on Large Language Models for Relevance Judgment

    Full text link
    When asked, current large language models (LLMs) like ChatGPT claim that they can assist us with relevance judgments. Many researchers think this would not lead to credible IR research. In this perspective paper, we discuss possible ways for LLMs to assist human experts along with concerns and issues that arise. We devise a human-machine collaboration spectrum that allows categorizing different relevance judgment strategies, based on how much the human relies on the machine. For the extreme point of "fully automated assessment", we further include a pilot experiment on whether LLM-based relevance judgments correlate with judgments from trained human assessors. We conclude the paper by providing two opposing perspectives - for and against the use of LLMs for automatic relevance judgments - and a compromise perspective, informed by our analyses of the literature, our preliminary experimental evidence, and our experience as IR researchers. We hope to start a constructive discussion within the community to avoid a stale-mate during review, where work is dammed if is uses LLMs for evaluation and dammed if it doesn't

    A game theory approach for estimating reliability of crowdsourced relevance assessments

    Get PDF
    In this article, we propose an approach to improve quality in crowdsourcing (CS) tasks using Task Completion Time (TCT) as a source of information about the reliability of workers in a game-theoretical competitive scenario. Our approach is based on the hypothesis that some workers are more risk-inclined and tend to gamble with their use of time when put to compete with other workers. This hypothesis is supported by our previous simulation study. We test our approach with 35 topics from experiments on the TREC-8 collection being assessed as relevant or non-relevant by crowdsourced workers both in a competitive (referred to as "Game") and non-competitive (referred to as "Base") scenario. We find that competition changes the distributions of TCT, making them sensitive to the quality (i.e., wrong or right) and outcome (i.e., relevant or non-relevant) of the assessments. We also test an optimal function of TCT as weights in a weighted majority voting scheme. From probabilistic considerations, we derive a theoretical upper bound for the weighted majority performance of cohorts of 2, 3, 4, and 5 workers, which we use as a criterion to evaluate the performance of our weighting scheme. We find our approach achieves a remarkable performance, significantly closing the gap between the accuracy of the obtained relevance judgements and the upper bound. Since our approach takes advantage of TCT, which is an available quantity in any CS tasks, we believe it is cost-effective and, therefore, can be applied for quality assurance in crowdsourcing for micro-tasks

    A Preference Judgment Interface for Authoritative Assessment

    Get PDF
    For offline evaluation of information retrieval systems, preference judgments have been demonstrated to be a superior alternative to graded or binary relevance judgments. In contrast to graded judgments, where each document is assigned to a pre-defined grade level, with preference judgments, assessors judge a pair of items presented side by side, indicating which is better. Unfortunately, preference judgments may require a larger number of judgments, even under an assumption of transitivity. Until recently they also lacked well-established evaluation measures. Previous studies have explored various evaluation measures and proposed different approaches to address the perceived shortcomings of preference judgments. These studies focused on crowdsourced preference judgments, where assessors may lack the training and time to make careful judgments. They did not consider the case where assessors have been trained and provided with the time to carefully consider differences between items. For offline evaluation of information retrieval systems, preference judgments have been demonstrated to be a superior alternative to graded or binary relevance judgments. In contrast to graded judgments, where each document is assigned to a pre-defined grade level, with preference judgments, assessors judge a pair of items presented side by side, indicating which is better. Unfortunately, preference judgments may require a larger number of judgments, even under an assumption of transitivity. Until recently they also lacked well-established evaluation measures. Previous studies have explored various evaluation measures and proposed different approaches to address the perceived shortcomings of preference judgments. These studies focused on crowdsourced preference judgments, where assessors may lack the training and time to make careful judgments. They did not consider the case where assessors have been trained and provided with the time to carefully consider differences between items. We review the literature in terms of algorithms and strategies for extracting preference judgment, evaluation metrics, interface design, and use of crowdsourcing. In this thesis, we design and build a new framework for preference judgment called JUDGO, with various components designed for expert reviewers and researchers. We also suggested a new heap-like preference judgment algorithm that assumes transitivity and tolerates ties. With the help of our framework, NIST assessors found the top-10 best items of each 38 topics for TREC 2022 Health Misinformation Track, with more than 2,200 judgments collected. Our analysis shows that assessors frequently use the search box feature, which enables them to highlight their own keywords in documents, but they are less interested in highlighting documents with the mouse. As a result of additional feedback, we make some modifications to the initially proposed algorithm method and highlighting features

    Novel Methods for Designing Tasks in Crowdsourcing

    Get PDF
    Crowdsourcing is becoming more popular as a means for scalable data processing that requires human intelligence. The involvement of groups of people to accomplish tasks could be an effective success factor for data-driven businesses. Unlike in other technical systems, the quality of the results depends on human factors and how well crowd workers understand the requirements of the task, to produce high-quality results. Looking at previous studies in this area, we found that one of the main factors that affect workers’ performance is the design of the crowdsourcing tasks. Previous studies of crowdsourcing task design covered a limited set of factors. The main contribution of this research is the focus on some of the less-studied technical factors, such as examining the effect of task ordering and class balance and measuring the consistency of the same task design over time and on different crowdsourcing platforms. Furthermore, this study ambitiously extends work towards understanding workers’ point of view in terms of the quality of the task and the payment aspect by performing a qualitative study with crowd workers and shedding light on some of the ethical issues around payments for crowdsourcing tasks. To achieve our goal, we performed several crowdsourcing experiments on specific platforms and measured the factors that influenced the quality of the overall result

    Studying Relevance Judging Behavior of Secondary Assessors

    Get PDF
    Secondary assessors, individuals who do not originate search topics and are employed solely to judge the relevancy of documents, have been found to differ in their relevance judgments. Their relevance judgments are used in constructing test collections, which play a significant role in evaluating search systems. These judgments are also used in e-discovery to assist with locating relevant material. To a large extent, our existing understanding of secondary assessors' judging behavior is limited to quantitative measurements. The goal of this thesis is to better understand the relevance judging behavior of secondary assessors. Therefore, we conducted two user studies to achieve this objective. The first study, which forms the main part of this thesis, was a think-aloud study, and provides what may be the first of such qualitative studies of secondary assessors' judging behavior. The second study of the research was to capture the uncertainty in secondary assessors' relevance judgments. Further examination of the behavior of secondary assessors when judging multiple types of documents was also carried out based on the data from the think-aloud study. Data obtained through the think-aloud method, permitted us to achieve more in-depth insight into secondary assessors' relevance judging behavior. We were able to directly listen to and note their thoughts during the assigned search tasks. Based on this data, we found that relevance judgments are made with differing levels of certainty. These levels of certainty vary from low to high. We also found that the varying factors of a search topic, the document, and the assessor can each impact differing judgments. The think-aloud study also reveals preliminary evidence regarding how the amount of detail stated in a search topic's description influences the relevance judging behavior of secondary assessors. To capture the uncertainty in secondary assessors' relevance judgments, we designed four user interfaces in our second user study. The objective was to study the uncertainty in secondary assessors' relevance judgments when the level of uncertainty is self-reported. We found that they tend to make high certain relevance judgments despite the consensus level of a document. In judging high consensus documents, assessors' accuracy was lower when making low certainty relevance judgments, and the judgments were more accurate and tended to agree with NIST assessors when making high certainty relevance judgments. For low consensus documents, we found assessors' accuracy to be low regardless of their certainty level. Finally, we found that assessors tend to spend less time when making high certainty relevance judgments, regardless of the consensus level of the document. Further study of the behavior of secondary assessors when judging multiple types of documents, identified that relevance judgments are occasionally based on incorrect perception. We show how factors such as lack of familiarity, lack of understanding the search topic, absence of keywords and other reasons could be a source of not only incorrect relevance judgments, but also of those which are correct. We also illustrate how the length of search topics and documents, and their level of difficulty may further contribute to the issue of variations in the judgments. Our research overall contributes to a more extensive, meaningful understanding of the behavior of secondary assessors. It establishes a foundation for more pertinent work in the future on the impact of uncertainty in secondary assessor's relevance judgments. Our findings also show that assessor training and background, search topics, and document length should be all considered and given additional attention in order to obtain more reliable results

    Offline Evaluation via Human Preference Judgments: A Dueling Bandits Problem

    Get PDF
    The dramatic improvements in core information retrieval tasks engendered by neural rankers create a need for novel evaluation methods. If every ranker returns highly relevant items in the top ranks, it becomes difficult to recognize meaningful differences between them and to build reusable test collections. Several recent papers explore pairwise preference judgments as an alternative to traditional graded relevance assessments. Rather than viewing items one at a time, assessors view items side-by-side and indicate the one that provides the better response to a query, allowing fine-grained distinctions. If we employ preference judgments to identify the probably best items for each query, we can measure rankers by their ability to place these items as high as possible. I frame the problem of finding best items as a dueling bandits problem. While many papers explore dueling bandits for online ranker evaluation via interleaving, they have not been considered as a framework for offline evaluation via human preference judgments. I review the literature for possible solutions. For human preference judgments, any usable algorithm must tolerate ties since two items may appear nearly equal to assessors. It must minimize the number of judgments required for any specific pair since each such comparison requires an independent assessor. Since the theoretical guarantees provided by most algorithms depend on assumptions that are not satisfied by human preference judgments, I simulate selected algorithms on representative test cases to provide insight into their practical utility. In contrast to the previous paper presented at SIGIR 2022 [87], I include more theoretical analysis and experimental results in this work. Based on the simulations, two algorithms stand out for their potential. I proceed with the method of Clarke et al. [20], and the simulations suggest modifications to further improve its performance. Using the modified algorithm, over 10,000 preference judgments for pools derived from submissions to the TREC 2021 Deep Learning Track are collected, confirming its suitability. We test the idea of best-item evaluation and suggest ideas for further theoretical and practical progress
    • …
    corecore