26 research outputs found
Estimating Software Task Effort in Crowds
A key task during software maintenance is the refinement and elaboration of emerging software issues, such as feature implementations and bug resolution. It includes the annotation of software tasks with additional information, such as criticality, assignee and estimated cost of resolution. This paper reports on a first study to investigate the feasibility of using crowd workers supplied with limited information about an issue and project to provide comparably accurate estimates using planning poker. The paper describes our adaptation of planning poker to crowdsourcing and our initial trials. The results demonstrate the feasibility and potential efficiency of using crowds to deliver estimates. We also review the additional benefit that asking crowds for an estimate brings, in terms of further elaboration of the details of an issue. Finally, we outline our plans for a more extensive evaluation of planning poker in crowds
BUOCA: Budget-Optimized Crowd Worker Allocation
Due to concerns about human error in crowdsourcing, it is standard practice
to collect labels for the same data point from multiple internet workers. We
here show that the resulting budget can be used more effectively with a
flexible worker assignment strategy that asks fewer workers to analyze
easy-to-label data and more workers to analyze data that requires extra
scrutiny. Our main contribution is to show how the allocations of the number of
workers to a task can be computed optimally based on task features alone,
without using worker profiles. Our target tasks are delineating cells in
microscopy images and analyzing the sentiment toward the 2016 U.S. presidential
candidates in tweets. We first propose an algorithm that computes
budget-optimized crowd worker allocation (BUOCA). We next train a machine
learning system (BUOCA-ML) that predicts an optimal number of crowd workers
needed to maximize the accuracy of the labeling. We show that the computed
allocation can yield large savings in the crowdsourcing budget (up to 49
percent points) while maintaining labeling accuracy. Finally, we envisage a
human-machine system for performing budget-optimized data analysis at a scale
beyond the feasibility of crowdsourcing
BUOCA: Budget-Optimized Crowd Worker Allocation
Due to concerns about human error in crowdsourcing, it is standard practice to collect labels for the same data point from multiple internet workers. We here show that the resulting budget can be used more effectively with a flexible worker assignment strategy that asks fewer workers to analyze easy-to-label data and more workers to analyze data that requires extra scrutiny. Our main contribution is to show how the allocations of the number of workers to a task can be computed optimally based on task features alone, without using worker profiles. Our target tasks are delineating cells in microscopy images and analyzing the sentiment toward the 2016 U.S. presidential candidates in tweets. We first propose an algorithm that computes budget-optimized crowd worker allocation (BUOCA). We next train a machine learning system (BUOCA-ML) that predicts an optimal number of crowd workers needed to maximize the accuracy of the labeling. We show that the computed allocation can yield large savings in the crowdsourcing budget (up to 49 percent points) while maintaining labeling accuracy. Finally, we envisage a human-machine system for performing budget-optimized data analysis at a scale beyond the feasibility of crowdsourcing.First author draf
Playing Planning Poker in Crowds: Human Computation of Software Effort Estimates
Reliable cost effective effort estimation remains a considerable challenge for software projects. Recent work has demonstrated that the popular Planning Poker practice can produce reliable estimates when undertaken within a software team of knowledgeable domain experts. However, the process depends on the availability of experts and can be time-consuming to perform, making it impractical for large scale or open source projects that may curate many thousands of outstanding tasks. This paper reports on a full study to investigate the feasibility of using crowd workers supplied with limited information about a task to provide comparably accurate estimates using Planning Poker. We describe the design of a Crowd Planning Poker (CPP) process implemented on Amazon Mechanical Turk and the results of a substantial set of trials, involving more than 5000 crowd workers and 39 diverse software tasks. Our results show that a carefully organised and selected crowd of workers can produce effort estimates that are of similar accuracy to those of a single expert
Perspectives on Large Language Models for Relevance Judgment
When asked, current large language models (LLMs) like ChatGPT claim that they
can assist us with relevance judgments. Many researchers think this would not
lead to credible IR research. In this perspective paper, we discuss possible
ways for LLMs to assist human experts along with concerns and issues that
arise. We devise a human-machine collaboration spectrum that allows
categorizing different relevance judgment strategies, based on how much the
human relies on the machine. For the extreme point of "fully automated
assessment", we further include a pilot experiment on whether LLM-based
relevance judgments correlate with judgments from trained human assessors. We
conclude the paper by providing two opposing perspectives - for and against the
use of LLMs for automatic relevance judgments - and a compromise perspective,
informed by our analyses of the literature, our preliminary experimental
evidence, and our experience as IR researchers.
We hope to start a constructive discussion within the community to avoid a
stale-mate during review, where work is dammed if is uses LLMs for evaluation
and dammed if it doesn't
Budget-Smoothed Analysis for Submodular Maximization
The greedy algorithm for monotone submodular function maximization subject to cardinality constraint is guaranteed to approximate the optimal solution to within a 1-1/e factor. Although it is well known that this guarantee is essentially tight in the worst case - for greedy and in fact any efficient algorithm, experiments show that greedy performs better in practice. We observe that for many applications in practice, the empirical distribution of the budgets (i.e., cardinality constraints) is supported on a wide range, and moreover, all the existing hardness results in theory break under a large perturbation of the budget.
To understand the effect of the budget from both algorithmic and hardness perspectives, we introduce a new notion of budget-smoothed analysis. We prove that greedy is optimal for every budget distribution, and we give a characterization for the worst-case submodular functions. Based on these results, we show that on the algorithmic side, under realistic budget distributions, greedy and related algorithms enjoy provably better approximation guarantees, that hold even for worst-case functions, and on the hardness side, there exist hard functions that are fairly robust to all the budget distributions
Cheap IR Evaluation: Fewer Topics, No Relevance Judgements, and Crowdsourced Assessments
To evaluate Information Retrieval (IR) effectiveness, a possible approach is
to use test collections, which are composed of a collection of documents, a set
of description of information needs (called topics), and a set of relevant
documents to each topic. Test collections are modelled in a competition
scenario: for example, in the well known TREC initiative, participants run
their own retrieval systems over a set of topics and they provide a ranked list
of retrieved documents; some of the retrieved documents (usually the first
ranked) constitute the so called pool, and their relevance is evaluated by
human assessors; the document list is then used to compute effectiveness
metrics and rank the participant systems. Private Web Search companies also run
their in-house evaluation exercises; although the details are mostly unknown,
and the aims are somehow different, the overall approach shares several issues
with the test collection approach.
The aim of this work is to: (i) develop and improve some state-of-the-art
work on the evaluation of IR effectiveness while saving resources, and (ii)
propose a novel, more principled and engineered, overall approach to test
collection based effectiveness evaluation.
[...