16 research outputs found
A Unifying Framework for Combining Complementary Strengths of Humans and ML toward Better Predictive Decision-Making
Hybrid human-ML systems are increasingly in charge of consequential decisions
in a wide range of domains. A growing body of empirical and theoretical work
has advanced our understanding of these systems. However, existing empirical
results are mixed, and theoretical proposals are often mutually incompatible.
In this work, we propose a unifying framework for understanding conditions
under which combining the complementary strengths of humans and ML leads to
higher quality decisions than those produced by each of them individually -- a
state which we refer to as human-ML complementarity. We focus specifically on
the context of human-ML predictive decision-making and investigate optimal ways
of combining human and ML predictive decisions, accounting for the underlying
sources of variation in their judgments. Within this scope, we present two
crucial contributions. First, taking a computational perspective of
decision-making and drawing upon prior literature in psychology, machine
learning, and human-computer interaction, we introduce a taxonomy
characterizing a wide range of criteria across which human and machine
decision-making differ. Second, formalizing our taxonomy allows us to study how
human and ML predictive decisions should be aggregated optimally. We show that
our proposed framework encompasses several existing models of human-ML
complementarity as special cases. Last but not least, an initial exploratory
analysis of our framework presents a critical insight for future work in
human-ML complementarity: the mechanism by which we combine human and ML
judgments should be informed by the underlying causes of divergence in their
decisions.Comment: 21 pages, 1 figur
Supporting Human-AI Collaboration in Auditing LLMs with LLMs
Large language models are becoming increasingly pervasive and ubiquitous in
society via deployment in sociotechnical systems. Yet these language models, be
it for classification or generation, have been shown to be biased and behave
irresponsibly, causing harm to people at scale. It is crucial to audit these
language models rigorously. Existing auditing tools leverage either or both
humans and AI to find failures. In this work, we draw upon literature in
human-AI collaboration and sensemaking, and conduct interviews with research
experts in safe and fair AI, to build upon the auditing tool: AdaTest (Ribeiro
and Lundberg, 2022), which is powered by a generative large language model
(LLM). Through the design process we highlight the importance of sensemaking
and human-AI communication to leverage complementary strengths of humans and
generative models in collaborative auditing. To evaluate the effectiveness of
the augmented tool, AdaTest++, we conduct user studies with participants
auditing two commercial language models: OpenAI's GPT-3 and Azure's sentiment
analysis model. Qualitative analysis shows that AdaTest++ effectively leverages
human strengths such as schematization, hypothesis formation and testing.
Further, with our tool, participants identified a variety of failures modes,
covering 26 different topics over 2 tasks, that have been shown before in
formal audits and also those previously under-reported.Comment: 21 pages, 3 figure
Cite-seeing and Reviewing: A Study on Citation Bias in Peer Review
Citations play an important role in researchers' careers as a key factor in
evaluation of scientific impact. Many anecdotes advice authors to exploit this
fact and cite prospective reviewers to try obtaining a more positive evaluation
for their submission. In this work, we investigate if such a citation bias
actually exists: Does the citation of a reviewer's own work in a submission
cause them to be positively biased towards the submission? In conjunction with
the review process of two flagship conferences in machine learning and
algorithmic economics, we execute an observational study to test for citation
bias in peer review. In our analysis, we carefully account for various
confounding factors such as paper quality and reviewer expertise, and apply
different modeling techniques to alleviate concerns regarding the model
mismatch. Overall, our analysis involves 1,314 papers and 1,717 reviewers and
detects citation bias in both venues we consider. In terms of the effect size,
by citing a reviewer's work, a submission has a non-trivial chance of getting a
higher score from the reviewer: an expected increase in the score is
approximately 0.23 on a 5-point Likert item. For reference, a one-point
increase of a score by a single reviewer improves the position of a submission
by 11% on average.Comment: 19 pages, 3 figure
To ArXiv or not to ArXiv: A Study Quantifying Pros and Cons of Posting Preprints Online
Double-blind conferences have engaged in debates over whether to allow
authors to post their papers online on arXiv or elsewhere during the review
process. Independently, some authors of research papers face the dilemma of
whether to put their papers on arXiv due to its pros and cons. We conduct a
study to substantiate this debate and dilemma via quantitative measurements.
Specifically, we conducted surveys of reviewers in two top-tier double-blind
computer science conferences -- ICML 2021 (5361 submissions and 4699 reviewers)
and EC 2021 (498 submissions and 190 reviewers). Our two main findings are as
follows. First, more than a third of the reviewers self-report searching online
for a paper they are assigned to review. Second, outside the review process, we
find that preprints from better-ranked affiliations see a weakly higher
visibility, with a correlation of 0.06 in ICML and 0.05 in EC. In particular,
papers associated with the top-10-ranked affiliations had a visibility of
approximately 11% in ICML and 22% in EC, whereas the remaining papers had a
visibility of 7% and 18% respectively.Comment: 17 pages, 3 figure
DataPerf: Benchmarks for Data-Centric AI Development
Machine learning research has long focused on models rather than datasets,
and prominent datasets are used for common ML tasks without regard to the
breadth, difficulty, and faithfulness of the underlying problems. Neglecting
the fundamental importance of data has given rise to inaccuracy, bias, and
fragility in real-world applications, and research is hindered by saturation
across existing dataset benchmarks. In response, we present DataPerf, a
community-led benchmark suite for evaluating ML datasets and data-centric
algorithms. We aim to foster innovation in data-centric AI through competition,
comparability, and reproducibility. We enable the ML community to iterate on
datasets, instead of just architectures, and we provide an open, online
platform with multiple rounds of challenges to support this iterative
development. The first iteration of DataPerf contains five benchmarks covering
a wide spectrum of data-centric techniques, tasks, and modalities in vision,
speech, acquisition, debugging, and diffusion prompting, and we support hosting
new contributed benchmarks from the community. The benchmarks, online
evaluation platform, and baseline implementations are open source, and the
MLCommons Association will maintain DataPerf to ensure long-term benefits to
academia and industry.Comment: NeurIPS 2023 Datasets and Benchmarks Trac
A Taxonomy of Human and ML Strengths in Decision-Making to Investigate Human-ML Complementarity
Hybrid human-ML systems increasingly make consequential decisions in a wide range of domains. These systems are often introduced with the expectation that the combined human-ML system will achieve complementary performance, that is, the combined decision-making system will be an improvement compared with either decision-making agent in isolation. However, empirical results have been mixed, and existing research rarely articulates the sources and mechanisms by which complementary performance is expected to arise. Our goal in this work is to provide conceptual tools to advance the way researchers reason and communicate about human-ML complementarity. Drawing upon prior literature in human psychology, machine learning, and human-computer interaction, we propose a taxonomy characterizing distinct ways in which human and ML-based decision-making can differ. In doing so, we conceptually map potential mechanisms by which combining human and ML decision-making may yield complementary performance, developing a language for the research community to reason about design of hybrid systems in any decision-making domain. To illustrate how our taxonomy can be used to investigate complementarity, we provide a mathematical aggregation framework to examine enabling
conditions for complementarity. Through synthetic simulations, we demonstrate how this framework can be used to explore specific aspects of our taxonomy and shed light on the optimal mechanisms for combining human-ML judgments