14 research outputs found
Human-AI Interaction in the Presence of Ambiguity: From Deliberation-based Labeling to Ambiguity-aware AI
Ambiguity, the quality of being open to more than one interpretation, permeates our lives. It comes in different forms including linguistic and visual ambiguity, arises for various reasons and gives rise to disagreements among human observers that can be hard or impossible to resolve. As artificial intelligence (AI) is increasingly infused into complex domains of human decision making it is crucial that the underlying AI mechanisms also support a notion of ambiguity. Yet, existing AI approaches typically assume that there is a single correct answer for any given input, lacking mechanisms to incorporate diverse human perspectives in various parts of the AI pipeline, including data labeling, model development and user interface design.
This dissertation aims to shed light on the question of how humans and AI can be effective partners in the presence of ambiguous problems. To address this question, we begin by studying group deliberation as a tool to detect and analyze ambiguous cases in data labeling. We present three case studies that investigate group deliberation in the context of different labeling tasks, data modalities and types of human labeling expertise.
First, we present CrowdDeliberation, an online platform for synchronous group deliberation in novice crowd work, and show how worker deliberation affects resolvability and accuracy in text classification tasks of varying subjectivity. We then translate our findings to the expert domain of medical image classification to demonstrate how imposing additional structure on deliberation arguments can improve the efficiency of the deliberation process without compromising its reliability. Finally, we present CrowdEEG, an online platform for collaborative annotation and deliberation of medical time series data, implementing an asynchronous and highly structured deliberation process. Our findings from an observational study with 36 sleep health professionals help explain how disagreements arise and when they can be resolved through group deliberation.
Beyond investigating group deliberation within data labeling, we also demonstrate how the resulting deliberation data can be used to support both human and artificial intelligence. To this end, we first present results from a controlled experiment with ten medical generalists, suggesting that reading deliberation data from medical specialists significantly improves generalists' comprehension and diagnostic accuracy on difficult patient cases. Second, we leverage deliberation data to simulate and investigate AI assistants that not only highlight ambiguous cases, but also explain the underlying sources of ambiguity to end users in human-interpretable terms. We provide evidence suggesting that this form of ambiguity-aware AI can help end users to triage and trust AI-provided data classifications.
We conclude by outlining the main contributions of this dissertation and directions for future research
Curioscape: A Curiosity-driven Escape Room Board Game
Are you frustrated when a board game has too many rules? Do you want to jump straight into the game and just play? We created Curioscape, an escape room board game that focuses on the idea of whether eliminating a rule book is possible in a board game context. This means players can start the game without having to learn rules or understand how the game works. This paper describes Curioscape’s conception to release, along with the exploration of replicating escape rooms in a smaller space and investigates if we can use curiosity to create meaningful game design choices.SERC CREATE SWaGUR grant, Lennart Nacke’s NSERC Discovery Grant 2018-06576, the Canada Foundation for Innovation John R. Evans Leaders Fund 35819 “SURGE—The Stratford User Research and
Gameful Experiences Lab,” Mitacs, and the Social Sciences and Humanities Research Council (SSHRC) Canada Grant 895-2011- 1014 (IMMERSe)
Towards Conversational Diagnostic AI
At the heart of medicine lies the physician-patient dialogue, where skillful
history-taking paves the way for accurate diagnosis, effective management, and
enduring trust. Artificial Intelligence (AI) systems capable of diagnostic
dialogue could increase accessibility, consistency, and quality of care.
However, approximating clinicians' expertise is an outstanding grand challenge.
Here, we introduce AMIE (Articulate Medical Intelligence Explorer), a Large
Language Model (LLM) based AI system optimized for diagnostic dialogue.
AMIE uses a novel self-play based simulated environment with automated
feedback mechanisms for scaling learning across diverse disease conditions,
specialties, and contexts. We designed a framework for evaluating
clinically-meaningful axes of performance including history-taking, diagnostic
accuracy, management reasoning, communication skills, and empathy. We compared
AMIE's performance to that of primary care physicians (PCPs) in a randomized,
double-blind crossover study of text-based consultations with validated patient
actors in the style of an Objective Structured Clinical Examination (OSCE). The
study included 149 case scenarios from clinical providers in Canada, the UK,
and India, 20 PCPs for comparison with AMIE, and evaluations by specialist
physicians and patient actors. AMIE demonstrated greater diagnostic accuracy
and superior performance on 28 of 32 axes according to specialist physicians
and 24 of 26 axes according to patient actors. Our research has several
limitations and should be interpreted with appropriate caution. Clinicians were
limited to unfamiliar synchronous text-chat which permits large-scale
LLM-patient interactions but is not representative of usual clinical practice.
While further research is required before AMIE could be translated to
real-world settings, the results represent a milestone towards conversational
diagnostic AI.Comment: 46 pages, 5 figures in main text, 19 figures in appendi
Towards Accurate Differential Diagnosis with Large Language Models
An accurate differential diagnosis (DDx) is a cornerstone of medical care,
often reached through an iterative process of interpretation that combines
clinical history, physical examination, investigations and procedures.
Interactive interfaces powered by Large Language Models (LLMs) present new
opportunities to both assist and automate aspects of this process. In this
study, we introduce an LLM optimized for diagnostic reasoning, and evaluate its
ability to generate a DDx alone or as an aid to clinicians. 20 clinicians
evaluated 302 challenging, real-world medical cases sourced from the New
England Journal of Medicine (NEJM) case reports. Each case report was read by
two clinicians, who were randomized to one of two assistive conditions: either
assistance from search engines and standard medical resources, or LLM
assistance in addition to these tools. All clinicians provided a baseline,
unassisted DDx prior to using the respective assistive tools. Our LLM for DDx
exhibited standalone performance that exceeded that of unassisted clinicians
(top-10 accuracy 59.1% vs 33.6%, [p = 0.04]). Comparing the two assisted study
arms, the DDx quality score was higher for clinicians assisted by our LLM
(top-10 accuracy 51.7%) compared to clinicians without its assistance (36.1%)
(McNemar's Test: 45.7, p < 0.01) and clinicians with search (44.4%) (4.75, p =
0.03). Further, clinicians assisted by our LLM arrived at more comprehensive
differential lists than those without its assistance. Our study suggests that
our LLM for DDx has potential to improve clinicians' diagnostic reasoning and
accuracy in challenging cases, meriting further real-world evaluation for its
ability to empower physicians and widen patients' access to specialist-level
expertise
Towards Generalist Biomedical AI
Medicine is inherently multimodal, with rich data modalities spanning text,
imaging, genomics, and more. Generalist biomedical artificial intelligence (AI)
systems that flexibly encode, integrate, and interpret this data at scale can
potentially enable impactful applications ranging from scientific discovery to
care delivery. To enable the development of these models, we first curate
MultiMedBench, a new multimodal biomedical benchmark. MultiMedBench encompasses
14 diverse tasks such as medical question answering, mammography and
dermatology image interpretation, radiology report generation and
summarization, and genomic variant calling. We then introduce Med-PaLM
Multimodal (Med-PaLM M), our proof of concept for a generalist biomedical AI
system. Med-PaLM M is a large multimodal generative model that flexibly encodes
and interprets biomedical data including clinical language, imaging, and
genomics with the same set of model weights. Med-PaLM M reaches performance
competitive with or exceeding the state of the art on all MultiMedBench tasks,
often surpassing specialist models by a wide margin. We also report examples of
zero-shot generalization to novel medical concepts and tasks, positive transfer
learning across tasks, and emergent zero-shot medical reasoning. To further
probe the capabilities and limitations of Med-PaLM M, we conduct a radiologist
evaluation of model-generated (and human) chest X-ray reports and observe
encouraging performance across model scales. In a side-by-side ranking on 246
retrospective chest X-rays, clinicians express a pairwise preference for
Med-PaLM M reports over those produced by radiologists in up to 40.50% of
cases, suggesting potential clinical utility. While considerable work is needed
to validate these models in real-world use cases, our results represent a
milestone towards the development of generalist biomedical AI systems
Recommended from our members
Investigating and Mitigating Biases in Crowdsourced Data
You are currently viewing a Conference Paper that was included in the November 2021 Good Systems Network Digest.Office of the VP for Researc
Learning to Predict Population-Level Label Distributions
As machine learning (ML) plays an ever increasing role in commerce, government, and daily life, reports of bias in ML systems against groups traditionally underrepresented in computing technologies have also increased. The problem appears to be extensive, yet it remains challenging even to fully assess the scope, let alone fix it. A fundamental reason is that ML systems are typically trained to predict one correct answer or set of answers; disagreements between the annotators who provide the training labels are resolved by either discarding minority opinions (which may correspond to demographic minorities or not) or presenting all opinions flatly, with no attempt to quantify how different answers might be distributed in society. Label distribution learning associates for each data item a probability distribution over the labels for that item. While such distributions may be representative of minority beliefs or not, they at least preserve diversities of opinion that conventional learning hides or ignores and represent a fundamental first step toward ML systems that can model diversity. We introduce a strategy for learning label distributions with only five-to-ten labels per item—a range that is typical of supervised learning datasets—by aggregating human-annotated labels over multiple, similarly rated data items. Our results suggest that specific label aggregation methods can help provide reliable, representative predictions at the population level
LAS ENTRAĂ‘AS DEL PODER: UNA AUTOPSIA MICHOACANA DEL SIGLO XVIII
\u3cp\u3eIdentifying player motivations such as curiosity could help game designers analyze player profiles and substantially improve game design. However, research on player profiling focuses on generalized personality traits, not specific aspects of motivation. This study examines how player behaviour indicates constructs of curiosity-related motivation. It contributes a more discriminating operationalization of game-related curiosity. We derive a curiosity measure from established self-report survey methodologies relating to social capital, behavioural activation, obsessive/harmonious passion, and BrainHex player types. We present the results of a cross-sectional study with data from 1,745 players of Destiny - A popular shared-world first-person shooter (FPS) game. Behaviour metrics were paired with four curiosity factors: 'social' curiosity, 'sensory/cognitive' curiosity, 'novelty-seeking' curiosity, and 'explorative' curiosity. Our findings provide key insights into the relationships between players curiosity and their in-game behaviour. We infer curiosity-related motivational profiles from behaviour metrics, and discuss how this may impact game design and player-computer interaction.\u3c/p\u3
Longitudinal Screening for Diabetic Retinopathy in a Nationwide Screening Program: Comparing Deep Learning and Human Graders
Objective. To evaluate diabetic retinopathy (DR) screening via deep learning (DL) and trained human graders (HG) in a longitudinal cohort, as case spectrum shifts based on treatment referral and new-onset DR. Methods. We randomly selected patients with diabetes screened twice, two years apart within a nationwide screening program. The reference standard was established via adjudication by retina specialists. Each patient’s color fundus photographs were graded, and a patient was considered as having sight-threatening DR (STDR) if the worse eye had severe nonproliferative DR, proliferative DR, or diabetic macular edema. We compared DR screening via two modalities: DL and HG. For each modality, we simulated treatment referral by excluding patients with detected STDR from the second screening using that modality. Results. There were 5,738 patients (12.3% STDR) in the first screening. DL and HG captured different numbers of STDR cases, and after simulated referral and excluding ungradable cases, 4,148 and 4,263 patients remained in the second screening, respectively. The STDR prevalence at the second screening was 5.1% and 6.8% for DL- and HG-based screening, respectively. Along with the prevalence decrease, the sensitivity for both modalities decreased from the first to the second screening (DL: from 95% to 90%, p=0.008; HG: from 74% to 57%, p<0.001). At both the first and second screenings, the rate of false negatives for the DL was a fifth that of HG (0.5-0.6% vs. 2.9-3.2%). Conclusion. On 2-year longitudinal follow-up of a DR screening cohort, STDR prevalence decreased for both DL- and HG-based screening. Follow-up screenings in longitudinal DR screening can be more difficult and induce lower sensitivity for both DL and HG, though the false negative rate was substantially lower for DL. Our data may be useful for health-economics analyses of longitudinal screening settings