12 research outputs found
Can LLMs get help from other LLMs without revealing private information?
Cascades are a common type of machine learning systems in which a large,
remote model can be queried if a local model is not able to accurately label a
user's data by itself. Serving stacks for large language models (LLMs)
increasingly use cascades due to their ability to preserve task performance
while dramatically reducing inference costs. However, applying cascade systems
in situations where the local model has access to sensitive data constitutes a
significant privacy risk for users since such data could be forwarded to the
remote model. In this work, we show the feasibility of applying cascade systems
in such setups by equipping the local model with privacy-preserving techniques
that reduce the risk of leaking private information when querying the remote
model. To quantify information leakage in such setups, we introduce two privacy
measures. We then propose a system that leverages the recently introduced
social learning paradigm in which LLMs collaboratively learn from each other by
exchanging natural language. Using this paradigm, we demonstrate on several
datasets that our methods minimize the privacy loss while at the same time
improving task performance compared to a non-cascade baseline
Social Learning: Towards Collaborative Learning with Large Language Models
We introduce the framework of "social learning" in the context of large
language models (LLMs), whereby models share knowledge with each other in a
privacy-aware manner using natural language. We present and evaluate two
approaches for knowledge transfer between LLMs. In the first scenario, we allow
the model to generate abstract prompts aiming to teach the task. In our second
approach, models transfer knowledge by generating synthetic examples. We
evaluate these methods across diverse datasets and quantify memorization as a
proxy for privacy loss. These techniques inspired by social learning yield
promising results with low memorization of the original data. In particular, we
show that performance using these methods is comparable to results with the use
of original labels and prompts. Our work demonstrates the viability of social
learning for LLMs, establishes baseline approaches and highlights several
unexplored areas for future work
Engaging Engineering Teams Through Moral Imagination: A Bottom-Up Approach for Responsible Innovation and Ethical Culture Change in Technology Companies
We propose a "Moral Imagination" methodology to facilitate a culture of
responsible innovation for engineering and product teams in technology
companies. Our approach has been operationalized over the past two years at
Google, where we have conducted over 50 workshops with teams across the
organization. We argue that our approach is a crucial complement to existing
formal and informal initiatives for fostering a culture of ethical awareness,
deliberation, and decision-making in technology design such as company
principles, ethics and privacy review procedures, and compliance controls. We
characterize some of the distinctive benefits of our methodology for the
technology sector in particular.Comment: 16 pages, 1 figur
Engaging Engineering Teams Through Moral Imagination: A Bottom-Up Approach for Responsible Innovation and Ethical Culture Change in Technology Companies
We propose a ‘Moral Imagination’ methodology to facilitate a culture of responsible innovation for engineering and product teams in technology companies. Our approach has been operationalized over the past two years at Google, where we have conducted over 50 workshops with teams from across the organization. We argue that our approach is a crucial complement to existing formal and informal initiatives for fostering a culture of ethical awareness, deliberation, and decision-making in technology design such as company principles, ethics and privacy review procedures, and compliance controls. We characterize some distinctive benefits of our methodology for the technology sector in particular
Large Language Models Encode Clinical Knowledge
Large language models (LLMs) have demonstrated impressive capabilities in
natural language understanding and generation, but the quality bar for medical
and clinical applications is high. Today, attempts to assess models' clinical
knowledge typically rely on automated evaluations on limited benchmarks. There
is no standard to evaluate model predictions and reasoning across a breadth of
tasks. To address this, we present MultiMedQA, a benchmark combining six
existing open question answering datasets spanning professional medical exams,
research, and consumer queries; and HealthSearchQA, a new free-response dataset
of medical questions searched online. We propose a framework for human
evaluation of model answers along multiple axes including factuality,
precision, possible harm, and bias. In addition, we evaluate PaLM (a
540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM, on
MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves
state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA,
MedMCQA, PubMedQA, MMLU clinical topics), including 67.6% accuracy on MedQA (US
Medical License Exam questions), surpassing prior state-of-the-art by over 17%.
However, human evaluation reveals key gaps in Flan-PaLM responses. To resolve
this we introduce instruction prompt tuning, a parameter-efficient approach for
aligning LLMs to new domains using a few exemplars. The resulting model,
Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show
that comprehension, recall of knowledge, and medical reasoning improve with
model scale and instruction prompt tuning, suggesting the potential utility of
LLMs in medicine. Our human evaluations reveal important limitations of today's
models, reinforcing the importance of both evaluation frameworks and method
development in creating safe, helpful LLM models for clinical applications
Towards Generalist Biomedical AI
Medicine is inherently multimodal, with rich data modalities spanning text,
imaging, genomics, and more. Generalist biomedical artificial intelligence (AI)
systems that flexibly encode, integrate, and interpret this data at scale can
potentially enable impactful applications ranging from scientific discovery to
care delivery. To enable the development of these models, we first curate
MultiMedBench, a new multimodal biomedical benchmark. MultiMedBench encompasses
14 diverse tasks such as medical question answering, mammography and
dermatology image interpretation, radiology report generation and
summarization, and genomic variant calling. We then introduce Med-PaLM
Multimodal (Med-PaLM M), our proof of concept for a generalist biomedical AI
system. Med-PaLM M is a large multimodal generative model that flexibly encodes
and interprets biomedical data including clinical language, imaging, and
genomics with the same set of model weights. Med-PaLM M reaches performance
competitive with or exceeding the state of the art on all MultiMedBench tasks,
often surpassing specialist models by a wide margin. We also report examples of
zero-shot generalization to novel medical concepts and tasks, positive transfer
learning across tasks, and emergent zero-shot medical reasoning. To further
probe the capabilities and limitations of Med-PaLM M, we conduct a radiologist
evaluation of model-generated (and human) chest X-ray reports and observe
encouraging performance across model scales. In a side-by-side ranking on 246
retrospective chest X-rays, clinicians express a pairwise preference for
Med-PaLM M reports over those produced by radiologists in up to 40.50% of
cases, suggesting potential clinical utility. While considerable work is needed
to validate these models in real-world use cases, our results represent a
milestone towards the development of generalist biomedical AI systems