6 research outputs found
Comparing Population Means under Local Differential Privacy: with Significance and Power
A statistical hypothesis test determines whether a hypothesis should be
rejected based on samples from populations. In particular, randomized
controlled experiments (or A/B testing) that compare population means using,
e.g., t-tests, have been widely deployed in technology companies to aid in
making data-driven decisions. Samples used in these tests are collected from
users and may contain sensitive information. Both the data collection and the
testing process may compromise individuals' privacy. In this paper, we study
how to conduct hypothesis tests to compare population means while preserving
privacy. We use the notation of local differential privacy (LDP), which has
recently emerged as the main tool to ensure each individual's privacy without
the need of a trusted data collector. We propose LDP tests that inject noise
into every user's data in the samples before collecting them (so users do not
need to trust the data collector), and draw conclusions with bounded type-I
(significance level) and type-II errors (1 - power). Our approaches can be
extended to the scenario where some users require LDP while some are willing to
provide exact data. We report experimental results on real-world datasets to
verify the effectiveness of our approaches.Comment: Full version of an AAAI 2018 conference pape
LLMs Understand Glass-Box Models, Discover Surprises, and Suggest Repairs
We show that large language models (LLMs) are remarkably good at working with
interpretable models that decompose complex outcomes into univariate
graph-represented components. By adopting a hierarchical approach to reasoning,
LLMs can provide comprehensive model-level summaries without ever requiring the
entire model to fit in context. This approach enables LLMs to apply their
extensive background knowledge to automate common tasks in data science such as
detecting anomalies that contradict prior knowledge, describing potential
reasons for the anomalies, and suggesting repairs that would remove the
anomalies. We use multiple examples in healthcare to demonstrate the utility of
these new capabilities of LLMs, with particular emphasis on Generalized
Additive Models (GAMs). Finally, we present the package as
an open-source LLM-GAM interface
Using Interpretable Machine Learning to Predict Maternal and Fetal Outcomes
Most pregnancies and births result in a good outcome, but complications are
not uncommon and when they do occur, they can be associated with serious
implications for mothers and babies. Predictive modeling has the potential to
improve outcomes through better understanding of risk factors, heightened
surveillance, and more timely and appropriate interventions, thereby helping
obstetricians deliver better care. For three types of complications we identify
and study the most important risk factors using Explainable Boosting Machine
(EBM), a glass box model, in order to gain intelligibility: (i) Severe Maternal
Morbidity (SMM), (ii) shoulder dystocia, and (iii) preterm preeclampsia. While
using the interpretability of EBM's to reveal surprising insights into the
features contributing to risk, our experiments show EBMs match the accuracy of
other black-box ML methods such as deep neural nets and random forests.Comment: DSHealth at SIGKDD 2022, 5 pages, 3 figure
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Artificial intelligence (AI) researchers have been developing and refining
large language models (LLMs) that exhibit remarkable capabilities across a
variety of domains and tasks, challenging our understanding of learning and
cognition. The latest model developed by OpenAI, GPT-4, was trained using an
unprecedented scale of compute and data. In this paper, we report on our
investigation of an early version of GPT-4, when it was still in active
development by OpenAI. We contend that (this early version of) GPT-4 is part of
a new cohort of LLMs (along with ChatGPT and Google's PaLM for example) that
exhibit more general intelligence than previous AI models. We discuss the
rising capabilities and implications of these models. We demonstrate that,
beyond its mastery of language, GPT-4 can solve novel and difficult tasks that
span mathematics, coding, vision, medicine, law, psychology and more, without
needing any special prompting. Moreover, in all of these tasks, GPT-4's
performance is strikingly close to human-level performance, and often vastly
surpasses prior models such as ChatGPT. Given the breadth and depth of GPT-4's
capabilities, we believe that it could reasonably be viewed as an early (yet
still incomplete) version of an artificial general intelligence (AGI) system.
In our exploration of GPT-4, we put special emphasis on discovering its
limitations, and we discuss the challenges ahead for advancing towards deeper
and more comprehensive versions of AGI, including the possible need for
pursuing a new paradigm that moves beyond next-word prediction. We conclude
with reflections on societal influences of the recent technological leap and
future research directions
Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine
Generalist foundation models such as GPT-4 have displayed surprising
capabilities in a wide variety of domains and tasks. Yet, there is a prevalent
assumption that they cannot match specialist capabilities of fine-tuned models.
For example, most explorations to date on medical competency benchmarks have
leveraged domain-specific training, as exemplified by efforts on BioGPT and
Med-PaLM. We build on a prior study of GPT-4's capabilities on medical
challenge benchmarks in the absence of special training. Rather than using
simple prompting to highlight the model's out-of-the-box capabilities, we
perform a systematic exploration of prompt engineering. We find that prompting
innovation can unlock deeper specialist capabilities and show that GPT-4 easily
tops prior leading results for medical benchmarks. The prompting methods we
explore are general purpose, and make no specific use of domain expertise,
removing the need for expert-curated content. Our experimental design carefully
controls for overfitting during the prompt engineering process. We introduce
Medprompt, based on a composition of several prompting strategies. With
Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark
datasets in the MultiMedQA suite. The method outperforms leading specialist
models such as Med-PaLM 2 by a significant margin with an order of magnitude
fewer calls to the model. Steering GPT-4 with Medprompt achieves a 27%
reduction in error rate on the MedQA dataset over the best methods to date
achieved with specialist models and surpasses a score of 90% for the first
time. Beyond medical problems, we show the power of Medprompt to generalize to
other domains and provide evidence for the broad applicability of the approach
via studies of the strategy on exams in electrical engineering, machine
learning, philosophy, accounting, law, nursing, and clinical psychology.Comment: 21 pages, 7 figure