20 research outputs found
X-Risk Analysis for AI Research
Artificial intelligence (AI) has the potential to greatly improve society,
but as with any powerful technology, it comes with heightened risks and
responsibilities. Current AI research lacks a systematic discussion of how to
manage long-tail risks from AI systems, including speculative long-term risks.
Keeping in mind the potential benefits of AI, there is some concern that
building ever more intelligent and powerful AI systems could eventually result
in systems that are more powerful than us; some say this is like playing with
fire and speculate that this could create existential risks (x-risks). To add
precision and ground these discussions, we provide a guide for how to analyze
AI x-risk, which consists of three parts: First, we review how systems can be
made safer today, drawing on time-tested concepts from hazard analysis and
systems safety that have been designed to steer large processes in safer
directions. Next, we discuss strategies for having long-term impacts on the
safety of future systems. Finally, we discuss a crucial concept in making AI
systems safer by improving the balance between safety and general capabilities.
We hope this document and the presented concepts and tools serve as a useful
guide for understanding how to analyze AI x-risk
An Overview of Catastrophic AI Risks
Rapid advancements in artificial intelligence (AI) have sparked growing
concerns among experts, policymakers, and world leaders regarding the potential
for increasingly advanced AI systems to pose catastrophic risks. Although
numerous risks have been detailed separately, there is a pressing need for a
systematic discussion and illustration of the potential dangers to better
inform efforts to mitigate them. This paper provides an overview of the main
sources of catastrophic AI risks, which we organize into four categories:
malicious use, in which individuals or groups intentionally use AIs to cause
harm; AI race, in which competitive environments compel actors to deploy unsafe
AIs or cede control to AIs; organizational risks, highlighting how human
factors and complex systems can increase the chances of catastrophic accidents;
and rogue AIs, describing the inherent difficulty in controlling agents far
more intelligent than humans. For each category of risk, we describe specific
hazards, present illustrative stories, envision ideal scenarios, and propose
practical suggestions for mitigating these dangers. Our goal is to foster a
comprehensive understanding of these risks and inspire collective and proactive
efforts to ensure that AIs are developed and deployed in a safe manner.
Ultimately, we hope this will allow us to realize the benefits of this powerful
technology while minimizing the potential for catastrophic outcomes
Scaling Out-of-Distribution Detection for Real-World Settings
Detecting out-of-distribution examples is important for safety-critical
machine learning applications such as medical screening and self-driving cars.
However, existing research mainly focuses on simple small-scale settings. To
set the stage for more realistic out-of-distribution detection, we depart from
small-scale settings and explore large-scale multiclass and multi-label
settings with high-resolution images and hundreds of classes. To make future
work in real-world settings possible, we also create a new benchmark for
anomaly segmentation by introducing the Combined Anomalous Object Segmentation
benchmark. Our novel benchmark combines two datasets for anomaly segmentation
that incorporate both realism and anomaly diversity. Using both real images and
those from a simulated driving environment, we ensure the background context
and a wide variety of anomalous objects are naturally integrated, unlike
before. We conduct extensive experiments in these more realistic settings for
out-of-distribution detection and find that a surprisingly simple detector
based on the maximum logit outperforms prior methods in all the large-scale
multi-class, multi-label, and segmentation tasks we consider, establishing a
new baseline for future work. These results, along with our new anomaly
segmentation benchmark, open the door to future research in out-of-distribution
detection.Comment: StreetHazards dataset and code are available at
https://github.com/hendrycks/anomaly-se
Forecasting Future World Events with Neural Networks
Forecasting future world events is a challenging but valuable task. Forecasts
of climate, geopolitical conflict, pandemics and economic indicators help shape
policy and decision making. In these domains, the judgment of expert humans
contributes to the best forecasts. Given advances in language modeling, can
these forecasts be automated? To this end, we introduce Autocast, a dataset
containing thousands of forecasting questions and an accompanying news corpus.
Questions are taken from forecasting tournaments, ensuring high quality,
real-world importance, and diversity. The news corpus is organized by date,
allowing us to precisely simulate the conditions under which humans made past
forecasts (avoiding leakage from the future). Motivated by the difficulty of
forecasting numbers across orders of magnitude (e.g. global cases of COVID-19
in 2022), we also curate IntervalQA, a dataset of numerical questions and
metrics for calibration. We test language models on our forecasting task and
find that performance is far below a human expert baseline. However,
performance improves with increased model size and incorporation of relevant
information from the news corpus. In sum, Autocast poses a novel challenge for
large language models and improved performance could bring large practical
benefits.Comment: NeurIPS 2022; our dataset is available at
https://github.com/andyzoujm/autocas
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
Generative Pre-trained Transformer (GPT) models have exhibited exciting
progress in capabilities, capturing the interest of practitioners and the
public alike. Yet, while the literature on the trustworthiness of GPT models
remains limited, practitioners have proposed employing capable GPT models for
sensitive applications to healthcare and finance - where mistakes can be
costly. To this end, this work proposes a comprehensive trustworthiness
evaluation for large language models with a focus on GPT-4 and GPT-3.5,
considering diverse perspectives - including toxicity, stereotype bias,
adversarial robustness, out-of-distribution robustness, robustness on
adversarial demonstrations, privacy, machine ethics, and fairness. Based on our
evaluations, we discover previously unpublished vulnerabilities to
trustworthiness threats. For instance, we find that GPT models can be easily
misled to generate toxic and biased outputs and leak private information in
both training data and conversation history. We also find that although GPT-4
is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more
vulnerable given jailbreaking system or user prompts, potentially due to the
reason that GPT-4 follows the (misleading) instructions more precisely. Our
work illustrates a comprehensive trustworthiness evaluation of GPT models and
sheds light on the trustworthiness gaps. Our benchmark is publicly available at
https://decodingtrust.github.io/
Representation Engineering: A Top-Down Approach to AI Transparency
In this paper, we identify and characterize the emerging area of
representation engineering (RepE), an approach to enhancing the transparency of
AI systems that draws on insights from cognitive neuroscience. RepE places
population-level representations, rather than neurons or circuits, at the
center of analysis, equipping us with novel methods for monitoring and
manipulating high-level cognitive phenomena in deep neural networks (DNNs). We
provide baselines and an initial analysis of RepE techniques, showing that they
offer simple yet effective solutions for improving our understanding and
control of large language models. We showcase how these methods can provide
traction on a wide range of safety-relevant problems, including honesty,
harmlessness, power-seeking, and more, demonstrating the promise of top-down
transparency research. We hope that this work catalyzes further exploration of
RepE and fosters advancements in the transparency and safety of AI systems.Comment: Code is available at
https://github.com/andyzoujm/representation-engineerin
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
The White House Executive Order on Artificial Intelligence highlights the
risks of large language models (LLMs) empowering malicious actors in developing
biological, cyber, and chemical weapons. To measure these risks of malicious
use, government institutions and major AI labs are developing evaluations for
hazardous capabilities in LLMs. However, current evaluations are private,
preventing further research into mitigating risk. Furthermore, they focus on
only a few, highly specific pathways for malicious use. To fill these gaps, we
publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a
dataset of 3,668 multiple-choice questions that serve as a proxy measurement of
hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP
was developed by a consortium of academics and technical consultants, and was
stringently filtered to eliminate sensitive information prior to public
release. WMDP serves two roles: first, as an evaluation for hazardous knowledge
in LLMs, and second, as a benchmark for unlearning methods to remove such
hazardous knowledge. To guide progress on unlearning, we develop RMU, a
state-of-the-art unlearning method based on controlling model representations.
RMU reduces model performance on WMDP while maintaining general capabilities in
areas such as biology and computer science, suggesting that unlearning may be a
concrete path towards reducing malicious use from LLMs. We release our
benchmark and code publicly at https://wmdp.aiComment: See the project page at https://wmdp.a