Search CORE

12 research outputs found

X-Risk Analysis for AI Research

Author: Hendrycks Dan
Mazeika Mantas
Publication venue
Publication date: 20/09/2022
Field of study

Artificial intelligence (AI) has the potential to greatly improve society, but as with any powerful technology, it comes with heightened risks and responsibilities. Current AI research lacks a systematic discussion of how to manage long-tail risks from AI systems, including speculative long-term risks. Keeping in mind the potential benefits of AI, there is some concern that building ever more intelligent and powerful AI systems could eventually result in systems that are more powerful than us; some say this is like playing with fire and speculate that this could create existential risks (x-risks). To add precision and ground these discussions, we provide a guide for how to analyze AI x-risk, which consists of three parts: First, we review how systems can be made safer today, drawing on time-tested concepts from hazard analysis and systems safety that have been designed to steer large processes in safer directions. Next, we discuss strategies for having long-term impacts on the safety of future systems. Finally, we discuss a crucial concept in making AI systems safer by improving the balance between safety and general capabilities. We hope this document and the presented concepts and tools serve as a useful guide for understanding how to analyze AI x-risk

arXiv.org e-Print Archive

An Overview of Catastrophic AI Risks

Author: Hendrycks Dan
Mazeika Mantas
Woodside Thomas
Publication venue
Publication date: 11/09/2023
Field of study

Rapid advancements in artificial intelligence (AI) have sparked growing concerns among experts, policymakers, and world leaders regarding the potential for increasingly advanced AI systems to pose catastrophic risks. Although numerous risks have been detailed separately, there is a pressing need for a systematic discussion and illustration of the potential dangers to better inform efforts to mitigate them. This paper provides an overview of the main sources of catastrophic AI risks, which we organize into four categories: malicious use, in which individuals or groups intentionally use AIs to cause harm; AI race, in which competitive environments compel actors to deploy unsafe AIs or cede control to AIs; organizational risks, highlighting how human factors and complex systems can increase the chances of catastrophic accidents; and rogue AIs, describing the inherent difficulty in controlling agents far more intelligent than humans. For each category of risk, we describe specific hazards, present illustrative stories, envision ideal scenarios, and propose practical suggestions for mitigating these dangers. Our goal is to foster a comprehensive understanding of these risks and inspire collective and proactive efforts to ensure that AIs are developed and deployed in a safe manner. Ultimately, we hope this will allow us to realize the benefits of this powerful technology while minimizing the potential for catastrophic outcomes

arXiv.org e-Print Archive

Scaling Out-of-Distribution Detection for Real-World Settings

Author: Basart Steven
Hendrycks Dan
Mazeika Mantas
Mostajabi Mohammadreza
Song Dawn
Steinhardt Jacob
Publication venue
Publication date: 07/12/2020
Field of study

Detecting out-of-distribution examples is important for safety-critical machine learning applications such as medical screening and self-driving cars. However, existing research mainly focuses on simple small-scale settings. To set the stage for more realistic out-of-distribution detection, we depart from small-scale settings and explore large-scale multiclass and multi-label settings with high-resolution images and hundreds of classes. To make future work in real-world settings possible, we also create a new benchmark for anomaly segmentation by introducing the Combined Anomalous Object Segmentation benchmark. Our novel benchmark combines two datasets for anomaly segmentation that incorporate both realism and anomaly diversity. Using both real images and those from a simulated driving environment, we ensure the background context and a wide variety of anomalous objects are naturally integrated, unlike before. We conduct extensive experiments in these more realistic settings for out-of-distribution detection and find that a surprisingly simple detector based on the maximum logit outperforms prior methods in all the large-scale multi-class, multi-label, and segmentation tasks we consider, establishing a new baseline for future work. These results, along with our new anomaly segmentation benchmark, open the door to future research in out-of-distribution detection.Comment: StreetHazards dataset and code are available at https://github.com/hendrycks/anomaly-se

arXiv.org e-Print Archive

Forecasting Future World Events with Neural Networks

Author: Evans Owain
Hendrycks Dan
Jia Ryan
Kwon Joe
Li Richard
Mazeika Mantas
Song Dawn
Steinhardt Jacob
Xiao Tristan
Zou Andy
Publication venue
Publication date: 09/10/2022
Field of study

Forecasting future world events is a challenging but valuable task. Forecasts of climate, geopolitical conflict, pandemics and economic indicators help shape policy and decision making. In these domains, the judgment of expert humans contributes to the best forecasts. Given advances in language modeling, can these forecasts be automated? To this end, we introduce Autocast, a dataset containing thousands of forecasting questions and an accompanying news corpus. Questions are taken from forecasting tournaments, ensuring high quality, real-world importance, and diversity. The news corpus is organized by date, allowing us to precisely simulate the conditions under which humans made past forecasts (avoiding leakage from the future). Motivated by the difficulty of forecasting numbers across orders of magnitude (e.g. global cases of COVID-19 in 2022), we also curate IntervalQA, a dataset of numerical questions and metrics for calibration. We test language models on our forecasting task and find that performance is far below a human expert baseline. However, performance improves with increased model size and incorporation of relevant information from the news corpus. In sum, Autocast poses a novel challenge for large language models and improved performance could bring large practical benefits.Comment: NeurIPS 2022; our dataset is available at https://github.com/andyzoujm/autocas

arXiv.org e-Print Archive

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models

Author: Arora Simran
Chen Weixin
Cheng Yu
Dutta Ritik
Hendrycks Dan
Kang Mintong
Koyejo Sanmi
Li Bo
Lin Zinan
Mazeika Mantas
Pei Hengzhi
Schaeffer Rylan
Song Dawn
Truong Sang T.
Wang Boxin
Xie Chulin
Xiong Zidi
Xu Chejian
Zhang Chenhui
Publication venue
Publication date: 20/06/2023
Field of study

Generative Pre-trained Transformer (GPT) models have exhibited exciting progress in capabilities, capturing the interest of practitioners and the public alike. Yet, while the literature on the trustworthiness of GPT models remains limited, practitioners have proposed employing capable GPT models for sensitive applications to healthcare and finance - where mistakes can be costly. To this end, this work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5, considering diverse perspectives - including toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness. Based on our evaluations, we discover previously unpublished vulnerabilities to trustworthiness threats. For instance, we find that GPT models can be easily misled to generate toxic and biased outputs and leak private information in both training data and conversation history. We also find that although GPT-4 is usually more trustworthy than GPT-3.5 on standard benchmarks, GPT-4 is more vulnerable given jailbreaking system or user prompts, potentially due to the reason that GPT-4 follows the (misleading) instructions more precisely. Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps. Our benchmark is publicly available at https://decodingtrust.github.io/

arXiv.org e-Print Archive

Representation Engineering: A Top-Down Approach to AI Transparency

Author: Basart Steven
Byun Michael J.
Campbell James
Chen Sarah
Dombrowski Ann-Kathrin
Fredrikson Matt
Goel Shashwat
Guo Phillip
Hendrycks Dan
Kolter J. Zico
Koyejo Sanmi
Li Nathaniel
Mallen Alex
Mazeika Mantas
Pan Alexander
Phan Long
Ren Richard
Song Dawn
Wang Zifan
Yin Xuwang
Zou Andy
Publication venue
Publication date: 10/10/2023
Field of study

In this paper, we identify and characterize the emerging area of representation engineering (RepE), an approach to enhancing the transparency of AI systems that draws on insights from cognitive neuroscience. RepE places population-level representations, rather than neurons or circuits, at the center of analysis, equipping us with novel methods for monitoring and manipulating high-level cognitive phenomena in deep neural networks (DNNs). We provide baselines and an initial analysis of RepE techniques, showing that they offer simple yet effective solutions for improving our understanding and control of large language models. We showcase how these methods can provide traction on a wide range of safety-relevant problems, including honesty, harmlessness, power-seeking, and more, demonstrating the promise of top-down transparency research. We hope that this work catalyzes further exploration of RepE and fosters advancements in the transparency and safety of AI systems.Comment: Code is available at https://github.com/andyzoujm/representation-engineerin

arXiv.org e-Print Archive