Search CORE

175 research outputs found

Demographic Studies on Hawaii's Endangered Tree Snails: Partulina proxima

Author: Hadfield Michael G.
Miller Stephen E.
Publication venue: 'University of Hawaii Press (Project Muse)'
Publication date: 01/01/1989
Field of study

Populations of the tree snail Partulina proxima, endemic to higher elevations of Molokai, Hawaiian Islands, were studied for 3 years. Analyses of the data derived from 17bimonthly mark-recapture events determined that each tree harbors a small, mostly nonmigratory population of 8-26 snails of which 2-4 are adults; the snails average 4.2 mm long at birth and 21.3 mm long when growth stops; growth is slow, with maturity reached in 5-7 years; annual fecundity averages 6.2 offspring per adult; and mortality is about 98% over the first 4 years of life. Given the high rate of juvenile mortality, adult snails must reproduce for at least 12 years to replace themselves. From this we calculate a minimum maximal life-span of 18-19 years. We conclude that the current high rate of unexplained juvenile mortality, combined with lat e age at first reproduction and low fecundity, place this species at very high risk to any sort of perturbation, particularly any selective predation on adults

ScholarSpace at University of Hawai'i at Manoa

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

Author: Casper Stephen
Hadfield-Menell Dylan
Ho Anson
Räuker Tilman
Publication venue
Publication date: 27/12/2022
Field of study

The last decade of machine learning has seen drastic increases in scale and capabilities. Deep neural networks (DNNs) are increasingly being deployed in the real world. However, they are difficult to analyze, raising concerns about using them without a rigorous understanding of how they function. Effective tools for interpreting them will be important for building more trustworthy AI by helping to identify problems, fix bugs, and improve basic understanding. In particular, "inner" interpretability techniques, which focus on explaining the internal components of DNNs, are well-suited for developing a mechanistic understanding, guiding manual modifications, and reverse engineering solutions. Much recent work has focused on DNN interpretability, and rapid progress has thus far made a thorough systematization of methods difficult. In this survey, we review over 300 works with a focus on inner interpretability tools. We introduce a taxonomy that classifies methods by what part of the network they help to explain (weights, neurons, subnetworks, or latent representations) and whether they are implemented during (intrinsic) or after (post hoc) training. To our knowledge, we are also the first to survey a number of connections between interpretability research and work in adversarial robustness, continual learning, modularity, network compression, and studying the human visual system. We discuss key challenges and argue that the status quo in interpretability research is largely unproductive. Finally, we highlight the importance of future work that emphasizes diagnostics, debugging, adversaries, and benchmarking in order to make interpretability tools more useful to engineers in practical applications

arXiv.org e-Print Archive

Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?

Author: Andreas Jacob
Casper Stephen
Hadfield-Menell Dylan
Liu Kevin
Publication venue
Publication date: 27/11/2023
Field of study

Neural language models (LMs) can be used to evaluate the truth of factual statements in two ways: they can be either queried for statement probabilities, or probed for internal representations of truthfulness. Past work has found that these two procedures sometimes disagree, and that probes tend to be more accurate than LM outputs. This has led some researchers to conclude that LMs "lie" or otherwise encode non-cooperative communicative intents. Is this an accurate description of today's LMs, or can query-probe disagreement arise in other ways? We identify three different classes of disagreement, which we term confabulation, deception, and heterogeneity. In many cases, the superiority of probes is simply attributable to better calibration on uncertain answers rather than a greater fraction of correct, high-confidence answers. In some cases, queries and probes perform better on different subsets of inputs, and accuracy can further be improved by ensembling the two. Code is available at github.com/lingo-mit/lm-truthfulness.Comment: Accepted to EMNLP, 202

arXiv.org e-Print Archive

Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL Agents

Author: Casper Stephen
Hadfield-Menell Dylan
Killian Taylor
Kreiman Gabriel
Publication venue
Publication date: 13/10/2023
Field of study

Adversarial examples can be useful for identifying vulnerabilities in AI systems before they are deployed. In reinforcement learning (RL), adversarial policies can be developed by training an adversarial agent to minimize a target agent's rewards. Prior work has studied black-box versions of these attacks where the adversary only observes the world state and treats the target agent as any other part of the environment. However, this does not take into account additional structure in the problem. In this work, we study white-box adversarial policies and show that having access to a target agent's internal state can be useful for identifying its vulnerabilities. We make two contributions. (1) We introduce white-box adversarial policies where an attacker observes both a target's internal state and the world state at each timestep. We formulate ways of using these policies to attack agents in 2-player games and text-generating language models. (2) We demonstrate that these policies can achieve higher initial and asymptotic performance against a target agent than black-box controls. Code is available at https://github.com/thestephencasper/lm_white_box_attacksComment: Code is available at https://github.com/thestephencasper/lm_white_box_attack

arXiv.org e-Print Archive

Robust Feature-Level Adversaries are Interpretability Tools

Author: Casper Stephen
Hadfield-Menell Dylan
Kreiman Gabriel
Nadeau Max
Publication venue
Publication date: 01/06/2022
Field of study

The literature on adversarial attacks in computer vision typically focuses on pixel-level perturbations. These tend to be very difficult to interpret. Recent work that manipulates the latent representations of image generators to create "feature-level" adversarial perturbations gives us an opportunity to explore interpretable adversarial attacks. We make three contributions. First, we observe that feature-level attacks provide useful classes of inputs for studying the representations in models. Second, we show that these adversaries are versatile and highly robust. We demonstrate that they can be used to produce targeted, universal, disguised, physically-realizable, and black-box attacks at the ImageNet scale. Third, we show how these adversarial images can be used as a practical interpretability tool for identifying bugs in networks. We use these adversaries to make predictions about spurious associations between features and classes which we then test by designing "copy/paste" attacks in which one natural image is pasted into another to cause a targeted misclassification. Our results indicate that feature-level attacks are a promising approach for rigorous interpretability research. They support the design of tools to better understand what a model has learned and diagnose brittle feature associations.Comment: Code available at https://github.com/thestephencasper/feature_level_ad

arXiv.org e-Print Archive

Explore, Establish, Exploit: Red Teaming Language Models from Scratch

Author: Casper Stephen
Culp Gatlen
Hadfield-Menell Dylan
Kwon Joe
Lin Jason
Publication venue
Publication date: 21/06/2023
Field of study

Deploying Large language models (LLMs) can pose hazards from harmful outputs such as toxic or dishonest speech. Prior work has introduced tools that elicit harmful outputs in order to identify and mitigate these risks. While this is a valuable step toward securing language models, these approaches typically rely on a pre-existing classifier for undesired outputs. This limits their application to situations where the type of harmful behavior is known with precision beforehand. However, this skips a central challenge of red teaming: developing a contextual understanding of the behaviors that a model can exhibit. Furthermore, when such a classifier already exists, red teaming has limited marginal value because the classifier could simply be used to filter training data or model outputs. In this work, we consider red teaming under the assumption that the adversary is working from a high-level, abstract specification of undesired behavior. The red team is expected to refine/extend this specification and identify methods to elicit this behavior from the model. Our red teaming framework consists of three steps: 1) Exploring the model's behavior in the desired context; 2) Establishing a measurement of undesired behavior (e.g., a classifier trained to reflect human evaluations); and 3) Exploiting the model's flaws using this measure and an established red teaming methodology. We apply this approach to red team GPT-2 and GPT-3 models to systematically discover classes of prompts that elicit toxic and dishonest statements. In doing so, we also construct and release the CommonClaim dataset of 20,000 statements that have been labeled by human subjects as common-knowledge-true, common-knowledge-false, or neither. Code is available at https://github.com/thestephencasper/explore_establish_exploit_llms. CommonClaim is available at https://github.com/Algorithmic-Alignment-Lab/CommonClaim

arXiv.org e-Print Archive

ZeST-NeRF: Using temporal aggregation for Zero-Shot Temporal NeRFs

Author: Gilbert Andrew
González Violeta Menéndez
Hadfield Simon
Jolly Stephen
Phillipson Graeme
Publication venue
Publication date: 30/11/2023
Field of study

In the field of media production, video editing techniques play a pivotal role. Recent approaches have had great success at performing novel view image synthesis of static scenes. But adding temporal information adds an extra layer of complexity. Previous models have focused on implicitly representing static and dynamic scenes using NeRF. These models achieve impressive results but are costly at training and inference time. They overfit an MLP to describe the scene implicitly as a function of position. This paper proposes ZeST-NeRF, a new approach that can produce temporal NeRFs for new scenes without retraining. We can accurately reconstruct novel views using multi-view synthesis techniques and scene flow-field estimation, trained only with unrelated scenes. We demonstrate how existing state-of-the-art approaches from a range of fields cannot adequately solve this new task and demonstrate the efficacy of our solution. The resulting network improves quantitatively by 15% and produces significantly better visual results.Comment: VUA BMVC 202

arXiv.org e-Print Archive

Whole Genome Amplification (WGA) for archiving and genotyping of clinical isolates of Cryptosporidium species

Author: Blanco
DARREN HEAVENS
Esteban
Han
KEVIN M. TYLER
KRISTIN ELWIN
MAHA BOUZID
Nelson
PAUL R. HUNTER
Paunio
RACHEL M. CHALMERS
STEPHEN J. HADFIELD
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 01/01/2010
Field of study

Crossref

University of East Anglia digital repository

Red Teaming Deep Neural Networks with Feature Synthesis Tools

Author: Bu Tong
Casper Stephen
Hadfield-Menell Dylan
Hariharan Kaivalya
Li Jiawei
Li Yuxiao
Zhang Kevin
Publication venue
Publication date: 21/09/2023
Field of study

Interpretable AI tools are often motivated by the goal of understanding model behavior in out-of-distribution (OOD) contexts. Despite the attention this area of study receives, there are comparatively few cases where these tools have identified previously unknown bugs in models. We argue that this is due, in part, to a common feature of many interpretability methods: they analyze model behavior by using a particular dataset. This only allows for the study of the model in the context of features that the user can sample in advance. To address this, a growing body of research involves interpreting models using \emph{feature synthesis} methods that do not depend on a dataset. In this paper, we benchmark the usefulness of interpretability tools on debugging tasks. Our key insight is that we can implant human-interpretable trojans into models and then evaluate these tools based on whether they can help humans discover them. This is analogous to finding OOD bugs, except the ground truth is known, allowing us to know when an interpretation is correct. We make four contributions. (1) We propose trojan discovery as an evaluation task for interpretability tools and introduce a benchmark with 12 trojans of 3 different types. (2) We demonstrate the difficulty of this benchmark with a preliminary evaluation of 16 state-of-the-art feature attribution/saliency tools. Even under ideal conditions, given direct access to data with the trojan trigger, these methods still often fail to identify bugs. (3) We evaluate 7 feature-synthesis methods on our benchmark. (4) We introduce and evaluate 2 new variants of the best-performing method from the previous evaluation. A website for this paper and its code is at https://benchmarking-interpretability.csail.mit.edu/Comment: In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023

arXiv.org e-Print Archive

Sporadic Human Cryptosporidiosis Caused by Cryptosporidium cuniculus, United Kingdom, 2007–2008

Author: Chalmers
Chalmers
Chalmers
Guy Robinson
Kristin Elwin
Nolan
Rachel M. Chalmers
Robinson
Robinson
Robinson
Shi
Stephen J. Hadfield
Publication venue: Centers for Disease Control and Prevention
Publication date: 01/03/2011
Field of study

To investigate sporadic human cryptosporidiosis trends in the United Kingdom, we tested 3,030 Cryptosporidium spp.–positive fecal samples, submitted for routine typing in 2007–2008, for C. cuniculus. C. cuniculus prevalence was 1.2%; cases were mostly indigenous and occurred across all age groups. Most occurred during August–October and may be linked to exposure opportunities

Crossref

Directory of Open Access Journals

PubMed Central