175 research outputs found
Demographic Studies on Hawaii's Endangered Tree Snails: Partulina proxima
Populations of the tree snail Partulina proxima, endemic to higher
elevations of Molokai, Hawaiian Islands, were studied for 3 years. Analyses of
the data derived from 17bimonthly mark-recapture events determined that each
tree harbors a small, mostly nonmigratory population of 8-26 snails of which
2-4 are adults; the snails average 4.2 mm long at birth and 21.3 mm long when
growth stops; growth is slow, with maturity reached in 5-7 years; annual
fecundity averages 6.2 offspring per adult; and mortality is about 98% over the
first 4 years of life. Given the high rate of juvenile mortality, adult snails must
reproduce for at least 12 years to replace themselves. From this we calculate a
minimum maximal life-span of 18-19 years. We conclude that the current high
rate of unexplained juvenile mortality, combined with lat e age at first reproduction
and low fecundity, place this species at very high risk to any sort of
perturbation, particularly any selective predation on adults
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks
The last decade of machine learning has seen drastic increases in scale and
capabilities. Deep neural networks (DNNs) are increasingly being deployed in
the real world. However, they are difficult to analyze, raising concerns about
using them without a rigorous understanding of how they function. Effective
tools for interpreting them will be important for building more trustworthy AI
by helping to identify problems, fix bugs, and improve basic understanding. In
particular, "inner" interpretability techniques, which focus on explaining the
internal components of DNNs, are well-suited for developing a mechanistic
understanding, guiding manual modifications, and reverse engineering solutions.
Much recent work has focused on DNN interpretability, and rapid progress has
thus far made a thorough systematization of methods difficult. In this survey,
we review over 300 works with a focus on inner interpretability tools. We
introduce a taxonomy that classifies methods by what part of the network they
help to explain (weights, neurons, subnetworks, or latent representations) and
whether they are implemented during (intrinsic) or after (post hoc) training.
To our knowledge, we are also the first to survey a number of connections
between interpretability research and work in adversarial robustness, continual
learning, modularity, network compression, and studying the human visual
system. We discuss key challenges and argue that the status quo in
interpretability research is largely unproductive. Finally, we highlight the
importance of future work that emphasizes diagnostics, debugging, adversaries,
and benchmarking in order to make interpretability tools more useful to
engineers in practical applications
Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?
Neural language models (LMs) can be used to evaluate the truth of factual
statements in two ways: they can be either queried for statement probabilities,
or probed for internal representations of truthfulness. Past work has found
that these two procedures sometimes disagree, and that probes tend to be more
accurate than LM outputs. This has led some researchers to conclude that LMs
"lie" or otherwise encode non-cooperative communicative intents. Is this an
accurate description of today's LMs, or can query-probe disagreement arise in
other ways? We identify three different classes of disagreement, which we term
confabulation, deception, and heterogeneity. In many cases, the superiority of
probes is simply attributable to better calibration on uncertain answers rather
than a greater fraction of correct, high-confidence answers. In some cases,
queries and probes perform better on different subsets of inputs, and accuracy
can further be improved by ensembling the two. Code is available at
github.com/lingo-mit/lm-truthfulness.Comment: Accepted to EMNLP, 202
Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL Agents
Adversarial examples can be useful for identifying vulnerabilities in AI
systems before they are deployed. In reinforcement learning (RL), adversarial
policies can be developed by training an adversarial agent to minimize a target
agent's rewards. Prior work has studied black-box versions of these attacks
where the adversary only observes the world state and treats the target agent
as any other part of the environment. However, this does not take into account
additional structure in the problem. In this work, we study white-box
adversarial policies and show that having access to a target agent's internal
state can be useful for identifying its vulnerabilities. We make two
contributions. (1) We introduce white-box adversarial policies where an
attacker observes both a target's internal state and the world state at each
timestep. We formulate ways of using these policies to attack agents in
2-player games and text-generating language models. (2) We demonstrate that
these policies can achieve higher initial and asymptotic performance against a
target agent than black-box controls. Code is available at
https://github.com/thestephencasper/lm_white_box_attacksComment: Code is available at
https://github.com/thestephencasper/lm_white_box_attack
Robust Feature-Level Adversaries are Interpretability Tools
The literature on adversarial attacks in computer vision typically focuses on
pixel-level perturbations. These tend to be very difficult to interpret. Recent
work that manipulates the latent representations of image generators to create
"feature-level" adversarial perturbations gives us an opportunity to explore
interpretable adversarial attacks. We make three contributions. First, we
observe that feature-level attacks provide useful classes of inputs for
studying the representations in models. Second, we show that these adversaries
are versatile and highly robust. We demonstrate that they can be used to
produce targeted, universal, disguised, physically-realizable, and black-box
attacks at the ImageNet scale. Third, we show how these adversarial images can
be used as a practical interpretability tool for identifying bugs in networks.
We use these adversaries to make predictions about spurious associations
between features and classes which we then test by designing "copy/paste"
attacks in which one natural image is pasted into another to cause a targeted
misclassification. Our results indicate that feature-level attacks are a
promising approach for rigorous interpretability research. They support the
design of tools to better understand what a model has learned and diagnose
brittle feature associations.Comment: Code available at
https://github.com/thestephencasper/feature_level_ad
Explore, Establish, Exploit: Red Teaming Language Models from Scratch
Deploying Large language models (LLMs) can pose hazards from harmful outputs
such as toxic or dishonest speech. Prior work has introduced tools that elicit
harmful outputs in order to identify and mitigate these risks. While this is a
valuable step toward securing language models, these approaches typically rely
on a pre-existing classifier for undesired outputs. This limits their
application to situations where the type of harmful behavior is known with
precision beforehand. However, this skips a central challenge of red teaming:
developing a contextual understanding of the behaviors that a model can
exhibit. Furthermore, when such a classifier already exists, red teaming has
limited marginal value because the classifier could simply be used to filter
training data or model outputs. In this work, we consider red teaming under the
assumption that the adversary is working from a high-level, abstract
specification of undesired behavior. The red team is expected to refine/extend
this specification and identify methods to elicit this behavior from the model.
Our red teaming framework consists of three steps: 1) Exploring the model's
behavior in the desired context; 2) Establishing a measurement of undesired
behavior (e.g., a classifier trained to reflect human evaluations); and 3)
Exploiting the model's flaws using this measure and an established red teaming
methodology. We apply this approach to red team GPT-2 and GPT-3 models to
systematically discover classes of prompts that elicit toxic and dishonest
statements. In doing so, we also construct and release the CommonClaim dataset
of 20,000 statements that have been labeled by human subjects as
common-knowledge-true, common-knowledge-false, or neither. Code is available at
https://github.com/thestephencasper/explore_establish_exploit_llms. CommonClaim
is available at https://github.com/Algorithmic-Alignment-Lab/CommonClaim
ZeST-NeRF: Using temporal aggregation for Zero-Shot Temporal NeRFs
In the field of media production, video editing techniques play a pivotal
role. Recent approaches have had great success at performing novel view image
synthesis of static scenes. But adding temporal information adds an extra layer
of complexity. Previous models have focused on implicitly representing static
and dynamic scenes using NeRF. These models achieve impressive results but are
costly at training and inference time. They overfit an MLP to describe the
scene implicitly as a function of position. This paper proposes ZeST-NeRF, a
new approach that can produce temporal NeRFs for new scenes without retraining.
We can accurately reconstruct novel views using multi-view synthesis techniques
and scene flow-field estimation, trained only with unrelated scenes. We
demonstrate how existing state-of-the-art approaches from a range of fields
cannot adequately solve this new task and demonstrate the efficacy of our
solution. The resulting network improves quantitatively by 15% and produces
significantly better visual results.Comment: VUA BMVC 202
Red Teaming Deep Neural Networks with Feature Synthesis Tools
Interpretable AI tools are often motivated by the goal of understanding model
behavior in out-of-distribution (OOD) contexts. Despite the attention this area
of study receives, there are comparatively few cases where these tools have
identified previously unknown bugs in models. We argue that this is due, in
part, to a common feature of many interpretability methods: they analyze model
behavior by using a particular dataset. This only allows for the study of the
model in the context of features that the user can sample in advance. To
address this, a growing body of research involves interpreting models using
\emph{feature synthesis} methods that do not depend on a dataset.
In this paper, we benchmark the usefulness of interpretability tools on
debugging tasks. Our key insight is that we can implant human-interpretable
trojans into models and then evaluate these tools based on whether they can
help humans discover them. This is analogous to finding OOD bugs, except the
ground truth is known, allowing us to know when an interpretation is correct.
We make four contributions. (1) We propose trojan discovery as an evaluation
task for interpretability tools and introduce a benchmark with 12 trojans of 3
different types. (2) We demonstrate the difficulty of this benchmark with a
preliminary evaluation of 16 state-of-the-art feature attribution/saliency
tools. Even under ideal conditions, given direct access to data with the trojan
trigger, these methods still often fail to identify bugs. (3) We evaluate 7
feature-synthesis methods on our benchmark. (4) We introduce and evaluate 2 new
variants of the best-performing method from the previous evaluation. A website
for this paper and its code is at
https://benchmarking-interpretability.csail.mit.edu/Comment: In Proceedings of the 37th Conference on Neural Information
Processing Systems (NeurIPS 2023
Sporadic Human Cryptosporidiosis Caused by Cryptosporidium cuniculus, United Kingdom, 2007–2008
To investigate sporadic human cryptosporidiosis trends in the United Kingdom, we tested 3,030 Cryptosporidium spp.–positive fecal samples, submitted for routine typing in 2007–2008, for C. cuniculus. C. cuniculus prevalence was 1.2%; cases were mostly indigenous and occurred across all age groups. Most occurred during August–October and may be linked to exposure opportunities
- …