27 research outputs found
An Efficient Fill Estimation Algorithm for Sparse Matrices and Tensors in Blocked Formats
Tensors, linear-algebraic extensions of matrices in arbitrary dimensions, have numerous applications in computer science and computational science. Many tensors are sparse, containing more than 90% zero entries. Efficient algorithms can leverage sparsity to do less work, but the irregular locations of the nonzero entries pose challenges to performance engineers. Many tensor operations such as tensor-vector multiplications can be sped up substantially by breaking the tensor into equally sized blocks (only storing blocks which contain nonzeros) and performing operations in each block using carefully tuned code. However, selecting the best block size is computationally challenging. Previously, Vuduc et al. defined the fill of a sparse tensor to be the number of stored entries in the blocked format divided by the number of nonzero entries, and showed that the fill can be used as an effective heuristic to choose a good block size. However, they gave no accuracy bounds for their method for estimating the fill, and it is vulnerable to adversarial examples. In this paper, we present a sampling-based method for finding a (1 + epsilon)-approximation to the fill of an order N tensor for all block sizes less than B, with probability at least 1 - delta, using O(B^(2N) log(B^N / delta) / epsilon^2) samples for each block size. We introduce an efficient routine to sample for all B^N block sizes at once in O(N B^N) time. We extend our concentration bounds to a more efficient bound based on sampling without replacement, using the recent Hoeffding-Serfling inequality. We then implement our algorithm and compare our scheme to that of Vuduc, as implemented in the Optimized Sparse Kernel Interface (OSKI) library. We find that our algorithm provides faster estimates of the fill at all accuracy levels, providing evidence that this is both a theoretical and practical improvement. Our code is available under the BSD 3-clause license at https://github.com/peterahrens/FillEstimation
Exponentially Improving the Complexity of Simulating the Weisfeiler-Lehman Test with Graph Neural Networks
Recent work shows that the expressive power of Graph Neural Networks (GNNs)
in distinguishing non-isomorphic graphs is exactly the same as that of the
Weisfeiler-Lehman (WL) graph test. In particular, they show that the WL test
can be simulated by GNNs. However, those simulations involve neural networks
for the 'combine' function of size polynomial or even exponential in the number
of graph nodes , as well as feature vectors of length linear in .
We present an improved simulation of the WL test on GNNs with
\emph{exponentially} lower complexity. In particular, the neural network
implementing the combine function in each node has only a polylogarithmic
number of parameters in , and the feature vectors exchanged by the nodes of
GNN consists of only bits. We also give logarithmic lower bounds
for the feature vector length and the size of the neural networks, showing the
(near)-optimality of our construction.Comment: 22 pages,5 figures, accepted at NeurIPS 202
Toy Models of Superposition
Neural networks often pack many unrelated concepts into a single neuron - a
puzzling phenomenon known as 'polysemanticity' which makes interpretability
much more challenging. This paper provides a toy model where polysemanticity
can be fully understood, arising as a result of models storing additional
sparse features in "superposition." We demonstrate the existence of a phase
change, a surprising connection to the geometry of uniform polytopes, and
evidence of a link to adversarial examples. We also discuss potential
implications for mechanistic interpretability.Comment: Also available at
https://transformer-circuits.pub/2022/toy_model/index.htm
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
As large language models (LLMs) perform more difficult tasks, it becomes
harder to verify the correctness and safety of their behavior. One approach to
help with this issue is to prompt LLMs to externalize their reasoning, e.g., by
having them generate step-by-step reasoning as they answer a question
(Chain-of-Thought; CoT). The reasoning may enable us to check the process that
models use to perform tasks. However, this approach relies on the stated
reasoning faithfully reflecting the model's actual reasoning, which is not
always the case. To improve over the faithfulness of CoT reasoning, we have
models generate reasoning by decomposing questions into subquestions.
Decomposition-based methods achieve strong performance on question-answering
tasks, sometimes approaching that of CoT while improving the faithfulness of
the model's stated reasoning on several recently-proposed metrics. By forcing
the model to answer simpler subquestions in separate contexts, we greatly
increase the faithfulness of model-generated reasoning over CoT, while still
achieving some of the performance gains of CoT. Our results show it is possible
to improve the faithfulness of model-generated reasoning; continued
improvements may lead to reasoning that enables us to verify the correctness
and safety of LLM behavior.Comment: For few-shot examples and prompts, see
https://github.com/anthropics/DecompositionFaithfulnessPape
Towards Understanding Sycophancy in Language Models
Human feedback is commonly utilized to finetune AI assistants. But human
feedback may also encourage model responses that match user beliefs over
truthful ones, a behaviour known as sycophancy. We investigate the prevalence
of sycophancy in models whose finetuning procedure made use of human feedback,
and the potential role of human preference judgments in such behavior. We first
demonstrate that five state-of-the-art AI assistants consistently exhibit
sycophancy across four varied free-form text-generation tasks. To understand if
human preferences drive this broadly observed behavior, we analyze existing
human preference data. We find that when a response matches a user's views, it
is more likely to be preferred. Moreover, both humans and preference models
(PMs) prefer convincingly-written sycophantic responses over correct ones a
non-negligible fraction of the time. Optimizing model outputs against PMs also
sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results
indicate that sycophancy is a general behavior of state-of-the-art AI
assistants, likely driven in part by human preference judgments favoring
sycophantic responses.Comment: 32 pages, 20 figure
Language Models (Mostly) Know What They Know
We study whether language models can evaluate the validity of their own
claims and predict which questions they will be able to answer correctly. We
first show that larger models are well-calibrated on diverse multiple choice
and true/false questions when they are provided in the right format. Thus we
can approach self-evaluation on open-ended sampling tasks by asking models to
first propose answers, and then to evaluate the probability "P(True)" that
their answers are correct. We find encouraging performance, calibration, and
scaling for P(True) on a diverse array of tasks. Performance at self-evaluation
further improves when we allow models to consider many of their own samples
before predicting the validity of one specific possibility. Next, we
investigate whether models can be trained to predict "P(IK)", the probability
that "I know" the answer to a question, without reference to any particular
proposed answer. Models perform well at predicting P(IK) and partially
generalize across tasks, though they struggle with calibration of P(IK) on new
tasks. The predicted P(IK) probabilities also increase appropriately in the
presence of relevant source materials in the context, and in the presence of
hints towards the solution of mathematical word problems. We hope these
observations lay the groundwork for training more honest models, and for
investigating how honesty generalizes to cases where models are trained on
objectives other than the imitation of human writing.Comment: 23+17 pages; refs added, typos fixe
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
We describe our early efforts to red team language models in order to
simultaneously discover, measure, and attempt to reduce their potentially
harmful outputs. We make three main contributions. First, we investigate
scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B
parameters) and 4 model types: a plain language model (LM); an LM prompted to
be helpful, honest, and harmless; an LM with rejection sampling; and a model
trained to be helpful and harmless using reinforcement learning from human
feedback (RLHF). We find that the RLHF models are increasingly difficult to red
team as they scale, and we find a flat trend with scale for the other model
types. Second, we release our dataset of 38,961 red team attacks for others to
analyze and learn from. We provide our own analysis of the data and find a
variety of harmful outputs, which range from offensive language to more subtly
harmful non-violent unethical outputs. Third, we exhaustively describe our
instructions, processes, statistical methodologies, and uncertainty about red
teaming. We hope that this transparency accelerates our ability to work
together as a community in order to develop shared norms, practices, and
technical standards for how to red team language models
Specific versus General Principles for Constitutional AI
Human feedback can prevent overtly harmful utterances in conversational
models, but may not automatically mitigate subtle problematic behaviors such as
a stated desire for self-preservation or power. Constitutional AI offers an
alternative, replacing human feedback with feedback from AI models conditioned
only on a list of written principles. We find this approach effectively
prevents the expression of such behaviors. The success of simple principles
motivates us to ask: can models learn general ethical behaviors from only a
single written principle? To test this, we run experiments using a principle
roughly stated as "do what's best for humanity". We find that the largest
dialogue models can generalize from this short constitution, resulting in
harmless assistants with no stated interest in specific motivations like power.
A general principle may thus partially avoid the need for a long list of
constitutions targeting potentially harmful behaviors. However, more detailed
constitutions still improve fine-grained control over specific types of harms.
This suggests both general and specific principles have value for steering AI
safely
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Humans are capable of strategically deceptive behavior: behaving helpfully in
most situations, but then behaving very differently in order to pursue
alternative objectives when given the opportunity. If an AI system learned such
a deceptive strategy, could we detect it and remove it using current
state-of-the-art safety training techniques? To study this question, we
construct proof-of-concept examples of deceptive behavior in large language
models (LLMs). For example, we train models that write secure code when the
prompt states that the year is 2023, but insert exploitable code when the
stated year is 2024. We find that such backdoor behavior can be made
persistent, so that it is not removed by standard safety training techniques,
including supervised fine-tuning, reinforcement learning, and adversarial
training (eliciting unsafe behavior and then training to remove it). The
backdoor behavior is most persistent in the largest models and in models
trained to produce chain-of-thought reasoning about deceiving the training
process, with the persistence remaining even when the chain-of-thought is
distilled away. Furthermore, rather than removing backdoors, we find that
adversarial training can teach models to better recognize their backdoor
triggers, effectively hiding the unsafe behavior. Our results suggest that,
once a model exhibits deceptive behavior, standard techniques could fail to
remove such deception and create a false impression of safety.Comment: updated to add missing acknowledgement
Learned Interpolation for Better Streaming Quantiles with Worst Case Guarantees
An ε-approximate quantile sketch over a stream of n inputs approximates the rank of any query point q—that is, the number of input points less than q—up to an additive error of εn, generally with some probability of at least 1−1/ poly(n), while consuming o(n) space. While the celebrated KLL sketch of Karnin, Lang, and Liberty achieves a provably optimal quantile approximation algorithm over worst-case streams, the approximations it achieves in practice are often far from optimal. Indeed, the most commonly used technique in practice is Dunning’s t-digest, which often achieves much better approximations than KLL on real-world data but is known to have arbitrarily large errors in the worst case. We apply interpolation techniques to the streaming quantiles problem to attempt to achieve better approximations on real-world data sets than KLL while maintaining similar guarantees in the worst case.S.M