54 research outputs found
The Learnability of In-Context Learning
In-context learning is a surprising and important phenomenon that emerged
when modern language models were scaled to billions of learned parameters.
Without modifying a large language model's weights, it can be tuned to perform
various downstream natural language tasks simply by including concatenated
training examples of these tasks in its input. Though disruptive for many
practical applications of large language models, this emergent learning
paradigm is not well understood from a theoretical perspective. In this paper,
we propose a first-of-its-kind PAC based framework for in-context learnability,
and use it to provide the first finite sample complexity results for the
in-context learning setup. Our framework includes an initial pretraining phase,
which fits a function to the pretraining distribution, and then a second
in-context learning phase, which keeps this function constant and concatenates
training examples of the downstream task in its input. We use our framework in
order to prove that, under mild assumptions, when the pretraining distribution
is a mixture of latent tasks (a model often considered for natural language
pretraining), these tasks can be efficiently learned via in-context learning,
even though the model's weights are unchanged and the input significantly
diverges from the pretraining distribution. Our theoretical analysis reveals
that in this setting, in-context learning is more about identifying the task
than about learning it, a result which is in line with a series of recent
empirical findings. We hope that the in-context learnability framework
presented in this paper will facilitate future progress towards a deeper
understanding of this important new learning paradigm
Fundamental Limitations of Alignment in Large Language Models
An important aspect in developing language models that interact with humans
is aligning their behavior to be useful and unharmful for their human users.
This is usually achieved by tuning the model in a way that enhances desired
behaviors and inhibits undesired ones, a process referred to as alignment. In
this paper, we propose a theoretical approach called Behavior Expectation
Bounds (BEB) which allows us to formally investigate several inherent
characteristics and limitations of alignment in large language models.
Importantly, we prove that for any behavior that has a finite probability of
being exhibited by the model, there exist prompts that can trigger the model
into outputting this behavior, with probability that increases with the length
of the prompt. This implies that any alignment process that attenuates
undesired behavior but does not remove it altogether, is not safe against
adversarial prompting attacks. Furthermore, our framework hints at the
mechanism by which leading alignment approaches such as reinforcement learning
from human feedback increase the LLM's proneness to being prompted into the
undesired behaviors. Moreover, we include the notion of personas in our BEB
framework, and find that behaviors which are generally very unlikely to be
exhibited by the model can be brought to the front by prompting the model to
behave as specific persona. This theoretical result is being experimentally
demonstrated in large scale by the so called contemporary "chatGPT jailbreaks",
where adversarial users trick the LLM into breaking its alignment guardrails by
triggering it into acting as a malicious persona. Our results expose
fundamental limitations in alignment of LLMs and bring to the forefront the
need to devise reliable mechanisms for ensuring AI safety
- …