42 research outputs found
Towards the Fundamental Limits of Knowledge Transfer over Finite Domains
We characterize the statistical efficiency of knowledge transfer through
samples from a teacher to a probabilistic student classifier with input space
over labels . We show that privileged information at
three progressive levels accelerates the transfer. At the first level, only
samples with hard labels are known, via which the maximum likelihood estimator
attains the minimax rate . The
second level has the teacher probabilities of sampled labels available in
addition, which turns out to boost the convergence rate lower bound to
. However, under this second data
acquisition protocol, minimizing a naive adaptation of the cross-entropy loss
results in an asymptotically biased student. We overcome this limitation and
achieve the fundamental limit by using a novel empirical variant of the squared
error logit loss. The third level further equips the student with the soft
labels (complete logits) on given every sampled input, thereby
provably enables the student to enjoy a rate free of
. We find any Kullback-Leibler divergence minimizer to be
optimal in the last case. Numerical simulations distinguish the four learners
and corroborate our theory.Comment: 41 pages, 2 figures; Appendix polishe
The Effective Horizon Explains Deep RL Performance in Stochastic Environments
Reinforcement learning (RL) theory has largely focused on proving minimax
sample complexity bounds. These require strategic exploration algorithms that
use relatively limited function classes for representing the policy or value
function. Our goal is to explain why deep RL algorithms often perform well in
practice, despite using random exploration and much more expressive function
classes like neural networks. Our work arrives at an explanation by showing
that many stochastic MDPs can be solved by performing only a few steps of value
iteration on the random policy's Q function and then acting greedily. When this
is true, we find that it is possible to separate the exploration and learning
components of RL, making it much easier to analyze. We introduce a new RL
algorithm, SQIRL, that iteratively learns a near-optimal policy by exploring
randomly to collect rollouts and then performing a limited number of steps of
fitted-Q iteration over those rollouts. Any regression algorithm that satisfies
basic in-distribution generalization properties can be used in SQIRL to
efficiently solve common MDPs. This can explain why deep RL works, since it is
empirically established that neural networks generalize well in-distribution.
Furthermore, SQIRL explains why random exploration works well in practice. We
leverage SQIRL to derive instance-dependent sample complexity bounds for RL
that are exponential only in an "effective horizon" of lookahead and on the
complexity of the class used for function approximation. Empirically, we also
find that SQIRL performance strongly correlates with PPO and DQN performance in
a variety of stochastic environments, supporting that our theoretical analysis
is predictive of practical performance. Our code and data are available at
https://github.com/cassidylaidlaw/effective-horizon
Principled Reinforcement Learning with Human Feedback from Pairwise or -wise Comparisons
We provide a theoretical framework for Reinforcement Learning with Human
Feedback (RLHF). Our analysis shows that when the true reward function is
linear, the widely used maximum likelihood estimator (MLE) converges under both
the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. However,
we show that when training a policy based on the learned reward model, MLE
fails while a pessimistic MLE provides policies with improved performance under
certain coverage assumptions. Additionally, we demonstrate that under the PL
model, the true MLE and an alternative MLE that splits the -wise comparison
into pairwise comparisons both converge. Moreover, the true MLE is
asymptotically more efficient. Our results validate the empirical success of
existing RLHF algorithms in InstructGPT and provide new insights for algorithm
design. Furthermore, our results unify the problem of RLHF and max-entropy
Inverse Reinforcement Learning (IRL), and provide the first sample complexity
bound for max-entropy IRL
Generative AI Security: Challenges and Countermeasures
Generative AI's expanding footprint across numerous industries has led to
both excitement and increased scrutiny. This paper delves into the unique
security challenges posed by Generative AI, and outlines potential research
directions for managing these risks
Noisy Computing of the and Functions
We consider the problem of computing a function of variables using noisy
queries, where each query is incorrect with some fixed and known probability . Specifically, we consider the computation of the
function of bits (where queries correspond to noisy readings of the bits)
and the function of real numbers (where queries correspond
to noisy pairwise comparisons). We show that an expected number of queries of
is
both sufficient and necessary to compute both functions with a vanishing error
probability , where denotes the
Kullback-Leibler divergence between and
distributions. Compared to previous work, our results tighten the dependence on
in both the upper and lower bounds for the two functions