42 research outputs found

    Towards the Fundamental Limits of Knowledge Transfer over Finite Domains

    Full text link
    We characterize the statistical efficiency of knowledge transfer through nn samples from a teacher to a probabilistic student classifier with input space S\mathcal S over labels A\mathcal A. We show that privileged information at three progressive levels accelerates the transfer. At the first level, only samples with hard labels are known, via which the maximum likelihood estimator attains the minimax rate ∣S∣∣A∣/n\sqrt{{|{\mathcal S}||{\mathcal A}|}/{n}}. The second level has the teacher probabilities of sampled labels available in addition, which turns out to boost the convergence rate lower bound to ∣S∣∣A∣/n{{|{\mathcal S}||{\mathcal A}|}/{n}}. However, under this second data acquisition protocol, minimizing a naive adaptation of the cross-entropy loss results in an asymptotically biased student. We overcome this limitation and achieve the fundamental limit by using a novel empirical variant of the squared error logit loss. The third level further equips the student with the soft labels (complete logits) on A{\mathcal A} given every sampled input, thereby provably enables the student to enjoy a rate ∣S∣/n{|{\mathcal S}|}/{n} free of ∣A∣|{\mathcal A}|. We find any Kullback-Leibler divergence minimizer to be optimal in the last case. Numerical simulations distinguish the four learners and corroborate our theory.Comment: 41 pages, 2 figures; Appendix polishe

    The Effective Horizon Explains Deep RL Performance in Stochastic Environments

    Full text link
    Reinforcement learning (RL) theory has largely focused on proving minimax sample complexity bounds. These require strategic exploration algorithms that use relatively limited function classes for representing the policy or value function. Our goal is to explain why deep RL algorithms often perform well in practice, despite using random exploration and much more expressive function classes like neural networks. Our work arrives at an explanation by showing that many stochastic MDPs can be solved by performing only a few steps of value iteration on the random policy's Q function and then acting greedily. When this is true, we find that it is possible to separate the exploration and learning components of RL, making it much easier to analyze. We introduce a new RL algorithm, SQIRL, that iteratively learns a near-optimal policy by exploring randomly to collect rollouts and then performing a limited number of steps of fitted-Q iteration over those rollouts. Any regression algorithm that satisfies basic in-distribution generalization properties can be used in SQIRL to efficiently solve common MDPs. This can explain why deep RL works, since it is empirically established that neural networks generalize well in-distribution. Furthermore, SQIRL explains why random exploration works well in practice. We leverage SQIRL to derive instance-dependent sample complexity bounds for RL that are exponential only in an "effective horizon" of lookahead and on the complexity of the class used for function approximation. Empirically, we also find that SQIRL performance strongly correlates with PPO and DQN performance in a variety of stochastic environments, supporting that our theoretical analysis is predictive of practical performance. Our code and data are available at https://github.com/cassidylaidlaw/effective-horizon

    Principled Reinforcement Learning with Human Feedback from Pairwise or KK-wise Comparisons

    Full text link
    We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). Our analysis shows that when the true reward function is linear, the widely used maximum likelihood estimator (MLE) converges under both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. However, we show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions. Additionally, we demonstrate that under the PL model, the true MLE and an alternative MLE that splits the KK-wise comparison into pairwise comparisons both converge. Moreover, the true MLE is asymptotically more efficient. Our results validate the empirical success of existing RLHF algorithms in InstructGPT and provide new insights for algorithm design. Furthermore, our results unify the problem of RLHF and max-entropy Inverse Reinforcement Learning (IRL), and provide the first sample complexity bound for max-entropy IRL

    Generative AI Security: Challenges and Countermeasures

    Full text link
    Generative AI's expanding footprint across numerous industries has led to both excitement and increased scrutiny. This paper delves into the unique security challenges posed by Generative AI, and outlines potential research directions for managing these risks

    Noisy Computing of the OR\mathsf{OR} and MAX\mathsf{MAX} Functions

    Full text link
    We consider the problem of computing a function of nn variables using noisy queries, where each query is incorrect with some fixed and known probability p∈(0,1/2)p \in (0,1/2). Specifically, we consider the computation of the OR\mathsf{OR} function of nn bits (where queries correspond to noisy readings of the bits) and the MAX\mathsf{MAX} function of nn real numbers (where queries correspond to noisy pairwise comparisons). We show that an expected number of queries of (1Β±o(1))nlog⁑1Ξ΄DKL(pβˆ₯1βˆ’p) (1 \pm o(1)) \frac{n\log \frac{1}{\delta}}{D_{\mathsf{KL}}(p \| 1-p)} is both sufficient and necessary to compute both functions with a vanishing error probability Ξ΄=o(1)\delta = o(1), where DKL(pβˆ₯1βˆ’p)D_{\mathsf{KL}}(p \| 1-p) denotes the Kullback-Leibler divergence between Bern(p)\mathsf{Bern}(p) and Bern(1βˆ’p)\mathsf{Bern}(1-p) distributions. Compared to previous work, our results tighten the dependence on pp in both the upper and lower bounds for the two functions
    corecore