458 research outputs found

    Provably Efficient QQ-learning with Function Approximation via Distribution Shift Error Checking Oracle

    Full text link
    QQ-learning with function approximation is one of the most popular methods in reinforcement learning. Though the idea of using function approximation was proposed at least 60 years ago, even in the simplest setup, i.e, approximating QQ-functions with linear functions, it is still an open problem on how to design a provably efficient algorithm that learns a near-optimal policy. The key challenges are how to efficiently explore the state space and how to decide when to stop exploring in conjunction with the function approximation scheme. The current paper presents a provably efficient algorithm for QQ-learning with linear function approximation. Under certain regularity assumptions, our algorithm, Difference Maximization QQ-learning (DMQ), combined with linear function approximation, returns a near-optimal policy using a polynomial number of trajectories. Our algorithm introduces a new notion, the Distribution Shift Error Checking (DSEC) oracle. This oracle tests whether there exists a function in the function class that predicts well on a distribution D1\mathcal{D}_1, but predicts poorly on another distribution D2\mathcal{D}_2, where D1\mathcal{D}_1 and D2\mathcal{D}_2 are distributions over states induced by two different exploration policies. For the linear function class, this oracle is equivalent to solving a top eigenvalue problem. We believe our algorithmic insights, especially the DSEC oracle, are also useful in designing and analyzing reinforcement learning algorithms with general function approximation.Comment: In NeurIPS 201

    Is Long Horizon Reinforcement Learning More Difficult Than Short Horizon Reinforcement Learning?

    Full text link
    Learning to plan for long horizons is a central challenge in episodic reinforcement learning problems. A fundamental question is to understand how the difficulty of the problem scales as the horizon increases. Here the natural measure of sample complexity is a normalized one: we are interested in the number of episodes it takes to provably discover a policy whose value is ε\varepsilon near to that of the optimal value, where the value is measured by the normalized cumulative reward in each episode. In a COLT 2018 open problem, Jiang and Agarwal conjectured that, for tabular, episodic reinforcement learning problems, there exists a sample complexity lower bound which exhibits a polynomial dependence on the horizon -- a conjecture which is consistent with all known sample complexity upper bounds. This work refutes this conjecture, proving that tabular, episodic reinforcement learning is possible with a sample complexity that scales only logarithmically with the planning horizon. In other words, when the values are appropriately normalized (to lie in the unit interval), this results shows that long horizon RL is no more difficult than short horizon RL, at least in a minimax sense. Our analysis introduces two ideas: (i) the construction of an ε\varepsilon-net for optimal policies whose log-covering number scales only logarithmically with the planning horizon, and (ii) the Online Trajectory Synthesis algorithm, which adaptively evaluates all policies in a given policy class using sample complexity that scales with the log-covering number of the given policy class. Both may be of independent interest

    Provably Efficient Exploration for Reinforcement Learning Using Unsupervised Learning

    Full text link
    Motivated by the prevailing paradigm of using unsupervised learning for efficient exploration in reinforcement learning (RL) problems [tang2017exploration,bellemare2016unifying], we investigate when this paradigm is provably efficient. We study episodic Markov decision processes with rich observations generated from a small number of latent states. We present a general algorithmic framework that is built upon two components: an unsupervised learning algorithm and a no-regret tabular RL algorithm. Theoretically, we prove that as long as the unsupervised learning algorithm enjoys a polynomial sample complexity guarantee, we can find a near-optimal policy with sample complexity polynomial in the number of latent states, which is significantly smaller than the number of observations. Empirically, we instantiate our framework on a class of hard exploration problems to demonstrate the practicality of our theory

    FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs

    Full text link
    In order to deal with the curse of dimensionality in reinforcement learning (RL), it is common practice to make parametric assumptions where values or policies are functions of some low dimensional feature space. This work focuses on the representation learning question: how can we learn such features? Under the assumption that the underlying (unknown) dynamics correspond to a low rank transition matrix, we show how the representation learning question is related to a particular non-linear matrix decomposition problem. Structurally, we make precise connections between these low rank MDPs and latent variable models, showing how they significantly generalize prior formulations for representation learning in RL. Algorithmically, we develop FLAMBE, which engages in exploration and representation learning for provably efficient RL in low rank transition models.Comment: New algorithm and analysis to remove the reachability assumptio

    Smooth Structured Prediction Using Quantum and Classical Gibbs Samplers

    Full text link
    We introduce two quantum algorithms for solving structured prediction problems. We show that a stochastic subgradient descent method that uses the quantum minimum finding algorithm and takes its probabilistic failure into account solves the structured prediction problem with a runtime that scales with the square root of the size of the label space, and in O~(1/ϵ)\widetilde O\left(1/\epsilon\right) with respect to the precision, ϵ\epsilon, of the solution. Motivated by robust inference techniques in machine learning, we introduce another quantum algorithm that solves a smooth approximation of the structured prediction problem with a similar quantum speedup in the size of the label space and a similar scaling in the precision parameter. In doing so, we analyze a stochastic gradient algorithm for convex optimization in the presence of an additive error in the calculation of the gradients, and show that its convergence rate does not deteriorate if the additive errors are of the order O(ϵ)O(\sqrt\epsilon). This algorithm uses quantum Gibbs sampling at temperature Ω(ϵ)\Omega (\epsilon) as a subroutine. Based on these theoretical observations, we propose a method for using quantum Gibbs samplers to combine feedforward neural networks with probabilistic graphical models for quantum machine learning. Our numerical results using Monte Carlo simulations on an image tagging task demonstrate the benefit of the approach

    Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension

    Full text link
    Value function approximation has demonstrated phenomenal empirical success in reinforcement learning (RL). Nevertheless, despite a handful of recent progress on developing theory for RL with linear function approximation, the understanding of general function approximation schemes largely remains missing. In this paper, we establish a provably efficient RL algorithm with general value function approximation. We show that if the value functions admit an approximation with a function class F\mathcal{F}, our algorithm achieves a regret bound of O~(poly(dH)T)\widetilde{O}(\mathrm{poly}(dH)\sqrt{T}) where dd is a complexity measure of F\mathcal{F} that depends on the eluder dimension [Russo and Van Roy, 2013] and log-covering numbers, HH is the planning horizon, and TT is the number interactions with the environment. Our theory generalizes recent progress on RL with linear value function approximation and does not make explicit assumptions on the model of the environment. Moreover, our algorithm is model-free and provides a framework to justify the effectiveness of algorithms used in practice

    Efficient Planning in Large MDPs with Weak Linear Function Approximation

    Full text link
    Large-scale Markov decision processes (MDPs) require planning algorithms with runtime independent of the number of states of the MDP. We consider the planning problem in MDPs using linear value function approximation with only weak requirements: low approximation error for the optimal value function, and a small set of "core" states whose features span those of other states. In particular, we make no assumptions about the representability of policies or value functions of non-optimal policies. Our algorithm produces almost-optimal actions for any state using a generative oracle (simulator) for the MDP, while its computation time scales polynomially with the number of features, core states, and actions and the effective horizon.Comment: 12 pages and appendix (10 pages). Submitted to the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canad

    An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap

    Full text link
    A fundamental question in the theory of reinforcement learning is: suppose the optimal QQ-function lies in the linear span of a given dd dimensional feature mapping, is sample-efficient reinforcement learning (RL) possible? The recent and remarkable result of Weisz et al. (2020) resolved this question in the negative, providing an exponential (in dd) sample size lower bound, which holds even if the agent has access to a generative model of the environment. One may hope that this information theoretic barrier for RL can be circumvented by further supposing an even more favorable assumption: there exists a \emph{constant suboptimality gap} between the optimal QQ-value of the best action and that of the second-best action (for all states). The hope is that having a large suboptimality gap would permit easier identification of optimal actions themselves, thus making the problem tractable; indeed, provided the agent has access to a generative model, sample-efficient RL is in fact possible with the addition of this more favorable assumption. This work focuses on this question in the standard online reinforcement learning setting, where our main result resolves this question in the negative: our hardness result shows that an exponential sample complexity lower bound still holds even if a constant suboptimality gap is assumed in addition to having a linearly realizable optimal QQ-function. Perhaps surprisingly, this implies an exponential separation between the online RL setting and the generative model setting. Complementing our negative hardness result, we give two positive results showing that provably sample-efficient RL is possible either under an additional low-variance assumption or under a novel hypercontractivity assumption (both implicitly place stronger conditions on the underlying dynamics model)

    Learning Zero-Sum Simultaneous-Move Markov Games Using Function Approximation and Correlated Equilibrium

    Full text link
    We develop provably efficient reinforcement learning algorithms for two-player zero-sum finite-horizon Markov games with simultaneous moves. To incorporate function approximation, we consider a family of Markov games where the reward function and transition kernel possess a linear structure. Both the offline and online settings of the problems are considered. In the offline setting, we control both players and aim to find the Nash Equilibrium by minimizing the duality gap. In the online setting, we control a single player playing against an arbitrary opponent and aim to minimize the regret. For both settings, we propose an optimistic variant of the least-squares minimax value iteration algorithm. We show that our algorithm is computationally efficient and provably achieves an O~(d3H3T)\tilde O(\sqrt{d^3 H^3 T} ) upper bound on the duality gap and regret, where dd is the linear dimension, HH the horizon and TT the total number of timesteps. Our results do not require additional assumptions on the sampling model. Our setting requires overcoming several new challenges that are absent in Markov decision processes or turn-based Markov games. In particular, to achieve optimism with simultaneous moves, we construct both upper and lower confidence bounds of the value function, and then compute the optimistic policy by solving a general-sum matrix game with these bounds as the payoff matrices. As finding the Nash Equilibrium of a general-sum game is computationally hard, our algorithm instead solves for a Coarse Correlated Equilibrium (CCE), which can be obtained efficiently. To our best knowledge, such a CCE-based scheme for optimism has not appeared in the literature and might be of interest in its own right.Comment: Accepted for presentation at COLT 202

    Sample Efficient Reinforcement Learning In Continuous State Spaces: A Perspective Beyond Linearity

    Full text link
    Reinforcement learning (RL) is empirically successful in complex nonlinear Markov decision processes (MDPs) with continuous state spaces. By contrast, the majority of theoretical RL literature requires the MDP to satisfy some form of linear structure, in order to guarantee sample efficient RL. Such efforts typically assume the transition dynamics or value function of the MDP are described by linear functions of the state features. To resolve this discrepancy between theory and practice, we introduce the Effective Planning Window (EPW) condition, a structural condition on MDPs that makes no linearity assumptions. We demonstrate that the EPW condition permits sample efficient RL, by providing an algorithm which provably solves MDPs satisfying this condition. Our algorithm requires minimal assumptions on the policy class, which can include multi-layer neural networks with nonlinear activation functions. Notably, the EPW condition is directly motivated by popular gaming benchmarks, and we show that many classic Atari games satisfy this condition. We additionally show the necessity of conditions like EPW, by demonstrating that simple MDPs with slight nonlinearities cannot be solved sample efficiently.Comment: ICML 202
    • …
    corecore