21 research outputs found

    Concentration of Contractive Stochastic Approximation: Additive and Multiplicative Noise

    Full text link
    In this work, we study the concentration behavior of a stochastic approximation (SA) algorithm under a contractive operator with respect to an arbitrary norm. We consider two settings where the iterates are potentially unbounded: (1) bounded multiplicative noise, and (2) additive sub-Gaussian noise. We obtain maximal concentration inequalities on the convergence errors, and show that these errors have sub-Gaussian tails in the additive noise setting, and super-polynomial tails (faster than polynomial decay) in the multiplicative noise setting. In addition, we provide an impossibility result showing that it is in general not possible to achieve sub-exponential tails for SA with multiplicative noise. To establish these results, we develop a novel bootstrapping argument that involves bounding the moment generating function of the generalized Moreau envelope of the error and the construction of an exponential supermartingale to enable using Ville's maximal inequality. To demonstrate the applicability of our theoretical results, we use them to provide maximal concentration bounds for a large class of reinforcement learning algorithms, including but not limited to on-policy TD-learning with linear function approximation, off-policy TD-learning with generalized importance sampling factors, and QQ-learning. To the best of our knowledge, super-polynomial concentration bounds for off-policy TD-learning have not been established in the literature due to the challenge of handling the combination of unbounded iterates and multiplicative noise

    Convergence Rates for Localized Actor-Critic in Networked Markov Potential Games

    Full text link
    We introduce a class of networked Markov potential games in which agents are associated with nodes in a network. Each agent has its own local potential function, and the reward of each agent depends only on the states and actions of the agents within a neighborhood. In this context, we propose a localized actor-critic algorithm. The algorithm is scalable since each agent uses only local information and does not need access to the global state. Further, the algorithm overcomes the curse of dimensionality through the use of function approximation. Our main results provide finite-sample guarantees up to a localization error and a function approximation error. Specifically, we achieve an O~(ϵ~−4)\tilde{\mathcal{O}}(\tilde{\epsilon}^{-4}) sample complexity measured by the averaged Nash regret. This is the first finite-sample bound for multi-agent competitive games that does not depend on the number of agents

    A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants

    Full text link
    This paper develops an unified framework to study finite-sample convergence guarantees of a large class of value-based asynchronous Reinforcement Learning (RL) algorithms. We do this by first reformulating the RL algorithms as Markovian Stochastic Approximation (SA) algorithms to solve fixed-point equations. We then develop a Lyapunov analysis and derive mean-square error bounds on the convergence of the Markovian SA. Based on this central result, we establish finite-sample mean-square convergence bounds for asynchronous RL algorithms such as QQ-learning, nn-step TD, TD(λ)(\lambda), and off-policy TD algorithms including V-trace. As a by-product, by analyzing the performance bounds of the TD(λ)(\lambda) (and nn-step TD) algorithm for general λ\lambda (and nn), we demonstrate a bias-variance trade-off, i.e., efficiency of bootstrapping in RL. This was first posed as an open problem in [37]

    (3R,4R,4aS,7aR,12bS)-3-Cyclo­propyl­methyl-4a,9-dihy­droxy-3-methyl-7-oxo-2,3,4,4a,5,6,7,7a-octa­hydro-1H-4,12-methano­benzofuro[3,2-e]isoquinolin-3-ium bromide

    Get PDF
    The title compound, C21H26NO4 +·Br−, also known as R-methyl­naltrexone (MNTX) bromide, is a selective peripher­ally acting μ-opioid receptor antagonist with a oroxymorphone skeleton, synthesized by hydroxyl protection, N-methyl­ation, deprotection and anion exchange of naltrexone. It comprises a five-ring system A/B/C/D/E. Rings C and E adopt distorted chair conformations, whereas ring D is in half-chair conformation. The C/E ring junctions are trans fused. The dihedral angle between rings D and E is 82.3 (1)°, while the dihedral angles between the planes of rings C and A, and rings D and E are respectively 81.7 (1), 75.9 (1) and 12.2 (1)°. In the crystal, mol­ecules are linked by O—H⋯Br hydrogen bonds

    Halide homogenization for low energy loss in 2-eV-bandgap perovskites and increased efficiency in all-perovskite triple-junction solar cells

    Get PDF
    Monolithic all-perovskite triple-junction solar cells have the potential to deliver power conversion efficiencies beyond those of state-of-art double-junction tandems and well beyond the detailed-balance limit for single junctions. Today, however, their performance is limited by large deficits in open-circuit voltage and unfulfilled potential in both short-circuit current density and fill factor in the wide-bandgap perovskite sub cell. Here we find that halide heterogeneity—present even immediately following materials synthesis—plays a key role in interfacial non-radiative recombination and collection efficiency losses under prolonged illumination for Br-rich perovskites. We find that a diammonium halide salt, propane-1,3-diammonium iodide, introduced during film fabrication, improves halide homogenization in Br-rich perovskites, leading to enhanced operating stability and a record open-circuit voltage of 1.44 V in an inverted (p–i–n) device; ~86% of the detailed-balance limit for a bandgap of 1.97 eV. The efficient wide-bandgap sub cell enables the fabrication of monolithic all-perovskite triple-junction solar cells with an open-circuit voltage of 3.33 V and a champion PCE of 25.1% (23.87% certified quasi-steady-state efficiency)

    Improved charge extraction in inverted perovskite solar cells with dual-site-binding ligands

    Get PDF
    Inverted (pin) perovskite solar cells (PSCs) afford improved operating stability in comparison to their nip counterparts but have lagged in power conversion efficiency (PCE). The energetic losses responsible for this PCE deficit in pin PSCs occur primarily at the interfaces between the perovskite and the charge-transport layers. Additive and surface treatments that use passivating ligands usually bind to a single active binding site: This dense packing of electrically resistive passivants perpendicular to the surface may limit the fill factor in pin PSCs. We identified ligands that bind two neighboring lead(II) ion (Pb2+) defect sites in a planar ligand orientation on the perovskite. We fabricated pin PSCs and report a certified quasi–steady state PCE of 26.15 and 24.74% for 0.05– and 1.04–square centimeter illuminated areas, respectively. The devices retain 95% of their initial PCE after 1200 hours of continuous 1 sun maximum power point operation at 65°C

    A Unified Lyapunov Framework for Finite-Sample Analysis of Reinforcement Learning Algorithms

    Get PDF
    Reinforcement learning is a framework for solving sequential decision-making problems without requiring the environmental model, and is viewed as a promising approach to achieve artificial intelligence. However, there is a huge gap between the empirical successes and the theoretical understanding of reinforcement learning. In this thesis, we make an effort to bridging such gap. More formally, this thesis focuses on designing data-efficient reinforcement learning algorithms and establishing their finite-sample guarantees. Specifically, we aim at answering the following question: suppose we carry out some reinforcement learning algorithm with finite amount of samples (or with finite number of iterations), then what can we say about the performance of the output of the algorithm? The more detailed motivation and the research background are presented in Chapter 1. Part I: Stochastic Approximation. The main body of this thesis is divided into three parts. In the first part of the thesis, we focus on studying the stochastic approximation method. Stochastic approximation is the major workhorse for large-scale optimization and machine learning, and is widely used in reinforcement learning for both algorithm design and algorithm analysis. Therefore, understanding the behavior of SA algorithms is of fundamental interest to the analysis of RL algorithms. In Chapter 2 and Chapter 3, we consider Markovian stochastic approximation under a contractive operator and under a strongly pseudo-monotone operator, and establish their finite-sample guarantees. These two results on stochastic approximation are used in later parts of the thesis to study reinforcement learning algorithms with a tabular representation and with linear function approximation. The main technique we use to analyze those stochastic approximation algorithms is the Lyapunov-drift method. Specifically, we construct novel Lyapunov functions (e.g., generalized Moreau envelope in the case of stochastic approximation under a contraction assumption) to capture the dynamics of the corresponding stochastic approximation algorithms, and control the discretization error and the stochastic error. This enables us to derive the one-step drift inequality, which can be repeatedly used to establish the finite-sample bounds. In Chapter 4, we switch our focus from finite-sample analysis to asymptotic analysis, and characterize the stationary distribution of the centered-scaled iterates of several popular stochastic approximation algorithms. Specifically, we show that for stochastic gradient descent, linear stochastic approximation, and contractive stochastic approximation, the stationary distribution of the centered iterates (after proper scaling) is a Gaussian distribution with mean zero and a covariance matrix being the unique solution of an appropriate Lyapunov equation. For stochastic approximation beyond these three types, we numerically demonstrate that the stationary distribution may not be Gaussian in general. The main technique we used for such asymptotic analysis is also Lyapunov method, where the characteristic function was used as the test function. Part II: Reinforcement Learning with a Tabular Representation. In the second part of this thesis, we focus on reinforcement learning with a tabular representation. The preliminaries of reinforcement learning are presented in Chapter 5. In Chapter 6 and Chapter 7, we consider the TD-learning algorithm for solving the policy evaluation problem, which refers to the problem of estimating the performance of a given policy. Solving the policy evaluation problem is an important intermediate step in the popular actor-critic framework for ultimately finding an optimal policy. More specifically, we consider on-policy TD-learning algorithms such as nn-step TD and TD(λ)(\lambda) in Chapter 6. By establishing finite-sample guarantees of nn-step TD and TD(λ)(\lambda) as explicit functions of the parameters nn and λ\lambda, we provide theoretical insight into the open problem about the efficiency of bootstrapping, which is about how to choose the parameters nn and λ\lambda so that nn-step TD and TD(λ)(\lambda) achieve their best performance. In Chapter 7, we study the problem of policy evaluation using off-policy sampling, where the policy used to collect samples and the policy whose value function we aim at estimating is different. We provide finite-sample analysis of a generic off-policy multi-step TD-learning algorithm, which subsumes several popular existing algorithms such as Qπ(λ)Q^\pi(\lambda), Tree-Backup(λ)(\lambda), Retrace(λ)(\lambda), and VV-trace as its special cases. In addition, our finite-sample bounds demonstrate a trade-off between the variance (which arises due to the product of the importance sampling ratios) and the bias in the limit point (which arises due to various modifications to the importance sampling ratios). Understanding such bias-variance trade-off is at the heart of off-policy learning. In Chapter 8, we consider the QQ-learning algorithm for directly finding an optimal policy and present its finite-sample guarantees. The finite-sample bounds imply an \Tilde{\mathcal{O}}(\epsilon^{-2}) sample complexity, which is known to be optimal up to a logarithmic factor. In addition, our finite-sample bounds also capture the dependence on other importance parameters of the reinforcement learning problem, such as the size of the state-action space and the effective horizon. Part III: Reinforcement Learning with Linear Function Approximation. In the last part of this thesis, to overcome the curse of dimensionality in reinforcement learning, we consider reinforcement learning with linear function approximation. Specifically, we focus on the off-policy setting, where the deadly triad is present, and can result in instability of reinforcement learning algorithms. In Chapter 9, we consider off-policy TD-learning with linear function approximation, where the deadly triad appears. We design a single time-scale off-policy TD-learning using generalized importance sampling ratios and multi-step bootstrapping, and establish its finite-sample guarantees. The algorithm is provably convergent in the presence of the deadly triad, and does not suffer from the high variance in existing off-policy learning algorithms. The TD-learning algorithm proposed in Chapter 9 is later used in Chapter 10 to solve the policy evaluation sub-problem in the general policy-based framework with various policy update rules, including approximate policy iteration and natural policy gradient. By only exploiting the contraction property and the monotonicity property of the Bellman operator, we establish an overall \Tilde{\mathcal{O}}(\epsilon^{-2}) sample complexity for a wide class of policy-based methods using off-policy sampling and linear function approximation. In Chapter 11, we focus on QQ-learning with linear function approximation (where the deadly triad naturally appears), and establish its finite-sample bounds under an assumption on the discount factor of the problem. In particular, we show that when the discount factor is sufficiently small, the deadly triad challenge can be overcome. In Chapter 12, we further remove the restriction on the discount factor by designing a convergent variant of QQ-learning with linear function approximation using target network and truncation. This is the first variant of QQ-learning with linear function approximation that uses a single trajectory of Markovian samples, and is provably stable without requiring strong assumptions. In addition, the algorithm achieves the optimal O(ϵ−2)\mathcal{O}(\epsilon^{-2}) sample complexity (which matches with QQ-learning in the tabular setting) up to a function approximation error.Ph.D
    corecore