Search CORE

314,355 research outputs found

On the Convergence and Sample Complexity Analysis of Deep Q-Networks with $\epsilon$ -Greedy Exploration

Author: Chaudhury Subhajit
Chen Pin-Yu
Li Hongkang
Liu Miao
Liu Sijia
Lu Songtao
Murugesan Keerthiram
Wang Meng
Zhang Shuai
Publication venue
Publication date: 24/10/2023
Field of study

This paper provides a theoretical understanding of Deep Q-Network (DQN) with the

\varepsilon

-greedy exploration in deep reinforcement learning. Despite the tremendous empirical achievement of the DQN, its theoretical characterization remains underexplored. First, the exploration strategy is either impractical or ignored in the existing analysis. Second, in contrast to conventional Q-learning algorithms, the DQN employs the target network and experience replay to acquire an unbiased estimation of the mean-square Bellman error (MSBE) utilized in training the Q-network. However, the existing theoretical analysis of DQNs lacks convergence analysis or bypasses the technical challenges by deploying a significantly overparameterized neural network, which is not computationally efficient. This paper provides the first theoretical convergence and sample complexity analysis of the practical setting of DQNs with

\epsilon

-greedy policy. We prove an iterative procedure with decaying

\epsilon

converges to the optimal Q-value function geometrically. Moreover, a higher level of

\epsilon

values enlarges the region of convergence but slows down the convergence, while the opposite holds for a lower level of

\epsilon

values. Experiments justify our established theoretical insights on DQNs

arXiv.org e-Print Archive

Comparative Study of Reinforcement Learning Algorithms: Deep Q-Networks, Deep Deterministic Policy Gradients and Proximal Policy Optimization

Author: Wang Haoyi
Publication venue: Washington University Open Scholarship
Publication date: 06/12/2023
Field of study

The advancement of Artificial Intelligence (AI), particularly in the field of Reinforcement Learning (RL), has led to significant breakthroughs in numerous domains, ranging from autonomous systems to complex game environments. Among this progress, the emergence and evolution of algorithms like Deep Q-Networks (DQN), Deep Deterministic Policy Gradients (DDPG), and Proximal Policy Optimization (PPO) have been pivotal. These algorithms, each with unique approaches and strengths, have become fundamental in tackling diverse RL challenges. This study aims to dissect and compare these three influential algorithms to provide a clearer understanding of their mechanics, efficiencies, and applicability. We delve into the theoretical underpinnings of DQN, DDPG, and PPO, and assess their performances across a variety of standard benchmarks. Through this comparative analysis, we seek to offer valuable insights for choosing the right algorithms for different environments and highlight potential pathways for future research in the field of Reinforcement Learning

Washington University St. Louis: Open Scholarship

Machine learning detects terminal singularities

Author: Coates Tom
Kasprzyk Alexander M.
Veneziale Sara
Publication venue
Publication date: 31/10/2023
Field of study

Algebraic varieties are the geometric shapes defined by systems of polynomial equations; they are ubiquitous across mathematics and science. Amongst these algebraic varieties are Q-Fano varieties: positively curved shapes which have Q-factorial terminal singularities. Q-Fano varieties are of fundamental importance in geometry as they are "atomic pieces" of more complex shapes - the process of breaking a shape into simpler pieces in this sense is called the Minimal Model Programme. Despite their importance, the classification of Q-Fano varieties remains unknown. In this paper we demonstrate that machine learning can be used to understand this classification. We focus on 8-dimensional positively-curved algebraic varieties that have toric symmetry and Picard rank 2, and develop a neural network classifier that predicts with 95% accuracy whether or not such an algebraic variety is Q-Fano. We use this to give a first sketch of the landscape of Q-Fanos in dimension 8. How the neural network is able to detect Q-Fano varieties with such accuracy remains mysterious, and hints at some deep mathematical theory waiting to be uncovered. Furthermore, when visualised using the quantum period, an invariant that has played an important role in recent theoretical developments, we observe that the classification as revealed by ML appears to fall within a bounded region, and is stratified by the Fano index. This suggests that it may be possible to state and prove conjectures on completeness in the future. Inspired by the ML analysis, we formulate and prove a new global combinatorial criterion for a positively curved toric variety of Picard rank 2 to have terminal singularities. Together with the first sketch of the landscape of Q-Fanos in higher dimensions, this gives new evidence that machine learning can be an essential tool in developing mathematical conjectures and accelerating theoretical discovery.Comment: 20 pages, 11 figures, 3 table

arXiv.org e-Print Archive

Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo

Author: Anandkumar Anima
Azizzadenesheli Kamyar
Ishfaq Haque
Lan Qingfeng
Mahmood A. Rupam
Precup Doina
Xu Pan
Publication venue
Publication date: 29/05/2023
Field of study

We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL). One of the key shortcomings of existing Thompson sampling algorithms is the need to perform a Gaussian approximation of the posterior distribution, which is not a good surrogate in most practical settings. We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy gradient descent updates to learn the exact posterior distribution of the Q function, which makes our approach easy to deploy in deep RL. We provide a rigorous theoretical analysis for the proposed method and demonstrate that, in the linear Markov decision process (linear MDP) setting, it has a regret bound of

\tilde{O}(d^{3/2}H^{5/2}\sqrt{T})

, where

d

is the dimension of the feature mapping,

H

is the planning horizon, and

T

is the total number of steps. We apply this approach to deep RL, by using Adam optimizer to perform gradient updates. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite

arXiv.org e-Print Archive