Search CORE

14 research outputs found

Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping

Author: Gu Quanquan
He Jiafan
Zhou Dongruo
Publication venue
Publication date: 22/02/2021
Field of study

Modern tasks in reinforcement learning have large state and action spaces. To deal with them efficiently, one often uses predefined feature mapping to represent states and actions in a low-dimensional space. In this paper, we study reinforcement learning for discounted Markov Decision Processes (MDPs), where the transition kernel can be parameterized as a linear function of certain feature mapping. We propose a novel algorithm that makes use of the feature mapping and obtains a

\tilde O(d\sqrt{T}/(1-\gamma)^2)

regret, where

d

is the dimension of the feature space,

T

is the time horizon and

\gamma

is the discount factor of the MDP. To the best of our knowledge, this is the first polynomial regret bound without accessing the generative model or making strong assumptions such as ergodicity of the MDP. By constructing a special class of MDPs, we also show that for any algorithms, the regret is lower bounded by

\Omega(d\sqrt{T}/(1-\gamma)^{1.5})

. Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a

(1-\gamma)^{-0.5}

factor.Comment: 31 pages, 1 figur

arXiv.org e-Print Archive

Logarithmic Regret for Reinforcement Learning with Linear Function Approximation

Author: Gu Quanquan
He Jiafan
Zhou Dongruo
Publication venue
Publication date: 18/02/2021
Field of study

Reinforcement learning (RL) with linear function approximation has received increasing attention recently. However, existing work has focused on obtaining

\sqrt{T}

-type regret bound, where

T

is the number of interactions with the MDP. In this paper, we show that logarithmic regret is attainable under two recently proposed linear MDP assumptions provided that there exists a positive sub-optimality gap for the optimal action-value function. More specifically, under the linear MDP assumption (Jin et al. 2019), the LSVI-UCB algorithm can achieve

\tilde{O}(d^{3}H^5/\text{gap}_{\text{min}}\cdot \log(T))

regret; and under the linear mixture MDP assumption (Ayoub et al. 2020), the UCRL-VTR algorithm can achieve

\tilde{O}(d^{2}H^5/\text{gap}_{\text{min}}\cdot \log^3(T))

regret, where

d

is the dimension of feature mapping,

H

is the length of episode,

\text{gap}_{\text{min}}

is the minimal sub-optimality gap, and

\tilde O

hides all logarithmic terms except

\log(T)

. To the best of our knowledge, these are the first logarithmic regret bounds for RL with linear function approximation. We also establish gap-dependent lower bounds for the two linear MDP models.Comment: 26 page

arXiv.org e-Print Archive

An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap

Author: Kakade Sham M.
Wang Ruosong
Wang Yuanhao
Publication venue
Publication date: 23/03/2021
Field of study

A fundamental question in the theory of reinforcement learning is: suppose the optimal

Q

-function lies in the linear span of a given

d

dimensional feature mapping, is sample-efficient reinforcement learning (RL) possible? The recent and remarkable result of Weisz et al. (2020) resolved this question in the negative, providing an exponential (in

d

) sample size lower bound, which holds even if the agent has access to a generative model of the environment. One may hope that this information theoretic barrier for RL can be circumvented by further supposing an even more favorable assumption: there exists a \emph{constant suboptimality gap} between the optimal

Q

-value of the best action and that of the second-best action (for all states). The hope is that having a large suboptimality gap would permit easier identification of optimal actions themselves, thus making the problem tractable; indeed, provided the agent has access to a generative model, sample-efficient RL is in fact possible with the addition of this more favorable assumption. This work focuses on this question in the standard online reinforcement learning setting, where our main result resolves this question in the negative: our hardness result shows that an exponential sample complexity lower bound still holds even if a constant suboptimality gap is assumed in addition to having a linearly realizable optimal

Q

-function. Perhaps surprisingly, this implies an exponential separation between the online RL setting and the generative model setting. Complementing our negative hardness result, we give two positive results showing that provably sample-efficient RL is possible either under an additional low-variance assumption or under a novel hypercontractivity assumption (both implicitly place stronger conditions on the underlying dynamics model)

arXiv.org e-Print Archive

Online Sparse Reinforcement Learning

Author: Hao Botao
Lattimore Tor
Szepesvári Csaba
Wang Mengdi
Publication venue
Publication date: 10/02/2021
Field of study

We investigate the hardness of online reinforcement learning in fixed horizon, sparse linear Markov decision process (MDP), with a special focus on the high-dimensional regime where the ambient dimension is larger than the number of episodes. Our contribution is two-fold. First, we provide a lower bound showing that linear regret is generally unavoidable in this case, even if there exists a policy that collects well-conditioned data. The lower bound construction uses an MDP with a fixed number of states while the number of actions scales with the ambient dimension. Note that when the horizon is fixed to one, the case of linear stochastic bandits, the linear regret can be avoided. Second, we show that if the learner has oracle access to a policy that collects well-conditioned data then a variant of Lasso fitted Q-iteration enjoys a nearly dimension-free regret of

\tilde{O}( s^{2/3} N^{2/3})

where

N

is the number of episodes and

s

is the sparsity level. This shows that in the large-action setting, the difficulty of learning can be attributed to the difficulty of finding a good exploratory policy.Comment: Accepted at AISTATS 202

arXiv.org e-Print Archive

Provably Efficient Representation Learning in Low-rank Markov Decision Processes

Author: Gu Quanquan
He Jiafan
Zhang Amy
Zhang Weitong
Zhou Dongruo
Publication venue
Publication date: 22/06/2021
Field of study

The success of deep reinforcement learning (DRL) is due to the power of learning a representation that is suitable for the underlying exploration and exploitation task. However, existing provable reinforcement learning algorithms with linear function approximation often assume the feature representation is known and fixed. In order to understand how representation learning can improve the efficiency of RL, we study representation learning for a class of low-rank Markov Decision Processes (MDPs) where the transition kernel can be represented in a bilinear form. We propose a provably efficient algorithm called ReLEX that can simultaneously learn the representation and perform exploration. We show that ReLEX always performs no worse than a state-of-the-art algorithm without representation learning, and will be strictly better in terms of sample efficiency if the function class of representations enjoys a certain mild "coverage'' property over the whole state-action space.Comment: 27 page

arXiv.org e-Print Archive

Gap-Dependent Bounds for Two-Player Markov Games

Author: Dou Zehao
Du Simon S.
Wang Zhaoran
Yang Zhuoran
Publication venue
Publication date: 01/07/2021
Field of study

As one of the most popular methods in the field of reinforcement learning, Q-learning has received increasing attention. Recently, there have been more theoretical works on the regret bound of algorithms that belong to the Q-learning class in different settings. In this paper, we analyze the cumulative regret when conducting Nash Q-learning algorithm on 2-player turn-based stochastic Markov games (2-TBSG), and propose the very first gap dependent logarithmic upper bounds in the episodic tabular setting. This bound matches the theoretical lower bound only up to a logarithmic term. Furthermore, we extend the conclusion to the discounted game setting with infinite horizon and propose a similar gap dependent logarithmic regret bound. Also, under the linear MDP assumption, we obtain another logarithmic regret for 2-TBSG, in both centralized and independent settings.Comment: 34 page

arXiv.org e-Print Archive

Provably Efficient Exploration in Policy Optimization

Author: Cai Qi
Jin Chi
Wang Zhaoran
Yang Zhuoran
Publication venue
Publication date: 06/07/2020
Field of study

While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an ``optimistic version'' of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves

\tilde{O}(\sqrt{d^2 H^3 T} )

regret. Here

d

is the feature dimension,

H

is the episode horizon, and

T

is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.Comment: We have fixed a technical issue in the first version of this paper. We remark the technical assumption of the linear MDP in this version of the paper is different from that in the first versio

arXiv.org e-Print Archive

Almost Optimal Algorithms for Two-player Markov Games with Linear Function Approximation

Author: Chen Zixiang
Gu Quanquan
Zhou Dongruo
Publication venue
Publication date: 15/02/2021
Field of study

We study reinforcement learning for two-player zero-sum Markov games with simultaneous moves in the finite-horizon setting, where the transition kernel of the underlying Markov games can be parameterized by a linear function over the current state, both players' actions and the next state. In particular, we assume that we can control both players and aim to find the Nash Equilibrium by minimizing the duality gap. We propose an algorithm Nash-UCRL-VTR based on the principle "Optimism-in-Face-of-Uncertainty". Our algorithm only needs to find a Coarse Correlated Equilibrium (CCE), which is computationally very efficient. Specifically, we show that Nash-UCRL-VTR can provably achieve an

\tilde{O}(dH\sqrt{T})

regret, where

d

is the linear function dimension,

H

is the length of the game and

T

is the total number of steps in the game. To access the optimality of our algorithm, we also prove an

\tilde{\Omega}( dH\sqrt{T})

lower bound on the regret. Our upper bound matches the lower bound up to logarithmic factors, which suggests the optimality of our algorithm.Comment: 31 page

arXiv.org e-Print Archive

Sparse Feature Selection Makes Batch Reinforcement Learning More Sample Efficient

Author: Duan Yaqi
Hao Botao
Lattimore Tor
Szepesvári Csaba
Wang Mengdi
Publication venue
Publication date: 08/11/2020
Field of study

This paper provides a statistical analysis of high-dimensional batch Reinforcement Learning (RL) using sparse linear function approximation. When there is a large number of candidate features, our result sheds light on the fact that sparsity-aware methods can make batch RL more sample efficient. We first consider the off-policy policy evaluation problem. To evaluate a new target policy, we analyze a Lasso fitted Q-evaluation method and establish a finite-sample error bound that has no polynomial dependence on the ambient dimension. To reduce the Lasso bias, we further propose a post model-selection estimator that applies fitted Q-evaluation to the features selected via group Lasso. Under an additional signal strength assumption, we derive a sharper instance-dependent error bound that depends on a divergence function measuring the distribution mismatch between the data distribution and occupancy measure of the target policy. Further, we study the Lasso fitted Q-iteration for batch policy optimization and establish a finite-sample error bound depending on the ratio between the number of relevant features and restricted minimal eigenvalue of the data's covariance. In the end, we complement the results with minimax lower bounds for batch-data policy evaluation/optimization that nearly match our upper bounds. The results suggest that having well-conditioned data is crucial for sparse batch policy learning

arXiv.org e-Print Archive

Cautiously Optimistic Policy Optimization and Exploration with Linear Function Approximation

Author: Agarwal Alekh
Cheng Ching-An
Zanette Andrea
Publication venue
Publication date: 29/06/2021
Field of study

Policy optimization methods are popular reinforcement learning algorithms, because their incremental and on-policy nature makes them more stable than the value-based counterparts. However, the same properties also make them slow to converge and sample inefficient, as the on-policy requirement precludes data reuse and the incremental updates couple large iteration complexity into the sample complexity. These characteristics have been observed in experiments as well as in theory in the recent work of~\citet{agarwal2020pc}, which provides a policy optimization method PCPG that can robustly find near optimal polices for approximately linear Markov decision processes but suffers from an extremely poor sample complexity compared with value-based techniques. In this paper, we propose a new algorithm, COPOE, that overcomes the sample complexity issue of PCPG while retaining its robustness to model misspecification. Compared with PCPG, COPOE makes several important algorithmic enhancements, such as enabling data reuse, and uses more refined analysis techniques, which we expect to be more broadly applicable to designing new reinforcement learning algorithms. The result is an improvement in sample complexity from

\widetilde{O}(1/\epsilon^{11})

for PCPG to

\widetilde{O}(1/\epsilon^3)

for PCPG, nearly bridging the gap with value-based techniques.Comment: Appears in COLT 202

arXiv.org e-Print Archive