14 research outputs found
Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping
Modern tasks in reinforcement learning have large state and action spaces. To
deal with them efficiently, one often uses predefined feature mapping to
represent states and actions in a low-dimensional space. In this paper, we
study reinforcement learning for discounted Markov Decision Processes (MDPs),
where the transition kernel can be parameterized as a linear function of
certain feature mapping. We propose a novel algorithm that makes use of the
feature mapping and obtains a regret, where
is the dimension of the feature space, is the time horizon and
is the discount factor of the MDP. To the best of our knowledge, this is the
first polynomial regret bound without accessing the generative model or making
strong assumptions such as ergodicity of the MDP. By constructing a special
class of MDPs, we also show that for any algorithms, the regret is lower
bounded by . Our upper and lower bound
results together suggest that the proposed reinforcement learning algorithm is
near-optimal up to a factor.Comment: 31 pages, 1 figur
Logarithmic Regret for Reinforcement Learning with Linear Function Approximation
Reinforcement learning (RL) with linear function approximation has received
increasing attention recently. However, existing work has focused on obtaining
-type regret bound, where is the number of interactions with the
MDP. In this paper, we show that logarithmic regret is attainable under two
recently proposed linear MDP assumptions provided that there exists a positive
sub-optimality gap for the optimal action-value function. More specifically,
under the linear MDP assumption (Jin et al. 2019), the LSVI-UCB algorithm can
achieve regret; and
under the linear mixture MDP assumption (Ayoub et al. 2020), the UCRL-VTR
algorithm can achieve regret, where is the dimension of feature mapping, is the
length of episode, is the minimal sub-optimality gap,
and hides all logarithmic terms except . To the best of our
knowledge, these are the first logarithmic regret bounds for RL with linear
function approximation. We also establish gap-dependent lower bounds for the
two linear MDP models.Comment: 26 page
An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap
A fundamental question in the theory of reinforcement learning is: suppose
the optimal -function lies in the linear span of a given dimensional
feature mapping, is sample-efficient reinforcement learning (RL) possible? The
recent and remarkable result of Weisz et al. (2020) resolved this question in
the negative, providing an exponential (in ) sample size lower bound, which
holds even if the agent has access to a generative model of the environment.
One may hope that this information theoretic barrier for RL can be circumvented
by further supposing an even more favorable assumption: there exists a
\emph{constant suboptimality gap} between the optimal -value of the best
action and that of the second-best action (for all states). The hope is that
having a large suboptimality gap would permit easier identification of optimal
actions themselves, thus making the problem tractable; indeed, provided the
agent has access to a generative model, sample-efficient RL is in fact possible
with the addition of this more favorable assumption.
This work focuses on this question in the standard online reinforcement
learning setting, where our main result resolves this question in the negative:
our hardness result shows that an exponential sample complexity lower bound
still holds even if a constant suboptimality gap is assumed in addition to
having a linearly realizable optimal -function. Perhaps surprisingly, this
implies an exponential separation between the online RL setting and the
generative model setting. Complementing our negative hardness result, we give
two positive results showing that provably sample-efficient RL is possible
either under an additional low-variance assumption or under a novel
hypercontractivity assumption (both implicitly place stronger conditions on the
underlying dynamics model)
Online Sparse Reinforcement Learning
We investigate the hardness of online reinforcement learning in fixed
horizon, sparse linear Markov decision process (MDP), with a special focus on
the high-dimensional regime where the ambient dimension is larger than the
number of episodes. Our contribution is two-fold. First, we provide a lower
bound showing that linear regret is generally unavoidable in this case, even if
there exists a policy that collects well-conditioned data. The lower bound
construction uses an MDP with a fixed number of states while the number of
actions scales with the ambient dimension. Note that when the horizon is fixed
to one, the case of linear stochastic bandits, the linear regret can be
avoided. Second, we show that if the learner has oracle access to a policy that
collects well-conditioned data then a variant of Lasso fitted Q-iteration
enjoys a nearly dimension-free regret of where
is the number of episodes and is the sparsity level. This shows that in
the large-action setting, the difficulty of learning can be attributed to the
difficulty of finding a good exploratory policy.Comment: Accepted at AISTATS 202
Provably Efficient Representation Learning in Low-rank Markov Decision Processes
The success of deep reinforcement learning (DRL) is due to the power of
learning a representation that is suitable for the underlying exploration and
exploitation task. However, existing provable reinforcement learning algorithms
with linear function approximation often assume the feature representation is
known and fixed. In order to understand how representation learning can improve
the efficiency of RL, we study representation learning for a class of low-rank
Markov Decision Processes (MDPs) where the transition kernel can be represented
in a bilinear form. We propose a provably efficient algorithm called ReLEX that
can simultaneously learn the representation and perform exploration. We show
that ReLEX always performs no worse than a state-of-the-art algorithm without
representation learning, and will be strictly better in terms of sample
efficiency if the function class of representations enjoys a certain mild
"coverage'' property over the whole state-action space.Comment: 27 page
Gap-Dependent Bounds for Two-Player Markov Games
As one of the most popular methods in the field of reinforcement learning,
Q-learning has received increasing attention. Recently, there have been more
theoretical works on the regret bound of algorithms that belong to the
Q-learning class in different settings. In this paper, we analyze the
cumulative regret when conducting Nash Q-learning algorithm on 2-player
turn-based stochastic Markov games (2-TBSG), and propose the very first gap
dependent logarithmic upper bounds in the episodic tabular setting. This bound
matches the theoretical lower bound only up to a logarithmic term. Furthermore,
we extend the conclusion to the discounted game setting with infinite horizon
and propose a similar gap dependent logarithmic regret bound. Also, under the
linear MDP assumption, we obtain another logarithmic regret for 2-TBSG, in both
centralized and independent settings.Comment: 34 page
Provably Efficient Exploration in Policy Optimization
While policy-based reinforcement learning (RL) achieves tremendous successes
in practice, it is significantly less understood in theory, especially compared
with value-based RL. In particular, it remains elusive how to design a provably
efficient policy optimization algorithm that incorporates exploration. To
bridge such a gap, this paper proposes an Optimistic variant of the Proximal
Policy Optimization algorithm (OPPO), which follows an ``optimistic version''
of the policy gradient direction. This paper proves that, in the problem of
episodic Markov decision process with linear function approximation, unknown
transition, and adversarial reward with full-information feedback, OPPO
achieves regret. Here is the feature
dimension, is the episode horizon, and is the total number of steps. To
the best of our knowledge, OPPO is the first provably efficient policy
optimization algorithm that explores.Comment: We have fixed a technical issue in the first version of this paper.
We remark the technical assumption of the linear MDP in this version of the
paper is different from that in the first versio
Almost Optimal Algorithms for Two-player Markov Games with Linear Function Approximation
We study reinforcement learning for two-player zero-sum Markov games with
simultaneous moves in the finite-horizon setting, where the transition kernel
of the underlying Markov games can be parameterized by a linear function over
the current state, both players' actions and the next state. In particular, we
assume that we can control both players and aim to find the Nash Equilibrium by
minimizing the duality gap. We propose an algorithm Nash-UCRL-VTR based on the
principle "Optimism-in-Face-of-Uncertainty". Our algorithm only needs to find a
Coarse Correlated Equilibrium (CCE), which is computationally very efficient.
Specifically, we show that Nash-UCRL-VTR can provably achieve an
regret, where is the linear function dimension,
is the length of the game and is the total number of steps in the game. To
access the optimality of our algorithm, we also prove an lower bound on the regret. Our upper bound matches the lower bound
up to logarithmic factors, which suggests the optimality of our algorithm.Comment: 31 page
Sparse Feature Selection Makes Batch Reinforcement Learning More Sample Efficient
This paper provides a statistical analysis of high-dimensional batch
Reinforcement Learning (RL) using sparse linear function approximation. When
there is a large number of candidate features, our result sheds light on the
fact that sparsity-aware methods can make batch RL more sample efficient. We
first consider the off-policy policy evaluation problem. To evaluate a new
target policy, we analyze a Lasso fitted Q-evaluation method and establish a
finite-sample error bound that has no polynomial dependence on the ambient
dimension. To reduce the Lasso bias, we further propose a post model-selection
estimator that applies fitted Q-evaluation to the features selected via group
Lasso. Under an additional signal strength assumption, we derive a sharper
instance-dependent error bound that depends on a divergence function measuring
the distribution mismatch between the data distribution and occupancy measure
of the target policy. Further, we study the Lasso fitted Q-iteration for batch
policy optimization and establish a finite-sample error bound depending on the
ratio between the number of relevant features and restricted minimal eigenvalue
of the data's covariance. In the end, we complement the results with minimax
lower bounds for batch-data policy evaluation/optimization that nearly match
our upper bounds. The results suggest that having well-conditioned data is
crucial for sparse batch policy learning
Cautiously Optimistic Policy Optimization and Exploration with Linear Function Approximation
Policy optimization methods are popular reinforcement learning algorithms,
because their incremental and on-policy nature makes them more stable than the
value-based counterparts. However, the same properties also make them slow to
converge and sample inefficient, as the on-policy requirement precludes data
reuse and the incremental updates couple large iteration complexity into the
sample complexity. These characteristics have been observed in experiments as
well as in theory in the recent work of~\citet{agarwal2020pc}, which provides a
policy optimization method PCPG that can robustly find near optimal polices for
approximately linear Markov decision processes but suffers from an extremely
poor sample complexity compared with value-based techniques.
In this paper, we propose a new algorithm, COPOE, that overcomes the sample
complexity issue of PCPG while retaining its robustness to model
misspecification. Compared with PCPG, COPOE makes several important algorithmic
enhancements, such as enabling data reuse, and uses more refined analysis
techniques, which we expect to be more broadly applicable to designing new
reinforcement learning algorithms. The result is an improvement in sample
complexity from for PCPG to
for PCPG, nearly bridging the gap with
value-based techniques.Comment: Appears in COLT 202