452 research outputs found
One Objective to Rule Them All: A Maximization Objective Fusing Estimation and Planning for Exploration
In online reinforcement learning (online RL), balancing exploration and
exploitation is crucial for finding an optimal policy in a sample-efficient
way. To achieve this, existing sample-efficient online RL algorithms typically
consist of three components: estimation, planning, and exploration. However, in
order to cope with general function approximators, most of them involve
impractical algorithmic components to incentivize exploration, such as
optimization within data-dependent level-sets or complicated sampling
procedures. To address this challenge, we propose an easy-to-implement RL
framework called \textit{Maximize to Explore} (\texttt{MEX}), which only needs
to optimize \emph{unconstrainedly} a single objective that integrates the
estimation and planning components while balancing exploration and exploitation
automatically. Theoretically, we prove that \texttt{MEX} achieves a sublinear
regret with general function approximations for Markov decision processes (MDP)
and is further extendable to two-player zero-sum Markov games (MG). Meanwhile,
we adapt deep RL baselines to design practical versions of \texttt{MEX}, in
both model-free and model-based manners, which can outperform baselines by a
stable margin in various MuJoCo environments with sparse rewards. Compared with
existing sample-efficient online RL algorithms with general function
approximations, \texttt{MEX} achieves similar sample efficiency while enjoying
a lower computational cost and is more compatible with modern deep RL methods
Sample-Efficient Multi-Agent RL: An Optimization Perspective
We study multi-agent reinforcement learning (MARL) for the general-sum Markov
Games (MGs) under the general function approximation. In order to find the
minimum assumption for sample-efficient learning, we introduce a novel
complexity measure called the Multi-Agent Decoupling Coefficient (MADC) for
general-sum MGs. Using this measure, we propose the first unified algorithmic
framework that ensures sample efficiency in learning Nash Equilibrium, Coarse
Correlated Equilibrium, and Correlated Equilibrium for both model-based and
model-free MARL problems with low MADC. We also show that our algorithm
provides comparable sublinear regret to the existing works. Moreover, our
algorithm combines an equilibrium-solving oracle with a single objective
optimization subprocedure that solves for the regularized payoff of each
deterministic joint policy, which avoids solving constrained optimization
problems within data-dependent constraints (Jin et al. 2020; Wang et al. 2023)
or executing sampling procedures with complex multi-objective optimization
problems (Foster et al. 2023), thus being more amenable to empirical
implementation
Unified Algorithms for RL with Decision-Estimation Coefficients: No-Regret, PAC, and Reward-Free Learning
Finding unified complexity measures and algorithms for sample-efficient
learning is a central topic of research in reinforcement learning (RL). The
Decision-Estimation Coefficient (DEC) is recently proposed by Foster et al.
(2021) as a necessary and sufficient complexity measure for sample-efficient
no-regret RL. This paper makes progress towards a unified theory for RL with
the DEC framework. First, we propose two new DEC-type complexity measures:
Explorative DEC (EDEC), and Reward-Free DEC (RFDEC). We show that they are
necessary and sufficient for sample-efficient PAC learning and reward-free
learning, thereby extending the original DEC which only captures no-regret
learning. Next, we design new unified sample-efficient algorithms for all three
learning goals. Our algorithms instantiate variants of the
Estimation-To-Decisions (E2D) meta-algorithm with a strong and general model
estimation subroutine. Even in the no-regret setting, our algorithm E2D-TA
improves upon the algorithms of Foster et al. (2021) which require either
bounding a variant of the DEC which may be prohibitively large, or designing
problem-specific estimation subroutines. As applications, we recover existing
and obtain new sample-efficient learning results for a wide range of tractable
RL problems using essentially a single algorithm. We also generalize the DEC to
give sample-efficient algorithms for all-policy model estimation, with
applications for learning equilibria in Markov Games. Finally, as a connection,
we re-analyze two existing optimistic model-based algorithms based on Posterior
Sampling or Maximum Likelihood Estimation, showing that they enjoy similar
regret bounds as E2D-TA under similar structural conditions as the DEC
A General Framework for Sample-Efficient Function Approximation in Reinforcement Learning
With the increasing need for handling large state and action spaces, general
function approximation has become a key technique in reinforcement learning
(RL). In this paper, we propose a general framework that unifies model-based
and model-free RL, and an Admissible Bellman Characterization (ABC) class that
subsumes nearly all Markov Decision Process (MDP) models in the literature for
tractable RL. We propose a novel estimation function with decomposable
structural properties for optimization-based exploration and the functional
eluder dimension as a complexity measure of the ABC class. Under our framework,
a new sample-efficient algorithm namely OPtimization-based ExploRation with
Approximation (OPERA) is proposed, achieving regret bounds that match or
improve over the best-known results for a variety of MDP models. In particular,
for MDPs with low Witness rank, under a slightly stronger assumption, OPERA
improves the state-of-the-art sample complexity results by a factor of .
Our framework provides a generic interface to design and analyze new RL models
and algorithms
Introduction to Online Nonstochastic Control
This text presents an introduction to an emerging paradigm in control of
dynamical systems and differentiable reinforcement learning called online
nonstochastic control. The new approach applies techniques from online convex
optimization and convex relaxations to obtain new methods with provable
guarantees for classical settings in optimal and robust control.
The primary distinction between online nonstochastic control and other
frameworks is the objective. In optimal control, robust control, and other
control methodologies that assume stochastic noise, the goal is to perform
comparably to an offline optimal strategy. In online nonstochastic control,
both the cost functions as well as the perturbations from the assumed dynamical
model are chosen by an adversary. Thus the optimal policy is not defined a
priori. Rather, the target is to attain low regret against the best policy in
hindsight from a benchmark class of policies.
This objective suggests the use of the decision making framework of online
convex optimization as an algorithmic methodology. The resulting methods are
based on iterative mathematical optimization algorithms, and are accompanied by
finite-time regret and computational complexity guarantees.Comment: Draft; comments/suggestions welcome at
[email protected]
- …