151 research outputs found
Best-of-three-worlds Analysis for Linear Bandits with Follow-the-regularized-leader Algorithm
The linear bandit problem has been studied for many years in both stochastic
and adversarial settings. Designing an algorithm that can optimize the
environment without knowing the loss type attracts lots of interest.
\citet{LeeLWZ021} propose an algorithm that actively detects the loss type and
then switches between different algorithms specially designed for specific
settings. However, such an approach requires meticulous designs to perform well
in all environments. Follow-the-regularized-leader (FTRL) is another type of
popular algorithm that can adapt to different environments. This algorithm is
of simple design and the regret bounds are shown to be optimal in traditional
multi-armed bandit problems compared with the detect-switch type. Designing an
FTRL-type algorithm for linear bandits is an important question that has been
open for a long time. In this paper, we prove that the FTRL algorithm with a
negative entropy regularizer can achieve the best-of-three-world results for
the linear bandit problem. Our regret bounds achieve the same or nearly the
same order as the previous detect-switch type algorithm but with a much simpler
algorithmic design.Comment: Accepted in COLT 202
Corruption-Robust Offline Reinforcement Learning with General Function Approximation
We investigate the problem of corruption robustness in offline reinforcement
learning (RL) with general function approximation, where an adversary can
corrupt each sample in the offline dataset, and the corruption level
quantifies the cumulative corruption amount over episodes and
steps. Our goal is to find a policy that is robust to such corruption and
minimizes the suboptimality gap with respect to the optimal policy for the
uncorrupted Markov decision processes (MDPs). Drawing inspiration from the
uncertainty-weighting technique from the robust online RL setting
\citep{he2022nearly,ye2022corruptionrobust}, we design a new uncertainty weight
iteration procedure to efficiently compute on batched samples and propose a
corruption-robust algorithm for offline RL. Notably, under the assumption of
single policy coverage and the knowledge of , our proposed algorithm
achieves a suboptimality bound that is worsened by an additive factor of
due to the corruption.
Here is the coverage
coefficient that depends on the regularization parameter , the
confidence set , and the dataset , and
is a coefficient that depends on
and the underlying data distribution . When specialized to linear MDPs,
the corruption-dependent error term reduces to
with being the dimension of the feature map, which matches the existing
lower bound for corrupted linear MDPs. This suggests that our analysis is tight
in terms of the corruption-dependent term
- …