Search CORE

151 research outputs found

Best-of-three-worlds Analysis for Linear Bandits with Follow-the-regularized-leader Algorithm

Author: Kong Fang
Li Shuai
Zhao Canzhe
Publication venue
Publication date: 18/07/2023
Field of study

The linear bandit problem has been studied for many years in both stochastic and adversarial settings. Designing an algorithm that can optimize the environment without knowing the loss type attracts lots of interest. \citet{LeeLWZ021} propose an algorithm that actively detects the loss type and then switches between different algorithms specially designed for specific settings. However, such an approach requires meticulous designs to perform well in all environments. Follow-the-regularized-leader (FTRL) is another type of popular algorithm that can adapt to different environments. This algorithm is of simple design and the regret bounds are shown to be optimal in traditional multi-armed bandit problems compared with the detect-switch type. Designing an FTRL-type algorithm for linear bandits is an important question that has been open for a long time. In this paper, we prove that the FTRL algorithm with a negative entropy regularizer can achieve the best-of-three-world results for the linear bandit problem. Our regret bounds achieve the same or nearly the same order as the previous detect-switch type algorithm but with a much simpler algorithmic design.Comment: Accepted in COLT 202

arXiv.org e-Print Archive

Corruption-Robust Offline Reinforcement Learning with General Function Approximation

Author: Gu Quanquan
Yang Rui
Ye Chenlu
Zhang Tong
Publication venue
Publication date: 14/11/2023
Field of study

We investigate the problem of corruption robustness in offline reinforcement learning (RL) with general function approximation, where an adversary can corrupt each sample in the offline dataset, and the corruption level

\zeta\geq0

quantifies the cumulative corruption amount over

n

episodes and

H

steps. Our goal is to find a policy that is robust to such corruption and minimizes the suboptimality gap with respect to the optimal policy for the uncorrupted Markov decision processes (MDPs). Drawing inspiration from the uncertainty-weighting technique from the robust online RL setting \citep{he2022nearly,ye2022corruptionrobust}, we design a new uncertainty weight iteration procedure to efficiently compute on batched samples and propose a corruption-robust algorithm for offline RL. Notably, under the assumption of single policy coverage and the knowledge of

\zeta

, our proposed algorithm achieves a suboptimality bound that is worsened by an additive factor of

\mathcal O(\zeta \cdot (\text{CC}(\lambda,\hat{\mathcal F},\mathcal Z_n^H))^{1/2} (C(\hat{\mathcal F},\mu))^{-1/2} n^{-1})

due to the corruption. Here

\text{CC}(\lambda,\hat{\mathcal F},\mathcal Z_n^H)

is the coverage coefficient that depends on the regularization parameter

\lambda

, the confidence set

\hat{\mathcal F}

, and the dataset

\mathcal Z_n^H

, and

C(\hat{\mathcal F},\mu)

is a coefficient that depends on

\hat{\mathcal F}

and the underlying data distribution

\mu

. When specialized to linear MDPs, the corruption-dependent error term reduces to

\mathcal O(\zeta d n^{-1})

with

d

being the dimension of the feature map, which matches the existing lower bound for corrupted linear MDPs. This suggests that our analysis is tight in terms of the corruption-dependent term

arXiv.org e-Print Archive