13 research outputs found
Misspecified Linear Bandits
We consider the problem of online learning in misspecified linear stochastic
multi-armed bandit problems. Regret guarantees for state-of-the-art linear
bandit algorithms such as Optimism in the Face of Uncertainty Linear bandit
(OFUL) hold under the assumption that the arms expected rewards are perfectly
linear in their features. It is, however, of interest to investigate the impact
of potential misspecification in linear bandit models, where the expected
rewards are perturbed away from the linear subspace determined by the arms
features. Although OFUL has recently been shown to be robust to relatively
small deviations from linearity, we show that any linear bandit algorithm that
enjoys optimal regret performance in the perfectly linear setting (e.g., OFUL)
must suffer linear regret under a sparse additive perturbation of the linear
model. In an attempt to overcome this negative result, we define a natural
class of bandit models characterized by a non-sparse deviation from linearity.
We argue that the OFUL algorithm can fail to achieve sublinear regret even
under models that have non-sparse deviation.We finally develop a novel bandit
algorithm, comprising a hypothesis test for linearity followed by a decision to
use either the OFUL or Upper Confidence Bound (UCB) algorithm. For perfectly
linear bandit models, the algorithm provably exhibits OFULs favorable regret
performance, while for misspecified models satisfying the non-sparse deviation
property, the algorithm avoids the linear regret phenomenon and falls back on
UCBs sublinear regret scaling. Numerical experiments on synthetic data, and on
recommendation data from the public Yahoo! Learning to Rank Challenge dataset,
empirically support our findings.Comment: Thirty-First AAAI Conference on Artificial Intelligence, 201
No-Regret Reinforcement Learning with Value Function Approximation: a Kernel Embedding Approach
We consider the regret minimization problem in reinforcement learning (RL) in
the episodic setting. In many real-world RL environments, the state and action
spaces are continuous or very large. Existing approaches establish regret
guarantees by either a low-dimensional representation of the stochastic
transition model or an approximation of the -functions. However, the
understanding of function approximation schemes for state-value functions
largely remains missing. In this paper, we propose an online model-based RL
algorithm, namely the CME-RL, that learns representations of transition
distributions as embeddings in a reproducing kernel Hilbert space while
carefully balancing the exploitation-exploration tradeoff. We demonstrate the
efficiency of our algorithm by proving a frequentist (worst-case) regret bound
that is of order , where is the
episode length, is the total number of time steps and is an
information theoretic quantity relating the effective dimension of the
state-action feature space. Our method bypasses the need for estimating
transition probabilities and applies to any domain on which kernels can be
defined. It also brings new insights into the general theory of kernel methods
for approximate inference and RL regret minimization
Differentially Private Reward Estimation with Preference Feedback
Learning from preference-based feedback has recently gained considerable
traction as a promising approach to align generative models with human
interests. Instead of relying on numerical rewards, the generative models are
trained using reinforcement learning with human feedback (RLHF). These
approaches first solicit feedback from human labelers typically in the form of
pairwise comparisons between two possible actions, then estimate a reward model
using these comparisons, and finally employ a policy based on the estimated
reward model. An adversarial attack in any step of the above pipeline might
reveal private and sensitive information of human labelers. In this work, we
adopt the notion of label differential privacy (DP) and focus on the problem of
reward estimation from preference-based feedback while protecting privacy of
each individual labelers. Specifically, we consider the parametric
Bradley-Terry-Luce (BTL) model for such pairwise comparison feedback involving
a latent reward parameter . Within a standard
minimax estimation framework, we provide tight upper and lower bounds on the
error in estimating under both local and central models of DP. We
show, for a given privacy budget and number of samples , that the
additional cost to ensure label-DP under local model is , while it is
under the weaker central
model. We perform simulations on synthetic data that corroborate these
theoretical results
Differentially Private Episodic Reinforcement Learning with Heavy-tailed Rewards
In this paper, we study the problem of (finite horizon tabular) Markov
decision processes (MDPs) with heavy-tailed rewards under the constraint of
differential privacy (DP). Compared with the previous studies for private
reinforcement learning that typically assume rewards are sampled from some
bounded or sub-Gaussian distributions to ensure DP, we consider the setting
where reward distributions have only finite -th moments with some . By resorting to robust mean estimators for rewards, we first propose
two frameworks for heavy-tailed MDPs, i.e., one is for value iteration and
another is for policy optimization. Under each framework, we consider both
joint differential privacy (JDP) and local differential privacy (LDP) models.
Based on our frameworks, we provide regret upper bounds for both JDP and LDP
cases and show that the moment of distribution and privacy budget both have
significant impacts on regrets. Finally, we establish a lower bound of regret
minimization for heavy-tailed MDPs in JDP model by reducing it to the
instance-independent lower bound of heavy-tailed multi-armed bandits in DP
model. We also show the lower bound for the problem in LDP by adopting some
private minimax methods. Our results reveal that there are fundamental
differences between the problem of private RL with sub-Gaussian and that with
heavy-tailed rewards.Comment: ICML 2023. arXiv admin note: text overlap with arXiv:2009.09052 by
other author