192 research outputs found
Provably Efficient UCB-type Algorithms For Learning Predictive State Representations
The general sequential decision-making problem, which includes Markov
decision processes (MDPs) and partially observable MDPs (POMDPs) as special
cases, aims at maximizing a cumulative reward by making a sequence of decisions
based on a history of observations and actions over time. Recent studies have
shown that the sequential decision-making problem is statistically learnable if
it admits a low-rank structure modeled by predictive state representations
(PSRs). Despite these advancements, existing approaches typically involve
oracles or steps that are not computationally efficient. On the other hand, the
upper confidence bound (UCB) based approaches, which have served successfully
as computationally efficient methods in bandits and MDPs, have not been
investigated for more general PSRs, due to the difficulty of optimistic bonus
design in these more challenging settings. This paper proposes the first known
UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the
total variation distance between the estimated and true models. We further
characterize the sample complexity bounds for our designed UCB-type algorithms
for both online and offline PSRs. In contrast to existing approaches for PSRs,
our UCB-type algorithms enjoy computational efficiency, last-iterate guaranteed
near-optimal policy, and guaranteed model accuracy
Model-Free Algorithm with Improved Sample Efficiency for Zero-Sum Markov Games
The problem of two-player zero-sum Markov games has recently attracted
increasing interests in theoretical studies of multi-agent reinforcement
learning (RL). In particular, for finite-horizon episodic Markov decision
processes (MDPs), it has been shown that model-based algorithms can find an
-optimal Nash Equilibrium (NE) with the sample complexity of
, which is optimal in the dependence of the horizon
and the number of states (where and denote the number of actions of
the two players, respectively). However, none of the existing model-free
algorithms can achieve such an optimality. In this work, we propose a
model-free stage-based Q-learning algorithm and show that it achieves the same
sample complexity as the best model-based algorithm, and hence for the first
time demonstrate that model-free algorithms can enjoy the same optimality in
the dependence as model-based algorithms. The main improvement of the
dependency on arises by leveraging the popular variance reduction technique
based on the reference-advantage decomposition previously used only for
single-agent RL. However, such a technique relies on a critical monotonicity
property of the value function, which does not hold in Markov games due to the
update of the policy via the coarse correlated equilibrium (CCE) oracle. Thus,
to extend such a technique to Markov games, our algorithm features a key novel
design of updating the reference value functions as the pair of optimistic and
pessimistic value functions whose value difference is the smallest in the
history in order to achieve the desired improvement in the sample efficiency
- …