121 research outputs found
Reinforcement Learning in Rich-Observation MDPs using Spectral Methods
Reinforcement learning (RL) in Markov decision processes (MDPs) with large state spaces is a challenging problem. The performance of standard RL algorithms degrades drastically with the dimensionality of state space. However, in practice, these large MDPs typically incorporate a latent or hidden low-dimensional structure. In this paper, we study the setting of rich-observation Markov decision processes (ROMDP), where there are a small number of hidden states which possess an injective mapping to the observation states. In other words, every observation state is generated through a single hidden state, and this mapping is unknown a priori. We introduce a spectral decomposition method that consistently learns this mapping, and more importantly, achieves it with low regret. The estimated mapping is integrated into an optimistic RL algorithm (UCRL), which operates on the estimated hidden space. We derive finite-time regret bounds for our algorithm with a weak dependence on the dimensionality of the observed space. In fact, our algorithm asymptotically achieves the same average regret as the oracle UCRL algorithm, which has the knowledge of the mapping from hidden to observed spaces. Thus, we derive an efficient spectral RL algorithm for ROMDPs
Provably Efficient UCB-type Algorithms For Learning Predictive State Representations
The general sequential decision-making problem, which includes Markov
decision processes (MDPs) and partially observable MDPs (POMDPs) as special
cases, aims at maximizing a cumulative reward by making a sequence of decisions
based on a history of observations and actions over time. Recent studies have
shown that the sequential decision-making problem is statistically learnable if
it admits a low-rank structure modeled by predictive state representations
(PSRs). Despite these advancements, existing approaches typically involve
oracles or steps that are not computationally efficient. On the other hand, the
upper confidence bound (UCB) based approaches, which have served successfully
as computationally efficient methods in bandits and MDPs, have not been
investigated for more general PSRs, due to the difficulty of optimistic bonus
design in these more challenging settings. This paper proposes the first known
UCB-type approach for PSRs, featuring a novel bonus term that upper bounds the
total variation distance between the estimated and true models. We further
characterize the sample complexity bounds for our designed UCB-type algorithms
for both online and offline PSRs. In contrast to existing approaches for PSRs,
our UCB-type algorithms enjoy computational efficiency, last-iterate guaranteed
near-optimal policy, and guaranteed model accuracy
Sample Efficient Policy Search for Optimal Stopping Domains
Optimal stopping problems consider the question of deciding when to stop an
observation-generating process in order to maximize a return. We examine the
problem of simultaneously learning and planning in such domains, when data is
collected directly from the environment. We propose GFSE, a simple and flexible
model-free policy search method that reuses data for sample efficiency by
leveraging problem structure. We bound the sample complexity of our approach to
guarantee uniform convergence of policy value estimates, tightening existing
PAC bounds to achieve logarithmic dependence on horizon length for our setting.
We also examine the benefit of our method against prevalent model-based and
model-free approaches on 3 domains taken from diverse fields.Comment: To appear in IJCAI-201
Sample-Efficient Learning of POMDPs with Multiple Observations In Hindsight
This paper studies the sample-efficiency of learning in Partially Observable
Markov Decision Processes (POMDPs), a challenging problem in reinforcement
learning that is known to be exponentially hard in the worst-case. Motivated by
real-world settings such as loading in game playing, we propose an enhanced
feedback model called ``multiple observations in hindsight'', where after each
episode of interaction with the POMDP, the learner may collect multiple
additional observations emitted from the encountered latent states, but may not
observe the latent states themselves. We show that sample-efficient learning
under this feedback model is possible for two new subclasses of POMDPs:
\emph{multi-observation revealing POMDPs} and \emph{distinguishable POMDPs}.
Both subclasses generalize and substantially relax \emph{revealing POMDPs} -- a
widely studied subclass for which sample-efficient learning is possible under
standard trajectory feedback. Notably, distinguishable POMDPs only require the
emission distributions from different latent states to be \emph{different}
instead of \emph{linearly independent} as required in revealing POMDPs
- …