5,468 research outputs found
Perseus: Randomized Point-based Value Iteration for POMDPs
Partially observable Markov decision processes (POMDPs) form an attractive
and principled framework for agent planning under uncertainty. Point-based
approximate techniques for POMDPs compute a policy based on a finite set of
points collected in advance from the agents belief space. We present a
randomized point-based value iteration algorithm called Perseus. The algorithm
performs approximate value backup stages, ensuring that in each backup stage
the value of each point in the belief set is improved; the key observation is
that a single backup may improve the value of many belief points. Contrary to
other point-based methods, Perseus backs up only a (randomly selected) subset
of points in the belief set, sufficient for improving the value of each belief
point in the set. We show how the same idea can be extended to dealing with
continuous action spaces. Experimental results show the potential of Perseus in
large scale POMDP problems
Influence-Optimistic Local Values for Multiagent Planning --- Extended Version
Recent years have seen the development of methods for multiagent planning
under uncertainty that scale to tens or even hundreds of agents. However, most
of these methods either make restrictive assumptions on the problem domain, or
provide approximate solutions without any guarantees on quality. Methods in the
former category typically build on heuristic search using upper bounds on the
value function. Unfortunately, no techniques exist to compute such upper bounds
for problems with non-factored value functions. To allow for meaningful
benchmarking through measurable quality guarantees on a very general class of
problems, this paper introduces a family of influence-optimistic upper bounds
for factored decentralized partially observable Markov decision processes
(Dec-POMDPs) that do not have factored value functions. Intuitively, we derive
bounds on very large multiagent planning problems by subdividing them in
sub-problems, and at each of these sub-problems making optimistic assumptions
with respect to the influence that will be exerted by the rest of the system.
We numerically compare the different upper bounds and demonstrate how we can
achieve a non-trivial guarantee that a heuristic solution for problems with
hundreds of agents is close to optimal. Furthermore, we provide evidence that
the upper bounds may improve the effectiveness of heuristic influence search,
and discuss further potential applications to multiagent planning.Comment: Long version of IJCAI 2015 paper (and extended abstract at AAMAS
2015
The Beam Conditions Monitor of the LHCb Experiment
The LHCb experiment at the European Organization for Nuclear Research (CERN)
is dedicated to precision measurements of CP violation and rare decays of B
hadrons. Its most sensitive components are protected by means of a Beam
Conditions Monitor (BCM), based on polycrystalline CVD diamond sensors. Its
configuration, operation and decision logics to issue or remove the beam permit
signal for the Large Hadron Collider (LHC) are described in this paper.Comment: Index Terms: Accelerator measurement systems, CVD, Diamond, Radiation
detector
How are practices made to vary? Managing practice adaptation in a multinational corporation
Research has shown that management practices are adapted and ‘made to fit’ the specific context into which they are adopted. Less attention has been paid to how organizations anticipate and purposefully influence the adaptation process. How do organizations manage the tension between allowing local adaptation of a management practice and retaining control over the practice? By studying the adaptation of a specialized quality management practice – ACE (Achieving Competitive Excellence) – in a multinational corporation in the aerospace industry, we examine how the organization manages the adaptation process at the corporate and subsidiary levels. We identified three strategies through which an organization balances the tension between standardization and variation – preserving the ‘core’ practice while allowing local adaptation at the subsidiary level: creating and certifying progressive achievement levels; setting discretionary and mandatory adaptation parameters; and differentially adapting to context-specific and systemic misfits. While previous studies have shown how and why practices vary as they diffuse, we show how practices may diffuse because they are engineered to vary for allowing a better fit with diverse contextual specificities
The Role of Diverse Replay for Generalisation in Reinforcement Learning
In reinforcement learning (RL), key components of many algorithms are the
exploration strategy and replay buffer. These strategies regulate what
environment data is collected and trained on and have been extensively studied
in the RL literature. In this paper, we investigate the impact of these
components in the context of generalisation in multi-task RL. We investigate
the hypothesis that collecting and training on more diverse data from the
training environment will improve zero-shot generalisation to new
environments/tasks. We motivate mathematically and show empirically that
generalisation to states that are "reachable" during training is improved by
increasing the diversity of transitions in the replay buffer. Furthermore, we
show empirically that this same strategy also shows improvement for
generalisation to similar but "unreachable" states and could be due to improved
generalisation of latent representations.Comment: 14 pages, 8 figure
E-MCTS: Deep Exploration in Model-Based Reinforcement Learning by Planning with Epistemic Uncertainty
One of the most well-studied and highly performing planning approaches used
in Model-Based Reinforcement Learning (MBRL) is Monte-Carlo Tree Search (MCTS).
Key challenges of MCTS-based MBRL methods remain dedicated deep exploration and
reliability in the face of the unknown, and both challenges can be alleviated
through principled epistemic uncertainty estimation in the predictions of MCTS.
We present two main contributions: First, we develop methodology to propagate
epistemic uncertainty in MCTS, enabling agents to estimate the epistemic
uncertainty in their predictions. Second, we utilize the propagated uncertainty
for a novel deep exploration algorithm by explicitly planning to explore. We
incorporate our approach into variations of MCTS-based MBRL approaches with
learned and provided dynamics models, and empirically show deep exploration
through successful epistemic uncertainty estimation achieved by our approach.
We compare to a non-planning-based deep-exploration baseline, and demonstrate
that planning with epistemic MCTS significantly outperforms non-planning based
exploration in the investigated deep exploration benchmark.Comment: Submitted to NeurIPS 2023, accepted to EWRL 202
Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL
Reinforcement learning agents may sometimes develop habits that are effective
only when specific policies are followed. After an initial exploration phase in
which agents try out different actions, they eventually converge toward a
particular policy. When this occurs, the distribution of state-action
trajectories becomes narrower, and agents start experiencing the same
transitions again and again. At this point, spurious correlations may arise.
Agents may then pick up on these correlations and learn state representations
that do not generalize beyond the agent's trajectory distribution. In this
paper, we provide a mathematical characterization of this phenomenon, which we
refer to as policy confounding, and show, through a series of examples, when
and how it occurs in practice
Optimal and Approximate Q-value Functions for Decentralized POMDPs
Decision-theoretic planning is a popular approach to sequential decision
making problems, because it treats uncertainty in sensing and acting in a
principled way. In single-agent frameworks like MDPs and POMDPs, planning can
be carried out by resorting to Q-value functions: an optimal Q-value function
Q* is computed in a recursive manner by dynamic programming, and then an
optimal policy is extracted from Q*. In this paper we study whether similar
Q-value functions can be defined for decentralized POMDP models (Dec-POMDPs),
and how policies can be extracted from such value functions. We define two
forms of the optimal Q-value function for Dec-POMDPs: one that gives a
normative description as the Q-value function of an optimal pure joint policy
and another one that is sequentially rational and thus gives a recipe for
computation. This computation, however, is infeasible for all but the smallest
problems. Therefore, we analyze various approximate Q-value functions that
allow for efficient computation. We describe how they relate, and we prove that
they all provide an upper bound to the optimal Q-value function Q*. Finally,
unifying some previous approaches for solving Dec-POMDPs, we describe a family
of algorithms for extracting policies from such Q-value functions, and perform
an experimental evaluation on existing test problems, including a new
firefighting benchmark problem
- …