2,093 research outputs found
Scalable Multiagent Coordination with Distributed Online Open Loop Planning
We propose distributed online open loop planning (DOOLP), a general framework
for online multiagent coordination and decision making under uncertainty. DOOLP
is based on online heuristic search in the space defined by a generative model
of the domain dynamics, which is exploited by agents to simulate and evaluate
the consequences of their potential choices.
We also propose distributed online Thompson sampling (DOTS) as an effective
instantiation of the DOOLP framework. DOTS models sequences of agent choices by
concatenating a number of multiarmed bandits for each agent and uses Thompson
sampling for dealing with action value uncertainty. The Bayesian approach
underlying Thompson sampling allows to effectively model and estimate
uncertainty about (a) own action values and (b) other agents' behavior. This
approach yields a principled and statistically sound solution to the
exploration-exploitation dilemma when exploring large search spaces with
limited resources.
We implemented DOTS in a smart factory case study with positive empirical
results. We observed effective, robust and scalable planning and coordination
capabilities even when only searching a fraction of the potential search space
PAC-Bayesian Analysis of the Exploration-Exploitation Trade-off
We develop a coherent framework for integrative simultaneous analysis of the
exploration-exploitation and model order selection trade-offs. We improve over
our preceding results on the same subject (Seldin et al., 2011) by combining
PAC-Bayesian analysis with Bernstein-type inequality for martingales. Such a
combination is also of independent interest for studies of multiple
simultaneously evolving martingales.Comment: On-line Trading of Exploration and Exploitation 2 - ICML-2011
workshop. http://explo.cs.ucl.ac.uk/workshop
Heterogeneous Stochastic Interactions for Multiple Agents in a Multi-armed Bandit Problem
We define and analyze a multi-agent multi-armed bandit problem in which
decision-making agents can observe the choices and rewards of their neighbors.
Neighbors are defined by a network graph with heterogeneous and stochastic
interconnections. These interactions are determined by the sociability of each
agent, which corresponds to the probability that the agent observes its
neighbors. We design an algorithm for each agent to maximize its own expected
cumulative reward and prove performance bounds that depend on the sociability
of the agents and the network structure. We use the bounds to predict the rank
ordering of agents according to their performance and verify the accuracy
analytically and computationally
PAC-Bayesian Analysis of Martingales and Multiarmed Bandits
We present two alternative ways to apply PAC-Bayesian analysis to sequences
of dependent random variables. The first is based on a new lemma that enables
to bound expectations of convex functions of certain dependent random variables
by expectations of the same functions of independent Bernoulli random
variables. This lemma provides an alternative tool to Hoeffding-Azuma
inequality to bound concentration of martingale values. Our second approach is
based on integration of Hoeffding-Azuma inequality with PAC-Bayesian analysis.
We also introduce a way to apply PAC-Bayesian analysis in situation of limited
feedback. We combine the new tools to derive PAC-Bayesian generalization and
regret bounds for the multiarmed bandit problem. Although our regret bound is
not yet as tight as state-of-the-art regret bounds based on other
well-established techniques, our results significantly expand the range of
potential applications of PAC-Bayesian analysis and introduce a new analysis
tool to reinforcement learning and many other fields, where martingales and
limited feedback are encountered
- …