119 research outputs found
Dynamic priority allocation via restless bandit marginal productivity indices
This paper surveys recent work by the author on the theoretical and
algorithmic aspects of restless bandit indexation as well as on its application
to a variety of problems involving the dynamic allocation of priority to
multiple stochastic projects. The main aim is to present ideas and methods in
an accessible form that can be of use to researchers addressing problems of
such a kind. Besides building on the rich literature on bandit problems, our
approach draws on ideas from linear programming, economics, and multi-objective
optimization. In particular, it was motivated to address issues raised in the
seminal work of Whittle (Restless bandits: activity allocation in a changing
world. In: Gani J. (ed.) A Celebration of Applied Probability, J. Appl.
Probab., vol. 25A, Applied Probability Trust, Sheffield, pp. 287-298, 1988)
where he introduced the index for restless bandits that is the starting point
of this work. Such an index, along with previously proposed indices and more
recent extensions, is shown to be unified through the intuitive concept of
``marginal productivity index'' (MPI), which measures the marginal productivity
of work on a project at each of its states. In a multi-project setting, MPI
policies are economically sound, as they dynamically allocate higher priority
to those projects where work appears to be currently more productive. Besides
being tractable and widely applicable, a growing body of computational evidence
indicates that such index policies typically achieve a near-optimal performance
and substantially outperform benchmark policies derived from conventional
approaches.Comment: 7 figure
Lipschitzness Is All You Need To Tame Off-policy Generative Adversarial Imitation Learning
Despite the recent success of reinforcement learning in various domains,
these approaches remain, for the most part, deterringly sensitive to
hyper-parameters and are often riddled with essential engineering feats
allowing their success. We consider the case of off-policy generative
adversarial imitation learning, and perform an in-depth review, qualitative and
quantitative, of the method. We show that forcing the learned reward function
to be local Lipschitz-continuous is a sine qua non condition for the method to
perform well. We then study the effects of this necessary condition and provide
several theoretical results involving the local Lipschitzness of the
state-value function. We complement these guarantees with empirical evidence
attesting to the strong positive effect that the consistent satisfaction of the
Lipschitzness constraint on the reward has on imitation performance. Finally,
we tackle a generic pessimistic reward preconditioning add-on spawning a large
class of reward shaping methods, which makes the base method it is plugged into
provably more robust, as shown in several additional theoretical guarantees. We
then discuss these through a fine-grained lens and share our insights.
Crucially, the guarantees derived and reported in this work are valid for any
reward satisfying the Lipschitzness condition, nothing is specific to
imitation. As such, these may be of independent interest
Corrupted Contextual Bandits with Action Order Constraints
We consider a variant of the novel contextual bandit problem with corrupted
context, which we call the contextual bandit problem with corrupted context and
action correlation, where actions exhibit a relationship structure that can be
exploited to guide the exploration of viable next decisions. Our setting is
primarily motivated by adaptive mobile health interventions and related
applications, where users might transitions through different stages requiring
more targeted action selection approaches. In such settings, keeping user
engagement is paramount for the success of interventions and therefore it is
vital to provide relevant recommendations in a timely manner. The context
provided by users might not always be informative at every decision point and
standard contextual approaches to action selection will incur high regret. We
propose a meta-algorithm using a referee that dynamically combines the policies
of a contextual bandit and multi-armed bandit, similar to previous work, as
wells as a simple correlation mechanism that captures action to action
transition probabilities allowing for more efficient exploration of
time-correlated actions. We evaluate empirically the performance of said
algorithm on a simulation where the sequence of best actions is determined by a
hidden state that evolves in a Markovian manner. We show that the proposed
meta-algorithm improves upon regret in situations where the performance of both
policies varies such that one is strictly superior to the other for a given
time period. To demonstrate that our setting has relevant practical
applicability, we evaluate our method on several real world data sets, clearly
showing better empirical performance compared to a set of simple algorithms
Actes du 11ème Atelier en Évaluation de Performances
International audienceLe présent document contient les actes du 11ème Atelier en Évaluation des Performances qui s'est tenu les 15-17 Mars 2016 au LAAS-CNRS, Toulouse. L’Atelier en Évaluation de Performances est une réunion destinée à faire s’exprimer et se rencontrer les jeunes chercheurs (doctorants et postdoctorants) dans le domaine de la Modélisation et de l’Évaluation de Performances, une discipline consacrée à l’étude et l’optimisation de systèmes dynamiques stochastiques et/ou temporisés apparaissant en Informatique, Télécommunications, Productique et Robotique entre autres. La présentation informelle de travaux, même en cours, y est encouragée afin de renforcer les interactions entre jeunes chercheurs et préparer des soumissions de nouveaux projets scientifiques. Des exposés de synthèse sur des domaines de recherche d’actualité, donnés par des chercheurs confirmés du domaine renforcent la partie formation de l’atelier
- …