62,798 research outputs found
Representations for optimal stopping under dynamic monetary utility functionals
In this paper we consider the optimal stopping problem for general dynamic monetary utility functionals. Sufficient conditions for the Bellman principle and the existence of optimal stopping times are provided. Particular attention is payed to representations which allow for a numerical treatment in real situations. To this aim, generalizations of standard evaluation methods like policy iteration, dual and consumption based approaches are developed in the context of general dynamic monetary utility functionals. As a result, it turns out that the possibility of a particular generalization depends on specific properties of the utility functional under consideration.monetary utility functionals, optimal stopping, duality, policy iteration
Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics
Value-based reinforcement-learning algorithms provide state-of-the-art
results in model-free discrete-action settings, and tend to outperform
actor-critic algorithms. We argue that actor-critic algorithms are limited by
their need for an on-policy critic. We propose Bootstrapped Dual Policy
Iteration (BDPI), a novel model-free reinforcement-learning algorithm for
continuous states and discrete actions, with an actor and several off-policy
critics. Off-policy critics are compatible with experience replay, ensuring
high sample-efficiency, without the need for off-policy corrections. The actor,
by slowly imitating the average greedy policy of the critics, leads to
high-quality and state-specific exploration, which we compare to Thompson
sampling. Because the actor and critics are fully decoupled, BDPI is remarkably
stable, and unusually robust to its hyper-parameters. BDPI is significantly
more sample-efficient than Bootstrapped DQN, PPO, and ACKTR, on discrete,
continuous and pixel-based tasks. Source code:
https://github.com/vub-ai-lab/bdpi.Comment: Accepted at the European Conference on Machine Learning 2019 (ECML
Safe Reinforcement Learning with Dual Robustness
Reinforcement learning (RL) agents are vulnerable to adversarial
disturbances, which can deteriorate task performance or compromise safety
specifications. Existing methods either address safety requirements under the
assumption of no adversary (e.g., safe RL) or only focus on robustness against
performance adversaries (e.g., robust RL). Learning one policy that is both
safe and robust remains a challenging open problem. The difficulty is how to
tackle two intertwined aspects in the worst cases: feasibility and optimality.
Optimality is only valid inside a feasible region, while identification of
maximal feasible region must rely on learning the optimal policy. To address
this issue, we propose a systematic framework to unify safe RL and robust RL,
including problem formulation, iteration scheme, convergence analysis and
practical algorithm design. This unification is built upon constrained
two-player zero-sum Markov games. A dual policy iteration scheme is proposed,
which simultaneously optimizes a task policy and a safety policy. The
convergence of this iteration scheme is proved. Furthermore, we design a deep
RL algorithm for practical implementation, called dually robust actor-critic
(DRAC). The evaluations with safety-critical benchmarks demonstrate that DRAC
achieves high performance and persistent safety under all scenarios (no
adversary, safety adversary, performance adversary), outperforming all
baselines significantly
A Neural Benders Decomposition for the Hub Location Routing Problem
In this study, we propose an imitation learning framework designed to enhance
the Benders decomposition method. Our primary focus is addressing degeneracy in
subproblems with multiple dual optima, among which Magnanti-Wong technique
identifies the non-dominant solution. We develop two policies. In the first
policy, we replicate the Magnanti-Wong method and learn from each iteration. In
the second policy, our objective is to determine a trajectory that expedites
the attainment of the final subproblem dual solution. We train and assess these
two policies through extensive computational experiments on a network design
problem with flow subproblem, confirming that the presence of such learned
policies significantly enhances the efficiency of the decomposition process
- âŠ