74,311 research outputs found
Safe Learning and Optimization Techniques: Towards a Survey of the State of the Art
Safe learning and optimization deals with learning and optimization problems
that avoid, as much as possible, the evaluation of non-safe input points, which
are solutions, policies, or strategies that cause an irrecoverable loss (e.g.,
breakage of a machine or equipment, or life threat). Although a comprehensive
survey of safe reinforcement learning algorithms was published in 2015, a
number of new algorithms have been proposed thereafter, and related works in
active learning and in optimization were not considered. This paper reviews
those algorithms from a number of domains including reinforcement learning,
Gaussian process regression and classification, evolutionary algorithms, and
active learning. We provide the fundamental concepts on which the reviewed
algorithms are based and a characterization of the individual algorithms. We
conclude by explaining how the algorithms are connected and suggestions for
future research.Comment: The final authenticated publication was made In: Heintz F., Milano
M., O'Sullivan B. (eds) Trustworthy AI - Integrating Learning, Optimization
and Reasoning. TAILOR 2020. Lecture Notes in Computer Science, vol 12641.
Springer, Cham. The final authenticated publication is available online at
\<https://doi.org/10.1007/978-3-030-73959-1_12
Smoothing Policies and Safe Policy Gradients
Policy gradient algorithms are among the best candidates for the much
anticipated application of reinforcement learning to real-world control tasks,
such as the ones arising in robotics. However, the trial-and-error nature of
these methods introduces safety issues whenever the learning phase itself must
be performed on a physical system. In this paper, we address a specific safety
formulation, where danger is encoded in the reward signal and the learning
agent is constrained to never worsen its performance. By studying actor-only
policy gradient from a stochastic optimization perspective, we establish
improvement guarantees for a wide class of parametric policies, generalizing
existing results on Gaussian policies. This, together with novel upper bounds
on the variance of policy gradient estimators, allows to identify those
meta-parameter schedules that guarantee monotonic improvement with high
probability. The two key meta-parameters are the step size of the parameter
updates and the batch size of the gradient estimators. By a joint, adaptive
selection of these meta-parameters, we obtain a safe policy gradient algorithm
Gradient-free Policy Architecture Search and Adaptation
We develop a method for policy architecture search and adaptation via
gradient-free optimization which can learn to perform autonomous driving tasks.
By learning from both demonstration and environmental reward we develop a model
that can learn with relatively few early catastrophic failures. We first learn
an architecture of appropriate complexity to perceive aspects of world state
relevant to the expert demonstration, and then mitigate the effect of
domain-shift during deployment by adapting a policy demonstrated in a source
domain to rewards obtained in a target environment. We show that our approach
allows safer learning than baseline methods, offering a reduced cumulative
crash metric over the agent's lifetime as it learns to drive in a realistic
simulated environment.Comment: Accepted in Conference on Robot Learning, 201
- …