96 research outputs found
Semidefinite Relaxations for Stochastic Optimal Control Policies
Recent results in the study of the Hamilton Jacobi Bellman (HJB) equation
have led to the discovery of a formulation of the value function as a linear
Partial Differential Equation (PDE) for stochastic nonlinear systems with a
mild constraint on their disturbances. This has yielded promising directions
for research in the planning and control of nonlinear systems. This work
proposes a new method obtaining approximate solutions to these linear
stochastic optimal control (SOC) problems. A candidate polynomial with variable
coefficients is proposed as the solution to the SOC problem. A Sum of Squares
(SOS) relaxation is then taken to the partial differential constraints, leading
to a hierarchy of semidefinite relaxations with improving sub-optimality gap.
The resulting approximate solutions are shown to be guaranteed over- and
under-approximations for the optimal value function.Comment: Preprint. Accepted to American Controls Conference (ACC) 2014 in
Portland, Oregon. 7 pages, colo
Abstraction in decision-makers with limited information processing capabilities
A distinctive property of human and animal intelligence is the ability to
form abstractions by neglecting irrelevant information which allows to separate
structure from noise. From an information theoretic point of view abstractions
are desirable because they allow for very efficient information processing. In
artificial systems abstractions are often implemented through computationally
costly formations of groups or clusters. In this work we establish the relation
between the free-energy framework for decision making and rate-distortion
theory and demonstrate how the application of rate-distortion for
decision-making leads to the emergence of abstractions. We argue that
abstractions are induced due to a limit in information processing capacity.Comment: Presented at the NIPS 2013 Workshop on Planning with Information
Constraint
Deep Object-Centric Representations for Generalizable Robot Learning
Robotic manipulation in complex open-world scenarios requires both reliable
physical manipulation skills and effective and generalizable perception. In
this paper, we propose a method where general purpose pretrained visual models
serve as an object-centric prior for the perception system of a learned policy.
We devise an object-level attentional mechanism that can be used to determine
relevant objects from a few trajectories or demonstrations, and then
immediately incorporate those objects into a learned policy. A task-independent
meta-attention locates possible objects in the scene, and a task-specific
attention identifies which objects are predictive of the trajectories. The
scope of the task-specific attention is easily adjusted by showing
demonstrations with distractor objects or with diverse relevant objects. Our
results indicate that this approach exhibits good generalization across object
instances using very few samples, and can be used to learn a variety of
manipulation tasks using reinforcement learning
Path integral policy improvement with differential dynamic programming
Path Integral Policy Improvement with Covariance Matrix Adaptation (PI2-CMA) is a step-based model free reinforcement learning approach that combines statistical estimation techniques with fundamental results from Stochastic Optimal Control. Basically, a policy distribution is improved iteratively using reward weighted averaging of the corresponding rollouts. It was assumed that PI2-CMA somehow exploited gradient information that was contained by the reward weighted statistics. To our knowledge we are the first to expose the principle of this gradient extraction rigorously. Our findings reveal that PI2-CMA essentially obtains gradient information similar to the forward and backward passes in the Differential Dynamic Programming (DDP) method. It is then straightforward to extend the analogy with DDP by introducing a feedback term in the policy update. This suggests a novel algorithm which we coin Path Integral Policy Improvement with Differential Dynamic Programming (PI2-DDP). The resulting algorithm is similar to the previously proposed Sampled Differential Dynamic Programming (SaDDP) but we derive the method independently as a generalization of the framework of PI2-CMA. Our derivations suggest to implement some small variations to SaDDP so to increase performance. We validated our claims on a robot trajectory learning task
Universal Convexification via Risk-Aversion
We develop a framework for convexifying a fairly general class of
optimization problems. Under additional assumptions, we analyze the
suboptimality of the solution to the convexified problem relative to the
original nonconvex problem and prove additive approximation guarantees. We then
develop algorithms based on stochastic gradient methods to solve the resulting
optimization problems and show bounds on convergence rates. %We show a simple
application of this framework to supervised learning, where one can perform
integration explicitly and can use standard (non-stochastic) optimization
algorithms with better convergence guarantees. We then extend this framework to
apply to a general class of discrete-time dynamical systems. In this context,
our convexification approach falls under the well-studied paradigm of
risk-sensitive Markov Decision Processes. We derive the first known model-based
and model-free policy gradient optimization algorithms with guaranteed
convergence to the optimal solution. Finally, we present numerical results
validating our formulation in different applications
Domain Decomposition for Stochastic Optimal Control
This work proposes a method for solving linear stochastic optimal control
(SOC) problems using sum of squares and semidefinite programming. Previous work
had used polynomial optimization to approximate the value function, requiring a
high polynomial degree to capture local phenomena. To improve the scalability
of the method to problems of interest, a domain decomposition scheme is
presented. By using local approximations, lower degree polynomials become
sufficient, and both local and global properties of the value function are
captured. The domain of the problem is split into a non-overlapping partition,
with added constraints ensuring continuity. The Alternating Direction
Method of Multipliers (ADMM) is used to optimize over each domain in parallel
and ensure convergence on the boundaries of the partitions. This results in
improved conditioning of the problem and allows for much larger and more
complex problems to be addressed with improved performance.Comment: 8 pages. Accepted to CDC 201
- …