5,760 research outputs found
Hierarchical Decomposition of Nonlinear Dynamics and Control for System Identification and Policy Distillation
The control of nonlinear dynamical systems remains a major challenge for
autonomous agents. Current trends in reinforcement learning (RL) focus on
complex representations of dynamics and policies, which have yielded impressive
results in solving a variety of hard control tasks. However, this new
sophistication and extremely over-parameterized models have come with the cost
of an overall reduction in our ability to interpret the resulting policies. In
this paper, we take inspiration from the control community and apply the
principles of hybrid switching systems in order to break down complex dynamics
into simpler components. We exploit the rich representational power of
probabilistic graphical models and derive an expectation-maximization (EM)
algorithm for learning a sequence model to capture the temporal structure of
the data and automatically decompose nonlinear dynamics into stochastic
switching linear dynamical systems. Moreover, we show how this framework of
switching models enables extracting hierarchies of Markovian and
auto-regressive locally linear controllers from nonlinear experts in an
imitation learning scenario.Comment: 2nd Annual Conference on Learning for Dynamics and Contro
Learning a Unified Control Policy for Safe Falling
Being able to fall safely is a necessary motor skill for humanoids performing
highly dynamic tasks, such as running and jumping. We propose a new method to
learn a policy that minimizes the maximal impulse during the fall. The
optimization solves for both a discrete contact planning problem and a
continuous optimal control problem. Once trained, the policy can compute the
optimal next contacting body part (e.g. left foot, right foot, or hands),
contact location and timing, and the required joint actuation. We represent the
policy as a mixture of actor-critic neural network, which consists of n control
policies and the corresponding value functions. Each pair of actor-critic is
associated with one of the n possible contacting body parts. During execution,
the policy corresponding to the highest value function will be executed while
the associated body part will be the next contact with the ground. With this
mixture of actor-critic architecture, the discrete contact sequence planning is
solved through the selection of the best critics while the continuous control
problem is solved by the optimization of actors. We show that our policy can
achieve comparable, sometimes even higher, rewards than a recursive search of
the action space using dynamic programming, while enjoying 50 to 400 times of
speed gain during online execution
Model-Based Reinforcement Learning for Stochastic Hybrid Systems
Optimal control of general nonlinear systems is a central challenge in
automation. Enabled by powerful function approximators, data-driven approaches
to control have recently successfully tackled challenging robotic applications.
However, such methods often obscure the structure of dynamics and control
behind black-box over-parameterized representations, thus limiting our ability
to understand closed-loop behavior. This paper adopts a hybrid-system view of
nonlinear modeling and control that lends an explicit hierarchical structure to
the problem and breaks down complex dynamics into simpler localized units. We
consider a sequence modeling paradigm that captures the temporal structure of
the data and derive an expectation-maximization (EM) algorithm that
automatically decomposes nonlinear dynamics into stochastic piecewise affine
dynamical systems with nonlinear boundaries. Furthermore, we show that these
time-series models naturally admit a closed-loop extension that we use to
extract local polynomial feedback controllers from nonlinear experts via
behavioral cloning. Finally, we introduce a novel hybrid relative entropy
policy search (Hb-REPS) technique that incorporates the hierarchical nature of
hybrid systems and optimizes a set of time-invariant local feedback controllers
derived from a local polynomial approximation of a global state-value function
Learning Adaptive Display Exposure for Real-Time Advertising
In E-commerce advertising, where product recommendations and product ads are
presented to users simultaneously, the traditional setting is to display ads at
fixed positions. However, under such a setting, the advertising system loses
the flexibility to control the number and positions of ads, resulting in
sub-optimal platform revenue and user experience. Consequently, major
e-commerce platforms (e.g., Taobao.com) have begun to consider more flexible
ways to display ads. In this paper, we investigate the problem of advertising
with adaptive exposure: can we dynamically determine the number and positions
of ads for each user visit under certain business constraints so that the
platform revenue can be increased? More specifically, we consider two types of
constraints: request-level constraint ensures user experience for each user
visit, and platform-level constraint controls the overall platform monetization
rate. We model this problem as a Constrained Markov Decision Process with
per-state constraint (psCMDP) and propose a constrained two-level reinforcement
learning approach to decompose the original problem into two relatively
independent sub-problems. To accelerate policy learning, we also devise a
constrained hindsight experience replay mechanism. Experimental evaluations on
industry-scale real-world datasets demonstrate the merits of our approach in
both obtaining higher revenue under the constraints and the effectiveness of
the constrained hindsight experience replay mechanism.Comment: accepted by CIKM201
- …