40 research outputs found
Exploring Restart Distributions
We consider the generic approach of using an experience memory to help
exploration by adapting a restart distribution. That is, given the capacity to
reset the state with those corresponding to the agent's past observations, we
help exploration by promoting faster state-space coverage via restarting the
agent from a more diverse set of initial states, as well as allowing it to
restart in states associated with significant past experiences. This approach
is compatible with both on-policy and off-policy methods. However, a caveat is
that altering the distribution of initial states could change the optimal
policies when searching within a restricted class of policies. To reduce this
unsought learning bias, we evaluate our approach in deep reinforcement learning
which benefits from the high representational capacity of deep neural networks.
We instantiate three variants of our approach, each inspired by an idea in the
context of experience replay. Using these variants, we show that performance
gains can be achieved, especially in hard exploration problems.Comment: RLDM 201
Scalable Multiuser Immersive Communications with Multi-numerology and Mini-slot
This paper studies multiuser immersive communications networks in which
different user equipment may demand various extended reality (XR) services. In
such heterogeneous networks, time-frequency resource allocation needs to be
more adaptive since XR services are usually multi-modal and latency-sensitive.
To this end, we develop a scalable time-frequency resource allocation method
based on multi-numerology and mini-slot. To appropriately determining the
discrete parameters of multi-numerology and mini-slot for multiuser immersive
communications, the proposed method first presents a novel flexible
time-frequency resource block configuration, then it leverages the deep
reinforcement learning to maximize the total quality-of-experience (QoE) under
different users' QoE constraints. The results confirm the efficiency and
scalability of the proposed time-frequency resource allocation method
Robust High-speed Running for Quadruped Robots via Deep Reinforcement Learning
Deep reinforcement learning has emerged as a popular and powerful way to
develop locomotion controllers for quadruped robots. Common approaches have
largely focused on learning actions directly in joint space, or learning to
modify and offset foot positions produced by trajectory generators. Both
approaches typically require careful reward shaping and training for millions
of time steps, and with trajectory generators introduce human bias into the
resulting control policies. In this paper, we instead explore learning foot
positions in Cartesian space, which we track with impedance control, for a task
of running as fast as possible subject to environmental disturbances. Compared
with other action spaces, we observe less needed reward shaping, much improved
sample efficiency, the emergence of natural gaits such as galloping and
bounding, and ease of sim-to-sim transfer. Policies can be learned in only a
few million time steps, even for challenging tasks of running over rough
terrain with loads of over 100% of the nominal quadruped mass. Training occurs
in PyBullet, and we perform a sim-to-sim transfer to Gazebo, where our
quadruped is able to run at over 4 m/s without a load, and 3.5 m/s with a 10 kg
load, which is over 83% of the nominal quadruped mass. Video results can be
found at https://youtu.be/roE1vxpEWfw.Comment: arXiv admin note: text overlap with arXiv:2011.0708
Learning natural locomotion behaviors for humanoid robots using human bias
This paper presents a new learning framework that leverages the knowledge
from imitation learning, deep reinforcement learning, and control theories to
achieve human-style locomotion that is natural, dynamic, and robust for
humanoids. We proposed novel approaches to introduce human bias, i.e. motion
capture data and a special Multi-Expert network structure. We used the
Multi-Expert network structure to smoothly blend behavioral features, and used
the augmented reward design for the task and imitation rewards. Our reward
design is composable, tunable, and explainable by using fundamental concepts
from conventional humanoid control. We rigorously validated and benchmarked the
learning framework which consistently produced robust locomotion behaviors in
various test scenarios. Further, we demonstrated the capability of learning
robust and versatile policies in the presence of disturbances, such as terrain
irregularities and external pushes.Comment: university polic