11 research outputs found
Fuzzy Ensembles of Reinforcement Learning Policies for Robotic Systems with Varied Parameters
Reinforcement Learning (RL) is an emerging approach to control many dynamical
systems for which classical control approaches are not applicable or
insufficient. However, the resultant policies may not generalize to variations
in the parameters that the system may exhibit. This paper presents a powerful
yet simple algorithm in which collaboration is facilitated between RL agents
that are trained independently to perform the same task but with different
system parameters. The independency among agents allows the exploitation of
multi-core processing to perform parallel training. Two examples are provided
to demonstrate the effectiveness of the proposed technique. The main
demonstration is performed on a quadrotor with slung load tracking problem in a
real-time experimental setup. It is shown that integrating the developed
algorithm outperforms individual policies by reducing the RMSE tracking error.
The robustness of the ensemble is also verified against wind disturbance.Comment: arXiv admin note: text overlap with arXiv:2311.0501
Artificial Development by Reinforcement Learning Can Benefit From Multiple Motivations
Research on artificial development, reinforcement learning, and intrinsic motivations like curiosity could profit from the recently developed framework of multi-objective reinforcement learning. The combination of these ideas may lead to more realistic artificial models for life-long learning and goal directed behavior in animals and humans
Deep Reinforcement Learning from Hierarchical Weak Preference Feedback
Reward design is a fundamental, yet challenging aspect of practical
reinforcement learning (RL). For simple tasks, researchers typically handcraft
the reward function, e.g., using a linear combination of several reward
factors. However, such reward engineering is subject to approximation bias,
incurs large tuning cost, and often cannot provide the granularity required for
complex tasks. To avoid these difficulties, researchers have turned to
reinforcement learning from human feedback (RLHF), which learns a reward
function from human preferences between pairs of trajectory sequences. By
leveraging preference-based reward modeling, RLHF learns complex rewards that
are well aligned with human preferences, allowing RL to tackle increasingly
difficult problems. Unfortunately, the applicability of RLHF is limited due to
the high cost and difficulty of obtaining human preference data. In light of
this cost, we investigate learning reward functions for complex tasks with less
human effort; simply by ranking the importance of the reward factors. More
specifically, we propose a new RL framework -- HERON, which compares
trajectories using a hierarchical decision tree induced by the given ranking.
These comparisons are used to train a preference-based reward model, which is
then used for policy learning. We find that our framework can not only train
high performing agents on a variety of difficult tasks, but also provide
additional benefits such as improved sample efficiency and robustness. Our code
is available at https://github.com/abukharin3/HERON.Comment: 28 Pages, 15 figure
Multi-objective Optimization of Space-Air-Ground Integrated Network Slicing Relying on a Pair of Central and Distributed Learning Algorithms
As an attractive enabling technology for next-generation wireless
communications, network slicing supports diverse customized services in the
global space-air-ground integrated network (SAGIN) with diverse resource
constraints. In this paper, we dynamically consider three typical classes of
radio access network (RAN) slices, namely high-throughput slices, low-delay
slices and wide-coverage slices, under the same underlying physical SAGIN. The
throughput, the service delay and the coverage area of these three classes of
RAN slices are jointly optimized in a non-scalar form by considering the
distinct channel features and service advantages of the terrestrial, aerial and
satellite components of SAGINs. A joint central and distributed multi-agent
deep deterministic policy gradient (CDMADDPG) algorithm is proposed for solving
the above problem to obtain the Pareto optimal solutions. The algorithm first
determines the optimal virtual unmanned aerial vehicle (vUAV) positions and the
inter-slice sub-channel and power sharing by relying on a centralized unit.
Then it optimizes the intra-slice sub-channel and power allocation, and the
virtual base station (vBS)/vUAV/virtual low earth orbit (vLEO) satellite
deployment in support of three classes of slices by three separate distributed
units. Simulation results verify that the proposed method approaches the
Pareto-optimal exploitation of multiple RAN slices, and outperforms the
benchmarkers.Comment: 19 pages, 14 figures, journa