1,039 research outputs found
Space Navigator: a Tool for the Optimization of Collision Avoidance Maneuvers
The number of space objects will grow several times in a few years due to the
planned launches of constellations of thousands microsatellites. It leads to a
significant increase in the threat of satellite collisions. Spacecraft must
undertake collision avoidance maneuvers to mitigate the risk. According to
publicly available information, conjunction events are now manually handled by
operators on the Earth. The manual maneuver planning requires qualified
personnel and will be impractical for constellations of thousands satellites.
In this paper we propose a new modular autonomous collision avoidance system
called "Space Navigator". It is based on a novel maneuver optimization approach
that combines domain knowledge with Reinforcement Learning methods.Comment: Submitted to AAS Advances in the Astronautical Sciences, presented at
IAA SciTech Forum 201
Self-Organization in Traffic Lights: Evolution of Signal Control with Advances in Sensors and Communications
Traffic signals are ubiquitous devices that first appeared in 1868. Recent
advances in information and communications technology (ICT) have led to
unprecedented improvements in such areas as mobile handheld devices (i.e.,
smartphones), the electric power industry (i.e., smart grids), transportation
infrastructure, and vehicle area networks. Given the trend towards
interconnectivity, it is only a matter of time before vehicles communicate with
one another and with infrastructure. In fact, several pilots of such
vehicle-to-vehicle and vehicle-to-infrastructure (e.g. traffic lights and
parking spaces) communication systems are already operational. This survey of
autonomous and self-organized traffic signaling control has been undertaken
with these potential developments in mind. Our research results indicate that,
while many sophisticated techniques have attempted to improve the scheduling of
traffic signal control, either real-time sensing of traffic patterns or a
priori knowledge of traffic flow is required to optimize traffic. Once this is
achieved, communication between traffic signals will serve to vastly improve
overall traffic efficiency
Multi-Task Learning as Multi-Objective Optimization
In multi-task learning, multiple tasks are solved jointly, sharing inductive
bias between them. Multi-task learning is inherently a multi-objective problem
because different tasks may conflict, necessitating a trade-off. A common
compromise is to optimize a proxy objective that minimizes a weighted linear
combination of per-task losses. However, this workaround is only valid when the
tasks do not compete, which is rarely the case. In this paper, we explicitly
cast multi-task learning as multi-objective optimization, with the overall
objective of finding a Pareto optimal solution. To this end, we use algorithms
developed in the gradient-based multi-objective optimization literature. These
algorithms are not directly applicable to large-scale learning problems since
they scale poorly with the dimensionality of the gradients and the number of
tasks. We therefore propose an upper bound for the multi-objective loss and
show that it can be optimized efficiently. We further prove that optimizing
this upper bound yields a Pareto optimal solution under realistic assumptions.
We apply our method to a variety of multi-task deep learning problems including
digit classification, scene understanding (joint semantic segmentation,
instance segmentation, and depth estimation), and multi-label classification.
Our method produces higher-performing models than recent multi-task learning
formulations or per-task training.Comment: In Neural Information Processing Systems (NeurIPS) 201
An Incremental Off-policy Search in a Model-free Markov Decision Process Using a Single Sample Path
In this paper, we consider a modified version of the control problem in a
model free Markov decision process (MDP) setting with large state and action
spaces. The control problem most commonly addressed in the contemporary
literature is to find an optimal policy which maximizes the value function,
i.e., the long run discounted reward of the MDP. The current settings also
assume access to a generative model of the MDP with the hidden premise that
observations of the system behaviour in the form of sample trajectories can be
obtained with ease from the model. In this paper, we consider a modified
version, where the cost function is the expectation of a non-convex function of
the value function without access to the generative model. Rather, we assume
that a sample trajectory generated using a priori chosen behaviour policy is
made available. In this restricted setting, we solve the modified control
problem in its true sense, i.e., to find the best possible policy given this
limited information. We propose a stochastic approximation algorithm based on
the well-known cross entropy method which is data (sample trajectory)
efficient, stable, robust as well as computationally and storage efficient. We
provide a proof of convergence of our algorithm to a policy which is globally
optimal relative to the behaviour policy. We also present experimental results
to corroborate our claims and we demonstrate the superiority of the solution
produced by our algorithm compared to the state-of-the-art algorithms under
appropriately chosen behaviour policy
Active Object Perceiver: Recognition-guided Policy Learning for Object Searching on Mobile Robots
We study the problem of learning a navigation policy for a robot to actively
search for an object of interest in an indoor environment solely from its
visual inputs. While scene-driven visual navigation has been widely studied,
prior efforts on learning navigation policies for robots to find objects are
limited. The problem is often more challenging than target scene finding as the
target objects can be very small in the view and can be in an arbitrary pose.
We approach the problem from an active perceiver perspective, and propose a
novel framework that integrates a deep neural network based object recognition
module and a deep reinforcement learning based action prediction mechanism. To
validate our method, we conduct experiments on both a simulation dataset
(AI2-THOR) and a real-world environment with a physical robot. We further
propose a new decaying reward function to learn the control policy specific to
the object searching task. Experimental results validate the efficacy of our
method, which outperforms competing methods in both average trajectory length
and success rate.Comment: 2018 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS 2018
Analysis of Agent Expertise in Ms. Pac-Man using Value-of-Information-based Policies
Conventional reinforcement learning methods for Markov decision processes
rely on weakly-guided, stochastic searches to drive the learning process. It
can therefore be difficult to predict what agent behaviors might emerge. In
this paper, we consider an information-theoretic cost function for performing
constrained stochastic searches that promote the formation of risk-averse to
risk-favoring behaviors. This cost function is the value of information, which
provides the optimal trade-off between the expected return of a policy and the
policy's complexity; policy complexity is measured by number of bits and
controlled by a single hyperparameter on the cost function. As the policy
complexity is reduced, the agents will increasingly eschew risky actions. This
reduces the potential for high accrued rewards. As the policy complexity
increases, the agents will take actions, regardless of the risk, that can raise
the long-term rewards. The obtainable reward depends on a single, tunable
hyperparameter that regulates the degree of policy complexity.
We evaluate the performance of value-of-information-based policies on a
stochastic version of Ms. Pac-Man. A major component of this paper is the
demonstration that ranges of policy complexity values yield different game-play
styles and explaining why this occurs. We also show that our
reinforcement-learning search mechanism is more efficient than the others we
utilize. This result implies that the value of information theory is
appropriate for framing the exploitation-exploration trade-off in reinforcement
learning.Comment: IEEE Transactions on Computational Intelligence and Artificial
Intelligence in Game
Soft Decomposed Policy-Critic: Bridging the Gap for Effective Continuous Control with Discrete RL
Discrete reinforcement learning (RL) algorithms have demonstrated exceptional
performance in solving sequential decision tasks with discrete action spaces,
such as Atari games. However, their effectiveness is hindered when applied to
continuous control problems due to the challenge of dimensional explosion. In
this paper, we present the Soft Decomposed Policy-Critic (SDPC) architecture,
which combines soft RL and actor-critic techniques with discrete RL methods to
overcome this limitation. SDPC discretizes each action dimension independently
and employs a shared critic network to maximize the soft -function. This
novel approach enables SDPC to support two types of policies: decomposed actors
that lead to the Soft Decomposed Actor-Critic (SDAC) algorithm, and decomposed
-networks that generate Boltzmann soft exploration policies, resulting in
the Soft Decomposed-Critic Q (SDCQ) algorithm. Through extensive experiments,
we demonstrate that our proposed approach outperforms state-of-the-art
continuous RL algorithms in a variety of continuous control tasks, including
Mujoco's Humanoid and Box2d's BipedalWalker. These empirical results validate
the effectiveness of the SDPC architecture in addressing the challenges
associated with continuous control
A Survey and Critique of Multiagent Deep Reinforcement Learning
Deep reinforcement learning (RL) has achieved outstanding results in recent
years. This has led to a dramatic increase in the number of applications and
methods. Recent works have explored learning beyond single-agent scenarios and
have considered multiagent learning (MAL) scenarios. Initial results report
successes in complex multiagent domains, although there are several challenges
to be addressed. The primary goal of this article is to provide a clear
overview of current multiagent deep reinforcement learning (MDRL) literature.
Additionally, we complement the overview with a broader analysis: (i) we
revisit previous key components, originally presented in MAL and RL, and
highlight how they have been adapted to multiagent deep reinforcement learning
settings. (ii) We provide general guidelines to new practitioners in the area:
describing lessons learned from MDRL works, pointing to recent benchmarks, and
outlining open avenues of research. (iii) We take a more critical tone raising
practical challenges of MDRL (e.g., implementation and computational demands).
We expect this article will help unify and motivate future research to take
advantage of the abundant literature that exists (e.g., RL and MAL) in a joint
effort to promote fruitful research in the multiagent community.Comment: Under review since Oct 2018. Earlier versions of this work had the
title: "Is multiagent deep reinforcement learning the answer or the question?
A brief survey
Occupancy Map Building through Bayesian Exploration
We propose a novel holistic approach for safe autonomous exploration and map
building based on constrained Bayesian optimisation. This method finds optimal
continuous paths instead of discrete sensing locations that inherently satisfy
motion and safety constraints. Evaluating both the objective and constraints
functions requires forward simulation of expected observations. As such
evaluations are costly, the Bayesian optimiser proposes only paths which are
likely to yield optimal results and satisfy the constraints with high
confidence. By balancing the reward and risk associated with each path, the
optimiser minimises the number of expensive function evaluations. We
demonstrate the effectiveness of our approach in a series of experiments both
in simulation and with a real ground robot and provide comparisons to other
exploration techniques. Evidently, each method has its specific favourable
conditions, where it outperforms all other techniques. Yet, by reasoning on the
usefulness of the entire path instead of its end point, our method provides a
robust and consistent performance through all tests and performs better than or
as good as the other leading methods
Reinforcement in Cooperative Games
Εθνικό Μετσόβιο Πολυτεχνείο--Μεταπτυχιακή Εργασία. Διεπιστημονικό-Διατμηματικό Πρόγραμμα Μεταπτυχιακών Σπουδών (Δ.Π.Μ.Σ.) “Επιστήμη Δεδομένων και Μηχανική Μάθηση
- …