13 research outputs found
Experience Sharing Between Cooperative Reinforcement Learning Agents
The idea of experience sharing between cooperative agents naturally emerges
from our understanding of how humans learn. Our evolution as a species is
tightly linked to the ability to exchange learned knowledge with one another.
It follows that experience sharing (ES) between autonomous and independent
agents could become the key to accelerate learning in cooperative multiagent
settings. We investigate if randomly selecting experiences to share can
increase the performance of deep reinforcement learning agents, and propose
three new methods for selecting experiences to accelerate the learning process.
Firstly, we introduce Focused ES, which prioritizes unexplored regions of the
state space. Secondly, we present Prioritized ES, in which temporal-difference
error is used as a measure of priority. Finally, we devise Focused Prioritized
ES, which combines both previous approaches. The methods are empirically
validated in a control problem. While sharing randomly selected experiences
between two Deep Q-Network agents shows no improvement over a single agent
baseline, we show that the proposed ES methods can successfully outperform the
baseline. In particular, the Focused ES accelerates learning by a factor of 2,
reducing by 51% the number of episodes required to complete the task.Comment: Published at the Proceedings of the 31st IEEE International
Conference on Tools with Artificial Intelligenc
Online contrastive divergence with generative replay: experience replay without storing data
Conceived in the early 1990s, Experience Replay (ER) has been shown to be a
successful mechanism to allow online learning algorithms to reuse past
experiences. Traditionally, ER can be applied to all machine learning paradigms
(i.e., unsupervised, supervised, and reinforcement learning). Recently, ER has
contributed to improving the performance of deep reinforcement learning. Yet,
its application to many practical settings is still limited by the memory
requirements of ER, necessary to explicitly store previous observations. To
remedy this issue, we explore a novel approach, Online Contrastive Divergence
with Generative Replay (OCD_GR), which uses the generative capability of
Restricted Boltzmann Machines (RBMs) instead of recorded past experiences. The
RBM is trained online, and does not require the system to store any of the
observed data points. We compare OCD_GR to ER on 9 real-world datasets,
considering a worst-case scenario (data points arriving in sorted order) as
well as a more realistic one (sequential random-order data points). Our results
show that in 64.28% of the cases OCD_GR outperforms ER and in the remaining
35.72% it has an almost equal performance, while having a considerably reduced
space complexity (i.e., memory usage) at a comparable time complexity
Q-learning with Experience Replay in a Dynamic Environment
Most research in reinforcement learning has focused on stationary environments. In this paper, we propose several adaptations of Q-learning for a dynamic environment, for both single and multiple agents. The environment consists of a grid of random rewards, where every reward is removed after a visit. We focus on experience replay, a technique that receives a lot of attention nowadays, and combine this method with Q-learning. We compare two variations of experience replay, where experiences are reused based on time or based on the obtained reward. For multi-agent reinforcement learning we compare two variations of policy representation. In the first variation the agents share a Q-function, while in the second variation both agents have a separate Q-function. Furthermore, in both variations we test the effect of reward sharing between the agents. This leads to four different multi-agent reinforcement learning algorithms, from which sharing a Q-function and sharing the rewards is the most cooperative method. The results show that in the single-agent environment both experience replay algorithms significantly outperform standard Q-learning and a greedy benchmark agent. In the multi-agent environment the highest maximum reward sum in a trial is achieved by using one Q-function and reward sharing. The highest mean reward sum is obtained with separate Q-functions and separate rewards
Fitted Q-iteration in continuous action-space MDPs
We consider continuous state, continuous action batch reinforcement learning where the goal is to learn a good policy from a sufficiently rich trajectory generated by some policy. We study a variant of fitted Q-iteration, where the greedy action selection is replaced by searching for a policy in a restricted set of candidate policies by maximizing the average action values. We provide a rigorous analysis of this algorithm, proving what we believe is the first finite-time bound for value-function based algorithms for continuous state and action problems
Swarm robotics: Cooperative navigation in unknown environments
Swarm Robotics is garnering attention in the robotics field due to its substantial benefits. It has been proven to outperform most other robotic approaches in many applications such as military, space exploration and disaster search and rescue missions. It is inspired by the behavior of swarms of social insects such as ants and bees. It consists of a number of robots with limited capabilities and restricted local sensing. When deployed, individual robots behave according to local sensing until the emergence of a global behavior where they, as a swarm, can accomplish missions individuals cannot. In this research, we propose a novel exploration and navigation method based on a combination of Probabilistic Finite Sate Machine (PFSM), Robotic Darwinian Particle Swarm Optimization (RDPSO) and Depth First Search (DFS). We use V-REP Simulator to test our approach. We are also implementing our own cost effective swarm robot platform, AntBOT, as a proof of concept for future experimentation. We prove that our proposed method will yield excellent navigation solution in optimal time when compared to methods using either PFSM only or RDPSO only. In fact, our method is proved to produce 40% more success rate along with an exploration speed of 1.4x other methods. After exploration, robots can navigate the environment forming a Mobile Ad-hoc Network (MANET) and using the graph of robots as network nodes
Advancing the Applicability of Reinforcement Learning to Autonomous Control
Mit dateneffizientem Reinforcement Learning (RL) konnten
beeindruckendeErgebnisse erzielt werden, z.B. für die Regelung von
Gasturbinen. In derPraxis erfordert die Anwendung von RL jedoch noch viel
manuelle Arbeit, wasbisher RL für die autonome Regelung untauglich
erscheinen ließ. Dievorliegende Arbeit adressiert einige der verbleibenden
Probleme, insbesonderein Bezug auf die Zuverlässigkeit der
Policy-Erstellung.
Es werden zunächst RL-Probleme mit diskreten Zustands- und
Aktionsräumenbetrachtet. Für solche Probleme wird häufig ein MDP aus
Beobachtungengeschätzt, um dann auf Basis dieser MDP-Schätzung eine Policy
abzuleiten. DieArbeit beschreibt, wie die Schätzer-Unsicherheit des MDP in
diePolicy-Erstellung eingebracht werden kann, um mit diesem Wissen das
Risikoeiner schlechten Policy aufgrund einer fehlerhaften MDP-Schätzung
zuverringern. Außerdem wird so effiziente Exploration sowie
Policy-Bewertungermöglicht.
Anschließend wendet sich die Arbeit Problemen mit
kontinuierlichenZustandsräumen zu und konzentriert sich auf auf
RL-Verfahren, welche aufFitted Q-Iteration (FQI) basieren, insbesondere
Neural Fitted Q-Iteration(NFQ). Zwar ist NFQ sehr dateneffizient, jedoch
nicht so zuverlässig, wie fürdie autonome Regelung nötig wäre. Die Arbeit
schlägt die Verwendung vonEnsembles vor, um die Zuverlässigkeit von NFQ zu
erhöhen. Es werden eine Reihevon Möglichkeiten der Ensemble-Nutzung
entworfen und evaluiert. Bei allenbetrachteten RL-Problemen sorgen
Ensembles für eine zuverlässigere Erstellungguter Policies.
Im nächsten Schritt werden Möglichkeiten der Policy-Bewertung
beikontinuierlichen Zustandsräumen besprochen. Die Arbeit schlägt vor,
FittedPolicy Evaluation (FPE), eine Variante von FQI für Policy Evaluation,
mitanderen Regressionsverfahren und/oder anderen Datensätzen zu
kombinieren, umein Maß für die Policy-Qualität zu erhalten. Experimente
zeigen, dassExtra-Tree-FPE ein realistisches Qualitätsmaß für
NFQ-generierte Policies liefernkann.
Schließlich kombiniert die Arbeit Ensembles und Policy-Bewertung, um mit
sichändernden RL-Problemen umzugehen. Der wesentliche Beitrag ist das
EvolvingEnsemble, dessen Policy sich langsam ändert, indem alte,
untaugliche Policiesentfernt und neue hinzugefügt werden. Es zeigt sich,
dass das EvolvingEnsemble deutlich besser funktioniert als einfachere
Ansätze.With data-efficient reinforcement learning (RL) methods impressive
resultscould be achieved, e.g., in the context of gas turbine control.
However, inpractice the application of RL still requires much human
intervention, whichhinders the application of RL to autonomous control.
This thesis addressessome of the remaining problems, particularly regarding
the reliability of thepolicy generation process.
The thesis first discusses RL problems with discrete state and action
spaces.In that context, often an MDP is estimated from observations. It is
describedhow to incorporate the estimators' uncertainties into the policy
generationprocess. This information can then be used to reduce the risk of
obtaining apoor policy due to flawed MDP estimates. Moreover, it is
discussed how to usethe knowledge of uncertainty for efficient exploration
and the assessment ofpolicy quality without requiring the policy's
execution.
The thesis then moves on to continuous state problems and focuses on
methodsbased on fitted Q-iteration (FQI), particularly neural fitted
Q-iteration(NFQ). Although NFQ has proven to be very data-efficient, it is
not asreliable as required for autonomous control. The thesis proposes to
useensembles to increase reliability. Several ways of ensemble usage in an
NFQcontext are discussed and evaluated on a number of benchmark domains. It
showsthat in all considered domains with ensembles good policies can be
producedmore reliably.
Next, policy assessment in continuous domains is discussed. The
thesisproposes to use fitted policy evaluation (FPE), an adaptation of FQI
to policyevaluation, combined with a different function approximator and/or
differentdataset to obtain a measure for policy quality. Results of
experiments showthat extra-tree FPE, applied to policies generated by NFQ,
produces valuefunctions that can well be used to reason about the true
policy quality.
Finally, the thesis combines ensembles and policy assessment to derive
methodsthat can deal with changing environments. The major contribution is
theevolving ensemble. The policy of the evolving ensemble changes slowly as
newpolicies are added and old policies removed. It turns out that the
evolvingensemble approaches work considerably better than simpler
approaches likesingle policies learned with recent observations or simple
ensembles