8 research outputs found
Efficient Deep Reinforcement Learning via Planning, Generalization, and Improved Exploration
Reinforcement learning (RL) is a general-purpose machine learning framework, which considers an agent that makes sequential decisions in an environment to maximize its reward. Deep reinforcement learning (DRL) approaches use deep neural networks as non-linear function approximators that parameterize policies or value functions directly from raw observations in RL.
Although DRL approaches have been shown to be successful on many challenging RL benchmarks, much of the prior work has mainly focused on learning a single task in a model-free setting, which is often sample-inefficient. On the other hand, humans have abilities to acquire knowledge by learning a model of the world in an unsupervised fashion, use such knowledge to plan ahead for decision making, transfer knowledge between many tasks, and generalize to previously unseen circumstances from the pre-learned knowledge. Developing such abilities are some of the fundamental challenges for building RL agents that can learn as efficiently as humans.
As a step towards developing the aforementioned capabilities in RL, this thesis develops new DRL techniques to address three important challenges in RL: 1) planning via prediction, 2) rapidly generalizing to new environments and tasks, and 3) efficient exploration in complex environments.
The first part of the thesis discusses how to learn a dynamics model of the environment using deep neural networks and how to use such a model for planning in complex domains where observations are high-dimensional. Specifically, we present neural network architectures for action-conditional video prediction and demonstrate improved exploration in RL. In addition, we present a neural network architecture that performs lookahead planning by predicting the future only in terms of rewards and values without predicting observations. We then discuss why this approach is beneficial compared to conventional model-based planning approaches.
The second part of the thesis considers generalization to unseen environments and tasks. We first introduce a set of cognitive tasks in a 3D environment and present memory-based DRL architectures that generalize better to previously unseen 3D environments compared to existing baselines. In addition, we introduce a new multi-task RL problem where the agent should learn to execute different tasks depending on given instructions and generalize to new instructions in a zero-shot fashion. We present a new hierarchical DRL architecture that learns to generalize over previously unseen task descriptions with minimal prior knowledge.
The third part of the thesis discusses how exploiting past experiences can indirectly drive deep exploration and improve sample-efficiency. In particular, we propose a new off-policy learning algorithm, called self-imitation learning, which learns a policy to reproduce past good experiences. We empirically show that self-imitation learning indirectly encourages the agent to explore reasonably good state spaces and thus significantly improves sample-efficiency on RL domains where exploration is challenging.
Overall, the main contribution of this thesis are to explore several fundamental challenges in RL in the context of DRL and develop new DRL architectures and algorithms to address such challenges. This allows us to understand how deep learning can be used to improve sample efficiency, and thus come closer to human-like learning abilities.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/145829/1/junhyuk_1.pd
Advancing the Applicability of Reinforcement Learning to Autonomous Control
ï»żMit dateneffizientem Reinforcement Learning (RL) konnten
beeindruckendeErgebnisse erzielt werden, z.B. fĂŒr die Regelung von
Gasturbinen. In derPraxis erfordert die Anwendung von RL jedoch noch viel
manuelle Arbeit, wasbisher RL fĂŒr die autonome Regelung untauglich
erscheinen lieĂ. Dievorliegende Arbeit adressiert einige der verbleibenden
Probleme, insbesonderein Bezug auf die ZuverlÀssigkeit der
Policy-Erstellung.
Es werden zunÀchst RL-Probleme mit diskreten Zustands- und
AktionsrĂ€umenbetrachtet. FĂŒr solche Probleme wird hĂ€ufig ein MDP aus
BeobachtungengeschÀtzt, um dann auf Basis dieser MDP-SchÀtzung eine Policy
abzuleiten. DieArbeit beschreibt, wie die SchÀtzer-Unsicherheit des MDP in
diePolicy-Erstellung eingebracht werden kann, um mit diesem Wissen das
Risikoeiner schlechten Policy aufgrund einer fehlerhaften MDP-SchÀtzung
zuverringern. AuĂerdem wird so effiziente Exploration sowie
Policy-Bewertungermöglicht.
AnschlieĂend wendet sich die Arbeit Problemen mit
kontinuierlichenZustandsrÀumen zu und konzentriert sich auf auf
RL-Verfahren, welche aufFitted Q-Iteration (FQI) basieren, insbesondere
Neural Fitted Q-Iteration(NFQ). Zwar ist NFQ sehr dateneffizient, jedoch
nicht so zuverlĂ€ssig, wie fĂŒrdie autonome Regelung nötig wĂ€re. Die Arbeit
schlÀgt die Verwendung vonEnsembles vor, um die ZuverlÀssigkeit von NFQ zu
erhöhen. Es werden eine Reihevon Möglichkeiten der Ensemble-Nutzung
entworfen und evaluiert. Bei allenbetrachteten RL-Problemen sorgen
Ensembles fĂŒr eine zuverlĂ€ssigere Erstellungguter Policies.
Im nÀchsten Schritt werden Möglichkeiten der Policy-Bewertung
beikontinuierlichen ZustandsrÀumen besprochen. Die Arbeit schlÀgt vor,
FittedPolicy Evaluation (FPE), eine Variante von FQI fĂŒr Policy Evaluation,
mitanderen Regressionsverfahren und/oder anderen DatensÀtzen zu
kombinieren, umein MaĂ fĂŒr die Policy-QualitĂ€t zu erhalten. Experimente
zeigen, dassExtra-Tree-FPE ein realistisches QualitĂ€tsmaĂ fĂŒr
NFQ-generierte Policies liefernkann.
SchlieĂlich kombiniert die Arbeit Ensembles und Policy-Bewertung, um mit
sichÀndernden RL-Problemen umzugehen. Der wesentliche Beitrag ist das
EvolvingEnsemble, dessen Policy sich langsam Àndert, indem alte,
untaugliche Policiesentfernt und neue hinzugefĂŒgt werden. Es zeigt sich,
dass das EvolvingEnsemble deutlich besser funktioniert als einfachere
AnsÀtze.With data-efficient reinforcement learning (RL) methods impressive
resultscould be achieved, e.g., in the context of gas turbine control.
However, inpractice the application of RL still requires much human
intervention, whichhinders the application of RL to autonomous control.
This thesis addressessome of the remaining problems, particularly regarding
the reliability of thepolicy generation process.
The thesis first discusses RL problems with discrete state and action
spaces.In that context, often an MDP is estimated from observations. It is
describedhow to incorporate the estimators' uncertainties into the policy
generationprocess. This information can then be used to reduce the risk of
obtaining apoor policy due to flawed MDP estimates. Moreover, it is
discussed how to usethe knowledge of uncertainty for efficient exploration
and the assessment ofpolicy quality without requiring the policy's
execution.
The thesis then moves on to continuous state problems and focuses on
methodsbased on fitted Q-iteration (FQI), particularly neural fitted
Q-iteration(NFQ). Although NFQ has proven to be very data-efficient, it is
not asreliable as required for autonomous control. The thesis proposes to
useensembles to increase reliability. Several ways of ensemble usage in an
NFQcontext are discussed and evaluated on a number of benchmark domains. It
showsthat in all considered domains with ensembles good policies can be
producedmore reliably.
Next, policy assessment in continuous domains is discussed. The
thesisproposes to use fitted policy evaluation (FPE), an adaptation of FQI
to policyevaluation, combined with a different function approximator and/or
differentdataset to obtain a measure for policy quality. Results of
experiments showthat extra-tree FPE, applied to policies generated by NFQ,
produces valuefunctions that can well be used to reason about the true
policy quality.
Finally, the thesis combines ensembles and policy assessment to derive
methodsthat can deal with changing environments. The major contribution is
theevolving ensemble. The policy of the evolving ensemble changes slowly as
newpolicies are added and old policies removed. It turns out that the
evolvingensemble approaches work considerably better than simpler
approaches likesingle policies learned with recent observations or simple
ensembles
Recommended from our members
Adaptive Planning and Prediction in Agent-Supported Distributed Collaboration.
Agents that act as user assistants will become invaluable as the number of information sources continue to proliferate. Such agents can support the work of users by learning to automate time-consuming tasks and filter information to manageable levels. Although considerable advances have been made in this area, it remains a fertile area for further development. One application of agents under careful scrutiny is the automated negotiation of conflicts between different user's needs and desires. Many techniques require explicit user models in order to function. This dissertation explores a technique for dynamically constructing user models and the impact of using them to anticipate the need for negotiation. Negotiation is reduced by including an advising aspect to the agent that can use this anticipation of conflict to adjust user behavior
The Efficient Learning of Multiple Task Sequences
I present a modular network architecture and a learning algorithm based on incremental dynamic programming that allows a single learning agent to learn to solve multiple Markovian decision tasks (MDTs) with significant transfer of learning across the tasks. I consider a class of MDTs, called composite tasks, formed by temporally concatenating a number of simpler, elemental MDTs. The architecture is trained on a set of composite and elemental MDTs. The temporal structure of a composite task is assumed to be unknown and the architecture learns to produce a temporal decomposition. It is shown that under certain conditions the solution of a composite MDT can be constructed by computationally inexpensive modifications of the solutions of its constituent elemental MDTs. 1 INTRODUCTION Most applications of domain independent learning algorithms have focussed on learning single tasks. Building more sophisticated learning agents that operate in complex environments will require handling multip..