106 research outputs found
Transferring Autonomous Driving Knowledge on Simulated and Real Intersections
We view intersection handling on autonomous vehicles as a reinforcement
learning problem, and study its behavior in a transfer learning setting. We
show that a network trained on one type of intersection generally is not able
to generalize to other intersections. However, a network that is pre-trained on
one intersection and fine-tuned on another performs better on the new task
compared to training in isolation. This network also retains knowledge of the
prior task, even though some forgetting occurs. Finally, we show that the
benefits of fine-tuning hold when transferring simulated intersection handling
knowledge to a real autonomous vehicle.Comment: Appeared in Lifelong Learning Workshop @ ICML 2017. arXiv admin note:
text overlap with arXiv:1705.0119
Dex: Incremental Learning for Complex Environments in Deep Reinforcement Learning
This paper introduces Dex, a reinforcement learning environment toolkit
specialized for training and evaluation of continual learning methods as well
as general reinforcement learning problems. We also present the novel continual
learning method of incremental learning, where a challenging environment is
solved using optimal weight initialization learned from first solving a similar
easier environment. We show that incremental learning can produce vastly
superior results than standard methods by providing a strong baseline method
across ten Dex environments. We finally develop a saliency method for
qualitative analysis of reinforcement learning, which shows the impact
incremental learning has on network attention.Comment: NIPS 2017 submission, 10 pages, 26 figure
Analyzing Knowledge Transfer in Deep Q-Networks for Autonomously Handling Multiple Intersections
We analyze how the knowledge to autonomously handle one type of intersection,
represented as a Deep Q-Network, translates to other types of intersections
(tasks). We view intersection handling as a deep reinforcement learning
problem, which approximates the state action Q function as a deep neural
network. Using a traffic simulator, we show that directly copying a network
trained for one type of intersection to another type of intersection decreases
the success rate. We also show that when a network that is pre-trained on Task
A and then is fine-tuned on a Task B, the resulting network not only performs
better on the Task B than an network exclusively trained on Task A, but also
retained knowledge on the Task A. Finally, we examine a lifelong learning
setting, where we train a single network on five different types of
intersections sequentially and show that the resulting network exhibited
catastrophic forgetting of knowledge on previous tasks. This result suggests a
need for a long-term memory component to preserve knowledge.Comment: Submitted to IEEE International Conference on Intelligent
Transportation Systems (ITSC 2017
Learning to Factor Policies and Action-Value Functions: Factored Action Space Representations for Deep Reinforcement learning
Deep Reinforcement Learning (DRL) methods have performed well in an
increasing numbering of high-dimensional visual decision making domains. Among
all such visual decision making problems, those with discrete action spaces
often tend to have underlying compositional structure in the said action space.
Such action spaces often contain actions such as go left, go up as well as go
diagonally up and left (which is a composition of the former two actions). The
representations of control policies in such domains have traditionally been
modeled without exploiting this inherent compositional structure in the action
spaces. We propose a new learning paradigm, Factored Action space
Representations (FAR) wherein we decompose a control policy learned using a
Deep Reinforcement Learning Algorithm into independent components, analogous to
decomposing a vector in terms of some orthogonal basis vectors. This
architectural modification of the control policy representation allows the
agent to learn about multiple actions simultaneously, while executing only one
of them. We demonstrate that FAR yields considerable improvements on top of two
DRL algorithms in Atari 2600: FARA3C outperforms A3C (Asynchronous Advantage
Actor Critic) in 9 out of 14 tasks and FARAQL outperforms AQL (Asynchronous
n-step Q-Learning) in 9 out of 13 tasks.Comment: 11 pages + 7 pages appendi
Qualitative Measurements of Policy Discrepancy for Return-Based Deep Q-Network
The deep Q-network (DQN) and return-based reinforcement learning are two
promising algorithms proposed in recent years. DQN brings advances to complex
sequential decision problems, while return-based algorithms have advantages in
making use of sample trajectories. In this paper, we propose a general
framework to combine DQN and most of the return-based reinforcement learning
algorithms, named R-DQN. We show the performance of traditional DQN can be
improved effectively by introducing return-based reinforcement learning. In
order to further improve the R-DQN, we design a strategy with two measurements
which can qualitatively measure the policy discrepancy. Moreover, we give the
two measurements' bounds in the proposed R-DQN framework. We show that
algorithms with our strategy can accurately express the trace coefficient and
achieve a better approximation to return. The experiments, conducted on several
representative tasks from the OpenAI Gym library, validate the effectiveness of
the proposed measurements. The results also show that the algorithms with our
strategy outperform the state-of-the-art methods
TBQ(): Improving Efficiency of Trace Utilization for Off-Policy Reinforcement Learning
Off-policy reinforcement learning with eligibility traces is challenging
because of the discrepancy between target policy and behavior policy. One
common approach is to measure the difference between two policies in a
probabilistic way, such as importance sampling and tree-backup. However,
existing off-policy learning methods based on probabilistic policy measurement
are inefficient when utilizing traces under a greedy target policy, which is
ineffective for control problems. The traces are cut immediately when a
non-greedy action is taken, which may lose the advantage of eligibility traces
and slow down the learning process. Alternatively, some non-probabilistic
measurement methods such as General Q() and Naive Q() never
cut traces, but face convergence problems in practice. To address the above
issues, this paper introduces a new method named TBQ(), which
effectively unifies the tree-backup algorithm and Naive Q(). By
introducing a new parameter to illustrate the \emph{degree} of
utilizing traces, TBQ() creates an effective integration of
TB() and Naive Q() and continuous role shift between them.
The contraction property of TB() is theoretically analyzed for both
policy evaluation and control settings. We also derive the online version of
TBQ() and give the convergence proof. We empirically show that, for
in -greedy policies, there exists some degree of
utilizing traces for , which can improve the efficiency in
trace utilization for off-policy reinforcement learning, to both accelerate the
learning process and improve the performance.Comment: 8 page
Accelerated Methods for Deep Reinforcement Learning
Deep reinforcement learning (RL) has achieved many recent successes, yet
experiment turn-around time remains a key bottleneck in research and in
practice. We investigate how to optimize existing deep RL algorithms for modern
computers, specifically for a combination of CPUs and GPUs. We confirm that
both policy gradient and Q-value learning algorithms can be adapted to learn
using many parallel simulator instances. We further find it possible to train
using batch sizes considerably larger than are standard, without negatively
affecting sample complexity or final performance. We leverage these facts to
build a unified framework for parallelization that dramatically hastens
experiments in both classes of algorithm. All neural network computations use
GPUs, accelerating both data collection and training. Our results include using
an entire DGX-1 to learn successful strategies in Atari games in mere minutes,
using both synchronous and asynchronous algorithms.Comment: v2: -Added game performance statistics summary for algorithm scaling
across full Atari game set. -Added full set of learning curves (appendix).
-Fixed images to remove phantom borders. -Streamlined some discussion, moved
some details to appendi
Learning to Mix n-Step Returns: Generalizing lambda-Returns for Deep Reinforcement Learning
Reinforcement Learning (RL) can model complex behavior policies for
goal-directed sequential decision making tasks. A hallmark of RL algorithms is
Temporal Difference (TD) learning: value function for the current state is
moved towards a bootstrapped target that is estimated using next state's value
function. -returns generalize beyond 1-step returns and strike a
balance between Monte Carlo and TD learning methods. While lambda-returns have
been extensively studied in RL, they haven't been explored a lot in Deep RL.
This paper's first contribution is an exhaustive benchmarking of
lambda-returns. Although mathematically tractable, the use of exponentially
decaying weighting of n-step returns based targets in lambda-returns is a
rather ad-hoc design choice. Our second major contribution is that we propose a
generalization of lambda-returns called Confidence-based Autodidactic Returns
(CAR), wherein the RL agent learns the weighting of the n-step returns in an
end-to-end manner. This allows the agent to learn to decide how much it wants
to weigh the n-step returns based targets. In contrast, lambda-returns restrict
RL agents to use an exponentially decaying weighting scheme. Autodidactic
returns can be used for improving any RL algorithm which uses TD learning. We
empirically demonstrate that using sophisticated weighted mixtures of
multi-step returns (like CAR and lambda-returns) considerably outperforms the
use of n-step returns. We perform our experiments on the Asynchronous Advantage
Actor Critic (A3C) algorithm in the Atari 2600 domain.Comment: 10 pages + 9 page appendi
Carrier-Sense Multiple Access for Heterogeneous Wireless Networks Using Deep Reinforcement Learning
This paper investigates a new class of carrier-sense multiple access (CSMA)
protocols that employ deep reinforcement learning (DRL) techniques for
heterogeneous wireless networking, referred to as carrier-sense
deep-reinforcement learning multiple access (CS-DLMA). Existing CSMA protocols,
such as the medium access control (MAC) of WiFi, are designed for a homogeneous
network environment in which all nodes adopt the same protocol. Such protocols
suffer from severe performance degradation in a heterogeneous environment where
there are nodes adopting other MAC protocols. This paper shows that DRL
techniques can be used to design efficient MAC protocols for heterogeneous
networking. In particular, in a heterogeneous environment with nodes adopting
different MAC protocols (e.g., CS-DLMA, TDMA, and ALOHA), a CS-DLMA node can
learn to maximize the sum throughput of all nodes. Furthermore, compared with
WiFi's CSMA, CS-DLMA can achieve both higher sum throughput and individual
throughputs when coexisting with other MAC protocols. Last but not least, a
salient feature of CS-DLMA is that it does not need to know the operating
mechanisms of the co-existing MACs. Neither does it need to know the number of
nodes using these other MACs.Comment: 8 page
Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning
Recent model-free reinforcement learning algorithms have proposed
incorporating learned dynamics models as a source of additional data with the
intention of reducing sample complexity. Such methods hold the promise of
incorporating imagined data coupled with a notion of model uncertainty to
accelerate the learning of continuous control tasks. Unfortunately, they rely
on heuristics that limit usage of the dynamics model. We present model-based
value expansion, which controls for uncertainty in the model by only allowing
imagination to fixed depth. By enabling wider use of learned dynamics models
within a model-free reinforcement learning algorithm, we improve value
estimation, which, in turn, reduces the sample complexity of learning
- …