106 research outputs found

    Transferring Autonomous Driving Knowledge on Simulated and Real Intersections

    Full text link
    We view intersection handling on autonomous vehicles as a reinforcement learning problem, and study its behavior in a transfer learning setting. We show that a network trained on one type of intersection generally is not able to generalize to other intersections. However, a network that is pre-trained on one intersection and fine-tuned on another performs better on the new task compared to training in isolation. This network also retains knowledge of the prior task, even though some forgetting occurs. Finally, we show that the benefits of fine-tuning hold when transferring simulated intersection handling knowledge to a real autonomous vehicle.Comment: Appeared in Lifelong Learning Workshop @ ICML 2017. arXiv admin note: text overlap with arXiv:1705.0119

    Dex: Incremental Learning for Complex Environments in Deep Reinforcement Learning

    Full text link
    This paper introduces Dex, a reinforcement learning environment toolkit specialized for training and evaluation of continual learning methods as well as general reinforcement learning problems. We also present the novel continual learning method of incremental learning, where a challenging environment is solved using optimal weight initialization learned from first solving a similar easier environment. We show that incremental learning can produce vastly superior results than standard methods by providing a strong baseline method across ten Dex environments. We finally develop a saliency method for qualitative analysis of reinforcement learning, which shows the impact incremental learning has on network attention.Comment: NIPS 2017 submission, 10 pages, 26 figure

    Analyzing Knowledge Transfer in Deep Q-Networks for Autonomously Handling Multiple Intersections

    Full text link
    We analyze how the knowledge to autonomously handle one type of intersection, represented as a Deep Q-Network, translates to other types of intersections (tasks). We view intersection handling as a deep reinforcement learning problem, which approximates the state action Q function as a deep neural network. Using a traffic simulator, we show that directly copying a network trained for one type of intersection to another type of intersection decreases the success rate. We also show that when a network that is pre-trained on Task A and then is fine-tuned on a Task B, the resulting network not only performs better on the Task B than an network exclusively trained on Task A, but also retained knowledge on the Task A. Finally, we examine a lifelong learning setting, where we train a single network on five different types of intersections sequentially and show that the resulting network exhibited catastrophic forgetting of knowledge on previous tasks. This result suggests a need for a long-term memory component to preserve knowledge.Comment: Submitted to IEEE International Conference on Intelligent Transportation Systems (ITSC 2017

    Learning to Factor Policies and Action-Value Functions: Factored Action Space Representations for Deep Reinforcement learning

    Full text link
    Deep Reinforcement Learning (DRL) methods have performed well in an increasing numbering of high-dimensional visual decision making domains. Among all such visual decision making problems, those with discrete action spaces often tend to have underlying compositional structure in the said action space. Such action spaces often contain actions such as go left, go up as well as go diagonally up and left (which is a composition of the former two actions). The representations of control policies in such domains have traditionally been modeled without exploiting this inherent compositional structure in the action spaces. We propose a new learning paradigm, Factored Action space Representations (FAR) wherein we decompose a control policy learned using a Deep Reinforcement Learning Algorithm into independent components, analogous to decomposing a vector in terms of some orthogonal basis vectors. This architectural modification of the control policy representation allows the agent to learn about multiple actions simultaneously, while executing only one of them. We demonstrate that FAR yields considerable improvements on top of two DRL algorithms in Atari 2600: FARA3C outperforms A3C (Asynchronous Advantage Actor Critic) in 9 out of 14 tasks and FARAQL outperforms AQL (Asynchronous n-step Q-Learning) in 9 out of 13 tasks.Comment: 11 pages + 7 pages appendi

    Qualitative Measurements of Policy Discrepancy for Return-Based Deep Q-Network

    Full text link
    The deep Q-network (DQN) and return-based reinforcement learning are two promising algorithms proposed in recent years. DQN brings advances to complex sequential decision problems, while return-based algorithms have advantages in making use of sample trajectories. In this paper, we propose a general framework to combine DQN and most of the return-based reinforcement learning algorithms, named R-DQN. We show the performance of traditional DQN can be improved effectively by introducing return-based reinforcement learning. In order to further improve the R-DQN, we design a strategy with two measurements which can qualitatively measure the policy discrepancy. Moreover, we give the two measurements' bounds in the proposed R-DQN framework. We show that algorithms with our strategy can accurately express the trace coefficient and achieve a better approximation to return. The experiments, conducted on several representative tasks from the OpenAI Gym library, validate the effectiveness of the proposed measurements. The results also show that the algorithms with our strategy outperform the state-of-the-art methods

    TBQ(σ\sigma): Improving Efficiency of Trace Utilization for Off-Policy Reinforcement Learning

    Full text link
    Off-policy reinforcement learning with eligibility traces is challenging because of the discrepancy between target policy and behavior policy. One common approach is to measure the difference between two policies in a probabilistic way, such as importance sampling and tree-backup. However, existing off-policy learning methods based on probabilistic policy measurement are inefficient when utilizing traces under a greedy target policy, which is ineffective for control problems. The traces are cut immediately when a non-greedy action is taken, which may lose the advantage of eligibility traces and slow down the learning process. Alternatively, some non-probabilistic measurement methods such as General Q(λ\lambda) and Naive Q(λ\lambda) never cut traces, but face convergence problems in practice. To address the above issues, this paper introduces a new method named TBQ(σ\sigma), which effectively unifies the tree-backup algorithm and Naive Q(λ\lambda). By introducing a new parameter σ\sigma to illustrate the \emph{degree} of utilizing traces, TBQ(σ\sigma) creates an effective integration of TB(λ\lambda) and Naive Q(λ\lambda) and continuous role shift between them. The contraction property of TB(σ\sigma) is theoretically analyzed for both policy evaluation and control settings. We also derive the online version of TBQ(σ\sigma) and give the convergence proof. We empirically show that, for ϵ(0,1]\epsilon\in(0,1] in ϵ\epsilon-greedy policies, there exists some degree of utilizing traces for λ[0,1]\lambda\in[0,1], which can improve the efficiency in trace utilization for off-policy reinforcement learning, to both accelerate the learning process and improve the performance.Comment: 8 page

    Accelerated Methods for Deep Reinforcement Learning

    Full text link
    Deep reinforcement learning (RL) has achieved many recent successes, yet experiment turn-around time remains a key bottleneck in research and in practice. We investigate how to optimize existing deep RL algorithms for modern computers, specifically for a combination of CPUs and GPUs. We confirm that both policy gradient and Q-value learning algorithms can be adapted to learn using many parallel simulator instances. We further find it possible to train using batch sizes considerably larger than are standard, without negatively affecting sample complexity or final performance. We leverage these facts to build a unified framework for parallelization that dramatically hastens experiments in both classes of algorithm. All neural network computations use GPUs, accelerating both data collection and training. Our results include using an entire DGX-1 to learn successful strategies in Atari games in mere minutes, using both synchronous and asynchronous algorithms.Comment: v2: -Added game performance statistics summary for algorithm scaling across full Atari game set. -Added full set of learning curves (appendix). -Fixed images to remove phantom borders. -Streamlined some discussion, moved some details to appendi

    Learning to Mix n-Step Returns: Generalizing lambda-Returns for Deep Reinforcement Learning

    Full text link
    Reinforcement Learning (RL) can model complex behavior policies for goal-directed sequential decision making tasks. A hallmark of RL algorithms is Temporal Difference (TD) learning: value function for the current state is moved towards a bootstrapped target that is estimated using next state's value function. λ\lambda-returns generalize beyond 1-step returns and strike a balance between Monte Carlo and TD learning methods. While lambda-returns have been extensively studied in RL, they haven't been explored a lot in Deep RL. This paper's first contribution is an exhaustive benchmarking of lambda-returns. Although mathematically tractable, the use of exponentially decaying weighting of n-step returns based targets in lambda-returns is a rather ad-hoc design choice. Our second major contribution is that we propose a generalization of lambda-returns called Confidence-based Autodidactic Returns (CAR), wherein the RL agent learns the weighting of the n-step returns in an end-to-end manner. This allows the agent to learn to decide how much it wants to weigh the n-step returns based targets. In contrast, lambda-returns restrict RL agents to use an exponentially decaying weighting scheme. Autodidactic returns can be used for improving any RL algorithm which uses TD learning. We empirically demonstrate that using sophisticated weighted mixtures of multi-step returns (like CAR and lambda-returns) considerably outperforms the use of n-step returns. We perform our experiments on the Asynchronous Advantage Actor Critic (A3C) algorithm in the Atari 2600 domain.Comment: 10 pages + 9 page appendi

    Carrier-Sense Multiple Access for Heterogeneous Wireless Networks Using Deep Reinforcement Learning

    Full text link
    This paper investigates a new class of carrier-sense multiple access (CSMA) protocols that employ deep reinforcement learning (DRL) techniques for heterogeneous wireless networking, referred to as carrier-sense deep-reinforcement learning multiple access (CS-DLMA). Existing CSMA protocols, such as the medium access control (MAC) of WiFi, are designed for a homogeneous network environment in which all nodes adopt the same protocol. Such protocols suffer from severe performance degradation in a heterogeneous environment where there are nodes adopting other MAC protocols. This paper shows that DRL techniques can be used to design efficient MAC protocols for heterogeneous networking. In particular, in a heterogeneous environment with nodes adopting different MAC protocols (e.g., CS-DLMA, TDMA, and ALOHA), a CS-DLMA node can learn to maximize the sum throughput of all nodes. Furthermore, compared with WiFi's CSMA, CS-DLMA can achieve both higher sum throughput and individual throughputs when coexisting with other MAC protocols. Last but not least, a salient feature of CS-DLMA is that it does not need to know the operating mechanisms of the co-existing MACs. Neither does it need to know the number of nodes using these other MACs.Comment: 8 page

    Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning

    Full text link
    Recent model-free reinforcement learning algorithms have proposed incorporating learned dynamics models as a source of additional data with the intention of reducing sample complexity. Such methods hold the promise of incorporating imagined data coupled with a notion of model uncertainty to accelerate the learning of continuous control tasks. Unfortunately, they rely on heuristics that limit usage of the dynamics model. We present model-based value expansion, which controls for uncertainty in the model by only allowing imagination to fixed depth. By enabling wider use of learned dynamics models within a model-free reinforcement learning algorithm, we improve value estimation, which, in turn, reduces the sample complexity of learning