2,160 research outputs found

    Atari games and Intel processors

    Full text link
    The asynchronous nature of the state-of-the-art reinforcement learning algorithms such as the Asynchronous Advantage Actor-Critic algorithm, makes them exceptionally suitable for CPU computations. However, given the fact that deep reinforcement learning often deals with interpreting visual information, a large part of the train and inference time is spent performing convolutions. In this work we present our results on learning strategies in Atari games using a Convolutional Neural Network, the Math Kernel Library and TensorFlow 0.11rc0 machine learning framework. We also analyze effects of asynchronous computations on the convergence of reinforcement learning algorithms

    Pipelined genetic propagation

    Get PDF
    © 2015 IEEE.Genetic Algorithms (GAs) are a class of numerical and combinatorial optimisers which are especially useful for solving complex non-linear and non-convex problems. However, the required execution time often limits their application to small-scale or latency-insensitive problems, so techniques to increase the computational efficiency of GAs are needed. FPGA-based acceleration has significant potential for speeding up genetic algorithms, but existing FPGA GAs are limited by the generational approaches inherited from software GAs. Many parts of the generational approach do not map well to hardware, such as the large shared population memory and intrinsic loop-carried dependency. To address this problem, this paper proposes a new hardware-oriented approach to GAs, called Pipelined Genetic Propagation (PGP), which is intrinsically distributed and pipelined. PGP represents a GA solver as a graph of loosely coupled genetic operators, which allows the solution to be scaled to the available resources, and also to dynamically change topology at run-time to explore different solution strategies. Experiments show that pipelined genetic propagation is effective in solving seven different applications. Our PGP design is 5 times faster than a recent FPGA-based GA system, and 90 times faster than a CPU-based GA system

    Optimizing Data Collection in Deep Reinforcement Learning

    Full text link
    Reinforcement learning (RL) workloads take a notoriously long time to train due to the large number of samples collected at run-time from simulators. Unfortunately, cluster scale-up approaches remain expensive, and commonly used CPU implementations of simulators induce high overhead when switching back and forth between GPU computations. We explore two optimizations that increase RL data collection efficiency by increasing GPU utilization: (1) GPU vectorization: parallelizing simulation on the GPU for increased hardware parallelism, and (2) simulator kernel fusion: fusing multiple simulation steps to run in a single GPU kernel launch to reduce global memory bandwidth requirements. We find that GPU vectorization can achieve up to 1024×1024\times speedup over commonly used CPU simulators. We profile the performance of different implementations and show that for a simple simulator, ML compiler implementations (XLA) of GPU vectorization outperform a DNN framework (PyTorch) by 13.4×13.4\times by reducing CPU overhead from repeated Python to DL backend API calls. We show that simulator kernel fusion speedups with a simple simulator are 11.3×11.3\times and increase by up to 1024×1024\times as simulator complexity increases in terms of memory bandwidth requirements. We show that the speedups from simulator kernel fusion are orthogonal and combinable with GPU vectorization, leading to a multiplicative speedup.Comment: MLBench 2022 ( https://memani1.github.io/mlbench22/ ) camera ready submissio
    • …
    corecore