2,160 research outputs found
Atari games and Intel processors
The asynchronous nature of the state-of-the-art reinforcement learning
algorithms such as the Asynchronous Advantage Actor-Critic algorithm, makes
them exceptionally suitable for CPU computations. However, given the fact that
deep reinforcement learning often deals with interpreting visual information, a
large part of the train and inference time is spent performing convolutions. In
this work we present our results on learning strategies in Atari games using a
Convolutional Neural Network, the Math Kernel Library and TensorFlow 0.11rc0
machine learning framework. We also analyze effects of asynchronous
computations on the convergence of reinforcement learning algorithms
Pipelined genetic propagation
© 2015 IEEE.Genetic Algorithms (GAs) are a class of numerical and combinatorial optimisers which are especially useful for solving complex non-linear and non-convex problems. However, the required execution time often limits their application to small-scale or latency-insensitive problems, so techniques to increase the computational efficiency of GAs are needed. FPGA-based acceleration has significant potential for speeding up genetic algorithms, but existing FPGA GAs are limited by the generational approaches inherited from software GAs. Many parts of the generational approach do not map well to hardware, such as the large shared population memory and intrinsic loop-carried dependency. To address this problem, this paper proposes a new hardware-oriented approach to GAs, called Pipelined Genetic Propagation (PGP), which is intrinsically distributed and pipelined. PGP represents a GA solver as a graph of loosely coupled genetic operators, which allows the solution to be scaled to the available resources, and also to dynamically change topology at run-time to explore different solution strategies. Experiments show that pipelined genetic propagation is effective in solving seven different applications. Our PGP design is 5 times faster than a recent FPGA-based GA system, and 90 times faster than a CPU-based GA system
Optimizing Data Collection in Deep Reinforcement Learning
Reinforcement learning (RL) workloads take a notoriously long time to train
due to the large number of samples collected at run-time from simulators.
Unfortunately, cluster scale-up approaches remain expensive, and commonly used
CPU implementations of simulators induce high overhead when switching back and
forth between GPU computations. We explore two optimizations that increase RL
data collection efficiency by increasing GPU utilization: (1) GPU
vectorization: parallelizing simulation on the GPU for increased hardware
parallelism, and (2) simulator kernel fusion: fusing multiple simulation steps
to run in a single GPU kernel launch to reduce global memory bandwidth
requirements. We find that GPU vectorization can achieve up to
speedup over commonly used CPU simulators. We profile the performance of
different implementations and show that for a simple simulator, ML compiler
implementations (XLA) of GPU vectorization outperform a DNN framework (PyTorch)
by by reducing CPU overhead from repeated Python to DL backend API
calls. We show that simulator kernel fusion speedups with a simple simulator
are and increase by up to as simulator complexity
increases in terms of memory bandwidth requirements. We show that the speedups
from simulator kernel fusion are orthogonal and combinable with GPU
vectorization, leading to a multiplicative speedup.Comment: MLBench 2022 ( https://memani1.github.io/mlbench22/ ) camera ready
submissio
- …