844 research outputs found
A General Framework for Accelerating Swarm Intelligence Algorithms on FPGAs, GPUs and Multi-core CPUs
Swarm intelligence algorithms (SIAs) have demonstrated excellent performance when solving optimization problems including many real-world problems. However, because of their expensive computational cost for some complex problems, SIAs need to be accelerated effectively for better performance. This paper presents a high-performance general framework to accelerate SIAs (FASI). Different from the previous work which accelerate SIAs through enhancing the parallelization only, FASI considers both the memory architectures of hardware platforms and the dataflow of SIAs, and it reschedules the framework of SIAs as a converged dataflow to improve the memory access efficiency. FASI achieves higher acceleration ability by matching the algorithm framework to the hardware architectures. We also design deep optimized structures of the parallelization and convergence of FASI based on the characteristics of specific hardware platforms. We take the quantum behaved particle swarm optimization algorithm (QPSO) as a case to evaluate FASI. The results show that FASI improves the throughput of SIAs and provides better performance through optimizing the hardware implementations. In our experiments, FASI achieves a maximum of 290.7Mbit/s throughput which is higher than several existing systems, and FASI on FPGAs achieves a better speedup than that on GPUs and multi-core CPUs. FASI is up to 123 times and not less than 1.45 times faster in terms of optimization time on Xilinx Kintex Ultrascale xcku040 when compares to Intel Core i7-6700 CPU/ NVIDIA GTX1080 GPU. Finally, we compare the differences of deploying FASI on hardware platforms and provide some guidelines for promoting the acceleration performance according to the hardware architectures
Fine-sorting One-dimensional Particle-In-Cell Algorithm with Monte-Carlo Collisions on a Graphics Processing Unit
Particle-in-cell (PIC) simulations with Monte-Carlo collisions are used in
plasma science to explore a variety of kinetic effects. One major problem is
the long run-time of such simulations. Even on modern computer systems, PIC
codes take a considerable amount of time for convergence. Most of the
computations can be massively parallelized, since particles behave
independently of each other within one time step. Current graphics processing
units (GPUs) offer an attractive means for execution of the parallelized code.
In this contribution we show a one-dimensional PIC code running on Nvidia GPUs
using the CUDA environment. A distinctive feature of the code is that size of
the cells that the code uses to sort the particles with respect to their
coordinates is comparable to size of the grid cells used for discretization of
the electric field. Hence, we call the corresponding algorithm "fine-sorting".
Implementation details and optimization of the code are discussed and the
speed-up compared to classical CPU approaches is computed
On the Feasibility of Real-Time 3D Hand Tracking using Edge GPGPU Acceleration
This paper presents the case study of a non-intrusive porting of a monolithic
C++ library for real-time 3D hand tracking, to the domain of edge-based
computation. Towards a proof of concept, the case study considers a pair of
workstations, a computationally powerful and a computationally weak one. By
wrapping the C++ library in Java container and by capitalizing on a Java-based
offloading infrastructure that supports both CPU and GPGPU computations, we are
able to establish automatically the required server-client workflow that best
addresses the resource allocation problem in the effort to execute from the
weak workstation. As a result, the weak workstation can perform well at the
task, despite lacking the sufficient hardware to do the required computations
locally. This is achieved by offloading computations which rely on GPGPU, to
the powerful workstation, across the network that connects them. We show the
edge-based computation challenges associated with the information flow of the
ported algorithm, demonstrate how we cope with them, and identify what needs to
be improved for achieving even better performance.Comment: 6 pages, 5 figure
EvoX: A Distributed GPU-accelerated Library towards Scalable Evolutionary Computation
During the past decades, evolutionary computation (EC) has demonstrated
promising potential in solving various complex optimization problems of
relatively small scales. Nowadays, however, ongoing developments in modern
science and engineering are bringing increasingly grave challenges to the
conventional EC paradigm in terms of scalability. As problem scales increase,
on the one hand, the encoding spaces (i.e., dimensions of the decision vectors)
are intrinsically larger; on the other hand, EC algorithms often require
growing numbers of function evaluations (and probably larger population sizes
as well) to work properly. To meet such emerging challenges, not only does it
require delicate algorithm designs, but more importantly, a high-performance
computing framework is indispensable. Hence, we develop a distributed
GPU-accelerated algorithm library -- EvoX. First, we propose a generalized
workflow for implementing general EC algorithms. Second, we design a scalable
computing framework for running EC algorithms on distributed GPU devices.
Third, we provide user-friendly interfaces to both researchers and
practitioners for benchmark studies as well as extended real-world
applications. To comprehensively assess the performance of EvoX, we conduct a
series of experiments, including: (i) scalability test via numerical
optimization benchmarks with problem dimensions/population sizes up to
millions; (ii) acceleration test via a neuroevolution task with multiple GPU
nodes; (iii) extensibility demonstration via the application to reinforcement
learning tasks on the OpenAI Gym. The code of EvoX is available at
https://github.com/EMI-Group/EvoX
- …