4,563 research outputs found
Non-power-of-Two FFTs: Exploring the Flexibility of the Montium TP
Coarse-grain reconfigurable architectures, like the Montium TP, have proven to be a very successful approach for low-power and high-performance computation of regular digital signal processing algorithms. This paper presents the implementation of a class of non-power-of-two FFTs to discover the limitations and Flexibility of the Montium TP for less regular algorithms. A non-power-of-two FFT is less regular compared to a traditional power-of-two FFT. The results of the implementation show the processing time, accuracy, energy consumption and Flexibility of the implementation
Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions
In the past decade, Convolutional Neural Networks (CNNs) have demonstrated
state-of-the-art performance in various Artificial Intelligence tasks. To
accelerate the experimentation and development of CNNs, several software
frameworks have been released, primarily targeting power-hungry CPUs and GPUs.
In this context, reconfigurable hardware in the form of FPGAs constitutes a
potential alternative platform that can be integrated in the existing deep
learning ecosystem to provide a tunable balance between performance, power
consumption and programmability. In this paper, a survey of the existing
CNN-to-FPGA toolflows is presented, comprising a comparative study of their key
characteristics which include the supported applications, architectural
choices, design space exploration methods and achieved performance. Moreover,
major challenges and objectives introduced by the latest trends in CNN
algorithmic research are identified and presented. Finally, a uniform
evaluation methodology is proposed, aiming at the comprehensive, complete and
in-depth evaluation of CNN-to-FPGA toolflows.Comment: Accepted for publication at the ACM Computing Surveys (CSUR) journal,
201
Libsharp - spherical harmonic transforms revisited
We present libsharp, a code library for spherical harmonic transforms (SHTs),
which evolved from the libpsht library, addressing several of its shortcomings,
such as adding MPI support for distributed memory systems and SHTs of fields
with arbitrary spin, but also supporting new developments in CPU instruction
sets like the Advanced Vector Extensions (AVX) or fused multiply-accumulate
(FMA) instructions. The library is implemented in portable C99 and provides an
interface that can be easily accessed from other programming languages such as
C++, Fortran, Python etc. Generally, libsharp's performance is at least on par
with that of its predecessor; however, significant improvements were made to
the algorithms for scalar SHTs, which are roughly twice as fast when using the
same CPU capabilities. The library is available at
http://sourceforge.net/projects/libsharp/ under the terms of the GNU General
Public License
ColDICE: a parallel Vlasov-Poisson solver using moving adaptive simplicial tessellation
Resolving numerically Vlasov-Poisson equations for initially cold systems can
be reduced to following the evolution of a three-dimensional sheet evolving in
six-dimensional phase-space. We describe a public parallel numerical algorithm
consisting in representing the phase-space sheet with a conforming,
self-adaptive simplicial tessellation of which the vertices follow the
Lagrangian equations of motion. The algorithm is implemented both in six- and
four-dimensional phase-space. Refinement of the tessellation mesh is performed
using the bisection method and a local representation of the phase-space sheet
at second order relying on additional tracers created when needed at runtime.
In order to preserve in the best way the Hamiltonian nature of the system,
refinement is anisotropic and constrained by measurements of local Poincar\'e
invariants. Resolution of Poisson equation is performed using the fast Fourier
method on a regular rectangular grid, similarly to particle in cells codes. To
compute the density projected onto this grid, the intersection of the
tessellation and the grid is calculated using the method of Franklin and
Kankanhalli (1993) generalised to linear order. As preliminary tests of the
code, we study in four dimensional phase-space the evolution of an initially
small patch in a chaotic potential and the cosmological collapse of a
fluctuation composed of two sinusoidal waves. We also perform a "warm" dark
matter simulation in six-dimensional phase-space that we use to check the
parallel scaling of the code.Comment: Code and illustration movies available at:
http://www.vlasix.org/index.php?n=Main.ColDICE - Article submitted to Journal
of Computational Physic
Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS
GROMACS is a widely used package for biomolecular simulation, and over the
last two decades it has evolved from small-scale efficiency to advanced
heterogeneous acceleration and multi-level parallelism targeting some of the
largest supercomputers in the world. Here, we describe some of the ways we have
been able to realize this through the use of parallelization on all levels,
combined with a constant focus on absolute performance. Release 4.6 of GROMACS
uses SIMD acceleration on a wide range of architectures, GPU offloading
acceleration, and both OpenMP and MPI parallelism within and between nodes,
respectively. The recent work on acceleration made it necessary to revisit the
fundamental algorithms of molecular simulation, including the concept of
neighborsearching, and we discuss the present and future challenges we see for
exascale simulation - in particular a very fine-grained task parallelism. We
also discuss the software management, code peer review and continuous
integration testing required for a project of this complexity.Comment: EASC 2014 conference proceedin
FFT for the APE Parallel Computer
We present a parallel FFT algorithm for SIMD systems following the `Transpose
Algorithm' approach. The method is based on the assignment of the data field
onto a 1-dimensional ring of systolic cells. The systolic array can be
universally mapped onto any parallel system. In particular for systems with
next-neighbour connectivity our method has the potential to improve the
efficiency of matrix transposition by use of hyper-systolic communication. We
have realized a scalable parallel FFT on the APE100/Quadrics massively parallel
computer, where our implementation is part of a 2-dimensional hydrodynamics
code for turbulence studies. A possible generalization to 4-dimensional FFT is
presented, having in mind QCD applications.Comment: 17 pages, 13 figures, figures include
- …