957 research outputs found
Computing the fast Fourier transform on SIMD microprocessors
This thesis describes how to compute the fast Fourier transform (FFT) of a power-of-two length signal on single-instruction, multiple-data (SIMD) microprocessors faster than or very close to the speed of state of the art libraries such as FFTW (âFastest Fourier Transform in the Westâ), SPIRAL and Intel Integrated Performance Primitives (IPP).
The conjugate-pair algorithm has advantages in terms of memory bandwidth, and three implementations of this algorithm, which incorporate latency and spatial locality optimizations, are automatically vectorized at the algorithm level of abstraction. Performance results on 2- way, 4-way and 8-way SIMD machines show that the performance scales much better than FFTW or SPIRAL.
The implementations presented in this thesis are compiled into a high-performance FFT library called SFFT (âStreaming Fast Fourier Trans- formâ), and benchmarked against FFTW, SPIRAL, Intel IPP and Apple Accelerate on sixteen x86 machines and two ARM NEON machines, and shown to be, in many cases, faster than these state of the art libraries, but without having to perform extensive machine specific calibration, thus demonstrating that there are good heuristics for predicting the performance of the FFT on SIMD microprocessors (i.e., the need for empirical optimization may be overstated)
The cosmological simulation code GADGET-2
We discuss the cosmological simulation code GADGET-2, a new massively
parallel TreeSPH code, capable of following a collisionless fluid with the
N-body method, and an ideal gas by means of smoothed particle hydrodynamics
(SPH). Our implementation of SPH manifestly conserves energy and entropy in
regions free of dissipation, while allowing for fully adaptive smoothing
lengths. Gravitational forces are computed with a hierarchical multipole
expansion, which can optionally be applied in the form of a TreePM algorithm,
where only short-range forces are computed with the `tree'-method while
long-range forces are determined with Fourier techniques. Time integration is
based on a quasi-symplectic scheme where long-range and short-range forces can
be integrated with different timesteps. Individual and adaptive short-range
timesteps may also be employed. The domain decomposition used in the
parallelisation algorithm is based on a space-filling curve, resulting in high
flexibility and tree force errors that do not depend on the way the domains are
cut. The code is efficient in terms of memory consumption and required
communication bandwidth. It has been used to compute the first cosmological
N-body simulation with more than 10^10 dark matter particles, reaching a
homogeneous spatial dynamic range of 10^5 per dimension in a 3D box. It has
also been used to carry out very large cosmological SPH simulations that
account for radiative cooling and star formation, reaching total particle
numbers of more than 250 million. We present the algorithms used by the code
and discuss their accuracy and performance using a number of test problems.
GADGET-2 is publicly released to the research community.Comment: submitted to MNRAS, 31 pages, 20 figures (reduced resolution), code
available at http://www.mpa-garching.mpg.de/gadge
Distributed watermarking for secure control of microgrids under replay attacks
The problem of replay attacks in the communication network between
Distributed Generation Units (DGUs) of a DC microgrid is examined. The DGUs are
regulated through a hierarchical control architecture, and are networked to
achieve secondary control objectives. Following analysis of the detectability
of replay attacks by a distributed monitoring scheme previously proposed, the
need for a watermarking signal is identified. Hence, conditions are given on
the watermark in order to guarantee detection of replay attacks, and such a
signal is designed. Simulations are then presented to demonstrate the
effectiveness of the technique
A mathematical approach to a low power FFT architecture
Journal ArticleArchitecture and circuit design are the two most effective means of reducing power in CMOS VLSI. Mathematical manipulations have been applied to create a power efficient architecture of an FFT. This architecture has been implemented in asynchronous circuit technology that achieves significant power reduction over other FFT architectures. Multirate signal processing concepts are applied to the FFT to localize communication and remove the need for globally shared results in the FFT computation. A novel architecture is produced from the polyphase components that is mapped to an synchronous implementation. The asynchronous design continues the localization of communication and can be designed using standard cell libraries such as radiation-tolerant libraries for space electronics. We present a methodology based on multirate signal processing techniques and asynchronous design style that supports significant reduction in power over conventional design practices. A test chip implementing part of this design has been fabricated and power comparisons have been made
Determining an Out-of-Core FFT Decomposition Strategy for Parallel Disks by Dynamic Programming
We present an out-of-core FFT algorithm based on the in-core FFT method developed by Swarztrauber. Our algorithm uses a recursive divide-and-conquer strategy, and each stage in the recursion presents several possibilities for how to split the problem into subproblems. We give a recurrence for the algorithm\u27s I/O complexity on the Parallel Disk Model and show how to use dynamic programming to determine optimal splits at each recursive stage. The algorithm to determine the optimal splits takes only Theta(lg^2 N) time for an N-point FFT, and it is practical. The out-of-core FFT algorithm itself takes considerably longer
- âŠ