49 research outputs found
Multirate as a hardware paradigm
Journal ArticleArchitecture and circuit design are the two most effective means of reducing power in CMOS VLSI. Mathematical manipulations, based on applying ideas from multirate signal processing have been applied to create high performance, low power architectures. To illustrate this approach, two case studies are presented - one concerns the design of a fast Fourier transforms(FFT) device, while the other one is concerned with the design of analog-to-digital converters
A mathematical approach to a low power FFT architecture
Journal ArticleArchitecture and circuit design are the two most effective means of reducing power in CMOS VLSI. Mathematical manipulations have been applied to create a power efficient architecture of an FFT. This architecture has been implemented in asynchronous circuit technology that achieves significant power reduction over other FFT architectures. Multirate signal processing concepts are applied to the FFT to localize communication and remove the need for globally shared results in the FFT computation. A novel architecture is produced from the polyphase components that is mapped to an synchronous implementation. The asynchronous design continues the localization of communication and can be designed using standard cell libraries such as radiation-tolerant libraries for space electronics. We present a methodology based on multirate signal processing techniques and asynchronous design style that supports significant reduction in power over conventional design practices. A test chip implementing part of this design has been fabricated and power comparisons have been made
Large-Scale Discrete Fourier Transform on TPUs
In this work, we present two parallel algorithms for the large-scale discrete
Fourier transform (DFT) on Tensor Processing Unit (TPU) clusters. The two
parallel algorithms are associated with two formulations of DFT: one is based
on the Kronecker product, to be specific, dense matrix multiplications between
the input data and the Vandermonde matrix, denoted as KDFT in this work; the
other is based on the famous Cooley-Tukey algorithm and phase adjustment,
denoted as FFT in this work. Both KDFT and FFT formulations take full advantage
of TPU's strength in matrix multiplications. The KDFT formulation allows direct
use of nonuniform inputs without additional step. In the two parallel
algorithms, the same strategy of data decomposition is applied to the input
data. Through the data decomposition, the dense matrix multiplications in KDFT
and FFT are kept local within TPU cores, which can be performed completely in
parallel. The communication among TPU cores is achieved through the one-shuffle
scheme in both parallel algorithms, with which sending and receiving data takes
place simultaneously between two neighboring cores and along the same direction
on the interconnect network. The one-shuffle scheme is designed for the
interconnect topology of TPU clusters, minimizing the time required by the
communication among TPU cores. Both KDFT and FFT are implemented in TensorFlow.
The three-dimensional complex DFT is performed on an example of dimension with a full TPU Pod: the run time of KDFT is 12.66
seconds and that of FFT is 8.3 seconds. Scaling analysis is provided to
demonstrate the high parallel efficiency of the two DFT implementations on
TPUs
Determining an Out-of-Core FFT Decomposition Strategy for Parallel Disks by Dynamic Programming
We present an out-of-core FFT algorithm based on the in-core FFT method developed by Swarztrauber. Our algorithm uses a recursive divide-and-conquer strategy, and each stage in the recursion presents several possibilities for how to split the problem into subproblems. We give a recurrence for the algorithm\u27s I/O complexity on the Parallel Disk Model and show how to use dynamic programming to determine optimal splits at each recursive stage. The algorithm to determine the optimal splits takes only Theta(lg^2 N) time for an N-point FFT, and it is practical. The out-of-core FFT algorithm itself takes considerably longer
Inter-motherboard Memory Scheduling
Exploring the performance benefits of applying memory scheduling beyond the motherboardSerrano G贸mez, M. (2009). Inter-motherboard Memory Scheduling. http://hdl.handle.net/10251/14163Archivo delegad
A cache-friendly truncated FFT
We describe a cache-friendly version of van der Hoeven's truncated FFT and
inverse truncated FFT, focusing on the case of `large' coefficients, such as
those arising in the Schonhage--Strassen algorithm for multiplication in Z[x].
We describe two implementations and examine their performance.Comment: 14 pages, 11 figures, uses algorithm2e packag
Multiprocessor Out-of-Core FFTs with Distributed Memory and Parallel Disks
This paper extends an earlier out-of-core Fast Fourier Transform (FFT) method for a uniprocessor with the Parallel Disk Model (PDM) to use multiple processors. Four out-of-core multiprocessor methods are examined. Operationally, these methods differ in the size of mini-butterfly computed in memory and how the data are organized on the disks and in the distributed memory of the multiprocessor. The methods also perform differing amounts of I/O and communication. Two of them have the remarkable property that even though they are computing the FFT on a multiprocessor, all interprocessor communication occurs outside the mini-butterfly computations. Performance results on a small workstation cluster indicate that except for unusual combinations of problem size and memory size, the methods that do not perform interprocessor communication during the mini-butterfly computations require approximately 86% of the time of those that do. Moreover, the faster methods are much easier to implement
Numerics of High Performance Computers and Benchmark Evaluation of Distributed Memory Computers
The internal representation of numerical data, their speed of manipulation to generate the desired result through efficient utilisation of central processing unit, memory, and communication links are essential steps of all high performance scientific computations. Machine parameters, in particular, reveal accuracy and error bounds of computation, required for performance tuning of codes. This paper reports diagnosis of machine parameters, measurement of computing power of several workstations, serial and parallel computers, and a component-wise test procedure for distributed memory computers. Hierarchical memory structure is illustrated by block copying and unrolling techniques. Locality of reference for cache reuse of data is amply demonstrated by fast Fourier transform codes. Cache and register-blocking technique results in their optimum utilisation with consequent gain in throughput during vector-matrix operations. Implementation of these memory management techniques reduces cache inefficiency loss, which is known to be proportional to the number of processors. Of the two Linux clusters-ANUP16, HPC22 and HPC64, it has been found from the measurement of intrinsic parameters and from application benchmark of multi-block Euler code test run that ANUP16 is suitable for problems that exhibit fine-grained parallelism. The delivered performance of ANUP16 is of immense utility for developing high-end PC clusters like HPC64 and customised parallel computers with added advantage of speed and high degree of parallelism