Search CORE

49 research outputs found

Multirate as a hardware paradigm

Author: Stevens Kenneth
Suter Bruce W.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1999
Field of study

Journal ArticleArchitecture and circuit design are the two most effective means of reducing power in CMOS VLSI. Mathematical manipulations, based on applying ideas from multirate signal processing have been applied to create high performance, low power architectures. To illustrate this approach, two case studies are presented - one concerns the design of a fast Fourier transforms(FFT) device, while the other one is concerned with the design of analog-to-digital converters

The University of Utah: J. Willard Marriott Digital Library

A mathematical approach to a low power FFT architecture

Author: Stevens Kenneth
Suter Bruce W.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1998
Field of study

Journal ArticleArchitecture and circuit design are the two most effective means of reducing power in CMOS VLSI. Mathematical manipulations have been applied to create a power efficient architecture of an FFT. This architecture has been implemented in asynchronous circuit technology that achieves significant power reduction over other FFT architectures. Multirate signal processing concepts are applied to the FFT to localize communication and remove the need for globally shared results in the FFT computation. A novel architecture is produced from the polyphase components that is mapped to an synchronous implementation. The asynchronous design continues the localization of communication and can be designed using standard cell libraries such as radiation-tolerant libraries for space electronics. We present a methodology based on multirate signal processing techniques and asynchronous design style that supports significant reduction in power over conventional design practices. A test chip implementing part of this design has been fabricated and power comparisons have been made

The University of Utah: J. Willard Marriott Digital Library

Large-Scale Discrete Fourier Transform on TPUs

Author: Anderson John
Chen Yi-Fan
Hechtman Blake
Lu Tianjian
Wang Tao
Publication venue
Publication date: 11/12/2020
Field of study

In this work, we present two parallel algorithms for the large-scale discrete Fourier transform (DFT) on Tensor Processing Unit (TPU) clusters. The two parallel algorithms are associated with two formulations of DFT: one is based on the Kronecker product, to be specific, dense matrix multiplications between the input data and the Vandermonde matrix, denoted as KDFT in this work; the other is based on the famous Cooley-Tukey algorithm and phase adjustment, denoted as FFT in this work. Both KDFT and FFT formulations take full advantage of TPU's strength in matrix multiplications. The KDFT formulation allows direct use of nonuniform inputs without additional step. In the two parallel algorithms, the same strategy of data decomposition is applied to the input data. Through the data decomposition, the dense matrix multiplications in KDFT and FFT are kept local within TPU cores, which can be performed completely in parallel. The communication among TPU cores is achieved through the one-shuffle scheme in both parallel algorithms, with which sending and receiving data takes place simultaneously between two neighboring cores and along the same direction on the interconnect network. The one-shuffle scheme is designed for the interconnect topology of TPU clusters, minimizing the time required by the communication among TPU cores. Both KDFT and FFT are implemented in TensorFlow. The three-dimensional complex DFT is performed on an example of dimension

8192 \times 8192 \times 8192

with a full TPU Pod: the run time of KDFT is 12.66 seconds and that of FFT is 8.3 seconds. Scaling analysis is provided to demonstrate the high parallel efficiency of the two DFT implementations on TPUs

arXiv.org e-Print Archive

Directory of Open Access Journals

Determining an Out-of-Core FFT Decomposition Strategy for Parallel Disks by Dynamic Programming

Author: Cormen Thomas H
Publication venue: Dartmouth Digital Commons
Publication date: 01/09/1996
Field of study

We present an out-of-core FFT algorithm based on the in-core FFT method developed by Swarztrauber. Our algorithm uses a recursive divide-and-conquer strategy, and each stage in the recursion presents several possibilities for how to split the problem into subproblems. We give a recurrence for the algorithm\u27s I/O complexity on the Parallel Disk Model and show how to use dynamic programming to determine optimal splits at each recursive stage. The algorithm to determine the optimal splits takes only Theta(lg^2 N) time for an N-point FFT, and it is practical. The out-of-core FFT algorithm itself takes considerably longer

Dartmouth Digital Commons (Dartmouth College)

Inter-motherboard Memory Scheduling

Author: Serrano Gómez Mónica
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 28/12/2011
Field of study

Exploring the performance benefits of applying memory scheduling beyond the motherboardSerrano Gómez, M. (2009). Inter-motherboard Memory Scheduling. http://hdl.handle.net/10251/14163Archivo delegad

RiuNet

A cache-friendly truncated FFT

Author: Harvey David
Publication venue
Publication date: 17/10/2008
Field of study

We describe a cache-friendly version of van der Hoeven's truncated FFT and inverse truncated FFT, focusing on the case of `large' coefficients, such as those arising in the Schonhage--Strassen algorithm for multiplication in Z[x]. We describe two implementations and examine their performance.Comment: 14 pages, 11 figures, uses algorithm2e packag

arXiv.org e-Print Archive

CiteSeerX

Elsevier - Publisher Connector

Multiprocessor Out-of-Core FFTs with Distributed Memory and Parallel Disks

Author: Cormen Thomas H
Nicol David M
Wegmann Jake
Publication venue: Dartmouth Digital Commons
Publication date: 01/01/1997
Field of study

This paper extends an earlier out-of-core Fast Fourier Transform (FFT) method for a uniprocessor with the Parallel Disk Model (PDM) to use multiple processors. Four out-of-core multiprocessor methods are examined. Operationally, these methods differ in the size of mini-butterfly computed in memory and how the data are organized on the disks and in the distributed memory of the multiprocessor. The methods also perform differing amounts of I/O and communication. Two of them have the remarkable property that even though they are computing the FFT on a multiprocessor, all interprocessor communication occurs outside the mini-butterfly computations. Performance results on a small workstation cluster indicate that except for unusual combinations of problem size and memory size, the methods that do not perform interprocessor communication during the mini-butterfly computations require approximately 86% of the time of those that do. Moreover, the faster methods are much easier to implement

Dartmouth Digital Commons (Dartmouth College)

Numerics of High Performance Computers and Benchmark Evaluation of Distributed Memory Computers

Author: Krishna H. S.
Singh K. P.
Publication venue: 'Defence Scientific Information and Documentation Centre'
Publication date: 01/07/2004
Field of study

The internal representation of numerical data, their speed of manipulation to generate the desired result through efficient utilisation of central processing unit, memory, and communication links are essential steps of all high performance scientific computations. Machine parameters, in particular, reveal accuracy and error bounds of computation, required for performance tuning of codes. This paper reports diagnosis of machine parameters, measurement of computing power of several workstations, serial and parallel computers, and a component-wise test procedure for distributed memory computers. Hierarchical memory structure is illustrated by block copying and unrolling techniques. Locality of reference for cache reuse of data is amply demonstrated by fast Fourier transform codes. Cache and register-blocking technique results in their optimum utilisation with consequent gain in throughput during vector-matrix operations. Implementation of these memory management techniques reduces cache inefficiency loss, which is known to be proportional to the number of processors. Of the two Linux clusters-ANUP16, HPC22 and HPC64, it has been found from the measurement of intrinsic parameters and from application benchmark of multi-block Euler code test run that ANUP16 is suitable for problems that exhibit fine-grained parallelism. The delivered performance of ANUP16 is of immense utility for developing high-end PC clusters like HPC64 and customised parallel computers with added advantage of speed and high degree of parallelism

Defence Science Journal