Search CORE

1,060 research outputs found

Ordered fast fourier transforms on a massively parallel hypercube multiprocessor

Author: Swarztrauber Paul N.
Tong Charles
Publication venue
Publication date
Field of study

Design alternatives for ordered Fast Fourier Transformation (FFT) algorithms were examined on massively parallel hypercube multiprocessors such as the Connection Machine. Particular emphasis is placed on reducing communication which is known to dominate the overall computing time. To this end, the order and computational phases of the FFT were combined, and the sequence to processor maps that reduce communication were used. The class of ordered transforms is expanded to include any FFT in which the order of the transform is the same as that of the input sequence. Two such orderings are examined, namely, standard-order and A-order which can be implemented with equal ease on the Connection Machine where orderings are determined by geometries and priorities. If the sequence has N = 2 exp r elements and the hypercube has P = 2 exp d processors, then a standard-order FFT can be implemented with d + r/2 + 1 parallel transmissions. An A-order sequence can be transformed with 2d - r/2 parallel transmissions which is r - d + 1 fewer than the standard order. A parallel method for computing the trigonometric coefficients is presented that does not use trigonometric functions or interprocessor communication. A performance of 0.9 GFLOPS was obtained for an A-order transform on the Connection Machine

NASA Technical Reports Server

Pattern Avoidance in Reverse Double Lists

Author: Anderson Monica
Diepenbroek Marika
Pudwell Lara
Stoll Alex
Publication venue
Publication date: 01/01/2018
Field of study

In this paper, we consider pattern avoidance in a subset of words on

\{1,1,2,2,\dots,n,n\}

called reverse double lists. In particular a reverse double list is a word formed by concatenating a permutation with its reversal. We enumerate reverse double lists avoiding any permutation pattern of length at most 4 and completely determine the corresponding Wilf classes. For permutation patterns

\rho

of length 5 or more, we characterize when the number of

\rho

-avoiding reverse double lists on

n

letters has polynomial growth. We also determine the number of

1\cdots k

-avoiders of maximum length for any positive integer

k

.Comment: 24 pages, 5 figures, 4 table

arXiv.org e-Print Archive

Episciences.org

Directory of Open Access Journals

Valparaiso University

Pruned Bit-Reversal Permutations: Mathematical Characterization, Fast Algorithms and Architectures

Author: Mohammad M. Mansour
Senior Member
Publication venue
Publication date: 18/10/2014
Field of study

A mathematical characterization of serially-pruned permutations (SPPs) employed in variable-length permuters and their associated fast pruning algorithms and architectures are proposed. Permuters are used in many signal processing systems for shuffling data and in communication systems as an adjunct to coding for error correction. Typically only a small set of discrete permuter lengths are supported. Serial pruning is a simple technique to alter the length of a permutation to support a wider range of lengths, but results in a serial processing bottleneck. In this paper, parallelizing SPPs is formulated in terms of recursively computing sums involving integer floor and related functions using integer operations, in a fashion analogous to evaluating Dedekind sums. A mathematical treatment for bit-reversal permutations (BRPs) is presented, and closed-form expressions for BRP statistics are derived. It is shown that BRP sequences have weak correlation properties. A new statistic called permutation inliers that characterizes the pruning gap of pruned interleavers is proposed. Using this statistic, a recursive algorithm that computes the minimum inliers count of a pruned BR interleaver (PBRI) in logarithmic time complexity is presented. This algorithm enables parallelizing a serial PBRI algorithm by any desired parallelism factor by computing the pruning gap in lookahead rather than a serial fashion, resulting in significant reduction in interleaving latency and memory overhead. Extensions to 2-D block and stream interleavers, as well as applications to pruned fast Fourier transforms and LTE turbo interleavers, are also presented. Moreover, hardware-efficient architectures for the proposed algorithms are developed. Simulation results demonstrate 3 to 4 orders of magnitude improvement in interleaving time compared to existing approaches.Comment: 31 page

arXiv.org e-Print Archive

CiteSeerX

Solving Large Problem Sizes of Index-Digit Algorithms on GPU: FFT and Tridiagonal System Solvers

Author: Amor Margarita
Doallo Ramón
Lobeiras Blanco Jacobo
Pérez Diéguez Adrián
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

[Abstract] Current Graphics Processing Units (GPUs) are capable of obtaining high computational performance in scientific applications. Nevertheless, programmers have to use suitable parallel algorithms for these architectures and usually have to consider optimization techniques in the implementation in order to achieve said performance. There are many efficient proposals for limited-size problems which fit directly in the shared memory of CUDA GPUs, however, there are few GPU proposals that tackle the design of efficient algorithms for large problem sizes that exceed shared memory storage capacity. In this work, we present a tuning strategy that addresses this problem for some parallel prefix algorithms that can be represented according to a set of common permutations of the digits of each of its element indices [1], denoted as Index-Digit (ID) algorithms. Specifically, our strategy has been applied to develop flexible Multi-Stage (MS) algorithms for the Fast Fourier Transform (FFT) algorithm (MS-ID-FFT) and a tridiagonal system solver (MS-ID-TS) on the GPU. The resulting implementation is compact and outperforms other well-known and commonly used state-of-the-art libraries, with an improvement of up to 1.47x with respect to NVIDIA's complex CUFFT, and up to 33.2x in comparison with NVIDIA's CUSPARSE for real data tridiagonal systems

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref