543 research outputs found

    Ordered fast fourier transforms on a massively parallel hypercube multiprocessor

    Get PDF
    Design alternatives for ordered Fast Fourier Transformation (FFT) algorithms were examined on massively parallel hypercube multiprocessors such as the Connection Machine. Particular emphasis is placed on reducing communication which is known to dominate the overall computing time. To this end, the order and computational phases of the FFT were combined, and the sequence to processor maps that reduce communication were used. The class of ordered transforms is expanded to include any FFT in which the order of the transform is the same as that of the input sequence. Two such orderings are examined, namely, standard-order and A-order which can be implemented with equal ease on the Connection Machine where orderings are determined by geometries and priorities. If the sequence has N = 2 exp r elements and the hypercube has P = 2 exp d processors, then a standard-order FFT can be implemented with d + r/2 + 1 parallel transmissions. An A-order sequence can be transformed with 2d - r/2 parallel transmissions which is r - d + 1 fewer than the standard order. A parallel method for computing the trigonometric coefficients is presented that does not use trigonometric functions or interprocessor communication. A performance of 0.9 GFLOPS was obtained for an A-order transform on the Connection Machine

    On the impact of communication complexity in the design of parallel numerical algorithms

    Get PDF
    This paper describes two models of the cost of data movement in parallel numerical algorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In the second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm independent upper bounds on system performance are derived for several problems that are important to scientific computation

    Application of graphics processing units to search pipelines for gravitational waves from coalescing binaries of compact objects

    Get PDF
    We report a novel application of a graphics processing unit (GPU) for the purpose of accelerating the search pipelines for gravitational waves from coalescing binaries of compact objects. A speed-up of 16-fold in total has been achieved with an NVIDIA GeForce 8800 Ultra GPU card compared with one core of a 2.5 GHz Intel Q9300 central processing unit (CPU). We show that substantial improvements are possible and discuss the reduction in CPU count required for the detection of inspiral sources afforded by the use of GPUs

    Multiprocessor Out-of-Core FFTs with Distributed Memory and Parallel Disks

    Get PDF
    This paper extends an earlier out-of-core Fast Fourier Transform (FFT) method for a uniprocessor with the Parallel Disk Model (PDM) to use multiple processors. Four out-of-core multiprocessor methods are examined. Operationally, these methods differ in the size of mini-butterfly computed in memory and how the data are organized on the disks and in the distributed memory of the multiprocessor. The methods also perform differing amounts of I/O and communication. Two of them have the remarkable property that even though they are computing the FFT on a multiprocessor, all interprocessor communication occurs outside the mini-butterfly computations. Performance results on a small workstation cluster indicate that except for unusual combinations of problem size and memory size, the methods that do not perform interprocessor communication during the mini-butterfly computations require approximately 86% of the time of those that do. Moreover, the faster methods are much easier to implement

    Correcting soft errors online in fast fourier transform

    Get PDF
    While many algorithm-based fault tolerance (ABFT) schemes have been proposed to detect soft errors offline in the fast Fourier transform (FFT) after computation finishes, none of the existing ABFT schemes detect soft errors online before the computation finishes. This paper presents an online ABFT scheme for FFT so that soft errors can be detected online and the corrupted computation can be terminated in a much more timely manner. We also extend our scheme to tolerate both arithmetic errors and memory errors, develop strategies to reduce its fault tolerance overhead and improve its numerical stability and fault coverage, and finally incorporate it into the widely used FFTW library - one of the today's fastest FFT software implementations. Experimental results demonstrate that: (1) the proposed online ABFT scheme introduces much lower overhead than the existing offline ABFT schemes; (2) it detects errors in a much more timely manner; and (3) it also has higher numerical stability and better fault coverage

    Solving the Klein-Gordon equation using Fourier spectral methods: A benchmark test for computer performance

    Get PDF
    The cubic Klein-Gordon equation is a simple but non-trivial partial differential equation whose numerical solution has the main building blocks required for the solution of many other partial differential equations. In this study, the library 2DECOMP&FFT is used in a Fourier spectral scheme to solve the Klein-Gordon equation and strong scaling of the code is examined on thirteen different machines for a problem size of 512^3. The results are useful in assessing likely performance of other parallel fast Fourier transform based programs for solving partial differential equations. The problem is chosen to be large enough to solve on a workstation, yet also of interest to solve quickly on a supercomputer, in particular for parametric studies. Unlike other high performance computing benchmarks, for this problem size, the time to solution will not be improved by simply building a bigger supercomputer.Comment: 10 page

    Partitioning of computationally intensive tasks between FPGA and CPUs

    Get PDF
    With the recent development of faster and more complex Multiprocessor System-on-Cips (MPSoCs), a large number of different resources have become available on a single chip. For example, Xilinx's UltraScale+ is a powerful MPSoC with four ARM Cortex-A53 CPUs, two Cortex-R5 real-time cores, an FPGA fabric and a Mali-400 GPU. Optimal partitioning between CPUs, real-time cores, GPU and FPGA is therefore a challenge. For many scientific applications with high sampling rates and real-time signal analysis, an FFT needs to be calculated and analyzed directly in the measuring device. The goal of partitioning such an FFT in an MPSoC is to make best use of the available resources, to minimize latency and to optimize performance. The paper compares different partitioning designs and discusses their advantages and disadvantages. Measurement results with up to 250 MSamples per second are shown

    ARKCoS: Artifact-Suppressed Accelerated Radial Kernel Convolution on the Sphere

    Full text link
    We describe a hybrid Fourier/direct space convolution algorithm for compact radial (azimuthally symmetric) kernels on the sphere. For high resolution maps covering a large fraction of the sky, our implementation takes advantage of the inexpensive massive parallelism afforded by consumer graphics processing units (GPUs). Applications involve modeling of instrumental beam shapes in terms of compact kernels, computation of fine-scale wavelet transformations, and optimal filtering for the detection of point sources. Our algorithm works for any pixelization where pixels are grouped into isolatitude rings. Even for kernels that are not bandwidth limited, ringing features are completely absent on an ECP grid. We demonstrate that they can be highly suppressed on the popular HEALPix pixelization, for which we develop a freely available implementation of the algorithm. As an example application, we show that running on a high-end consumer graphics card our method speeds up beam convolution for simulations of a characteristic Planck high frequency instrument channel by two orders of magnitude compared to the commonly used HEALPix implementation on one CPU core while maintaining at typical a fractional RMS accuracy of about 1 part in 10^5.Comment: 10 pages, 6 figures. Submitted to Astronomy and Astrophysics. Replaced to match published version. Code can be downloaded at https://github.com/elsner/arkco
    corecore