3,374 research outputs found
FourierPIM: High-Throughput In-Memory Fast Fourier Transform and Polynomial Multiplication
The Discrete Fourier Transform (DFT) is essential for various applications
ranging from signal processing to convolution and polynomial multiplication.
The groundbreaking Fast Fourier Transform (FFT) algorithm reduces DFT time
complexity from the naive O(n^2) to O(n log n), and recent works have sought
further acceleration through parallel architectures such as GPUs.
Unfortunately, accelerators such as GPUs cannot exploit their full computing
capabilities as memory access becomes the bottleneck. Therefore, this paper
accelerates the FFT algorithm using digital Processing-in-Memory (PIM)
architectures that shift computation into the memory by exploiting physical
devices capable of storage and logic (e.g., memristors). We propose an O(log n)
in-memory FFT algorithm that can also be performed in parallel across multiple
arrays for high-throughput batched execution, supporting both fixed-point and
floating-point numbers. Through the convolution theorem, we extend this
algorithm to O(log n) polynomial multiplication - a fundamental task for
applications such as cryptography. We evaluate FourierPIM on a
publicly-available cycle-accurate simulator that verifies both correctness and
performance, and demonstrate 5-15x throughput and 4-13x energy improvement over
the NVIDIA cuFFT library on state-of-the-art GPUs for FFT and polynomial
multiplication
Pricing of early-exercise Asian options under L\'evy processes based on Fourier cosine expansions
In this article, we propose a pricing method for Asian options with early-exercise
features. It is based on a two-dimensional integration and a backward recursion of the
Fourier coefficients, in which several numerical techniques, like Fourier cosine expansions,
Clenshaw–Curtis quadrature and the Fast Fourier Transform (FFT) are employed. Rapid
convergence of the pricing method is illustrated by an error analysis. Its performance is
further demonstrated by various numerical examples, where we also show the power of
an implementation on Graphics Processing Units (GPUs)
Journal Staff
The fast Fourier transform (FFT) plays an important role in digital signal processing (DSP) applications, and its implementation involves a large number of computations. Many DSP designers have been working on implementations of the FFT algorithms on different devices, such as central processing unit (CPU), Field programmable gate array (FPGA), and graphical processing unit (GPU), in order to accelerate the performance. We selected the GPU device for the implementations of the FFT algorithm because the hardware of GPU is designed with highly parallel structure. It consists of many hundreds of small parallel processing units. The programming of such a parallel device, can be done by a parallel programming language CUDA (Compute Unified Device Architecture). In this thesis, we propose different implementations of the FFT algorithm on the NVIDIA GPU using CUDA programming language. We study and analyze the different approaches, and use different techniques to accelerate the computations of the FFT. We also discuss the results and compare different approaches and techniques. Finally, we compare our best cases of results with the CUFFT library, which is a specific library to compute the FFT on NVIDIA GPUs
Accelerated Modeling of Near and Far-Field Diffraction for Coronagraphic Optical Systems
Accurately predicting the performance of coronagraphs and tolerancing optical
surfaces for high-contrast imaging requires a detailed accounting of
diffraction effects. Unlike simple Fraunhofer diffraction modeling, near and
far-field diffraction effects, such as the Talbot effect, are captured by
plane-to-plane propagation using Fresnel and angular spectrum propagation. This
approach requires a sequence of computationally intensive Fourier transforms
and quadratic phase functions, which limit the design and aberration
sensitivity parameter space which can be explored at high-fidelity in the
course of coronagraph design. This study presents the results of optimizing the
multi-surface propagation module of the open source Physical Optics Propagation
in PYthon (POPPY) package. This optimization was performed by implementing and
benchmarking Fourier transforms and array operations on graphics processing
units, as well as optimizing multithreaded numerical calculations using the
NumExpr python library where appropriate, to speed the end-to-end simulation of
observatory and coronagraph optical systems. Using realistic systems, this
study demonstrates a greater than five-fold decrease in wall-clock runtime over
POPPY's previous implementation and describes opportunities for further
improvements in diffraction modeling performance.Comment: Presented at SPIE ASTI 2018, Austin Texas. 11 pages, 6 figure
A Multi-GPU Programming Library for Real-Time Applications
We present MGPU, a C++ programming library targeted at single-node multi-GPU
systems. Such systems combine disproportionate floating point performance with
high data locality and are thus well suited to implement real-time algorithms.
We describe the library design, programming interface and implementation
details in light of this specific problem domain. The core concepts of this
work are a novel kind of container abstraction and MPI-like communication
methods for intra-system communication. We further demonstrate how MGPU is used
as a framework for porting existing GPU libraries to multi-device
architectures. Putting our library to the test, we accelerate an iterative
non-linear image reconstruction algorithm for real-time magnetic resonance
imaging using multiple GPUs. We achieve a speed-up of about 1.7 using 2 GPUs
and reach a final speed-up of 2.1 with 4 GPUs. These promising results lead us
to conclude that multi-GPU systems are a viable solution for real-time MRI
reconstruction as well as signal-processing applications in general.Comment: 15 pages, 10 figure
Fast algorithms and efficient GPU implementations for the Radon transform and the back-projection operator represented as convolution operators
The Radon transform and its adjoint, the back-projection operator, can both
be expressed as convolutions in log-polar coordinates. Hence, fast algorithms
for the application of the operators can be constructed by using FFT, if data
is resampled at log-polar coordinates. Radon data is typically measured on an
equally spaced grid in polar coordinates, and reconstructions are represented
(as images) in Cartesian coordinates. Therefore, in addition to FFT, several
steps of interpolation have to be conducted in order to apply the Radon
transform and the back-projection operator by means of convolutions.
Both the interpolation and the FFT operations can be efficiently implemented
on Graphical Processor Units (GPUs). For the interpolation, it is possible to
make use of the fact that linear interpolation is hard-wired on GPUs, meaning
that it has the same computational cost as direct memory access. Cubic order
interpolation schemes can be constructed by combining linear interpolation
steps which provides important computation speedup.
We provide details about how the Radon transform and the back-projection can
be implemented efficiently as convolution operators on GPUs. For large data
sizes, speedups of about 10 times are obtained in relation to the computational
times of other software packages based on GPU implementations of the Radon
transform and the back-projection operator. Moreover, speedups of more than a
1000 times are obtained against the CPU-implementations provided in the MATLAB
image processing toolbox
Benchmarking CPUs and GPUs on embedded platforms for software receiver usage
Smartphones containing multi-core central processing units (CPUs) and powerful many-core graphics processing units (GPUs) bring supercomputing technology into your pocket (or into our embedded devices). This can be exploited to produce power-efficient, customized receivers with flexible correlation schemes and more advanced positioning techniques. For example, promising techniques such as the Direct Position Estimation paradigm or usage of tracking solutions based on particle filtering, seem to be very appealing in challenging environments but are likewise computationally quite demanding. This article sheds some light onto recent embedded processor developments, benchmarks Fast Fourier Transform (FFT) and correlation algorithms on representative embedded platforms and relates the results to the use in GNSS software radios. The use of embedded CPUs for signal tracking seems to be straight forward, but more research is required to fully achieve the nominal peak performance of an embedded GPU for FFT computation. Also the electrical power consumption is measured in certain load levels.Peer ReviewedPostprint (published version
- …