Search CORE

3,374 research outputs found

FourierPIM: High-Throughput In-Memory Fast Fourier Transform and Polynomial Multiplication

Author: Boneh Yahav
Gazit Gonen
Kvatinsky Shahar
Leitersdorf Orian
Ronen Ronny
Publication venue: 'Elsevier BV'
Publication date: 05/04/2023
Field of study

The Discrete Fourier Transform (DFT) is essential for various applications ranging from signal processing to convolution and polynomial multiplication. The groundbreaking Fast Fourier Transform (FFT) algorithm reduces DFT time complexity from the naive O(n^2) to O(n log n), and recent works have sought further acceleration through parallel architectures such as GPUs. Unfortunately, accelerators such as GPUs cannot exploit their full computing capabilities as memory access becomes the bottleneck. Therefore, this paper accelerates the FFT algorithm using digital Processing-in-Memory (PIM) architectures that shift computation into the memory by exploiting physical devices capable of storage and logic (e.g., memristors). We propose an O(log n) in-memory FFT algorithm that can also be performed in parallel across multiple arrays for high-throughput batched execution, supporting both fixed-point and floating-point numbers. Through the convolution theorem, we extend this algorithm to O(log n) polynomial multiplication - a fundamental task for applications such as cryptography. We evaluate FourierPIM on a publicly-available cycle-accurate simulator that verifies both correctness and performance, and demonstrate 5-15x throughput and 4-13x energy improvement over the NVIDIA cuFFT library on state-of-the-art GPUs for FFT and polynomial multiplication

arXiv.org e-Print Archive

Directory of Open Access Journals

Pricing of early-exercise Asian options under L\'evy processes based on Fourier cosine expansions

Author: Oosterlee C.W. (Kees)
Zhang B. (Bo)
Publication venue: 'Elsevier BV'
Publication date: 01/01/2014
Field of study

In this article, we propose a pricing method for Asian options with early-exercise features. It is based on a two-dimensional integration and a backward recursion of the Fourier coefficients, in which several numerical techniques, like Fourier cosine expansions, Clenshaw–Curtis quadrature and the Fast Fourier Transform (FFT) are employed. Rapid convergence of the pricing method is illustrated by an error analysis. Its performance is further demonstrated by various numerical examples, where we also show the power of an implementation on Graphics Processing Units (GPUs)

CWI's Institutional Repository

Journal Staff

Author: Sreehari Ambuluri
Publication venue: Duke University School of Law
Publication date: 01/01/2012
Field of study

The fast Fourier transform (FFT) plays an important role in digital signal processing (DSP) applications, and its implementation involves a large number of computations. Many DSP designers have been working on implementations of the FFT algorithms on different devices, such as central processing unit (CPU), Field programmable gate array (FPGA), and graphical processing unit (GPU), in order to accelerate the performance. We selected the GPU device for the implementations of the FFT algorithm because the hardware of GPU is designed with highly parallel structure. It consists of many hundreds of small parallel processing units. The programming of such a parallel device, can be done by a parallel programming language CUDA (Compute Unified Device Architecture). In this thesis, we propose different implementations of the FFT algorithm on the NVIDIA GPU using CUDA programming language. We study and analyze the different approaches, and use different techniques to accelerate the computations of the FFT. We also discuss the results and compare different approaches and techniques. Finally, we compare our best cases of results with the CUFFT library, which is a specific library to compute the FFT on NVIDIA GPUs

Publikationer från Linköpings universitet

Duke Law Scholarship Repository

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Accelerated Modeling of Near and Far-Field Diffraction for Coronagraphic Optical Systems

Author: Abdellah
Akeret
Cooke
Cooley
Douglas
Douglas
Fangohr
Frigo
Greenfield
Greenfield
Hirst
Jones
Kluyver
Lawrence
Lumbres
Macintosh
Marois
Mendillo
Morgan
Noecker
Pavlyk
Perrin
Shimobaba
Soummer
Steinbach
Stone
Yamamoto
Publication venue: 'SPIE-Intl Soc Optical Eng'
Publication date: 17/06/2018
Field of study

Accurately predicting the performance of coronagraphs and tolerancing optical surfaces for high-contrast imaging requires a detailed accounting of diffraction effects. Unlike simple Fraunhofer diffraction modeling, near and far-field diffraction effects, such as the Talbot effect, are captured by plane-to-plane propagation using Fresnel and angular spectrum propagation. This approach requires a sequence of computationally intensive Fourier transforms and quadratic phase functions, which limit the design and aberration sensitivity parameter space which can be explored at high-fidelity in the course of coronagraph design. This study presents the results of optimizing the multi-surface propagation module of the open source Physical Optics Propagation in PYthon (POPPY) package. This optimization was performed by implementing and benchmarking Fourier transforms and array operations on graphics processing units, as well as optimizing multithreaded numerical calculations using the NumExpr python library where appropriate, to speed the end-to-end simulation of observatory and coronagraph optical systems. Using realistic systems, this study demonstrates a greater than five-fold decrease in wall-clock runtime over POPPY's previous implementation and describes opportunities for further improvements in diffraction modeling performance.Comment: Presented at SPIE ASTI 2018, Austin Texas. 11 pages, 6 figure

arXiv.org e-Print Archive

Crossref

A Multi-GPU Programming Library for Real-Time Applications

Author: Schaetz Sebastian
Uecker Martin
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

We present MGPU, a C++ programming library targeted at single-node multi-GPU systems. Such systems combine disproportionate floating point performance with high data locality and are thus well suited to implement real-time algorithms. We describe the library design, programming interface and implementation details in light of this specific problem domain. The core concepts of this work are a novel kind of container abstraction and MPI-like communication methods for intra-system communication. We further demonstrate how MGPU is used as a framework for porting existing GPU libraries to multi-device architectures. Putting our library to the test, we accelerate an iterative non-linear image reconstruction algorithm for real-time magnetic resonance imaging using multiple GPUs. We achieve a speed-up of about 1.7 using 2 GPUs and reach a final speed-up of 2.1 with 4 GPUs. These promising results lead us to conclude that multi-GPU systems are a viable solution for real-time MRI reconstruction as well as signal-processing applications in general.Comment: 15 pages, 10 figure

arXiv.org e-Print Archive

MPG.PuRe

Fast algorithms and efficient GPU implementations for the Radon transform and the back-projection operator represented as convolution operators

Author: Andersson Fredrik
Carlsson Marcus
Nikitin Viktor V.
Publication venue
Publication date: 29/05/2015
Field of study

The Radon transform and its adjoint, the back-projection operator, can both be expressed as convolutions in log-polar coordinates. Hence, fast algorithms for the application of the operators can be constructed by using FFT, if data is resampled at log-polar coordinates. Radon data is typically measured on an equally spaced grid in polar coordinates, and reconstructions are represented (as images) in Cartesian coordinates. Therefore, in addition to FFT, several steps of interpolation have to be conducted in order to apply the Radon transform and the back-projection operator by means of convolutions. Both the interpolation and the FFT operations can be efficiently implemented on Graphical Processor Units (GPUs). For the interpolation, it is possible to make use of the fact that linear interpolation is hard-wired on GPUs, meaning that it has the same computational cost as direct memory access. Cubic order interpolation schemes can be constructed by combining linear interpolation steps which provides important computation speedup. We provide details about how the Radon transform and the back-projection can be implemented efficiently as convolution operators on GPUs. For large data sizes, speedups of about 10 times are obtained in relation to the computational times of other software packages based on GPU implementations of the Radon transform and the back-projection operator. Moreover, speedups of more than a 1000 times are obtained against the CPU-implementations provided in the MATLAB image processing toolbox

arXiv.org e-Print Archive

Lund University Publications

Benchmarking CPUs and GPUs on embedded platforms for software receiver usage

Author: Bär W.
Closas Gómez Pau
Dampf J.
Fürlinger K.
García Molina J. A.
Pany T.
Stöber C.
Winkel J.
Publication venue
Publication date: 01/01/2015
Field of study

Smartphones containing multi-core central processing units (CPUs) and powerful many-core graphics processing units (GPUs) bring supercomputing technology into your pocket (or into our embedded devices). This can be exploited to produce power-efficient, customized receivers with flexible correlation schemes and more advanced positioning techniques. For example, promising techniques such as the Direct Position Estimation paradigm or usage of tracking solutions based on particle filtering, seem to be very appealing in challenging environments but are likewise computationally quite demanding. This article sheds some light onto recent embedded processor developments, benchmarks Fast Fourier Transform (FFT) and correlation algorithms on representative embedded platforms and relates the results to the use in GNSS software radios. The use of embedded CPUs for signal tracking seems to be straight forward, but more research is required to fully achieve the nominal peak performance of an embedded GPU for FFT computation. Also the electrical power consumption is measured in certain load levels.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC