689 research outputs found

    Low-power Programmable Processor for Fast Fourier Transform Based on Transport Triggered Architecture

    Get PDF
    This paper describes a low-power processor tailored for fast Fourier transform computations where transport triggering template is exploited. The processor is software-programmable while retaining an energy-efficiency comparable to existing fixed-function implementations. The power savings are achieved by compressing the computation kernel into one instruction word. The word is stored in an instruction loop buffer, which is more power-efficient than regular instruction memory storage. The processor supports all power-of-two FFT sizes from 64 to 16384 and given 1 mJ of energy, it can compute 20916 transforms of size 1024.Comment: 5 pages, 4 figures, 1 table, ICASSP 2019 conferenc

    Non-power-of-Two FFTs: Exploring the Flexibility of the Montium TP

    Get PDF
    Coarse-grain reconfigurable architectures, like the Montium TP, have proven to be a very successful approach for low-power and high-performance computation of regular digital signal processing algorithms. This paper presents the implementation of a class of non-power-of-two FFTs to discover the limitations and Flexibility of the Montium TP for less regular algorithms. A non-power-of-two FFT is less regular compared to a traditional power-of-two FFT. The results of the implementation show the processing time, accuracy, energy consumption and Flexibility of the implementation

    Portable and efficient FFT and DCT algorithms with the Heterogeneous Butterfly Processing Library

    Get PDF
    Versión final aceptada de: https://doi.org/10.1016/j.jpdc.2018.11.011This version of the article: Vázquez, S., Amor, M., Fraguela, B. B. (2019). 'Portable and efficient FFT and DCT algorithms with the heterogeneous butterfly processing library', has been accepted for publication in Journal of Parallel and Distributed Computing, 125, 135–146. The Version of Record is available online at https://doi.org/10.1016/j.jpdc.2018.11.011.[Abstract]: The existence of a wide variety of computing devices with very different properties makes essential the development of software that is not only portable among them, but which also adapts to the properties of each platform. In this paper, we present the Heterogeneous Butterfly Processing Library (HBPL), which provides optimized portable kernels for problems of small sizes that allow using orthogonal transform algorithms such as the FFT and DCT on different accelerators and regular CPUs. Our library is implemented on the OpenCL standard, which provides portability on a large number of platforms. Furthermore, high performance is achieved on a wide range of devices by exploiting run-time code generation and metaprogramming guided by a parametrization strategy. An exhaustive evaluation on different platforms shows that our proposal obtains competitive or better performance than related libraries.This research has received financial support from the Ministerio de Economía y Competitividad of Spain and European Regional Development Fund (ERDF) funds (80%) of the EU (TIN2016-75845-P), by the Consellería de Cultura, Educación e Ordenación Universitaria, Xunta de Galicia co-founded by European Regional Development Fund (ERDF) funds under the Consolidation Programme of Competitive Reference Groups (Ref. ED431C 2017/04) and the Consolidation Programme of Competitive Research Units (Ref. R2014/049 and Ref. R2016/037) as well as by the Consellería de Cultura, Educación e Ordenación Universitaria, Xunta de Galicia (Centro Singular de Investigación de Galicia accreditation 2016–2019) and the European Union (European Regional Development Fund, ERDF) under Grant Ref. ED431G/01.Xunta de Galicia; ED431C 2017/04Xunta de Galicia; ED431G/01Xunta de Galicia; R2014/049Xunta de Galicia; R2016/03

    Accelerating Number Theoretic Transformations for Bootstrappable Homomorphic Encryption on GPUs

    Full text link
    Homomorphic encryption (HE) draws huge attention as it provides a way of privacy-preserving computations on encrypted messages. Number Theoretic Transform (NTT), a specialized form of Discrete Fourier Transform (DFT) in the finite field of integers, is the key algorithm that enables fast computation on encrypted ciphertexts in HE. Prior works have accelerated NTT and its inverse transformation on a popular parallel processing platform, GPU, by leveraging DFT optimization techniques. However, these GPU-based studies lack a comprehensive analysis of the primary differences between NTT and DFT or only consider small HE parameters that have tight constraints in the number of arithmetic operations that can be performed without decryption. In this paper, we analyze the algorithmic characteristics of NTT and DFT and assess the performance of NTT when we apply the optimizations that are commonly applicable to both DFT and NTT on modern GPUs. From the analysis, we identify that NTT suffers from severe main-memory bandwidth bottleneck on large HE parameter sets. To tackle the main-memory bandwidth issue, we propose a novel NTT-specific on-the-fly root generation scheme dubbed on-the-fly twiddling (OT). Compared to the baseline radix-2 NTT implementation, after applying all the optimizations, including OT, we achieve 4.2x speedup on a modern GPU.Comment: 12 pages, 13 figures, to appear in IISWC 202

    Improving Mobile SOC\u27s Performance as an Energy Efficient DSP Platform with Heterogeneous Computing

    Get PDF
    Mobile system-on-chip (SOC) technology is improving at a staggering rate spurred primarily by the adoption of smartphones and tablets. This rapid innovation has allowed the mobile SOC to be considered in everything from high performance computing to embedded applications. In this work, modern SOC\u27s heterogeneous computing capabilities are evaluated with a focus toward digital signal processing (DSP). Evaluation is conducted on modern consumer devices running Android operating system and leveraging the relatively new RenderScript Compute to utilize CPU resources alongside other compute resources such as graphics processing units (GPUs) and digital signal processors. In order to benchmark these concepts, several implementations of both the discrete Fourier transform (DFT) and the fast Fourier transform (FFT) are tested across devices. The results show both improvement in performance and energy efficiency on many devices compared to traditional Java implementations and indicate that the mobile SOC is a relevant platform for DSP applications

    Solving Large Problem Sizes of Index-Digit Algorithms on GPU: FFT and Tridiagonal System Solvers

    Get PDF
    [Abstract] Current Graphics Processing Units (GPUs) are capable of obtaining high computational performance in scientific applications. Nevertheless, programmers have to use suitable parallel algorithms for these architectures and usually have to consider optimization techniques in the implementation in order to achieve said performance. There are many efficient proposals for limited-size problems which fit directly in the shared memory of CUDA GPUs, however, there are few GPU proposals that tackle the design of efficient algorithms for large problem sizes that exceed shared memory storage capacity. In this work, we present a tuning strategy that addresses this problem for some parallel prefix algorithms that can be represented according to a set of common permutations of the digits of each of its element indices [1], denoted as Index-Digit (ID) algorithms. Specifically, our strategy has been applied to develop flexible Multi-Stage (MS) algorithms for the Fast Fourier Transform (FFT) algorithm (MS-ID-FFT) and a tridiagonal system solver (MS-ID-TS) on the GPU. The resulting implementation is compact and outperforms other well-known and commonly used state-of-the-art libraries, with an improvement of up to 1.47x with respect to NVIDIA's complex CUFFT, and up to 33.2x in comparison with NVIDIA's CUSPARSE for real data tridiagonal systems

    Design and Evaluation of a Scalable Engine for 3D-FFT Computation in an FPGA Cluster

    Get PDF
    The Three Dimensional Fast Fourier Transform (3D-FFT) is commonly used to solve the partial differential equations describing the system evolution in several physical phenomena, such as the motion of viscous fluids described by the Navier–Stokes equations. Simulation of such problems requires the use of a parallel High-Performance Computing architecture since the size of the problem grows with the cube of the FFT size, and the representation of the single point comprises several double precision floating- point complex numbers. Modern High-Performance Computing (HPC) systems are considering the inclusion of FPGAs as components of this computing architecture because they can combine effective hardware acceleration capabilities and dedicated communication facilities. Furthermore, the network topology can be optimized for the specific calculation that the cluster must perform, especially in the case of algorithms limited by the data exchange delay between the processors. In this paper, we explore an HPC design that uses FPGA accelerators to compute the 3DFFT. We devise a scalable FFT engine based on a custom radix-2 double-precision core that is used to implement the Decimation in Frequency version of the Cooley–Tukey FFT algorithm. The FFT engine can be adapted to different technology constraints and networking topologies by adjusting the number of cores and configuration parameters in order to minimize the overall calculation time. We compare the various possible configurations with the technological limits of available hardware. Finally, we evaluate the bandwidth required for continuous FFT execution in the APEnet toroidal mesh network.
    corecore