Search CORE

1,453 research outputs found

HIGH-SPEED CO-PROCESSORS BASED ON REDUNDANT NUMBER SYSTEMS

Author: Kaivani Amir
Publication venue: 'University of Saskatchewan Library'
Publication date
Field of study

There is a growing demand for high-speed arithmetic co-processors for use in applications with computationally intensive tasks. For instance, Fast Fourier Transform (FFT) co-processors are used in real-time multimedia services and financial applications use decimal co-processors to perform large amounts of decimal computations. Using redundant number systems to eliminate word-wide carry propagation within interim operations is a well-known technique to increase the speed of arithmetic hardware units. Redundant number systems are mostly useful in applications where many consecutive arithmetic operations are performed prior to the final result, making it advantageous for arithmetic co-processors. This thesis discusses the implementation of two popular arithmetic co-processors based on redundant number systems: namely, the binary FFT co-processor and the decimal arithmetic co-processor. FFT co-processors consist of several consecutive multipliers and adders over complex numbers. FFT architectures are implemented based on fixed-point and floating-point arithmetic. The main advantage of floating-point over fixed-point arithmetic is the wide dynamic range it introduces. Moreover, it avoids numerical issues such as scaling and overflow/underflow concerns at the expense of higher cost. Furthermore, floating-point implementation allows for an FFT co-processor to collaborate with general purpose processors. This offloads computationally intensive tasks from the primary processor. The first part of this thesis, which is devoted to FFT co-processors, proposes a new FFT architecture that uses a new Binary-Signed Digit (BSD) carry-limited adder, a new floating-point BSD multiplier and a new floating-point BSD three-operand adder. Finally, a new unit labeled as Fused-Dot-Product-Add (FDPA) is designed to compute AB+CD+E over floating-point BSD operands. The second part of the thesis discusses decimal arithmetic operations implemented in hardware using redundant number systems. These operations are popularly used in decimal floating-point co-processors. A new signed-digit decimal adder is proposed along with a sequential decimal multiplier that uses redundant number systems to increase the operational frequency of the multiplier. New redundant decimal division and square-root units are also proposed. The architectures proposed in this thesis were all implemented using Hardware-Description-Language (Verilog) and synthesized using Synopsys Design Compiler. The evaluation results prove the speed improvement of the new arithmetic units over previous pertinent works. Consequently, the FFT and decimal co-processors designed in this thesis work with at least 10% higher speed than that of previous works. These architectures are meant to fulfill the demand for the high-speed co-processors required in various applications such as multimedia services and financial computations

eCommons@USASK

University of Saskatchewan Research Archive

A Many-Core Overlay for High-Performance Embedded Computing on FPGAs

Author: Neto Horácio
Véstias Mário
Publication venue
Publication date: 21/08/2014
Field of study

In this work, we propose a configurable many-core overlay for high-performance embedded computing. The size of internal memory, supported operations and number of ports can be configured independently for each core of the overlay. The overlay was evaluated with matrix multiplication, LU decomposition and Fast-Fourier Transform (FFT) on a ZYNQ-7020 FPGA platform. The results show that using a system-level many-core overlay avoids complex hardware design and still provides good performance results.Comment: Presented at First International Workshop on FPGAs for Software Programmers (FSP 2014) (arXiv:1408.4423

arXiv.org e-Print Archive

Repositório Científico do Instituto Politécnico de Lisboa

Performance Evaluation of cuDNN Convolution Algorithms on NVIDIA Volta GPUs

Author: Jordà Marc
Peña Monferrer Antonio José
Valero-Lara Pedro
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

Convolutional neural networks (CNNs) have recently attracted considerable attention due to their outstanding accuracy in applications, such as image recognition and natural language processing. While one advantage of the CNNs over other types of neural networks is their reduced computational cost, faster execution is still desired for both training and inference. Since convolution operations pose most of the execution time, multiple algorithms were and are being developed with the aim of accelerating this type of operations. However, due to the wide range of convolution parameter configurations used in the CNNs and the possible data type representations, it is not straightforward to assess in advance which of the available algorithms will be the best performing in each particular case. In this paper, we present a performance evaluation of the convolution algorithms provided by the cuDNN, the library used by most deep learning frameworks for their GPU operations. In our analysis, we leverage the convolution parameter configurations from widely used the CNNs and discuss which algorithms are better suited depending on the convolution parameters for both 32 and 16-bit floating-point (FP) data representations. Our results show that the filter size and the number of inputs are the most significant parameters when selecting a GPU convolution algorithm for 32-bit FP data. For 16-bit FP, leveraging specialized arithmetic units (NVIDIA Tensor Cores) is key to obtain the best performance.This work was supported by the European Union's Horizon 2020 Research and Innovation Program under the Marie Sklodowska-Curie under Grant 749516, and in part by the Spanish Juan de la Cierva under Grant IJCI-2017-33511Peer ReviewedPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Non-generic floating-point software support for embedded media processing

Author: Jeannerod Claude-Pierre
Jourdan-Lu Jingyan
Monat Christophe
Publication venue: HAL CCSD
Publication date: 01/01/2012
Field of study

International audienceThis paper presents some work in progress on the design and implementation of efficient floating-point software support for embedded integer processors. We provide quantitative evidence of the benefits of supporting various non-generic (that is, specialized, fused, or simultaneous) operations in addition to the five basic arithmetic operations: for individual calls, speedups range from 1.12 to 4.86, while on DSP kernels and benchmarks, our approach allows us to be up to 1.34x faster

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot

HAL-Rennes 1

EFFICIENT FLOATING POINT FAST FOURIER TRANSFORM BUTTERFLY ARCHITECTURE USING BINARY SIGNED DIGIT MULTIPLIER AND ADDERS

Author: Acharya Shivani
Beulet Augusta Sophy
Publication venue: 'Innovare Academic Sciences Pvt Ltd'
Publication date: 01/04/2017
Field of study

Fast Fourier transform (FFT) is one of the most important tools in digital signal processing as well as communication system because transforming time domain to S-plane is very convenient using FFT. As FFT uses various techniques to convert a signal from time domain to S-domain and inverse, out of which butterfly technique is the one on which paper is focused on. Butterfly technique uses additions and multiplications of operands to get the required output. Floating point (FP) is used as operands due to their flexibility. As the computations involving FP has less speed, we have used binaryÂ signed digit (BSD). BSD will take the less time for addition and subtraction. Three bit BSD adder and FP adder together will make a fused dot product add (FDPA) unit. In FDPA, unit addition and subtraction will be one group and multiplication will be one group and then their respective results will be fused. Modified booth encoding and decoding algorithm are used here to make the complex multiplication with ease.Â

Innovare Academic Sciences: E-Journals

Libsharp - spherical harmonic transforms revisited

Author: Reinecke Martin
Seljebotn Dag Sverre
Publication venue: 'EDP Sciences'
Publication date: 22/04/2013
Field of study

We present libsharp, a code library for spherical harmonic transforms (SHTs), which evolved from the libpsht library, addressing several of its shortcomings, such as adding MPI support for distributed memory systems and SHTs of fields with arbitrary spin, but also supporting new developments in CPU instruction sets like the Advanced Vector Extensions (AVX) or fused multiply-accumulate (FMA) instructions. The library is implemented in portable C99 and provides an interface that can be easily accessed from other programming languages such as C++, Fortran, Python etc. Generally, libsharp's performance is at least on par with that of its predecessor; however, significant improvements were made to the algorithms for scalar SHTs, which are roughly twice as fast when using the same CPU capabilities. The library is available at http://sourceforge.net/projects/libsharp/ under the terms of the GNU General Public License

arXiv.org e-Print Archive

CiteSeerX

Crossref

EDP Sciences OAI-PMH repository (1.2.0)

MPG.PuRe

Fast computing of scattering maps of nanostructures using graphical processing units

Author: Butler
Diaz
Favre-Nicolin
Gelisio
Grosse-Kunstleve
Gutmann
Henke
Hubert Renevier
Immirzi
Johann Coraux
Katcho
Keating
Kegel
Langs
Marie-Ingrid Richard
Minkevich
Nield
Niquet
Plimpton
Proffen
Richard
Robinson
Schmeisser
Schmidbauer
Stangl
Stillinger
Takagi
Tardif
Ten Eyck
Ten Eyck
Tersoff
Vincent Favre-Nicolin
Welberry
Wintersberger
Publication venue: 'International Union of Crystallography (IUCr)'
Publication date: 02/04/2011
Field of study

Scattering maps from strained or disordered nano-structures around a Bragg reflection can either be computed quickly using approximations and a (Fast) Fourier transform, or using individual atomic positions. In this article we show that it is possible to compute up to 4.10^10 $reflections.atoms/s using a single graphic card, and we evaluate how this speed depends on number of atoms and points in reciprocal space. An open-source software library (PyNX) allowing easy scattering computations (including grazing incidence conditions) in the Python language is described, with examples of scattering from non-ideal nanostructures.Comment: 7 pages, 4 figure

arXiv.org e-Print Archive

Crossref

Hal - Université Grenoble Alpes

HAL-CEA