Search CORE

10 research outputs found

The Potential for a GPU-Like Overlay Architecture for FPGAs

Author: J. Gregory Steffan
Jeffrey Kingyens
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2011
Field of study

We propose a soft processor programming model and architecture inspired by graphics processing units (GPUs) that are well-matched to the strengths of FPGAs, namely, highly parallel and pipelinable computation. In particular, our soft processor architecture exploits multithreading, vector operations, and predication to supply a floating-point pipeline of 64 stages via hardware support for up to 256 concurrent thread contexts. The key new contributions of our architecture are mechanisms for managing threads and register files that maximize data-level and instruction-level parallelism while overcoming the challenges of port limitations of FPGA block memories as well as memory and pipeline latency. Through simulation of a system that (i) is programmable via NVIDIA's high-level Cg language, (ii) supports AMD's CTM r5xx GPU ISA, and (iii) is realizable on an XtremeData XD1000 FPGA-based accelerator system, we demonstrate the potential for such a system to achieve 100% utilization of a deeply pipelined floating-point datapath

Crossref

Directory of Open Access Journals

A Many-Core Overlay for High-Performance Embedded Computing on FPGAs

Author: Neto Horácio
Véstias Mário
Publication venue
Publication date: 21/08/2014
Field of study

In this work, we propose a configurable many-core overlay for high-performance embedded computing. The size of internal memory, supported operations and number of ports can be configured independently for each core of the overlay. The overlay was evaluated with matrix multiplication, LU decomposition and Fast-Fourier Transform (FFT) on a ZYNQ-7020 FPGA platform. The results show that using a system-level many-core overlay avoids complex hardware design and still provides good performance results.Comment: Presented at First International Workshop on FPGAs for Software Programmers (FSP 2014) (arXiv:1408.4423

arXiv.org e-Print Archive

Repositório Científico do Instituto Politécnico de Lisboa

Embedded System Architecture for Mobile Augmented Reality. Sailor Assistance Case Study

Author: Bergmann Neil
Diguet Jean-Philippe
Jean-Christophe Morgère
Publication venue: HAL CCSD
Publication date: 19/02/2013
Field of study

International audienceWith upcoming see-through displays new kinds of applications of Augmented Reality are emerging. However this also raises questions about the design of associated embedded systems that must be lightweight and handle object positioning, heterogeneous sensors, wireless communications as well as graphic computation. This paper studies the specific case of a promising Mobile AR processor, which is different from usual graphics applications. A complete architecture is described, designed and prototyped on FPGA. It includes hard-ware/software partitioning based on the analysis of application requirements. The specification of an original and flexible coprocessor is detailed. Choices as well as optimizations of algorithms are also described. Implementation results and performance evaluation show the relevancy of the proposed approach and demonstrate a new kind of architecture focused on object processing and optimized for the AR domain

HAL-Université de Bretagne Occidentale

A Configurable Shared Scratchpad Memory for GPU-like Processors

Author: Cilardo A.
Donnarumma C.
Gagliardi M.
Publication venue
Publication date
Field of study

During the last years Field Programmable Gate Arrays and Graphics Processing Units have become increasingly important for high-performance computing. In particular, a number of industrial solutions and academic projects are proposing design frameworks based on FPGA-implemented GPU-like compute units. Existing GPU-like core projects provide limited hardware support for shared scratch-pad memory and particularly for the problem of bank conflicts, a major source of performance loss with many parallel kernels. In this paper, we present a configurable, GPU-like oriented scratchpad memory with built-in support for bank remapping. The core is fully synthetizable on FPGA with a contained hardware cost. We also validated the presented architecture with a cycle-accurate event-driven emulator written in C++ as well as an RTL simulator tool. Last, we demonstrated the impact of bank remapping and other parameters available with the proposed configurable shared scratchpad memory by evaluating the performance of two real-world parallelized kernels

Università degli Studi di Napoli Federico Il Open Archive

Dedicated object processor for mobile augmented reality - sailor assistance case study

Author: A Munshi
AR Lingley
J Bijker
J Bresenham
Jean-Christophe Morgère
Jean-Philippe Diguet
KH Kim
M Franklin
Neil Bergmann
R Vaslin
S Wuytack
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Recommended from our members

AN ARCHITECTURE EVALUATION AND IMPLEMENTATION OF A SOFT GPGPU FOR FPGAs

Author: Andryc Kevin
Publication venue: ScholarWorks@UMass Amherst
Publication date: 25/10/2018
Field of study

Embedded and mobile systems must be able to execute a variety of different types of code, often with minimal available hardware. Many embedded systems now come with a simple processor and an FPGA, but not more energy-hungry components, such as a GPGPU. In this dissertation we present FlexGrip, a soft architecture which allows for the execution of GPGPU code on an FPGA without the need to recompile the design. The architecture is optimized for FPGA implementation to effectively support the conditional and thread-based execution characteristics of GPGPU execution without FPGA design recompilation. This architecture supports direct CUDA compilation to a binary which is executable on the FPGA-based GPGPU. Our architecture is customizable, thus providing the FPGA designer with a selection of GPGPU cores which display performance versus area tradeoffs. This dissertation describes the FlexGrip architecture in detail and showcases the benefits by evaluating the design for a collection of five standard CUDA benchmarks which are compiled using standard GPGPU compilation tools. Speedups of 23x, on average, versus a MicroBlaze microprocessor are achieved for designs which take advantage of the conditional execution capabilities offered by FlexGrip. We also show FlexGrip can achieve an 80% average reduction of dynamic energy versus the MicroBlaze microprocessor. The dissertation furthers discussion by exploring application-customized versions of the soft GPGPU, thus exploiting the overlay architecture. We expand the architecture to multiple processors per GPGPU and optimizing away features which are not needed by certain classes of applications. These optimizations, which include the effective use of block RAMs and DSP blocks, are critical to the performance of FlexGrip. By implementing a 2 GPGPU design, we show speedups of 44x on average versus a MicroBlaze microprocessor. Application-customized versions of the soft GPGPU can be used to further reduce dynamic energy consumption by an average of 14%. To complete this thesis, we augmented a GPGPU cycle accurate simulator to emulate FlexGrip and evaluate different levels of cache design spaces. We show performance increases for select benchmarks, however, we also show that 64% and 45% of benchmarks exhibited performance decreases when L1D cache was enabled for the 1 SMP and 2 SMP configurations, and only one benchmark showed performance improvement when the L2 cache was enabled

ScholarWorks@UMass Amherst

A Novel Methodology for Calculating Large Numbers of Symmetrical Matrices on a Graphics Processing Unit: Towards Efficient, Real-Time Hyperspectral Image Processing

Author: Runnels Denise Renee
Publication venue: The Aquila Digital Community
Publication date: 01/05/2013
Field of study

Hyperspectral imagery (HSI) is often processed to identify targets of interest. Many of the quantitative analysis techniques developed for this purpose mathematically manipulate the data to derive information about the target of interest based on local spectral covariance matrices. The calculation of a local spectral covariance matrix for every pixel in a given hyperspectral data scene is so computationally intensive that real-time processing with these algorithms is not feasible with today’s general purpose processing solutions. Specialized solutions are cost prohibitive, inflexible, inaccessible, or not feasible for on-board applications. Advances in graphics processing unit (GPU) capabilities and programmability offer an opportunity for general purpose computing with access to hundreds of processing cores in a system that is affordable and accessible. The GPU also offers flexibility, accessibility and feasibility that other specialized solutions do not offer. The architecture for the NVIDIA GPU used in this research is significantly different from the architecture of other parallel computing solutions. With such a substantial change in architecture it follows that the paradigm for programming graphics hardware is significantly different from traditional serial and parallel software development paradigms. In this research a methodology for mapping an HSI target detection algorithm to the NVIDIA GPU hardware and Compute Unified Device Architecture (CUDA) Application Programming Interface (API) is developed. The RX algorithm is chosen as a representative stochastic HSI algorithm that requires the calculation of a spectral covariance matrix. The developed methodology is designed to calculate a local covariance matrix for every pixel in the input HSI data scene. A characterization of the limitations imposed by the chosen GPU is given and a path forward toward optimization of a GPU-based method for real-time HSI data processing is defined

Aquila Digital Community (University of Southern Mississippi, USM)

The Potential for a GPU-Like Overlay Architecture for FPGAs

Author: Kingyens Jeffrey
Steffan J. Gregory
Publication venue: University of Toronto
Publication date: 08/02/2018
Field of study

We propose a soft processor programmingmodel and architecture inspired by graphics processing units(GPUs) that are well-matched to the strengths of FPGAs,namely, highly parallel and pipelinable computation. Inparticular, our soft processor architecture exploits multithreading,vector operations, and predication to supply afloating-point pipeline of 64 stages via hardware supportfor up to 256 concurrent thread contexts. The key newcontributions of our architecture are mechanisms for managingthreads and register files that maximize data-level andinstruction-level parallelism while overcoming the challengesof port limitations of FPGA block memories as well asmemory and pipeline latency. Through simulation of asystem that (i) is programmable via NVIDIA's high-levelCg language, (ii) supports AMD's CTM r5xx GPU ISA, and(iii) is realizable on an XtremeData XD1000 FPGA-basedaccelerator system, we demonstrate the potential for sucha system to achieve 100% utilization of a deeply pipelinedfloating-point datapath.Peer Reviewe

University of Toronto Research Repository