Search CORE

446 research outputs found

Performance comparison of single-precision SPICE Model-Evaluation on FPGA, GPU, Cell, and multi-core processors

Author: DeHon André
Kapre Nachiket
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

Automated code generation and performance tuning techniques for concurrent architectures such as GPUs, Cell and FPGAs can provide integer factor speedups over multi-core processor organizations for data-parallel, floating-point computation in SPICE model-evaluation. Our Verilog AMS compiler produces code for parallel evaluation of non-linear circuit models suitable for use in SPICE simulations where the same model is evaluated several times for all the devices in the circuit. Our compiler uses architecture specific parallelization strategies (OpenMP for multi-core, PThreads for Cell, CUDA for GPU, statically scheduled VLIW for FPGA) when producing code for these different architectures. We automatically explore different implementation configurations (e.g. unroll factor, vector length) using our performance-tuner to identify the best possible configuration for each architecture. We demonstrate speedups of 3- 182times for a Xilinx Virtex5 LX 330T, 1.3-33times for an IBM Cell, and 3-131times for an NVIDIA 9600 GT GPU over a 3 GHz Intel Xeon 5160 implementation for a variety of single-precision device models

Crossref

Caltech Authors

DR-NTU (Digital Repository of NTU)

Parallelizing the Chambolle Algorithm for Performance-Optimized Mapping on FPGA Devices

Author: Abdulkadir Akin
Abutaleb M. M.
Alessandro Antonio Nacci
Black M. J.
David Atienza
Donatella Sciuto
Ghodhbani R.
Ivan Beretta
Jamro E.
Kim Sungbok
Li Yamin
Pock T.
Sun S.
Vincenzo Rana
Zach C.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 25/11/2015
Field of study

The performance and the efficiency of recent computing platforms have been deeply influenced by the widespread adoption of hardware accelerators, such as Graphics Processing Units (GPUs) or Field Programmable Gate Arrays (FPGAs), which are often employed to support the tasks of General Purpose Processors (GPP). One of the main advantages of these accelerators over their sequential counterparts (GPPs) is their ability of performing massive parallel computation. However, in order to exploit this competitive edge, it is necessary to extract the parallelism from the target algorithm to be executed, which is in general a very challenging task. This concept is demonstrated, for instance, by the poor performance achieved on relevant multimedia algorithms, such as Chambolle, which is a well-known algorithm employed for the optical flow estimation. The implementations of this algorithm that can be found in the state of the art are generally based on GPUs, but barely improve the performance that can be obtained with a powerful GPP. In this paper, we propose a novel approach to extract the parallelism from computation-intensive multimedia algorithms, which includes an analysis of their dependency schema and an assessment of their data reuse. We then perform a thorough analysis of the Chambolle algorithm, providing a formal proof of its inner data dependencies and locality properties. Then, we exploit the considerations drawn from this analysis by proposing an architectural template that takes advantage of the fine-grained parallelism of FPGA devices. Moreover, since the proposed template can be instantiated with different parameters, we also propose a design metric, the expansion rate, to help the designer in the estimation of the efficiency and performance of the different instances, making it possible to select the right one before the implementation phase. We finally show, by means of experimental results, how the proposed analysis and parallelization approach leads to the design of efficient and high-performance FPGA-based implementations that are orders of magnitude faster than the state-of-the-art ones

Infoscience - École polytechnique fédérale de Lausanne

Archivio istituzionale della ricerca - Politecnico di Milano

Crossref

WestminsterResearch

SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator

Author: Kapre Nachiket Ganesh
Publication venue
Publication date: 01/01/2011
Field of study

Spatial processing of sparse, irregular floating-point computation using a single FPGA enables up to an order of magnitude speedup (mean 2.8X speedup) over a conventional microprocessor for the SPICE circuit simulator. We deliver this speedup using a hybrid parallel architecture that spatially implements the heterogeneous forms of parallelism available in SPICE. We decompose SPICE into its three constituent phases: Model-Evaluation, Sparse Matrix-Solve, and Iteration Control and parallelize each phase independently. We exploit data-parallel device evaluations in the Model-Evaluation phase, sparse dataflow parallelism in the Sparse Matrix-Solve phase and compose the complete design in streaming fashion. We name our parallel architecture SPICE²: Spatial Processors Interconnected for Concurrent Execution for accelerating the SPICE circuit simulator. We program the parallel architecture with a high-level, domain-specific framework that identifies, exposes and exploits parallelism available in the SPICE circuit simulator. This design is optimized with an auto-tuner that can scale the design to use larger FPGA capacities without expert intervention and can even target other parallel architectures with the assistance of automated code-generation. This FPGA architecture is able to outperform conventional processors due to a combination of factors including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and overlapped processing of the control algorithms. We demonstrate that we can independently accelerate Model-Evaluation by a mean factor of 6.5X(1.4--23X) across a range of non-linear device models and Matrix-Solve by 2.4X(0.6--13X) across various benchmark matrices while delivering a mean combined speedup of 2.8X(0.2--11X) for the two together when comparing a Xilinx Virtex-6 LX760 (40nm) with an Intel Core i7 965 (45nm). With our high-level framework, we can also accelerate Single-Precision Model-Evaluation on NVIDIA GPUs, ATI GPUs, IBM Cell, and Sun Niagara 2 architectures. We expect approaches based on exploiting spatial parallelism to become important as frequency scaling slows down and modern processing architectures turn to parallelism (\eg multi-core, GPUs) due to constraints of power consumption. This thesis shows how to express, exploit and optimize spatial parallelism for an important class of problems that are challenging to parallelize.</p

Caltech Theses and Dissertations

Parallel computing 2011, ParCo 2011: book of abstracts

Author: D'Hollander Erik
Publication venue: Academia Press Scientific Publishers
Publication date: 01/01/2011
Field of study

This book contains the abstracts of the presentations at the conference Parallel Computing 2011, 30 August - 2 September 2011, Ghent, Belgiu

Ghent University Academic Bibliography

Analysis of 3D Cone-Beam CT Image Reconstruction Performance on a FPGA

Author: Held Devin
Publication venue: Scholarship@Western
Publication date: 16/12/2016
Field of study

Efficient and accurate tomographic image reconstruction has been an intensive topic of research due to the increasing everyday usage in areas such as radiology, biology, and materials science. Computed tomography (CT) scans are used to analyze internal structures through capture of x-ray images. Cone-beam CT scans project a cone-shaped x-ray to capture 2D image data from a single focal point, rotating around the object. CT scans are prone to multiple artifacts, including motion blur, streaks, and pixel irregularities, therefore must be run through image reconstruction software to reduce visual artifacts. The most common algorithm used is the Feldkamp, Davis, and Kress (FDK) backprojection algorithm. The algorithm is computationally intensive due to the O(n4) backprojection step, running slowly with large CT data files on CPUs, but exceptionally well on GPUs due to the parallel nature of the algorithm. This thesis will analyze the performance of 3D cone-beam CT image reconstruction implemented in OpenCL on a FPGA embedded into a Power System

Scholarship@Western

Real-Time Dense Stereo Matching With ELAS on FPGA Accelerated Embedded Devices

Author: Frost Duncan
Miksik Ondrej
Rahnama Oscar
Torr Philip H. S.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

For many applications in low-power real-time robotics, stereo cameras are the sensors of choice for depth perception as they are typically cheaper and more versatile than their active counterparts. Their biggest drawback, however, is that they do not directly sense depth maps; instead, these must be estimated through data-intensive processes. Therefore, appropriate algorithm selection plays an important role in achieving the desired performance characteristics. Motivated by applications in space and mobile robotics, we implement and evaluate a FPGA-accelerated adaptation of the ELAS algorithm. Despite offering one of the best trade-offs between efficiency and accuracy, ELAS has only been shown to run at 1.5-3 fps on a high-end CPU. Our system preserves all intriguing properties of the original algorithm, such as the slanted plane priors, but can achieve a frame rate of 47fps whilst consuming under 4W of power. Unlike previous FPGA based designs, we take advantage of both components on the CPU/FPGA System-on-Chip to showcase the strategy necessary to accelerate more complex and computationally diverse algorithms for such low power, real-time systems.Comment: 8 pages, 7 figures, 2 table

arXiv.org e-Print Archive

Oxford University Research Archive

Vicuna: A Timing-Predictable RISC-V Vector Coprocessor for Scalable Parallel Computation

Author: Platzer Michael
Puschner Peter
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 33rd Euromicro Conference on Real-Time Systems (ECRTS 2021)
Publication date: 01/01/2021
Field of study

Dagstuhl Research Online Publication Server