446 research outputs found

    Performance comparison of single-precision SPICE Model-Evaluation on FPGA, GPU, Cell, and multi-core processors

    Get PDF
    Automated code generation and performance tuning techniques for concurrent architectures such as GPUs, Cell and FPGAs can provide integer factor speedups over multi-core processor organizations for data-parallel, floating-point computation in SPICE model-evaluation. Our Verilog AMS compiler produces code for parallel evaluation of non-linear circuit models suitable for use in SPICE simulations where the same model is evaluated several times for all the devices in the circuit. Our compiler uses architecture specific parallelization strategies (OpenMP for multi-core, PThreads for Cell, CUDA for GPU, statically scheduled VLIW for FPGA) when producing code for these different architectures. We automatically explore different implementation configurations (e.g. unroll factor, vector length) using our performance-tuner to identify the best possible configuration for each architecture. We demonstrate speedups of 3- 182times for a Xilinx Virtex5 LX 330T, 1.3-33times for an IBM Cell, and 3-131times for an NVIDIA 9600 GT GPU over a 3 GHz Intel Xeon 5160 implementation for a variety of single-precision device models

    Parallelizing the Chambolle Algorithm for Performance-Optimized Mapping on FPGA Devices

    Get PDF
    The performance and the efficiency of recent computing platforms have been deeply influenced by the widespread adoption of hardware accelerators, such as Graphics Processing Units (GPUs) or Field Programmable Gate Arrays (FPGAs), which are often employed to support the tasks of General Purpose Processors (GPP). One of the main advantages of these accelerators over their sequential counterparts (GPPs) is their ability of performing massive parallel computation. However, in order to exploit this competitive edge, it is necessary to extract the parallelism from the target algorithm to be executed, which is in general a very challenging task. This concept is demonstrated, for instance, by the poor performance achieved on relevant multimedia algorithms, such as Chambolle, which is a well-known algorithm employed for the optical flow estimation. The implementations of this algorithm that can be found in the state of the art are generally based on GPUs, but barely improve the performance that can be obtained with a powerful GPP. In this paper, we propose a novel approach to extract the parallelism from computation-intensive multimedia algorithms, which includes an analysis of their dependency schema and an assessment of their data reuse. We then perform a thorough analysis of the Chambolle algorithm, providing a formal proof of its inner data dependencies and locality properties. Then, we exploit the considerations drawn from this analysis by proposing an architectural template that takes advantage of the fine-grained parallelism of FPGA devices. Moreover, since the proposed template can be instantiated with different parameters, we also propose a design metric, the expansion rate, to help the designer in the estimation of the efficiency and performance of the different instances, making it possible to select the right one before the implementation phase. We finally show, by means of experimental results, how the proposed analysis and parallelization approach leads to the design of efficient and high-performance FPGA-based implementations that are orders of magnitude faster than the state-of-the-art ones

    SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator

    Get PDF
    Spatial processing of sparse, irregular floating-point computation using a single FPGA enables up to an order of magnitude speedup (mean 2.8X speedup) over a conventional microprocessor for the SPICE circuit simulator. We deliver this speedup using a hybrid parallel architecture that spatially implements the heterogeneous forms of parallelism available in SPICE. We decompose SPICE into its three constituent phases: Model-Evaluation, Sparse Matrix-Solve, and Iteration Control and parallelize each phase independently. We exploit data-parallel device evaluations in the Model-Evaluation phase, sparse dataflow parallelism in the Sparse Matrix-Solve phase and compose the complete design in streaming fashion. We name our parallel architecture SPICE²: Spatial Processors Interconnected for Concurrent Execution for accelerating the SPICE circuit simulator. We program the parallel architecture with a high-level, domain-specific framework that identifies, exposes and exploits parallelism available in the SPICE circuit simulator. This design is optimized with an auto-tuner that can scale the design to use larger FPGA capacities without expert intervention and can even target other parallel architectures with the assistance of automated code-generation. This FPGA architecture is able to outperform conventional processors due to a combination of factors including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and overlapped processing of the control algorithms. We demonstrate that we can independently accelerate Model-Evaluation by a mean factor of 6.5X(1.4--23X) across a range of non-linear device models and Matrix-Solve by 2.4X(0.6--13X) across various benchmark matrices while delivering a mean combined speedup of 2.8X(0.2--11X) for the two together when comparing a Xilinx Virtex-6 LX760 (40nm) with an Intel Core i7 965 (45nm). With our high-level framework, we can also accelerate Single-Precision Model-Evaluation on NVIDIA GPUs, ATI GPUs, IBM Cell, and Sun Niagara 2 architectures. We expect approaches based on exploiting spatial parallelism to become important as frequency scaling slows down and modern processing architectures turn to parallelism (\eg multi-core, GPUs) due to constraints of power consumption. This thesis shows how to express, exploit and optimize spatial parallelism for an important class of problems that are challenging to parallelize.</p

    Parallel computing 2011, ParCo 2011: book of abstracts

    Get PDF
    This book contains the abstracts of the presentations at the conference Parallel Computing 2011, 30 August - 2 September 2011, Ghent, Belgiu

    Analysis of 3D Cone-Beam CT Image Reconstruction Performance on a FPGA

    Get PDF
    Efficient and accurate tomographic image reconstruction has been an intensive topic of research due to the increasing everyday usage in areas such as radiology, biology, and materials science. Computed tomography (CT) scans are used to analyze internal structures through capture of x-ray images. Cone-beam CT scans project a cone-shaped x-ray to capture 2D image data from a single focal point, rotating around the object. CT scans are prone to multiple artifacts, including motion blur, streaks, and pixel irregularities, therefore must be run through image reconstruction software to reduce visual artifacts. The most common algorithm used is the Feldkamp, Davis, and Kress (FDK) backprojection algorithm. The algorithm is computationally intensive due to the O(n4) backprojection step, running slowly with large CT data files on CPUs, but exceptionally well on GPUs due to the parallel nature of the algorithm. This thesis will analyze the performance of 3D cone-beam CT image reconstruction implemented in OpenCL on a FPGA embedded into a Power System

    Real-Time Dense Stereo Matching With ELAS on FPGA Accelerated Embedded Devices

    Full text link
    For many applications in low-power real-time robotics, stereo cameras are the sensors of choice for depth perception as they are typically cheaper and more versatile than their active counterparts. Their biggest drawback, however, is that they do not directly sense depth maps; instead, these must be estimated through data-intensive processes. Therefore, appropriate algorithm selection plays an important role in achieving the desired performance characteristics. Motivated by applications in space and mobile robotics, we implement and evaluate a FPGA-accelerated adaptation of the ELAS algorithm. Despite offering one of the best trade-offs between efficiency and accuracy, ELAS has only been shown to run at 1.5-3 fps on a high-end CPU. Our system preserves all intriguing properties of the original algorithm, such as the slanted plane priors, but can achieve a frame rate of 47fps whilst consuming under 4W of power. Unlike previous FPGA based designs, we take advantage of both components on the CPU/FPGA System-on-Chip to showcase the strategy necessary to accelerate more complex and computationally diverse algorithms for such low power, real-time systems.Comment: 8 pages, 7 figures, 2 table

    Vicuna: A Timing-Predictable RISC-V Vector Coprocessor for Scalable Parallel Computation

    Get PDF
    corecore