110 research outputs found

    SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator

    Get PDF
    Spatial processing of sparse, irregular floating-point computation using a single FPGA enables up to an order of magnitude speedup (mean 2.8X speedup) over a conventional microprocessor for the SPICE circuit simulator. We deliver this speedup using a hybrid parallel architecture that spatially implements the heterogeneous forms of parallelism available in SPICE. We decompose SPICE into its three constituent phases: Model-Evaluation, Sparse Matrix-Solve, and Iteration Control and parallelize each phase independently. We exploit data-parallel device evaluations in the Model-Evaluation phase, sparse dataflow parallelism in the Sparse Matrix-Solve phase and compose the complete design in streaming fashion. We name our parallel architecture SPICE²: Spatial Processors Interconnected for Concurrent Execution for accelerating the SPICE circuit simulator. We program the parallel architecture with a high-level, domain-specific framework that identifies, exposes and exploits parallelism available in the SPICE circuit simulator. This design is optimized with an auto-tuner that can scale the design to use larger FPGA capacities without expert intervention and can even target other parallel architectures with the assistance of automated code-generation. This FPGA architecture is able to outperform conventional processors due to a combination of factors including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and overlapped processing of the control algorithms. We demonstrate that we can independently accelerate Model-Evaluation by a mean factor of 6.5X(1.4--23X) across a range of non-linear device models and Matrix-Solve by 2.4X(0.6--13X) across various benchmark matrices while delivering a mean combined speedup of 2.8X(0.2--11X) for the two together when comparing a Xilinx Virtex-6 LX760 (40nm) with an Intel Core i7 965 (45nm). With our high-level framework, we can also accelerate Single-Precision Model-Evaluation on NVIDIA GPUs, ATI GPUs, IBM Cell, and Sun Niagara 2 architectures. We expect approaches based on exploiting spatial parallelism to become important as frequency scaling slows down and modern processing architectures turn to parallelism (\eg multi-core, GPUs) due to constraints of power consumption. This thesis shows how to express, exploit and optimize spatial parallelism for an important class of problems that are challenging to parallelize.</p

    Parallel Type-2 Fuzzy Logic Co-Processors for Engine Management

    Get PDF
    Marine diesel engines operate in highly dynamic and uncertain environments, hence they require robust and accurate speed controllers that can handle the encountered uncertainties. Type-2 Fuzzy Logic Controllers (FLCs) have shown that they can handle such uncertainties and give a superior performance to the existing commercial controllers. However, there are a number of computational bottlenecks that pose as significant barriers to the widespread deployment of type-2 FLCs in commercial embedded control systems. This paper explores the use of parallel hardware implementations of interval type-2 FLC as a means to eradicate these barriers thus producing bespoke co-processors for a soft core implementation of a FPGA based 32 bit RISC micro-processor. These co-processors will perform functions such as fuzzification and type reduction and are currently utilised as part of a larger embedded interval Type-2 Fuzzy Engine Management System (T2FEMS). Numerous timing comparisons were undertaken between the co-processors and their sequential counterparts where the type-2 co-processors reduced significantly the computational cycles required by the type-2 FLC. This reduction in computational cycles allowed the T2FEMS to produce faster control responses whilst offering a superior control performance to the commercial engine management systems. Thus the proposed co-processors enable us to fully explore the potential of interval and possibly general type-2 FLCs in commercial embedded applications. © 2007 IEEE

    Dynamically Reconfigurable Systolic Array Accelerators: A Case Study with Extended Kalman Filter and Discrete Wavelet Transform Algorithms

    Get PDF
    Field programmable grid arrays (FPGA) are increasingly being adopted as the primary on-board computing system for autonomous deep space vehicles. There is a need to support several complex applications for navigation and image processing in a rapidly responsive on-board FPGA-based computer. This requires exploring and combining several design concepts such as systolic arrays, hardware-software partitioning, and partial dynamic reconfiguration. A microprocessor/co-processor design that can accelerate two single precision oating-point algorithms, extended Kalman lter and a discrete wavelet transform, is presented. This research makes three key contributions. (i) A polymorphic systolic array framework comprising of recofigurable partial region-based sockets to accelerate algorithms amenable to being mapped onto linear systolic arrays. When implemented on a low end Xilinx Virtex4 SX35 FPGA the design provides a speedup of at least 4.18x and 6.61x over a state of the art microprocessor used in spacecraft systems for the extended Kalman lter and discrete wavelet transform algorithms, respectively. (ii) Switchboxes to enable communication between static and partial reconfigurable regions and a simple protocol to enable schedule changes when a socket\u27s contents are dynamically reconfigured to alter the concurrency of the participating systolic arrays. (iii) A hybrid partial dynamic reconfiguration method that combines Xilinx early access partial reconfiguration, on-chip bitstream decompression, and bitstream relocation to enable fast scaling of systolic arrays on the PolySAF. This technique provided a 2.7x improvement in reconfiguration time compared to an o-chip partial reconfiguration technique that used a Flash card on the FPGA board, and a 44% improvement in BRAM usage compared to not using compression

    Fast self-reconfigurable embedded system on Spartan-3

    Get PDF
    Many image-processing algorithms require several stages to be processed that cannot be resolved by embedded microprocessors in a reasonable time, due to their high-computational cost. A set of dedicated coprocessors can accelerate the resolution of these algorithms, alt hough the main drawback is the area needed for their implementation. The main advantage of a reconfigurable system is that several coprocessors designed to perform different operations can be mapped on the same area in a time-multiplexed way. This work presents the architecture of an embedded system composed of a microprocessor and a run-time reconfigurable coprocessor, mapped on Spartan-3, the low-cost family of Xilinx FPGAs. Designing reconfigurable systems on Spartan-3 requires much design effort, since unlike higher cost families of Xilinx FPGAs, this device does not officially support partial reconfiguration. In order to overcome this drawback, the paper also describes the main steps used in the design flow to obtain a successful design. The main goal of the presented architecture is to reduce the coprocessor reconfiguration time, as well as accelerate image-processing algorithms. The experimental results demonstrate significant improvement in both objectives. The reconfiguration rate nearly achieves 320 Mb/s which is far superior to th e previous related works.Peer ReviewedPostprint (published version

    AIR: Adaptive Dynamic Precision Iterative Refinement

    Get PDF
    In high performance computing, applications often require very accurate solutions while minimizing runtimes and power consumption. Improving the ratio of the number of logic gates implementing floating point arithmetic operations to the total number of logic gates enables greater efficiency, potentially with higher performance and lower power consumption. Software executing on the fixed hardware in Von-Neuman architectures faces limitations on improving this ratio, since processors require extensive supporting logic to fetch and decode instructions while employing arithmetic units with statically defined precision. This dissertation explores novel approaches to improve computing architectures for linear system applications not only by designing application-specific hardware but also by optimizing precision by applying adaptive dynamic precision iterative refinement (AIR). This dissertation shows that AIR is numerically stable and well behaved. Theoretically, AIR can produce up to 3 times speedup over mixed precision iterative refinement on FPGAs. Implementing an AIR prototype for the refinement procedure on a Xilinx XC6VSX475T FPGA results in an estimated around 0.5, 8, and 2 times improvement for the time-, clock-, and energy-based performance per iteration compared to mixed precision iterative refinement on the Nvidia Tesla C2075 GPU, when a user requires a prescribed accuracy between single and double precision. AIR using FPGAs can produce beyond double precision accuracy effectively, while CPUs or GPUs need software help causing substantial overhead
    corecore