205 research outputs found
Accelerating SPICE Model-Evaluation using FPGAs
Single-FPGA spatial implementations can provide
an order of magnitude speedup over sequential microprocessor
implementations for data-parallel, floating-point computation in
SPICE model-evaluation. Model-evaluation is a key component
of the SPICE circuit simulator and it is characterized by
large irregular floating-point compute graphs. We show how to
exploit the parallelism available in these graphs on single-FPGA
designs with a low-overhead VLIW-scheduled architecture. Our
architecture uses spatial floating-point operators coupled to local
high-bandwidth memories and interconnected by a time-shared
network. We retime operation inputs in the model-evaluation to
allow independent scheduling of computation and communication.
With this approach, we demonstrate speedups of 2–18×
over a dual-core 3GHz Intel Xeon 5160 when using a Xilinx
Virtex 5 LX330T for a variety of SPICE device models
Performance comparison of single-precision SPICE Model-Evaluation on FPGA, GPU, Cell, and multi-core processors
Automated code generation and performance tuning techniques for concurrent architectures such as GPUs, Cell and FPGAs can provide integer factor speedups over multi-core processor organizations for data-parallel, floating-point computation in SPICE model-evaluation. Our Verilog AMS compiler produces code for parallel evaluation of non-linear circuit models suitable for use in SPICE simulations where the same model is evaluated several times for all the devices in the circuit. Our compiler uses architecture specific parallelization strategies (OpenMP for multi-core, PThreads for Cell, CUDA for GPU, statically scheduled VLIW for FPGA) when producing code for these different architectures. We automatically explore different implementation configurations (e.g. unroll factor, vector length) using our performance-tuner to identify the best possible configuration for each architecture. We demonstrate speedups of 3- 182times for a Xilinx Virtex5 LX 330T, 1.3-33times for an IBM Cell, and 3-131times for an NVIDIA 9600 GT GPU over a 3 GHz Intel Xeon 5160 implementation for a variety of single-precision device models
SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator
Spatial processing of sparse, irregular floating-point computation using a single FPGA enables up to an order of magnitude speedup (mean 2.8X speedup) over a conventional microprocessor for the SPICE circuit simulator. We deliver this speedup using a hybrid parallel architecture that spatially implements the heterogeneous forms of parallelism available in SPICE. We decompose SPICE into its three constituent phases: Model-Evaluation, Sparse Matrix-Solve, and Iteration Control and parallelize each phase independently. We exploit data-parallel device evaluations in the Model-Evaluation phase, sparse dataflow parallelism in the Sparse Matrix-Solve phase and compose the complete design in streaming fashion. We name our parallel architecture SPICE²: Spatial Processors Interconnected for Concurrent Execution for accelerating the SPICE circuit simulator. We program the parallel architecture with a high-level, domain-specific framework that identifies, exposes and exploits parallelism available in the SPICE circuit simulator. This design is optimized with an auto-tuner that can scale the design to use larger FPGA capacities without expert intervention and can even target other parallel architectures with the assistance of automated code-generation. This FPGA architecture is able to outperform conventional processors due to a combination of factors including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and overlapped processing of the control algorithms.
We demonstrate that we can independently accelerate Model-Evaluation by a mean factor of 6.5X(1.4--23X) across a range of non-linear device models and Matrix-Solve by 2.4X(0.6--13X) across various benchmark matrices while delivering a mean combined speedup of 2.8X(0.2--11X) for the two together when comparing a Xilinx Virtex-6 LX760 (40nm) with an Intel Core i7 965 (45nm). With our high-level framework, we can also accelerate Single-Precision Model-Evaluation on NVIDIA GPUs, ATI GPUs, IBM Cell, and Sun Niagara 2 architectures.
We expect approaches based on exploiting spatial parallelism to become important as frequency scaling slows down and modern processing architectures turn to parallelism (\eg multi-core, GPUs) due to constraints of power consumption. This thesis shows how to express, exploit and optimize spatial parallelism for an important class of problems that are challenging to parallelize.</p
Modeling and Design of Digital Electronic Systems
The paper is concerned with the modern methodologies for holistic modeling of electronic systems enabling system-on-chip design. The method deals with the functional modeling of complete electronic systems using the behavioral features of Hardware Description Languages or high level languages then targeting programmable devices - mainly Field Programmable Gate Arrays (FPGAs) - for the rapid prototyping of digital electronic controllers. This approach offers major advantages such as: a unique modeling and evaluation environment for complete power systems, the same environment is used for the rapid prototyping of the digital controller, fast design development, short time to market, a CAD platform independent model, reusability of the model/design, generation of valuable IP, high level hardware/software partitioning of the design is enabled, Concurrent Engineering basic rules (unique EDA environment and common design database) are fulfilled. The recent evolution of such design methodologies is marked through references to case studies of electronic system modeling,simulation, controller design and implementation. Pointers for future trends / evolution of electronic design strategies and tools are given
Hardware Acceleration of Electronic Design Automation Algorithms
With the advances in very large scale integration (VLSI) technology, hardware is going
parallel. Software, which was traditionally designed to execute on single core microprocessors,
now faces the tough challenge of taking advantage of this parallelism, made available
by the scaling of hardware. The work presented in this dissertation studies the acceleration
of electronic design automation (EDA) software on several hardware platforms such
as custom integrated circuits (ICs), field programmable gate arrays (FPGAs) and graphics
processors. This dissertation concentrates on a subset of EDA algorithms which are heavily
used in the VLSI design flow, and also have varying degrees of inherent parallelism
in them. In particular, Boolean satisfiability, Monte Carlo based statistical static timing
analysis, circuit simulation, fault simulation and fault table generation are explored. The
architectural and performance tradeoffs of implementing the above applications on these
alternative platforms (in comparison to their implementation on a single core microprocessor)
are studied. In addition, this dissertation also presents an automated approach to
accelerate uniprocessor code using a graphics processing unit (GPU). The key idea is to
partition the software application into kernels in an automated fashion, such that multiple
instances of these kernels, when executed in parallel on the GPU, can maximally benefit
from the GPU?s hardware resources.
The work presented in this dissertation demonstrates that several EDA algorithms can
be successfully rearchitected to maximally harness their performance on alternative platforms
such as custom designed ICs, FPGAs and graphic processors, and obtain speedups upto 800X. The approaches in this dissertation collectively aim to contribute towards enabling
the computer aided design (CAD) community to accelerate EDA algorithms on arbitrary
hardware platforms
Digital Simulations of Memristors Towards Integration with Reconfigurable Computing
The end of Moore’s Law has been predicted for decades. Demand for increased parallel computational performance has been increased by improvements in machine learning. This past decade has demonstrated the ever-increasing creativity and effort necessary to extract scaling improvements in CMOS fabrication processes. However, CMOS scaling is nearing its fundamental physical limits. A viable path for increasing performance is to break the von Neumann bottleneck. In-memory computing using emerging memory technologies (e.g. ReRam, STT, MRAM) offers a potential path beyond the end of Moore’s Law. However, there is currently very little support from industry tools for designers wishing to incorporate these devices and novel architectures. The primary issue for those using these tools is the lack of support for mixed-signal design, as HDLs such as Verilog were designed to work only with digital components. This work aims to improve the ability for designers to rapidly prototype their designs using these emerging memory devices, specifically memristors, by extending Verilog to support functional simulation of memristors with the Verilog Procedural Interface (VPI). In this work, demonstrations of the ability for the VPI to simulate memristors with the nonlinear ion-drift model and the behavior of a memristive crossbar array are presented
- …