34 research outputs found

    Performance comparison of single-precision SPICE Model-Evaluation on FPGA, GPU, Cell, and multi-core processors

    Get PDF
    Automated code generation and performance tuning techniques for concurrent architectures such as GPUs, Cell and FPGAs can provide integer factor speedups over multi-core processor organizations for data-parallel, floating-point computation in SPICE model-evaluation. Our Verilog AMS compiler produces code for parallel evaluation of non-linear circuit models suitable for use in SPICE simulations where the same model is evaluated several times for all the devices in the circuit. Our compiler uses architecture specific parallelization strategies (OpenMP for multi-core, PThreads for Cell, CUDA for GPU, statically scheduled VLIW for FPGA) when producing code for these different architectures. We automatically explore different implementation configurations (e.g. unroll factor, vector length) using our performance-tuner to identify the best possible configuration for each architecture. We demonstrate speedups of 3- 182times for a Xilinx Virtex5 LX 330T, 1.3-33times for an IBM Cell, and 3-131times for an NVIDIA 9600 GT GPU over a 3 GHz Intel Xeon 5160 implementation for a variety of single-precision device models

    SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator

    Get PDF
    Spatial processing of sparse, irregular floating-point computation using a single FPGA enables up to an order of magnitude speedup (mean 2.8X speedup) over a conventional microprocessor for the SPICE circuit simulator. We deliver this speedup using a hybrid parallel architecture that spatially implements the heterogeneous forms of parallelism available in SPICE. We decompose SPICE into its three constituent phases: Model-Evaluation, Sparse Matrix-Solve, and Iteration Control and parallelize each phase independently. We exploit data-parallel device evaluations in the Model-Evaluation phase, sparse dataflow parallelism in the Sparse Matrix-Solve phase and compose the complete design in streaming fashion. We name our parallel architecture SPICE²: Spatial Processors Interconnected for Concurrent Execution for accelerating the SPICE circuit simulator. We program the parallel architecture with a high-level, domain-specific framework that identifies, exposes and exploits parallelism available in the SPICE circuit simulator. This design is optimized with an auto-tuner that can scale the design to use larger FPGA capacities without expert intervention and can even target other parallel architectures with the assistance of automated code-generation. This FPGA architecture is able to outperform conventional processors due to a combination of factors including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and overlapped processing of the control algorithms. We demonstrate that we can independently accelerate Model-Evaluation by a mean factor of 6.5X(1.4--23X) across a range of non-linear device models and Matrix-Solve by 2.4X(0.6--13X) across various benchmark matrices while delivering a mean combined speedup of 2.8X(0.2--11X) for the two together when comparing a Xilinx Virtex-6 LX760 (40nm) with an Intel Core i7 965 (45nm). With our high-level framework, we can also accelerate Single-Precision Model-Evaluation on NVIDIA GPUs, ATI GPUs, IBM Cell, and Sun Niagara 2 architectures. We expect approaches based on exploiting spatial parallelism to become important as frequency scaling slows down and modern processing architectures turn to parallelism (\eg multi-core, GPUs) due to constraints of power consumption. This thesis shows how to express, exploit and optimize spatial parallelism for an important class of problems that are challenging to parallelize.</p

    Accelerating SPICE Model-Evaluation using FPGAs

    Get PDF
    Single-FPGA spatial implementations can provide an order of magnitude speedup over sequential microprocessor implementations for data-parallel, floating-point computation in SPICE model-evaluation. Model-evaluation is a key component of the SPICE circuit simulator and it is characterized by large irregular floating-point compute graphs. We show how to exploit the parallelism available in these graphs on single-FPGA designs with a low-overhead VLIW-scheduled architecture. Our architecture uses spatial floating-point operators coupled to local high-bandwidth memories and interconnected by a time-shared network. We retime operation inputs in the model-evaluation to allow independent scheduling of computation and communication. With this approach, we demonstrate speedups of 2–18× over a dual-core 3GHz Intel Xeon 5160 when using a Xilinx Virtex 5 LX330T for a variety of SPICE device models

    Multi-level simulation of nano-electronic digital circuits on GPUs

    Get PDF
    Simulation of circuits and faults is an essential part in design and test validation tasks of contemporary nano-electronic digital integrated CMOS circuits. Shrinking technology processes with smaller feature sizes and strict performance and reliability requirements demand not only detailed validation of the functional properties of a design, but also accurate validation of non-functional aspects including the timing behavior. However, due to the rising complexity of the circuit behavior and the steady growth of the designs with respect to the transistor count, timing-accurate simulation of current designs requires a lot of computational effort which can only be handled by proper abstraction and a high degree of parallelization. This work presents a simulation model for scalable and accurate timing simulation of digital circuits on data-parallel graphics processing unit (GPU) accelerators. By providing compact modeling and data-structures as well as through exploiting multiple dimensions of parallelism, the simulation model enables not only fast and timing-accurate simulation at logic level, but also massively-parallel simulation with switch level accuracy. The model facilitates extensions for fast and efficient fault simulation of small delay faults at logic level, as well as first-order parametric and parasitic faults at switch level. With the parallelization on GPUs, detailed and scalable simulation is enabled that is applicable even to multi-million gate designs. This way, comprehensive analyses of realistic timing-related faults in presence of process- and parameter variations are enabled for the first time. Additional simulation efficiency is achieved by merging the presented methods in a unified simulation model, that allows to combine the unique advantages of the different levels of abstraction in a mixed-abstraction multi-level simulation flow to reach even higher speedups. Experimental results show that the implemented parallel approach achieves unprecedented simulation throughput as well as high speedup compared to conventional timing simulators. The underlying model scales for multi-million gate designs and gives detailed insights into the timing behavior of digital CMOS circuits, thereby enabling large-scale applications to aid even highly complex design and test validation tasks

    Graphics processing unit utilization in circuit simulation

    Get PDF
    Nykypäivän grafiikkaprosessorit (GPU) koostuvat sadoista monisäikeisistä, moniytimisistä prosessoreista ja monimutkaisesta korkean kaistanleveyden muistiarkkitehtuurista. Tämän vuoksi niistä on tullut hyvä vaihtoehto nopeuttamaan rinnakkaistettua yleislaskentaa, jossa suuria datamääriä käsitellään samoilla funktioilla. Myös piirisimuloinnin alalla on esitelty menestyksellisiä GPU-laskennan sovellutuksia. Tämän opinnäytteen tavoitteena on tutkia GPU-laskennan mahdollisuuksia APLAC-piirisimulointiohjelmassa. Työssä esitellään myös diodimallin laskennan toteutus GPU:lla. Epälineaarinen diodimalli toteutettiin NVIDIAn CUDA-arkkitehtuurilla, joka on niin sanottu SIMT-arkkitehtuuri (single-instruction, multiple-thread) eli yksi käsky suoritetaan kerrallaan usealle säikeelle. CUDA-laite ohjelmoitiin CUDA C -ohjelmointirajapinnalla, joka on standardin C-kielen laajennus. Testitulokset paljastivat että diodin yksinkertaisesta epälineaarisuudesta johtuen sen laskenta on liian kevyt, jotta GPU:n tehokkuudesta olisi mitään nopeusetua. Vaadittavat muutokset piirianalyysin rakenteeseen sekä datan hallintaan johtivat marginaalisesti alkuperäistä pidempään kokonaissimulointiaikaan. Kun diodimallia monimutkaistetaan moninkertaistamalla sen laskenta, CUDA-toteutus on nopeampi kuin alkuperäinen malli. Tämä antaa karkean arvion siitä kuinka monimutkainen malli hyötyy GPU-laskennasta. Vaikka diodimalli ei ollutkaan nopeampi GPU:lla, tämä toteutus on hyvä perusta tuleville CUDA-sovelluksille APLACissa. Näistä seuraavana on huomattavasti monimutkaisempi BSIM3-transistorimallin laskenta, joka mitä todennäköisimmin hyötyy GPU:n laskentatehosta.Graphics processing units (GPU) of today include hundreds of multi-threaded, multicore processors and a complex, high-bandwidth memory architecture, making them a good alternative to speed up general-purpose parallel computation where large data quantities are processed with same functions. Some successful applications of GPU computation have also been introduced in the field of circuit simulation. The objective of this thesis is to examine the GPU's computing potential in the APLAC circuit simulation software. The realization of a diode model on a GPU device is also presented. The nonlinear diode model was implemented on NVIDIA's Compute Unified Device Architecture (CUDA), that is a single-instruction, multiple-thread (SIMT) architecture. A CUDA device was programmed using the CUDA C application programming interface, which is an extension of the standard C language. The test results revealed that due to the diode's simple nonlinearity, its evaluation is computationally too light to gain any speed benefit from the GPU's computation power. The required modifications to the circuit analysis structure and data handling resulted in a marginally longer total simulation time than initially. However, when the diode model is made more complex by multiplying its evaluation, the CUDA implementation is faster than the original model. This gives a rough estimate of how complex a model benefits from the GPU computation. Although, the diode model evaluation was not faster on the GPU, this implementation is a good foundation for future CUDA applications in APLAC. The next of these applications will be the computationally more complex BSIM3 transistor model, which will most likely benefit from the computing power of GPU devices

    Hardware Acceleration of Electronic Design Automation Algorithms

    Get PDF
    With the advances in very large scale integration (VLSI) technology, hardware is going parallel. Software, which was traditionally designed to execute on single core microprocessors, now faces the tough challenge of taking advantage of this parallelism, made available by the scaling of hardware. The work presented in this dissertation studies the acceleration of electronic design automation (EDA) software on several hardware platforms such as custom integrated circuits (ICs), field programmable gate arrays (FPGAs) and graphics processors. This dissertation concentrates on a subset of EDA algorithms which are heavily used in the VLSI design flow, and also have varying degrees of inherent parallelism in them. In particular, Boolean satisfiability, Monte Carlo based statistical static timing analysis, circuit simulation, fault simulation and fault table generation are explored. The architectural and performance tradeoffs of implementing the above applications on these alternative platforms (in comparison to their implementation on a single core microprocessor) are studied. In addition, this dissertation also presents an automated approach to accelerate uniprocessor code using a graphics processing unit (GPU). The key idea is to partition the software application into kernels in an automated fashion, such that multiple instances of these kernels, when executed in parallel on the GPU, can maximally benefit from the GPU?s hardware resources. The work presented in this dissertation demonstrates that several EDA algorithms can be successfully rearchitected to maximally harness their performance on alternative platforms such as custom designed ICs, FPGAs and graphic processors, and obtain speedups upto 800X. The approaches in this dissertation collectively aim to contribute towards enabling the computer aided design (CAD) community to accelerate EDA algorithms on arbitrary hardware platforms

    Toward Reliable, Secure, and Energy-Efficient Multi-Core System Design

    Get PDF
    Computer hardware researchers have perennially focussed on improving the performance of computers while stipulating the energy consumption under a strict budget. While several innovations over the years have led to high performance and energy efficient computers, more challenges have also emerged as a fallout. For example, smaller transistor devices in modern multi-core systems are afflicted with several reliability and security concerns, which were inconceivable even a decade ago. Tackling these bottlenecks happens to negatively impact the power and performance of the computers. This dissertation explores novel techniques to gracefully solve some of the pressing challenges of the modern computer design. Specifically, the proposed techniques improve the reliability of on-chip communication fabric under a high power supply noise, increase the energy-efficiency of low-power graphics processing units, and demonstrate an unprecedented security loophole of the low-power computing paradigm through rigorous hardware-based experiments

    FPGA-SPICE: A Simulation-Based Architecture Evaluation Framework for FPGAs

    Get PDF
    In this paper, we developed a simulation-based architecture evaluation framework for field-programmable gate arrays (FPGAs), called FPGA-SPICE, which enables automatic layout-level estimation and electrical simulations of FPGA architectures. FPGA-SPICE can automatically generate Verilog and SPICE netlists based on realistic FPGA configurations and a high-level eTtensible Markup Language-based FPGA architectural description language. The outputted Verilog netlists can be used to generate layouts of full FPGA fabrics through a semicustom design flow. SPICE simulation decks can be generated at three levels of complexity, namely, full-chip-level, grid-level, and component-level, providing different tradeoff between accuracy and simulation time. In order to enable such level of analysis, we presented two SPICE netlist partitioning techniques: loads extraction and parasitic net activity estimation. Electrical simulations showed that averaged over the selected benchmarks, the grid-/component-level approach can achieve 6.1x/7.5x execution speed-up with 9.9%/8.3% accuracy loss, respectively, compared to the full-chip level simulation. FPGA-SPICE was showcased through three different case studies: 1) an area breakdown analysis for static random access memory-based FPGAs, showing that configuration memories are a dominant factor; 2) a power breakdown comparison to analytical models, analyzing the source of accuracy loss; and 3) a robustness evaluation against process corners, studying their impact on energy consumption of full FPGA fabrics

    Design Exploration of AES Accelerators on FPGAs and GPUs, Journal of Telecommunications and Information Technology, 2017, nr 1

    Get PDF
    The embedded systems are increasingly becoming a key technological component of all kinds of complex technical systems and an exhaustive analysis of the state of the art of all current performance with respect to architectures, design methodologies, test and applications could be very interesting. The Advanced Encryption Standard (AES), based on the well-known algorithm Rijndael, is designed to be easily implemented in hardware and software platforms. General purpose computing on graphics processing unit (GPGPU) is an alternative to recongurable accelerators based on FPGA devices. This paper presents a direct comparison between FPGA and GPU used as accelerators for the AES cipher. The results achieved on both platforms and their analysis has been compared to several others in order to establish which device is best at playing the role of hardware accelerator by each solution showing interesting considerations in terms of throughput, speedup factor, and resource usage. This analysis suggests that, while hardware design on FPGA remains the natural choice for consumer-product design, GPUs are nowadays the preferable choice for PC based accelerators, especially when the processing routines are highly parallelizable

    Architectural Support for Medical Imaging

    Full text link
    Advancements in medical imaging research are continuously providing doctors with better diagnostic information, removing the need for unnecessary surgeries and increasing accuracy in predicting life-threatening conditions. However, newly developed techniques are currently limited by the capabilities of existing computer hardware, restricting them to expensive, custom-designed machines that only the largest hospital systems can afford or even worse, precluding them entirely. Many of these issues are due to existing hardware being ill-suited for these types of algorithms and not designed with medical imaging in mind. In this thesis we discuss our efforts to motivate and democratize architectural support for advanced medical imaging tasks with MIRAQLE, a medical image reconstruction benchmark suite. In particular, MIRAQLE focuses on advanced image reconstruction techniques for 3D ultrasound, low-dose X-ray CT, and dynamic MRI. For each imaging modality we provide a detailed background and parallel implementations to enable future hardware development. In addition to providing baseline algorithms for these workloads, we also develop a unique analysis tool that provides image quality feedback for each simulation. This allows hardware designers to explore acceptable image quality trade-offs in algorithm-hardware co-design, potentially allowing for even more efficient solutions than hardware innovations alone could provide. We also motivate the need for such tools by discussing Sonic Millip3De, our low-power, highly parallel hardware for 3D ultrasound. Using Sonic Millip3De, we illustrate the orders-of-magnitude power efficiency improvement that better medical imaging hardware can provide, especially when developed with a hardware-software co-design. We also show validation of the design using a scaled-down FPGA proof-of-concept and discuss our further refinement of the hardware to support a wider range of applications and produce higher frame rates. Overall, with this thesis we hope to enable application specific hardware support for the critical medical imaging tasks in MIRAQLE to make them practical for wide clinical use.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/137105/1/rsamp_1.pd
    corecore