285 research outputs found

    SPICE²: A Spatial, Parallel Architecture for Accelerating the Spice Circuit Simulator

    Get PDF
    Spatial processing of sparse, irregular floating-point computation using a single FPGA enables up to an order of magnitude speedup (mean 2.8X speedup) over a conventional microprocessor for the SPICE circuit simulator. We deliver this speedup using a hybrid parallel architecture that spatially implements the heterogeneous forms of parallelism available in SPICE. We decompose SPICE into its three constituent phases: Model-Evaluation, Sparse Matrix-Solve, and Iteration Control and parallelize each phase independently. We exploit data-parallel device evaluations in the Model-Evaluation phase, sparse dataflow parallelism in the Sparse Matrix-Solve phase and compose the complete design in streaming fashion. We name our parallel architecture SPICE²: Spatial Processors Interconnected for Concurrent Execution for accelerating the SPICE circuit simulator. We program the parallel architecture with a high-level, domain-specific framework that identifies, exposes and exploits parallelism available in the SPICE circuit simulator. This design is optimized with an auto-tuner that can scale the design to use larger FPGA capacities without expert intervention and can even target other parallel architectures with the assistance of automated code-generation. This FPGA architecture is able to outperform conventional processors due to a combination of factors including high utilization of statically-scheduled resources, low-overhead dataflow scheduling of fine-grained tasks, and overlapped processing of the control algorithms. We demonstrate that we can independently accelerate Model-Evaluation by a mean factor of 6.5X(1.4--23X) across a range of non-linear device models and Matrix-Solve by 2.4X(0.6--13X) across various benchmark matrices while delivering a mean combined speedup of 2.8X(0.2--11X) for the two together when comparing a Xilinx Virtex-6 LX760 (40nm) with an Intel Core i7 965 (45nm). With our high-level framework, we can also accelerate Single-Precision Model-Evaluation on NVIDIA GPUs, ATI GPUs, IBM Cell, and Sun Niagara 2 architectures. We expect approaches based on exploiting spatial parallelism to become important as frequency scaling slows down and modern processing architectures turn to parallelism (\eg multi-core, GPUs) due to constraints of power consumption. This thesis shows how to express, exploit and optimize spatial parallelism for an important class of problems that are challenging to parallelize.</p

    High Performance Reconfigurable Computing for Linear Algebra: Design and Performance Analysis

    Get PDF
    Field Programmable Gate Arrays (FPGAs) enable powerful performance acceleration for scientific computations because of their intrinsic parallelism, pipeline ability, and flexible architecture. This dissertation explores the computational power of FPGAs for an important scientific application: linear algebra. First of all, optimized linear algebra subroutines are presented based on enhancements to both algorithms and hardware architectures. Compared to microprocessors, these routines achieve significant speedup. Second, computing with mixed-precision data on FPGAs is proposed for higher performance. Experimental analysis shows that mixed-precision algorithms on FPGAs can achieve the high performance of using lower-precision data while keeping higher-precision accuracy for finding solutions of linear equations. Third, an execution time model is built for reconfigurable computers (RC), which plays an important role in performance analysis and optimal resource utilization of FPGAs. The accuracy and efficiency of parallel computing performance models often depend on mean maximum computations. Despite significant prior work, there have been no sufficient mathematical tools for this important calculation. This work presents an Effective Mean Maximum Approximation method, which is more general, accurate, and efficient than previous methods. Together, these research results help address how to make linear algebra applications perform better on high performance reconfigurable computing architectures

    Accelerating SPICE Model-Evaluation using FPGAs

    Get PDF
    Single-FPGA spatial implementations can provide an order of magnitude speedup over sequential microprocessor implementations for data-parallel, floating-point computation in SPICE model-evaluation. Model-evaluation is a key component of the SPICE circuit simulator and it is characterized by large irregular floating-point compute graphs. We show how to exploit the parallelism available in these graphs on single-FPGA designs with a low-overhead VLIW-scheduled architecture. Our architecture uses spatial floating-point operators coupled to local high-bandwidth memories and interconnected by a time-shared network. We retime operation inputs in the model-evaluation to allow independent scheduling of computation and communication. With this approach, we demonstrate speedups of 2–18× over a dual-core 3GHz Intel Xeon 5160 when using a Xilinx Virtex 5 LX330T for a variety of SPICE device models

    Accelerated Quantum Monte Carlo with Probabilistic Computers

    Full text link
    Quantum Monte Carlo (QMC) techniques are widely used in a variety of scientific problems and much work has been dedicated to developing optimized algorithms that can accelerate QMC on standard processors (CPU). With the advent of various special purpose devices and domain specific hardware, it has become increasingly important to establish clear benchmarks of what improvements these technologies offer compared to existing technologies. In this paper, we demonstrate 2 to 3 orders of magnitude acceleration of a standard QMC algorithm using a specially designed digital processor, and a further 2 to 3 orders of magnitude by mapping it to a clockless analog processor. Our demonstration provides a roadmap for 5 to 6 orders of magnitude acceleration for a transverse field Ising model (TFIM) and could possibly be extended to other QMC models as well. The clockless analog hardware can be viewed as the classical counterpart of the quantum annealer and provides performance within a factor of <10<10 of the latter. The convergence time for the clockless analog hardware scales with the number of qubits as N\sim N, improving the N2\sim N^2 scaling for CPU implementations, but appears worse than that reported for quantum annealers by D-Wave

    Simulation and implementation of novel deep learning hardware architectures for resource constrained devices

    Get PDF
    Corey Lammie designed mixed signal memristive-complementary metal–oxide–semiconductor (CMOS) and field programmable gate arrays (FPGA) hardware architectures, which were used to reduce the power and resource requirements of Deep Learning (DL) systems; both during inference and training. Disruptive design methodologies, such as those explored in this thesis, can be used to facilitate the design of next-generation DL systems

    Custom optimization algorithms for efficient hardware implementation

    No full text
    The focus is on real-time optimal decision making with application in advanced control systems. These computationally intensive schemes, which involve the repeated solution of (convex) optimization problems within a sampling interval, require more efficient computational methods than currently available for extending their application to highly dynamical systems and setups with resource-constrained embedded computing platforms. A range of techniques are proposed to exploit synergies between digital hardware, numerical analysis and algorithm design. These techniques build on top of parameterisable hardware code generation tools that generate VHDL code describing custom computing architectures for interior-point methods and a range of first-order constrained optimization methods. Since memory limitations are often important in embedded implementations we develop a custom storage scheme for KKT matrices arising in interior-point methods for control, which reduces memory requirements significantly and prevents I/O bandwidth limitations from affecting the performance in our implementations. To take advantage of the trend towards parallel computing architectures and to exploit the special characteristics of our custom architectures we propose several high-level parallel optimal control schemes that can reduce computation time. A novel optimization formulation was devised for reducing the computational effort in solving certain problems independent of the computing platform used. In order to be able to solve optimization problems in fixed-point arithmetic, which is significantly more resource-efficient than floating-point, tailored linear algebra algorithms were developed for solving the linear systems that form the computational bottleneck in many optimization methods. These methods come with guarantees for reliable operation. We also provide finite-precision error analysis for fixed-point implementations of first-order methods that can be used to minimize the use of resources while meeting accuracy specifications. The suggested techniques are demonstrated on several practical examples, including a hardware-in-the-loop setup for optimization-based control of a large airliner.Open Acces

    Parallel computing 2011, ParCo 2011: book of abstracts

    Get PDF
    This book contains the abstracts of the presentations at the conference Parallel Computing 2011, 30 August - 2 September 2011, Ghent, Belgiu

    Acceleration Techniques for Sparse Recovery Based Plane-wave Decomposition of a Sound Field

    Get PDF
    Plane-wave decomposition by sparse recovery is a reliable and accurate technique for plane-wave decomposition which can be used for source localization, beamforming, etc. In this work, we introduce techniques to accelerate the plane-wave decomposition by sparse recovery. The method consists of two main algorithms which are spherical Fourier transformation (SFT) and sparse recovery. Comparing the two algorithms, the sparse recovery is the most computationally intensive. We implement the SFT on an FPGA and the sparse recovery on a multithreaded computing platform. Then the multithreaded computing platform could be fully utilized for the sparse recovery. On the other hand, implementing the SFT on an FPGA helps to flexibly integrate the microphones and improve the portability of the microphone array. For implementing the SFT on an FPGA, we develop a scalable FPGA design model that enables the quick design of the SFT architecture on FPGAs. The model considers the number of microphones, the number of SFT channels and the cost of the FPGA and provides the design of a resource optimized and cost-effective FPGA architecture as the output. Then we investigate the performance of the sparse recovery algorithm executed on various multithreaded computing platforms (i.e., chip-multiprocessor, multiprocessor, GPU, manycore). Finally, we investigate the influence of modifying the dictionary size on the computational performance and the accuracy of the sparse recovery algorithms. We introduce novel sparse-recovery techniques which use non-uniform dictionaries to improve the performance of the sparse recovery on a parallel architecture

    A mixed-signal computer architecture and its application to power system problems

    Get PDF
    Radical changes are taking place in the landscape of modern power systems. This massive shift in the way the system is designed and operated has been termed the advent of the ``smart grid''. One of its implications is a strong market pull for faster power system analysis computing. This work is concerned in particular with transient simulation, which is one of the most demanding power system analyses. This refers to the imitation of the operation of the real-world system over time, for time scales that cover the majority of slow electromechanical transient phenomena. The general mathematical formulation of the simulation problem includes a set of non-linear differential algebraic equations (DAEs). In the algebraic part of this set, heavy linear algebra computations are included, which are related to the admittance matrix of the topology. These computations are a critical factor to the overall performance of a transient simulator. This work proposes the use of analog electronic computing as a means of exceeding the performance barriers of conventional digital computers for the linear algebra operations. Analog computing is integrated in the frame of a power system transient simulator yielding significant computational performance benefits to the latter. Two hybrid, analog and digital computers are presented. The first prototype has been implemented using reconfigurable hardware. In its core, analog computing is used for linear algebra operations, while pipelined digital resources on a field programmable gate array (FPGA) handle all remaining computations. The properties of the analog hardware are thoroughly examined, with special attention to accuracy and timing. The application of the platform to the transient analysis of power system dynamics showed a speedup of two orders of magnitude against conventional software solutions. The second prototype is proposed as a future conceptual architecture that would overcome the limitations of the already implemented hardware, while retaining its virtues. The design space of this future architecture has been thoroughly explored, with the help of a software emulator. For one possible suggested implementation, speedups of four orders of magnitude against software solvers have been observed for the linear algebra operations

    Parallel and Multistep Simulation of Power System Transients

    Get PDF
    RÉSUMÉ La simulation des régimes transitoires électromagnétiques (EMT) est devenue indispensable aux ingénieurs dans de nombreuses études des réseaux électriques. L’approche EMT a une nature de large bande et est applicable aux études des transitoires lents (électromécaniques) et rapides (électromagnétiques). Cependant, la complexité des réseaux électriques modernes qui ne cesse de s’accroître, particulièrement des réseaux avec des interconnexions HVDC et des éoliennes, augmente considérablement le temps de résolution dans les études des transitoires électromagnétiques qui exigent la résolution précise des systèmes d’équations différentielles et algébriques avec un pas de calcul pré-déterminé. En tant que sujet de recherche, la réduction du temps de résolution des grands réseaux électriques complexes a donc attiré beaucoup d’attention et d’intérêt. Cette thèse a pour objectif de proposer de nouvelles méthodes numériques qui sont efficaces, flexibles et précises pour la simulation des régimes transitoires électromagnétiques des réseaux électriques. Dans un premier temps, une approche parallèle et à pas multiples basée sur la norme Functional Mock-up Interface (FMI) pour la simulation transitoire des réseaux électriques avec systèmes de contrôle complexes est développée. La forme de co-simulation de la norme FMI dont l’objectif est de faciliter l’échange de données entre des modèles développés avec différents logiciels est implémentée dans EMTP. Tout en profitant de cette implémentation, les différents systèmes de contrôle complexes peuvent être découplés du réseau principal en mémoire et résolus de façon indépendante sur des processeurs séparés. Ils communiquent avec le réseau principal à travers une interface de co-simulation pendant une simulation. Cette méthodologie non seulement réduit la charge de calcul total sur un seul processeur, mais elle permet aussi de simuler les systèmes de contrôle découplés de façon parallèle et à pas multiples. Deux modes de co-simulation sont proposés dans la première étape du développement, qui sont les modes asynchrone et synchrone. Dans le mode asynchrone, tous les systèmes de contrôle découplés (esclaves) sont simulés en parallèle avec le réseau principal (maître) en utilisant un seul pas de calcul tandis que le mode synchrone permet une simulation séquentielle en utilisant différents pas de calcul dans le maître et les esclaves. La communication entre le maître et les esclaves est réalisée et coordonnée par des fonctions qui implémentent le primitif de synchronisation de bas niveau sémaphore.----------ABSTRACT The simulation of electromagnetic transients (EMT) has become indispensable to utility engineers in a multitude of studies in power systems. The EMT approach is of wideband nature and applicable to both slower electromechanical as well as faster electromagnetic transients. However, the ever-growing complexity of modern-day power systems, especially those with HVDC interconnections and wind generations, considerably increases computational time in EMT studies which require the accurate solution of usually large sets of differential and algebraic equations (DAEs) with a pre-determinded time-step. Therefore, computing time reduction for solving complex, practical and large-scale power system networks has become a hot research topic. This thesis proposes new fast, flexible and accurate numerical methods for the simulation of power system electromagnetic transients. As a first step in this thesis, a parallel and multistep approach based on the Functional Mock-up Interface (FMI) standard for power system EMT simulations with complex control systems is developed. The co-simulation form of the FMI standard, a tool independent interface standard aiming to facilitate data exchange between dynamic models developed in different simulation environments, is implemented in EMTP. Taking advantage of the compatibility established between the FMI standard and EMTP, various computationally demanding control systems can be decoupled from the power network in memory, solved independently on separate processors, and communicate with the power network through a co-simulation interface during a simulation. This not only reduces the total computation burden on a single processor, but also allows parallel and multistep simulation for the decoupled control systems. Following a master-slave co-simulation scheme (with the master representing the power network and the slaves denoting the decoupled control systems), two co-simulation modes, which are respectively the asynchronous and synchronous modes, are proposed in the first stage of the development. In the asynchronous mode, all decoupled subsystems are simulated in parallel with a single numerical integration time-step whereas the synchronous mode allows the use of different numerical time-steps in a sequential co-simulation environment. The communication between master and slaves is coordinated by functions employing the low-level synchronization primitive semaphore
    corecore