174,810 research outputs found
Multi-level simulation of nano-electronic digital circuits on GPUs
Simulation of circuits and faults is an essential part in design and test validation tasks of contemporary nano-electronic digital integrated CMOS circuits.
Shrinking technology processes with smaller feature sizes and strict performance and reliability requirements demand not only detailed validation of the functional properties of a design, but also accurate validation of non-functional aspects including the timing behavior. However, due to the rising complexity of the circuit behavior and the steady growth of the designs with respect to the transistor count, timing-accurate simulation of current designs requires a lot of computational effort which can only be handled by proper abstraction and a high degree of parallelization.
This work presents a simulation model for scalable and accurate timing simulation of digital circuits on data-parallel graphics processing unit (GPU) accelerators.
By providing compact modeling and data-structures as well as through exploiting multiple dimensions of parallelism, the simulation model enables not only fast and timing-accurate simulation at logic level, but also massively-parallel simulation with switch level accuracy.
The model facilitates extensions for fast and efficient fault simulation of small delay faults at logic level, as well as first-order parametric and parasitic faults at switch level.
With the parallelization on GPUs, detailed and scalable simulation is enabled that is applicable even to multi-million gate designs.
This way, comprehensive analyses of realistic timing-related faults in presence of process- and parameter variations are enabled for the first time.
Additional simulation efficiency is achieved by merging the presented methods in a unified simulation model, that allows to combine the unique advantages of the different levels of abstraction in a mixed-abstraction multi-level simulation flow to reach even higher speedups.
Experimental results show that the implemented parallel approach achieves unprecedented simulation throughput as well as high speedup compared to conventional timing simulators.
The underlying model scales for multi-million gate designs and gives detailed insights into the timing behavior of digital CMOS circuits, thereby enabling large-scale applications to aid even highly complex design and test validation tasks
A fine-grain time-sharing Time Warp system
Although Parallel Discrete Event Simulation (PDES) platforms relying on the Time Warp (optimistic) synchronization
protocol already allow for exploiting parallelism, several techniques have been proposed to
further favor performance. Among them we can mention optimized approaches for state restore, as well as
techniques for load balancing or (dynamically) controlling the speculation degree, the latter being specifically
targeted at reducing the incidence of causality errors leading to waste of computation. However, in
state of the art Time Warp systems, eventsâ processing is not preemptable, which may prevent the possibility
to promptly react to the injection of higher priority (say lower timestamp) events. Delaying the processing
of these events may, in turn, give rise to higher incidence of incorrect speculation. In this article we present
the design and realization of a fine-grain time-sharing Time Warp system, to be run on multi-core Linux
machines, which makes systematic use of event preemption in order to dynamically reassign the CPU to
higher priority events/tasks. Our proposal is based on a truly dual mode execution, application vs platform,
which includes a timer-interrupt based support for bringing control back to platform mode for possible CPU
reassignment according to very fine grain periods. The latter facility is offered by an ad-hoc timer-interrupt
management module for Linux, which we release, together with the overall time-sharing support, within the
open source ROOT-Sim platform. An experimental assessment based on the classical PHOLD benchmark and
two real world models is presented, which shows how our proposal effectively leads to the reduction of the
incidence of causality errors, as compared to traditional Time Warp, especially when running with higher
degrees of parallelism
Transparent multi-core speculative parallelization of DES models with event and cross-state dependencies
In this article we tackle transparent parallelization of Discrete Event Simulation (DES) models to be run on top of multi-core machines according to speculative schemes. The innovation in our proposal lies in that we consider a more general programming and execution model, compared to the one targeted by state of the art PDES platforms, where the boundaries of the state portion accessible while processing an event at a specific simulation object do not limit access to the actual object state, or to shared global variables. Rather, the simulation object is allowed to access (and alter) the state of any other object, thus causing what we term cross-state dependency. We note that this model exactly complies with typical (easy to manage) sequential-style DES programming, where a (dynamically-allocated) state portion of object A can be accessed by object B in either read or write mode (or both) by, e.g., passing a pointer to B as the payload of a scheduled simulation event. However, while read/write memory accesses performed in the sequential run are always guaranteed to observe (and to give rise to) a consistent snapshot of the state of the simulation model, consistency is not automatically guaranteed in case of parallelization and concurrent execution of simulation objects with cross-state dependencies. We cope with such a consistency issue, and its application-transparent support, in the context of parallel and optimistic executions. This is achieved by introducing an advanced memory management architecture, able to efficiently detect read/write accesses by concurrent objects to whichever object state in an application transparent manner, together with advanced synchronization mechanisms providing the advantage of exploiting parallelism in the underlying multi-core architecture while transparently handling both cross-state and traditional event-based dependencies. Our proposal targets Linux and has been integrated with the ROOT-Sim open source optimistic simulation platform, although its design principles, and most parts of the developed software, are of general relevance. Copyright 2014 ACM
A new tool for the performance analysis of massively parallel computer systems
We present a new tool, GPA, that can generate key performance measures for
very large systems. Based on solving systems of ordinary differential equations
(ODEs), this method of performance analysis is far more scalable than
stochastic simulation. The GPA tool is the first to produce higher moment
analysis from differential equation approximation, which is essential, in many
cases, to obtain an accurate performance prediction. We identify so-called
switch points as the source of error in the ODE approximation. We investigate
the switch point behaviour in several large models and observe that as the
scale of the model is increased, in general the ODE performance prediction
improves in accuracy. In the case of the variance measure, we are able to
justify theoretically that in the limit of model scale, the ODE approximation
can be expected to tend to the actual variance of the model
A case study for NoC based homogeneous MPSoC architectures
The many-core design paradigm requires flexible and modular hardware and software components to provide the required scalability to next-generation on-chip multiprocessor architectures. A multidisciplinary approach is necessary to consider all the interactions between the different components of the design. In this paper, a complete design methodology that tackles at once the aspects of system level modeling, hardware architecture, and programming model has been successfully used for the implementation of a multiprocessor network-on-chip (NoC)-based system, the NoCRay graphic accelerator. The design, based on 16 processors, after prototyping with field-programmable gate array (FPGA), has been laid out in 90-nm technology. Post-layout results show very low power, area, as well as 500 MHz of clock frequency. Results show that an array of small and simple processors outperform a single high-end general purpose processo
A low-cost parallel implementation of direct numerical simulation of wall turbulence
A numerical method for the direct numerical simulation of incompressible wall
turbulence in rectangular and cylindrical geometries is presented. The
distinctive feature resides in its design being targeted towards an efficient
distributed-memory parallel computing on commodity hardware. The adopted
discretization is spectral in the two homogeneous directions; fourth-order
accurate, compact finite-difference schemes over a variable-spacing mesh in the
wall-normal direction are key to our parallel implementation. The parallel
algorithm is designed in such a way as to minimize data exchange among the
computing machines, and in particular to avoid taking a global transpose of the
data during the pseudo-spectral evaluation of the non-linear terms. The
computing machines can then be connected to each other through low-cost network
devices. The code is optimized for memory requirements, which can moreover be
subdivided among the computing nodes. The layout of a simple, dedicated and
optimized computing system based on commodity hardware is described. The
performance of the numerical method on this computing system is evaluated and
compared with that of other codes described in the literature, as well as with
that of the same code implementing a commonly employed strategy for the
pseudo-spectral calculation.Comment: To be published in J. Comp. Physic
- âŠ