We detail the design of 
Introduction
This paper is to reflect the design of a sealable infrastructure, called Paradys, developed for parallel circuit simulation. Early measurements of its scalability (some 0 . 9~ of parallel efficiency) are encouraging signs to measure on larger parallel configurations as well as to envision its application for simulation of deep sub-micron technology.
From the very beginning, the main goals assigned to Paradys are to:
1. develop algorithms for dynamic load balancing on parallel computers and adaptive algorithms for merging subcircuits in correlation with the instantaneous distributed power of the processing elements.
2. develop parallel algorithms for the detection of subcircuits with an analog or digital behavior in ULSI.
Overall Structure
The overall design of Paradys is lead by the following basic assumptions:
scalability is the main ingredient: at time of trade-offs it should be the winning part, generality should always be kept opened, portability : though developed on a concrete hardware configuration (an IBM SP2, herein known as SP), everything is done in order to easily port the code unto other parallel (or distributed) platforms.
As depicted in Figure in 1 , the different components of Paradys infrastructure are the followings:
Paravis is related to Paradys design component and runs in a specific node (PEO) of parallel configuration: it allows to display the events in the parallel infrastructure as well as interacting with the running simulation going on. A dedicated paragraph gives a thorough description of this graphical interface.
Data Base is a simple data base containing Spice sentences (also known as netlists). It is splitted (i.e. partitioned) in "mini db's" (i.e. kind of sub-netlists) by the Partitioning phase of the infrastructure.
Partitioning is developed based on the Toggle algorithm (cf. [ll] , [l] ) and is intended to produce a picture of the different subcircuits that can be simulated in parallel -the set of ultimate grainsalong with the way they are linked together.
These ultimate grains are potentially re-gathered together based on the actual SP configuration selected: Given the ragrain (;lbr of subcircuits) estimated at PE (nbr of PE's) minimum around 5, there is an upper bound for the number of grains per available PE; the lower bound is being driven by the number of transistors per grain, which is empirically determined between 100 and 200 (indeed, smaller subcircuits trigger too high a latency in -the availableSpice processing).
feedback detection: Feedbacks are common place in circuits. In Partitioning, when the maximum granularity is searched, they are not considered. Later on, during the Regrouping phase, they are detected: whenever they can be inserted within a subcircuit, it is done so in regrouping the concerned original grains. This regrouping based on feedback is limited by the size of the resulting subcircuit as well as by the number of resulting subcircuits. Though they complexify the overall process, largefeedbacks are kept outside the subcircuits (i.e., they are visible to the Paradys infrastructure and handled accordingly). Further developments are considered in order to handle them in a more general manner.
analog vs digital criteria: Contiguous original ultimate grains of the same nature (analog or digital) are also potentially regrouped, while always keeping in balance the number of subcircuits and their respective sizes.
This granularity adaptation -or regroupingis handled within PE0 as the final initialisation phase.
During the Partitioning and Regrouping phases, a conjiguration pre-tayloring is going on in all the other available PES, checking the different parameters (or adapters) of each PE so they can be set at their optimal value (for both latency and bandwidth) for the actual parallel part of Paradys processing.
Data-driven scheme
During the design phase of Paradys, an alternate architecture has been thorougfully considered. Indeed, despite the Paradys objectives of being scalable up to several thousand PES, realpolitik obliged to consider small and medium size (from 10's to 100's PES) configurations.
In an attempt to address also those small configurations, a distributed model has been considered for a while. Given the complexity and the programming load it would have triggered for the project, this model has been discarded in favor of a unique data-driven model addressing the long term objective of Paradys (i.e. 10.000 PES) which is described in the following paragraphs.
Data-driven scheduling
At the beginning of Paradys processing, a given processing element (PEO) has a specific role, in the sense that it runs the initialising phase of the simulation, followed by the partitioning phase (follow-on versions of Paradys are considering to parallelize also this latter, exploiting space-filling curve algorithms such as proposed in References [3] , [2] ), producing the ultimate grains and the way they are linked together. Further refinement known as Regrouping is also handled to fit the different criteria, as expressed in the previous paragraph.
From then on, this specific role of PE0 is terminated, it then plays the role of the Paravis PE (as described in a specific paragraph) for the rest of the simulation. Each other PE of the available configuration has an instance of the scheduler, which is hence distributed. Such a distributed structure of the scheduler, avoiding a single point of failure as well as a hostnode structure as advocated in Reference [9] , is expected to achieve the scalability objectives required of Paradys.
This distributed scheduler memorizes the db's mapping, and keeps track of the advancement of the simulation based on the contents of a piece of shared memory, containing the overall subcircuits statudmapping. Each block in the Shared Memory is either a SC-CB (for SubCircuit-ControlBlock, a C++ object) representing the different signals required by a given subcircuit to be simulated, or a control block representing the electrical nodes. Those SC-CBs reside in the shared memory and form the base upon which the firing rules are applied in order to build up an overall data-driven scheduling; those firing rules are being triggered by properly overloaded C++ operators.
Load Balancing
The simulation time for each subcircuits being quite variable, it is expected that the conjunctions of 1. the ultimate grains approach (regrouped according to the available SP configuration, feedback detection and Analogvs-Digital criteria) provided by the partitioning component, 2. the data-driven scheduling scheme provides the adequate load balancing required to achieve the expected Paradys scalability.
The way the SC-CB's are chained, represents the static ordering of the subcircuits simulations. The way by which they are dynamically (i.e. actually) ordered is mainly due to three factors:
I. relative speed of simulation execution for each subcircuits 2. PES availability 3. subcircuits simulation convergence.
Loads and Convergence
The fact of partitioning a circuit into subcircuits, decouples each subcircuit from one another. This creates a situation where each subcircuit is 'electrically isolated' from the others. At the time of the first iteration, all loads are set to zero. As the first iteration is progressing among the different subcircuits, proper loads are reflected as inputs to the appropriate subcircuit(s). This whole process triggers the need to iterate through the simulation of each subcircuits, until each one of them has converged (i.e. the loads have stabilized their effects on each of them).
A subcircuit is flagged as converged when all its input signals and loads have converged and when all its output signals have converged.
,'
Overall design
After the initialisation phase (i.e. Partitioning and Regrouping), the main components of Paradys involved in the infrastructure are as follows:
NetList Manager: the existence of both inputs and loads to be managed by Paradys between subcircuits implies the existence of a permanent process, called a NetList Manager, present in each PE. The NetList Manager is in charge of maintaining actual values of inputs and loads (as they become available) within the netlists provided to Spice. It has been preferred to have such a permanent process in order to smooth the communication load between the different PES. The other term of the alternative would have been to wait for the inputs and loads to be available to pull them unto the appropriate PE, which would have triggered a peak in communication load between the PES.
Simulator: in a first approach, Spice is run as a distinct application in each node of the SP configuration, as each "mini db" can be perceived as an actual DB (i.e. netlist) by each Spice instance.
It should be noted that in the case of a SP node made of n processors (i.e. a SMP), n instances of the simulator are simultaneously running in the given SP node. In such a case, we are dealing with n logical PES, each of them being implemented as a pair of Distributed-Scheduler / NetlistManager processes, described later.
Resulting Waveforms are produced by the simulation of each subcircuit, as dictated by the data-driven scheme, and are kept on the PE where the given iteration has just been exercised. The PE number is set into the SCCB representing the given subcircuit. When time comes of scheduling the next iteration (i.e. next time when the firing rule is to be applied for the given subcircuit), this PE number is considered by the distributed scheduler as the preferred PE to be selected, hence providing maximum data locality.
Dynamically managing the memory gap
The growing gap between processor and memory has been exacerbated by the way we are using computers: in order to obtain an actual data for a given instruction, it can take several, if not a lot of, different memory accesses, dependent upon the way data are structured.
As part of the Paradys project, we were lead by design to build up a shared memory. As an assist to the Paradys infrastructure written in C++, and transparent to this programming model, the author developed a software layer in order to access objects resident in the required shared memory. Such a software layer, coined ShMC++, is a first step implementing, in Paradys parallel environment, the device proposed in references [7, 81.
Structured data across the wall
As argued in so many instances [5, 4, 61 , the requirements are numerous to handle dynamic complex data structure. However current designs (either at the processor or at the system level) do not address the point in a satisfactory manner, witness vector processors or High Performance Fortran inadequacies to exploit processor speed for these kinds of data.
To give a concrete case of the matter, it has actually been measured that on a superscalar processor, able to potentially run 5 instructions per cycle, the achieved rate for High Energy Physics (HEP) data is, in fact, 0.8 instruction per cycle' : data not being accessed at the necessary rate because too many indirections have to be handled along the von Neumann bottleneck.
Profiling of HEP applications[ 101 shows it clearly: in such environments, 50% of the time is spent handling indirections in order to access the actual data.
Given the discrepancy between microprocessors performance increase per year and DRAM much slower speed increase per year2, this leads to an enlarging gap, the effect of which is much more exacerbated by dynamic complex data structures accessed in a non-local manner.
'Personal communication of Sverre Jarp, Atlas experiment, CERN.
'respectively estimated by E Baskett at 80% and 7% (in his keynote address at the International Symposium on Shared Memory Multiprocessing, April 1991), they have recently been rated at 60% and 10% by Maurice Encouraged by preliminary results obtained by software simulation of the CERN Benchmark Jobstream[8I1, we developed a concrete case around a shared memory access. Such an access is important for parallelism in multithreading environments as well as in the perspective of a NUMA architecture.
We now describe how the principles proposed in [7, 81 are implemented as an assist in order to dynamically handle access to Paradys shared memory.
ShMC++ motivations
In the context of accessing Paradys shared memory, one needs to reduce the latency for both read and write. This implementation of the principles proposed in [7, 81, shows that a dynamically managed access to complex data stuctures (ShMC++) residing in a shared memory, can substantially help the overall latency.
Control blocks implemented as tiles3 within the shared memory are representing the subcircuits as well as the different signals required to simulate them. This set of control blocks resides in shared memory as a set of tiles and forms the base from which firing rules are triggered, forming Parady s overall data-driven scheduling. Such control blocks being C++ objects, the firing rules are ideally triggered by properly overloading a C++ operator4.
Wilkes (cf. reference [ 131).
3We take the term tile in its general meaning as an element filing up with some others, the shared memory content.
4C++ overloading, allowing to freely modify operators semantic during readwrite access to objects, has extensively been exploited in ShMC++ as a mean to develop new type of object access, while remaining transparent to the programming model.
ShMC++, principles
As part of Paradys design, the availability of a shared memory was assumed. As the local available hardware (an IBM SP2) doesn't embody such a mechanism, we developed a home grown software-based distributed shared memory. As an assist to the Paradys infrastructure written in C++', and transparent to this programming model, the author developed a software layer in order to access objects resident in the shared memory.
Such a software layer, coined ShMC++, is an implementation, in a parallel environment, of the proposed device[7, 81 along the gap between processing elements and a shared memory. ShMC++ allows the parallel application to define the required tilings as well as to program their access. When each participating process starts, their ShMC++ part contains no reference (i.e. knowledge) of the shared objects. As such process IC++, even as standardized, does not offer a semantics in accessing shared objects either from different threads within a process or from different processes running in parallel -either in an SMP or an MPP environment.
starts to reference the shared objects, the structures of these latters are automatically built up for the given process in the local node, which potentially gives for each process the same perspective of the shared memory content. When referenced, the structures of the shared objects are built up in the ShMC++ part, the fields corresponding to the actual data (such as the integer, Flag or SC-status fields in Example 1) instead containing the address to be used for accessing such fields in the shared memory.
Example 1 is an instance of a C++ object (called SC-CB) used in Paradys: The ShMC++ sentences are implemented as C++ directives that specify how the data structure of the given object should be tiled in the shared memory. The sentences which contain the following verb: The directive //ShMC++ Id specifies that the following variable (namely in the given case, SC-id), must be used as a unique identifier for the tile to be built.
Such ShMC++ sentences allow to direct how to build up tiles in the shared memory that only contain the actual data for each instance of the given C++ object. The set of pointers building up the structure of normal C++ objects are not placed in the shared memory but are dynamically managed in each separate participating process.
Overall, a given shared object can actually be represented by the contents of two locations:
1. its structure known and defined in the ShMC++ parts of a certain number of processes (hence, at a given moment, there are n such structures, n being <= the number of participating processes), 2. the set of actual data it is made of, uniquely stored and maintained in the shared memory without any structure.
First measurements
Our implementation has been exerciced for different circuit size, on 2, 4 and 8 PE's config- The software distributed shared memory is evenly distributed on the considered PE's. Its structure is quite simple, first of all for lack of programming resources in developing a full blown distributed shared memory, but also the considered shared memory is filled up once at the end of partitionning, after what its structure does not change, only values are modified. We are thus not dealing with a sophisticated memory management scheme, cache coherency, etc ... and its possible interference with the measured device.
compiled with gcc 2.95 19990728. The linkedit is done with mpcc a specific script provided with that machine. At run-time, the global variables related to the parallel environment are: MP_LABELIO=yes, MPPROCS=n, MPEUILIB=ip, MP_EUIDEVICE=cssO.
For each circuit size, measurements are carried in an equal stable environnment.
ShMC++ effects on speedup
Given the major objective of the project (scalability), scalability points of control are placed in Paradys code in order to measure their respective effects. In this perspective, we detail, in this paragraph, effects of ShMC++ as measured on the Paradys infrastructure. In those specific (i.e., ShMC++) scalability measurements, partitioning is kept under control in order to obtain subcircuits of the same size: in consequence, we are expressing the circuit size in term of number of subcircuits. Each subcircuit is made up of around 120 similar transistors, so that the complexity of a subcircuit simulation is identical and does not interfere with the targeted objectives of the measurement.
For each circuit size, measurements are carried in an equal stable environnment, for both normal case (i.e., w/o ShMC++) and with the proposed device (i.e., with ShMC++). The difference of the presence of the studied device can be characterized by a decrease in the number of shared memory access, e.g. the memory pressure, or the traffic on the von Neumann bottleneck.
The experiment data show the variation in speedup calculated here as S(p) = ?# with T( 1) being the elapsed time to simulate a given circuit on 1 PE, and T(p) the elapsed time to simulate the very same circuit in parallel on p PE's. The Paradys infrastructure is written in C++, Number of subcircuits This new approach, of dynamically managing the memory gap, gives substantial gains mainly in decreasing the traffic along the von Neumann bottleneck which Paradys takes advantage of in order to achieve its main goal: scalability.
Follow-on activities
As mentionned in the different parts of this paper, follow-on activities are considered for Paradys: Dynamically managing the memory gap should be Dursued in developing and analyzing a novel hardware device in order to alleviate the memory wall problem.
Acknowlegments
The alithor wishes to thanks the members of the Paradys team who plainfully contributed to its success: P. Debefve for his advice as well as his sense of humor, C. Alexandre for successfully sniffing around, C. Dupuis for carving out the first C++ objects of this project, Y. 
