Distributed system-level simulation among coordinated, heterogeneous simulators requires communication and synchrony to preserve event causality. Once achieved, multiple coordinated, distributed instances of a single simulator not originally written for internal parallelism can be used to conduct the expression-level parallel execution of a model partitioned into subsystems, such that each subsystem is assigned to an individual simulator. Using a Kahn Process Network simulation backplane for coordination, and a custom Xspice TCP/IP socket device for interfacing, expression-level distributed simulation was applied to observe a decrease of up to 1/52 times the transient analysis time of the same circuit in a single Ngspice instance, without modifying the Ngspice kernel or host execution environment. Up to 128 independent Ngspice instances were coordinated in parallel with this method, with a selectable tradeoff in speed versus accuracy.
INTRODUCTION
Distributed circuit simulation techniques can be used to coordinate the parallel simulation of circuit subsystems across multiple, process-independent instances of a single simulator not normally supporting internal parallelism. This is possible if the simulator offers a communication interface at the model expression-level to communicate with other running instances, or to communicate with an arbitrating software agent such as a simulation backplane [1] . However, since each concurrent simulator may advance time independently (local virtual time)
[2], a coordination solution must both communicate signal values and enforce event causality, such that the local causality constraint (LCC) is observed.
The LCC requires that concurrent simulators process external events in time step order [2] . Techniques addressing causality in distributed simulation are covered in depth in the literature in the field of Parallel Discrete Event Simulation (PDES) [3] , and discrete event/continuous-time cosimulation [4] .
If a coordination solution is achieved, it can be applied to conduct the multiple-simulator, parallel execution of a Spice circuit at the model expression level for the speed up of an otherwise long-duration, sequential transient circuit analysis. This is important for modern VLSI device counts, where Spice simulation becomes "typically infeasible for designs larger than 20,000 devices" [5] . Where parallel techniques can be employed, the non-linear increase in transient analysis time per device count for sequential executions can be reduced by conducting the Spice model-evaluation phase or the matrixsolving phase of each time point iteration in parallel. Reported solutions, however, require hardware acceleration (such as offloading to a GPU [5] or FPGA [6] ), may not parallelize both the model-evaluation and the matrix-solve phases [6] , or require changes to the Spice kernel (all cases reviewed). With the expression-level approach, however, both model-evaluation and matrix-solving phases are parallelized in software, without modifying the Spice kernel [7] or host environment, at the cost of a selectable tradeoff in execution speed versus accuracy.
For distributed simulator coordination, we used the SimConnect/SimTalk infrastructure [8] to speed up the transient analysis time of a counter circuit of more than three thousand transistors. We wrote an Xspice [9] user TCP/IP socket device, which allowed concurrently running Ngspice instances to exchange node information with the SimConnect simulator backplane. Parallelism was employed across 2 to 128 concurrently running Ngspice instances, for speedup to 52x at less than 10 percent error of measurement, and up to 17x with less than 1 percent error of measurement. Percent error of measurement was arbitrarily reduced, at the expense of longer simulation time, by increasing the resolution of the interpolated event (IE) data types exchanged between the Ngspice instances.
II. RELATED WORK
Parallel execution is not new to Spice circuit simulation, due to its high internal data potential parallelism [6] in the model-evaluation and matrix-solving phases of each time point iteration. Techniques exploiting this through executive means (parallel CPUs, GPU or FPGA offloading, or matrix algorithm techniques) are covered in [5] , [6] , and [10]- [13] , but these methods require modifying the Spice kernel, expensive hardware, and most importantly are not generally ported to other simulators that have no studied internal parallelism. We term this post-model level of internal simulator parallelism, "execution-level parallelism." That is, the parallelism is not carried out at the model description layer (the expressive level), but rather at the model execution level, where it occurs "behind the scenes" to some degree from the view of the model writer.
Execution-level techniques are powerful though, reporting up to an 18x speedup of Spice 3f5 benchmarks in [6] , without loss of accuracy, by offloading the model evaluation phase onto an FPGA. In another approach [5] , the expensive BSIM3 transistor model evaluation steps, which "may comprise about 75% of the SPICE runtime [5] ," are offloaded to a GPU for parallel execution, for up to a 4x speedup at the expense of single-precision floating-point accuracy (GPU execution) verses double-precision floating-point accuracy (BSIM3 model code). In the matrix-solving phase, domain decomposition methods can be employed to achieve a very large performance increase (up to 870x reported simulation speedup in [13] ), but scaling limits the approach at around 400k nodes as the execution is hosted in only a single SPICE3/HSPICE instance. Finally, with multiple technique advancements and supercomputer parallel CPU execution, Sandia National Laboratories' Xyce simulator [14] "demonstrates good speedups (24x on 40 processors)," but results may be limited to "sufficiently large circuits" [6] . These are by no means an exhaustive set of speed up reports, but rather show that significant increases may be obtained at the execution-level through hardware acceleration and software techniques, if they are available and the Spice kernel is modified.
For heterogeneous, distributed simulation, the "backplane" technique ([1], [15] - [17] ) has been employed to solve the simulator coordination problem. Simulator backplanes are arbitrating software agents that distribute information across connected simulators and potentially control simulator time advancement. An interface must be constructed for each simulator that connects to the backplane to implement the coordination API. The SimConnect/SimTalk client-server backplane architecture described in [8] implements the dynamics of a Kahn Process Network (KPN) [18] , such that the tokens of the KPN are interpolated event (IE) data types. Interpolated events are defined in [8] and summarized in section III. With this coordination, simulators are synchronized through dataflow, rather than explicit time step control, lessening the backplane API complexity.
Additionally, simulators have no process awareness of one another (per the rules of a KPN), only awareness of their input and output FIFOs, which offers easier management as the number of simulators increases.
III. DISTRIBUTED PARALLELISM
We define "expression-level parallelism" to start at the model description layer. At this layer, the description is inspected for points of partition, at which nodes the circuit is expressed as new, independent subcircuits with communication interfaces. Each new subcircuit is assigned to an independent simulator. The entire model then simulates in coordination over the distributed, coordinated instances of the single simulator normally hosting the non-partitioned model. In this way, if a communication interface is offered at the model description layer, parallel execution can be gained for simulators not normally supporting internal parallelism. The cost is the additional communication overhead between simulators (both a computation and latency cost), and the burden of coordinating distributed simulators with independent versions of time advancement (local virtual time).
As an example of a partitioned distribution, Figure 1 illustrates the coordination of eight Ngspice instances connected to the SimConnect server through SimTalk, for a concurrent 8x parallel simulation of the a 128-bit counter described in section IV. The counter is partitioned into subcircuits 16 bits wide, connected at their MSB and LSB nodes via Xspice socket devices. The SimConnect server implements the dynamics of a Kahn Process Network, where the contents of the KPN tokens in are "interpolated event" (IE) objects. Interpolated events, defined in [8] , are 3-tuple elements (v, t m , t n ) from the product set V × T × T, where { V } is a set of values, and { T } is a set of tags. This nomenclature borrows from the value/tag (v, t) definition of an event covered in [19] . For a given interpolated event (v, t m , t n ), we define the value v to be constant on the interval [t m , t n ) specified in the IE, such that the tag set { T } is ordered. { T } is conventionally the real number set R 1 in timed, event driven simulations, representing the simulation time stamp of an event occurrence. For an interpolated event (v, t m , t n ), the range [t m , t n ) assigns a "stable" time to the signal value v for producers and consumers.
If a simulator consumes an interpolated event, it may assume the value v is constant on the tag range [t m , t n ), and not need to sample the value again until expiration time t n . Therefore, an interpolated event encapsulates both communication (the signal value) and synchronization (the start and end time). Mapped to nodes in a Kahn Process Network, simulators consume IEs, run, and produce IEs until the expiration tag of the last consumed IEs, at which point simulators sample their FIFOs again for a new IEs. If input FIFOs are empty, simulators are blocked. Through the blocking read property of KPNs, the local causality constraint is observed, because simulators cannot advance in time beyond the expiration tags of IEs on their input FIFOs. Further dynamics of KPN and IEs are detailed in [8] .
As a consequence of sampling, there is a tradeoff in speed versus accuracy when using the SimConnect/SimTalk method of IEs for distributed Spice parallelism. Specifically, an IE assigns a stable value for duration to a node voltage, during which local time a consuming simulator can operate on it without re-querying the value. During that time, however, the signal may change, resulting in sample-and-hold error for continuous values, or change-delay error for digital values, since the state change information of the digital value is delayed until the start time of the next IE. This speed versus accuracy tradeoff is tunable, however, as explored in section VI.
IV. EXPERIMENTS
Consider simulating a wide-bit asynchronous ripple counter at the transistor level. While ripple counters are impractical as real circuits, due to the rollover delay from maximum value (0xFFF…) to zero, they are simple elementary circuits for conceptualizing or simulating a propagation delay (the rollover delay as the carry bit propagates from bit 0 to bit <n>-1, for counter width <n>). Consider the <n>-bit ripple counter in Figure 
Single Instance Simulation
The single-instance counter is simulated in Ngspice [7] , the open source distribution of Berkeley Spice version 3 and Georgia Tech's Xspice [9] . Figure 4 shows the increase in transient analysis time as the number of transistors in the circuit increases, per bit width of the counter. The counter is simulated at 4, 8, 16, 32, 64, and 128 bits for 1.5 μs of simulation time. Figure 4 , the increase in analysis time per number of transistors is non-linear due to the non-linear increase in model evaluation time and matrix solution time in the Spice kernel as the device count increases. As the number of bits in the counter increases from 64 to 128 bits (1918 to 3838 transistors), for example, the analysis time increases from slightly over a minute to more than five minutes on a single workstation Linux 2.6.16 kernel machine with Intel Xeon 2.93 GHz core. This non-linear increase limits the practicality of simulating complex circuits at the transistor level on the order of modern VLSI transistor counts.
Parallel Simulation
For improvement, the circuit is partitioned at the expression level (the Ngspice circuit deck) into subcircuits <m>-bits wide, where <m> is a power-of-two divisor of 128, and the factor of parallelization. Each subcircuit is then assigned to an independent Ngspice process, coordinated with other Ngspice processes in parallel through SimConnect and SimTalk. Figure 5 shows an <m>-bit wide subcircuit of the counter, where Xspice user TCP/IP socket devices connect the circuit to its neighboring subcircuits over SimTalk. By dividing the 128-bit counter into tw we achieve a 2x speedup alone, and t speedup by subdividing into 64 subcircuit However, the speedup maximizes at this p diminishes as the communication overh Ngspice instances increases. This mani speedup from 64x to 128x parallel in Fig  fixed- resolution IE duration also results in error of measurement, shown in Figure 7 time of the ripple counter across the parall against the rollover time of the non-paralle devices connect the om one subcircuit to circuit for bits [0:15] for bits [16:31] , and device services the to the SimConnect kens through KPN umer.
s the speedup result 128-bit counter for solution wo 64-bit subcircuits, then achieve a 52x ts, each 2 bits wide. point, after which it head per number of fests in the loss of gure 6. The cost of n a non-zero percent , where the rollover lel cases is measured el case. The error in measuremen finite duration, during which constant. If an IE duration time of a circuit inverter, fo conveying the information o be delayed up to the duration inverter output was sampled from one communication n parallel instance, it can acc where the rollover delay accumulated delay can incre This is responsible for the po Figure 7 .
Increased Resolution
However, if IE resolution Figure 8 , the percent error o because an inverter fall at c every 2 ns, instead of 10 ns. the order of 10 ns as these tra results in smaller worst-case state change on an inver measurement drops to below parallel cases in Figure 8 , an to 16x parallel cases, althoug is greater than the rail-to-rail fall or example, or on the order of it, f the inverter's changed state may n of the IE, depending on when the d. Since this delay can continue node to the next through each cumulate at the output at bit 127 is measured. The sum of ease as the parallelism increases. ositively correlated relationship in increases (from 10 ns to 2 ns) in f measurement decreases. This is communication nodes is sampled Since the inverter fall time is on ansistors were sized, a 2 ns sample e delay in observing a rail-to-rail rter output.
Percent error of w five percent for the 64x and 128x nd to below one percent for the 2x gh the speedup decreases. measurement at 2 ns IE resolution The transient analysis time for the parallelism increases as resolution increas 9, due to the increased communicati SimConnect server (higher resolution Ngspice time steps, resulting in more IE KPN FIFOs). This cost of communication also occu increases. However, percent error of m reduced arbitrarily per degree of paralleliz the IE resolution, as shown in Figure 8 transient analysis time for the same achieved without modifications to the Ng execution host, making it different th parallelization schemes. In this method performed at the circuit expression-level, decks spread over independent simulators evaluation and the matrix-solving phases o
Choosing an appropriate degree of pa resolution automatically is not yet sugge since it is highly circuit dependent (the look for nodes of loose coupling or s cutsets). For accuracy, IE resolution shoul the maximum frequency content of the c to minimize accumulated delay of rise or f due to sampling. In the examples of decreasing the IE duration to one fifth ( inverter rail-to-rail fall time (approximate e same degree of ses, shown in Figure ion rate with the results in smaller tokens through the solution ease (17x at less than titioning the 128-bit s wide, where <n> is ctor of parallelism. zation factor per IE n increases, speedup on and load on the urs as IE resolution measurement can be zation by increasing 8. This speedup in 128-bit counter is gspice kernel, or the han execution-level , the partitioning is in multiple Ngspice , so both the modeloccur in parallel. arallelization and IE ested by this work, partitioning should signal feed-forward ld be on the order of ommunicated signal fall time information f Figures 7 and 8 , (2 ns) of the circuit ely 10 ns) decreased the percent error of mea parallelism by more than o parallel cases.
With this approach, there degree of parallelization, resolution, and percent error up to 52x transient analysis t by this software technique simulator or execution host, investigation phases of system
VII. SUMMAR
We took the SimConnect/ simulation scheme [8] and parallel execution of an Ngs We observed gains up to 52x percent error of measuremen than one percent error of m without any modification to host, but by partitioning the distributing the coordinated concurrent instances of Ngsp through IE tokens exchang through SimTalk, impleme Process Network. In max individual Ngspice instanc SimConnect server.
We postulate that it may b level parallelization and ex further speedup. For exampl speedup is achieved, then achieved individually over since the speedup is internal level, though, a J-times spee both, a J times K factor of both techniques (the speedup speedup occurs at the e expression-level. There will due to the usage of IEs wit level (compared to execution not introduce error). We also simulators not initially writt such that they offer a devicesee if similar speedup gains c REF [ asurement for each degree of one-half for the 4x through 128x will always be tradeoffs between total circuit analysis time, IE of measurement. However, gains time at less than ten percent error e alone, without modifying the may be acceptable at some early m-level design (SLD) [20] .
RY AND CONCLUSIONS /SimTalk KPN and IE distributed applied it to the expression-level pice circuit at the transistor level.
x in analysis time, at less than ten nt, and 17x in analysis time, at less measurement. This was achieved o the Ngspice kernel or execution circuit at the expression level and d subcircuits over independent, pice. Coordination was achieved ged with the SimConnect server enting the dynamics of a Kahn imum parallelization, up to 128 ces were coordinated with the be possible to combine expressionxecution-level parallelization for le, if at execution-level, a K-times that same speedup would be <N> separate Ngspice instances, to each instance. If at expressionedup is achieved, then combining f speedup should be achieved for ps should multiply, not add). One execution-level, another at the l still be an error in measurement th this method at the expressionn-level methods that may or may o intend to apply this technique to en for parallel internal execution, -level communication interface, to can be achieved. 
