Dynamic partitioning for concurrent waveform relaxation-based circuit simulation by Peterson, Lena & Mattisson, Sven
Dynamic Partitioning for 
Concurrent Waveform Relaxation-Based Circuit Simulation 
Lena Peterson, Sven Mattisson, 
Department of Applied Electronics, 
Lund University, 
P.O. Box 118, S-22100 Lund, Sweden 
Physics of Computation Laboratory, 
California Institute of Technology, 
MS 256-80, Pasadena, CA, 91125, USA 
Abstract-We present a new dynamic circuit 
partitioning algorithm for the waveform relax- 
ation method. Such an algorithm dynamically 
changes the partitioning as the simulation pro- 
ceeds through the simulation interval. The pro- 
posed algorithm is suitable for implementation on 
a multicomputer. Experimental results show that 
the algorithm decreases the runtimes for circuits 
where a good static partitioning is hard to find. 
This is true both in the ideal case, that is when 
communication overhead for the repartitioning is 
not included and the load is distributed as evenly 
as possible among the processors, and in the real- 
world case, when all the partitioning overhead and 
load imbalance is included in actual measured run 
times . 
a method which solves the equations which we can form 
from the description of a circuit, typically by applying 
Kirchhoffs current law to the circuit. These equations, 
representing the circuit dynamics, is a system of DAEs 
(differential-algebraic equations). Such a system can be 
partitioned into a number of subsystems (or blocks) each 
containing at least one DAE. Using the waveform relax- 
ation method, we can then solve these blocks separately. 
When solving one block all the other blocks are relaxed, 
that is, the solutions from the other blocks are assumed 
to be static. In this manner it is possible to  solve for the 
unknowns in one block over a time interval (called a lime 
window) without interaction with the other blocks. The 
partitioning of the DAE system is extremely important 
for the convergence speed for the iterative WR method. 
Any iteration scheme can be used for the global iterations 
over the blocks, for example Gauss-Seidel or Jacobi itera- 
tion methods. In this investigation we concentrate on the - 
Jacobi iteration method since it gives the best asymptotic 
I .  INTRODUCTION speedup [5]. 
The purpose of this project was  to investigate the useful- 
~~ ~~ - 
11. DYNAMIC PARTITIONING ness of dynamic circuit partitioning algorithms for a wave- form relaxation-based circuit simulator designed for mul- " 
ticomputers. A multicomputer is a concurrent message- 
passing computer with local memory space for each pro- 
cessing node. For such a computer there is an important 
tradeoff between the irregularity in the partitioning and 
the achievable parallelism. A dynamic partitioning algo- 
rithm changes the circuit partitioning as the simulation 
proceeds through the simulation interval. Thus, ideally, 
a highly irregular partitioning will be used only when the 
current circuit state indicates so. 
CRITERIA 
The goal of the partitioning methods for the system of 
DAEs, is to merge equations that  are strongly coupled 
so that the resulting subsystems are only loosely coupled. 
The proposed dynamic partitioning method computes the 
couplings from the Jacobian matrix J which contains the 
partial derivatives of the model equations at the current 
state. An efficient dynamic partitioning method has to 
Waveform relaxation is an iterative method which can 
be used for performing transient analysis. That is, it is 
'The research described in this paper was supported in part by 
NUTEK Dm 9001159 ad in part by the 
Projects Agency, DARPA Order number 6202, monitored by the 
Office of Naval Research under Contract Number ~00014-87-~-0745. 
This research is described in detail in Peterson's PhD thesis [I]. 
make its repartitioning decisions based -on the informa- 
tion that can be obtained from the current partitioning. 
Thus, our dynamic partitioning method consists of two 
different parts, one division criterion which is used to de- 
cide if currently clustered nodes should be separated and 
one merging CrztenOn which is used to decide if current 
blocks should be merged. 
Advanced 
0-7803-1254-6/93$03.00 0 1993 IEEE 
1639 
A. The Division Criterion 
The division criterion treats each diagonal block Jacobian 
matrix as if it were a Jacobian matrix for a self-contained 
equation system. The value of the return voltage transfer 
ratio, Tmn, between circuit nodes m and n belonging to 
the block j ,  determines if they are still to  belong to the 
same block. The return voltage transfer ratio is defined 
as the ” roundtrip” volt age gain: 
Tmn E Tnm = A :  *A?.  (1) 
In our division criterion the return voltage transfer ratio is 
computed from the inverse of the diagonal block Jacobian, 
Jjj for block j ,  as 
Since the blocks are assumed not to be extremely large, 
hopefully 1 - 20 circuit nodes, the cost for computing the 
inverse J;’ is not prohibitive even though the calculation 
is 0(n3) where n is the size of the block. Furthermore, 
the inverses of the diagonal Jacobians can be reused when 
calculating the merge criterion. The division criterion for 
nodes m and n is 
Tmn > 7 d  (3) 
where */d is the division limit, which in our experiments 
has been kept constant throughout each simulation run. 
It is noteworthy that even though (3) is called the ddvi- 
sion criterion, it is fulfilled when two circuit nodes are 
not to be divided. We obtain the “global” partitioning 
for the block by finding the components of an undirected 
graph representing the block. The vertices of this graph 
correspond to the circuit nodes in the block. A pair of cir- 
cuit nodes fulfilling (3) ,  has their corresponding vertices 
connected by an edge. In our current implementation the 
division criterion is computed locally for each block for 
each time point in the window iterations during which the 
repartitioning is computed. 
B. The Merging Criterion 
The merging criterion is based on the collapsed iteration 
matrix for the block Jacobi method. The collapsed itera- 
tion matrix is derived using pnorm inequalities for each 
row equation. For a blocked system with N blocks it is: 
This matrix can be viewed as derived from a system ma- 
trix. Since the iteration matrices derived from all system 
matrices differ only by a scalar factor we choose the sys- 
tem matrix that is easiest to compute, that is is the one 
where the diagonal contains only ones, or when D = I. 
We call this system matrix P. It is calculated as 
P = - T + I .  (5) 
The matrix P is the one from which we derive the merging 
criterion. We use the diagonally dominant loop criterion 
introduced by White [6 ,  71. The criterion, applied to the 
entries of P, is 
where /^m is the merging limit, usually fixed but similar 
to  the division limit it  may be changed if desired. 
The diagonally dominant loop criterion is a local crite- 
rion, and it can, thus, not detect loops of unidirectional 
couplings among more than two blocks. I t  would be de- 
sirable to use a criterion right-hand side similar to the 
one of the division criterion (eq:transferratio) also for the 
merging. That is, we would like to use the inverse of P. 
In practice, however, it would be extremely expensive to 
invert P since this matrix does not exist physically any- 
where in the database. Furthermore, for each waveform 
iteration, each off-diagonal element of P, pjk, is a wave- 
form of values calculated for the time points chosen by 
the integration algorithm for block j .  In order to invert 
P for a number of time points over the present time win- 
dow each pjk must be interpolated to each desired time 
point and then the matrix can be built somewhere to be 
inverted (possibly via a distributed inversion algorithm). 
111. CRITERIA CALCULATIONS I  
PRACTICE 
The program used for the experiments, called CON- 
CISE [2], is a circuit simulator for transient analysis of 
MOS circuits. It is written in C and uses the Cos- 
mic Environment/Reactive Kernel message-passing prim- 
itives [3, 41. These primitives support the programming 
model where the computational unit is the process and 
each process has its own local memory-space. A process 
is an instance of a program that causes messages to be 
sent and received. CONCISE consists of a number of 
solver processes, each responsible for computing the s e  
lution to a number of blocks for each window iteration 
and one scheduler process which collects convergence data 
from the solvers and decides when a window has converged 
or when a time window has to be split. 
1640 
The inversion of the diagonal Jacobian matrices is the 
core coniputation in both the division and the merging cri- 
teria. The matrix inversion is performed using the already 
LU-factorized Jacobian matrix from the last Newton iter- 
ation in the current time point. If both criteria are calcu- 
lated during the same iteration (which is not necessary) 
the same matrix inversion is used for both of them. 
The calculation of the division criterion is performed 
entirely within one solver process. Communication with 
other processes is, in principle, only required to decide in 
what solver processes the new blocks should be placed. 
In our implementation the merging overrides the division 
and so communication is also needed to verify that the 
block is not being merged. 
For the division calculation the block is represented as 
an adjacency matrix N, a straightforward matrix repre- 
sentation of the connections in the undirected graph de- 
scribing the block. Each off-diagonal entry can be either 
one or zero. A non-zero entry njj indicates that circuit 
node i and j are coupled. All entries in N are set to zero 
a t  the start of the window iteration during which the cou- 
pling calculation is to take place. For each time point, 
entries are added to the N for each pair of circuit nodes 
that fulfill the division criterion (3). If the graph is con- 
nected then the block can not be divided. At each time 
point during the coupling calculation N is checked to see if 
the block graph is fully connected. If it is no more division 
calculations have to be performed for the block during the 
current window iteration. 
For the merging calculations communication between 
solver processes is necessary since the bidirectional cou- 
pling between two blocks has to be computed from the 
entries in the system matrix P, derived from the collapsed 
iteration matrix. The off-diagonal entry p i , ,  which de- 
scribes the impact of block i on block j during a time 
window, becomes a waveform sampled a t  the same time 
points as the voltage waveforms. We call such a waveform 
a coupling waveform. It is not feasible to store or com- 
municate all the time points in the coupling waveforms 
because that would require too much memory. To alle- 
viate these memory requirements, we use a compressed 
representation for the coupling waveforms. This repre- 
sentation comprises three coupling values, the first, the 
last, and the middle one of the waveform, plus the largest 
( 2 .  e. the worst) coupling value and its corresponding time 
point. Thus, only five real numbers are used for storing 
the information. Additional memory is needed during the 
calculation of the waveform, but it does not have to be 
retained once the solution of the particular block is con- 
cluded. 
IV. EXPERIMENTS 
The circuits used in the experiments are all digital MOS 
circuits. All in all there are 14 circuits ranging in size 
from 18 to 1944 circuit nodes (40 to  3348 transistors). All 
fourteen circuits come from research chips designed either 
in Lund or at Caltech. The netlists for all the circuits 
were extracted from the chip layouts. 
We tested four different schemes for using the two repar- 
titioning criteria described above. In all four schemes the 
criteria are computed in a fixed window iteration for each 
time window. In three of the schemes the division and 
the merging criteria are both computed in the same win- 
dow iteration, that is, in window iteration 1,  2, or 3 re- 
spectively. The fourth scheme is called the concurrency- 
preferring partitioning (CPP) since it prefers having many 
small blocks and slower convergence over few blocks and 
faster convergence. This is achieved by computing the di- 
vision as soon as possible, in the first window iteration, 
while postponing the merging until window iteration 9. 
For comparison, we have also used a precomputed par- 
titioning the off-line dynamic partitioning (ODP). In this 
case the dynamic partitioning is computed from the Ja- 
cobians collected form precious simulations of the same 
circuit using an incremental-time method. 
The on-line dynamic partitioning, that is when the 
repartitioning is computed from data from the on-going 
run, has been analyzed in two types of experiments. In 
the first type of experiments, the ideal placement experi- 
ments, we disregard the repartitioning overhead and cal- 
culate the best possible load balance. Thus, these exper- 
iments reflect the partitioning criteria and the partition- 
ing schemes themselves without taking the shortcomings 
of our coding and the type of computer we used into ac- 
count. In the second type of experiment we just measure 
the actual runtime with all existing overhead. 
The four dynamic partitioning schemes are compared to 
two static partitioning methods: the single-equation (SE) 
partitioning method where each circuit node forms one 
block and the gate (G) partitioning where circuit nodes 
connected by the source and drain by the same transistor 
are clustered. 
Plots for the circuit where the dynamic partitioning is 
most successful, adder, are shown in Figures 1 to  3. Note 
that the improvement in runtime with partitioning limit is 
not monotonically decreasing when the partitioning limit 
even for the ODP method. 
1641 
Time (sec) 
dynamic iter 1 -..- : 
dynamic !ter 2 ---- . 
d namic iter 3 - - -  . 
, .  . 
V. CONCLUSIONS 
dynamic iter 1 . . . .  : 
dynam!c Iter 2 ---- . 
d namic iter 3 - - -  . 
Jynamic CPP . 
source-drain - 
pointwise . 
. . . . . . . . . . . . . . . . . . . .  
... \. . . .  .._ . .._. _-_-____._ 
-.--I_._._______: .. 
- L. .__'. - .' 
L.--?: - _ _ _ _  I ,---: 
10000 
dynamic iter 1 .... 
dynam!c iter 2 - - - -  
d namic iter 3 - - - 
Jynamic CPP . . - . ODp ...... ........... ...... .......... ...... . . .  pointwiee .... 
7- - .7  ..--, -:..\ . ._.  
. t  . _  _ -  
'800b01 0.001 0.01 0.1 1 
Limit 
Fig. 1: Adder, dynamic on-line partitioning in 32 proces- 
sors, ideal placement. 
Time (sec) 
10000 
1000 
I 
2 4 8 16 32 64 100 ; 
Nodes 
Fig. 2: Adder, with division and merging limit 
ideal placement. 
Time (sec) 
10000 
1000 
In this study we have derived two new dynamic partition- 
ing criteria for the waveform relaxation method, one for 
the merging of existing blocks and one for the division 
of existing blocks. The  division criterion takes both bidi- 
rectional couplings and loops of unidirectional couplings 
into account. The merging criterion considers only bidi- 
rectional couplings. The experiments show that speedup 
improvement up to  an order of magnitude over the static 
heuristic gate partitioning can be obtained for MOS cir- 
cuits for which the gate partitioning yields large blocks. 
For circuits for which the gate partitioning produces only 
small blocks there is, for most circuits, no improvement. 
ACKNOWLEDGEMENTS 
We thank professor Charles L. Seitz for continuous encour- 
agement and innumerable hours of multicomputer time. 
REFERENCES 
[I] L. Peterson. Dynamic Circuit Partitioning for Concurrent 
Waveform Relaxation-Based Circuit Simulation. PhD the- 
sis, Dept of Applied Electronics, Lund University, Lund, 
1992. 
[2] L. Peterson and S. Mattisson. The design and implemen- 
tation of a circuit simulation program for multicomput- 
ers. IEEE Transactions of Computer-Aided Design of In- 
tegrated Circuits and Systems, March, 1993. In press. 
[3] C. L. Seitz, J. Seizovic, and W-K. Su. The C Programmer's 
Abbreviated Guide to Multicomputer Programming. Tech- 
nical report Caltech-CS-TR-88-1, Computer Science Dept, 
California Institute of Technology, Pasadena, 1988. Re- 
vised May 1989. 
= 0.1, 
[4] J. Seizovic. The reactive kernel. Technical Report SC- 
TR-88-10, Computer Science Dept, California Institute of 
Technology, 1988. MS Thesis. 
[5] D. W. Smart. Parallel Processing Techniques for the Simu- 
lation of MOS VLSI  Circuits Using Waveform Relaxation. 
PhD thesis, Dept of Electrical and Computer Engineering, 
University of Illinois at Urbana-Champaign, 1988. 
[6] J White and A.L. Sangiovanni-Vincentelli. Partitioning 
algorithms and parallel implementations of waveform re- 
laxation algorithms for circuit simulation. In Proceedings 
of IEEE International Symposium on Circuit and Systems, I 
100 i 2 4 8 16 32 64 105 pages 221-224, 1985. 
Nodes [7] J. K. White and A. Sangiovanni-Vincentelh. Relaxation 
Techniques for the Simulation of VLSI  Circuits. Kluwer 
Academic Publishers, Boston, Massachusetts, 1987. 
Fig. 3: Adder on-line dynamic partitioning, measured run 
times with division and merging limit = 0.1. 
1642 
