Computationally Efficient Modeling and Simulation of Large Scale Systems by Koh, Cheng-Kok et al.
1111111111111111111imnnuu ~ 
(12) United States Patent 
Jain et al. 
(54) COMPUTATIONALLY EFFICIENT 
MODELING AND SIMULATION OF LARGE 
SCALE SYSTEMS 
(71) Applicant: Purdue Research Foundation, West 
Lafayette, IN (US) 
(72) Inventors: Jitesh Jain, Rajasthan (IN); Stephen F 
Cauley, West Lafayette, IN (US); Hong 
Li, He nan (CN); Cheng-Kok Koh, West 
Lafayette, IN (US); Vankataramanan 
Balakrishnan, West Lafayette, IN (US) 
(73) Assignee: Purdue Research Foundation, West 
Lafayette, IN (US) 
(*) Notice: 	 Subject to any disclaimer, the term of this 
patent is extended or adjusted under 35 
U.S.C. 154(b) by 0 days. 
(21) Appl. No.: 13/710,145 
(22) Filed: 	 Dec. 10, 2012 
(65) 	 Prior Publication Data 
US 2013/0124168 Al 	 May 16, 2013 
Related U.S. Application Data 
(63) Continuation of application No. 12/852,942, filed on 
Aug. 9, 2010, now Pat. No. 8,336,014. 
(51) Int. Cl. 
G06F 17150 	 (2006.01) 
(52) U.S. Cl. 
USPC ................ 716/115; 716/106; 716/111; 703/4 
(58) Field of Classification Search 
USPC ............................... 716/106,111, 115; 703/4 
See application file for complete search history.  
(1o) Patent No.: 	 US 8,745,563 B2 
(45) Date of Patent: 	 Jun. 3, 2014 
(56) 	 References Cited 
U.S. PATENT DOCUMENTS 
5,692,158 A * 11/1997 Degeneff et al . 	 ................. 703/2 
6,041,170 A * 3/2000 Feldmann et al . 	 ................ 703/2 
6,192,328 B1* 2/2001 Kahlertetal . 	 .................... 703/2 
6,820,245 B2 * 11/2004 Beattie et al . 	 ................. 716/115 
7,228,259 B2 * 6/2007 Freund 	 .............................. 703/2 
7,307,492 B2 * 12/2007 Tripathi et al . 	 ............... 333/111 
7,353,157 B2 * 4/2008 Wasynczuk et al . 	 ............ 703/14 
2003/0177458 Al* 9/2003 Beattie et al . 	 ..................... 716A 
OTHER PUBLICATIONS 
Jain et al.; "Fast simulation of VLSI interconnects'; IEEE/ACM 
International Conference; publication date: Nov. 7-11, 2004; pp. 
93-98.* 
* cited by examiner 
Primary Examiner Naum Levin 
(74) Attorney, Agent, or Firm Purdue Research 
Foundation 
(57) 	 ABSTRACT 
A system for simulating operation of a VLSI interconnect 
structure having capacitive and inductive coupling between 
nodes thereof, including a processor, and a memory, the pro-
cessor configured to perform obtaining a matrix X and a 
matrix  containing different combinations of passive circuit 
element values for the interconnect structure, the element 
values for each matrix including inductance L and inverse 
capacitance P, obtaining an adjacency matrix A associated 
with the interconnect structure, storing the matrices X, Y, and 
A in the memory, and performing numerical integration to 
solve first and second equations. 
3 Claims, 17 Drawing Sheets 
^d^ ttrt'eS ~'iafr~=:Y 
-sr- r2~res%iotd 
~,j 
-vt- thr~9 ok3-,CR1: 
--~- thres!iatd- .~r~o1 
1,014 
https://ntrs.nasa.gov/search.jsp?R=20150003306 2019-08-31T11:22:15+00:00Z
FIG. 1 
U.S. Patent 	 Jun. 3, 2014 	 Sheet 1 of 17 	 US 8,745,563 B2 
U.S. Patent 	 Jun. 3 , 2014 	 Sheet 2 of 17 	 US 8,745,563 B2 
E ----------------- A---------------
~ 	 ri Ca? 	 rM 	 ens 	 kri 	 Nr 	 0" 	 N 	 C3 r 
\ 
. ~ 
ate: 
« Q: 
\~ 
/ 
~ < ~ 
~ m 
Q 
U.R. Patent 	 Jun. 3 2014 	 She!] of 17 	 US 8,743,363 G2 
r 
C 
U.S. Patent 	 Jun. 3, 2014 	 Sheet 4 of 17 	 US 8,745,563 B2 
Column Number 
FIG. 4( a.) 
U.S. Patent 	 Jun. 3, 2014 	 Sheet 5 of 17 	 US 8,745,563 B2 
Col - unzn Number 
fgTG. ,-I(b) 
J 
U.S. Patent 	 Jun. 3, 2014 	 Sheet 6 of 17 	 US 8,745,563 B2 
Column Number 
FIG. 4(c) 
Ls 
co 
Iq 
0 
0 
n 
ci 
i 
r~ 
U.S. Patent 	 Jun. 3 , 2014 	 Sheet 7 of 17 	 US 8 ,745,563 B2 
CD 	 CD 	 0 	 CD 
E 	 E 
< -- -- -- -- -- 
 
~CL ~ I14 
CIQ 
:-o 
CD 
0 
m 
r 
X 
n 
U.S. Patent 	 Jun. 3, 2014 	 Sheet 8 of 17 	 US 8 ,745,563 B2 
W 
~4 
U.S. Patent 	 Jun. 3, 2014 	 Sheet 9 of 17 	 US 8,745,563 B2 
15 - ---------------- -------------- ---------- ------------------ ------------------------------------------- ------------------- T- ------ ? --- - ---------- 7 -------------------------------------- 
--------------------------------------- ALP 
GKG 
INDUCWISE 
---------------------------------- 
10 
V 
- - - - - - - --- - - - - - - - - - - 	
....................................... 
10 -
2 	 10 -4 
	
1 0­5 
Threshold 
FIG, 7 
U.S. Patent Jun. 3, 2014 	 Sheet 10 of 17 	 US 8,745,563 B2 
15 
10 
w 
CIO 
0 
10 -2 10 	 10' 
Diresho1d 
FIB, 8 
10- 
U.S. Patent 	 Jun. 3, 2014 	 Sheet 11 of 17 	 US 8,745,563 B2 
z 
Column Number 
FIG. 9 
Column Number 
FIG. 10 
U.S. Patent 	 Jun. 3, 2014 	 Sheet 12 of 17 	 US 8,745,563 B2 
t~. 
Column Number 
FIG. -11 
U.S. Patent 	 Jun. 3, 2014 	 Sheet 13 of 17 	 US 8 ,745,563 B2 
0 
c~ 
r~ 
s~ 
N"^ 
CJ 	 W 
r~ 
LO 
LJ 
C~ 
CD 
co 
0 
J 
CD 
0 
LO 	 co 	 N 	 07 	 co 	 1l- 	 110 
U.S. Patent 	 Jun. 3, 2014 	 Sheet 14 of 17 	 US 8 ,745,563 B2 
C° 
;a 
r-- 
f+7 
`m n E 	 ^~ E E 	 ~ 
i~ 
LO }— 
at7 
0. 
w 	 r 
v U 
cx x < 
; cn w~ 
t~ 
li 
/rl 
r: 
F~f4 
E 
U.S. Patent 	 Jun. 3, 2014 	 Sheet 15 of 17 	 US 8,745,563 B2 
CL) 
4-d 
------------------- 
. . . . . . . . . . . 
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
----------------- - - - - - - - - - - - - - - 
----- ------- --- 	 ------ ------- 
- - - - - - - - - - - - 	
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - - - - - - - - - - j- 
------- ------- ------- ------- --------------- ------- ------- ------- ------- -------- 
....... ............... ........ ............... 
----------------------- -------- ------- ------- 
- ------ ------- - ------- ------- ------- ------- ----------------------- ------- ------- -------- -------- 
I -------- ------- - ------- ------- ....... -------- ------- ------- ------- ------- ------- ------- -------- 
x "I V 
..... ........ 
----- 	
............... ....... ....... ............... ........ ------- ------- 
----- ------- ------- -------- ----------------------- ------- ------- -------- -------- 
. . . 	 . . . 	 . . . . . 
U 
------- ------- ---------- 
--------------- 
-------- -------- ------- 
------- ----- 
U.S. Patent 	 Jun. 3, 2014 	 Sheet 16 of 17 	 US 8,745,563 B2 
a) 	 CD 	 il-- 	 w 	 LO 
Llc:jdwnSUOO AJOWen 10 Oil% 
LD 
.r 
04 
-0 
tl) 
V) 	
LO
C) LO 
0)  
n r) 
CO 
z 
CIA 
----------------------- 
- 
U') CD CD 	 LO 	 CD 	 Ln 	 0 
uoijdwnsuc , ) kiowa ~j , o opH 
U.S. Patent 	 Jun. 3, 2014 	 Sheet 17 of 17 	 US 8,745,563 B2 
US 8,745,563 B2 
1 	 2 
COMPUTATIONALLY EFFICIENT 
MODELING AND SIMULATION OF LARGE 
 
SCALE SYSTEMS 
where 
CROSS-REFERENCE TO RELATED 
APPLICATIONS 
This application is a continuation application of patent 
application Ser. No. 12/852,942, filed on Aug. 9, 2010, now 
U.S. Pat. No. 8,336,014 to Jain et al., issued Dec. 18, 2012, 
whichis a divisional application of patent application Ser. No. 
11/593,465, filed on Nov. 6, 2006, now U.S. Pat. No. 7,774, 
725 to Jain et al., issued Aug. 10, 2010, which claims the 
benefit of Provisional Patent Application No. 60/733,460, 
filed Nov. 4, 2005, and Provisional Patent Application No. 
60/740,990, filed Nov. 30, 2005, which applications are 
hereby incorporated by reference along with all references 
cited therein. 
GOVERNMENT RIGHTS 
This invention was made with government support under 
Contract/Grant No. NCC 2-1363 awarded by the National 
Aeronautics and Space Administration (NASA), under Con-
tract/Grant Nos. CCR-9984553 and CCR-0203362 awarded 
by the National Science Foundation, and under Contract/ 
Grant No. USAF-FA8650-04-D-2409 awarded by the United 
States Air Force Research Laboratories. The government has 
certain rights in the invention. 
TECHNICAL FIELD OF THE INVENTION 
	
5 9 AT 	 C 	 v 
C 
— ~ — Aa 0 	 C— ~ 0 L0 1 '  x— ~ itI 
A +T is b = 
~ 	
11-j=Ag 
0 	
R-'Ag , and C = ATCAc . 
10 
R is the resistance matrix. The matrices g, L and C are the 
conductance, inductance and capacitance matrices respec-
tively, with corresponding adjacency matrices A g, Ai and A,. 
15 IS is the current source vector with adj acency matrix A , and v„ 
and ii are the node voltages and inductor currents respectively. 
With n denoting the number of inductors, we note that 
L,C,ReR" ' C, 0 ~R2n 2 
A standard algorithm for the numerical integration of dif- 
20 ferential equations such as (1) is the trapezoidal method. 
Consider a uniform discretization of the time axis with reso-
lution h. Then, using the notation xk=x(kh), and the approxi-
mations 
25 
+ 	 +1 
	
x(t)
N 	 and 
Xk 
[fit 	 r--kh 	 h 	 2 
30 over the interval [kh,(k+l)h], we may solve for xk+r  in terms 
of xk by solving the equation 
C C 	 bk+1 + bk 	 (2) 
35 	 ~ 2+ 
	
+1 h~z — —( C 
2— 
C
h Jz + 	 2 	 . 
The present invention relates generally to electrical circuit 
modeling and simulation techniques and, more particularly, 
to methods for simulating interconnect effects in very large 
scale integrated circuits. 
BACKGROUND OF THE INVENTION 
40 
With aggressive technology scaling, the accurate and effi-
cient modeling and simulation of interconnect effects has 
become (and continues to be) a problem of central impor-
tance. In a three-dimensional interconnect structure there can 
be significant amounts of coupling, both inductive and 45 
capacitive, between interconnects. Models that capture these 
effects tend to involve large matrices, resulting in extraordi-
nary demands on memory. Simulation with these models 
require prohibitive amounts of computation. 
While all coupling effects in theory extend without bound, 50 
it is well-recognized that, in practice, the effects of capacitive 
coupling, and to some extent that of inductive coupling, can 
be assumed to be local without much sacrifice in accuracy. 
Practical modeling and simulation techniques exploit this 
localization to significantly reduce storage and computa- 55 
tional costs. For practical interconnect structures, the capaci-
tance matrix C and the inverse of the inductance matrix 
K=L-1 turn out to be (approximately) sparse. A number of 
techniques exploit the sparsity in K at extraction level. 
Exploiting sparsity of C and K in simulation however, is much 60 
less straightforward. The main problem is that simulation 
requires terms that not only involve the sparsified matrices C 
and K, but also inverses of terms that involve them; these 
inverses are dense in general. 
The Modified Nodal Analysis (MNA) of interconnect 65 
structures such as the one shown in FIG.1 yields equations of 
the form 
A direct implementation of thi s algorithm requires O(n 3 +pn2) 
operations, where p is the number of time steps. The direct 
implementation ignores the structure of the matrices 6 and C 
that is evident in (1); explicitly recognizing this structure 
yields the so-called Nodal Analysis (NA) equations, used in 
INDUCTWISE: 
-
~
-
jC
---
+
-----
C
- 
-
+
-----
S
--
w
- 
+1= 
 
kA 	 s+ 1  +is). 	 (3)hZ 	 —j+hC--Sw, 	 aa? 
- ---- - --- — 
U
an 	
VT----------- 
	
2A I is +1  = 2Aa is + hS(v k,+1  + v ), 	 (4) 
where S=A,KA iT (recall that K=L -1 , L being the inductance 
matrix, with corresponding adjacency matrix A i, and AiT 
being the transpose of A i). 
The NA equations (3) and (4) enjoy several advantages 
over the NINA equations (1). The first advantage is that the 
solution of equations (1), a problem of size 3n, has been 
divided into two sub-problems of sizes 2n and 2n, which 
yields computational savings with polynomial-time algo-
rithms. Next, it has been observed that with typical VLSI 
interconnect structures, the matrices K, C and g  exhibit spar-
sity. This can be used at the extraction stage to write down (3) 
and (4) with fewer parameters. Finally, at the simulation 
stage, the structure of the matrix U defined in (3) symmetry, 
positive-definiteness and sparsity lends itself to the use of 
fast and sound numerical techniques such as Cholesky fac- 
US 8,745,563 B2 
3 
torizations. These advantages have been extensively used to 
develop INDUCTWISE. For future reference, we note that 
the computation with INDUCTWISE is O(n 3 +pn2) opera-
tions, and is usually dominated by O(pn 2). 
SUMMARY OF THE INVENTION 
The approach that is employed is to sparsity the various 
matrices that underlie the model of interconnects; the result-
ing approximate models can be represented by far fewer 
parameters , leading to savings in storage. 
The present invention presents methods that systematically 
take advantage of sparsity in C and K, in simulation, achiev-
ing a very significant reduction in computation with very little 
sacrifice in simulation accuracy. The first idea underlying our 
approach is that if the sparsity in the inverse of a dense matrix 
is known, the (sparse) inverse can be computed very effi-
ciently. We take advantage of this fact by writing the simula-
tion equations in terms of L and P —C-i . The most computa-
tionally intensive step in simulation, of system formulated in 
such a fashion , reduces to that of matrix-vector multiplication 
involving a sparse matrix . We also show that savings with 
sparse-matrix-vector multiplication can be obtained with 
simulation using K=L-i and C, as well , but to a lesser extent. 
The RLP formulation is extended to include non-linear 
devices, without sacrificing the computational benefits 
achieved due to sparsity of the linear system. It should be 
noted that the A matrix involved in the solution of the linear 
system is constant throughout the simulation. In contrast, the 
A matrix involved in solving the non-linear system changes in 
each simulation step. However, the  matrix is sparse. Due to 
the sparse and time varying nature of the problem at hand 
Krylov subspace based iterative methods could be used for 
efficient simulation . Our second contribution is to introduce a 
novel preconditioner constructed based on the sparsity struc-
ture of the non-linear system. The inverse of the precondi-
tioner has a compact representation in the form of the Had-
amard product , which facilitates not only the fast 
computation of the inverse, but also the fast dense matrix-
vector product. 
The objects and advantages of the present invention will be 
more apparent upon reading the following detailed descrip-
tion in conjunction with the accompanying drawings. 
BRIEF DESCRIPTION OF THE DRAWINGS 
FIG. 1 a distributed model of a typical three dimensional 
VLSI interconnect structure that may be operated in simula-
tion with the methods of the present invention. 
FIG. 2 shows the average sparsity index of the matrices 
U- ' V, U- 'A, U- 'AS T and U- 'S, for a structure as a function 
of h for various values of E. 
FIG. 3 shows average sparsity index of the matrices X - 'Y, 
X- 'A and X`AP, for a structure as a function of h for the 
sparsity threshold of e -0.001 , as compared with the average 
sparsity index of the matrices encountered in the GKC-algo-
rithm. 
FIG. 4(a) shows the significant entries (shown darker) of 
the adjacency matrix A for a structure with 1500 conductors. 
FIGS. 4(b) and 4(c) show the significant entries (shown 
darker) of W` and X -i , respectively, for the structure in FIG. 
4(a). 
FIG. 5 shows the voltage wave forms, obtained from 
SPICE and Exact-RLP, of the active line and the seventh line 
of a 100-conductor circuit. 
4 
FIG. 6 shows the voltage wave forms, obtained through 
INDUCTWISE, Exact-RLP, RLP, and GKC, of the active line 
and the seventh line of a 600 -conductor circuit. 
FIGS. 7 and 8 show plots of the RMSE for the active and 
5 the seventh line as a function of threshold value for a 600-
conductor circuit. 
FIG. 9 shows the sparsity structure (nonzero entries shown 
darker) of the A matrix for an exemplary circuit of parallel 
wires driving a bank of inverters. 
10 	 FIG. 10 shows an exemplary preconditioner matrix that 
may be used with the exemplary circuit of FIG. 9. 
FIG. 11 shows the sparsity pattern (nonzero entries shown 
darker) of matrix A of a circuit having only non-linear devices 
and no interconnects. 
15 	 FIG. 12 shows average sparsity versus circuit size. 
FIG. 13 shows the voltage wave form obtained through 
SPICE and Exact-RLP and SASIMI. 
FIG. 14 shows a two dimensional non-uniform spatial grid 
for a Nanotransistor. 
20 FIG. 15 shows the ratio of memory consumption of the 
algorithm in [ 9] as compared to ours for varying division sizes 
(N,JD). 
FIG. 16 shows the ratio of memory consumption of the 
algorithm in [9] as compared to ours for a varying number of 
25 divisions (D). 
DETAILED DESCRIPTION OF THE PREFERRED 
EMBODIMENTS 
30 For the purpose of promoting an understanding of the 
principles of the invention , reference will now be made to the 
embodiments illustrated in the drawings and specific lan-
guage will be used to describe the same. It will nevertheless 
be understood that no limitation of the scope of the invention 
35 is thereby intended , such alterations and further modifica-
tions in the illustrated device and such further applications of 
the principles of the invention as illustrated therein being 
contemplated as would normally occur to one skilled in the art 
to which the invention relates. 
40 	 While significant storage and computational advantages 
accrue with INDUCTWISE, we note that the sparsity of U has 
not been fully taken advantage of at the level of linear algebra 
(beyond the possible use of sparse Cholesky factorizations) in 
the numerical solution of (3). In particular, with the formula- 
45 tion used by INDUCTWISE, while the matrix U is sparse, its 
inverse is dense. Thus , trapezoidal numerical integration, at a 
first glance, entails matrix-vector multiplies with a dense 
matrix at each time step . However, it has been observed that 
the matrices U`V (where V is defined in (3)), U - 'A, 
50 U- 'A T and U-1 S are approximately sparse, and this informa-
tion can be used to significantly reduce the computation as 
follows. Rewrite (3) and (4) as 
v 
k+i_U 
 iVv„k-2U'A,Tiik+U'AiT(jk+i+I,
~ 
55 	
7k 1_ 	 7k 	 k 1 2L~ A i ii + 2L~ A i ii +hL~ S(v„ + +v„~ . 
Pre-compute and store the sparsified matrices U - 'V, 
U- 'A, U- 'AS T and U- ' S. Then, every time step in the trap-
ezoidal integration scheme requires only sparse matrix-vec- 
60 for multiplies. We will henceforth refer to this technique as 
the GKC-algorithm (as the computations are done with the 
conductance , inverse of the inductance and the capacitance as 
the parameters). 
In order to quantify the computational savings obtained 
65 with the GKC-algorithm over INDUCTWISE, we define the 
"sparsity index" µ e(A) of a matrix A as ratio of the number of 
entries ofA with absolute value less than e to the total number 
US 8,745,563 B2 
5 
of entries. Then, the computation required for each iteration 
withthe GKC-algorithm, with some appropriate value of e, is 
0((1-v)n2) where v is the minimum of the sparsity indices of 
the matrices U-'V, U-'A,  U-'A T and U- ' S. The value of v 
can be expected to depend on the threshold for detecting 
sparsity e, as well as the time step size h. FIG. 2 shows the 
average sparsity index of the matrices U - ' V, U-'A,  U-'A,T 
and U-1 S, for a structure with three parallel planes consisting 
of 600 conductors, as a function of h for various values of E. 
The typical value of h used solving the MNA equations for 
VLSI interconnects is 0.1 picoseconds. With such values of h 
and e=0.001, it can be seen that v-0.8. Thus the total compu-
tation time with the GKC-algorithm is approximately a fifth 
of that required by INDUCTWISE. 
We now explore an alternative formulation of the NINA 
equations that uses the resistance , inductance and the inverse 
of the capacitance matrix . For typical interconnect structures, 
shown in FIG. 1, we can manipulate the NINA equations (2) to 
obtain 
~ h  + + 4APAT )ia+l  = 	 (5) X — —. 
~ h — —  gAPAT )i~ +Av,  + 4AP(ls +t  +ls), Y— — — 
and  
6 
FIG. 3 shows the average sparsity index of the matrices 
X- 'Y, X- 'A and X- 'AP, for a structure with three parallel 
planes consisting of 600 conductors, as a function of h for the 
sparsity threshold of e -0.001 , as compared with the average 
5 sparsity index of the matrices encountered in the GKC-algo-
rithm. It is clear that the matrices encountered in the RLP-
algorithm exhibit much higher sparsity over a wide range of 
time-steps. In particular, for h-0.1 ps, it can be seen that 
y-0.9. Thus the total computation time with the above RLP- 
to algorithm is approximately one-tenth of that required by the 
RLP formulation that does not use sparsity information. 
When compared to the GKC-algorithm and INDUCTWISE 
which use twice as many state variables, the amount of com-
putation required by the RLP-algorithm is approximately 
15 one-eighth and one-fiftieth respectively. 
We now provide the details on the fast inversion of X. 
Assume for simplicity that the sparsity pattern in X -i is 
known, deferring for later the problem of detecting this spar-
sity pattern. Then, manipulations of only a subset of the 
20 entries of the X matrix (rather than the entire X matrix) can be 
used to compute the inverse matrix . To briefly illustrate the 
idea consider the example when XER .... i and the 5th row of 
X-1 has the following form: 
[00 0*00*00], 
25 where denotes the nonzero entries. Then , it can be shown 
that these nonzero entries can be computed exactly from the 
second row of the inverse of the following 3x3 matrix 
obtained from X: 
where P is the inverse capacitance matrix i . e P-C-i and  is 
the adjacency matrix of the circuit, obtained by first addingA g 
and Ai and then removing zero columns (these correspond to 
intermediate nodes, representing the connection of a resis-
tance to an inductance). When compared with the NA equa-
tions (3) and (4), we see that the number of state variables has 
been halved. Compared to INDUCTWISE, this represents 
immediate savings . For future reference , we will term the 
technique of directly solving (5) and (6) as the "Exact-RLP" 
algorithm. 
In contrast with the GKC-algorithm , it turns out here that X 
is dense, but with an inverse that is approximately sparse. 
Thus, windowing techniques such as those employed by 
INDUCTWISE during the extraction stage to obtain a spar-
sified matrix K can be employed here to quickly compute a 
sparsified X-i . (Windowing techniques details will be 
described below.) Moreover, the matrices X - 'Y, X- 'A and 
X- 'AP turn out to be approximately sparse. Thus, paralleling 
the development of the GKC-algorithm, we have the follow-
ing RLP-algorithm: 
Rewrite (5) and (6) as 
vk+i = v — 2PAT(ik+i+ik)+ h P( Ik+i +Is),  (6) 30 
X33 X35 X38 
X53 X55 X58 
X83 X85 X88 
35 	 More generally, suppose that there are a, nonzero entries in 
the ith row of X-i . By following a procedure as above, the ith 
row of X-1 can be computed by inverting an a xa matrix. 
Thus, the overall computation in determining is X -i is 
O(E iai3). It is typical with VLSI interconnects that a, is a 
40 small constant . Thus if X-1 is exactly sparse, with a known 
sparsity pattern, it can be computed in O(n) from X . Table I 
gives the time taken for inversion for different circuit sizes. 
TABLE I 
45 
Inversion time in matlab (in seconds) 
No. of conductors 
500 	 1000 	 2000 	 5000 
50 	
Direct Inversion 	 .29 	 2.18 	 16.87 	 260.68 
Fast Inversion 	 .79 	 1.48 	 2.93 	 10.17 
Thus, there remains the problem of determining the spar-
55 sity pattern in X-i . Recall that 
i', + ' = X -1  Yia + X —i  Avk, + 4 X—i AP(ls+i  + I,), 
X —i Avk,+i  = X_ 1 Avk,— ZX—i APAT (ia +1  +ia)+ ZX—i AP(ls+i  +I,) 
Pre-compute and store the sparsified matrices X - 'Y, X- 'A 
and X- 'AP. Again, every time-step in the trapezoidal integra-
tion scheme requires only sparse matrix-vector multiplies. As 
with the GKC-algorithm, the total computation with the RLP-
algorithm is dominated by 0((I-y)n 2), where is the y is the 
minimum of the sparsity indices the matrices X - 'Y, X- 'A and 
X-'AP. 
LR h 
X= h + 2 + 4 APA T . 
60 
Let 
L R 	 h 
W - h + 2 and Z= 4 APAT . 
65 
Y , (v; - vi)' 
15 
Y, V  
where v and v denote the waveforms obtained from Exact- 
RLP and the algorithm under consideration respectively. A 
20 threshold value of 0.001 was chosen for sparsification of RLP 
and GKC algorithms, as well as for sparsification of L -1 in 
INDUCTWISE. Table 2 presents a summary of the results 
from the study of simulation accuracy. 
25 TABLE 2 
RMSE comparison. 
Active Line 7th line 
30 	 Size INDUCTWISE RLP 	 GKC INDUCTWISE RLP GKC 
300 .0013 .0010 	 .0017 .1622 .1266 .1960 
600 .0014 .0011 	 .0014 .4381 .3452 .4651 
900 .0006 .0007 	 .0008 .3222 .3076 .4078 
1200 .0004 .0004 	 .0004 .2382 .2656 .2992 
35 	 1500 .0003 .0003 	 .0004 .2021 .2200 .2336 
US 8,745,563 B2 
7 
Then 
x-'=W '-W '(W '+z ')- ' ff- 1 . 	 (7) 
For the values of R, L, C and h under consideration, it turns 
out that 
X -1gly-1-w-1Zw 1 . 	 (8) 
Thus, the significant entries of X -1  can be obtained by super-
posing the significant entries of W-1  and the significant 
entries of W-1 ZW-1 . The sparsity pattern of W -1  can be 
efficiently determined using the techniques available in the 
literature. Turning next to 
W -1 ZW -1 = 
h
4 W -I APA rW -1 
note that the significant entries of W -1A are obtained by 
distributing the significant entries of W-1  into locations deter-
mined by the adjacency matrix A. In summary, we have the 
following heuristic for predicting the sparsity pattern in X -1 : 
First determine the significant entries of W -1  by determining 
the set of segments that are inductively couple with a given 
segment. In addition, spread the nonzero entries of W -1  to 
locations suggested by the adjacency matrix to find the 
remaining significant entries. 
These ideas are illustrated via a three dimensional inter-
connect structure of three parallel planes with 1500 conduc-
tors. In FIG. 4(a), the significant entries of the adjacency 
matrix A are shown to be darker. FIGS. 4(b) and 4(c) show the 
entries of W-1  and  X-1  respectively, again withthe significant 
entries shown darker. 
We emphasize that the actual computation of the signifi-
cant entries of X -1  proceeds via the technique in, where given 
the knowledge of the sparsity pattern resident in X -1 , the 
actual entries can be directly and efficiently computed. Thus, 
(7) and (8) are not used for computation, but only to motivate 
the heuristic for efficiently determining the sparsity pattern of 
X-1 . 
We implemented the INDUCTWISE, RLP and GKC algo-
rithms in MATLAB on a PC with an Intel Pentium IV 2.4 GHz 
processor. In order to quantify the simulation accuracy with 
various methods, we used as the benchmark the Exact-RLP 
simulation (recall that this is the direct simulation of equa-
tions (5) and (6)). (While SPICE simulations would have been 
more natural to use as the benchmark, we found that the 
computation time grew quickly to make them impractical; for 
a modest-size circuit comprising 100 parallel conductors, 
SPICE simulation took 350 seconds as compared to 1.08 
seconds with the Exact-RLP algorithm, with no detectable 
simulation error, as shown in the FIG. 5). 
Simulations were done on a three dimensional structure of 
three parallel planes, with each plane consisting of busses 
with parallel conductors, with wire-lengths of 1 mm, and a 
cross section of 1 ltmx I µm. The wire separation was taken to 
be 1 µm; each wire was divided into ten segments. A periodic 
IV square wave with rise and fall times of 6 ps each was 
8 
applied to the first signal on the lowest plane, with a time 
period of 240 ps. All the other lines were assumed to be quiet. 
For each wire, the drive resistance was I OQ the load capaci-
tance was 20 fF. A time step of 0.15 ps was taken and the 
5 simulation was performed over 330 ps (or 2200 time steps). 
As expected, with all methods, there is an inherent trade-
off between simulation accuracy and cost (CPU time and 
memory). We first present results comparing the accuracy in 
simulating the voltage waveforms at the far end of the first 
to (active) and the seventh (victim or quiet) lines. The metric for 
comparing the simulations is the relative mean square error 
(RMSE) defined as 
It can be seen that the simulation accuracy of the RLP and 
the GKC algorithms are comparable to that of INDUCT-
WISE, with a marginally inferior performance as measured 
40 by the RMSE. A plot of the voltage waveforms at the far end 
of the active line and the 7th line, obtained by INDUCT-
WISE, RLP, and GKC algorithms, is shown in the FIG. 6. 
We briefly explore the influence the choice of the threshold 
for determining sparsity. A higher threshold can be expected 
45 to decrease the computational and memory requirements, 
howeverwith loss in simulation accuracy. FIGS. 7 and 8 show 
plots of the RMSE for the active and seventh line as a function 
of threshold value, again for a circuit of size 600 conductors. 
Any value of the threshold below 0.001 appears to be a 
50 reasonable choice. 
We now turn to a comparison of the computational and 
memory requirements between INDUCTWISE, RLP and 
GKC algorithms. Table 3 summarizes the findings. 
TABLE 3 
Run time and memory comparisons 
Time (in sec) Memory (in MB) 
Size Exact-RLP INDUCTWISE RLP GKC Exact-RLP INDUCTWISE RLP GKC 
300 14.30 74.34 4.09 18.99 2.95 11.61 1.02 6.61 
600 76.21 422.00 16.28 77.32 11.61 46.20 2.36 15.38 
900 244.14 1133.40 33.21 162.08 26.03 103.84 4.09 31.68 
1200 513.08 3051.10 60.53 312.93 46.20 184.56 6.16 52.22 
1500 827.50 4682.00 92.16 813.00 72.14 288.24 7.60 86.00 
US 8,745,563 B2 
9 
It can be seen that for a circuit consisting of 1200 conduc-
tors, RLP is about nine times faster than the Exact-RLP, and 
fifty times faster than INDUCTWISE. The GKC algorithm is 
about twice as fast as the Exact-RLP, and ten times faster than 
INDUCTWISE. The Exact-RLP is about six times as fast as 
INDUCTWISE. With larger circuit sizes, the advantage of 
RLP over INDUCTWISE continues to grow, while the Exact-
RLP and GKC algorithms have an advantage over INDUCT-
WISE that grows slightly. An explanation for the slower per-
formance of INDUCTWISE compared to Exact-RLP is that 
the number of variables with the latter algorithm is one-half as 
that with the former. The same trends are observed with 
memory requirements. 
VLSI interconnect structures with non linear devices can 
also be analyzed using the Modified Nodal Analysis (MNA) 
formulation, yielding equations of the form 
10 
Here Cvv denotes the sub-matrix of the capacitance matrix 
that changes amid the simulation, while all other sub-matrices 
remain constant. The matrix Cvv captures the drain, gate and 
bulk capacitances of all devices, which are voltage-depen- 
15 dent, while C, and Cam, are the capacitance matrices that 
arise from interconnects and are hence constant. 
For typical interconnect structures, the above decomposi-
tion allows us to manipulate the MNA equations (10) and 
(11): 
20 
10 
We begin by decomposing C, A, and A as: 
CCC 
5 	 C— ~ C'C CvvA
T
—~ AzIAr— ~ Az 
-[°1-=- VVCI 
Cx + Cz = b, 	 (9) 
where 
AT 
C— ~ —A, 0 J' C— ~ 0 L1' x— ~ it 
b=
~
AT I,+I„ a 
0 	
1,Cj=AgR—'Ag,andC=A cT CA c . 
R denotes the resistance matrix. The matrices 9 , L and C are 
the conductance, inductance and capacitance matrices 
respectively, with corresponding adjacency matrices A g, Ai 
and A,. IS is the current source vector with adjacency matrix 
A,, and v„ and i i are the node voltages and inductor currents 
respectively. 
Vector, I„ i captures the effect of non-linear loads and 
depends on the node voltages as I„, -f(v„). f is a function 
which varies depending on the load characteristics and in 
general can be a non-linear function. 
Utilizing the trapezoidal method to numerically solve (9) 
requires the solution of a set of linear and non-linear equa-
tions: 
cc +, - c c 	 bk+' + b  	 ( 1 0) 
~ 2 + hid 	 ~ 2 h ~ + 	 2 
and 
,,I I I = f (vn+i). 	 (11) 
The nonlinearity in the above set of equations can be handled 
by the standard Newton-Raphson technique of linearizing 
(11) and iterating until convergence: Equation (10) is a linear 
equation of the form L(x)O, where we have omitted the 
iteration index k for simplicity. Equation (11) is a nonlinear 
equation of the form g(x) -O. Let g(x)-G(x) be a linear 
approximation of g(x), linearized around some x -xo . Then, 
simultaneously solving L(x)-O and G(x)-O yields numerical 
values for x and hence v,,. These values are then used to obtain 
a new linear approximation g(x)-G„ eW(x), and the process is 
repeated until convergence. A good choice of the point x o for 
the initial linearization at the kth time-step is given by the 
value of v„ from the previous time-step. 
A direct implementation of this algorithm requires 
0(pgnr 3) operations, where  is the number of time steps, qis 
the maximum number of Newton-Raphson iterations in each 
time step, and n, -3N. 
~ h  + R  + 4A l P Aria+ i  = ( h — R — 4A i P~~Ai~ ia + i vk 	 + 	 (12) X 	 r 
25 	 4AiPccA ( ~s +i +Is) — A,PccC_(vk+i— vv)+ 22 (vv+i +vv), 
vk+i = 	 (13) 
v~ — 
2
P~~ Az(ia +i  + )+ 2P_A  (Is +i  +Js) — P~~ C_(vv +i  — vy), 
30 	 h (14) 
C"Vk + l  = C"Vk  - 2A 2 (~' + 1 + )+ 
2 A z US+i  + 
I
s) — Cvv (v
k+i _ vk) +h (w+i + I'), 
[v+i = f (vv+i ). 35 	 (15) 
Here r denotes the size of interconnect structure connected 
directly to non linear circuit, and given 1=N-r we note that 
40 	 C_ R " C ~R  x' 
P,,-C,,- r is the inverse capacitance matrix, and A is the 
adjacency matrix of the circuit. A is obtained by first adding 
Ag andA, and then removing zero columns (these correspond 
to intermediate nodes, representing the connection of a resis- 
45 tance to an inductance). 
Thus far, the analysis is similar to that of the linear ele-
ments structures described above, with the major difference 
being the addition of (14) and (15), which account for the 
nonlinear elements. We will show here how the linear tech- 
5o niques can be extended to handle the case when nonlinear 
elements are present. 
For future reference, we will call the technique of directly 
solving (12), (13), (14), and (15) as the "Exact-RLP" algo-
rithm. It can be shown that the computational complexity of 
55 the Exact-RLP algorithm is 0(13  +pq(12  +r  3))  . For large VLSI 
interconnect structures we have 1»r, reducing the complex-
ity to 0(13 +pq(1 2)). 
We now turn to the fast solution of equations (12) through 
(15). Recall that the nonlinear equation (15) is handled via the 
6o Newton-Raphson technique. This requires, at each time step, 
linearizing (15) and substituting it into (14). The resulting set 
of linear equations have very specific structure: 
Equations (12) and (13) are of the form Ax=b where A is 
fixed (does not change with the time-step). Moreover, 
65 	 A-r is typically approximately sparse. 
Equation (14) (after the substitution of the linearized (15)) 
is again of the form Ax=b, where the matrix A is 
US 8,745,563 B2 
11 
obtained by adding Cvv and the coefficient of the first-
order terms in the linearized equation (15). Recall that 
the matrix Cvv captures the drain, gate and bulk capaci-
tances of all devices . It also contains the interconnect 
coupling capacitances between gates and drains of dif-
ferent non-linear devices in the circuit. As each non-
linear device is connected to only a few nodes and the 
capacitive effects of interconnects are localized, the A 
matrix is observed to be sparse in practice. Note that A 
changes with each Newton -Raphson iteration and with 
the time-step. 
Thus the key computational problem is the solution of a 
sparse time-varying set of linear equations, coupled with a 
large fixed system of linear equations Ax=b with A` being 
sparse. 
Krylov subspace methods have been shown to work 
extremely well for sparse time-varying linear equations. Spe-
cifically, the GMRES (Generalized Minimum Residual) 
method of Saad and Schultz allows the efficient solution of a 
sparse, possibly non-symmetric, linear system to within a 
pre-specified tolerance . This method performs a directional 
search along the orthogonal Amoldi vectors which span the 
Krylov subspace of A. That is , given an initial guess xo and 
corresponding residual r o=b-Axo , orthogonal vectors {q l , 
qz, .... qm } are generated with the property that they span S m , 
the solution search space at iteration m. 
& = xp + span{ro, Aro, ... , A'rpl 	 (16) 
= xp +K(A, ro, m) 
c span{it, i2 ... , q. 1. 
These vectors are chosen according to the Amoldi itera-
tion: AQ_-Qm+,Hm where Q_-Iqt , q21  .... qm } is orthogonal 
and H_ERm+"- is an upper Heisenberg matrix. 
For these methods the choice of a preconditioner matrix M, 
which is an approximation of A, can greatly affect the con-
vergence. A good preconditioner should have the following 
two properties: 
M-tA=I. 
It must accommodate a fast solution to an equation of the 
form Mz-c for a general c. 
FIG. 9 depicts the sparsity structure of the A matrix for a 
circuit example of parallel wires driving a bank of inverters. 
For such a sparsity structure , an appropriate choice of the 
preconditioner could be of the form as shown in FIG. 10. 
Although we have chosen a circuit with only inverters for 
simplicity, a more complicated circuit structure would simply 
distribute the entries around the diagonal and off-diagonal 
bands and lead to possibly more off diagonal bands. To see 
this, consider an extreme case where the circuit under con-
sideration has only non-linear devices and does not comprise 
of interconnects . In this case the sparsity pattern of the A 
matrix is as shown in FIG. 11. Therefore, the chosen precon-
ditioner would encompass not only the sparsity structure 
shown in FIG. 9 but also other sparsity patterns that might 
arise with the analysis of more complicated non-linear 
devices. Correspondingly the structure of the preconditioner 
(see FIG. 10) would have additional bands. 
Matrices of the form shown in FIG. 10 have the following 
two properties which make them an ideal choice for precon-
ditioner. 
The inverses of the preconditioner matrix can be computed 
efficiently in linear time, O(r) (r denotes the size of 
12 
interconnect structure directly connected to non-linear 
devices), by exploiting the Hadamard product formula-
tion. 
It can also be shown that this formulation facilitates the fast 
5 matrix-vector products , again in linear time (O(r)), 
which arise while solving linear systems of equations 
with the preconditioner matrix. 
A simple example which best illustrates these advantages 
10 
is a symmetric tridiagonal matrix. 
a, 
— bi (17) 
—bi a2 — b2 
B 15 
— bn 2 a­ 1 — bn 1 
— ba-1 a„ 
20 The inverse of B can be represented compactly as a Hadamard 
product of two matrices, which are defined as follows: 
U1 U1 	 ... U1 V1 V1 	 ... V„ (18) 
U1 u2 	 ... u2 V2 V2 	 ... Vn 25 	 B-1 = 
~ . ° ~ . 
--
u1 u2 	 .. u„ 
------ 
V„ V„ 	 .. V„ 
--------------- U --------------------- V 
30 There exists an explicit formula to compute the sequences 
{u}, {v} efficiently in O(n) operations. In this case, if we are 
interested in solving a linear system of equations By -c, we 
only need to concern ourselves with the matrix-vector prod- 
35 uct B- 'c-y. This computation can also be performed effi-
ciently in O(n) computations as outlined below: 
[-1 	 1 	 (19) 
Pu. = L UjCj, P, = L VjCj, i = 1, ... , n 
40 	 .i=1 	 .i=1 
Y1 =u1Pn' 
Y; = V; P.,_1 + uj Pv, , i = 2, ... , n. 
45 
The above formulation for a tridiagonal matrix could be eas-
ily extended to handle the more general case when the pre-
conditioner matrix is a zero padded block tridiagonal matrix 
(matrix with zero diagonals inserted between the main diago- 
50 nal and the non-zero super -diagonal and sub-diagonal of 
tridiagonal matrix ) as in FIG. 10. Elementary row and column 
block permutations could be performed on such a matrix to 
reduce it into a block tridiagonal matrix. This has been shown 
with a small example as below. 
55 
a, 0 —bt 0 	 (20) and (21) 
0 a2 0 —b2 
e- 
—b i 0 a3 0 
60 	 0 —b2 0 a4 
at —bi 0 0 
—bi a2 0 0 	 T P 
0 0 a3 —b2 	
' 
0 0 —b2 a4 
65 X 
US 8,745,563 B2 
14 
To proceed, we rewrite (12) and (13) as 
is+l=X—l Yia+X
—l Aly'+4X ' AiP_A (ls+l+ls)— 	
(23) 
5 
A 
X —I AIP_C_(vv+l  _vy)+ 2 
2 
(vv +l +vv), 
X l A1Vc
+l =X 1 A1vk— X —i AI2PccA2T (ia +1 +ia)+ 	 (24) 
10 	 h 	 1 	 T k+1 	 k 	 1 	 k+l — k 
ZX AIPccAuUs + ~s ) — X AIPcc Cw(vv 	 V ). 
13 
-continued 
where 
1 0 0 0 
0 0 1 0 
P- 
0 1 0 0 
0 0 0 1 
Hence, 
ul 0 ul 0 	 v l 0 V3 0 	 (22) 
0 w2 0 wz 	 0 Vz 0 V4 
s-1 = PX- lPT = 	 o 
ul 0 u3 0 	 V3 0 V3 0 
0 uz 0 u4 	 0 V4 0 V4 
----------------------- ---------------------- U 	 V 
We have not included block matrices for simplicity of pre-
sentation , however the zero padded block tridiagonal case is 
a natural extension of the above example. All the entries in 
U,V matrices have to be now replaced by blocks and accord-
ingly the row and column permutations would be replaced by 
their block counterparts with an identity matrix of appropriate 
size replacing the `ones' in the P matrix. Table 4 gives the 
comparison for Incomplete-LU preconditioner and the zero-
padded (Z-Pad) preconditioner . Simulations were done on 
circuits consisting of busses with parallel conductors driving 
bank of inverters. `Size' denotes the number of non-linear 
devices. All the results are reported as a ratio of run-time and 
iteration -count (number of iterations for the solution to con-
verge to within a tolerance of I e-10) of Z-Pad to the Incom-
plete-LU preconditioner. As can be seen from Table 4, Z-Pad 
offers a substantial improvement in run time as compared to 
the Incomplete-LU preconditioner. 
TABLE 4 
Preconditioner 
Size 
400 	 800 	 1600 	 3200 
Runtime 	 .44 	 .42 	 .42 	 .43 
Iterations 	 5110 	 5110 	 5110 	 5110 
We now turn to the solution of equations (12) and (13). As 
mentioned earlier, these equations reduce to the form Ax -b 
with a constant, approximately sparse A - '. A (corresponding 
to X in (12) is composed of L, R and P. Each of these matrices 
has a sparse inverse for typical VLSI interconnects which 
then leads to a approximately sparse A` (Note that this argu-
ment is used for motivating the sparsity inherent in A -1 and 
cannot be used as a theoretical proof for the same). In addition 
this sparsity has a regular pattern which can be explained on 
the basis of how inductance and capacitance matrices are 
extracted. The distributed RLC effects of VLSI interconnects 
can be modeled by dividing conductors into small subsets of 
segments, each of which are aligned . Each of these subsets 
leads to a sparsity pattern (corresponding to a band in A`). 
All the effects when summed up lead to a A -1 matrix that has 
a regular sparsity pattern. Window selection algorithm can 
then be employed to find out the sparsity pattern in A - '. It has 
been recognized in earlier work that this property (sparsity) 
yields enormous computational savings; it has been shown 
that an approximate implementation of the Exact -RLP algo-
rithm, referred to simply as the "RLP algorithm " provides an 
order-of-magnitude in computational savings with little sac-
rifice in simulation accuracy. 
Although X is a dense matrix, X -1 turns out to be an approxi- 
mately sparse matrix. Moreover the matrices X -1 Y, X-lAl , 15 X
-IA,P_A, j T X-IA,P,,C_ are also approximately sparse. 
This information can be used to reduce the computation sig-
nificantly by noting that each step of trapezoidal integration 
now requires only sparse vector multiplications . Solving 
sparse (23) and (24) along with (15) and (16) is termed as the 
20 RLP algorithm (SASIMI). To analyze the computational sav-
ing of the approximate algorithm over the Exact-RLP algo-
rithm, we denote "sparsity index" of a matrix A as ratio of the 
number of entries of A with absolute value less than e to the 
total number of entries. The computation required for each 
25 iteration of (23) and (24) is then 0((1  _V)12), where v is the 
minimum of the sparsity indices the matrices X -1 Y, X-lAl , 
X-IA,P_A, j T X-IA,P,,C_. FIG. 12 provides the average 
sparsity for the matrices for a system with parallel conductors 
driving a bank of inverters. The sizes in consideration are 100, 
30 200 , 500 and 1000 . On top of this the computation time of can 
be reduced to O(1) by using the windowing techniques (details 
in [16]). Hence the computational complexity of RLP is O(pq 
(I -v)12) as compared to O(pgn 1 3) for the MNA approach. 
35 We implemented the Exact-RLP and RLP (SASIMI) algo-
rithms in C++. A commercially available version of SPICE 
with significant speed-up over the public-domain SPICE has 
been used for reporting all results with SPICE. Simulations 
were done on circuits consisting of busses with parallel con- 
40 
ductors driving bank of inverters, with wires of length I mm, 
cross section I µmx I µm , and with a wire separation of I µm. 
A periodic IV square wave with rise and fall times of 6 ps 
each was applied to the first signal with a time period of 240 
ps. All the other lines were assumed to be quiet. For eachwire, 
45 the drive resistance was 10Q. A time step of 0.15 ps was taken 
and the simulation was performed over 30 ps (or 200 time 
steps). For the inverters the W/L ratio of NMOS and PMOS 
were taken to be 0.42 µm/0.25 µm and 1.26 µm/0.25 µm 
respectively. 
In order to explore the effect of the number of non-linear 
50 elements relative to the total , three cases were considered. 
With p denoting the ratio of the number of linear elements to 
that of non-linear elements, the experiments were performed 
for p equaling 5, 20 and 50 . The number of linear elements in 
the following results is denoted by a. 
55 	 We first present results comparing the accuracy in simulat- 
ing the voltage waveforms at the far end of the first line (after 
the inverter load). The RMSE is again use for comparing the 
simulations , defined here as 
60 
Y' (v; - vs)' 
V 
  
65 
where v and v denote the waveforms obtained from Exact-
RLP and SASIMI respectively. 
US 8,745,563 B2 
15 
Table 5 presents a summary of the results from the study of 
simulation accuracy. It can be seen that the simulation accu-
racy of the Exact-RLP algorithm is almost identical to that of 
SPICE, while the SASIMI has a marginally inferior perfor-
mance as measured by the RMSE. The error values for 
SASIMI are compared simply with the Exact-RLP as it had 
the same accuracy as SPICE results for all the experiments 
run. A plot of the voltage waveforms at the far end of the 
active line, obtained from SPICE, Exact-RLP and SASIMI 
algorithms, is shown in FIG. 13. (The number of conductors 
in this simulation example is 200.) There is almost no detect-
able simulation error between the SASIMI, Exact-RLP and 
SPICE waveforms over 200 time steps. To give a better pic-
ture, the accuracy results reported are for a larger simulation 
time of 2200 time steps. 
TABLE 5 
16 
The Hamiltonian of a valley b for electrons, associated with 
the device under consideration, is as follows: 
5 	 fi d I d 	 d(I d ~ d I d 	 (25) 
Hb(r)=- 2 dx ~ medx ~ + dy medy +dZ ~W,T +V(r), 
dry, (ri, r2, kz, E)Ce(ri, r2, kz, E) = &I - r2), 
b 
where (mxb, myb, mzb ) are the components of effective mass in 
10 valley b. The equation of motion for the retarded Green's 
function (G') and less-than Green's function (G`) are: 
2kzz 	
(26) 
E 2m - Hb(rt) Cart, rz, kz, - 
15 
RMSE comparison. 	 fi2kz 	 (27) 
a 
E 2m  - Hb(rt) Cart, rz, kz, - p=5 	 p=20 	 p=50 20 
100 	 .0054 	 .0053 	 .0088 	 J dry' (ri, r2 , kz , E)Cb(ri, r2, k z , E) _ 
200 	 .0078 	 .0052 	 .0071 	 b 
500 	 .0006 	 .0022 	 .0001 
1000 	 .0003 	 .0005 	 .0004 	 f dr~ (ri , r2 , kz, E)Gb (r i , r2 , kz , E). 
2000 	 .0003 	 .0004 	 .0004 	 25 
b 
We now turn to a comparison of the computational require-
ments between Exact-RLP, SASIMI and SPICE. Table 6 sum-
marizes the findings. For a fair comparison, our total simula-
tion time is compared against the transient simulation time for 
SPICE (i.e we have not included any of the error check or set 
up time for SPICE). As can be seen from the table, SASIMI 
outperforms the Exact-RLP algorithm and SPICE. For the 
case of 500 conductors with p=50, the Exact-RLP algorithm 
is 390 times as fast compared to SPICE. SASIMI is about 
1400 times faster as compared to SPICE, and more than three 
times faster than Exact-RLP. As can be seen, the computa-
tional savings increase as the ratio of linear to non-linear 
elements is increased from 5 to 50. The savings also increase 
with increase in the size of the problem considered. The 
computational efficiency of the SASIMI can be explained on 
the use of sparsity-aware algorithms for both the linear and 
non-linear parts of the problem. 
TABLE 6 
Run time (in seconds) comparisons 
Given Gr and G`, the density of states and the charge density 
can be written as a sum of contributions from the individual 
30 valleys, 
N(r, kz, E) _ 
	
Nb(r, kz, E) _ - Im[Ci (r, r, kz, E)], 	
(28) 
b 35  
P(r, kz, E) _ 
	
Pb(r, kz, E) = - i[Qb(r, r, kz, E)]. 	 (29) 
b 
The self-consistent solution of the Green's function is 
40 often the most time intensive step in the simulation of electron 
density. It was shown by Svizhenko et al. in "Two-dimen- 
sional quantum mechanical modeling of nanotransistors," 
Journal of Applied Physics, 91(4):2343-2354, 2002, hereby 
p=5 p=20 p=50 
a SPICE Exact-RLP SASIMI SPICE Exact-RLP SASIMI SPICE Exact-RLP SASIMI 
100 11.96 1.34 1.26 13.73 .27 .21 13.54 .15 .12 
200 100.25 3.28 2.68 68.72 .64 .28 67.68 .55 .22 
500 3590.12 17.13 4.872 1919.21 13.47 3.01 1790.67 4.58 1.30 
1000 >12 hrs 87.75 22.71 >10 hrs 79.07 16.49 >10 hrs 77.56 15.20 
2000 >1 day 545.6 78.06 >1 day 526.23 59.33 >1 day 408.54 56.05 
The existing methods for finding the inverse of a block 
tridiagonal matrix suffer from being either numerically 
unstable or heavily memory intensive and hence impractical 
for problems of very large size (e.g. 10 6x106). 
Consider the two-dimensional model of a nano-scale tran-
sistor, shown in FIG. 14. The body of the transistor is pro-
jected onto a two-dimensional non-uniform spatial grid of 
dimension NxxNy, where Nx(Ny) denote the number of grid 
points along the depth (length) of the device. A brief review of 
the governing physics of this device is provided here.  
incorporated by reference, that the approximate block tridi- 
agonal structure of the left-hand side in (26) and (27) facili- 
60 tates the efficient calculation of the electron density. Specifi-
cally, it was demonstrated that only the diagonal entries of G" 
are needed to be computed for such a simulation. 
Svizhenko et al. showed that the problem of computing 
65 electron densities in a nanotransistor can be reduced to find-
ing the diagonal blocks of G', where AG'-I and A is a block 
tridiagonal matrix of the form 
US 8,745,563 B2 
17 	 18 
for large problem sizes. This will preclude it from being 
Al -Bl 	 (30) 	 directly implemented to solve these problems. Alternatively, 
-B; A 2 	 -B2 	 the divide and conquer algorithm described below avoids 
A 
 = 
	
	
these numerical issues by only handling manageable problem 
-BN 
5 
Y 
 - 2 A NY  - 1 -BN Y  - 1 	 sizes at each step. 
	
-BN Y  -1 A NY 	 In the most simple case, the block tri diagonal matrix, A can 
be decomposed into two sub-matrices connected by what we 
where each A,, B ye C N xN Thus AE C NyN- xNyN- with NY  10 will call a bridge matrix. This concept is illustrated below: 
diagonal blocks of size Nx each. We will denote A compactly 
as A=trid(A,,B,). 
	
The algorithm in Svizhenko et al. calculates the diagonal 	 A = n + xY, 
	
blocks (D,) of A-1 in the following manner. 	
15 
 
Gl=Al-', 	
Sz 
G,+l=(,4i,,—BsG,Bs)—1, i=1, 2, ... , AY-1, 
DNY GNY, 	
20 
D,=G,+G,B,D,+,B,G;, i=NY  1, NY -2,..., 1. 	 (31) 
The time complexity of this algorithm was shown to be 
O(Nx3Ny), with a memory requirement of O(Nx3Ny). 
Alternatively, the inverse of a block tridiagonal matrix can 25 
be computed explicitly by exploiting the Hadamard product 
formulation . When 113 , 1 are non-singular , A is said to be 
proper . In this case there exists two (non-unique ) sequences 
of matrices {U,}, {V,} such that for j>_i, (A -l )~ U~VjT. 
30 
Hence, A` can be written as 
Al -Bi 
T 
—Bl A 2 —B2 
.Sl = 
-BT 2 	 A i-1 -B;—1 
-Br 
l Ai 
A+l -B +, 
-B
r 
;+l A;+2 -B;+2 
S2 
 = 
—BTY — 2 ANY — 1 —BNY — 1 
-BT
NY -1 A NY 
0 	 0 
Ul V1 	 Ul V2 	 Ul V3 	 ... 	 Ul VNY 
V2Ul 	 Ul 	 U2 V3 	 ... 	 U2 VNY 35 _g 	 0 0 
A l = V3 Ui 	 V3 UZ 	 U3 V3 	 ... 	 U3 VNY 
X- 
0 	 -BT 
Y- 
0 
VNY U1 	 VNY U2 	 VNY U3 	 ... 	 UNY  VNY 
0 	 0 
40 
0 1 	 ... 0 
1 0 ... 0 
Notice that the block tridiagonal structure is preserved in 
each sub-problem, and they are joined by the N xxNx bridge 
matrix B,. The structure of S, and S z facilitates the use of 
45 Hadamard based methods (32) for determiningthe solutionof 
each sub-problem. However, the question then becomes how 
to determine the global solution from each sub-problem solu-
tion and their corresponding bridge. This problem can be 
5o resolved by the use of a fundamental result of linear algebra. 
The matrix inversion lemma describes the effect of a low rank 
v.T (v. +lTA£+l — v+2TB,+lT), i=NY  2, ... , 2,1. 	 (32) 	 correction term on the inverse of a matrix, and can be used in 
I denotes identity matrix of appropriate size 	 the following way: 
The matrices encountered in the simulation of nanotrans- 55 
istors enjoy the property that the diagonal blocks {A,} are 
tridiagonal while the off-diagonal blocks 1 13 , 1 are diagonal 
(this is due to the tight-binding approximation of the Hamil-
tonian). The complexity of the above parameterization is then 
reduced to O(NYNx2+Nx3). It is worthwhile to note here that 60 
the complexity of algorithm in (31), which was given to be 
O(NyNx3), does not change based upon these properties (al-
though the actual run time is reduced). Therefore, the reduced 
complexity of the Hadamard formulation makes it an ideal 65 
choice for the solution of this problem. However, it has been 
shown that the above recursions can be numerically unstable 
The U and V sequences can be efficiently computed in 
O(NYNx3) operations in the following manner: 
Ul 4 U2—B1  1A 1, 
U+l =BY l (A,U B=-I TU—l), i=2, ... , NY-1, 
7 	 7 	 —1 
VNY  — (AN Y  UNY BN Y  1 UNY  1) , 
VNY 1
7 
 = VNY
7 
'4NYBNY I
-1 
 , 
A-1 =(A+XY) 
=A-1 - (A l X)(I+YA l X) l(YA l) 
where, 
—1 X 	
( — ClB;) 	 0 
A - 
	
0 	 (-C2Bi , 
	
_l —1 	 I 	 -S2 1(1, 1)BT —
1 
(I+YA X) _ l  
	
-Si l (i,i)B ; 	 I 
(33) 
US 8,745,563 B2 
19 	 20 
-continued 	 -continued 
YA i = 
(Ci) 0 
5 	 ... 
 
[61 
 VT] ... [UiVm] 
Here, C,=S, -i (:,i) and C2-S2-'(:I  I) denotes the last (first) 
 
block columns of S i -i (Sz-i ) respectively. 	 10 
Although (33) can be used directly to solve the problem 
described above, its lack of computational efficiency pre-
cludes its use in this case. However, it is important to note that 
the adjustment term, (I+YA- 'X)-i , depends only on the cor-
ner blocks of the inverses of the sub-problems. This observa-
tion provides the basis fora formulation of the solution for the 
more general case. Specifically, the divide and conquer algo-
rithm of the present invention addresses the issue of how to 
combine multiple sub-problem solutions in both a memory 
and computationally efficient manner. 
An overview of the divide and conquer algorithm is pro-
vided here. 
The procedure begins with separating the block tridiagonal 
matrix A into D subproblems, each joined by a bridge matrix. 
The procedure presented previously motivates the formula-
tion of the global solution by combining the sub-problems in 
a simple radix 2 fashion. However, this method offers no 
improvement in terms of memory consumption and is com-
putationally intensive. Alternatively, matrix maps are created 
to capture the effect of each combining step without perform-
ing all of the associated computation. Adjustments to the 
matrix maps at each combining stage are not constant and 
must be modified to parallel the procedure above. These maps 
can then be used in the final stage to transform the subproblem 
solutions into the global solution. 
Each combining step in the divide and conquer algorithm 
will have an associated entry in the "job" list pointing to the 
first and last sub-problems along with the corresponding 
bridge point. For example, (1-2,3-4) describes the action of 
combining joined sub-problems I and 2 (S,_ 2) to joined sub-
problems 3 and 4 (S3_4) by use of the bridge matrix between 
problems 2 and 3. The corresponding entry in the job list 
would be of the form: [start, stop, bridge]=[1, 4, 2]. 
This formulation lends itself directly to parallel implemen-
tation, since computations associated with non-overlapping 
combinations can be farmed out across multiple systems. 
To model the effect of any combining stage in the job list, 55 
it is necessary to know the corner block elements from the 
inverse of each combined sub-problem. This process can be 
easily illustrated by a simple example. Suppose once again 
T 	 T 	 (34) Z,= -BT ) JU T 
Z2 = -B-1 0nVn 
The matrix maps associated with the Upper sub-problems (S, 
and Sz) are updated in the following manner. 
M6 M3 7(ADJ12B„+1)M4
— M6 
M7_ M4 7(ADJ12B„+1)M3
— M7 
M$ -M4 7( 4DJ12B„+1)M4— M8 (35) 
where n=size (S,)+size (S 2), and m=size (S 3)+size (S4). 
It would be impractical to recalculate the inverse of each 
joined problem for each combining stage. Alternatively, 
15 matrix maps areusedto efficiently produce the required block 
entries. 
Matrix maps are created to produce the cumulative effect of 
each combining step associated with a particular sub-prob- 
20 lem. There are a total of eight NxxN, matrix maps 1M,1 for 
each sub-problem. The process of updating matrix maps can 
be broken down into two categories: Adjustments to Upper 
sub-problem and those to Lower sub-problems, the distinc-
tion being their location with respect to the bridge point. This 
25 procedure will be illustrated using the above example. The 
"adjustment" matrix, AD7, for the combining step is defined 
as follows: 
30 
35 	 P=Y-Z2Z1)-i, 
	
 (ADJ 
11
ADJ = 1 + PZ2 
	 P~ P - ADJ21 ADJzz 
40 
45 	 M,+((J,V 7ADJ12B,)M,—M i 
M,((J,fl, 7ADJ1,B„+i)M4
— MI 
(CI f _ADJ12B„+i)M,—Ma 
50 	 (CI P 7ADJ22B„+i)M4_M4 
Ms M3 7(ADJ12B„+1)M3
— M5 
that the combination stage is (1-2,3-4), let Q,=S,_ z 	 Those associated with the Lower sub-problems (S 3 and S4) 
and Q2-S3-4.  The corner block elements of Q, -i and Q2_'  can 60  are updated in the following manner: 
be found using the parameterization given in (32). 	 (u, v TADJ„B„+1 T)M,—M, 
(U,fl, 7ADJ„Bn+1T)M2—M2 
t 
0 Q 
Q
2 i 
65 	 M3 +(fl_ 01 7ADJ21B„+1 T)M1 — M3 
M4+(f m 01TADJ21B„+1T)M2—M4 
US 8,745,563 B2 
21 
Ms M1 7(ADJ21B„+1 T)M1 -Ms 
M6_M1 T(ADJ21B„+1 T)M2-M6 
M1_ M27(ADJ21B_1T)M1-M, 
Ms M2 7(ADJ21B„+1 T)M2-Ms (36) 
The above procedure for modifying the matrix maps is 
repeated for each entry in the job list. In the final stage the 
matrix maps are used to generate the diagonal blocks of A - '. 
It is important to note that this scheme fits nicely into a 
parallel framework due to the fact that systems handling 
Upper or Lower sub-problems would only have to trade the 
limited amount of information in (34) to modify the matrix 
maps they are governing. 
22 
FIG. 15 shows the ratio of memory consumption of the 
algorithm in Svizhenko et al. as compared to the divide and 
conquer algorithm for varying number of blocks per division 
(sub-problem size). FIG. 16 shows the same ratio for varying 
5 number of divisions. 
While the invention has been illustrated and described in 
detail in the drawings and foregoing description , the same is 
to be considered as illustrative and not restrictive in character, 
10 it being understood that only the preferred embodiment has 
been shown and described and that all changes and modifi-
cations that come within the spirit of the invention are desired 
to be protected. 
15 	
What is claimed is: 
1. A system for simulating operation of a VLSI intercon-
nect structure having capacitive and inductive coupling 
between nodes thereof, comprising: 
• processor; and 
• memory, said processor configured to: 
obtaining a matrix X and a matrixY containing different 
combinations of passive circuit element values for 
said interconnect structure, said element values for 
each matrix including inductance L and inverse 
capacitance P, 
obtaining an adjacency matrix A associated with said 
interconnect structure, 
storing said matrices X, Y, and A in said memory, and 
performing numerical integration to solve first and sec-
ond equations each including as a factor product of 
inverse said matrix X (X -1 ) and at least one other 
matrix, said first equation including X -1Y, X-1A, and 
X-1 P and said second equation including X -1A and 
X-1 P. 
2. The system of claim 1, wherein said matrices X and Y 
4o each contain resistance values in addition to L and P. 
3. The system of claim 2, wherein said first equation is 
substantially of the form, 
Pseudo-Code 
20 
1. For each sub-problem in (1,2,...,D] 
• Determine corner blocks from inverse of sub-problem 
• Associate bridge matrix (excluding problem D) 
• Initialize matrix maps 
2. Generate list of sub-problem combinations radix 2 
3. Adjust mappings for each combining step 	 25 
4. Compute diagonal elements for each division 
5. Apply matrix maps to transform sub-problem solutions to the global 
solution 
The time complexity of the divide and conquer algorithm is 30 
O(N,'Ny+Nx3 D 1092 D) where a is defined to be the order 
associated with a NxxNx matrix multiplication (typically 
a=2.7). The memory consumption is 
35 
N, 'N 
O 
D y
+NxD. 
The divide and conquer algorithm, along with the algo- 45 
rithm in Svizhenko et al. have been implemented , in Matlab, 
on a single 32-bit x86 Linux workstation. All results reported 
are for test cases on a MIT well-tempered 25 nm device-like 
structure. 
Table 7 shows the run-time comparison of the two algo- 50 
rithms across the cases: N, -100 , Ny 3,000-80 , 000. Notice 
that the algorithm given in Svizhenko et al. was unable to 
perform past the case N Y-1 1,000 due to memory restrictions. 
However, the divide and conquer algorithm was able to 
handle these cases without encountering memory overflow. 
TABLE 7 
Run time (min) comparison 
Size = N_ * 
ik+1 = X
-1 Y~ +X-lAyk + 4X-lAP(Ik+l +Is), 
where v„ and i t are node voltages and inductor currents, 
respectively, A is an adjacency matrix for the circuit, and 
0.4 	 0.5 	 0.6 	 0.7 	 0.8 	 0.9 	 1.0 	 1.1 	 2.0 	 4.0 	 8.0 
Divide and Conquer Algorithm 	 5.38 6.66 8.10 9.41 11.13 12.25 13.75 14.99 27.59 58.86 107.0 
Algorithm in Svizhenko et al. 	 1.22 1.51 	 1.82 2.12 	 2.43 2.72 3.01 3.33 
US 8,745,563 B2 
23 	 24 
IS is a current source vector, and wherein said second 
equation is substantially of the form 
X —'Av k,+i  = X —'Av k,— ZX — ' APAT (ia+i  +ia)+ ZX —' AP(ls+i  +I,) 
