Computationally efficient modeling and simulation of large scale systems by Jain, Jitesh et al.
mu uuuu ui iiui iiui mu mu mu mu mu uiii iuui mi uii mi
(12) United States Patent
Jain et al.
(54) COMPUTATIONALLY EFFICIENT
MODELING AND SIMULATION OF LARGE
SCALE SYSTEMS
(75) Inventors: Jitesh Jain, Rajasthan (IN); Stephen F.
Cauley, West Lafayette, IN (US); Hong
Li, He nan (CN); Cheng-Kok Koh, West
Lafayette, IN (US); Venkataramanan
Balakrishnan, West Lafayette, IN (US)
(73) Assignee: Purdue Research Foundation, West
Lafayette, IN (US)
(*) Notice: Subject to any disclaimer, the term of this
patent is extended or adjusted under 35
U.S.C. 154(b) by 262 days.
(21) Appl. No.: 11/593,465
(22) Filed:	 Nov. 6, 2006
Related U.S. Application Data
(60) Provisional application No. 60/733,460, filed on Nov.
4, 2005, provisional application No. 60/740,990, filed
on Nov. 30, 2005.
(51) Int. Cl.
G06F 17150	 (2006.01)
(52) U.S. Cl . ....................... 716/4; 716/5; 716/6; 703/2;
703/14
(58) Field of Classification Search ................. 716/4-6;
703/2, 14
See application file for complete search history.
(56)	 References Cited
U.S. PATENT DOCUMENTS
5,610,833 A	 3/1997 Chang et al.
5,692,158 A * 11/1997 Degeneffetal . ............... 703/2
6,041,170 A *	 3/2000 Feldmann et al . .............. 703/2
(1o) Patent No.:	 US 7,774,725 B1
(45) Date of Patent: 	 Aug. 10, 2010
6,192,328 B1* 2/2001 Kahlertetal .	 ................. 703/2
6,820,245 B2 * 11/2004 Beattie et al .	 .................. 716A
7,228,259 B2 * 6/2007 Freund	 .......................... 703/2
7,307,492 B2 * 12/2007 Tripathi et al ............... 333/111
7,353,157 B2 * 4/2008 Wasynczuk et al.	 ........... 703/14
2003/0177458 Al* 9/2003 Beattie et al .	 .................. 716A
OTHER PUBLICATIONS
Jain et al. "Fast simulation ofVLSI interconnects'; IEEE/ACM Inter-
national Conference; publication date: Nov. 7-11, 2004; pp. 93-98).*
Jain et al., "Fast Simulation of VLSI Interconnects," IEEE/ACM
International Conference on Computer Aided Design, Nov. 7, 2004,
pp. 93-98, San Jose, California.
Chen et al., "INDUCTWISE: Inductance-Wise Interconnect Simu-
lator and Extractor," IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, vol. 22, No. 7, Jul. 2003, pp.
884-894.
He et al., "SPIE: Sparse Partial Inductance Extraction," Proceedings
of the 34th Design Automation Conference, Jun. 1997, pp. 137-140,
Anaheim, California.
(Continued)
Primary Examiner Naum B Levin
(74) Attorney, Agent, or Firm William F. Bahret
(57)	 ABSTRACT
A method of simulating operation of a VLSI interconnect
structure having capacitive and inductive coupling between
nodes thereof. A matrix X and a matrix Y containing different
combinations of passive circuit element values for the inter-
connect structure are obtained where the element values for
each matrix include inductance L and inverse capacitance P.
An adjacency matrix A associated with the interconnect
structure is obtained. Numerical integration is used to solve
first and second equations, each including as a factor the
product of the inverse matrix X i and at least one other matrix,
with first equation including X'Y, X 1A, and X 1P, and the
second equation including X 1 A and X1P.
3 Claims, 17 Drawing Sheets
1
0.9 -
0.8
07
A0.6' ..
a
0:5
ro
v
0A
U;3
0:1
0—
10,'
--i- threshald1
-f- threshold	 01-
-^ threshald	 001	 1
-tt- threshold- 0001
I
10 1..	 ttftr 10 12•	 10"11:.	 1014
Time Step
https://ntrs.nasa.gov/search.jsp?R=20100040686 2019-08-30T13:33:50+00:00Z
US 7,774,725 B1
Page 2
OTHER PUBLICATIONS
Zhong et al., "An Adaptive Window-Based Susceptance Extraction
and its Efficient Implementation," Proceedings of the 40th Design
Automation Conference, Jun. 2003, pp. 728-731, Anaheim, Califor-
nia.
Gala, K. et al., "Inductance 101: Analysis and Design Issues," Proc.
Design Automation Conf., Jun. 18-21, 2001, Las Vegas, Nevada, pp.
329-334.
Beattie, M. et al., "Modeling Magnetic Coupling for On-Chip Inter-
connect," Proc. Design Automation Conf., Jun. 18-21, 2001, Las
Vegas, Nevada, pp. 335-340.
Chen, T. H. et al., "INDUCTWISE: Inductance-Wise Interconnect
Simulator and Extractor," Proc. Int. Conf. on Computer Aided
Design, Nov. 10-14, 2002, San Jose, California, pp. 215-220.
Chen, T.-H., et al.,"HiSIM: Hierarchical Interconnect-Centric Cir-
cuit Simulator," Proc. Int. Conf. on Computer Aided Design, Nov.
7-11, 2004, San Jose, California, pp. 489-496.
Devgan, A. et al., "How to Efficiently Capture On-Chip Inductance
Effects: Introducing a New Circuit Element K," Proc. Int. Conf. on
Computer Aided Design, Nov. 5-9, 2000, San Jose, California, pp.
150-155.
Ji, H. et al., "SPICE Compatible Circuit Models for Partial Reluc-
tance K," Proc. Asia South Pacific Design Automation Conf. Jan.
27-30, 2004, Yokohama, Japan, pp. 787-792.
Lin, T. et al., "On The Efficacy of Simplified 2D On-Chip Inductance
Models," Proc. Design Automation Conf, Jun. 10-14, 2002, New
Orleans, Louisiana, pp. 757-762.
Pacelli, A, "A Local Circuit Topology for Inductive Parasitics," Proc.
Int. Conf. on Computer Aided Design, Nov. 10-14, 2002, San Jose,
California,pp. 208-214.
Saad, Y et al., "GMRES: A Generalized Minimal Residual Algorithm
for Solving Nonsymmetric Linear Systems," SIAM Journal on Sci-
entific Computing, vol. 7, No. 3, Jul. 1986, pp. 856-869.
Zhao, M. et al., "Hierarchical Analysis of Power Distribution Net-
works," IEEE Trans. on Computer-Aided Design oflntegrated Cir-
cuits and Systems, vol. 21, No. 2, Feb. 2002, pp. 159-168.
Zheng, H. et al., "Window-Based Susceptance Models for Large-
Scale PLC Circuit Analyses," Proc. Design, Automation and Test in
Europe Conf., Mar. 4-8, 2002, Paris, France, pp. 628-633.
Zhong, G. et al., "On-chip Interconnect Modeling by Wire Duplica-
tion,"Proc. Int. Conf. on Computer Aided Design, Nov. 10-14, 2002,
San Jose, California, pp. 341-346.
Zhong, G., et al., "On-Chip Interconnect Modeling by Wire Dupli-
cation," IEEE Transactions on Computer-Aided Design oflntegrated
Circuits and Systems, vol. 22, No. 11, Nov. 2003, pp. 1521-1532.
Godfrin, E.M., "A Method to Compute the Inverse of an n-Block
Tridiagonal Quasi-Hermitian Matrix," Journal of Physics: Con-
densed Matter, vol. 3, No. 40, 1991, pp. 7843-7848.
Meurant, G., "A Review on the Inverse of Symmetric Tridiagonal and
Block Tridiagonal Matrices," SIAM Journal on Matrix Analysis and
Applications, vol. 13, No. 3, Jul. 1992, pp. 707-728.
Nabben, R., "Decay Rates of the Inverses of Nonsymmetric
Tridiagonal and Band Matrices," SIAM Journal on Matrix Analysis
and Applications, vol. 20, No. 3, May 1999, pp. 820-837.
Svizhenko, A., et al., "Two-Dimensional Quantum Mechanical Mod-
eling of Nanotransistors," Journal ofApplied Physics, vol. 91, No. 4,
Feb. 15, 2002, pp. 2343-2354.
Ruehli, A.E., "Inductance Calculation in a Complex Integrated Cir-
cuit Environment,"IBMJournal ofResearch and Development, Sep.
1972, pp. 470-481.
Nagel, L.W., "SPICE2: A Computer Program to Simulate Semicon-
ductor Circuits," Technical report, U.C. Berkeley, ERL Memo ERL-
M520, May 9, 1975, 431 pages.
Bevilacqua, R., et al., "Parallel Solution of Block Tridiagonal Linear
Systems," Linear Algebra and its Applications, vol. 104, Jun. 1998,
pp. 39-57.
Bowden, K., "A Direct Solution to the Block Tridiagonal Matrix
Inversion Problem," International Journal of General Systems, vol.
15, No. 3, Aug. 1989, pp. 185-198.
Concus P., et al., "On Computing INV Block Preconditionings for the
Conjugate Gradient Method," BIT Numerical Mathematics, vol. 26,
No. 4, Dec. 1986, pp. 493-504.
Chen et al., "INDUCTWISE: Inductance-Wise Interconnect Simu-
lator and Extractor," IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, vol. 22, No. 7, Jul. 2003, pp.
884-894.
* cited by examiner
U.S. Patent	 Aug. 10, 2010	 Sheet 1 of 17	 US 7,774,725 B1
FIG. I
DN
^'	 w
r	
0
a)	 CO	 r	 u0	 LO	 a	 n	 N	 o
A sjedS eb'eJGny.
U.S. Patent	 Aug. 10, 2010	 Sheet 2 of 17	 US 7,774,725 B1
U.S. Patent	 Aug. 10, 2010	 Sheet 3 of 17	 US 7,774,725 B1
f
f
k
C
I	 ^	 I	 I
L
L
kL
r	 F	 1	 i	 B	 I	 I	 I	 f
r
0
I^
r
CL
U7
F-	 M
w
t
0
r	 q	 CO	 FA	 (D	 to	 mot'	 m	 r4	 r	 Cy r
^Jisj' ads 96E!jaAV
z
a0
x
00
U.S. Patent	 Aug. 10, 2010	 Sheet 4 of 17	 US 7,774,725 B1
50
150
0
	
500	 1000
	
1500
Column Number
FIG. 4(a)
U.S. Patent Aug. 10, 2010	 Sheet 5 of 17	 US 7,774,725 B1
500	 1000
Column Number
FIG. 4(b)
0. 1500
0-
8 i
O O
iF 4
c
h Q.
4 e
Q	 P
t	 -0 O	 O
O ^' 1 O
4 ^"	 O 4	 O
^P n 3
O ^ P O
5®Q
0
6 O ^ v
E	 4 ! 4^6 4-
S O
z^
1000
1500
z
0
x
1
1
15001000500
U.S. Patent	 Aug. 10, 2010	 Sheet 6 of 17	 US 7,774,725 B1
Column Number
FIG. 4(c)
U)
C6
n	 0
J
	 r
X
} WUU
X n
W U)	
co
Lq
	
r
	
Lq
	 0	 LO
r
	 0	 0
Lq
0
<	 ObullOA
Lq
N
a^
J
N
UQ
N	 ^
n
i
w
Q)
E
L^
r
r
U.S. Patent	 Aug. 10, 2010	 Sheet 7 of 17	 US 7,774,725 B1
U.S. Patent	 Aug. 10, 2010	 Sheet 8 of 17	 US 7,774,725 BI
LO
cli
0
CV)
Lq
N
LU
U)
LL.
x
LU Z C[-
-.^^w^`.^wa^^-^"°^..•—ter---	
^^--^..-•^•^	
^^, Y^
CD
A
E
Lq
Ln
0
C)
Iq	 ^	 CQ	 r	 Op	 to	 It	 N	 (D	 C\j	 It
C3	 C3	 6	 C)	 6	 6
< ----- abellOA
U.S. Patent	 Aug. 10, 2010	 Sheet 9 of 17	 US 7,774,725 B1
15
RLP
- - G KC
INDUCWISE
10
5
0'
10-2	 0 -3 	10 -4 	105
Tlueshold
FIG. 7
U.S. Patent Aug. 10, 2010	 Sheet 10 of 17	 US 7,774,725 B1
V
10-2 10-3	 10-4
Threshold
FIG. 8
10-5
15
10
W
5
U.S. Patent	 Aug. 10, 2010	 Sheet 11 of 17	 US 7,774,725 BI
Column Number
FIG. 9
a.^
O
O
. a< o 0 o a
e e V v
• t t V O 0 i i
V J
V V J	
t 0
D o s s
w M1 o
[ s
! r
0 o J#
c q
4 0
g o 0 i
e e V 4
O O i i
V V V V
i i a a
O O 0 > >te'. Y
i 
w++
M1 Yr r
+ V V J
a b b b b a
< < O O O ^
. a t
+ + r t
e e +
! a a Y
t s c o o a o i
. ^ ^
J#
e a e r+
w w w My y
w 
t 0 0 O O O i
y t t t A w Y
V J # # #	 M1
t O i 0 0 0 0
q 9 q
V V J 
J t t t V
# w w w
V V V J
• a C 0 0 0 0
t !
v y y J
! O O 0
J t e e
Column Number
FIG. 10
U.S. Patent	 Aug. 10, 2010	 Sheet 12 of 17	 US 7,774,725 B1
Column Number
N
0
FIG. I 1
U.S. Patent	 Aug. 10, 2010	 Sheet 13 of 17	 US 7,774,725 B1
LO	 "It	 Co	 CVo	 0')
xapul Al!saL-dS
000r
00
rn
00(p
00
000
m N
CO
o	 w
0
LO
0
0
0
0
co
0
0
CV
CD
0
00	 I-	 (.O	 LO-
r-	 r-	 r-	 r-
U.S. Patent	 Aug. 10, 2010	 Sheet 14 of 17	 US 7,774,725 BI
rr
W _I
C)
x
U) LU U)
flF
CO
CJ
LO
co
LO
N
C\j
A
E
LO
LSD
CD
(D
 LO	 'IT	 co	 C\j	 T	 C)
<----- a6e110A
e'
c^
O
C7
w
N
•^ O
4J
^ O
N
U.S. Patent	 Aug. 10, 2010	 Sheet 15 of 17	 US 7,774,725 B1
U.S. Patent	 Aug. 10, 2010	 Sheet 16 of 17	 US 7,774,725 B1
co 0
T
X
W
V!
^ T
z
V
w
co
N
T
O
W	 1 `	 W	 V!
uoildwnsuoo Aaowan jo op-H
O	
N
LO 
	
O	 m	 O	 LO
co	
uoi}dwnsuoo Aaowen jo of}ea
U.S. Patent	 Aug. 10, 2010	 Sheet 17 of 17	 US 7,774,725 B1
LO	 O u,
rn LO "T
.................
ii	 a	 ii
O
T
x
1
W
LO
Z
HH
w
co
N
T
O
O
US 7,774,725 B1
1
COMPUTATIONALLY EFFICIENT
MODELING AND SIMULATION OF LARGE
SCALE SYSTEMS
CROSS-REFERENCE TO RELATED
APPLICATIONS
This application claims the benefit of U.S. Provisional
Patent Application Ser. No. 60/733,460, filed Nov. 4, 2005,
and U.S. Provisional Patent Application Ser. No. 60/740,990,
filed Nov. 30, 2005, which applications are hereby incorpo-
rated by reference along with all references cited therein.
GOVERNMENT RIGHTS
This invention was made with government support under
Contract/Grant No. NCC 2-1363 awarded by the National
Aeronautics and Space Administration (NASA), under Con-
tract/Grant Nos. CCR-9984553 and CCR-0203362 awarded
by the National Science Foundation, and under Contract/
Grant No. USAF-FA8650-04-D-2409 awarded by the United
States Air Force Research Laboratories. The government has
certain rights in the invention.
TECHNICAL FIELD OF THE INVENTION
The present invention relates generally to electrical circuit
modeling and simulation techniques and, more particularly,
to methods for simulating interconnect effects in very large
scale integrated circuits.
AT
5	 C-^ At 0 ^
,C-
^0 L^ z
-
^it^
A? 1,b=	 ^, =AT R-'A,, and C =ATCAS.
0
10
R is the resistance matrix. The matrices 9, L and C are the
conductance, inductance and capacitance matrices respec-
tively, with corresponding adjacency matrices A g, Ai and A..
IS is the current source vector with adjacency matrix A i, and v„
15 and i t are the node voltages and inductor currents respectively.
With n denoting the number of inductors, we note that
L, C, R E R""", C, g E R2­2,
A standard algorithm for the numerical integration of dif-
ferential equations such as (1) is the trapezoidal method.
20 Consider a uniform discretization of the time axis with reso-
lution h. Then, using the notation Xk=x(kh), and the approxi-
mations
25	
+1 _XkTX(t) 
-kh ^	 h	 and Xk 
z^` +1 2+zk
30 over the interval [kh, (k+l )h], we may solve for xk+l in terms
of xk by solving the equation
2
where
C C +1	 C C	 bk+1 + bk	 (2)
BACKGROUND OF THE INVENTION	 35	 2 + h i _ J2 n ^^ + 2
With aggressive technology scaling, the accurate and effi-
cient modeling and simulation of interconnect effects has
become (and continues to be) a problem of central impor- 40
tance. In a three-dimensional interconnect structure there can
be significant amounts of coupling, both inductive and
capacitive, between interconnects. Models that capture these
effects tend to involve large matrices, resulting in extraordi-
nary demands on memory. Simulation with these models 45
require prohibitive amounts of computation.
While all coupling effects in theory extend without bound,
it is well-recognized that, in practice, the effects of capacitive
coupling, and to some extent that of inductive coupling, can
be assumed to be local without much sacrifice in accuracy. 50
Practical modeling and simulation techniques exploit this
localization to significantly reduce storage and computa-
tional costs. For practical interconnect structures, the capaci-
tance matrix C and the inverse of the inductance matrix
K—L-1 turn out to be (approximately) sparse. A number of 55
techniques exploit the sparsity in K at extraction level.
Exploiting sparsity of C and K in simulation however, is much
less straightforward. The main problem is that simulation
requires terms that not only involve the sparsified matrices C 60
and K, but also inverses of terms that involve them; these
inverses are dense in general.
The Modified Nodal Analysis (MNA) of interconnect
structures such as the one shown in FIG.1 yields equations of
the form	 65
Gx+Cz=b,	 (1)
A direct implementation of this algorithm requires O(n3+pn2)
operations, where p is the number of time steps. The direct
implementation ignores the structure of the matrices G and C
that is evident in (1); explicitly recognizing this structure
yields the so-called Nodal Analysis (NA) equations, used in
INDUCTWISE:
I ^ + C+ ZSw +1 = 1 _G + ^C- ZS), k, -2Aa is +A?(ls +l +ls). (3)
U	 V
and
2A tzi tk+l_2A t7i tk+hS(v„k+l +v„k),	 (4)
where S—A,KAiT (recall that K—L-1 , L being the inductance
matrix, with corresponding adjacency matrix A i, and AiT
being the transpose of Ai).
The NA equations (3) and (4) enjoy several advantages
over the NINA equations (1). The first advantage is that the
solution of equations (1), a problem of size 3n, has been
divided into two sub-problems of sizes 2n and 2n, which
yields computational savings with polynomial-time algo-
rithms. Next, it has been observed that with typical VLSI
interconnect structures, the matrices K, C and G exhibit spar-
sity. This can be used at the extraction stage to write down (3)
and (4) with fewer parameters. Finally, at the simulation
stage, the structure of the matrix U defined in (3) symmetry,
US 7,774,725 B1
3
positive-definiteness and sparsity lends itself to the use of
fast and sound numerical techniques such as Cholesky fac-
torizations. These advantages have been extensively used to
develop INDUCTWISE. For future reference, we note that
the computation with INDUCTWISE is 0(n 3 +pn2) opera-
tions, and is usually dominated by O(pn2).
SUMMARY OF THE INVENTION
The approach that is employed is to sparsity the various
matrices that underlie the model of interconnects; the result-
ing approximate models can be represented by far fewer
parameters, leading to savings in storage.
The present invention presents methods that systematically
take advantage of sparsity in C and K, in simulation, achiev-
ing a very significant reduction in computation with very little
sacrifice in simulation accuracy. The first idea underlying our
approach is that if the sparsity in the inverse of a dense matrix
is known, the (sparse) inverse can be computed very effi-
ciently. We take advantage of this fact by writing the simula-
tion equations in terms of L and P —C-1 . The most computa-
tionally intensive step in simulation, of system formulated in
such a fashion, reduces to that of matrix-vector multiplication
involving a sparse matrix. We also show that savings with
sparse-matrix-vector multiplication can be obtained with
simulation using K=L -1 and C, as well, but to a lesser extent.
The RLP formulation is extended to include non-linear
devices, without sacrificing the computational benefits
achieved due to sparsity of the linear system. It should be
noted that the A matrix involved in the solution of the linear
system is constant throughout the simulation. In contrast, the
A matrix involved in solving the non-linear system changes in
each simulation step. However, the  matrix is sparse. Due to
the sparse and time varying nature of the problem at hand
Krylov subspace based iterative methods could be used for
efficient simulation. Our second contribution is to introduce a
novel preconditioner constructed based on the sparsity struc-
ture of the non-linear system. The inverse of the precondi-
tioner has a compact representation in the form of the Had-
amard product, which facilitates not only the fast
computation of the inverse, but also the fast dense matrix-
vector product.
The objects and advantages of the present invention will be
more apparent upon reading the following detailed descrip-
tion in conjunction with the accompanying drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 a distributed model of a typical three dimensional
VLSI interconnect structure that may be operated in simula-
tion with the methods of the present invention.
FIG. 2 shows the average sparsity index of the matrices
U-1V U-1AZ T U- 'A,'and U- 'S, for a structure as a function
of h for various values of E.
FIG. 3 shows average sparsity index of the matrices X-1Y,
X-1A and X-1AP, for a structure as a function of h for the
sparsity threshold of e-0.001, as compared with the average
sparsity index of the matrices encountered in the GKC-algo-
rithm.
FIG. 4(a) shows the significant entries (shown darker) of
the adjacency matrix A for a structure with 1500 conductors.
FIGS. 4(b) and 4(c) show the significant entries (shown
darker) of W-1 and X-1 , respectively, for the structure in FIG.
4(a).
FIG. 5 shows the voltage wave forms, obtained from
SPICE and Exact-RLP, of the active line and the seventh line
of a 100-conductor circuit.
4
FIG. 6 shows the voltage wave forms, obtained through
INDUCTWISE, Exact-RLP, RLP, and GKC, of the active line
and the seventh line of a 600-conductor circuit.
FIGS. 7 and 8 show plots of the RMSE for the active and
5 the seventh line as a function of threshold value for a 600-
conductor circuit.
FIG. 9 shows the sparsity structure (nonzero entries shown
darker) of the A matrix for an exemplary circuit of parallel
wires driving a bank of inverters.
10	 FIG. 10 shows an exemplary preconditioner matrix that
may be used with the exemplary circuit of FIG. 9.
FIG. 11 shows the sparsity pattern (nonzero entries shown
darker) of matrix A of a circuit having only non-linear devices
and no interconnects.
15	 FIG. 12 shows average sparsity versus circuit size.
FIG. 13 shows the voltage wave form obtained through
SPICE and Exact-RLP and SASIMI.
FIG. 14 shows a two dimensional non-uniform spatial grid
for a Nanotransistor.
20 FIG. 15 shows the ratio of memory consumption of the
algorithm in [9] as compared to ours for varying division sizes
(N,JD).
FIG. 16 shows the ratio of memory consumption of the
algorithm in [9] as compared to ours for a varying number of
25 divisions (D).
DETAILED DESCRIPTION OF THE PREFERRED
EMBODIMENTS
30 For the purpose of promoting an understanding of the
principles of the invention, reference will now be made to the
embodiments illustrated in the drawings and specific lan-
guage will be used to describe the same. It will nevertheless
be understood that no limitation of the scope of the invention
35 is thereby intended, such alterations and further modifica-
tions in the illustrated device and such further applications of
the principles of the invention as illustrated therein being
contemplated as would normally occur to one skilled in the art
to which the invention relates.
40 While significant storage and computational advantages
accrue with INDUCTWISE, we note that the sparsity of U has
not been fully taken advantage of at the level of linear algebra
(beyond the possible use of sparse Cholesky factorizations) in
the numerical solution of (3). In particular, with the formula-
45 tion used by INDUCTWISE, while the matrix U is sparse, its
inverse is dense. Thus, trapezoidal numerical integration, at a
first glance, entails matrix-vector multiplies with a dense
matrix at each time step. However, it has been observed that
the matrices U-1V (where V is defined in (3)), U-'A,
50 U-1A ZT and U-1 S are approximately sparse, and this informa-
tion can be used to significantly reduce the computation as
follows. Rewrite (3) and (4) as
v k+i_U iVv„k-2U'A,Tiik+UiAz^jk+i+jz^
55	 77k 1_	 7k	 k 1 k2L^ Ai ii ii + 2L^ Ai ii +hL^ S(v„ + +v„ ).
Pre-compute and store the sparsified matrices U-1V,
U-1AZT U-1AZ T and U-1 S. Then, every time step in the trap-
ezoidal integration scheme requires only sparse matrix-vec-
60 for multiplies. We will henceforth refer to this technique as
the GKC-algorithm (as the computations are done with the
conductance, inverse of the inductance and the capacitance as
the parameters).
In order to quantify the computational savings obtained
65 with the GKC-algorithm over INDUCTWISE, we define the
"sparsity index" µ E(A) of a matrix A as ratio of the number of
entries ofA with absolute value less than e to the total number
35
X33 X35 X38
X53 X55 X58
X83 X85 X88
40
More generally, suppose that there are a t nonzero entries
in the ith row of X-t . By following a procedure as above, the
ith row of X-t can be computed by inverting an U lxa, matrix.
Thus, the overall computation in determining X-t is 0(E,ai)
45 It is typical with VLSI interconnects that a is a small con-
stant. Thus if X-t is exactly sparse, with a known sparsity
pattern, it can be computed in O(n) from X. Table 1 gives the
time taken for inversion for different circuit sizes.
50	 TABLE 1
Inversion time in matlab (in seconds)
No. of conductors	 500	 1000	 2000	 5000
Direct Inversion	 .29	 2.18	 16.87	 260.68
55	 Fast Inversion	 .79	 1.48	 2.93	 10.17
Thus, there remains the problem of determining the spar-
sity pattern in X-t . Recall that
60
L R h	 L R	 h
X= h + 2 + 4 APAT . Let W= h + 2 and Z=4APAT.
65 Then
X-'-w '-w '(w '+Z ')-' ff -'	 (7)
US 7,774,725 B1
5
of entries. Then, the computation required for each iteration
withthe GKC-algorithm, with some appropriate value of e, is
0((1-v)n2) where v is the minimum of the sparsity indices of
the matrices U-'V, U- tA T, U-tA' T and U- 'S. The value of v
can be expected to depend on the threshold for detecting
sparsity e, as well as the time step size h. FIG. 2 shows the
average sparsity index of the matrices U- ' V, U-tA, U- 'A T
and U- ' S, for a structure with three parallel planes consisting
of 600 conductors, as a function of h for various values of E.
The typical value of h used solving the MNA equations for
VLSI interconnects is 0.1 picoseconds. With such values of h
and e-0.001, it can be seen that v-0.8. Thus the total compu-
tation time with the GKC-algorithm is approximately a fifth
of that required by INDUCTWISE.
We now explore an alternative formulation of the MNA
equations that uses the resistance, inductance and the inverse
of the capacitance matrix. For typical interconnect structures,
shown in FIG. 1, we can manipulate the MNA equations (2) to
obtain
6
algorithm is dominated by 0((1-y)n2), where is the y is the
minimum of the sparsity indices the matrices X - 'Y, X- 'A and
X-'AP.
FIG. 3 shows the average sparsity index of the matrices
5 X- 'Y, X-A and X- 'AP, for a structure with three parallel
planes consisting of 600 conductors, as a function of h for the
sparsity threshold of e-0.001, as compared with the average
sparsity index of the matrices encountered in the GKC-algo-
rithm. It is clear that the matrices encountered in the RLP-
10 algorithm exhibit much higher sparsity over a wide range of
time-steps. In particular, for h-0.1 ps, it can be seen that
y-0.9. Thus the total computation time with the above RLP-
algorithm is approximately one-tenth of that required by the
RLP formulation that does not use sparsity information.
15 When compared to the GKC-algorithm and INDUCTWISE
which use twice as many state variables, the amount of com-
putation required by the RLP-algorithm is approximately
one-eighth and one-fiftieth respectively.
We now provide the details on the fast inversion of X.
20 Assume for simplicity that the sparsity pattern inX-isknown,
deferring for later the problem of detecting this sparsity pat-
tern. Then, manipulations of only a subset of the entries of the
X matrix (rather than the entire X matrix) can be used to
compute the inverse matrix. To briefly illustrate the idea con-
25 sider the example when X E R t t t t and the 5 th row of X-t
has the following form:
^h + + 4APAT)ia +l =	 (5)
X
4 — —.
^h— —4APAT)ia+Av +4AP(l+t+I,k),
and
X00*0*00*001,
30 where * denotes the nonzero entries. Then, it can be shown
V +1 = v - h PAT (ik+' +,k ) + hP([k+' + [s), 	 (6)	 that these nonzero entries can be computed exactly2 	2	 p	 Y from the
second row of the inverse of the following 3x3 matrix
obtained from X:
where P is the inverse capacitance matrix i.e P-C -t and  is
the adjacency matrix of the circuit, obtained by first addingAg
and Ai and then removing zero columns (these correspond to
intermediate nodes, representing the connection of a resis-
tance to an inductance). When compared with the NA equa-
tions (3) and (4), we see that the number of state variables has
been halved. Compared to INDUCTWISE, this represents
immediate savings. For future reference, we will term the
technique of directly solving (5) and (6) as the "Exact-RLP"
algorithm.
In contrast with the GKC-algorithm, it turns out here that X
is dense, but with an inverse that is approximately sparse.
Thus, windowing techniques such as those employed by
INDUCTWISE during the extraction stage to obtain a spar-
sified matrix K can be employed here to quickly compute a
sparsified X-t . (Windowing techniques details will be
described below.) Moreover, the matrices X - 'Y, X- 'A and
X- 'AP turn out to be approximately sparse. Thus, paralleling
the development of the GKC-algorithm, we have the follow-
ing RLP-algorithm:
Rewrite (5) and (6) as
is+i = X —i Y^ +X_ 1 A vk, + 4 X- i AP(Ik+i + Ik),
X—i A vk,+i = X —i Avk, — 2 X—i APAT (ik+i +,k  + 2 X—i AM Ik+i + Ik )
Pre-compute and store the sparsified matrices X - 'Y, X-'A
and X- 'AP. Again, every time-step in the trapezoidal integra-
tion scheme requires only sparse matrix-vector multiplies. As
with the GKC-algorithm, the total computation with the RLP-
US 7,774,725 B1
7
For the values of R, L, C and h under consideration, it turns
out that
X- 1gly- 1 -w- 1Zw 1 .	 (8)
Thus, the significant entries of X-1 can be obtained by super-
posing the significant entries of W-1 and the significant
entries of W-1 ZW-1 . The sparsity pattern of W-1 can be
efficiently determined using the techniques available in the
literature. Turning next to
W -1 Z4V-1 = h4W-IAPArW -1
note that the significant entries of W-1A are obtained by
distributing the significant entries of W-1 into locations deter-
mined by the adjacency matrix A. In summary, we have the
following heuristic for predicting the sparsity pattern in X-1:
First determine the significant entries of W-1 by determining
the set of segments that are inductively couple with a given
segment. In addition, spread the nonzero entries of W-1 to
locations suggested by the adjacency matrix to find the
remaining significant entries.
These ideas are illustrated via a three dimensional inter-
connect structure of three parallel planes with 1500 conduc-
tors. In FIG. 4(a), the significant entries of the adjacency
matrix A are shown to be darker. FIGS. 4(b) and 4(c) show the
entries of W-1 and X-1 respectively, again withthe significant
entries shown darker.
We emphasize that the actual computation of the signifi-
cant entries of X-1 proceeds via the technique in, where given
the knowledge of the sparsity pattern resident in X -1 , the
actual entries can be directly and efficiently computed. Thus,
(7) and (8) are not used for computation, but only to motivate
the heuristic for efficiently determining the sparsity pattern of
X-1.
We implemented the INDUCTWISE, RLP and GKC algo-
rithms in MATLAB on a PC with an Intel Pentium IV 2.4 GHz
processor. In order to quantify the simulation accuracy with
various methods, we used as the benchmark the Exact-RLP
simulation (recall that this is the direct simulation of equa-
tions (5) and (6)). (While SPICE simulations would have been
more natural to use as the benchmark, we found that the
computation time grew quickly to make them impractical; for
a modest-size circuit comprising 100 parallel conductors,
SPICE simulation took 350 seconds as compared to 1.08
seconds with the Exact-RLP algorithm, with no detectable
simulation error, as shown in the FIG. 5).
Simulations were done on a three dimensional structure of
three parallel planes, with each plane consisting of busses
with parallel conductors, with wire-lengths of 1 mm, and a
cross section of 1 ltmx I µm. The wire separation was taken to
be 1 µm; each wire was divided into ten segments. A periodic
1 V square wave with rise and fall times of 6 ps each was
8
applied to the first signal on the lowest plane, with a time
period of 240 ps. All the other lines were assumed to be quiet.
For each wire, the drive resistance was 10 Q the load capaci-
tance was 20 fF. A time step of 0.15 ps was taken and the
5 simulation was performed over 330 ps (or 2200 time steps).
As expected, with all methods, there is an inherent trade-
off between simulation accuracy and cost (CPU time and
memory). We first present results comparing the accuracy in
simulating the voltage waveforms at the far end of the first
to (active) and the seventh (victim or quiet) lines. The metric for
comparing the simulations is the relative mean square error
(RMSE) defined as
15	 Y, (v ; - vs)'
V 
20 
where v and v denote the waveforms obtained from Exact-
RLP and the algorithm under consideration respectively. A
threshold value of 0.001 was chosen for sparsification of RLP
and GKC algorithms, as well as for sparsification of L -1 in
25 INDUCTWISE. Table 2 presents a summary of the resultsfrom the study of simulation accuracy.
TABLE 2
RMSE comparison
30
Active Line	 7th line
Size INDUCTWISE RLP GKC INDUCTWISE RLP GKC
300 .0013 .0010 .0017 .1622 .1266 .1960
35	 600 .0014 .0011 .0014 .4381 .3452 .4651
900 .0006 .0007 .0008 .3222 .3076 .4078
1200 .0004 .0004 .0004 .2382 .2656 .2992
1500 .0003 .0003 .0004 .2021 .2200 .2336
40 It can be seen that the simulation accuracy of the RLP and
the GKC algorithms are comparable to that of INDUCT-
WISE, with a marginally inferior performance as measured
by the RMSE. A plot of the voltage waveforms at the far end
of the active line and the 7th line, obtained by INDUCT-
45 WISE, RLP, and GKC algorithms, is shown in the FIG. 6.
We briefly explore the influence the choice of the threshold
for determining sparsity. A higher threshold can be expected
to decrease the computational and memory requirements,
howeverwith loss in simulation accuracy. FIGS. 7 and 8 show
50 plots of the RMSE for the active and seventh line as a function
of threshold value, again for a circuit of size 600 conductors.
Any value of the threshold below 0.001 appears to be a
reasonable choice.
We now turn to a comparison of the computational and
memory requirements between INDUCTWISE, RLP and
GKC algorithms. Table 3 summarizes the findings.
Table 3
Run time and memory comparisons
Time (in sec)	 Memory (in MB)
Size Exact-RLP INDUCTWISE RLP GKC EXACT-RLP INDUCTWISE RLP GKC
300	 14.30	 74.34	 4.09	 18.99	 2.95	 11.61	 1.02	 6.61
600	 76.21	 422.00	 16.28 77.32	 11.61	 46.20	 2.36 15.38
US 7,774,725 B1
9	 10
Table 3-continued
Run time and memory comparisons
Time (in sec)	 Memory (in MB)
Size Exact-RLP INDUCTWISE RLP GKC EXACT-RLP INDUCTWISE RLP GKC
900	 244.14	 1133.40	 33.21 162.08	 26.03	 103.84	 4.09 31.68
1200	 513.08	 3051.10	 60.53 312.93	 46.20	 184.56	 6.16 52.22
1500	 827.50	 4682.00	 92.16 813.00	 72.14	 288.24	 7.60 86.00
It can be seen that for a circuit consisting of 1200 conduc- 	 iteration index k for simplicity. Equation (11) is a nonlinear
tors, RLP is about nine times faster than the Exact-RLP, and 	 equation of the form g(x)O. Let g(x)-G(x) be a linear
fifty times faster than INDUCTWISE. The GKC algorithm is 15 approximation of g(x), linearized around some x°x o . Then,
about twice as fast as the Exact-RLP, and ten times faster than 	 simultaneously solving L(x)=0 and G(x)-0 yields numerical
INDUCTWISE. The Exact-RLP is about six times as fast as	 values for  and hence v,. These values are then used to obtain
INDUCTWISE. With larger circuit sizes, the advantage of 	 a new linear approximation g(x)-G„eW(x), and the process is
RLP over INDUCTWISE continues to grow, while the Exact-	 repeated until convergence. A good choice of the point xo for
RLP and GKC algorithms have an advantage over INDUCT- 20 the initial linearization at the k th time-step is given by the
WISE that grows slightly. An explanation for the slower per- 	 value of v„ from the previous time-step.formance of INDUCTWISE compared to Exact-RLP is that
the number of variables withthelatteralgorithmis one-half as
	
	 p gn i ) operations, where p is the number of time steps, q
di3) implementation of this algorithm requires
O(that with the former. The same trends are observed with 	
memory requirements. 	 25 is the maximum number of Newton-Raphson iterations in
VLSI interconnect structures with non linear devices can 	 each time step, and n i 3N.
also be analyzed using the Modified Nodal Analysis (MNA) 	 We begin by decomposing C, A, and A, as:
formulation, yielding equations of the form
	
Gx+Cz=b,	 (9) 30	 C^^ C^„	 AT	 AT
_	 _	 T_C- 11
where	 C- C_ A'
	 AAz	 '	 A z
G AT	 C 0	 v
C- ^-Aa 0 ^ C= ^0 L^ 	
35
r 0
	 VC l
AT 1,
 + l„ a	 Ana = L 
J vn - L Jb =	 9 =A T R_'A g ,  and C = ATCA c .	 I,	 vv
0
40 
Here C,,,, denotes the sub-matrix of the capacitance matrix
R denotes the resistance matrix. The matrices ^, L and C are 	 that changes amid the simulation, while all other sub-matrices
the conductance, inductance and capacitance matrices	 remain constant. The matrix C,,,, captures the drain, gate and
respectively, with corresponding adjacency matrices Ag, AZ	 bulk capacitances of all devices, which are voltage-depen-
and A,. IS is the current source vector with adjacency matrix 	 dent, while C, and C am„ are the capacitance matrices that ariseA, and v„ and i, are the node voltages and inductor currents 45 from interconnects and are hence constant.
respectively.
Vector, I„, captures the effect of non-linear loads and 	 For typical interconnect structures, the above decomposi-
depends on the node voltages as I„ i=f(v„). f is a function	 tion allows us to manipulate the MNA equations (10) and
which varies depending on the load characteristics and in	 (11):
general can be a non-linear function. 	 50
Utilizing the trapezoidal method to numerically solve (9)
requires the solution of a set of linear and non-linear equa-	 I h +R	 + 4 A, 	Pc^A, ^tI = I h - R - 4 A, Pcc A, ^ta + A, vk +	 (12)
tions:	 -	 X - - — -	 -- - - — Y
55	 4AlP^^A (^s+i+Is)-AlP^^C^,^(vk^i-vv)+ ^2(vy+i+vv),
C C	 C C	 bk+1 + bk	 (10)
v^-ZP^^Az (ia+i+ia)+ ZP_A (^s+i+^s)-P^^Cw(vv^i-vv),
60
and	 C,,vk+i = C,,,, vk 
- ZAT(ik+i +ia)+	 (14)
I k+l—gunk+l)	 (11)Z	 J^
ZA zUs' i + ^s) - Cv^(v^+i _ vk ) + 2(^°+i + w),
The nonlinearity in the above set of equations can be handled
by the standard Newton-Raphson technique of linearizing 65	 [v+l = f (Uv+l) 	 (15)
(11) and iterating until convergence: Equation (10) is a linear
equation of the form L(x)O, where we have omitted the
US 7,774,725 B1
11
Here r denotes the size of interconnect structure connected
directly to non linear circuit, and given 1=N-r we note that
C ERi" i C ER"
P,,-C,,- i is the inverse capacitance matrix, and A is the
adjacency matrix of the circuit. A is obtained by first adding
Ag andA, and then removing zero columns (these correspond
to intermediate nodes, representing the connection of a resis-
tance to an inductance).
Thus far, the analysis is similar to that of the linear ele-
ments structures described above, with the major difference
being the addition of (14) and (15), which account for the
nonlinear elements. We will show here how the linear tech-
niques can be extended to handle the case when nonlinear
elements are present.
For future reference, we will call the technique of directly
solving (12), (13), (14), and (15) as the "Exact-RLP" algo-
rithm. It can be shown that the computational complexity of
the Exact-RLP algorithm is 0(1 3 +pq(12 +r3)). For large VLSI
interconnect structures we have 1»r, reducing the complex-
ity to 0(13 +pq(1 2)) .
We now turn to the fast solution of equations (12) through
(15). Recall that the nonlinear equation (15) is handled via the
Newton-Raphson technique. This requires, at each time step,
linearizing (15) and substituting it into (14). The resulting set
of linear equations have very specific structure:
Equations (12) and (13) are of the form Ax=b where A is
fixed (does not change with the time-step). Moreover,
A-1 is typically approximately sparse.
Equation (14) (after the substitution of the linearized (15))
is again of the form Ax=b, where the matrix A is
obtained by adding C,,,, and the coefficient of the first-
order terms in the linearized equation (15). Recall that
the matrix C,,,, captures the drain, gate and bulk capaci-
tances of all devices. It also contains the interconnect
coupling capacitances between gates and drains of dif-
ferent non-linear devices in the circuit. As each non-
linear device is connected to only a few nodes and the
capacitive effects of interconnects are localized, the A
matrix is observed to be sparse in practice. Note that A
changes with each Newton-Raphson iteration and with
the time-step.
Thus the key computational problem is the solution of a
sparse time-varying set of linear equations, coupled with a
large fixed system of linear equations Ax=b with A-i being
sparse.
Krylov subspace methods have been shown to work
extremely well for sparse time-varying linear equations. Spe-
cifically, the GMRES (Generalized Minimum Residual)
method of Saad and Schultz allows the efficient solution of a
sparse, possibly non-symmetric, linear system to within a
pre-specified tolerance. This method performs a directional
search along the orthogonal Amoldi vectors which span the
Krylov subspace of A. That is, given an initial guess x0 and
corresponding residual ro=b-Axo , orthogonal vectors {ql,
q
z 
.... qm 1 are generated with the property that they span Sm,
the solution search space at iteration m.
& = xp + span{rp, Arp, ... , A'rpl	 (16)
= x0 +K(A , r0, m)
c span{q i , q2 ... , qm1.
12
These vectors are chosen according to the Arnoldi itera-
tion: AQm=Qm+ , Hm where Q_-{q,, q2 .... qm I is orthogonal
and Hm E Rm+i "m is an upper Heisenberg matrix.
For these methods the choice of a preconditioner matrix M,
5 which is an approximation of A, can greatly affect the con-
vergence. A good preconditioner should have the following
two properties:
M-IA=I.
to	 It must accommodate a fast solution to an equation of the
form Mz-c for a general c.
FIG. 9 depicts the sparsity structure of the A matrix for a
circuit example of parallel wires driving a bank of inverters.
For such a sparsity structure, an appropriate choice of the
15 preconditioner could be of the form as shown in FIG. 10.
Although we have chosen a circuit with only inverters for
simplicity, a more complicated circuit structure would simply
distribute the entries around the diagonal and off-diagonal
20 bands and lead to possibly more off diagonal bands. To see
this, consider an extreme case where the circuit under con-
sideration has only non-linear devices and does not comprise
of interconnects. In this case the sparsity pattern of the A
matrix is as shown in FIG. 11. Therefore, the chosen precon-
25 ditioner would encompass not only the sparsity structure
shown in FIG. 9 but also other sparsity patterns that might
arise with the analysis of more complicated non-linear
devices. Correspondingly the structure of the preconditioner
(see FIG. 10) would have additional bands.
30 Matrices of the form shown in FIG. 10 have the following
two properties which make them an ideal choice for precon-
ditioner.
The inverses of the preconditioner matrix can be computed
efficiently in linear time, O(r) (r denotes the size of
35 interconnect structure directly connected to non-linear
devices), by exploiting the Hadamard product formula-
tion.
It can also be shown that this formulation facilitates the fast
matrix-vector products, again in linear time (O(r)),
4o	
which arise while solving linear systems of equations
with the preconditioner matrix.
A simple example which best illustrates these advantages
is a symmetric tridiagonal matrix.
45
at - bi (17)
-bi a2 -b2
e-
50
-bn 2 as-1 -b,-t
-ba-t as
The inverse of B can be represented compactly as a Hadamard
55 product of two matrices, which are defined as follows:
U Uj	 ... Uj Vj V2	 ... Vn (t g)
U 1 U2 	... U2 V2 V 2	 ... V„
60	 B-1 =
U 1 u2	.. u„ V„ V„	 .. V„
U V
65
There exists an explicit formula to compute the sequences
Jul, {v1 efficiently in O(n) operations. In this case, if we are
US 7,774,725 B1
13
interested in solving a linear system of equations By —c, we
only need to concern ourselves with the matrix-vector prod-
uct B-i c=y. This computation can also be performed effi-
ciently in O(n) computations as outlined below:
I	 1	 (19)
Pu. = Y, 	 P, = Y, 	 i = 1^ ... n
14
TABLE 4
Preconditioner comparison.
Size	 400	 800	 1600	 3200
Runtime	 .44	 .42	 .42	 .43
Iterations	 5110	 5110	 5110	 5110
'-,	
'-' We now turn to the solution of equations (12) and (13). As
Y1 = Ui Pv, , 10 mentioned earlier, these equations reduce to the form Ax=b
with a constant, approximately sparse A'. A (corresponding
...	 , Y; = v; P.;_, + ut Pv, ,	 i = 2 1 	n. to X in (12) is composed of L, R and P. Each of these matrices
has a sparse inverse for typical VLSI interconnects which
15 then leads to a approximately sparse A` (Note that this argu-
The above formulation for a tridiagonal matrix could be eas- ment is used for motivating the sparsity inherent in A` and
ily extended to handle the more general case when the pre- cannot be used as a theoretical proof for the same). In addition
conditioner matrix is a zero padded block tridiagonal matrix this sparsity has a regular pattern which can be explained on(matrix with zero diagonals inserted between the main diago- the basis of how inductance and capacitance matrices are
nal and the non-zero super-diagonal and sub-diagonal of 20 extracted. The distributed RLC effects of VLSI interconnects
tridiagonal matrix) as in FIG. 10. Elementary row and column can be modeled by dividing conductors into small subsets of
block permutations could be performed on such a matrix to segments, each of which are aligned. Each of these subsets
reduce it into a block tridiagonal matrix. This has been shown leads to a sparsity pattern (corresponding to a band in A`).
with a small example as below. 25 All the effects when summed up lead to a A -i matrix that has
a regular sparsity pattern. Window selection algorithm can
a,	 0	 —b,	 0	 (20) and (21) then be employed to find out the sparsity pattern in A - '. It has
0	 az	 0	 —bz been recognized in earlier work that this property (sparsity)
B = —b,	 0	 a3	 0 yields enormous computational savings; it has been shown
0	 —bz	 0	 a4 30 that an approximate implementation of the Exact-RLP algo-
a,	 —b,	 0	 0 rithm, referred to simply as the "RLP algorithm" provides an
—b,	 az	 0	 0 order-of-magnitude in computational savings with little sac-
=
0	 0
PT ,	 where rifice in simulation accuracy.a3 	—bz
0	 — 0— - b2	a4
X 35 To proceed, we rewrite (12) and (13) as
1000
P —
0	 0	 1	 0
Hence ik+t = X -1 Y^ + X —i A, v' + h X —i A, P,,A (^1 +1 + ID —	 (23)
0	 1	 0	 0 4
0	 0	 0	 1 40	 X—iAjp_CC,(Vv+i—Vv)+ 22 (Vv+1+Vk),
U,	 0	 U,	 0 v,	 0	 V3 	0 (22)
(	 )1B	 = PX P T = 0	 u2	 0	 u2 0	 Vz	 0	 V4 X—'Atvk+t = X ' A t vk — X—iA i 2PccAz(ia
+1
 +	 )+
U,	 0	 U3	 0
0	 0
V3	 0	 V3	 0
0	 0
h
45	 1A t Pcc A	 (^s+t+ls)—X—iAtPccCw(Vv+t—Vv).2Xuz	 u4 	v4 	v4
--------------------- ---------------------U	 V
Although X is a dense matrix, X -1 turns out to be an approxi-
We have not included block matrices for simplicity of pre- mately sparse matrix. Moreover the matrices X -1Y, X-1A„
sentation, however the zero padded block tridiagonal case is 50 X-1A,P_Aj , T, X-'A,P,,C,„ are also approximately sparse.
a natural extension of the above example. All the entries in This information can be used to reduce the computation sig-
U,V matrices have to be now replaced by blocks and accord- nificantly by noting that each step of trapezoidal integration
ingly the row and column permutations would be replaced by now requires only sparse vector multiplications. Solving
their block counterparts with an identity matrix of appropriate 55 sparse (23) and (24) along with (15) and (16) is termed as the
size replacing the `ones' in the P matrix. Table 4 gives the RLP algorithm (SASIMI). To analyze the computational sav-
comparison for Incomplete-LU preconditioner and the zero- ing of the approximate algorithm over the Exact-RLP algo-
padded (Z-Pad) preconditioner. Simulations were done on rithm, we denote "sparsity index" of a matrix A as ratio of the
circuits consisting of busses with parallel conductors driving number of entries of A with absolute value less than e to the
bank of inverters. `Size' denotes the number of non-linear 60 total number of entries. The computation required for each
devices. All the results are reported as a ratio of run-time and iteration of (23) and (24) is then 0((l _V)12), where v is the
iteration-count (number of iterations for the solution to con- minimum of the sparsity indices the matrices X-1Y, X-1A„
verge to within a tolerance of Ie-10) of Z-Pad to the Incom- X-1A,P_Aj,T, X-'A,P,,C,v. FIG. 12 provides the average
plete-LU preconditioner. As can be seen from Table 4, Z-Pad 65 sparsity for the matrices for a system with parallel conductors
offers a substantial improvement in run time as compared to driving a bank of inverters. The sizes inconsideration are 100,
the Incomplete-LU preconditioner. 200,500 and 1000.Ontop ofthis the computation time ofX-1
US 7,774,725 B1
15
can be reduced to O(1) by using the windowing techniques
(details in [16]). Hence the computational complexity of RLP
is 0(pq(1-v)1 2) as compared to O(pgn,') for the NINA
approach.
We implemented the Exact-RLP and RLP (SASIMI) algo-
rithms in C++. A commercially available version of SPICE
with significant speed-up over the public-domain SPICE has
been used for reporting all results with SPICE. Simulations
were done on circuits consisting of busses with parallel con-
ductors driving bank of inverters, with wires of length 1 mm,
cross section 1 µmx I µm, and with a wire separation of 1 µm.
A periodic 1V square wave with rise and fall times of 6 ps
each was applied to the first signal with a time period of 240
ps. All the other lines were assumed to be quiet. For each wire,
the drive resistance was 10 Q. A time step of 0.15 ps was taken
and the simulation was performed over 30 ps (or 200 time
steps). For the inverters the W/L ratio of NMOS and PMOS
16
algorithms, is shown in FIG. 13. (The number of conductors
in this simulation example is 200.) There is almost no detect-
able simulation error between the SASIMI, Exact-RLP and
SPICE waveforms over 200 time steps. To give a better pic-
5 ture, the accuracy results reported are for a larger simulation
time of 2200 time steps.
TABLE 5
10
RMSE comparison.
p=5	 p=20	 p=50
15	
100	 .0054	 .0053	 .0088
200	 .0078	 .0052	 .0071
500	 .0006	 .0022	 .0001
1000	 .0003	 .0005	 .0004
were taken to be 0.42 µm/0.25 µm and 1.26 µm/0.25 µm	 2000	 .0003	 .0004	 .0004
respectively.
In order to explore the effect of the number of non-linear
elements relative to the total, three cases were considered.
With p denoting the ratio of the number of linear elements to
that of non-linear elements, the experiments were performed
for p equaling 5, 20 and 50. The number of linear elements in
the following results is denoted by a.
We first present results comparing the accuracy in simulat-
ing the voltage waveforms at the far end of the first line (after
the inverter load). The RMSE is again use for comparing the
simulations, defined here as
Y, (v; - vs)'
Y, V 
We now turn to a comparison of the computational require-
25 
ments between Exact-RLP, SASIMI and SPICE. Table 6 sum-
marizes the findings. For a fair comparison, our total simula-
tion time is compared against the transient simulation time for
SPICE (i.e we have not included any of the error check or set
up time for SPICE). As can be seen from the table, SASIMI
30 outperforms the Exact-RLP algorithm and SPICE. For the
case of 500 conductors with p=50, the Exact-RLP algorithm
is 390 times as fast compared to SPICE. SASIMI is about
1400 times faster as compared to SPICE, and more than three
times faster than Exact-RLP. As can be seen, the computa-
35 tional savings increase as the ratio of linear to non-linear
elements is increased from 5 to 50. The savings also increase
with increase in the size of the problem considered. The
computational efficiency of the SASIMI can be explained on
40 the use of sparsity-aware algorithms for both the linear and
non-linear parts of the problem.
TABLE 6
Run time (in seconds) comparisons
p=20
Exact-RLP
.27
.64
13.47
79.07
526.23
p=5
a	 SPICE Exact-RLP SASIMI SPICE
100	 11.96	 1.34	 1.26	 13.73
200	 100.25	 3.28	 2.68	 68.72
500 3590.12	 17.13	 4.872 1919.21
1000 >12 hrs	 87.75	 22.71	 >10 hrs
2000 >1 day	 545.6	 78.06	 >1 day
55
p=50
SASIMI SPICE Exact-RLP SASIMI
	
.21	 13.54	 .15	 .12
	
.28	 67.68	 .55	 .22
	
3.01	 1790.67	 4.58	 1.30
	
16.49 >10 hrs	 77.56	 15.20
	
59.33	 >1 day	 408.54	 56.05
where v and v denote the waveforms obtained from Exact-
RLP and SASIMI respectively.
Table 5 presents a summary of the results from the study of
simulation accuracy. It can be seen that the simulation accu-
racy of the Exact-RLP algorithm is almost identical to that of
SPICE, while the SASIMI has a marginally inferior perfor-
mance as measured by the RMSE. The error values for
SASIMI are compared simply with the Exact-RLP as it had
the same accuracy as SPICE results for all the experiments
run. A plot of the voltage waveforms at the far end of the
active line, obtained from SPICE, Exact-RLP and SASIMI
The existing methods for finding the inverse of a block
tridiagonal matrix suffer from being either numerically
unstable or heavily memory intensive and hence impractical
60 for problems of very large size (e.g. 106x106).
Consider the two-dimensional model of a nano-scale tran-
sistor, shown in FIG. 14. The body of the transistor is pro-
jected onto a two-dimensional non-uniform spatial grid of
65 dimension NxxNy, where Nx (NY) denote the number of grid
points along the depth (length) of the device. A brief review of
the governing physics of this device is provided here.
zzh k
E 2mz - Hb(rl) Gb(rl, rz, kz,	 -
fdry (ri, rz, kz, E) Gb(ri, rz, kz, E) _ &I — rz)
b
(26)
(27)	 20
Ul Vl Ul V2 Ul V3	 ... Ul VNY
V2Ui Ul VZ U2 V3... U2 VNY
A- ' = V3 U V, UT U V3	 ... U VN
Y
25
VNY U1 VNY UZ VNY U3	 ... UNY VNY
2 2k
E 2mz - Hb(rl) G irl, rz, kz, E) -
fdr ,^
 (rl, rz, kz, E)Gb- (rl, rz, kz, E)
b
fdr^ (rl, rz, kz, E)Url, rz, kz, E).
b
US 7,774,725 B1
17
The Hamiltonian of a valley b for electrons, associated with
the device under consideration, is as follows:
hd l d	 d l d	 d l d	 (25) 5
Hb(r) 
_ - 2 ^dx^mb dx^+ dy^me dY^ + dz mb dZ)l +V(r),
18
The algorithm in Svizhenko et al. calculates the diagonal
blocks (D,) of A` in the following manner.
G I =A I _ , ,
G,+l-(As+l-BsG^Bs)-1, i =1, 2, ... , AY-1,
DNY-GNY,
D,=G,+G,B,D,+IB,G;, i=NY 1, NY-2,..., 1.	 (31)
where (in,', my', mzb) are the components of effective mass in to
valley b . The equation of motion for the retarded Green's 	 The time complexity of this algorithm was shown to be
function (G') and less-than Green's function (G') are:	 O(Nx3N..), with a memory requirement of O(Nx3NJ.
Alternatively, the inverse of a block tridiagonal matrix can
be computed explicitly by exploiting the Hadamard product
15 formulation. When 113,1 are non-singular, A is said to be
proper. In this case there exists two (non-unique) sequences
of matrices {U,}, {V } such that for j?i, (A-l )^ U,V^T.
Hence, A-1 can be written as
The U and V sequences can be efficiently computed in
Given Gr and G`, the density of states and the charge density 30 O(NYNx3) operations in the following manner:
can be written as a sum of contributions from the individual
valleys,	 ul=z, uzBl-lAl,
N (r, kz, E) _ Y, 	kz, E) _ -' Im[Gb(r, r, kz, E)l, 	 (28) U+l B; l(A,UB,_lTU-l), i-2, ... , Ny 1,
b
35	 VNY
7
-(ANYUNY BNY
7 UNY , ) -1 ,-1
P(r , kz, E) _ Y,Pb(r , kz, E) _ - i [Gb (r, r, kz, E)].	 (29)
b 7	 7	 -1VNY - 1 =VNY `4NYBNY I	 ,
The self-consistent solution of the Green's function is vT (v.+I TA,+l-v.+2 TBi+1 T), i-NY 2, ... , 2, 1.
often the most time intensive step in the simulation of electron 40
density. It was shown by Svizhenko et al. in "Two-dimen- I denotes identity matrix of appropriate size.
sional quantum mechanical modeling of nanotransistors," The matrices encountered in the simulation of nanotrans-
Journal ofApplied Physics, 91(4):2343-2354, 2002, hereby istors enjoy the property that the diagonal blocks {A,} are
incorpated by reference, that the approximate block tridiago- tridiagonal while the off-diagonal blocks {B,} are diagonal
nal structure of the left-hand side in (26) and (27) facilitates 45 (this is due to the tight-binding approximation of the Hamil-
the efficient calculation of the electron density. Specifically, it tonian). The complexity of the above parameterization is then
was demonstrated that only the diagonal entries of G' are reduced to O(N,,Nx2 +Nx3). It is worthwhile to note here that
needed to be computed for such a simulation. the complexity of algorithm in (31), which was given to be
Svizhenko et al. showed that the problem of computing O(NYNx3), does not change based upon these properties (al-
electron densities in a nanotransistor can be reduced to find- though the actual run time is reduced). Therefore, the reduced50
ing the diagonal blocks of G'', where AG''=I and A is a block complexity of the Hadamard formulation makes it an ideal
tridiagonal matrix of the form choice for the solution of this problem. However, it has been
shown that the above recursions can be numerically unstable
for large problem sizes. This will preclude it from being
directly implemented to solve these problems. Alternatively,
Al	 -Bl (30) 55 the. divide and conquer algorithm described below avoids
-B;	 Az	 -Bz these numerical issues by only handling manageable problem
A = sizes at each step.
-BNY-z	 ANY l	 -BNY l In the most simple case, the block tri diagonal matrix, A canbe decomposed into two sub- matrices connected by what we
-BNy l	 ANY 60 will call a bridge matrix. This concept is illustrated below:
A=A+XY,
where each A,, B, E G
NxN. Thus AEG	 65	 A-^slN^ xN^ with NY diagonal blocks of size Nx each. We will 	 Sz
denote A compactly as A=trid(A,, B,).
US 7,774,725 B1
19
-continued
A l -B1
-Bi A, -B2
S1 =
-BT z A;-1
	-B;-1
_BT 1	 Ai
A ,1	 -B +1
-B ;
r
+1	 A;+2 -B;+2
S2 
=
-B
r
NY-2 A,,-,	 -BNy 1
BT
-BNY -1	 ANY
0 0
-B; 0 0 ...	 0	 1	 ...	 0
X
,Y
- (0 ...	 1	 0	 ...	 0)0-BT
0 0
Notice that the block tridiagonal structure is preserved in
each sub-problem, and they are joined by the NxxNx bridge
matrix B,. The structure of S, and Sz facilitates the use of
Hadamard based methods (32) for determining the the solu-
tion of each sub-problem. However, the question then
becomes how to determine the global solution from each
sub-problem solution and their corresponding .bridge. This
problem can be resolved by the use of a fundamental result of
linear algebra. The matrix inversion lemma describes the
effect of a low rank correction term on the inverse of a matrix,
and can be used in the following way:
A-1 = (A + XY)-1 	 (33)
=A 
1 
-(A 1 X)(I+YA 1 X) 1 (YA 1 ) where,
A
_1 
x	
(- CI B;)	 0
-
	
0	 (-C2BT)
	
-1 -1	
I	
-S21(1 1)BT t
(I+YA X) _
	
-Si 1 ( i , i)B;	 I
YA 
1	 0 (Cz)
- ( Ci) 0
Here, C,—S1-1(:,i) and C 2 —S2-1 (:,
 
1) denotes the last (first)
block columns of S 1 -1 (S2-1 ) respectively.
Although (33) can be used directly to solve the problem
described above, its lack of computational efficiency pre-
cludes its use in this case. However, it is important to note that
the adjustment term, (I+YA-1X)-1, depends only on the cor-
ner blocks of the inverses of the sub-problems. This observa-
tion provides the basis fora formulation of the solution for the
more general case. Specifically, the divide and conquer algo-
rithm of the present invention addresses the issue of how to
combine multiple sub -problem solutions in both a memory
and computationally efficient manner.
An overview of the divide and conquer algorithm is pro-
vided here.
The procedure begins with separating the block tridiagonal
matrix A into D subproblems, each joined by a bridge matrix.
The procedure presented previously motivates the formula-
tion of the global solution by combining the sub-problems in
20
a simple radix 2 fashion. However, this method offers no
improvement in terms of memory consumption and is com-
putationally intensive. Alternatively, matrix maps are created
to capture the effect of each combining step without perform-
5 ing all of the associated computation. Adjustments to the
matrix maps at each combining stage are not constant and
must be modified to parallel the procedure above. These maps
can then be used in the final stage to transform the subproblem
solutions into the global solution.
10 Each combining step in the divide and conquer algorithm
will have an associated entry in the "job" list pointing to the
first and last sub-problems along with the corresponding
bridge point. For example, (1-2,3-4) describes the action of
combining joined sub-problems I and 2 (S 1 _2) to joined sub-
15 problems 3 and 4 (S3_4) by use of the bridge matrix between
problems 2 and 3. The corresponding entry in the job list
would be of the form: [start, stop, bridge]=[1, 4, 2].
This formulation lends itself directly to parallel implemen-
tation, since computations associated with non-overlapping
20 combinations can be farmed out across multiple systems.
To model the effect of any combining stage in the job list,
it is necessary to know the corner block elements from the
inverse of each combined sub-problem. This process can be
25 easily illustrated by a simple example. Suppose once again
that the combination stage is (1-2 ,3-4), let Q 1=S 1 _2 and
Q2=S3-4• The corner block elements of Q, -1 and Q2-' can be
found using the parameterization given in (32).
30
Q-1 0
Q-1	 0 Q-
_ 1
[UlVl] ... [U1Vn^
35
... [UnVn^
0
40
where n=size(S J+size(S2), and m=size(S3)+size(S4).
45 It would be impractical to recalculate the inverse of each
joined problem for each combining stage. Alternatively,
matrix maps are used to efficiently produce the required block
entries.
Matrix maps are created to produce the cumulative effect of
50 each combining step associated with a particular sub-prob-
lem. There are a total of eight NxxNx matrix maps {M,1 for
each sub-problem. The process of updating matrix maps can
be broken down into two categories: Adjustments to Upper
sub-problem and those to Lower sub-problems, the distinc-
55 tion being their location with respect to the bridge point. This
procedure will be illustrated using the above example. The
"adjustment" matrix, AD7, for the combining step is defined
as follows:
60
Z1=-Bn1(I1VT , Z2=-B„+1UnV rn,	 (34)
P= (I -
(
Z2Z1)-1,
ADJ 
12
ADJ = I I ± 
PZ2Z2 P1 P - ADJ21 ADJ22
65
[U1Vj] ... [61 VT]
[UmVm]
US 7,774,725 B1
21 22
The matrix maps associated with the Upper sub-problems (S, associated with a NxxNx matrix multiplication (typically
and Sz) are updated in the following manner: a=2.7). The memory consumption is
M,+(II, V„7ADJ12B„+,)M3^M,
M2+01Vn 7ADJ12Bn+1) M4-M2 5	 O(NNY +N,D^.
D
(U, Vm7ADJ2213,)M3-M3
(U 1vm TADJ2213 ,)M4-M4 The divide and conquer algorithm, along with the algo-
10 rithm in Svizhenko et al. have been implemented, in Matlab,
Ms M3 7(ADJ 12B„+1) M3-M5 on a single 32-bitx86 Linux workstation. All results reported
are for test cases on a MIT well-tempered 25 nun device-like7M6 M3 (ADJ12B„+1)M4-M6
structure.
M, M47(ADJ12Bi+1)M3-M7 Table 7 shows the run-time comparison of the two algo-
15 rithms across the cases: N,-100, N,-3,000-80,000. Notice
M8-M4T(ADJ 12B +1)M4-Ms	 (35) that the algorithm given in Svizhenko et al. was unable to
perform past the case NY-1 1,000 due to memory restrictions.
Those associated with the Lower sub-problems (S3 and S4) However, the divide and conquer algorithm was able to
are updated in the following manner: handle these cases without encountering memory overflow.
TABLE 7
Run time (min) comparison
Size = Nx * NY (x106) 0.4	 0.5	 0.6	 0.7	 0.8	 0.9	 1.0	 1.1	 2.0	 4.0	 8.0
Divide and Conquer Algorithm 5.38	 6.66	 8.10	 9.41	 11.13	 12.25	 13.75	 14.99	 27.59	 58.86	 107.0
Algorithm in Svizhenko et al. 1.22	 1.51	 1.82	 2.12	 2.43	 2.72	 3.01	 3.33
30 FIG. 15 shows the ratio of memory consumption of the
(u,vn^ADJ„B 
.,^)M2^M2 algorithm in Svizhenko et al. as compared to the divide and
conquer algorithm for varying number of blocks per division
M3 +0vmu, TADJ21 B„+1 T)M1^M3 (sub-problem size). FIG. 16 shows the same ratio for varying
35 number of divisions.
M4+(VmII, 7	 7ADJ21 B„+1 )Mz-Ma While the invention has been illustrated and described in
Ms M1T(ADJ21B„+1T)M1-M5 detail in the drawings and foregoing description, the same is
to be considered as illustrative and not restrictive in character,
M6 M, 7(ADJ21 B„+ , 7)M2-M6 it being understood that only the preferred embodiment has
40 been shown and described and that all changes and modifi-
M7-M2 7(ADJ21B+,7)M1-M7 cations that come within the spirit of the invention are desired
to be protected.M8-M27(ADJ21 B„+1 T) M2-M8 (36)
The above procedure for modifying the matrix maps is
repeated for each entry in the job list. In the final stage the
matrix maps are used to generate the diagonal blocks of A-'.
It is important to note that this scheme fits nicely into a
parallel framework due to the fact that systems handling
Upper or Lower sub-problems would only have to trade the
limited amount of information in (34) to modify the matrix
maps they are governing.
Pseudo-Code
1. For each sub-problem in (1, 2, ... , D} :
Determine corner blocks from inverse of sub-problem
Associate bridge matrix (excluding problem D)
Initialize matrix maps
2. Generate list of sub-problem combinations radix 2
3. Adjust mappings for each combining step
4. Compute diagonal elements for each division
5. Apply matrix maps to transform sub-problem solutions to the global
solution
The time complexity of the divide and conquer algorithm is
O(N,'Ny+Nx3 Dlog2D) where a is defined to be the order
45
What is claimed is:
1.A method of simulating operation of a VLSI interconnect
structure having capacitive and inductive coupling between
50 nodes thereof, comprising:
obtaining a matrix X and a matrix Y containing different
combinations of passive circuit element values for said
interconnect structure, said element values for each
matrix including inductance L and inverse capacitance
55	 P.
obtaining an adjacency matrix A associated with said inter-
connect structure;
using a processor to perform numerical integration to solve
60 first and second equations each including as a factor the
product of the inverse matrix X-1 and at least one other
matrix, said first equation including X -1 Y, X-1A, and
X-1P, and said second equation including X-1A and
65	 X-1P
2. The method of claim 1, wherein matrices X andY each
contain resistance values R in addition to L and P.
US 7,774,725 B1
23	 24
3. The method of claim 2, wherein said first equation is 	 wherein said second equation is substantially of the form
substantially of the form
is+i = X-1 Y^, + X-1 Avk, + h X- i AP(ls+i +I,), 	 h	 h4 X -i A 3 +i = X-i A 3 - Z X- ' APAT (ia +i + ia) + Z X-' AP(ls+i + Is).
where v„ and i i are node voltages and inductor currents,
respectively, A is an adjacency matrix for the circuit, and I S is
a current source vector, and
