Parallel Processing for VLSI CAD Applications a Tutorial by Banerjee, Prith
September 1993 UILU-ENG-93-2240
CRHC-93-21
Center for Reliable and High-Performance Computing
PARALLEL PROCESSING 
FOR VLSI CAD APPLICATIONS 
A TUTORIAL
Prith Banerjee
Coordinated Science Laboratory 
College o f Engineering
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
Approved for Public Release. Distribution Unlimited.
UNCLASSIFIED___________
SECURITY CLASSIFICATION ÔF ÏHIS PAGE
REPORT DOCUMENTATION PAGE
la . REPORT SECURITY CLASSIFICATION
Unclassified
1b. RESTRICTIVE MARKINGS 
None ’______
2a. SECURITY CLASSIFICATION AUTHORITY
2b. DECLASSIFICATION/DOWNGRADING SCHEDULE
3 . DISTRIBUTION /AVAILABILITY OF REPORT 
Approved for public release; 
distribution unlimited
4. PERFORMING ORG
U IL U -E N G -93-
TION REPORT NUMBER(S) S. MONITORING ORGANIZATION REPORT NUMBER(S)
CRHC- 93-21
6a. NAME OF PERFORMING ORGANIZATION 
Coordinated Science Lab 
University of Illinois
6b. OFFICE SYMBOL 
(If applicable)
N/A
7a. NAME OF MONITORING ORGANIZATION
Semiconductor Research Corporation
6c ADDRESS (Cty, State, and ZIP Code)
1101 W. Springfield Avenue 
Urbana, IL 61801
7b. ADDRESS (C ty, State, and ZIP Code)
Research Triangle Park, NC 27709
8a. NAME OF FUNDING/SPONSORING  
ORGANIZATION
7a
8b. OFFICE SYMBOL 
Of applicable)
9. PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER
8c ADDRESS (City, State, and ZIP Code) 10. SOURCE OF FUNDING NUMBERS
PROGRAM PROJECT TASK
7b
ELEMENT NO. NO. NO.
WORK UNIT 
ACCESSION NO.
11. TITLE (Include Security Classification)
Parallel Processing for VLSI CAD Applications a Tutorial
12. PERSONAL AUTHOR(S) Prithviraj Banerjee
13a. TYPE OF REPORT
Technical
13b. TIME COVERED 
FROM _________ TO
14. DATE OF REPORT (Tear, Month, Day) 15. PAGE COUNT 105
16. SUPPLEMENTARY NOTATION
17. COSATI CODES 18. SUBJECT TERMS (Continue on reverse if  necessary and identify by block number)
FIELD GROUP SUB-GROUP VLSI CAD, parpllel computing, parallel algorithms for layout
parallel algorithms for synthesis/test, parallel algorithms
for simulation v
19. ABSTRACT (Continue on reverse If necessary and identify by block number)
Objectives of this tutorial are to:
1. Expose CAD developers and users to the rapidly growing field of parallel processing.
2. Explore various options for CAD application users in an integrated parallel CAD environment.
3. Expose parallel procssing researchers to rich application area.
20. DISTRIBUTION/AVAILABILITY OF ABSTRACT
E  UNCLASSIFIED/UNLIMITED □  SAME AS RPT. □  DTIC USERS
21. ABSTRACT SECURITY CLASSIFICATION 
Unclassified _____
22a. NAME OF RESPONSIBLE INDIVIDUAL 22b. TELEPHONE Qndude Area Code) 2 2 c  OFFICE SYMBOL
DD FORM 1473,84 m a r 83 APR edition may be used until exhausted. 
All o ther editions are obsolete.
SECURITY CLASSIFICATION OF THIS PAGE
UNCLASSIFIED
UNCLASSIFIED
SECURITY CLASSIFICATION OF THIS FAOE
i
UNCLASSIFIED
SECURITY CLASSIFICATION OF THIS PAGE
Ir Parallel CAD Tutorial
P A R A L L E L  P R O C E SSIN G  FO R P A R A L L E L  P R O C E SSIN G  FO R
V L S I C A D  A P P L IC A T IO N S  
A  T U T O R IA L
Prith Banerjee
Center for Reliable and High-Performance 
Computing
Coordinated Science Laboratory, 
University of Illinois at Urban a-Champaign, 
1308 W. Main St.
Urbana, IL 61801 
b anerj ee@crhc. uiuc. edu 
(217)-333-6564 
Fax: (217)-244-5685
V L SI C A D  A P P L IC A T IO N S
Prith Banerjee
University of Illinois at Urbana-Champaign
Ji Banerjee
Parallel CAD Tutorial r Parallel CAD Tutorial
Objectives of Tutorial
• Expose CAD developers and users to the rapidly 
growing field of parallel processing
• Explore various options for CAD application users 
in an integrated parallel CAD environment
• Expose parallel processing researchers to rich ap­
plication area
Program Overview
8:30-10:00 Parallel Computing and VLSI CAD
10:00-10:30 Coffee Break
10:30-12:00 Parallel Algorithms for Layout
12:00-1:30 Lunch Break
1:30-3:00 Parallel Algorithms for Simulation
3:00-3:30 Coffee Break
3:30-5:00 Parallel Algorithms for Synthesis/Test
2 Banerjee V "3 Banerjee
Parallel CAD Tutorial Parallel CAD Tutorialr
Outline
SESSION 1
A N  O V E R V IE W  OF
P AR ALLEL C O M P U T IN G  A N D  VLSI
C A D
• Motivation: Why Parallel CAD?
• Parallel Computing Options: Special-purpose, Genera - 
purpose
• Overview of Parallel Architectures and Program­
ming
i Banerjee V 5 Banerjee
Parallel CAD Tutorial Parallel CAD Tutorial ^ r
trallel Processing for VLSI C A D : W hy? VLSI CAD Overview
• Existing algorithms for VLSI CAD running on 
uniprocessors inadequate for future requirements
• Parallel Processing offers 10-30X more performance 
than best uniprocessors
-  Example: Parallel cell placement algorithm on 
hypercube gives 14 times speedup
• Parallel processing sometimes gives better quality 
results
-  Example: Parallel test generation gets higher 
fault coverage
• Parallel processing solves larger problem sizes
— Example: Parallel circuit extractor handles larger 
circuits
• Parallel processing has become affordable and avail­
able in recent years


























6 Banerjee V 7 Banerjee
Parallel CAD Tutorial Parallel CAD Tutorial
CAD Computing Requirements
• Logic Simulation
-  A 10 million gate system requires 1000,000 
gate evaluations per clock cycle assuming 10% 
activity
-  A 10 MIPS workstation can simulate 1 second 
of system operation (1 million clock cycles) in 
4 months
• Fault Simulation
-  Fault Simulation complexity increased by an­
other factor of N over logic simulation
• Cell Placement
-  Good Simulated Annealing Placement Program 
(Timberwolf6.0) needs 1 hour to place 3000 
cell circuit on a 10 MIPS workstation
-  Would need one week to perform task for cir­
cuit of 100,000 cells
Parallel Processing for CAD
• Parallel Processing is the Only Answer for the 
Tremendous Future computing requirements
• Large set of computational problems including 
many in CAD are inherently parallel in nature
• Rapidly growing parallel processing technology 
has recently become commercially available
• It is relatively easy to build hardware of systems 
with hundreds or thousands of processors and can 
theoretically achieve GFLOPS
• Hard problem: Design of Efficient Parallel Algo­
rithms
• The best sequential algorithm may not be the 
best parallel one
s' Banerjee 9 Banerjee
Parallel CAD Tutorial---------------------------------
Parallel processing options
• Network of workstations (lowest cost)
• Multiprocessor workstations ($60,000)
— DEC Firefly, Apollo DN 10000, Solbourne, Xe­
rox Dragon, SUN SPARCstation 20
• Shared memory multiprocessors ($200,000-400,000)
— Sequent Symmetry, Encore Multimax
• Distributed memory multiprocessors ($200,000- 
400,000)
— Intel iPSC hypercube, NCUBE
• Supercomputers ($5,000,000)
— Connection Machine CM-5, Intel Paragon, Kendall 
Squares KSR-1




r Parallel CAD Tutorial 'A
Hardware Accelerator Basics
• Remove extemporaneous software processes
• Customize architecture for a specific task
• Match intra-processor and inter-processor com­
munication to the algorithm
• Highest performance gain
• Technology and algorithms may change while ma­
chine built
• May be appropriate for Logic Simulation where 
algorithms are robust and unchanged
• For other applications, more general purpose ac­
celerators appropriate, e.g. MARS project (AT&T)
V n Banerjee




• Logic and Fault Simulation • Single Instruction Single Data (SISD)
— Zycad LE series, Daisy Megalogician, AT&T — Conventional computer
MARS, Silicon Solutions MACH 1000, Fujitsu 
SP, NEC HAL, IBM YSE/LSM
• Single Instruction Multiple Data (SIMD)
-  A central controller broadcasts same instruc-
• Routing tions to multiple processors executing on dif-
-  Distributed Array Processor (ICL), Wire Rout- ferent data
ing Machine (IBM) • Multiple Instruction Multiple Data (MIMD)
• Design Rule Checking -  Processors independently execute different in-
— Cytocomputer, Silicon Solutions FAST MASK structions on different data 
-  Shared Memory MIMD Multiprocessors
-  Distributed Memory MIMD Multicomputers
I T Banerjee 13 Banerjee
Parallel CAD Tutorial Parallel CAD Tutorial
Basics of Parallel Programming
• SIM D Programming
— Assume each processor executes same instruc­
tions
— Processors can sit out some instructions
• Shared Memory M IM D  Programming
— Assume each processor sees globally shared 
memory
— Processors communicate through shared vari­
ables, locks, barriers
• Message Passing M IM D  Programming
— Assume each processor has separate address 
space
— Processors communicate by sending messages
Data vs Functional Parallelism
• Data parallelism
— Create multiple identical processes and assign 
a portion of data to each
• Functional parallelism
-  Create multiple processes and perform differ­
ent operations on different portions of data
V 14 Banerjee J 15 Banerjee
Parallel CAD Tutorial r Parallel CAD Tutorial
Data/Functional Parallel Examples
• Data parallel: C program loops where each it­
eration of a loop is independent and represents a 
simple statement and is executed on a different 
processor.
fo r  ( i= 0 ; i  < 1000; i++) 
a [ i ]  = b [ i ]  + c [ i ]  ;
• Functional parallel: Multiple C program loops 
which cannot be parallelized individually, but the 
different code blocks are independent, and are 
executed on different processors.
fo r  ( i= 0 ; i  < 10; i++) /*  b lock 1 * /
b [ i -1 ]  = b [ i ]  + c [ i ]  ;
fo r  ( j= 0 ; j  < 5; i++) /*  block n * /
a [ j - l ]  = a [ j ]  + d [ j ]  ;
Coarse grain vs Fine Grain
• Grain size categorizes amount of compute work 
done over independent subtasks in parallel
• Fine grain implies tens of instructions, e.g. state­
ments in programs
-  Each loop iteration of C program is executed 
on a different processor.
fo r  (i= 0 ; i  < 1000; i++) 
a [ i ]  = b [ i ]  + c [ i ]  ;
• Coarse grain implies thousands of instructions, 
e.g. functions or procedure calls in programs
-  Each group of loop iterations of C program 
representing complex sets of statements con­
taining function calls executed on different pro­
cessor
fo r  (i= 0 ; i  < 1000; i++) {




Parallel CAD TutorialParallel CAD Tutorial
Parallelization and Data Dependence
Consider the following loop of a C program.
f o r  ( i = 0 ;  i  < 1000; i+ + )  
a [ i ]  = b [ i ]  + c [ i ]  ;
If one unfolds the loops, the statements would be 
executed as follows.
a[0] = b [ 0 ]  + c [0] 
a [l]  = b [ l ]  + c [ l ]  
a [2] = b [2] + c [2]
a [ 9 9 9 ]  = b [9 9 9 ] + c [999 ]  ;
Each of these iterations can be executed in parallel.
18
Data Dependence
No Dependence: Run in parallel 
Consider sequence of statements.
S I:  X = K + 3;
S2: Y = Z * 5;
True Dependence:
Consider the following sequence of statements.
S I :  X = 3;
S 2 : Y = X * 4;
Anti-Dependence
Consider sequence of statements
S I:  Y = X * 4;
S 2 : X = 3;
Output dependence
Consider the following sequence of statements
S I: X = Y * 4;
S2 : X = 3;
19Banerjee Banerjee
Parallel CAD Tutorial Parallel CAD Tutorialr r
Basics of Parallel Algorithms Example Timecharts
Denote T =  time for the best serial algorithm 
Tp =  time for parallel algorithm using p processors
rr\
Speedup Sp =  jr
§
Efficiency or Utilization Ep =
If parallel algorithm is 100% efficient, then one ob­
serves linear speedups
Efficiency is practically never 100% due to
• Cost of synchronization or communication across 
parallel processors;
• Suboptimal load balance among parallel proces­
sors.
V Jo Banerjee
(d) No lynch con, poor load balance (c) Poor load balance, sync coil 10
• Case (a) represents serial time 100 units
• Case (b) represents perfect parallelization time 
25 units, speedup 4
• Case (c) representes perfect load balance but synch 
cost 10, hence speedup — 100/35 =  2.85
• Case (d) represents no synch but load imbalance, 
speedup =  100/40 =  2.5
• Case (e) represents load imbalance and synch cost, 
speedup =  100/50 =  2
-------------------------------------- 2 i----------------------Banerjee-------- J
Parallel CAD Tutorial
Load Balancing and Scheduling
• Prescheduling
• Static blockwise scheduling
• Static interleaved scheduling
• Dynamic scheduling





(c) Static interleaved scheduling (d) Dynamic scheduling
Banerjee







Parallel CAD Tutorial Parallel CAD Tutorialr >v r
Example Shared MIMD Shared MIMD Programming
• Sequent Symmetry Multiprocessor








Execute a subprogram in parallel
Set number of child processes
Return process identification number
Lock a lock
Unlock a lock
Check in at barrier
Increment shared global counter
V J  V
24 Banerjee 25 Banerjee
I
Parallel CAD Tutorial----- Parallel CAD Tutorial------------------------------------------
Example Shared MIM D Program
Given sequential program
for (i=0; i < 100; i++) { 
a[i] = b[i] * c[i]
>
Static scheduled program
nprocs = get_num_procs() ; 
id = m_fork(nprocs); 
lb = id * 100 / nprocs; 
ub = (id +1) * 100 / nprocs; 
for (i = lb; i < ub; i++) { 




nprocs = get_num_procs(); 
id = m_fork(nprocs);
/* shared counter for next iteration */ 
global_i = 0; 
while (i < 100) { 
m_lock(); 
i = global_i; 
global_i = global_i + 1; 
m_unlock(); 
if (i <= 100)
a[i] = b[i] * c [i] ;
}
26 Banerjee 27 Banerjee J
r Parallel CAD Tutorial r Parallel CAD TutorialA A





2S J.t ;V Banerjee Banerjee J









Send a message and wait for completion 
Send a message, do not wait 
Receive message, wait for completion 
Receive message, do not wait 
Obtain node id of calling process 
Obtain process id of calling process 
Obtain number of nodes in cube
Example Dist. MIMD Program
Given original sequential program
in t a [100] b[100] c [1 00 ];
main()
{
fo r ( i= 0 ; i  < 100; i++)
a [ i ]  = b [ i+ l ]  + b [ i ]  * c [ i ]  ;
}
Distributed memory program
in t a [25] b[26] c [2 5 ];
mainO
{
id  = mynodeO;
c s end(BTYPE,&b[ 0 ] ,4 , id+1,0 ) ; 
crecv(BTYPE,&b[2 5 ] ,4 ) ;  
fo r ( i= 0 ; i  < 25; i++)
a [ i ]  = b [ i+ l]  + b [ i ]  * c [ i ]  ;
30 Banerjee 31 Banerjee
r Parallel CAD Tutorial r Parallel CAD Tutorial
SIMP Architectures Example SIMP Architecture
Thinking Machines Connection Machine CM-2
V 12 Banerjee J V. 33 Banerjee J
r Parallel CAD Tutorial Parallel CAD Tutorial
SIMP Programming Example SIMP Program
>v
shape: logically configure parallel data structun
with: select a current shape
where: select active processors
pcoord(): provide regular communication in grid
reduce(): perform reduction on parallel data
spread(): spreads result into parallel variable
Given sequential program
for (i=0; i < 100; i++) { 
if (c[i] ! = 0) {
a[i] = b[i] / c[i]
}
>
SIMD version of program
shape [100][100] s; 
float:s a,b,c;
with (s) 
where (c != 0) 
a = b / c;
V J  V J
34 Banerjee 35 Banerjee
Parallel CAD Tutorial Parallel CAD Tutorialr  ^ r














printf("Enter the number of intervals\n"); 
scamf ("*/.d",fcn);
computepi(n); /* call procedure to compute pi */
printf ("Value of pi is ,/,f\n",pi);
>
computepiO
h = 1.0 / n; 
sum = 0.0;
for (i=0; i < n; i++) { 
x = h * (i - 0.5); 
sum = sum + 4 . 0 / ( l + x * x ) ;
>
pi = h * sum;
V. JV 36 Banerjee 37 Banerjee
r Parallel CAD Tutorial r Parallel CAD Tutorial A
Shared Memory MIMD Program
shared int n, global_i; 





printf("Enter the number of intervals'^"); 
scanf("%d",in);
printf("Enter the number of processors\n"); 
scanf ("*/.d",4nprocs) ;
global.i = 0; /* initialize global index */
pi = 0; /* initialize pi */
h = 1.0 / n; /* initialize h */
m_set_procs(nprocs); /* create nprocs parallel threads */
m_fork(computepi); /* compute pi in parallel */
m_sync(); /* wait for all threads to complete */





float sum, localpi, x;
sum = 0.0^  
while (i < n) { 
m_lock();
i = global_i; 
global.i = global_i + 1; 
m_unlock(); 
x = h * (i - 0.5); 
sum = sura +4.0/ (l+x*x);
}
localpi = h * sum; 
m_lock();
pi = pi + localpi; 
m_unlock();
>









nnodes » numnodesO; 




if (id == 0)




id = mynodeO; 
if (id == 0) {
printf("Enter number of intervals\n"); 
scanf (‘"/.d" ,*n) ; 
csend(NTYPE, An, 4, -1, 0);
}
else {








id = mynodeO ; 
nnodes = numnodesO; 
h = 1.0 / n;
JV. - Banerjee V 39 Banerjee
Parallel CAD Tutorial
sum = 0.0;
/* each processor computes bounds of pi computations */ 
lb = id * n / nnodes; 
ub= (id+ 1) * n / nnodes; 
for (i=lb; i < ub; i ++) { 
x = h * (i - 0.5); 
sum = sum +4.0/ (1 + x * x);
}






id = mynodeO; 
nnodes = numnodesO; 
if (id ! = 0) {




/* processor 0 receives local pi data from 
each processor and adds them +/ 
pi = localpi;
for (p = 1; p < nnodes; p++) { 
crecv(PITYPE,&tmp,4); 












doublers height, h, x, hsum; 
intrs mpoint;
mainO
h ■ 1.0 / n;
/* éin intrinsic function providing 
the self address of each processor */ 
mpoint = pcoord(0); 
x = h * (mpoint - 0.5); 
height = 4.0 / (1.0 + x * x);
/* syntax: reduce(dest, src, dim, combine, dest_index) */ 
reduce(fchsum, height, 0, ADD, 0); 
pi = [0]h * [0]hsum; 
printf ("'/.f \n" ,pi) ;
41 Banerjee
r Parallel CAD Tutorial--------------
Summary
• We have reviewed the basic forms of parallel pro­
cessing
• Discussed Shared MIMD, Distributed MIMD and 
SIMD parallel architectures and programming
• Described a simple parallel program in each model
• In remaining sections, we will look at parallel al­
gorithms for numerous CAD applications for each 




Parallel CAD Tutorial Parallel CAD Tutorial
SESSION 2
PARALLEL ALGORITH M S IN
VLSI LAYO U T SYNTHESIS
A N D  VERIFICATION
43 Banerjee
Outline
Parallel Algorithms for Placement 
Parallel Algorithms for Routing 
Parallel Algorithms for Design Rule Checking 
Parallel Algorithms for Circuit Extraction
Banerjee
Parallel CAD Tutorial Parallel CAD Tutorialr
VLSI Placement Problem
Standard Cell Layout
• Minimize cost function =  wirelength +  row over­
lap +  cell overlap









• Iterative Pair-wise Exchange
-  Simultaneously identify N /2 pairs of adjacent 
cells, exchange
-  Large degree of parallelism, poor quality, in­
teracting moves
• Simulated Annealing
-  Best quality of results, long runtimes
-  Can be parallelized, however interactions of 
moves
• Simulated Evolution
-  Evaluate cells for goodness, select bad cells for 
replacement, allocate displaced cells
-  Comparable quality to annealing, very parallel
V
-16 Banerjee
- Parallel CAD Tutorial Parallel CAD Tutorialr A r
Parallel Placement Approaches (Contd)
• Genetic Algorithms
-  Maintain population of cell configurations, crossover 
and mutate solutions
-  Inherently parallel, large memory and runtime 
requirements, average quality
• Hierarchical Decomposition
-  Partition chip into 2 x 2  quadrisection
-  Recursively solve 2 x 2  quadrisection problems
-  Pay attention to boundary values between sub­
solutions through recursive X-direction and Y- 
direction decompositions
-  Good speedups, but top decomposition takes 




Start with an initial placement solution (state) S ;
Set T  =  To (initial temperature);
REPEAT /*  outer loop * /
REPEAT /*  inner loop * /
Generate a move;
Select move (cell exchange or cell displacement type); 
IF displacement THEN 
Select random cell;
Move cell to random location;
IF exchange THEN
Select two random cells;
Exchange cell locations;
Move perturbs S to generate a new state, Sn;
Evaluate cost change A E — E(Sn) - E (S );
Decide to accept or reject move;
IF (A E  < 0 ) THEN 
Accept move;
ELSE IF random(0,l) < eAE'T THEN 
Accept move;
ELSE Reject move;
Update state if accept, i.e., replace S with Sn;
Increment moves_attempted;
UNTIL moves .attempted = max_moves; /*  inner loop * / 
Update T (control temperature);









— Decompose move evaluation into subtasks; ac­
celerate each move
— Static and dynamic approaches
Parallel evaluation of moves, accept only one
— Same convergence property as serial case, low 
speedups
Parallel evaluation of noninteracting moves
— Same convergence property as serial case, low 
speedups
Parallel evaluation of interacting moves





Move ny processor 2= P




Parallel CAD Tutorial Parallel CAD Tutorial
Shared MIMD Parallel Annealing
Procedure SM-PARALLEL-SA-NONINTER-LOCKCELL;
Start with an initial placement solution (state) S ;
Set T  =  To (initial temperature)
REPEAT /*  outer loop * /
REPEAT /*  inner loop * /
FORALL processors asynchronously in PARALLEL DO 
Generate a move;
Select move (cell exchange or cell displacement type); 
IF displacement THEN
Select random unlocked cell;
Lock cell and its neighbors;
Move cell to random location;
IF exchange THEN
Select two random unlocked cells;
Lock cells and its neighbors;
Exchange cell locations;
Evaluate cost change A E — E(Sn) - E(S);
Decide to accept or reject move;
IF (A E  j 0 )
THEN Accept move;
ELSE IF random(0,l) eAE/T 
THEN Accept move;
ELSE Reject move;




UNTIL moves .attempted =  max_moves; /*  inner loop * / 
Update T (control temperature);
UNTIL termination condition; /*  end outer loop * /
End Procedure
J
Dist. MIMD Parallel Annealing
P rocedure DM-PARALLEL-SA-INTERACTIVE-DOMAIN]
Partition chip area by rows; processors own row partitions;
Ownership of cells by processors owning row partitions;
Generate initial placement 5, set T  =  To;
REPEAT /*  outer loop * /
REPEAT /*  inner loop * /
FORALL processors in PARALLEL DO
FOR each unique pairing of processors DO
Identify processor pairs for various dimensions of hyper cube; 
Each processor-pair generates a move cooperatively;
Select move (cell exchange or displacement);
IF intra-processor displacement THEN 
Select a random cell;
Move to random location within processor region;
IF inter-processor displacement THEN 
Select a random cell;
Move to random location within processor pair region; 
IF intra-processor exchange THEN
Select two random cells within processor;
Exchange cell locations;
IF inter-processor exchange THEN
Select two random cells within processor pair;
Exchange cell locations;
Each processor pair accepts/rejects on own cost;




UNTIL moves-attempted =  max_moves; /*  inner loop * /
Update T (control temperature);
UNTIL termination condition; /*  end outer loop * /
End Procedure
V 51 Banerjee 52 Banerjee
Parallel CAD Tutorial Parallel CAD Tutorial
Distributed Memory MIMD PSA
• Cost function =  wirelength +  row overlap +  cell 
overlap
• Row partitioning eliminates errors in last two items
• Error in wirelength computation minimized by 
cell coloring
-  If two cells connected, assign different colors
-  At a time, move cells of same color, noninter­
acting
• Need not broadcast after every parallel move, in­
crease speedup
• Maintain prob. acceptance within 5% of serial; 
when E > T/20, then update
• Results on Intel iPSC hypercube show speedups 
of 12 on 16 processors with good convergence
---------------------------------------5 3----------------------Banerjee--------
SIMP Parallel Simulated Annealing
• A single global description of the state is main­
tained across all the processors
• View connection as array of processors
• Comprises two data structures, one for nets and 
the other for cells
• The head processor in the sequence contains gen­
eral information about the cell, while the second, 
third and fourth processors contain position in­
formation.
• After the fourth processor comes a processor for 
each terminal on the cell, which contains the po­
sition of the terminal with respect to the center 
of the cell
• A sequence of terminals is used to represent each 
net. The head processor contains general infor­
mation about the net, while the processors that 
follow contain the absolute position of each ter­
minal on the net, one terminal per processor.
Banerjee
r Parallel CAD Tutorial Parallel CAD Tutorial
SIMP Parallel Annealing (Contd)
• Parallel algorithm selects approximately half the 
head cell’s processors
• These processors consider moving their cells to a 
new location by either an exchange or a displace­
ment
• All processors simultaneously compute the change 
in the cost function associated with the current 
move, each move assuming that the other cells 
are fixed
• Since algorithm stores a single description of the 
state across the machine, it ensures that after 
each iteration the state description is correct.
• During each iteration, however, the simultaneous 
movements of cells introduces an error.
• Algorithm implemented on Connection Machine 
CM-1




• Novel optimization based on natural selection: 
evolve from one generation to next, fitter cells 
survive
• Evaluate all cells in current positions using good­
ness measure (ratio of precomputed best bound­
ing box to current bounding box for each net)
• Select cells randomly for replacement biasing on 
goodness values
• Allocate cells in new locations
• Temporary errors still maintained convergence to 
within 2% of serial algorithm
-------------------------------------- 5 5----------------------Banerjee------ 56 Banerjee
Parallel CAD Tutorial Parallel CAD Tutorial
Parallel Simulated Evolution
• Partition rows of cells among processors using 
two patterns
• Each processor performs evaluate, select, allocate 
on own region for each pattern
• Broadcast cell positions between patterns
• Implemented on network of workstations and In­




Dist. MIMD Simulated Evolution
Procedure DM-PARALLEL-SIMULATED-EVOLUTION-DOMAIN;
Start with an initial placement of cells;
Partition the chip area among processors by rows;
Ownership of cells by processors owning chip subregions;
REPEAT /*  outer loop * /
FORALL processors in PARALLEL DO
Broadcast current cell positions to all processors;
Start partitioning pattern 1 
FOR each cell in the chip DO
Evaluate goodness of cell using cost function;
FOR each cell in the chip DO
IF random(0,l) > goodness of cell;
THEN select cell, place in queue for new allocation; 
FOR each cell in allocation queue DO 
Allocate cell in empty locations;
Broadcast current cell positions to all processors;
Start partitioning pattern 2 
FOR each cell in the chip DO
Evaluate goodness of cell using cost function;
FOR each cell in the chip DO
IF random(0,l) > goodness of cell
THEN select cell, place in queue for new allocation; 
FOR each cell in allocation queue DO 
Allocate cell in empty locations;
END FORALL
UNTIL (termination condition); /*  end outer loop * /
End Procedure
V 57 Banerjee 58 Banerjee J





Unrestricted routing: maze routing
6 c X X X X
C X X X X
< X X X X
7 - \
V
Restricted routing: channel routing





2 J J J 5 2 6 8 9 8 7 9  
(b) POSSIBLE ROUTING
• Parallel maze routing using area decomposition
— Partition area, perform maze routing on sub- 
regions
— Guaranteed to find optimal route for two ter­
minal net
— Solution dependent on order of net selection, 
route of one net acts as obtacle of future nets, 
eliminate by rip-up-reroute
— Large memory and computation requirements
— Inherently parallel, good speedups
• Parallel line routing using area decomposition
— Similar to above except expands along one di­
rection (line)
— Solution not optimal, but less runtime
V J
59 Banerjee 60 Banerjee
- Parallel CAD Tutorial Parallel CAD Tutorialr r
Parallel Channel Routing
• Parallel channel routing using greedy (column­
wise decomposition)
-  Channel routing handles simultaneously all nets
-  Columnwise decomposition not independent hence 
need to iterate among subsolutions, sometimes 
use 45 degree crossover at boundaries
• Parallel channel routing using simulated anneal­
ing (rowwise decomposition)
-  Will find optimal solution, parallel algorithm 
exists
-  Large runtimes for serial case, much more than 
good serial heuristics
• Parallel channel routing using hierarchical decom­
position
-  Decompose M x N  channel into a 2 x N  routing 
problem
-  Recursively solve 2 x  N  problems




5 4 3 4 5
5 4 3 2 3 4 5
5 4 3 2 i 2 3 4 5
5 4 3 2 i s 1 2 3 4 5
5 4 X X X X 3 4 5





00 01 00 01
•io •fi-•tu "*TT*
...00” -¿0 01
-io- •h -io- -fi-
Area decomposition on processors
"N
V J  V J61 Banerjee 62 Banerjee
r Parallel CAD Tutorial r Parallel CAD Tutorial
Dist. Mem. MIMD Maze Routing
Procedure DM-PA RA LLEL-MA ZE-ROUTING] 
Grid Partitioning and Mapping
Partition the n x n routing grid into P parts;
Assign one partition to each of P processors;
Front Wave Expansion
REPEAT
FORALL processors in PARALLEL DO
Each processor maintains a queue of front wave cells 
that are in its grid partition;
Each grid cell on the current wave is expanded 
(cells to its north, south, east, west are examined);
Some of these cells are in the processor’s grid partition 
while others are in the grid partition of other processors; 
Expansion may require communication with other processors; 
All communication requests are saved;
Inter processor communication.
Each processor sends its communication packets 
to the destination processor;
Process communication packets.
Each processor examines the packets its receives 
and labels the front wave contained in these packets;
END FORALL
UNTIL either the target cell is reached or 
the new front wave has no cells in it;
Path Trace Duck and Sweeping.
------------------------------------- 0 3 ----------------------Banerjee J
FORALL processors in PARALLEL DO
Set label number initially to the search number 
of the front wave expansion;
REPEAT with reduced label numbers
Reverse wave expansion
Starting from the target cell begin expanding 
in all four directions;
Identify cell of label one lower than present;
Store that pair of cells as a path segment;
Inter processor communication
Each processor sends its communication packets of 
path segments when it crosses processor grid boundaries; 
For each cell in path segment, label cells as 
blocked for future nets;
UNTIL start cell is reached;
END FORALL
End Procedure
V. 64 Banerjee J
Parallel CAD Tutorial r Parallel CAD Tutorial
Dist. Mem. MIMD Chan. Routing
Procedure DM-PARALLEL-CHANNEL-ROUTING-SA,
Start with an initial routing solution (state) 5;
Partition channel area by tracks and perform row mapping;
Assign channel subregions to processors;
Ownership of nets by processors owning channel subregions;
Set T =  To (initial temperature);
REPEAT /*  outer loop */
REPEAT /*  inner loop * /
FORALL processors in PARALLEL DO
Identify processor pairs for various dimensions of hypercube; 
Each processor-pair generates a move cooperatively;
Select move (net exchange or net displacement type);
IF intra-processor displacement THEN 
Select a random net;
Move to random track within processor region;
IF inter-processor displacement THEN 
Select a random net;
Move to random track within processor pair region; 
IF intra-processor exchange THEN
Select two random nets within processor;
Exchange net tracks;
IF inter-processor exchange THEN
Select two random nets within processor pair; 
Exchange net tracks;
Each processor pair accepts/rejects move on own cost; 
Broadcast updated net locations;
END FOR
Increment moves_attempted;
UNTIL moves-attempted =  max_moves; /*  inner loop * /
Update T (control temperature);
UNTIL termination condition; /*  end outer loop * /
End Procedure
V ______________ _ _________ m--------------------- Banerjee
Parallel CAD Tutorial Parallel CAD Tutorialr  ^ r
Global Routing Problem
Routing problem for standard cells
Global Routing Cost model
• Hij, contains the number of wire routes that pass 
horizontally through the channel i in routing grid
h
• Cost of a path P  =  UpHij +  v x C, where C  is 
the number of cell rows crossed
-------------------------------------- 6 7--------------------- Banerjee------
Parallel Global Routing
• Parallel global routing using graph expansion ap­
proach
-  Based on maze routing algorithm, expand each 
cell in 4 directions until target reached
-  Extended to cheapest path route instead of 
shortest path
-  Optimal global route for 2 terminal net, but 
dependent on net order selection
-  Avoided by rip-up-reroute
• Iterative Improvement approach
-  Based on previous graph based method
-  Restrict net placements to L-shaped or Z-shaped 
nets, not all routes examined
-  Not optimal, but excellent results obtained through 
rip-up-reroute
• Hierarchical Decomposition approach
-  Partition N x M  global routing into 2 x 2  rout­
ing problem
-  Solve problem recursively, paying attention to 
boundary conditions of subsolutions







• Each wire separately routed in parallel using re­
shaped or Z-shaped routes
• Update cost array when net routed, cost reflected 
in future net routes
• Algorithm implemented on Encore Multimax shared 
memory multiprocessor
• Gave speedups of 10-13 on 15 processors with 3- 
5% degradation in routing quality
69 Banerjee
Parallel CAD Tutorial
Shared MIMD Parallel GR Program
Procedure SM-PARALLEL-ROUTING-ITERATIVE- WIREBASED;
Read in circuit description and wires;
Create shared cost array;
Partition chip into regions;
Assign regions to processors;
Assign wires to processors by geographic assignment 
depending on location of leftmost point of wire;
Assign wires to task queues of processors using assignment;
REPEAT rip-up-reroute cycle
FORALL processors in PARALLEL DO 
WHILE NOT DONE DO
WHILE current task queue not empty DO 
Pick up next wire from task queue;
Route wire;
Update shared cost array;
END WHILE 
IF own task queue empty 




Rip up some wires;
Place wires on task queues of processors; 
by geographic assignment;
UNTIL convergence of rip-up-rereoute
End Procedure
70 Banerjee
r Parallel CAD Tutorial r Parallel CAD Tutorial
Pist. MIMD GB. Iterative Dist. MIMD GR Program
• Partition wires to processors Procedure DM-PA RA LLEL-RO U TING-ITERA TIVE- WIRE BASED;
• Copy of cost array on each processor
• Ownership of subregions by processors
• Each processor routes own wire
• If routing extends to other processor region, sender 
or receiver initiated update messages of costs
• Implemented on Intel iPSC hypercube
Division of cost array among processors
Banerjee
Read in circuit description and wires;
Create cost array;
Partition chip and cost array into regions;
Assign regions and cost array sections to processors;
Each processor owns subarray but copy of whole array;
Assign wires to processors by geographic assignment 
depending on location of leftmost point of wire 
if cost of wire less than a threshold, otherwise 
dynamic wire balancing 
Assign wires to task queues of processors;
REPEAT rip-up-reroute cycle
FORALL processors in PARALLEL DO 
WHILE NOT DONE DO
WHILE current task queue not emptj- DO 
Pick up next wire from task queue;
Route wire;
Compute delta cost array;
Send/Receive update messages to other 
processors of delta costs using 
sender/receiver initiated methods; 
Update cost array 
END WHILE
IF own task queue empty THEN
Pick next wire from central wire queue;
END WHILE 
END FORALL 
Rip up some wires
Place wires on task queues of processors by geographic assignment 
UNTIL convergence of rip-up-rereoute 
End Procedure
V 72 Banerjee J
Parallel CAD Tutorial Parallel CAD Tutorialr ^  r














Scan line algorithm takes 0(N  log N)
Parallel Design Rule Checking
• Data parallel area decomposition on flattened lay­
outs
— Partition chip area, perform DRC on parti­
tions
— Inherently parallel, good speedup
— Sequential algorithm takes more time than hi- 
erachical
• Hierarchical decomposition on hierarchical lay­
outs
— Algorithm exploits hierarchy, avoids redundant 
work
— Perform DRC within each cell and between 
cells
— Dependent tasks, average speedups
• Functional decomposition
— Each design rule on entire mask is separate 
task
— Dependencies among tasks due to intermedi­
ate layers
V .
73 Banerjee J V 74 Banerjee J
Parallel CAD Tutorial Parallel CAD Tutorial
Data Parallel PRC
Design Rule Interaction Distance (DRID
Removing error reports from overlap regions
incorrectly
checked
1 correctly * checked
incorrectly
correctly m checked jC
checked
correctly correctly
* checked - checked
r
Disi. MIMD Parallel PRC
Procedure D M - P A R A L L E L - D R C - A R E A ;
Read flattened mask layout;
Partition mask layout description by vertical slices 
using an overlap of DRID on each partition;
FORALL processors in PARALLEL DO
Perform design rule checking on own partition
Use polygon, or pixelmap or scanline edge-based method; 
Report errors for partition;
END FORALL
End Procedure
Banerjee J V 76 Banerjee J
Parallel CAD Tutorial------------------------------
Functional Parallel PRC
• Each design rule to be checked on entire mask 
forms a task
• Some design rules involve creation of intermedi­
ate layers, hence dependent tasks
• Map each task on separate processor (static or 
dynamic scheduling)
DRC data dependency graph mapping on 4 processors
Heuristic Height Size
Processor A B C D A B c D
time = 1 1 ,
time = 2 2 4 5 6 2 3 4 5
time -  3 3 10 6 7 8 9
lime — 4 7 8 9 11 10
time = 5 12 11
time = 6 12
V 77 Banerjee J
r Parallel CAD Tutorial
Dist. MIM D Parallel DRC Program
Procedure D M - P A R A L L E L - D R C - F V N C T I O N A L ;
Read mask description;
Assign one processor as Master and P-1 as Slaves;
FORALL processors in PARALLEL DO 
IF Master process THEN 
Read design rules;
WHILE there are design rules left to be checked DO 
Pick design rule that is not dependent on any
other design rule that has not yet been checked;
Create a DRC task on entire mask;
Insert DRC task into task queue of particular
Slave processor using static scheduling heuristics;
IF any DRC tasks completed by any Slave 




IF Slave process THEN 
WHILE NOT DONE DO
WHILE there are tasks in the task queue DO
Pick a DRC task from task queue if data dependencies met; 
Read flattened mask layout;
Perform design rule checking on entire mask for that rule 
Report errors;










Layout and extraction of CMOS Inverter
• Sequential extraction based on scanline algorithm 
0 (N  log N) time
• On each scan line stop check overlap among all 
rectangles in electrically connecting mask layers
• If intersection forms devices, identify devices (tran­
sistors)
• If connected, mark as connected component
• Collect all rectangles on connected component, 
label global number to net






• Parallel algorithm based on data parallel area de­
composition
— Partition area of chip, perform local extrac­
tion, merge results
— Inherently parallel, large speedups
• Parallel algorithm based on hierarchical decom­
position
— Algorithms exploits hierarchy of circuit, aviods 
redundant work
— Each cell extracted only once
— If some cells large, load imbalance problems
V.
80 Banerjee J
Parallel CAD Tutorial ^\
Data Parallel Extraction
Decompose given area by equal area or equal num­
ber of rectangles
2 6 14 10
3 7 15 11
1 5 13 9









Data Parallel Extraction (Contd)






A A A AA? \) U V
A A A AV V V
< 3 5 I > A A A A
H U ft. AV r rMl
V VAT
Perform global extraction
• Perform parameter extraction in parallel on dif­
ferent nets
Break up long nets into short nets
82 Banerjee
Parallel CAD Tutorial
Dist. M IM D Parallel Extraction
Procedure DM-PARALLEL-NETLIST-DATA;
Read flattened mask layout;
Partition mask layout description by 
(1) rectangular area (2) rectangular points 
(3) slice area (4) slice points;
Assign partitions of mask rectangles to processors;
Rectangles touching borders of regions assigned to both processors; 
FORALL processors in PARALLEL DO
Perform Boolean operations to derive new layers;




FORALL processors in PARALLEL DO
FOR log(P) stages in P processor machine DO 
Identify neighbor processor in current stage;
Merge LCS sets across processors in pair;
Create an extended region of local connected sets;
Create new border segments of extended region;
END FOR 
END FOR ALL 
Global netlist extraction 
FORALL processors in PARALLEL DO
FOR each global connected component owned by processor DO 
Report net for component;








• Remove overlaps among hierarchical cells
• Create a task for extracting each cell
• Insert in central queue from which processors will 
take tasks and execute
• Problem: What if some cells very large?
• Run each cell on j  processors, but have sublinear 
speedups
• Use partitionable independent task schedule
• Implemented on Encore shared memory multi­
processor obtained speedups of 7 on 8 processors
--------- B a n e rje e---84
r Parallel CAD Tutorial r Paralle l CAD Tutorial A
artitionable Independent Task Scheduling
Given n tasks with estimates of execution times,
T*
• Each task can run on many processors in parallel 
with sublinear speedups; use estimated speedups 
G i ( j )  running on j  processors
• Schedule them on p processors to minimize exe­
cution time
Shared MIM D Hierarchical Parallel PRC
Procedure SM-PARALLEL-NETLIST-HIERARCHY;
Read hierarchical circuit description;
Perform overlap analysis and transformation;
IF two cells overlap
THEN Flatten those cells;
FOR each hierarchical independent cell DO
Obtain estimate of extraction time by counting rectangles; 
ENDFOR
Obtain Partionable Independent Task Schedule;
Determines how many processors should execute 
each task in parallel;
FORALL processors in PARALLEL DO 
WHILE there axe cells to extract DO
Determine which processors will partipate in extraction;
WHILE all processors in schedule not free DO 
Wait for them to be free;
END WHILE
Perform parallel netlist extraction among P processors




Example Problem and Scheduling Algorithm
85 Banerjee 86 Banerjee
r Parallel CAD Tutorial
Summary
• Reviewed parallel algorithms for placement, rout­
ing, design rule checking and extraction
• Many more algorithms exist
• The algorithms described handle simple cases whereas 
sequential software handles complex cases
• Sequential algorithms keep improving in runtime 
and quality
• Need parallel algorithms that can interface to im­
provements in sequential algorithms
• Parallel algorithms are often targeted for specific 
architectures, e.g. hypercubes




Parallel CAD Tutorial r Parallel CAD Tutorialr "\
Outline
SESSION 3
PARALLEL ALG O RITH M S IN 
CIRCUIT A N D  LOGIC  
SIM ULATION




• Parallel Algorithms for Logic Simulation
— Compiled simulation
— Event-driven simulation
V. ss Banerjee w Banerjee J
Paralle l CAD Tutorial r Paralle l CAD Tutorial
Circuit Simulation Problem
93








=  -gi(vi ( t )  -  Vi7i(t)) -  g2 (vi(t) -
~V2(t)) -  93((vi(t) -  V3(t))
=  - M t ) )
C3-TT =  ~93(v3(t)
• In general, the system of n equations that result 
have the following form:
dq(v(t),u(t))
dt =  S(v(i),w(<))
• Solve for voltage v(t) at all nodes using numerical 
methods
Direct Method of Circuit Simulation
• Use numerical integration to formulate equations, 
where solve for unknown v(t-fh)
q(y ( t +  h),u(t +  h) -  q(v(t), u{t))
=  0.5h[g(v(t +  h), u(t +  h)) +  g{v(t), u(t))] ‘
• Solve nonlinear set of equations F(x) =  0 by Newton- 
Raphson method
JF(vk(t+h))[vk+1(t+h)—vk(t+h)] =  — F (vk(t+h))
where Jp is the Jacobian matrix
• First step comprises model evaluation or Load 
phase
• Next step is solution of sparse linear system of 
equations phase
V JV 90 Banerjee 91 Banerjee
Parallel CAD Tutorial
Parallel Model Evaluation
Model Evaluation Involves Loading Matrix and Vector
X X XXX X X XX X X
x X ♦ X X 
X X X X
X X X
4
Templates for Parallel Model Evaluation
92 Banerjee
Parallel CAD Tutorial
Shared MIMD Model Evaluation
• One lock for entire matrix
• One lock for each row of matrix




Partition circuit elements among processors;
Create shared Jacobian matrix and RHS vector;
FORALL processors in PARALLEL DO
WHILE there are elements in own queue DO 
Pick next circuit element;
Determine element contribution on Jacobian matrix; 
Update local copy of Jacobian matrix and RHS vector; 
END WHILE
Accumulate local copies of Jacobian matrix and RHS vector 




r Parallel CAD Tutorial r Parallel CAD Tutorial >v
Parallel Solution of Equations
Procedure L U-FA CTORIZATION-SO URCE;
FOR k =  l ,n  DO
FOR j  =  k +  1, n such that a kj  ^  0 DO 
akj =  akj/akk\ /* Normalize * /
END FOR
FOR i =  k +  1, n such that a,* ^  0 DO 
FOR j  =  k +  1, n such that ^  0 DO 





• Operation based decomposition
• Row level operation based decomposition
• Pivot based decomposition
V J
Element Level Parallelism
J  1 J 4 S I  
X X  X
X X 
X X
X X X X
Sparse matrix with fill-ins
'• *U  *  »11'» II
I *M * *l«'*ll
i  *a  » » „  .* „  • » „» *« * *ii ■ * n ' * 1«
*  * b  • * «  • *«• *11
* • * » » »  *11 ■* H* *14»■ *» * *n'*a*** ♦ *»••*«■ *i« 
i  A J4 ♦  *34 1 * J}
10l A j i  ♦  AM / A31 
II. * „ •  *M ■ • a ‘ *||U 1« * km .<„■*)’* *M * *M • * U ‘ *11 
i i * » '  *a  • * !> '* »
II *• * *« '»»»
«.!»« *H'*M**« n *»' »I*"*«'*«
i*  * * *  * a ' * u
List of element-level operations
Task graph for element-level operations
9594 Banerjee V Banerjee J
r Parallel CAD Tutorial r Parallel CAD Tutorial 'A
Row-level Parallelism Shared MIMD Row-level Parallel
Sparse matrix with fill-ins
1. N I: Norman» Row I
2. U12: Update Row 2 Ualng Row I
3. Ul 5: Update Row 3 Ualng Row I
4. N2: Normalst Row 2
5. U25: Update Row S Ualng Row 2
$. N3: Norm allia Row 3
7. U34: Update Row 4 Ualng Row 3
S. U36: Update Row t  Ualng Row 3
9. N4: Normal!» Row 4
10. U4S: Update Row i Ualng Row 4
11. U46: Update Row (Ualng Row 4
12. NS: Normali» Row S
13. US(: Update Row 6 Ualng Row S
List of row-level operations
Task graph of row-level operations
Procedure SM-PARALLEL-LU-FACTORIZATION-ROW;
Read in matrix elements and RHS vector; note sparse entries; 
Perform symbolic evaluation of row-level updates below;
FOR k =  l ,n  DO
FOR j  =  k +  1, n such that 0 DO 
Normalize row k;
—• &kj / ^kk)
END FOR
FOR t =  k +  1, n such that a,* ^  0 DO 
FOR j  =  k +  l,7i such that a j^ ^  0 DO 
Update row i;




Create a row-level task node for every row-level update;
Levelize the task graph;
Partition tasks of task graph onto processors statically;
FORALL processors in PARALLEL DO 
FOR each level L in task graph DO
WHILE there are tasks in processors queue at level L DO 
Pick next task at level L;
Since row is sparse, scatter row elements into vector; 
Execute all row operations in vector form;
Gather row from vector form to sparse form;






V. 96 Banerjee J V 97 Banerjee J







Bordered Block Diagonal and Nested BBD form
V
98 Banerjee J
-  Paralle l CAD T u to ria l--------------------
Wavefrom Relaxation Method
• Use relaxation techniques at the differential equa­
tion solution level to the system of equations
• Relaxation variables are voltage waveforms that 
are functions of time
• In each iteration entire waveforms are computed 
for each variable
• Circuit is partitioned into DC-connected subcir­
cuits







Gauss-Jacobi relaxation algorithm 
vk+l =  f ( v k, ^ , u , t )
• Gauss-Seidel waveform relaxation method uses a 
combination of results from current and previous 
iterations
v
• Gauss-Seidel algorithm forces an ordering of the 
analysis and some serialization among sub-circuits
• It requires all sub-circuits to be leveled and or­
dered based on their level (starting from primary 
inputs at level 0), i.e. all level i subcircuits must 
be analyzed before any level i -h i  sub circuit,
Too Banerjee V 101 Banerjee













Gauss-Seidel task precedence graph
Iteration 1 Iteration 2
Gauss-Jacobi task precedence graph
102
Shared MIMD Waveform Relax
Procedure SM-PARALLEL- WA
Read in circuit description;
Partition circuit into subcircuit;
Create shared data structure representing voltage waveforms;
Guess initial values of all nodes voltages on all time steps;
REPEAT
FORALL processors in PARALLEL DO
WHILE there are subcircuits to be processed DO 
Pick subcircuit;
Read input waveforms for all input nodes to subcircuit;
(voltages for all input nodes for all time steps)
FOR each time step DO 
Solve subcircuit;
Update output waveform for output nodes of subcircuit;
(voltages for all output nodes for all time steps);
IF output waveform of output nodes changed
THEN Schedule this subcircuit for another evaluation; 
END WHILE 
END FORALL
UNTIL waveforms of all nodes converge;
End Procedure
• Implemented on 8 processor Alliant F X /8 shared 
memory multiprocessor
• Gauss-Jacobi was two times slower on one pro­
cessor than Gauss-Seidel
• Gauss-Jacobi faster than Gauss-Seidel beyond 4 
processors, speedups of 7 on 8 processors
Banerjee J TÏÏ3 Banerjee
Paralle l CAD Tutorial Paralle l CAD Tutorial
Dist. MIM D Waveform Relax
Procedure DM-PARALLEL- WAVERELAX-FULL;
Read in circuit description;
Partition circuit into DC connected subcircuit;
Allocate partitions statically to processors;
Create data structures representing voltage waveforms 
on each processor for owned nodes;
Guess initial values of all nodes voltages on all time steps;
REPEAT
FORALL processors in PARALLEL DO
FOR all subcircuits owned by processor DO 
Pick subcircuit;
Receive input waveforms for all input nodes to subcircuit; 
FOR each time step DO 
Solve subcircuit;
Update output waveform for output nodes of subcircuit; 
Send output waveforms to processors needing them;
IF output waveform of output nodes changed
THEN Schedule this subcircuit for another evaluation; 
END WHILE 
END FOR ALL
UNTIL waveforms of all nodes converge;
End Procedure
Implemented on 256 processor VICTOR message 
passing computer at IBM
Gauss-Jacobi implemented, gave speedups of 100 
on 256 processors
Massively Parallel Approach
• Use nonlinear relaxation to solve 
d
dt
q(v(t),u(t)) =  - f ( v ( t ) ,u ( t ) )
• For each time step, an initial guess of the voltages 
at the present time point is made on the basis of 
the voltages of the previous time points.
• The nonlinear set of equations
•••.«?»- . v « -1) =  0
are repeatedly solved for each voltage vf using 
the Newton-Raphson method until the relaxation 
iteration has converged.
• After the relaxation iteration has converged, vf 
becomes the new voltage at time t +  h, and one 
can advance to the new time step.
• Each of the node voltages can be solved for inde­
pendently, hence in parallel.
• Massively parallel SIMD implementation on Con­
nection Machine CM-2, speedups of 8 over work­
station reported
V 104 Banerjee 105 Banerjee
Paralle l CAD Tutorialr Parallel CAD Tutorial
SIMP Parallel Nonlinear Relax
© © ®MVd M2_d M3_. _£!.♦ V « /  UT4 C2> U4j M» t  fcMd 1» d O *
IlJrji I 1
Example circuit and mapping on SIMD
P rocedu re SIMD-PARALLEL-NONRELAX-POINT;
Map each circuit device and circuit node on different processor; 
REPEAT Gauss-Seidel iteration
FORALL processors in PARALLEL DO 
Select processors that own circuit nodes
Send (voltages) to processors owning devices;
FOR each device in circuit DO
Select processors owning circuit devices;
Evaluate the device model;
END FOR
Select processors owning circuit devices;
Send responses to processors owning circuit nodes; 
Select processors owning circuit nodes 







• Given description of logic circuit, gates and con­
nections
• Given input data to be simulated and initial mem­
ory states
• Simulate circuit at logic level assuming some de­
lay models
• Two methods are commonly used
— Compiled simulation: simulate all gates at all 
time steps
* Scheduling no problem, done aat compile 
time, very fast
* Possibly redundant computations
-  Event-driven simulation: simulate gates whose 
inputs are changing at this time step
* Avoids redundant computations
* Scheduling cost at runtime
V
107 Banerjee
r Parallel CAD Tutorial Parallel CAD Tutoriala r
Compiled Logic Simulation
ORDER OF EVALUATION: G1; G2; G3; G6; G4; GS; G7; G8; G9; G10
• Given network is levelized
• All logic gates at a level simulated before going 
to next level
Procedure COMPILED-LOGIC-SIMULATION;
Read in the initial value of each lines;
Read in next input vector and update values;
FOR each new input data DO
FOR each level of logic circuit DO 
FOR each gate in level DO
Execute compiled code for gate, i.e. simulate logic;
END FOR 
END FOR
IF new value of outputs of feedback 
lines same as old value 
THEN output results;
ELSE set new input value of inputs of feedback lines 
same as new outputs;
END FOR 
End Procedure
V ios Banerjee J !
ORDER OF EVALUATION: G1; G2; G3; G6; G4; G5; G7; G8; G9; G10
• Gates are evaluated when there are events, i.e. a 
change in value of signal
• When an event i is simulated, and it is deter­
mined that its output line j  has changed state, 
then line j  has to be scheduled to change at time 
tcurrent +  A where A is some appropriate delay 
value of the gate.
• Can be handled by a time wheel





FOR each input vectors to be simulated DO 
Process new inputs;
Update input nodes;
Schedule connected elements on timing wheel;
WHILE (elements left for evaluation) DO 
Evaluate element;
IF ( change on output )
THEN update all fanout nodes and schedule 




• Typically, circuit activity is about 5-10%, hence 
great savings in runtimes
• Can simulate circuits with feedback paths
• Can handle arbitrary delays
Paralle l CAD T u to ria l-----------------
Parallel Logic Simulation
• Functional parallelism
— Break up functions of logic simulation into sub­
tasks, execute on different processors
— Grain size of tasks too small to be useful on 
general purpose parallel processors, used in 
hardware accelerators, e.g. MARS
• Input data parallelism
— Partition input data set into groups, simulate 
each set of inputs on entire circuit on different 
processor
— Applicable to combinational circuits, unit de­
lay models only
• Circuit parallelism for compiled simulation
V
— Partition circuit into subcircuits, simulate each 
subcircuit on different processor using com­
piled technique, compile communication as well
Method suitable for SIMD massively parallel, 
but inherently slower than event driven, used 
in YSE
V. no Banerjee i n Banerjee
r Paralle l CAD Tutorial Parallel CAD Tutorial
Parallel Logic Simulation (Contd)
• Circuit parallelism for event-driven simulation
-  Synchronous approach
* All nodes at given time point evaluated be­
fore proceeding
* Not much parallelism exists per time step 
in problem
— Asynchronous approach
* Different portions of circuit allowed to eval­
uate upto different points of time
* Conservative methods
• Simulate subcircuits safely as much ahead 
in time without violating causality
• Average parallelism, deadlock problem
* Optimistic methods
• Simulate subcircuits ahead in time, if causal­
ity error, rollback
• Maximum parallelism, complex control
V. m Banerjee
Circuit Partitioning Approaches
Partition logic circuit by random, natural, input 
cones, output cones, strings, levels, etc.
Each processor simulates each partition of circuit
Natural partitions =  [1,5,9,13][6,10,14,15][2,7,11,16][3,8,4,12] 
• Level partitions =  [1,2,3,4][5,6,7,8][9,10,11,12][13,14,15,16] 
Fan-Out Cones =  [1,5,6,9,13][2,7,10,H,14,15][3,8][4,12,16] 
Fan-in Cones =  [13,9,5,1][14,10,6,7,2][15,7,2][16,11,12,8,3>4] 
Strings =  [13,9,5,1][14,10,6][15][16,11,7,2][12,8,3][4]
113 Banerjee
i
---- Parallel CAD T u to ria l-------------------------- — x
Shared M IM D Compiled Simulation
Procedure SM-PARALLEL-COMPILED-LOGIC-SIMULATION-CIRCUIT;
Read logic circuit;
Partition logic circuit using heuristics;
Assign one subcircuit to each processor;
FORALL processors in PARALLEL DO 
FOR each time step DO
IF processor owns some primary inputs 
THEN Read inputs for current time;
Process new inputs;
FOR each level of logic circuit DO
FOR each gate in level owned by processor DO
Execute compiled code for gate, i.e. simulate logic; 
END FOR 
END FOR
IF new value of outputs of feedback 
lines same as old value 
THEN output results;
ELSE set new input value of inputs of feedback lines 








r Paralle l CAD Tutorial
Massively Parallel SIMP Simulation
• Switch level simulation model COSMOS prepro­
cesses switch level circuit and generates Boolean 
formulas (AND-OR-NOT)
• Memory of parallel machine treated as two-dimensiona 
array, each entry holds a 2-input Boolean AND/OR 
operator
• Rank order all gates in k levels
Processor
Compilation of Boolean Model on SIMD Machine
V
115 Banerjee J





FORALL processors in PARALLEL DO 
Load inputs of Boolean model into input 
slots of processors;
FOR each rank 0 to R step 1 DO 
Select processors (AND nodes)
Evaluate AND nodes;
Select processors (OR nodes)
Evaluate OR nodes;
Communicate (node values to two fanout processors) 
END FOR 
END FORALL
UNTIL(st.able state has been reached per step)
UNTIL (end of simulation time)
End Procedure
Implemented on Connection Machine CM-2, av­
erage number of Boolean operations executed in 
parallel per rank was 3000
Resultant speedup over VAX 8800 was 2
116
Parallel CAD Tutorial
Parallel Event Simulation Synchronous
• Uniprocessor algorithm performs loop (1) Update 
all scheduled nodes (2) Evaluate all elements con­
nected to the changed nodes (3) Schedule all out­
put nodes that change
• In parallel algorithm, single shared queue of events 
creates too much contention
• Have distributed event queues, one per processor 
(N queues)
• When node update is scheduled, scheduling pro­
cessor picks another processor in round-robin man­
ner and puts node on queue of that processor
• During node update, each processor updates all 
nodes in own queue (large computation grain)
• For load balancing, when processor queue empty, 
processor looks at other queues for more work
• Similar strategy for element evaluation, place on 
N queues
V Banerjee 117 Banerjee
r Paralle l CAD Tutorial
Shared MIM D Synch Event Sim
Procedure SM- PAR A LLEL-E VENT- LOGIC-SIM- CIRCUIT-SYNCH,
Read circuit description;
Partition circuit among processors;
Create N queues for scheduling events for N processors;
FOR each each input vector to be simulated DO 
FORALL processors in PARALLEL DO
Process new inputs among nodes in each processor;
Update input nodes;
Schedule connected elements on queue 
of appropriate processor;
WHILE (elements left for evaluation on own queue) DO 
Evaluate element;
IF ( change on output ) THEN 
Update all fanout nodes;




Barrier synchronize (time step);
END FOR
End P rocedure
• Implemented on Encore shared memory multi­
processor
• Speedups of 4.5-6 on 8 processors reported
• Not much parallelism per time step
ITS"
Paralle l CAD Tutorial
Parallel Event Asynch-Conservative
• Based on conservative methods of general Par­
allel Discrete Event Simulation (Chandy-Misra- 
Bryant algorithm)
• Construct a set of logical processes L P —1, LP<i, LPn 
to represent partitions of gates
• All interactions between physical processes (gates) 
modeled as timestamped messages
• Conservative methods avoid possiblity of causal­
ity error by waiting to see if it is safe to process 
event
• Contrast to optimistic methods which allow for 
causality errors, but detect and recover from them 
using rollback methods
Banerjee 119 Banerjee
r Parallel CAD Tutorial Parallel CAD Tutorial
Asynchronous Event Conservative
• Statically specify links indicating which processes 
can communicate
• Sequence of timestamps on messages on links guar­
anteed to be nondecreasing
• Hence, timestamp of last message received on in­
coming link is lower bound on timestamp of sub­
sequent message
• Each link clock equals either timestamp of mes­
sage at front of link if queue has a message or 
timestamp of last message received on that link
• The process repeatedly selects the link with the 
smallest clock and if there is a message in it, pro­
cesses it.
• If the selected queue is empty, the process blocks.
• If a cycle of empty queues arises that has suffi­
ciently small clock values, each process in that 
cycle might block, and the simulation deadlocks.
Deadlocks in Simulation
• Null messages are used to avoid deadlock situa­
tions.
• Null messages are used for synchronization pur­
poses only and do not correspond to any activity 
in the physical system.
• Whenever a process finishes processing an event 
it sends a null message Tnuu along each of its out­
put ports indicating lower bounds on the times­
tamps of next outgoing messages.
• One way to determine timestamp of null messages 
is to determine the clock values of each incoming 
link coupled with the minimum timestamp incre­
ment for any message passing through the logical 
process.
m Banerjee 121 Banerjee
Parallel CAD Tutorial
Shared MIMD Asynch Conservative
Procedure SM-PARALLEL-EVENT-LOGIC-CIRCUIT-ASYNCH-CON;
Perform Initialization;
FOR all time steps DO
FOR all generator and constant nodes DO 
Evaluate generator and constant nodes;
END FOR 
END FOR
FORALL processors in PARALLEL DO 
WHILE (element needs evaluation) DO
Atomically remove an element from own queue;
Evaluate new output behavior inputs as far as possible;
FOR all inputs DO
Search for minimum time of the first event of inputs;
Call this the current-time.
END FOR
FOR all events queued on input node DO 
FOR all input nodes DO
Search for minimum time we know the behavior for all 
input nodes. Call this min-valid.
END FOR 
END FOR
Set up the initial inputs, outputs and state 
corresponding to the current-time;
FOR each event on the input nodes that occurs 
before min-valid DO
Evaluate the element with the current inputs and state;
Add any new output values onto the behavior 
description of the output nodes;
Activate the elements in the fan-out of the 
updated nodes;
Get the next event on the input nodes;
Update the valid times for the output nodes.
122 Banerjee
Paralle l CAD Tutorial





Implemented on Encore Multimax multiproces­
sor, speedups of 5.5-7.5 on 8 processors
W Banerjee
r Parallel CAD Tutorial r Paralle l CAD Tutorial
Parallel Asynch Event Optimistic
• Optimistic methods detect and recover from causal­
ity errors using rollback mechanisms
• Jefferson’s Time Warp mechanism based on Vir­
tual Time
• Causality error is detected when event message 
with smaller timestamp than process clock is re­
ceived
• Recovery is accomplished by undoing the effects 
of all events that have been processed prema­
turely
• Rolling back the state can be achieved by period­
ically saving the process’ state, and restoring the 
state vector on rollback.
• Rollback from a previously sent message can be 
achieved by sending a negative or antimessage 
that annihilates the original message.
• If a process receives an antimessage, that process 
must recursively rollback to undo the effect of 





Process A before rollback
a P roco»  Ñama (Virtual Space Coordináis) Virtual Clock (Virtual Tima Coordináis)1 . 1  1
95 I 1071 121 1421 1S4| virtual time 
m i m i  old state
107 121 142 154 181 195
A A A A A A
3 l i li li li ±J














e Assume new message comes in with time 135 
e Process A after rollback
I A I Process Name (Virtual Space Coordinale) 
HQ1 1 Virtual Clock (Virtual Time Coordinale)
95 1107 1121 Im virtual time old siale
Inpul Queue
ar 3“ iTSTiTU














Banerjee V 125 Banerjee
Paralle l CAD Tutorial r Paralle l CAD Tutorialr
Dist. MIMD Asynch Optimistic
Procedure D M - P A R A L L E L - E V E N T - L O G I C - C I R C U I T - A S Y N C - O P T ;
Read circuit description;
Partition circuit among processors;
Create one separate timing wheel (queue) per processor;
Set global virtual time GVT =  0;
Set Local Virtual Time (LVT) =  0 for all processors;
FORALL processors in PARALLE1 DO 
WHILE (events needs evaluation) DO 
Pick next event for evaluation;
Evaluate gate corresponding to event;
IF (change on output) THEN 
Update all fanout nodes;
Send cancellation events to other processors;
IF cancellation messages received from 
other processors THEN 
Rollback simulation to last checkpointed state;
Redo simulation for all time steps from 
checkpoint to final step for element;
Schedule affected gates on queues;
Checkpoint simulation state;
IF (LVT - GVT) < window THEN 
Increment LVT;
ELSE Wait; do not increment LVT;
END WHILE
Set GVT =  Minimum(all LVTs of all processors);
Reclaim storage for all states less than GVT;
END FORALL
End Procedure
► Implemented on BBN GP 1000 multiprocessor, 
speedups of 11 on 32 processors
Summary
• Parallel processing for simulation most important 
CAD task
• We have reviewed several parallel algorithms for 
circuit simulation
• Waveform relaxation method looks most promis­
ing for parallel processing for most MOS circuits
• Direct methods are good for circuits with stiff 
coupling
• Perhaps a combined approach to exploit different 
levels of parallelism, e.g. combining direct meth­
ods for clusters of processors on large subcircuits, 
along with waveform relaxation across clusters
• We have also reviewed different parallel approaches 
for logic simulation
• Some combination of conservative and optimistic 
methods in asynchronous simulation is needed
V JBanerjee 127 Banerjee
Parallel CAD Tutorial Parallel CAD Tutorial
SESSION  4
PARALLEL ALG O R ITH M S IN
LOGIC SYNTHESIS,
VER IFICATIO N  A N D  TEST
128 Banerjee
Outline
Parallel algorithms in logic synthesis 
Parallel algorithms in logic verification 
Parallel algorithms in test generation 
Parallel algorithms in fault simulation
129 Banerjee
r Parallel CAD Tutorial Parallel CAD Tutorial
Logic Synthesis Problem
• The logic synthesis area is usually divided into 
two-level synthesis and multilevel synthesis.
• Given a two-level or multilevel circuit descrip­
tion, synthesize another circuit with minimum 
area (gate count)
• Other objectives such as maximum performance, 
maximum testability also exist
• Two-level logic minimization has been used to 
synthesize programmable logic arrays (PLA) for 
control logic.
• Example two-level synthesis approach: ESPRESSO
• Recently multi-level synthesis has become very 
popular






Parallel Two level Synthesis
— Develop parallel versions of ESPRESSO heuris­
tics, e.g. REDUCE, EXPAND, IRREDUN- 
DANT cover
— Basically, maintain same ordered list of cubes 
as sequential algqrithm
— Processors remove cubes off shared queue and 
apply REDUCE, EXPAND, etc. to them
— Implemented on shared memory multiproces­
sors, speedups of 6 on 8 processors
Parallel Multilevel Synthesis by Partitioning
-  Partition circuit into subcircuits, synthesize 
each partition independently, reconnect
-  Advantage is easy to get large speedups, but 
disadvantage is that synthesis process only per­
forms local optimization
-  Applicable to any synthesis method
• Parallel multilevel synthesis on entire circuit
Much harder to parallelize, need to parallelize 
algorithm, e.g. MIS or tranduction
131 Banerjee
Parallel CAD Tutorial Parallel CAD Tutorialr
Transduction Method of Logic Synthesis
• Consider loop-free multilevel circuits, consisting 
of AND, OR, NAND, NOR and INVERTERS 
and other gates
• Output function of a gate/conn : Collection of 
signal values taken for all possible (2n for ninputs) 
primary input combinations
• Set of permissible functions (SPF) of a gate/conn: 
A set of output functions that a gate/conn is al­
lowed to take
without changing the output function on any pri­
mary output.
-  Maximal SPF (MSPF)
-  Compatible SPF (CSPF)











• Pruning - Removal of redundant connection/gate 
from the network.
• If removal of a gate/connection does not change 
the output function of any PO, then it is redun­
dant
• Any other gate/connection driving that redun­
dant gate/connection only can also be removed.
• If CSPF is used to detect redundancy instead of 
MSPF, all redundant connections can be removed 
simultaneously.
133 Banerjee
Parallel CAD Tutorial Parallel CAD Tutorial
Transduction Method (Contd)
• Gate Substitution - A gate is substituted by 
another gate
• Network will have one less gate
• Gate G\ can be substituted by gate G<i if and only 
if
output-function (G2) Ç CSPF(Gi)
Gate Gi can be substituted by G2
134 Banerjee
Transduction Method (Contd)
• Gate Merging - Two gates are merged into a 
single gate





• Try to form a new gate (AND, OR, NAND, NOR), 
whose input connections are from existing gates
• Output function of new gate must be a subset of 
the intersection of the CSPFs of G\ and G 2
G 1 and G 2 merged to form new AND gate 
------------------------------------ j-35---------------------  Banerjee
Parallel CAD Tutorial Parallel CAD Tutorialr
Transduction Method of Logic Synthesis
P rocedure SYNTHES1S-TRANSDUCTION;
Read circuit;
Evaluate output functions from primary inputs to primary outputs; 
Evaluate CSPFs from primary outputs to primary inputs;
WHILE circuit keeps improving DO 
WHILE circuit keeps improving DO 
IF circuit changed THEN
Evaluate output functions and CSPFs;
IF redundant link found THEN 
Perform gate pruning;
END WHILE
WHILE circuit keeps improving DO 
IF circuit changed THEN
Evaluate output functions and CSPFs;
Perform gate substitution transformation;
END WHILE
WHILE circuit keeps improving DO 
IF circuit changed THEN
Evaluate output functions and CSPFs;
Perform gate merging transformation;
END WHILE






• Copy of entire circuit is shared among processors
• Processors cooperate to perform transformations 
in parallel
• Update of circuit is performed under locks
• Shared MIMD parallel versions of each transfor­
mation implemented
• Results on Encore shared memory multiproces­




Parallel CAD Tutorial Parallel CAD Tutorial
Shared M IM D  Parallel Output Function
Levelize circuit
Evaluate all gates at particular level before pro­
ceeding to next level
Processors remove gate from each level from shared 
queue
P rocedu re SM-PARALLEL-TRANSDUCTION-OUTPUT-FUNCTION; 
Partition the circuit into levels from primary inputs 
to primary outputs;
Insert gates of each level into separate queue;
FOR each level L =  1 to maxlevel DO
FOR ALL processors in PAR ALLEL DO 
WHILE there are gates on level L DO 
Pick gate of level L from queue;






Shared M IM D Parallel Gate Substitution
>C Gate D IK  Gate E ! > [  Gate F  | ~ ”  3
Process 1 dequeues gate A from Q1
• In each iteration generate a set of pairs of gates,
such that in each pair, the gate can 
substitute for the corresponding gate, vj.
• Each processor p marks a unique gate, Vjp) by 
scanning circuit from primary outputs to inputs
• The processor will then search the circuit for an­
other gate, uZp, which can substitute for vjp\ this 
search begins with the primary inputs to outputs
139 Banerjee
Parallel CAD Tutorial Parallel CAD Tutorialr r A
Shared M IM D Parallel Gate Merging Dist. MIM D Parallel Transduct ion
• Every processor p will each mark a unique gate, 
Vi scanning the circuit, starting with the primary 
outputs, for the first gate which has not marked 
by any processor
• Beginning with the primary outputs, the respec­
tive processor will search the partition for another 
gate, Vjp1 such that its CSPF intersects with that 
o f  V i„ -p
• After obtaining the pair (ulp, Vjp),  the processor 
tries to synthesize a third gate, Vkp1 which can 
replace both vlp and Vj
• If this is successful, the processor will then inform 
the other processors working on the same circuit 
to halt.
• Logic network is divided into a set of non-overlapping 
partitions, however, the partitioning is for paral­
lelization purposes only, the transformations and 
the optimizations are performed on the entire 
network
Processor 2 Processor 3
• The processor will perform the gate merging op­
eration.
• On the other hand, if the processor cannot find a 
suitable vJp for the gate vlp) or is unable to syn­
thesize Vkp for both vip and Vjp, it will look for 
another vlp and repeats the searching process
---------------------------------------no----------------------Banerjee------
• Foe each type of transformation, some coarse grained 
objects are created and get load balanced among 
the processors
• Parallel versions of each transformation proposed 
and are simultaneously executed asynchronously
Banerjee
Parallel CAD Tutorial Parallel CAD Tutorialr r
Dist. M IM D Function Evaluation Coherence Mechanism
• Evaluation of output functions proceeds from the 
primary inputs to outputs
• The output function of a gate can be evaluated if 
“valid” output functions of its fanins are available
• If the other partition that needs the output func­
tion is in different processor, the BDD is sent to 
other processor through work managers
• Since processes perform different optimizations 
simultaneously on the network, the output func­
tions and the CSPFs keep changing hence coher­
ence needs to be maintained
• Keep a tag with each BDD, called the version
• With each gate and connection, we keep the cur­
rent version number of the output function and 
the CSPF. A BDD is “valid” , if its version num­
ber is current.
• The version numbers are used to prevent any 
“illegal” transformation done using an “invalid” 
BDD
• Initially, the versions of output functions and the 
CSPF are set to 1 .
• If due to any transformation, the output func­
tion or CSPF of a gate or connection becomes 
“invalid” , the corresponding version number at­
tached to it is incremented by 1 .
142 Banerjee 143 Banerjee J
Parallel CAD Tutorialr Parallel CAD Tutorial r
MTTVm Parallel Gate Substitution
• For each partition pair, create a gate substitution 
object
• After the substitution is performed, the output 
functions of all the successors of the substituted 
gate will be invalid as well as the CSPFs of the 
predecessors of the substitute gate.
Dist. M IM D Parallel Gate Substitution
P roced u re  DM-PARALLEL-TRANSDVCTION-GATESUB) ;
Partition circuit into subcircuits;
FOR each subcircuit i DO 
FOR each subcircuit j DO
Create a gate substition object (i,j);
Assign gate subsitution object(i,j) to a processor;
END FOR 
END FOR
FORALL processors in PARALLEL DO
WHILE there are gate_sub-objects(i,j) on processor DO
(1) /*  When gate_sub-object (i, j) is awakened * /
Collect CSPFs of all gates in partition i 
Collect OutFunct of all gates in partition j
IF any CSPF or OutFunct is not available 
THEN request work manager to get it 
Perform pairwise comparison of available CSPFs and OutFunct 
IF possible gate substitution found
THEN ask master processor for permission 
Ask partition i to inform if any CSPF changes in partition i 
Ask partition j to inform if any OutFunct changes in partition j
(2) /*  When gate_sub_object(i,j) receives a new CSPF * /
Compare this CSPF with all available OutFunct of partition j 
IF possible gate substitution found
THEN ask master processor for permission
(3) /*  When gate-sub_object(i,j) receives a new output function * / 
Compare this OutFunct with all available CSPFs of partition i 
IF possible gate substitution found
THEN ask master processor for permission 
End P rocedu re
V J J144 Banerjee 145 Banerjee
------- Parallel CAD Tutorial-------------- ------- -
Dist. M IM D Parallel Gate Merge
• Create a gate merge object for each pair of par- 
tition objects
• If gate merging possible for a pair of gates, create 
four gate synthesis objects for AND. OR, NAND, 
NOR
V 146 Banerjee
r Parallel CAD Tutorial -\
Dist. M IM D Transduction
• In the transduction method, the user specifies the 
order in which different transformations are to be 
applied at the beginning and the transformations 
are applied strictly in that order, one after the 
other
• In parallel implementation, all transformation ob­
jects are simulatenously created, and they asyn­
chronously transform the circuit
• Implemented on Intel iPSC/860 hypercube and 
network of workstations, speedups of 7 on 8 pro­
cessors reported
;i|f| : Gale Sub iH Ü  : Gen Gate Sub H i  : Gate Inp Red H I  •' ( *alc Merge 
Transduction Algorithm
1 "■1 ■— »■ Time Axis
V 147 Banerjee J
Parallel CAD Tutorial Parallel CAD Tutorialr r
Logic Verification Problem Logic Verification by Tautology
• Logic verification tools compare the logic design 
of integrated circuits at different levels to make 
sure that, in the synthesis process, no logic errors 
have been introduced.
• Verification by justification
• Verification by cube comparison
• Verification by exhaustive simulation
• Verification by cover generation and simulation
• Verification by implicit enumeration
• Verification by tautology checking
• Given functions H and G that need to be com­
pared, and the don’t care function D, the logic 
verification problem is
F (y ) =  D (v) +  G (v).H (v) -f G (v).H (v) =  1
• Shannon’s expansion is used to express the func­
tion in terms of the cofactors.
F  — xj.FXj +  xj.Fxj
Let F be the following:
F = x ! + x2 + x3 + 3^  x2 x3
F = 1
F~ E X2 + ^  + *2 *3 -*1
F_ _ = x + xXi x2 3 3
F — = 1  F — — «* x„ ♦ x_ = 1 (special cases) xx x2 ~ x3 x2 3 3
V 148 Banerjee V, 149 Banerjee J
Parallel CAD Tutorialr Parallel CAD Tutorial
Shared M IM D Parallel Tautology
P roced u re  TA UTOLOGY(F,i,N,parent)] 
i ; current level of recursion 
N : number of processors
r <- SPECIAL-CASES(F);
IF (r ^  -1) THEN RETURN(r); 
j SELECT-SPLIT(F);
F Xj COFACTOR(F, x j ) ;
F ,¡j COFACTOR(F, x j ) ;
IF  (i  <  ( log2(N ) -  1) THEN / *  Start new process * /
Start a new process on another processor;
Send Fxj to the new processor
Call p l_ V  SIMULATE(F, xj) on new processor
Call T A U T O L O G Y 1 ,N,newparent)
F Xj SIMULATE(F, X j)
IF '(TAUTOLOGY^*. ),i+ l,N ,parent) =  0) THEN 
IF (il ^  i) RETURN(O);
Send a 0 back to the parent process 
ENDIF
Wait for the child process to send an answer back 
IF (il ^  i) RETURN (answer);
Send the answer back to parent process 
ELSE /*  Perform serial tautology */
F Xj *- SIMULATED, x j) ;  
j?_L ^  SIMULATE(F, xj)
IF (T A U T O L O G Y ^ ,) =  0) THEN RETURN(O);
IF (TAUTOLOGY(F^j) =  0) THEN RETURN(O); 
ENDIF 
RETURN(l);




• Given a logic circuit, generate test vectors to de­
tect all faults in circuit, e.g. single stuck-at faults 
at all lines
• Both pseudo-random and deterministic ATPG al­
gorithms exist
• One popular ATPG algorithm is PODEM
V JV_ 150 Banerjee 151 Banerjee
Parallel CAD Tutorial Parallel CAD Tutorial
Parallel Test. Generation
• Fault Decomposition
-  Fault set is divided equally among n proces­
sors; each processor can now generate tests for 
its own fault set independently
-  Method increases test length in integrated test 
generation /  fault simulation environment
-  While test generated for one fault, same fault 
might be detected by a different test for an­
other fault
• Heuristic decomposition
-  All ATPG algorithms use a variety of heiristics
-  Let each processor follow different heuristic
-  Disadvantage: Not many different heuristics 
exist, also might be searching same space
• AND-parallel Functional Decomposition
-  View ATPG as a series of subtasks, run each 
on different processor, e.g. justification and 
propagation
-  Disadvantage: conflicts among variable bind­
ings
Parallel Test Generation (Contd)
OR-parallel Functional Decomposition
-  Split search space disjointly among processors 
by evaluating several choice points (stack) in 
parallel
-  Get best possible speedup 
Circuit partitioned Decomposition
-  Partition circuit among processors
-  Processors perform justification and propaga­
tion within subcircuit boundaries, and coop­
erate among themselves
-  Variable binding conflicts
-  Not possible to extract good speedups
V. 152 Banerjee 153 Banerjee
r Parallel CAD Tutorial Parallel CAD Tutorialr
Fault-Parallel Test Generation
• Partition fault list using heuristics, allocate each 
set to one processor
• Consider compatibility of faults, if a vector exists 
to get both faults
• Assume it takes 10 time units to generate a test 
and R  time units for fault simulation
• Total time for TG /F S on uniprocessor is (10 +  5 
+  10 -f 3 +  10 =  38) time units and test length 
is 3
• For 2 processors, fault partitions / i , / 2,/6 and / 3, / 4, /5, 
the overall completion time is 30, and test length
is 4.
• For 2 processors with fault partitions and
fzifzifb (compatible faults) completion time is 22, 
and test length is 3
---------------------------------------154----------------------Banerjee------------
Dist. M IM D Fault Parallel ATPG
Procedure DM-PARALLEL-ATPG-FA ULT;
Partition fault lists statically among processors;
FORALL processors in PARALLEL DO 
REPEAT
WHILE processor’s own fault list not empty DO 
Select a fault / ,  from own fault list;
Perform ATPG to determine test for fault /,;
U =  PODEM-COMBINATIONAL-ATPG(/,); 
Fault simulate on Own fault list for test t,;
Remove detected faults from own fault list;
Check Messages from other processors for load balance; 
END WHILE
IF processor fault list empty THEN 
Get work from other processors 
Query other processors for fault list 
Receive half fault list of most loaded processor 
UNTIL all faults detected or aborted 
END FORALL
End Procedure
V 155 Banerjee J
Parallel CAD Tutorial Parallel CAD Tutorialr r
OR-Parallel Search in ATPG Dist. M IM D OR-Parallel ATP G
• PODEM views test generation as a search prob­
lem in input vector space (search for a solution 
where multiple solutions exist)
• Parallel search can find a solution faster
• Splitting search spaces in PODEM among pro­
cessors
• Implementation on Intel iPSC hypercube
• Search space splitting implemented by single node 
scheduler
• Scheduling can be made distributed by having 
each processor act as a scheduler looking for work 
when own search space exhausted
• Speedups of 12-14 on 16 processors reported
• Sometimes superlinear speedups of 28 also re­
ported for a few circuits due to search anomalies
JV 156 Banerjee 157 Banerjee
r Parallel CAD Tutorial Parallel CAD Tutorial





Read circuit and list of faults;
Select a fault;
FORALL processors in PARALLEL DO 
WHILE test not found DO
Check messages from scheduler;
IF a TESTFOUND, NOTEST or ABORT message received THEN 
Terminate search;
Execute one iteration of PODEM search loop;
IF a test is generated for the given fault THEN 
Report test;
Send TESTFOUND messages to scheduler;
Terminate search;
IF backtrack limit is exceeded THEN 
Terminate search;
IF local search space is exhausted THEN 
Send a WRK-REQ message to scheduler 
Receive portion of search space from scheduler 
IF all search spaces exhausted and no test found 
THEN Send NOTEST messages;
IF a WRK-REQ message is received from scheduler 




• Given circuit, list of faults, and set of input vec­
tors
• Objective is to find fraction of total faults (also 
referred to as fault coverage ) which are detected
• Each row below corresponds to a machine (good 
or with a single fault) run for all input vectors
• Each column corresponds to the runs of all ma­
chines for a particular test vector
• If the result of a row (fault) differs from good 
row, that fault is detected
V1 Xi v„
G ood C l G j G n
Fault 1 F 1,1 F l j - F l.n
Fault 2 F 2,1 ~ F 2 j F2.n
~
Fault i Fi 1 ~ F i . i F i.n
Fault m F m .1 Fm. j F m ,n
V 158 Banerjee J 159 Banerjee
Parallel CAD Tutorial Parallel CAD Tutorial
Parallel Fault Simulation
• Fault Decomposition
— Partition faults across processors, each proces­
sor performs fault simulation on entire input 
vector set for own faults
— Can obtain linear speedups, applicable to com­
binational and sequential circuits
— Problem is that for each partition, one has to 
perform good machine simulation
— Depending on partitioning of faults, fault ac­
tivity may not be uniform, result in load im­
balance
• Input Pattern Decomposition
— Partition input patterns into different sets, each 
processor performs fault simulation an all the 
faults for subset of inputs
— Possible to obtain linear speedups, but fault 
dropping cannot be handled easily
— Applicable to only combinational circuits, since 
state of sequential circuit depends on vector 
sequences
Parallel Fault Simulation (Contd)
• Circuit Decomposition
-  Partition circuit into subcircuits, assign each 
subcircuit to a processor, perform simulation 
passes like logic simulation
-  Speedup depends on partitioning, lots of mes­
sages and synchonization
• Pipeline Decomposition
-  A special case of circuit partitioning, where all 
gates of a level assigned to a processor, and 
simulation is pipelined
-  Can have load imbalances, with some levels 
having many gates, others having few
160 Banerjee 161 Banerjee
Parallel CAD Tutorial Parallel CAD Tutorial
Fault, Parallel Fault Simulation
• First approach: statically partition faults using 
some heuristics among processors
• Each processor performs fault simulation for en­
tire circuit and all input vectors on own set of 
faults
• Second approach: for good load balance, proces­
sors get sets of faults from scheduler, and perform 
fault simulation on that set
• Second approach performs good machine simu­
lation as many times as sets of faults received - 
redundant work
• Implementation on network of workstations showed 
speedups of 5 on 6 processors
• Can be easily implemented on shared memory 
MIMD multiprocessors
Banerjee
Dist. M IM D Fault Parallel FS
Procedure DM-PARALLEL-FAULTSIM-FA ULT;
Read circuit description;
Read list of faults;
Partition fault list;
Read input patterns;
Create 1 Client and P-1 Server processes;
IF (Client (master) process) THEN 
Initializes server processors;
Broadcast circuit copy to all servers;
FORALL processors in PARALLEL DO
WHILE there are simulations to perform DO 
IF (Server (slave) process) THEN
Sends fault simulation results to the client 
Requests client for a new partition of faults 
IF (Client process) THEN
Receive fault simulation results from servers; 
Send a new partition of faults to server process 
IF Server process THEN
Repartitions the circuit for the 
partition number it received 






Parallel CAD Tutorial Parallel CAD Tutorial
Input Parallel Fault Simulation
• Partition input vector set
• Each processor performs fault simulation on en­
tire list of faults on entire circuit on own input
set
• Method applicable to combinational circuits only
• Implemented on IBM RP3 shared memory MIMD 
multiprocessor, speedups of 3.9 on 4 processors 
obtained
• Implemented on distributed MIMD network of 
workstations, speedups of 9 on 10 processors mea­
sured
• Novelty in this parallel algorithm is in fault drop­
ping
• Each processor takes a set of input vectors and 
performs fault simulation on entire circuit
• Size of sets of input vectors simulated per pass de­
pendent on latency of communication, i.e. how 








Read list of faults;
FORALL processors in PARALLEL DO
WHILE there are input vectors left to simulate DO 
Receive next group of input vectors;
Initialize primary inputs with the test vectors;
Perform fault-free simulation;
Perform fault simulation;
Place detected faults in an output queue for distribution 
to other processors;
Check input queue for faults detected externally;




Report fault coverage and undetected faults;
End Procedure
J1 Gl Banerjee 165 Banerjee
Parallel CAD Tutorial
SIM P Parallel Fault Simulation
Map each gate, primary input and primary out­
put to a processor of an SIMD machine called 
gate-PE
Map each connection betwen gates on a processor 
called link-PE
Circuit is rank ordered by levels of gates from 
primary inputs to outputs
Perform fault-free simulation for N input patterns 
at a time
Inject a fault into the network and perform fault 
simulation for N input patterns
Check if fault detected by input patterns
Implemented on Connection Machine CM-2
Experimental results reported 330,000 gate evalu­




SIM P Parallel Fault Simulation
Procedure SIMD-PARALLEL-FA ULTSIM;
Map each gate, primary input/output to gate-PE;
Map each connection to link-PE;
Perform Preprocessing such as rank ordering;
FORALL processors in PARALLEL DO
WHILE input vectors remaining AND undetected faults 
Load next group of N input vectors;
Initialize primary inputs with the test vectors;
(Patterns stored as vector of N bits)
Perform fault-free simulation for N input patterns;
FOR level =  1 to maxJevel DO
Select ALL link-PEs at current level;
Link PEs read logic values from source gate-PEs; 
Link PEs send logic values to destination gate-PEs; 
Select gate-PEs at current level;
Gate-PEs combine logic values they receive 
using appropriate reduction operations;
Inject one fault into network;
Perform fault simulation;
FOR level =  1 to maxJevel DO
Select ALL link-PEs at current level;
Link PEs read logic values from source gate-PEs; 
Link PEs send lgic values to destination gate-PEs; 
Select gate-PEs at current level;
Gate-PEs combine logic values they receive 
using appropriate reduction operations; 
Increment current level and repeat above steps;
ENDFOR
Compare outputs of fault-free and faulty networks;
ENDWHILE 
END FORALL
Report fault coverage and undetected faults;
End Procedure
167 Banerjee
Parallel CAD Tutorial Parallel CAD Tutorial
Summary
• We have reviewed parallel algorithms for logic 
synthesis, verification, test generation and fault 
simulation
• Each of these applications require lot of compu­
tational resources and can benefit from parallel 
processing
• For logic synthesis, need to combine circuit parti­
tioning with algorithmic parallelization to handle 
really large circuits
• For test generation, OR-parallel techniques are 
most efficient, has been extended to sequential 
circuits
• For fault simulation, fault parallel is the most 
general since input parallel can only handle com­
binational circuits
Banerjee
Future of Parallel CAD
• VLSI circuits of the future will require more pow­
erful CAD tools
• Parallel processing becoming reality - multipro­
cessor workstations
• Parallel CAD is becoming popular - Mentor has 
products called CHECKMATE and PARADE
• Recent research on parallel CAD applications have 
been reported for a wide variety of applications
• A book on the subject is appearing in December 
1993
P. Banerjee, ’’ Parallel Algorithms for VLSI 
C A D ” , Prentice-Hall, Inc., 1994, pp. 615.
• Two problems facing parallel CAD
— Parallel algorithms are often tuned to parallel 
architecture
— Parallel algorithms take long time to develop 
and get outperformed by sequential algorithms
Banerjee
------- Parallel CAD Tutorial-------------- ------------
ProperCAD view of Future
• ProperCAD project at Illinois: (Portable object- 
oriented parallel environment for C A D ) tries to 
address both these goals.
• Develop parallel algorithms portable across ma­
chines
• Develop parallel algorithms around good sequen­




Bibliography 1N parallel computing
121 —
131 SGt o L B THnr ;n R - M ‘ BrOWn’ M ‘ Kat° ’ J• Kuck- D J - Slotnick, and R. A 
Aug 1968 ' V C° mPUter- l E E E  ^  C ™ P ' “ r>. 0 -17(^ :746 -75 7 .
f4] Paf^llel processor astern hardware. P r o ce ed in g s
I P S  N a t io n a l  C o m p u te r  C o n fe r e n c e ,  pages 405-410, 1974. 9
[5] K. E. Batcher. Design o f  a massively parallel processor I E E F  T m „ .  
C o m p u te r s ,  C-29(9):836-840, Sep. 1980 I B E E  7>a" 1
161 c a E ,£ ? ■  A M ,  P m , s , „  Diego
1 7 1 A b' r° " i u c ~ *<*.
m e n c a l  M e th o d s . Prent.ce-Hall, Inc., Englewoods Cliffs, Inc. 1989. '
M  D i ^ C A ,  I T m t  *  P " ° " “  S .n
1101 ¡P“ '  The « » .  family o f  concurrent , u p , „ „ , „ .
a n d  A p p l ic a t io n s , pages 33-36, Jan. 1988. m p u ic r s
[12] P. M. Flanders, D. J. Hunt, and S. F. Reddawav Hint, c „  j  n
a n d  A lg o r i th m  O rg a n iz a t io n . Academic Press, New York, 1977. ° m pu  e r
[13] M. J. Flynn. Very highspeed computing systems. P r o c .  o f  I E E E , 54:1901- 
1909, 1966.
[14] G . C . Fox, M. A. Johnson, G. A. Lyzenga, S. W . Otto, and J. K. Salmon. 
S o lv in g  P r o b le m s  o n  C o n c u r r e n t  P r o c e s s o r s . Prentice-Hall, Inc., Englewood 
Cliffs, NJ, 1989.
[15] G . C . Fox and S. W . Otto. Algorithms for concurrent processors. P h y s ic s  
T o d a y , 37(5):50—59, May 1984.
[16] D. Gajski, D. J. Kuck, D. Lawrie, and A. Sameh. Cedar - a large scale 
multiprocessor. P ro c . In t. C o n f . P a r a lle l  P r o c e s s in g , pages 524-529, Aug. 
1983.
[17] J. P. Hayes, T . N. Mudge, Q. T . Stout, S..Colley, and J. Palmer. Architec­
ture o f  a hypercube supercomputer. P r o c .  1 9 8 6  P a r a lle l  P r o c e s s in g  C o n f ,  
pages 653-660, Aug. 1986.
[18] J. L. Hennessy and D. A. Patterson. C o m p u te r  A r c h i t e c tu r e :  A  Q u a n t ita ­
t i v e  A p p r o a c h . Morgan Kaufman Publishers, Inc., San Mateo, CA , 1990.
[19] W . D. Hillis. T h e  C o n n e c t i o n  M a ch in e . MIT Press, Cambridge, M A, 1985.
[20] R. W . Hockney and C. R. Jesshope. P a r a lle l  C o m p u te r s . Adam Hilger, 
Ltd. Bristol, ENGLAND, 1981.
[21] R. M. Hord. P a r a lle l  S u p e rco m p u tin g  in S IM D  A r c h i te c tu r e s . CR C  Press, 
Inc., Boca Raton, FL, 1990.
[22] K. Hwang and F. Briggs. C o m p u te r  A r c h i t e c tu r e  an d  P a r a lle l  P r o c e s s in g .  
McGraw Hill, Inc, 1984.
[23] M. Kallstrom and S. S. Thakkar. Programming three parallel computers. 
I E E E  S o ftw a r e , pages 11-22, Jan. 1988.
[24] A. H. Karp. Programming for parallelism. I E E E  C o m p u te r , pages 43-57, 
May 1987.
[25] D. Lenoski, J. Laudon, K. Gharachorloo, W . D. Weber, A. Gupta, J. Hen­
nessy, M. Horowitz, and M. Lam. The stanford dash multiprocessor. I E E E  
C o m p u te r ,  Mar. 1992.
[26] T . Lovett and S. Thakkar. The sequent symmetry multiprocessor system. 
P r o c .  In t. C o n f .  P a r a lle l  P ro ce s s in g  ( I C P P - 8 8 ) ,  Aug. 1988.
[27] G . F. Pfister, W . C. Brantley, D. A. George, S. L. Harvey, W . J. Kleinfelder, 
K. P. McAuliffe, E. A. Melton, V. A. Norton, and J. Weiss. The ibm 
research parallel processor prototype (rp3): Introduction and architecture. 
P r o c .  In t. C o n f .  P a r a lle l  P ro ce s s in g , pages 764-771, 1985.
(28] J. Rattner. Concurrent processing: A new direction in scientific com puting. 
A F I P S  C o n f .  P r o c . ,  54:159-166, 1985.
[29] C . L. Seitz. Th e cosm ic cube, Jan. 1985.
Selected
Bibliography IN Parellel Algorithms for Placement 
^  J and Floor-Planning
[1] S. Arvindam , V . Kumar, and V. Nageshwara Rao. Floorplan optim ization 
on multiprocessors. P r o c .  In i. C o n f .  o n  C o m p u t e r  D e s ig n  ( I C C D - 8 9 ) ,  O ct. 
1989.
[2] P. Banerjee and M. Jones. A parallel simulated annealing for standard 
cell placement on a hypercube computer. P r o c .  I n i .  C o n f .  C o m p u t e r -A i d e d  
D e s ig n ,  pages 34-37, Nov. 1986.
[3] P. Banerjee, M . H. Jones, and J. S. Sargent. Parallel simulated annealing 
algorithm s for standard cell placement on hypercube multiprocessors. I E E E  
T r a n s . P a r a l l e l  a n d  D is t r ib u te d  S y s t e m s ,  1(1):91—106, Jan. 1990.
[4] M . A . Breuer. M in-cut placement. J o u r . D e s ig n  A u t o m a t i o n  a n d  F au lt 
T o le r a n t  C o m p u t in g ,  1:343-382, O ct. 1977.
[5] R. J. Brouwer and P. Banerjee. Phigure:a parallel hierarchical global router. 
2 7 th  D e s ig n  A u t o m a t i o n  C o n f e r e n c e ,  pages 360-364, Jun., 1990.
[6] R. J. Brouwer and P. Banerjee. Paragraph: A parallel algorithm for si­
multaneous placement and routing using hierarchy. P r o c .  E u r o p e a n  D e s ig n  
A u t o m a t i o n  C o n f .  ( E D A C - 9 2 ) ,  Mar. 1992.
[7] M. Burstein and R. Pelavin. Hierarchical wire routing. I E E E  T ra n s . 
C o m p u t e r - A i d e d  D e s ig n ,  CA D -2, no. 4:223-234, Oct. 1983.
[8] A . C asotto, F. R om eo, and A. Sangiovanni-Vincentelli. A parallel simu­
lated annealing algorithm for the placement o f  macro-cells. I E E E  T ra n s . 
C o m p u t e r - A i d e d  D e s ig n , pages 838-847, Sep. 1987.
[9] A . C asotto and A. Sangiovanni-Vincentelli. Placement o f  standard cells 
using simulated annealing on the connection machine. P r o c .  I n i .  C o n f .  
C o m p u t e r - A i d e d  D e s ig n , pages 350-353, Nov. 1987.
(101 C. Cheng and E. Kuh. Module placement based on resistive network opti­
m ization. I E E E  T ra n s . C o m p u te r -A i d e d  D e s ig n ,  3 (7 ):218—225, Jul 1984.
[11] D. J. Chyan and M. A. Breuer. A placement algorithm  for array processors. 
P r o c .  2 0 th  D e s ig n  A u t o m a t i o n  C o n f . ,  pages 182-188, Jun. 1983.
[12] 3. P. Cohoon, S. U. Hegde, W . N. Martin, and D. Richards. Distributed ge­
netic algorithms for the floorplan design problem . I E E E  T ra n s . C o m p u te r -  
A id e d  D e s ig n , 10(4), Apr. 1991.
[13] J. P. Cohoon and W . D. Paris. Genetic placement. I E E E  T ra n s . C o m p u te r -  
A id e d  D e s ig n , pages 956-964, Nov. 1987.
[14] F. Darema, S. Kirkpatrick, and V. A . Norton. Parallel algorithms for chip 
placement by simulated annealing. I B M  J o u r . R e s .  D e v . ,  May 1987.
[15] A E. Dunlop and B. VV. Kernighan. A procedure for placement o f  standard­
cell vlsi circuits. I E E E  T ra n s . C o m p u t e r -A i d e d  D e s ig n  o f  C i r c u i t s  a n d  
S y s te m s ,  C A D -4 (l):9 2 -9 8 , Jan. 1985.
[16] L. K. Grover. Standard cell placement using simulated sintering. P r o c .  
2 f l h  D e s ig n  A u t o m a t i o n  C o n f . ,  pages 60-66 , Jun. 1987.
[17] M. Hanan and J. M. Kurtzberg. P la c e m e n t  T e c h n iq u e s . Editor: M. A. 
Breuer, Prentice-Hall, Inc, 1972.
[18] M. Hanan and J. M. Kurtzberg. A  review o f  the placement and the 
quadratic assignment problem, Apr. 1972.
[19] M. Hanan and P. K. Wolff. Survey o f  placement techniques. J o u r . D e s ig n  
A u t o m a t i o n  a n d  F a u lt T o le r a n t  C o m p u t . ,  pages 28-61, O ct, 1976.
[20] M. It. Ilarloog. Analysis o f  placement procedures for vlsi standard cell 
layout. P r o c .  "¿3rd D e s ig n  A u t o m a t .  C o n f . ,  pages 314-319, June 1986.
[21] T . C . Hu and E. S. Kuh. V L S I  C ir c u i t  L a y o u t . IEEE Press, New York, 
N Y, 1985.
[22] M. D. Huang, F. Rom eo, and A. Sangiovanni-Vincentelli. An efficient gen­
eral cooling schedule for simulated annealing. P r o c .  I n t .  C o n f .  C o m p u te r -  
A id e d  D e s ig n ,  pages 381-384, Nov. 1986.
[23] A. Iosupovici, C. King, and M. A. Breuer. A  m odule interchange placement 
machine. P r o c .  2 0 th  D e s ig n  A u t o m a t i o n  C o n f ,  pages 171-174, Jun. 1983.
[24] R. Jayaraman and R. Rutenbar. Floorplanning by annealing on a hy­
percube multiprocessor. P r o c .  In t. C o n f .  C o m p u t e r -A i d e d  D e s ig n , pages 
346-349, Nov. 1987.
[25] M. Jones and P. Banerjee. Performance o f  a parallel algorithm for standard 
cell placement on the intel hypercube. P r o c .  2 f l h  D e s ig n  A u t o m a t i o n  C o n f ,  
pages 807-813, Jun. 1987.
[261 B. W . Kernighan and S. Lin. An efficient heuristic for partitioning graphs.
B e l l  S y s t .  T ech . Jour., 49:291-307, Feb. 1970.
[271 S Kirkpatrick, C. D. G elatt, and M. P. Vecchi. O ptim ization by simulated 
annealing. S c i e n c e ,  220:671-680, May 1983. 
m i  R M Klimt and P Banerjee. Esp: A  new standard cell placernent package 
using simulated evolution. P r o c .  2 \ ih  D e s ig n  Automation C o n f ,  pages 
60-66, Jun. 1987.
1989.
m i  R  M Klimt a n d  P. Banerjee. Concurrent esp: A  placement algorithm for 
P  '  L l o L  .u  distributed ¿ « e s s . , , .  P n , ,  C ou J . „  C .m , . .~ A .d « d  
D e s ig n ,  pages 354-357, Nov. 1987.
Computer-Aided D e s ig n , 10(10):1303-1315. O ct. 1991.
m i  R M Kline and P Banerjee. O ptim ization by simulated evolution with ap- 
1 1  plication to standard cell placement. 2 7 th  D e s ig n  Automation Conference, 
pages 20-25, Jun., 1990.
134] S. A . Kravits and R. A . Rulenbar. M l "»
multiprocessor. I E E E  T r a m  Computer-Aided B en in , C A D -6( ).
Jun. 1987.
|35] D. P. Lapotin and S. W  Director. M am a: * s lo b ,1 f l«o rp to u u w  tool 
I n i .  C o n f .  C o m p u te r -A i d e d  D e s ig n  ( I C C A D - 8 5 ) ,  pages n o  n o ,
[361 N Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. 
Equations o f  state calculations by fast com puting machines. J o u r . Chem. 
P h y s i c s ,  21:1087-1091, 1953.
1171 Mohan and P Mazumder. Wolverines: Standard cell placement on a 
1 1  network o f  workstations. Technical report. Univ. o f  Michigan, Ann Arbor,
MI, 1991.
[38] K. Natarajan and S. Kirkpatrick. Evaluation o f  parallel ^  ^
ulated annealing: Part i - the decom position approach. Tech
[39] K. Natarajan and S. Kirkpatrick. Evaluation o f  parallel placement by sim­
ulated annealing : Part ii -  the flat approach. Technical report, IBM Tech- 
nicalR eport, Nov. 1989.
[40] B. Preas and M. Lorenzetti. P h y s ic a l  D e s ig n  A u t o m a t i o n  o f  V L S I  S y s te m s .  
Benjamin-Cummings Publishing C o., Menlo Park, C A , 1988.
[41] B. T . Preas and P. G . Karger. A utom atic placement: A  review o f  current 
techniques. P r o c .  2 3 r d  D e s ig n  A u t o m a t .  Con/., pages 622-629, June 1986.
[42] C. P. Ravikumar and S. Sastry. Parallel placement on hypercube architec­
tures. P r o c .  In t. C o n f .  P a r a l l e l  P r o c e s s in g  ( I C P P 8 9 ) ,  pages 111:97-111:100, 
Aug. 1989.
[43] J. Rose, W . Klebsch, and J. W olf. Temperature measurement and equilib­
rium dynamics o f  simulated annealing placements. I E E E  T ra n s . C o m p u te r -  
A id e d  D e s ig n , 9(10):253-259, O ct. 1990.
[44] J. S. Rose, W . M. Snelgrove, and Z. G . Vranesic. Parallel standard cell 
placement algorithms with quality equivalent to simulated annealing. I E E E  
T ra n s . C o m p u te r -A i d e d  D e s ig n ,  pages 387-396, Mar. 1988.
[45] Y . Saad and M. H. Schultz. T opological properties o f  hypercubes. I E E E  
T r a n s a c t io n s  o n  C o m p u t e r s ,  37, No. 7:867-872, Jul. 1988.
[46] J. Sargent and P. Banerjee. A parallel row-based algorithm for standard 
cell placement with integrated error control. P r o c .  2 6 th  D e s ig n  A u t o m a t i o n  
C o n f ,  pages 590-594, Jun. 1989.
[47] D. G . Schweikert. A 2-dimensional placement algorithm for the layout o f  
electrical circuits. P r o c .  IS th  D e s ig n  A u t o m a t i o n  C o n f ,  pages 408-416, 
Jun. 1976.
[48] C . Sechen and A. Sangiovanni-Vincentelli. Th e tim berw olf placement and 
routing package. J  o f  S o l id -S t a te  C i r c u i t s ,  20(2):510—522, 1985.
[49] K. Shahookar and P. Mazumder. Vlsi placement techniques. A C M  C o m ­
p u tin g  S u r v e y s , 23(2): 143—220, Jun. 1991.
[50] K. Shahookar and P. Mazumder. A genetic approach to standard cell place­
ment using meta-genetic parameter optim ization. I E E E  T ra n s . C o m p u te r -  
A id e d  D e s ig n , 9 (5):500-511, May 1990.
[51] P. Suaris and G. Kedem. An algorithm  for quadrisection and its application 
to standard cell placement. I E E E  T ra n s . C i r c u i t s  a n d  S y s te m s ,  35(3):294— 
303, Mar. 1988.
[52] P. Suaris and G . Kedem. A quadrisection-based combined place and route 
scheme for standard cells. I E E E  T ra n s . C o m p u t e r -A i d e d  D e s ig n ,  8 (3 ):2 3 4 - 
244, M ar. 1989.
[53] R . Tsay, E. Kuh, and C. Hsu. M odule placement for large chips based on 
sparse linear equations. I n t .  J o u r .  C i r c u i t  T h e o r y  A p p l . ,  16:411-423, 1988.
[54] K. Ueda, T .  Komatsubara, and T . Hosaka. A  parallel module placement 
approach for logic m odule placement, Jan. 1983.
[55] S. W im er, I. Koren, and I. Cederbaum. O ptim al aspect ratios o f  building 
blocks in vlsi. P r o c .  2 5 th  D e s ig n  A u t o m a t i o n  C o n f .  ( D A C - 8 8 ) ,  pages 66-72, 
Jun. 1988.
[56] C . P. W ong and R. D. Fiebrich. Simulated annealing-based circuit place­
ment on the connection machine system. P r o c .  In t .  C o n f .  C o m p u t e r  D e s ig n ,  
pages 78-82 , O ct. 198.7.
[57] D. F. W ong and C. L. Liu. A new algorithm for floorplan design. P r o c .  
2 3 r d  D e s ig n  A u t o m a t i o n  C o n f ,  pages 101-107, Jun. 1986.
Selected
Bibliography IN Parallel algorithms for Routing
[1] il. G . Adshead. Employing a distributed array processor in a dedicated 
gate array layout system. I E E E  I n i. C o n f .  C i r c u i t s  a n d  C o m p u te r s ,  pages 
411-414, Sep. 1982.
[2] S. Akers. R o u t in g , volume 1. Prentice-Hall, Inc. Editor: M. A. Brener, 
Englewood Cliffs, NJ, 1972.
[3] T . Blank, M. Stefik, and W . vanCleemput. A  parallel bit map architecture 
for da algorithm s. P r o c .  1 8 th  D e s ig n  A u t o m a t i o n  C o n f e r e n c e ,  pages 837- 
845, Jun. 1981.
[4] R. Brouwer and P. Banerjee. A parallel simulated annealing algorithm 
for channel routing on a hypercube multiprocessor. P r o c .  In t. C o n f .  on  
C o m p u t e r - D e s i g n  ( I C C D - 8 8 ) ,  pages 4 -7 , Oct. 1988.
[5] R. J. Brouwer. P a r a l l e l  a lg o r ith m s  f o r  p l a c e m e n t  a n d  ro u t in g  in V L S I  
d es ig n . PhD. Thesis, Report no. CRH C-91-2, Univ. o f  Illinois, Feb. 1991.
[6] R  J - Brouwer and P. Banerjee. Phigurera parallel hierarchical global router. 
2 7 th  D e s ig n  A u t o m a t i o n  C o n f e r e n c e ,  pages 360-364, Jun., 1990.
[7] M. Burstein and R. Pelavin. Hierarchical channel router. P r o c .  2 0 th  D e s ig n  
A u t o m a t i o n  C o n f ,  pages 591-597, June 1983.
[8] M. Burstein and R. Pelavin. Hierarchical wire routing. I E E E  T ra n s . 
C o m p u t e r -A i d e d  D e s ig n ,  CA D -2, no. 4:223-234, Oct. 1983.
[9] S. C . Chang and J. JaJa. Parallel algorithms for channel routing in the 
knock-knee m odel. P r o c .  In t. C o n f .  P a r a ll e l  P r o c e s s in g ,  pages 18-25 Aue 
1988.
[10] D. Deutsch. A  dogleg channel router. P r o c .  1 3 th  D e s ig n  A u t o m a t i o n ,  pages 
425-433, Jun. 1976.
[11] A. H ashimoto and J. Stevens. Wire routing by optimizing channel assign­
ment. P r o c .  8 th  D e s ig n  A u t o m a t i o n  C o n f ,  pages 214-224, June 1971.
12] D. W . Hightower. A solution to the line routing problem  on a continuous 
plane. P r o c .  6 th  D e s ig n  A u t o m a t io n  W o r k s h o p , pages 1-24, Jun. 1969.
13] C . Lee. An algorithm for path connections and its applications. I R E  T ra n s ,  
o n  E l e c t r o n i c  C o m p u te r s ,  VEC-10:346-365, Sep. 1961.
14] H. W . Leong, D. F. W ong, and C. L. Liu. A simulated annealing channel 
router. P r o c .  2 2 n d  D e s ig n  Automation C o n f ,  pages 226-228, June 1985.
15] M. Martonosi and A. Gupta. Tradeoffs in message passing and shared 
mem ory implementations o f  a standard cell router. P r o c .  I n t .  C o n f .  P a r a l l e l  
P r o c e s s in g  ( I C P P 8 9 ) ,  pages 111—88,111—96, Aug. 1989.
16] K. Mikami and K. Tabuchi. A  computer program for optim al routing o f  
printed circuit board connections. I F I P S  P r o c . ,  H47:1475-1478, 1968.
17] E. F. Moore. Th e shortest path through a maze. A n n a ls  o f  th e  C o m p u ta t i o n  
L a b o r a to r y  o f  H a rv a r d  U -n iversity , 30:285-292, 1959.
18] R. Nair, S. J. Hong, S. Liles, and R. Villani. G lobal wiring on a wire routing 
machine. P r o c .  1 9 th  D e s ig n  A u t o m a t io n  C o n f e r e n c e ,  pages 224r231, Jun. 
1982.
19] O . A. Olukotun and T . N. Mudge. A preliminary investigation into parallel 
routing on a hypercube. P r o c .  D es ig n  A u t o m a t i o n  C o n f ,  pages 814-820, 
Jun. 1987.
20] B. Preas and M. Lorenzetti. P h y s ic a l  D e s ig n  A u t o m a t i o n  o f  V L S I  S y s te m s .  
Benjamin-Cummings Publishing Co., Menlo Park, C A , 1988.
21] R. L. Rivest and C. M. Fidducia. A greedy channel router. P r o c .  1 9 th  
D e s ig n  A u t o m a t i o n  C o n f ,  pages 418-424, Jun. 1982.
22] J. Rose. Locusroute: A parallel global router for standard cells. P r o c .  
D e s ig n  A u t o m a t i o n  C o n f . ,  pages 189-195, Jun. 1988.
23] J. Rose. Parallel global routing for standard cells. I E E E  T ra n s . C o m p u te r -  
A id e d  D e s ig n , pages 1085-1095, Oct. 1990.
24] R. A. Rutenbar, T . N. Mudge, and D. E. Atkins. A  class o f  cellular archi­
tectures to support physical design automation, O ct. 1984.
25] S. Sahni and Y . W on. A  hardware acceleratorfor maze routing. P r o c .  
D e s ig n  A u t o m a t i o n  C o n f ,  pages 800-806, Jun. 1987.
26] K. Shamsa and M. Breuer. A hardware router. J o u r . D ig ita l  S y s te m s ,  
4(4):393—408.
[27] R. Venkateswaran and P. Mazumder. A  hexagonal array machine for m ulti­
layer wire routing. I E E E  T ra n s . C o m p u t e r -A i d e d  D e s ig n ,  C A D -9(10):1096- 
1112, O ct. 1990.
[28] T . W atanabe, H. Kitazawa, and Y . Sugiyama. A parallel adaptable routing 
algorithm and its implementation on a two dimensional array processor. 
I E E E  T ra n s . C o m p u te r -A i d e d  D e s ig n ,  C A D -6(2 ):241-250 , Mar. 1987.
[29] Y . W on and S. Sahni. Maze routing on a hypercube multiprocessor com ­
puter. P r o c .  I n t .  C o n f .  P a r a l l e l  P r o c e s s in g ,  pages 630-637, Aug. 1987.
[30] T . Yoshimura and E. S. Kuh. Efficient algorithms for channel routing. 
I E E E  T ra n s . C o m p u t e r -A i d e d  D e s ig n  o f  I n te g r a te d  C i r c u i t s  a n d  S y s t e m s ,  
C A D -l:2 5 -2 5 . Jan. 1982.
[31] M. R. Zargham. Parallel channel routing. P r o c .  D e s ig n  A u t o m a t i o n  C o n f ,  
pages 128-133, Jun. 1988.
Selected
Bibliography IN Parallel Algorithms for LayoutAnalysis and Verification
[1] H. G . Adshead. Employing a distributed array processor in a dedicated 
gate array layout system. I E E E  In t. C o n f .  C i r c u i t s  a n d  C o m p u t e r s , pages 
411-414, Sep. 1982.
[2] C . M. Baker, Artwork analysis tools for integrated circuits. Technical 
report, Massachusetts Inst, o f  Tech., 1980.
[3] K. Belkhale and P. Banerjee. An approximate algorithm for the partition- 
able independent task scheduling problem . P r o c .  In t. C o n f .  o n  P a r a ll e l  
P r o c e s s in g ,  1:72-75, Aug. 1990.
[4] K. P. Belkhale and P. Banerjee. Recursive partitions on multiprocessors 
P r o c .  5 th  D is t r ib u te d  M e m o r y  C o m p u t in g  C o n f e r e n c e ,  Apr. 1990.
[5] K. P. Belkhale and P. Banerjee. Geom etric connected com ponent labeling 
on a hypercube multiprocessor. P r o c .  I n t .  C o n f .  o n  P a r a l l e l  P r o c e s s in g  
I C P P 9 0 ,  111:291-295, Aug. 1990.
[6] K. P. Belkhale and P. Banerjee. Parallel algorithms for vlsi circuit extrac­
tion. I E E E  T ra n s . C o m p u te r -A i d e d  D e s ig n , 10(2):604—618, May 1991.
[7] K. P. Belkhale and P. Banerjee. Pace: A parallel vlsi circuit extractor on 
the intel hypercube multiprocessor. P r o c .  I n i .  C o n f .  o n  C o m p u t e r - A i d e d  
D e s ig n  ( I C C A D - S 8 ) ,  pages 326-329, Nov. 1988.
[8] K. P. Belkhale and P. Banerjee. Pace2: An improved parallel vlsi extractor 
with parametric extraction. P r o c .  In t. C o n f .  C o m p u t e r -A i d e d  D e s ig n ,  pages 
526-530, Nov. 1989.
[9] K. P. Belkhale and P. Banerjee. A parallel algorithm for hierarchical circuit 
extraction. P r o c .  In t. C o n f .  C o m p t .  A id e d  D e s ig n  ( I C C A D - 9 0 ) ,  Nov. 1990.
[10] K. P. Belkhale and P. Banerjee. Parallel algorithms for geom etric con­
nected com ponent labeling on hypercube multiprocessors. I E E E  T ra n s . 
C o m p u t e r s ,  T o  Appear.
UJ J. B. Bentley and T . O ttm an. The com plexity o f  manipulating hierarchi­
cally defined sets o f  rectangles. Technical report, Com puter Science Dept., 
Carnegie-Mellon University, Apr. 1981.
12] G . E. Bier and A. R. Pleszkun. An algorithm for design rule checking on a 
multiprocessor. P r o c .  D e s ig n  A u t o m a t i o n  C o n f ,  pages 299*303, Jun. 1985.
13] T . Blank, M. Stefik, and W . vanCleemput. A parallel bit map architecture 
for da algorithms. P r o c .  1 8 th  D e s ig n  A u t o m a t i o n  C o n f e r e n c e ,  pages 837 - 
845, Jun. 1981.
14] E. Carlson and R. Rutenbar. Design and performance evaluation o f  new 
massively parallel vlsi mask verification algorithms in jigsaw. P r o c .  2 7 th  
D e s ig n  A u t o m a t i o n  Con/., pages 253-259, Jun. 1990.
15] E. Carlson and R. Rutenbar. A scanline data structure processor for vlsi 
geometric checking. I E E E  T ra n s . C o m p u te r -A i d e d  D e s ig n ,  pages 780-794, 
Sep. 1987.
16] E. C. Carlson and R. A . Rutenbar. Mask verification on the connection 
machine. P r o c .  D e s ig n  A u t o m a t i o n  C o n f . ,  pages 134-140, Jun. 1988.
17] K .W . Chiang. Resistance extraction and resistance calculation in goalie2. 
P r o c .  D e s ig n  A u t o m a t i o n  C o n f . ,  pages 682-685, Jun. 1989.
18] D. T . Fitzpatrick. Mextra: A manhattan circuit extractor. Technical re­
port, UC Berkeley, Jan. 1982.
19] R. L. Graham. Bounds on multiprocessing timing anomalies. S I A M  J. 
A p p l. M a th ., 17:263-269, 1969.
20] F. Gregoretti and Z. Segall. Analysis and evaluation o f  vlsi design rule 
checking implementation in a multiprocessor. P r o c .  I n t .  C o n f .  P a r a l l e l  
P r o c e s s in g , pages 7 -14, Aug. 1984.
21] A. Gupta. Ace: A circuit extractor. P r o c .  2 0 th  D e is g n  A u t o m a t i o n  C o n f . ,  
pages 721-725, Jun. 1983.
22] R. Hon and A. Gupta. H E X T :  A  H ie r c h ic a l  C i r c u i t  E x t r a c to r .  Com puter 
Science Press, 1983.
23] U. Lauther. An o(n log n) algorithm for boolean mask operations. P r o c .  
1 8th  D e s ig n  A u t o m a t i o n  C o n f . ,  Jul. 1981.
24] S. Levitin. M A C E :  A  M u l t ip r o c e s s in g  A p p r o a c h  to  C i r c u i t  E x tr a c t io n .  M IT, 
1986.
25] J. Marantz. Exploiting parallelism in vlsi cad. P r o c .  I n i .  C o n f .  C o m p u t e r  
D e s ig n , Oct. 1986.
[26] S. P. M cCorm ick. E xd : A  circuit extractor o f  ic designs. P r o c .  2 1 s t  D e s ig n  
A u t o m a t i o n  C o n f . ,  pages 624-628, June 1984.
[27] M. E. Newell and D. T . Fitzpatrick. Exploiting structure in integrated 
circuit design analysis. C o n f .  o n  A d v a n c e d  R e s e a r c h  in  V L S I , pages 84 J2. 
1982.
[28] R. D. Nielson. Algorithmically accelerated cad. V L S I  S y s t e m s  D e s ig n , Feb. 
1986.
[29] R. A . Rutenbar. T . N. Mudge, and D. E. Atkins. A class o f  cellular archi­
tectures to  support physical design autom ation, Oct. 1984.
[30] W . S. Scott and J. K. Ousterhoul. M agic’s circuit extractor. I E E E  D e s ig n  
a n d  T e s t , pages 24-34, Feb. 1986.
1311 L. Seiler. A  hardware assisted design rule check architecture. P r o c .  10th  
D e s ig n  A u t o m a t i o n  C o n f e r e n c e ,  pages 232-238, Jun. 1982.
[32] S. L. Su, V . B. Rao, and T . N. Trick. Hpex: A  hierarchical parasitic: circuit 
extractor. P r o c .  2 f t h  D e s ig n  Automation C o n f ,  pages 566 569, Jun. 198
[33] T .G . Szymanski and C. J. Van W yk. Goalie: a space efficient system for 
vlsi artwork analysis. I E E E  D e s ig n  a n d  T e s t  o f  C o m p u te r s ,  2, no 3 :6 4 -i2 . 
1985.
[34] G .M . Tarolli and W .J. Herman. Hierarchical circuit extraction with detailed 
parasitic capacitance. P r o c .  2 0 th  D e s ig n  A u t o m a t i o n  C o n f ,  pages 337 345, 
Jun. 1983.
[35] B. Tonkin. Circuit extraction on a message-based multiprocessor. P r o c .  
2 7 th  D e s ig n  A u t o m a t .  C o n f ,  pages 260-265, June 1990.
r<w>i w  rp v : -  T vprifiratinn o f  vlsi
\
Selected
Bibliography IN Parallel Algorithms for Circuit Simulation
[1] G . B ischoffand S. Greenberg. Cayenne: A  parallel implementation o f  the 
circuit simulator spice. P r o c .  I n t .  C o n f .  o n  C o m p u t e r -A i d e d  D e s ig n , pages 
182-185, Nov. 1986.
[2] M -C . Chang and I. N. Hajj. ipride: A  parallel integrated circuit simulator 
using direct m ethod. P r o c .  I n i .  C o n f .  C o m p u t e r -A i d e d  D e s ig n  ( I C C A D -  
8 8 ) ,  pages 304-307, Nov. 1988.
[3] P. C ox , R. Burch, D. Hocevar, and P. Yang. Supple: Simulator utilizing 
parallel processing and latency exploitation. P r o c .  I n i .  C o n f .  C o m p u t e r -  
A i d e d  D e s ig n  ( I C C A D - 8 7 ) ,  pages 368-371, Nov. 1987.
[4] J. T . Deutsch and A. R. Newton. A multiprocessor implementation o f  
relaxation-based electrical circuit simulation. P r o c .  2 1 s t  D e s ig n  A u t o m a ­
t i o n  C o n f ,  pages 350-357, Jun. 1984.
[5] J. J. Dongarra, F. G . Gustavson, and A. Karp. Implementing linear algebra 
algorithms for dense matrices on vector machines. S I A M  R e v ie w , Vol. 26 
no. 1:91-112, 1984.
[6] D. Dumlegol, P. Ordent, J. C ockx, and H. De Man. The segmented wave­
form  relaxation m ethod for m ixed-m ode switch level simulation o f  mos 
digital circuits and its hardware acceleration on parallel computers. P r o c .  
I n i .  C o n f  C o m p u t e r -A i d e d  D e s ig n  ( I C C A D - 8 6 ) ,  pages 84-87, Nov. 1986.
[7] K. G alii van, P. Koss, S. Lo, and R. Saleh. A  comparison o f  parallel 
relaxation-based circuit simulation techniques. P r o c .  E l e c tr o  8 8  M e e t in g ,  
May 1988.
' [8] G . K. Jacob, A . R. Newtoq, and D. E. Pederson. An empirical analysis o f  
performance o f  a multiprocessor-based circuit simulator. P r o c .  2 3 r d  D e s ig n  
A u t o m a t i o n  C o n f . ,  pages 588-593, Jun. 1986.
[9] G . K. Jacob, A . R. Newton, and D. 0 .  Pederson. Direct m ethod circuit 
simulation using multiprocessors. P r o c .  In t .  S y m p . C i r c u i t s  a n d  S y s t e m s ,  
May 1986.
[10] T . A . Johnson and D. J. Zukowski. W aveform  relaxation based circuit 
simulation on  the victor (v256) parallel processor. Technical report, IBM 
Technical R eport, Jun. 1991.
[11] R . Lucas, T .  Blank, and J. Tiem an. A  parallel solution m ethod for 
large scale systems o f  equations. P r o c .  I n t .  C o n f  C o m p u t e r -A i d e d  D e s ig n  
( I C C A D - 8 6 ) ,  pages 178-181, Nov. 1986.
[12] T . Nakata, N. Tanabe, H. Onozuka, T . K urobe, and N. Koike. A multipro­
cessor system for modular circuit simulation. P r o c .  I n t .  C o n f .  C o m p u te r -  
A i d e d  D e s ig n  ( I C C A D - 8 7 ) ,  Nov. 1987.
[131 P. Sadayappan and V . Viswanathan. Circuit simulation on shared memory 
multiprocessors. I E E E  T ra n s . C o m p u te r s ,  Vol. 37, no. 12:1634-1642, Dec. 
1988.
[14] R. Saleh. Nonlinear relaxation algorithms for ciruit simulation. Technical 
report. Ph.D . Thesis, Univ. California, Berkeley, 1986.
[15] R. A . Saleh, K. A . Gallivan, M. C. Chang, I. N. Hajj, D Smart, and T  N. 
Trick. Parallel circuit simulation on supercomputers. P r o c .  o f  I E E E ,  77 
no. 12:1915-1931, Dec. 1989.
[16] A . Sangiovanni-Vincentelli, L-K. Chen, and L. 0 .  Chua. An efficient heuris­
tic cluster algorithm for tearing large-scale networks. I E E E  T ra n s . C i r c u i t s  
a n d  S y s t e m s ,  CA S-24 Nol 12, Dec. 1977.
[17] D. Sm art and T .  Ttick. Increasing parallelism in multiprocessor waveform 
relaxation. P r o c .  I n t .  C o n f .  C o m p u t e r -A i d e d  D e s ig n  ( I C C A D - 8 7 ) ,  pages 
360-363, Nov. 1987.
[18] D. Smart and J. W hite. Reducing the parallel solution time o f  sparse 
circuit matrices using reordered gaussian elimination and relaxation. P ro c .  
I n t .  S y m p . C i r c u i t s  a n d  S y s te m s ,  Jun. 1988.
[19] J. A . Trotter and P. Agrawal. Circuit simulation algorithms on a distributed 
m em ory multiprocessor system. P r o c .  In t .  C o n f .  C o m p u t e r -A i d e d  D e s ig n  
( I C C A D - 9 0 ) ,  pages 438-441, Nov. 1990.
[20] R. S. Varga. M a tr i x  I t e r a t iv e  A n a ly s is . Prentice-Hall, Inc, Englewoods- 
Cliffs,, NJ. 1962.
[21] D. M. W ebber and A. Sangiovanni-Vincentelli. Circuit simulation on the 
connection machine. P r o c .  2 f t h  D e s ig n  A u t o m a t i o n  C o n f ,  pages 108-113, 
Jun. 1987.
[22] J. W hite, R. Saleh, A. Sangiovanni-Vincentelli, and A. R. Newton. Accel­
erating relaxation algorithms for circuit simulation using waveform new­
ton, iterative step size refinement and parallel techniques. P r o c .  I n i .  C o n f .  
C o m p u t e r -A i d e d  D e s ig n  ( I C C A D - 8 5 ) ,  Nov. 1985.
[23] J. W hite and A. Sangiovanni-Vincentelli. R e la x a t io n  M e th o d s  f o r  S im u la ­
t io n  o f  V L S I  C i r c u i t s .  Kluwer Academic Publishers, Norwell, M A , 1987.
[24] J. W hite and A. Sangiovanni-Vincentelli. Partitioning algorithms and par­
allel implementations o f  waveform relaxation algorithms for circuit simu­
lation. P r o c .  I n t .  S y m p . o n  C i r c u i t s  a n d  S y s t e m s ,  pages 221-224, June 
1985.
[25] O . W ing and J. W . Huang. A  com putational m odel o f  parallel solution o f  
linear equations. I E E E  T\rans. C o m p u te r s ,  Vol. C-29:632-638, 1980.
[26] P. Yang, I. N. Hajj, and T . N. Trick. Slate: A  circuit simulation program 
with latency exploitation and node tearing. P r o c .  In t .  C o n f .  C i r c u i t s  a n d  
S y s t e m s  ( I S C A S - 8 0 ) ,  pages 353-355, O ct. 1980.
[27] D. Yeh and V. B. Rao. Partitioning issues in circuit simulation on multipro­
cessors. P r o c .  In t .  C o n f .  o n  C o m p u te r -A i d e d  D e s ig n  ( I C C A D - 8 8 ) ,  pages 
300-303, Nov. 1988.
[28] C. P. Yuan, R. Lucas, P. Chan, and Dutton R. Parallel electronic circuit 
simulation on the ipse system. P r o c .  1 9 8 8  C u s to m  I n te g r a te d  C i r c u i t s  C o n f .  
( C I C C - 8 8 ) ,  pages 6 .5 .1-6.5.4, May 1988.
Selected
Bibliography in for Logic
, . . Y  H Levendel and P. R- Mcnon. A logic simulationix - a ./ ......
a n d  S y s te m s , CAD-2(2):82-93, Apr. 1983.
. F i n n  a k  Fzzat W  C. Fischer, H. V . Jagdish, an 
[2] P. Agrawal, W . J. Dally, A . K. . .  the mars hardware accel-
1 1 A .S  Krishnakumar. Arch.tecture and J  ^  «  Jun. l987 .
erator. P r o c .  i j t h  D e s ig n  A u t o m a t i o n  C o n f ,  pages
accelerator. I E E E  D e s ig n  a n d  T e s t , pages 28 36,
.  u  A C  T  Terman A multiprocessor implementation o f a
141 Ü S J l i ü i  <■- -
pages 116-118, Nov. 1985.
1979. .. I
161 p„eL jsiA ä  s ä  ss£- 
W Ä S *  ““ 6,‘ •^
1987.
181 D"'’"
and T e s t  o f  C o m p u te r s ,  Feb. 24.
,9, M A. B » »  » d  A. D. Friedmwi. D ia g n o s is  D e s ig n  o ,  D i g i t . l
1 ' Syslcms.C o m p » « .  S d « .  F » * .  * * * * *  M D. 1976.
[11] R. Bryant. Data parallel switch-level simulation. P r o c .  o f  In t .  C o n f .  
C o m p u t e r -A i d e d  D e s ig n  ( I C C A D - 8 8 ) ,  Nov. 1988.
[12] R. E. Bryant. Simulation o f  packet communications architecture computer 
systems. Technical report, Massachusetts Institute o f  Technology, 1977.
[13] R. E. Bryant. A  switch-level m odel and simulator for mos digital systems. 
I E E E  T ra n s . C o m p u t e r s ,  C-33(2):160-177, Feb. 1984.
[14] R. Chamberlain and M. A. Franklin. Discrete event simulation on hyper- 
cube architectures. P r o c .  In t. C o n f .  C o m p u te r -A i d e d  D e s ig n  ( I C C A D - 8 8 ) ,  
Nov. 1988.
[15] K. M. Chandy and J. Misra. Distributed simulation: A case study in design 
and verification o f  distributed programs. I E E E  T ra n s . S o ftw a r e  E n g g ., SE- 
5(5):44 -452, Sep. 1979.
[16] ZY C A D  C o. Zycad le-001 and le-1002 logic evaluator-product description, 
Jun. 1982.
[17] W . J. Dally and R. E. Bryant. A hardware architecture for switch-level 
simulation. I E E E  T ra n s . C o m p u te r -A i d e d  D e s ig n ,  pages 239-250, Jul. 1985.
[18] L. N. Dunn. Ibm ’s engineering design system support for vlsi design and 
verification. I E E E  D e s ig n  a n d  T e s t , pages 30-40 , Feb. 1984.
[19] R. Bryant et al. Cosm os: A com piled simulator for switch level circuits. 
P r o c .  o f  2 4 th  D e s ig n  A u t o m a t i o n  C o n f . ,  Jun. 1987.
[20] S. B. Tan et al. A fast signature simulation tool for built-in self testing 
circuits. P r o c .  o f  S f th  D e s ig n  A u t o m a t i o n  C o n f . ,  Jun. 1987.
[21] E. II. Frank. Exploiting parallelism in a switch-level simulation machine. 
P ro c . D e s ig n  A u t o m a t i o n  C o n f . ,  pages 209-215, Jun. 1986.
[22] R. M. Fujim oto. Parallel discrete event simulation. C o m m u n ic a t io n s  o f  
A C M ,  33 (3 ):3 0 -5 3 , Oct. 1990.
[23] A. Gafni. Rollback mechanisms for optim istic distributed simulation sys-
• terns,.
[24] J. K. Howard, L. M alm, and L. M. Warren. Introduction to the ibm los 
galos logic simulation machine. P r o c .  I E E E  In t. C o n f .  C o m p u te r  D e s ig n :  
V L S I  in C o m p u t e r s ,  pages 580-583, Oct. 1983.
[25] N. Ishiura, H. Yasuura, and S. Yajima. High-speed logic simulation on 
vector processors. I E E E  T ra n s . C o m p u te r -A i d e d  D e s ig n , CA D -6, no. 3:305- 
321, May 1987.
[26] D. Jefferson. Virtual time. A C M  T ra n s . P r o g r a m m in g  L a n g u a g e s  an d  
S y s t e m s ,  pages 404-425, July 1985.
[27] S. A . Kravitz, R. Bryant, and R. Rutenbar. Massively parallel switch- 
level simulation: A  feasibility study. I E E E  T ra n s . C o m p u te r -A i d e d  D e s ig n ,  
10(7), Jul. 1991.
[28] Y . H. Levendel, P. R. Menon, and S. H. Patel. Special-purpose com puter 
for logic simulation using distributed processing. B e l l  S y s te m  T ech . J o u r .,
61(10):2873—2910, Dec. 1982.
[29] B. D. Lubachevsky. Efficient distributed event-driven simulation o f  multiple 
loop networks. C o m m , o f  A C M ,  pages 111-123, Jan. 1989.
[30] R. B. Mueller-Thuns, D. G . Saab, R. F. Damiano, and J. A. Abraham . 
Portable parallel logic and fault simulation. P r o c .  In t. C o n f .  C o m p u t e r -  
A i d e d  D e s ig n  ( I C C A D - 8 9 ) ,  pages 506-509, Nov. 1989.
[31] S. Nagashima, T . Nakagawa, and S. M iyam oto K. O mota. Hardware imple­
mentation o f  velvet on the hitachi s-810 supercomputer. P r o c .  In t. C o n f .  
C o m p u t e r -A i d e d  D e s ig n  ( I C C A D - 8 6 ) ,  Nov. 1986.
[32] D. M. Nicol. The cost o f  conservative synchronization in parallel discrete 
event simulations. Technical report, ICASE, Jun. 1989.
[33] S. Patil, P. Banerjee, and C. Polychronopolous. Efficient circuit partitioning 
algorithm s for parallel logic simulation. P r o c .  S u p e r c o m p u tin g  C o n f . ,  pages 
361-364, Nov. 1989.
[34] G . Pfister. The yorktown simulation engine: Introduction. Proc. 19th  
D e s ig n  A u t o m a t i o n  C o n f e r e n c e ,  pages 51-54, Jun. 1982.
[35] R. Raghavan, J. P. Hayes, and W . R. Martin. Logic simulation on vector 
processors. P r o c .  In t. C o n f .  C o m p u te r -A i d e d  D e s ig n  ( I C C A D - 3 8 ) ,  Nov. 
1988.
[36] A . E. Ruehli and G . S. Ditlow. Circuit analysis, logic simulation, and design 
verification for vlsi. P r o c .  o f  I E E E , 71(1), Jan. 1983.
[37] M. Smith. A hardware switch-level simulator for large mos circuits. P ro c .  
2 f t h  D e s ig n  A u t o m a t i o n  C o n f . ,  Jun. 1987.
[38] S. P. Smith, B. Underwood, and J. Newman. An analysis o f  parallel logic 
simulation on several architectures. P r o c .  In t .  C o n f .  P a r a ll e l  P r o c e s s in g ,  
pages 65-68, Aug. 1988.
[39] S. P. Smith, W . Underwood, and M. R. Mercer. An analysis o f  several 
approaches to circuit partitioning for parallel logic simulation. P r o c .  o f  
I n t ' l  C o n f .  o n  C o m p u t e r  D e s ig n  ( I C C D 8 7 ) ,  pages 664-667, 1987.
V[401 L Soule and T . Blank. Statistics for parallelism and abstraction level in 
digital simulation. P ro c . 2 f t h  A C M / 1 E E E  D e s ig n  A u t o m a t i o n  C on / ., Jun. 
1987.
[41] L. Soule and T . Blank. Parallel logic simulation on general purpose ma­
chines. P r o c .  D e s ig n  A u t o m a t i o n  C o n f . ,  pages 166-171, Jun. 1988.
[421 L Soule and A. Gupta. Characterization o f  parallelism and deadlocks in 
distributed digital logic simulation. P r o c .  2 6 th  D e s ig n  A u t o m a t i o n  C o n f . ,  
pages 81-86, Jun. 1989.
[431 C. Terman. Rsim: A  logic level timing simulator. P r o c .  In t .  C o n f .  o n  
C o m p u t e r  D e s ig n , pages 437-440, Oct. 1983.
[44] D. West. Optim izing time warp: Lazy rollback and lazy réévaluation. 
Technical report, M S. Thesis, Univ. o f  Calgary, Jan. 1988.
Selected
Bibliography IN Parallel Algorithms for Test Generation 
°  and Fault Simulation
ill  P. Agrawal, V . Agrawal, K. T . Cheng, and R. Tutundjian. Fault simulation 
in a pipelined multiprocessor system. P r o c .  In t .  T e s t  C o n f .  ( I T C - 8 9 ) ,  pages 
727-734, Aug. 1989. .
[2] P. Agrawal, W . J. Dally, A . K. Ezzat, W . C. Fischer, H. V . Jagdish, and 
A. S. Krishnakumar. Architecture and design o f  the mars hardware accel­
erator. P r o c .  U t h  D e s ig n  A u t o m a t io n  C o n f ,  pages 108-113, Jun. 1987.
[3] P. Agrawal, W . J. Dally, W . C. Fischer, H. V . Jagdish, A. S. Krishnaku­
mar, and R. Tutundjian. Mars: A multiprocessor-based programmable 
accelerator. I E E E  D e s ig n  an d  T es t , pages 28-36, O ct. 1987.
[4] V  D Agrawal, K. T . Cheng, and P. Agrawal. Contest: A concurrent test 
generator for sequential circuits. P r o c .  2 5 th  D e s ig n  A u t o m a t i o n  C o n f ,  Jun. 
1988.
[5] S. B. Akers, C . Joseph, and B. Krishnamurthy. On the role o f  independent 
fault sets in the generation o f  minimal test sets. P r o c .  I E E E  I n t ' l  T e s t  
C o n f ,  pages 1100-1107, Oct. 1987.
[6] D. B. Armstrong. A deductive method for simulating faults in logic circuits. 
I E E E  T ra n s . C o m p .,  0 21 :462 -47 1 , May 1972.
[7] S. Arvindam, V. Kumar, V. N. Rao, and V. Singh. Autom atic test pat­
tern generation on parallel processors. Technical report, Com puter Science 
Dept, Univ. o f  Minnesota, May 1990.
[8] R. H. Bell, R. H. Klenke, J. Aylor, and R. D. W illiams. Results o f  a topo­
logically partitioned parallel automatic test pattern generation system on 
a distributed memory multiprocessor. P r o c .  A p p l ic a t io n  S p e c i f i c  In te g r a te d  
C i r c u i t s  C o n f e r e n c e ,  Sep. 1992.
[9] R. G . Bennetts. D e s ig n  o f  T es ta b le  L o g ic  C ir c u i t s . Addison-Wesley, Read­
ing, M A, 1984.
t[10] R. G . Bennetts and et al. Camelot: A com puter-aided measure for logic 
testability. P r o c .  I E E E  A u i o t e s t c o n ,  pages 177-189, Sep. 1981. *
[11] M. A. Breuer and A. D. Friedman. D ia g n o s i s  a n d  R e lia b le  D e s ig n  o f  D ig ita l  
S y s te m s .  Com puter Science Press, Rockville, M D, 1976.
[12] F. Brglez and H. Fujiwara. A  neutral netlist o f  10 com binational benchmark 
circuits and a target translator in fortran. S p e c ia l  S e s s io n  o n  A T P G  a n d  
F a u lt S im u la t io n , P r o c .  1 9 8 5  I E E E  I n t .  S y m p . C i r c u i t s  a n d  S y s te m s ,  Jun. 
1985.
[13] S. Chakradhar, M . L. Bushnell, and V. D. Agrawal. Towards massively 
parallel autom atic test generation. I E E E  T ra n s . C o m p u te r -A i d e d  D e s ig n ,  
pages 981-994, Sep. 1990.
[14] S. Chandra and J. II. Patel. Test generation in a parallel processing en­
vironment. P r o c .  In t .  C o n f .  C o m p . D e s ig n  ( I C C D - 8 8 ) ,  pages 11-14, Oct. 
1988.
[15] S. J. Chandra and J. H. Patel. Experimental evaluation o f  testability mea­
sures for test generation. I E E E  T ra n s . C o m p u t e r -A i d e d  D e s ig n , 8 :93-98, 
Jan. 1989.
[16] K. M. Chandy and J. Misra. Distributed simulation: A case study in design 
and verification o f  distributed programs. I E E E  T ra n s . S o ftw a r e  E n g g ., SE- 
5(5):44—452, Sep. 1979.
[17] W .-T . Cheng. The back algorithm for sequential circuit test generation. 
P r o c .  In t .  C o n f .  C o m p u t e r  D e s ig n , pages 66-69, Oct. 1988.
[18] ZY C A D  Co. Zycad le-001 and le-1002 logic evaluator-product description, 
Jun. 1982.
[19] J. S. Conery. The an d /or  process m odel for parallel interpretation o f logic 
programs. Technical report, University o f California, Irvine, Irvine, C A , 
Jun. 1983.
[20] P. A. Duba, R. K. Roy, J. A. Abraham , and W . A. Rogers. Fault simulation 
in a distributed environment. P r o c .  2 5 th  D e s ig n  A u t o m a t i o n  C o n f . ,  Jun. 
1988.
[21] L. N. Dunn, lb m ’s engineering design system support for vlsi design and 
verification. I E E E  D e s ig n  an d  T e s t ,  pages 30-40, Feb. 1984.
[22] E. B. Eichelberger and T . W . W illiams. A logic design structure for lsi 
testing. P r o c .  l f t h  D e s ig n  A u t o m a t i o n  C o n f . ,  pages 462-468, Jun. 1977.
[23] F. Hirose et al. Simulation processor sp. P r o c .  In t. C o n f .  C o m p u te r -A i d e d  
D e s ig n , pages 484-487, Nov. 1987.
[24] R. M. Fujim oto. Parallel discrete event simulation. C o m m u n ic a t io n s  o f  
A C M ,  33(3):30-53, O ct. 1990.
[25] H. Fujiwara and T . Inoue. Optimal granularity o f  test generation in a 
distributed system. I E E E  T ra n s . C o m p u te r -A i d e d  D e s ig n , pages 885-892, 
Aug. 1990.
[26] H. Fujiwara and T . Inoue. Optimal test granularity o f  test generation in a 
distributed system. P r o c .  In t. C o n f .  C o m p u t e r  A id e d  D e s ig n , Nov. 1989.
[27] H. Fujiwara and T . Shimono. On the acceleration o f  test generation algo­
rithms. I E E E  T ra n s . C o m p u te r s ,  C -32(12):l 137-1144, Dec. 1983.
[28] H. Fujiwara and S. Toida. The com plexity o f  fault detection problems for 
com binational logic circuits. I E E E  T ra n s . C o m p u te r s ,  C -31(6):555-560, 
June 1982.
[29] A . Ghosh, S. Devadas, and A. R. Newton. Test generation for highly 
sequential circuits. P r o c .  In t .  C o n f .  C o m p u te r -A i d e d  D e s ig n  ( I C C A D - 8 9 ) ,  
Nov. 1989.
[30] S. Ghosh. Nodifs: A  novel, distributed circuit partitioning based algo­
rithm for fault simulation o f  combinational and sequential digital designs 
on loosely coupled parallel processors. Technical report, LEMS, Division 
o f  Engineering, Brown Univ., Providence, RI, 1991.
[31] P. G oel. Test generation costs analysis and projections. P r o c .  1 7 th  D e s ig n  
A u t o m a t i o n  C o n f . ,  Jun. 1980.
[32] P. G oel. An implicit enumeration algorithm to generate tests for com bi­
national logic circuits. I E E E  T ra n s . C o m p u te r s ,  C -30(3):215-222, Mar. 
1981.
[33] L. H. Goldstein and E. L. Theigpen. Scoap: Sandia controllabil-
ity/observability analysis program. P r o c .  I E E E  1 7 th  D e s ig n  A u t o m a t i o n  
C o n f . ,  pages 190-196, 1980.
[34] D. Harel and B. Krishnamurthy. Is there hope for linear time fault simula­
tion? P r o c .  F a u lt T o le r a n t  C o m p u t in g  S y m p ., pages 28-33, Jun. 1987.
[35] F. Hirose, K. Takayama, and N. Kawato. A method to generate tests for 
com binational logic using an ultra-high speed logic simulator. P r o c .  In t. 
T e s t  C o n f . ,  pages 102-107, Sep. 1988.
[36] J. K. Howard, L. M alm, and L. M. Warren. Introduction to the ibm los 
gatos logic simulation machine. P r o c .  I E E E  In t. C o n f .  C o m p u t e r  D e s ig n :  
V L S I  in  C o m p u te r s ,  pages 580-583, Oct. 1983.
[37] N Isbiura, M. Ho, and S. Yajima. High-speed fault .im ulatw n using a 
1 1  vector processor. Proc. In t. Con/. C o m p u t e r -A ,J e d  D e s ig n  ( I C C A D - 8 7 ) ,  
pages 10-13, Nov. 1987.
f38l N lshiura H. Yasuura, and S. Yajim a. High-speed lo« ic « ^ ulat' ° "  ° "
11 V Jo, « » “  amr^AU-J 0" « " .  3:Wi-
321, May 1987.
[39] D. Jefferson. Virtual time. A C M  T ra n s . P r o g r a m m in g  L a n g u a g e s  an d  
S y s te m s , pages 404-425, July 1985.
140] T . P. Kelsey and K. K. Saluja. Fast test generati^for sequential circuits.
1 P r o c .  In t. C o n f .  C o m p u te r -A i d e d  D e s ig n , pages 354-357, Nov. 1989.
[41] T . Kirkland and M. R. Mercer. Algorithms^for autom atic test pattern 
generation. I E E E  D e s ig n  an d  T e s t , pages 43-55, Jun. 1988.
(421 R 11 Klenke R. D. Williams, and J. Aylor. Parallel processing techniques 
1421 for automatic test pattern generation. I E E E  C o m p u te r ,  pages 71-84, Jan. 
1992.
M l  C . A . Kramer. E m p lo y »«  m e » i . «  P « a M » » •“ *»*>
P r o c .  I n t .  T e s t  C o n f . ,  pages 108-114, O ct. 1983.
(441 C P Kune and C .S . Lin. Parallel sequence fault simulation for synchronous
1 1  f e o a e S d r e m i e .  fre e . B u r e a u  M i , »  page. 434 -
438, Mar. 1992.
(441 Y  H Levendel P. R- Menon, and S. H. Patel. Parallel fault Simula-
1 1  L  u i 7 d i . . r ' i b e « d  P— » «  « .  M l  Sprlem T e e M d
62(10):3107—3137, December 1983.
(4fil C  J Li and B W . Wah. Manip-2: A m ulticomputer architecture for 
1461 ^ a lu atin g  logic programs. Proc. I n ,  C o n f .  P a r a lle l  P r o c e s s in g , pages 123- 
130, Aug. 1985.
[47] H K. T . Ma, S. Devadas, A . R. Newton, and A. Sangiovanni-Vincentelli.
Test generation for sequential circuits. I E E E  T ra n s , o n  C A D ,  ( )•
1093, October 1988.
[4 8 ] S. M allei, and S. Wu. A eerprential tee, g e n e ra l»»  e y l e » -  M e .  /»■. T e.l 
C o n f .  (P h ila d e lp h ia , P A ) ,  pages 57-61, Oct. 1985.
[49] T . Markas, M. Royals, and N. Kanopoulos. On distributed fault simulation. 
I E E E  C o m p u te r ,  pages 40-52, Jan. 1990.
[501 R  Marlett. Ebt: A comprehensive test generation technique for h ig h ly  
sequential circuits. P r o c .  1 5 th  D e s ig n  A u t o m a t i o n  C o n f e r e n c e ,  pages 
338. June 1978.
[51] P. Mayor, V . Pitchumani, and V. Narayanan. A parallel algorithm for test 
generation on the connection machine. P r o c .  I n t ’l  T e s t  C o n f .  ( I T C - 8 9 ) ,  
page P.9, Sep. 1989.
[52] A . Motohara, K. Nishimura, H. Fujiwara, and I. Shirakawa. A parallel 
scheme for test pattern generation. P r o c .  In t. C o n f .  C o m p u te r -A i d e d  D e ­
s ig n , pages 156-159, Nov. 1986.
[53] R . B. Mueller-Thuns, D. G . Saab, R. F. Damiano, and J. A . Abraham. 
Portable parallel logic and fault simulation. P r o c .  In t .  C o n f .  C o m p u te r -  
A i d e d  D e s ig n  ( I C C A D - 8 9 ) ,  pages 506-509, Nov. 1989.
[54] P. Muth. A nine-valued circuit model for test generation. I E E E  T ra n s . 
C o m p u te r s ,  0 -25:630-636, Jun. 1976.
[55] V . Narayanan and V. Pitchumani. A massively parallel algorithm for fault 
simulation on the connection machine. P r o c .  2 6 th  D e s ig n  A u t o m a t i o n  
C o n f ,  pages 734-737-, Jun. 1989.
[561 V  Narayanan and V. Pitchumani. A parallel algorithm for fault simulation 
on the connection machine. P r o c .  I n t i  T e s t  C o n f .  ( 1 T C - 8 8 ) ,  pages 89-93, 
September 1988.
[57] J. F. Nelson. Deductive fault simulation on hypercube multiprocessors. 
P r o c .  9 th  A T & T  C o n f .  E l e c t r o n ic  T e s t in g , Oct. 1987.
[58] T . M. Niermann. Techniques for sequential circuit autom atic test gener­
ation. Technical report, University o f  Illinois, Coordinated Science Lab, 
Mar. 1991.
[59] D. Ostapko, Z. Barzilai, and G. M. Silberman. Fast fault simulation in a 
parallel processing environment. P r o c .  In t. T e s t  C o n f . ,  Oct. 1987.
[60] F. Ozguner, C . Aykanat, and O. Khalid. Logic fault simulation on a vector 
hypercube multiprocessor. P r o c .  3 rd  I n t ’l  C o n f .  o n  H y p e r c u b e  a n d  C o n c u r ­
r e n t  C o m p u te r s  a n d  A p p l ic a t io n s , 11:1108-1116, January 1988.
[61] F. Ozguner and R .'D aou d . Vectorized fault simulation on the cray x- 
m p supercomputer. P r o c .  In t. C o n f .  C o m p u te r -A i d e d  D e s ig n  ( I C C A D - 8 8 ) ,  
Nov. 1988.
[62] S. T . Patel and J. H. Patel. Effectiveness o f  heuristics measures for auto­
m atic test pattern generation. P r o c .  2 3 rd  D e s ig n  A u t o m a t i o n  C o n f ,  pages 
547-552, 1986.
[63] S. Patil. Parallel algorithms for test generation and fault simulation. Tech­
nical report, University o f  Illinois, Coordinated Science Lab, Sep. 1990.
[641 S. Patil and P. Banerjee. Performance trade-offs in a parallel test genera­
tion fault simulation environment. I E E E  T ra n s . C o m p u te r -A i d e d  D e s ig n ,  
10(12):1542—1558, Dec. 1991.
[651 S. Patil and P. Banerjee. A parallel branch and bound approach to test 
generation. I E E E  T ra n s . C o m p u t e r -A i d e d  D e s ig n  o f  C i r c u i t s  a n d  S y s te m s ,  
9(3):313—322, Mar. 1990.
[661 S Patil P. Banerjee, and C . Polychronopolous. Efficient circuit partitioning 
algorithms for parallel logic simulation. P r o c .  S u p e r c o m p u tin g  Con/., pages 
361-364, Nov. 1989.
[67] G . Pfister. Th e yorktown simulation engine: Introduction. P r o c .  1 9 th  
D e s ig n  A u t o m a t i o n  C o n f e r e n c e ,  pages 51-54, Jun. 1982.
[68] V N. Rao and V. Kumar. Parallel depth lirst search, part i: Implementa­
tion. I n t e r n a t i o n a l  J o u r n a l  o f  P a r a l l e l  P r o g r a m m in g , 16(6), 1987.
[69] V . N. Rao and V. Kumar. Parallel depth first search, part ii: Analysis. 
I n t e r n a t i o n a l  J o u r n a l  o f  P a r a l l e l  P r o g r a m m in g , 16(6), 1987.
[701 W . A. Rogers, J. F. G uzolek, and J. A . Abraham. Concurrent fault simula­
tion: Performance m odel and two optimizations. I E E E  T ra n s . C o m p u te r -  
A i d e d  D e s ig n , C A D -6:848-862, Sep. 1987.
[71] J. P: Roth. Diagnosis o f  autom ata failures: A calculus and a method. I B M  
J o u r . R e s .  D e v e l o p . ,  10:278-291, Jul. 1966.
[721 J. P. Roth, W . G . Bouricius, and P. R. Schneider. Programmed algorithms 
to com pute tests to detect and distinguish between failures in logic circuits. 
I E E E  T ra n s . C o m p u te r s ,  EC-16(5):567-580, Oct. 1967.
[731 P. Banerjee S. Patil and J. Patel. Parallel test generation for sequential 
circuits on general purpose multiprocessors. P ro c . 2 8 th  D e s ig n  A u t o m a t i o n  
Con/. ( D A C - 9 1 ) ,  Jun. 1991.
[741 M. H. Schultz and E. Auth. Essential: An efficient self-learning test pattern 
generation algorithm for sequential circuits. P r o c .  In t .  T e s t  C o n f ,  Aug. 
1989.
[75] S. Seshu. On an improved diagnosis program. I E E E  T ra n s . E le c tr o n . C o r n -  
p u t . ,  EC -14:76-79, 1965.
[76] S C  Seth, L. Pan, and V . Agrawal. Probabilistic estimation o f digital 
circuit testability. P r o c .  In t .  S y m p . F a u l t -T o le r a n t  C o m p u t in g , pages 220 - 
225, Jun. 1985.
[77] S. P. Smith, W . Underwood, and M. R. Mercer. An analysis o f  several 
approaches to  circuit partitioning for parallel logic simulation. P r o c .  o f  
I n t ’l  C o n f .  o n  C o m p u t e r  D e s ig n  ( I C C D 8 7 ) ,  pages 664-667, 1987.
[78] L. Soule and T . Blank. Parallel logic simulation on general purpose ma­
chines. P r o c .  D e s ig n  A u t o m a t i o n  C o n f ,  pages 166-171, Jun. 1988.
[79] R. E. Tarjan. Finding dominators in directed graphs. S I A M  J o u r n a l  C o m ­
p u tin g , pages 62-89, 1974.
[80] E. Ulrich and et al. High-speed concurrent fault simulation with vectors 
and scalars. P r o c .  2 0 th  D e s ig n  A u t o m a t i o n  C o n f ,  pages 709-712, Aug. 
1983.
[81] E. G . Ulrich and T . Baker. Concurrent simulation o f  nearly identical digital 
networks. C o m p u te r ,  7 :39-44, April 1974.
[82] B. W . Wah, G . J. Li, and 'C . F. Yu. Multiprocessing o f  com binatorial search 
problems. I E E E  C o m p u te r ,  18(6):93-108, June 1985.
[83] J. A . Waicukauski, E. B. Eichelberger, D. O . Forlenza, and E. Lindbloom  
andd T . McCarthy. Fault simulation for structured vlsi. V L S I  S y s te m s  
D e s ig n ,  pages 20-32, Dec. 1985.
[84] A . Warshawsky and J. Rajski. Distributed fault simulation with vector 
set partitioning. Technical report, VLSI Design Laboratory, M cGill Univ., 
Montreal, C A N A D A , 1991.
Selected
Bibliography IN Parallel Algorithms for Logic &  ^  J  synthesis and Verification
[1] K. A . Barlett, D. Bostick, G . Hachtel, R. Jacoby, and M. Lightner. BOLD:
A M ultiple-level Logic O ptim ization System. P r o c .  I n t e r n a t i o n a l  C o n f e r ­
e n c e  o n  C o m p u t e r  A i d e d  D e a ig n , 1987.
[2] K. A . Barlett and et. al. Multilevel Logic Minimization using Implicit D on ’ t 
Cares. I E E E  T r a n s a c t io n s  o n  C o m p u te r -A i d e d  D e s ig n , C A D -7(6):723-740, 
June 1988.
[3] R. Brayton, R. Rudell, A . Sangiovanni-Vincentelli, and A. W ang. Mrs: 
A  m ultiple-level logic optim ization system. I E E E  T ra n s . C o m p u te r -A i d e d  
D e s ig n ,  C A D -6(6 ):1062-1081 , Nov. 1987.
[41 R. K. Brayton and et al. ESPRESSO-U: A New Logic Minimizer for Pro­
gram m able Logic Arrays. C I C C ,  pages 370-376, June 1984.
[51 R. K. Brayton, G . D. Hachtel, C . T . McMullen, and A. L. Sangiovanni- 
Vincentelli. L o g ic  M in im is a t i o n  A lg o r i th m s  f o r  V L S I  S y n th e s is . Kluwer 
Academ ic Publishers, Boston, M A, 1984.
[61 R. K. Brayton, G . D. Hachtel, and A. L. Sangiovanni-Vincentelli. Multilevel 
logic synthesis. P r o c .  o f  I E E E ,  78(2), Feb. 1990.
[7] D. W . Brown. A s ia te  machine synthesizer. P r o c .  1 8 th  D e s ig n  A u t o m a t i o n  
C on /., pages 301-304, Jun. 1981.
[8] R. E. Bryant. Graph-Based Algorithms for Boolean Function Manipulation. 
I E E E  T r a n s d u c t io n s  o n  C o m p u te r ,  pages 677-691, August 1986.
[9] H. C h o, G . Hachtel, M. Nash, and L. Setiono. BE AT.N P: A T ool for Par­
titioning Boolean Networks. Proc. I n t e r n a t i o n a l  C o n f e r e n c e  o f  C o m p u t e r  
A i d e d  D e s ig n ,  pages 10-13, Nov. 1988.
[10] J. Darringer, D. Brand, J. Gerbi, W . Joyner, and L. Trevillyan. Lss: 
A system for production logic synthesis. I B M  J o u r . R e s e a r c h  a n d  D e v . ,  
28(5):537—545, Sep. 1984.
[11] K. De and P. Banerjee. Logic partitioning and resynthesis for testability. 
P r o c .  I n i .  T e s t  C o n f . ,  Oct. 1991.
[12] K . De, B. Ramkumar, and P. Banerjee. Propersyn: A portable parallel 
algorithm for logic synthesis. P r o c .  In t .  C o n f .  o n  C o m p u t e r -A i d e d  D e s ig n  
( I C C A D - 9 2 ) ,  Nov. 1992.
[13] A . J. de Geus and W . Cohen. A  Rule-based System for O ptim izing C om ­
binational Logic. I E E E  D e s ig n  a n d  T e s t , pages 22—32, August 1985.
[14] S. Dey, F. Berglez, and G . Kedem. Corolla Based Circuit Partitioning and 
Resynthesis. 2 7 th  D e s ig n  A u t o m a t i o n  C o n f e r e n c e ,  pages 607-612, 1990.
[15] W . E. Donath. Th e Partitioning o f  Com puter Logic. In G . N. Rabbat, 
editor, A d v a n c e s  in  V L S I  C A D ,  1984.
[16] W . E. Donath and H. Ofek. Autom atic identification o f  equivalence points 
for boolean logiq verification. I B M  T e c h n ic a l  D i s c l o s u r e  B u l le t in ,  18(8), 
Jan. 1976.
[17] Electronics Research Laboratory, University o f  California, Berkeley. O c t -  
to o l s  D i s t r ib u t i o n  3 .0 ,  V o lu m e  3 , August 1989.
[18] C . M. Fiduccia and R. M. Mattheyses. A  Linear T im e Heuristic for Im­
proving Network Partitioning. P r o c .  1 9 th  d es ig h  A u t o m a t i o n  C o n f e r e n c e ,  
pages 175-181, 1982.
[19] R . Galivanche and S. M. Reddy. A Parallel PLA M inim ization Program . 
P r o c .  D e s ig n  A u t o m a t i o n  C o n f e r e n c e ,  pages 600-607, 1987.
[20] M. R. Garey and D. S. Johnson, 1979.
[21] G . D. Hachtel and R. M. Jacoby. Verification algorithm s for vlsi synthesis. 
I E E E  T ra n s . C o m p u te r -A i d e d  D e s ig n , 7 (5):616-640, May 1988.
[22] G . D. Hachtel and P. H. Moceyunas. Parallel algorithm s for boolean tau­
tology checking. P r o c .  In t. C o n f .  C o m p u te r -A i d e d  D e s ig n ,  pages 422-425, 
Nov. 1987.
[23] S. J. Hong, R. G . Cain, and D. L. Ostapko. Mini: A heuristic approach for 
logic minimization. I B M  J o u r . R e s . D e v e l . ,  18:443-458, Sep. 1974.
[24] S. K. Jain and V. D. Agarwal. Statistical Fault Analysis. I E E E  D e s ig n  
a n d  T e s t  o f  C o m p u te r s ,  pages 38-44, February 1985.
[25] B. W . Kernighan and S. Lin. An Efficient Heuristic Procedure for Parti­
tioning Graphs. B e l l  S y s te m  T e c h n ic a l  J o u r n a l, 49:291—307, 1970.
[26] C . F. Lim, P. Banerjee, K. De, and S. M uroga. A shared memory parallel 
algorithm  for logic synthesis. P r o c .  6 th  I n t .  C o n f .  V L S I  D e s ig n , Jan. 1993.
[27] H. T .  Ma, S. Devadas, and A. Sangiovanni-Vincentelli. Logic verification 
algorithm s and their parallel implementation. I E E E  T ra n s . C o m p u t e r -  
A i d e d  D e s ig n ,  8 (2):181-189, Feb. 1989.
[28] S. M uroga, Y . Kambayashi, H. C . Lai, and J. N. Culliney. The transduction 
m ethod - design o f  logic networks based on permissible functions. I E E E  
T r a n s . C o m p u te r s ,  pages 1404-1424, O ct. 1989.
[29] K. P. Parker and E. J. McCluskey. Probabilistic treatment o f  general com ­
binational networks. I E E E  T r a n s a c t io n s  o n  C o m p u te r s ,  pages 668-670, 
June 1975.
[30] P. R oth . Hardware verification. I E E E  T ra n s . C o m p u te r s ,  C-26, 1977.
[31] Hamid Savoj, Huey-Yih Wang, and Robert K. Brayton. Improved Scripts 
in MIS-II for Logic Minimization o f  Com binational Circuits. I n t e r n a t i o n a l  
W o r k s h o p  o n  L o g i c  S y n th e s is , 1991.
[32] R . S. Wei and A. Sangiovanni-Vincentelli. Proteus: A  logic verification 
system  for com binational circuits. P r o c .  I n t .  T e s t  C o n f . ,  Sep. 1986.
[33] X . Q . Xiang and S. Muroga. SY L O N -X T R A N S : A  Multilevel Logic Net­
work Synthesizer. I n t e r n a t i o n a l  W o r k s h o p  o n  L o g ic  S y n th e s is , 1989.
[34] Xuequn Xiang. M u l t i l e v e l  L o g ic  N e t w o r k  S y n th e s is  S y s t e m s ,  S Y L O N -  
X T R A N S .  PhD thesis, Univ. o f  Illinois, 1990.
Selected
BiblioerraDhv IN Future Directions of Parallel O  Processing and Parallel CAD
[1] Agha, G .A . A c t o r s :  A  M o d e l  o f  C o n c u r r e n t  C o m p u ta t i o n  in  D is t r ib u te d  
S y s te m s .  M IT press, 1986.
[2] Athas, W .C ., Seitz, C.L. M ulticom puters: Message-Passing Concurrent 
Computers. Computer, pages 9 -24 , August 1988.
[3] Boyce, J., Butler, R . e t  al. P o r ta b l e  P r o g r a m s  f o r  P a r a l l e l  P r o c e s s o r s .  Holt, 
Rinehart i c  W inston, New York, 1987.
[4] Carriero, N., Gelernter, D. How to  W rite Parallel Programs: A  Guide to 
the Perplexed. A C M  C o m p u t in g  S u r v e y s ,  pages 323-357, Sep. 1989.
[5] K. De, B. Ramkumar, and P. Banerjee. Propersyn: A  portable parallel 
algorithm for logic synthesis. P r o c .  In t .  C o n f .  o n  C o m p u t e r -A i d e d  D e s ig n  
( I C C A D - 9 2 ) ,  Nov. 1992.
[6] W . Fenton, B. Ramkumar, V. A . Saletore, A. B. Sinha, and L. V . Kale. 
Supporting Machine Independent Program m ing on Diverse Parallel Archi­
tecture. I n t e r n a t i o n a l  C o n f e r e n c e  o n  P a r a l l e l  P r o c e s s in g ,  August 1991.
[7] Foster, I., Taylor, S. S tra n d : N e w  C o n c e p t s  in P a r a l l e l  P r o g r a m m in g . Pren­
tice Hall, 1990.
[8] G abber, E. V M M P: A Practical T ool for the Development for Portable and 
Efficient Programs for Multiprocessors. I E E E  T r a n s a c t io n s  o n  P a r a l l e l  a n d  
D is t r ib u te d  S y s t e m s ,  pages 304-317, July 1990.
[9] L. V. Kale. The Chare Kernel Parallel Programming System. I n t e r n a t i o n a l  
C o n f e r e n c e  o n  P a r a l l e l  P r o c e s s in g ,  August 1990.
[10] T . M. Niermann. Techniques for sequential circuit autom atic test gener­
ation. Technical report, University o f  Illinois, Coordinated Science Lab, 
Mar. 1991.
[11] B. Ramkumar and P. Banerjee. Properext: Portable parallel circuit extrac­
tion. P r o c .  I n t .  P a r a ll e l  P r o c e s s in g  S y m p . ( I P P S - 9 3 ) ,  Apr. 1993.
1 9 9 2 .  . a Do r t a b le  o b je c t -o r i e n t e d
im b. *»*r" »•* rjss.'xv c»/.■>» D“”n
1 p a r a l le l  e n v ir o n m e n t  fo r  v is .
(¡CCD-92), Oct. 1992. generation for sequential
1141 P' B^ e ^ ^ ° c ^ -  Pr°C- m k  DtS,9n AUi0matl° n
1161 c‘mt‘u'-Au“d 8131
