The Use of Parallel Processing in VLSI Computer-Aided Design Application by Banerjee, Prith
May 1989 UILU-ENG-89-2215 
CSG-104
COORDINATED SCIENCE LABORATORY
College of Engineering
THE USE 
OF PARALLEL 
PROCESSING 
IN VLSI
COMPUTER-AIDED 
DESIGN APPLICATION
Prith Baner jee
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
Approved for Public Release. Distribution Unlimited.
UNCLASSif1£U
FV R àSSIUÊàTÆJTSE'^s pà61
REPORT DOCUMENTATION PAGE
Form Apptovsd  
O M Ê NO. 070441M
U. « p o rr SECURITY CLASSIFICATION
Unclassified
2«. iiiuAlTY CLASSIFICATION AUTHOR!!
¿k. ¿¿¿UNIFICATION/DOWNGRADING SCHEDULE 
[ 4. F€RFORMlN<fbR<*ANlZATlON REPORT NUMSI
UILU-ENG-89-2215 (CSG-104)
>F PERFORMING ORGANIZA1 
Coordinated Science Lab 
i lv i
1101 W. Springfield Ave. 
Urbana, IL 61801
FUNDING / SPONSORING
ORGANIZATION
Semi-conductor Research Corp.
IZT Codai
•b. OFFICI SYMBOL 
M Q f sppU cshh) I
1b. RESTRICTIVE MARKINGS
None
DISTRIBUTION / AVAILABILITY OF RIPOP 
Approved for public release; 
distribution unlimited
STufiMlTAaiMG ORGANIZATION REPORT NUMI
NAMI OF MONITORING ORGANIZA
Semiconductor Research Corporation
AOORESS(GNy. State, ano zap
PoO. Box 12053
Research Triangle Park, NC 27709 
PROCURIMINT INSTRUMENT IOINT1FICATION NUMBER
87-DP-109 
10. SOURCE
PROGRAM 
ELEMENT NO.
PROJECT [ta sk""
NO. I NO.
WORK UMV
The Use of Parallel Processing in VLSI Computer-Aided Design Applications.-
Banerjee, Prith
RSI
Technical________
IB. SUPPLEMENTARY NOTATION
H3to. TIME COVERED • 
FROM ___ TO
|14. DATE OF REI
1989, Mæ
(Year, I ,Dar) p iw
202
■ I 17. COSATI COOES
1  FIELD GROUP SUB-GROUP
C ontinuo  on w a r s
IB. SU B iia  TERMS (C ondnuo on  rawna
Multiprocessors, Parallel Algorithms, VLSI CAD, 
Analysis, Synthesis, Simulation
totoc* numbof)
In view oI the increasing complexity ol VLSI circuits, there is a need for sophisticated computer-aided 
design (CAD) tools to automate the synthesis and verification steps in the design of VLSI systems. 
Special-purpose hardware accelerators have been proposed in the past to speed up various tasks in 
VLSI CAD. However, this approach is not always cost-effective because of the tremendous time and 
resources required to implement such machines which soon become out dated owing to changing tech* 
oologies Multiprocessors can also offer the tremendous speed improvements that are nested to solve 
the complex problems of the future, and are at the same time extremely cost effective. A large number of 
medium-priced multiprocessors are commercially available today. However, these machines will not be 
able to exploit the parallelism available in the problems unless new algorithms are developed that art 
well suited to a multiprocessor environment.
20. DISTRIBUTION/AVAILABIUTY OF ABSTRACT 
CBuNCLASSIFIEDAJNLIMITED □  SAME AS RPT. 
22s . NAME OF RESPONSIBLE INDIVIDUAL
DO Form 1473, iUN 88
OTIC USERS
See reverse
21. ABSTRACT SECURITY CLASSIFICATION
U n c l a s s i f i e d ____
22b. TELEPHONE (Induds A rso  Cods) 122c. OFFICE SYMBOL
Previous sd M o tu  ore o tto h to .
UNCLASSIFIED
‘.S-* - - •
This tutorial discusses the present approaches to'the use of parallel processing for numerous CAO appli­
cations. These approaches involve (1) coarse grain parallelism using shared memory multiprocessors 
consisting of about 5-10 processors; (2) medium grain parallelism using distributed memory multiproces­
sors such as hypercubes consisting ol about 10-100 processors, (3) tine gram parallelism using mas­
sively parallel SIMO processors consisting ol about 10,000-100,000 processors. The advantages and 
disadvantages ot choosing each of these approaches will be discussed with results ol case studies on 
specific applications.
The tutorial is organized as four sessions. Session 1 gives an overview of the general area of parallel 
processing and how it can be used lor VLSI CAO applications. It also reviews the relevant work done in 
the area of special purpose parallel hardware accelerators tor various CAO applications. Session 2 
discusses parallel applications for the synthesis tasks in VLSI CAO, namely, tloorplannmg, cell place­
ment. global routing, and detailed routing. Session 3 discusses parallel algorithms lor simulation ot VLSI 
systems at all levels: circuit, switch, logic and behavioral. Session 4 describes parallel aigonthms tor 
analysis tools, namely, design-rule checking, circuit extraction, test generation, fault simulation. For each 
ot the tasks mentioned above, algorithms using coarse gram, medium grain and line grain parallelism are 
discussed along with results of implementations where available. Finally, the tutorial concludes with a 
view of the future of parallel processing lor VLSI CAO.
THE USE OF PARALLEL PROCESSING IN
VLSI COMPUTER-AIDED DESIGN APPLICATIONS
Prof. Prith Banerjee
Department of Electrical and Computer Engineering 
and Coordinated Science Laboratory 
University of Illinois at Urbana-Champaign
A TUTORIAL 
Presented at the
IEEE International Conference on Computer-Aided Design 
Santa Clara Convention Center, California 
November 7,1988
SUMMARY
In view of the increasing complexity of VLSI circuits, there is a need for sophisticated computer-aided 
design (CAD) tools to automate the synthesis and verification steps in the design of VLSI systems. 
Special-purpose hardware accelerators have been proposed in the past to speed up various tasks in 
VLSI CAD. However, this approach is not always cost-effective because of the tremendous time and 
resources required to implement such machines which soon become out-dated owing to changing tech­
nologies. Multiprocessors can also offer the tremendous speed improvements that are needed to solve 
the complex problems of the future, and are at the same time extremely cost effective. A large number of 
medium-priced multiprocessors are commercially available today. However, these machines will not be 
able to exploit the parallelism available in the problems unless new algorithms are developed that are 
well suited to a multiprocessor environment.
This tutorial discusses the present approaches to the use of parallel processing for numerous CAD appli­
cations. These approaches involve (1) coarse grain parallelism using shared memory multiprocessors 
consisting of about 5-10 processors; (2) medium grain parallelism using distributed memory multiproces­
sors such as hypercubes consisting of about 10-100 processors, (3) fine grain parallelism using mas­
sively parallel SIMD processors consisting of about 10,000-100,000 processors. The advantages and 
disadvantages of choosing each of these approaches will be discussed with results of case studies on 
specific applications.
The tutorial is organized as four sessions. Session 1 gives an overview of the general area of parallel 
processing and how it can be used for VLSI CAD applications. It also reviews the relevant work done in 
the area of special purpose parallel hardware accelerators for various CAD applications. Session 2 
discusses parallel applications for the synthesis tasks in VLSI CAD, namely, floorplanning, cell place­
ment, global routing, and detailed routing. Session 3 discusses parallel algorithms for simulation of VLSI 
systems at all levels: circuit, switch, logic and behavioral. Session 4 describes parallel algorithms for 
analysis tools, namely, design-rule checking, circuit extraction, test generation, fault simulation. For each 
of the tasks mentioned above, algorithms using coarse grain, medium grain and fine grain parallelism are 
discussed along with results of implementations where available. Finally, the tutorial concludes with a 
view of the future of parallel processing for VLSI CAD.
Objectives of Tutorial
•  Expose CAD developers and users to the rapidly 
growing field of parallel processing
• Review some recent results in this area
• Explore various options for CAD application users in 
an integrated parallel CAD environment
Agenda
9:00-10:30 Session 1 Parallel Computing and VLSI CAD: An 
Overview
Prith Banerjee, Univ. of Illinois
10:30-11:00 Coffee Break
11:00-12:30 Session 2 Parallel Algorithms for Physical Design: 
Floorplanning, Placement, Routing 
Prith Banerjee, Univ. of Illinois
12:30-1:30 Lunch Break
1:30-3:00 Session 3 Parallel Algorithms for Simulation: Circuit,
Timing, Logic
Resve Saleh, Univ. of Illinois
3:00-3:30 Coffee Break
3:30-5:00 Session 4 Parallel Algorithms for Design Verification: 
DRC, Extraction, Test Generation 
Prith Banerjee, Univ. of Illinois
SESSION 1
PARALLEL COMPUTING AND VLSI CAD
Prith Banerjee
Coordinated Science Laboratory 
Electrical and Computer Engineering 
University of Illinois at Urbana-Champaign
November, 1988
Outline
• Motivation: Why Parallel CAD?
• Parallel Computing Options: Special-purpose, 
General-purpose
• Review of Special Purpose Hardware Accelerators for 
CAD
• General Purpose Parallel Processing Options: Coarse 
grain, medium grain, fine grain
• Description of Three Parallel Machines and 
Programming
• Conclusions and Look at the Future
Overview of VLSI CAD Tools
• Successful implementation of complex systems of the 
future using VLSI technology requires sophisticated 
CAD tools
1. Silicon Compilers 
2o Cell Generators
3. Floorplanning
4. Cell Placement
5. Global routing
1. Node Extraction
2. Design Rule Checking
3. Design Verification
4. Test Generation
5. Fault Simulation
1. Circuit Simulation
2. Switch Simulation
3. Logic Simulation
4. Behavioral Simulation
6. Detailed routing
Motivation
• Existing algorithms for solving above problems have 
been (are being) designed to run efficiently on 
conventional uniprocessor computers
• Technological improvements can speed up the 
performance of uniprocessor computers by factors of 
5-10 over 5 years
• Inadequate for future requirements
Examples of Computing Requirements
• Logic Simulation
□  Conventional computers can simulate 1000-5000 
gate evaluations per MIP
□ A 10 million gate system requires 1000,000 gate 
evaluations per clock cycle assuming 10% 
activity
□  Hence, a 10 MIP SUN server can simulate 1 
second of system operation (1 million clock 
cycles) in 106 x 1 o6 / 105 = 107 seconds = 4 
months
• Fault Simulation
□ Fault Simulation complexity increased by another 
factor of N over logic simulation
□  Example circuit with 8000 gates: fault simulation 
(7000 faults) takes 450 hours on a SUN 
workstation
• Cell Placement
□  Best Simulated Annealing Placement Program 
(Timberwolf4.2) needs 24 hours to place 3000 
cell circuit on a MicroVAX-ll (1 MIP)
□  Would need one year to perform task for circuit of 
100,000 cells
Parallel Processing for CAD
•  Parallel Processing is the Only Answer for the 
Tremendous Future computing requirements
• Rapidly growing technology has recently become 
commercially available and feasible
• Interest in parallel processing motivated by two 
reasons:
□  Large set of computational problems including 
many in CAD are inherently parallel in nature
□  Availability of low-cost high performance 
microprocessors and memories have enabled 
building of experimental parallel machines
• It is relatively easy to build hardware of systems with 
hundreds or thousands of processors and can 
theoretically achieve GFLOPS
• Hard problem: Design of Efficient Parallel Algorithms
• The best sequential algorithm may not be the best 
parallel one
• Need to research parallel algorithms, parallel 
programming environments and methodologies
Parallel Processing for CAD (Contd)
• Developers of applications of Parallel Processing for 
CAD have two alternatives:
• Special-purpose hardware accelerators
• General-purpose parallel hardware
•  Special-purpose hardware offers higher performance 
for specific applications
• General purpose hardware can be programmed for 
various applications
• Tradeoff between performance gain versus flexibility
Hardware Accelerators for Logic Simulation
Example 1 :IBM Yorktown Simulation Machine
•  Compiled simulator
•  256 x 256 Crossbar switch
• Logic processors for logic elements, array processors 
for memory arrays
• Logic network partitioned into subnetworks, each 
simulated by a logic processor
• All logic processors simulate entire circuit 
simultaneously
• Data structure of each gate: gate ID, pointer to truth 
table, pointers to current values of gate inputs and 
outputs
• Processor memory for storage of data and signal 
values within processor; switch memory for 
transferring simulation data between processors
INTERFACE
HOST
CONTROL LOGIC
PROCESSOR PROCESSOR
t
LOGIC
PROCESSOR
ARRAY
PROCESSOR
ARRAY
PROCESSOR
I J ~
' r ' ' 1 r * ' ' f
CROSSBAR
SWITCH
Hardware Accelerators for Logic Simulation
Example 2: NEC HAL Simulator
• Event driven simulator
•  31 simulation processors
• Multistage interconnection network
• Handles high-level models: (1) PLAs with 32 inputs 
and 31 outputs (2) Behavioral models
• Simulation cycle consists of (1) fetching input data, (2) 
creating local event queues (3) simulating gat es 
specified in event queues (4) storing computed results
MEMORY 1C PROCESSORS MASTER CONTROL PROCESSOR
Hardware Accelerators for Logic Simulation
Example 3: Zycad Logic Evaluator
•  Event-driven simulator
•  16 simulation processors connected to a bus
• All processors are driven by a central clock
• Within each cycle, each processor simulates gates 
according to own event queues
Hardware Accelerators for Logic Simulation
Company Machine
name
Model
complexity
Max no. of 
processors
Max no. 
gates
Gate Eval. 
per sec
Daisy Megalogician RTL/Gate 1 1M 0.1 M
Valid Realfast RTL/Gate 1 2.5M 0.5M
Tegas Accelerator RTL/Gate 1 2.5M 1M
Silicon Sol. MACH 1000 Gate 8 0.5M 2M
Zycad Logic Ev. Gate 16 3.8M 60M
AT&T MARS Gate 256 16M 100M
NEC HAL PLA 31 3M 360M
Fujitsu SP Gate 64 4M 800M
IBM YSE/LSM Gate 256 4M 960M
Hardware Accelerators for Switch and Circuit 
Simulation
•  MOSSIM Simulation Engine (Caltech)
□  Basically put the MOSSIM switch-level simulation 
algorithm in hardware
• FAST-1 Multiprocessor (CMU)
□  A data-driven multiprocessor with special 
instruction set
•  Yorktown Simulation Engine (IBM)
□  The YSE for Logic Simulation modified using 
MOSSIM algorithm
• PowerSpice (SimuCad)
□  Designed around Sequent Balance 
Multiprocessor
□  Speedup 20 reported
Hardware Accelerators for Design-Rule 
Checking
•  Cytocomputer (Michigan)
□  Boolean transformations on 3-dim bit-map image 
between layers and nearest orthogonal and 
diagonal neighbors
□  Pipeline architecture, speedup 10
• Window Processor (MIT)
□  Performs rasterization, local checks on 4 x 4 
window, error processing
□  Pipeline architecture, speedup 100
• Fast Mask (Silicon Solutions)
□  Four 68020 subsystems on Q-bus
Hardware Accelerators for Routing
•  Wire Routing Machine (IBM)
□  8 x 8 array of processors, speedup 3.4
□  Based on Lee-Moore Routing
5
5 4 5
5 4 3 4 5
5 4 3 2 3 4 5
5 4 3 2 1 2 3 4 5
5 4 3 2 1 S 1 2 3 4 5
5 4 X X X X 3 4 5
5 D 4 5
5
•  Hardware Routing Kernel (CAD-Calay)
□  Pipeline architecture, speedup 1.4
• Virtual Bit-Map Processor (Stanford)
□  4 x 4 array, speedup 50
• Cytocomputer (Michigan)
□  Pipeline, speedup 10
• Distributed Array Processor (ICL)
□  64 x 64 array, speedup 60
Hardware Accelerators for Placement
•  Module Interchange Placement Engine (USC)
□  Pipeline architecture, speedup 50
• Parallel Module Placement Engine (Japan)
□  4 x 4 Array Processor
(*) 0»
Note Prog—  Direct Data Pat*
f
h e 3,4» —  (Ml
CP
Cotefrol
1L
(1,31 -----( — V»—  (M»
(1,2) — - L 2^ ) ---f3,2) —  (V)
Procauor
Baa
(1,1) ---1 ---Cvi —  *4,1)
Basic Approaches in Special Purpose Hardware 
Accelerators
•  Remove extemporaneous software processes
• Use faster processor
• Customize processor for a specific task
• Match intra-processor communication to the algorithm
• Partition the problem data into separate processors
• Match inter-processor communication to the problem 
data
Characteristics of Special Purpose Hardware 
Accelerators
•  Architecture matched to application
• Highest performance gain
• Very expensive to build
• Each such engine needs a host to perform tasks for 
preprocessing and postprocessing
• Technology and algorithms may change while 
machine built
•  May be appropriate for Logic Simulation where 
algorithms are robust and unchanged
• For other applications, more general purpose 
accelerators appropriate
• Example: MARS project (AT&T)
General Purpose Parallel Computing
•  Parallel computing was unaffordable ($5-10 million) in 
the past
•  Recently, several affordable microprocessor based 
parallel systems have emerged ($100,000 to 
$500,000)
• Consider: An Integrated CAD software on a design 
workstation costs $50,000 to $100,000, hardware 
costs $50,000
THE TIME HAS COME FOR PARALLEL 
COMPUTING TO BE USED IN VLSI CAD
Various Options of General-Purpose Parallel 
Processing
(1) Coarse Grain Parallelism (4 < P < 20)
□ Distributed Workstations connected through 
Ethernet
□  Bus-based Shared Memory Multiprocessors 
(Alliant, Sequent, Encore)
(2) Medium Grain Parallelism (20 < P < 1,000)
□  Shared Memory Multiprocessors using Multistage 
Interconnection Networks (BBN Butterfly,
CEDAR, RP3)
□  Distributed Memory Hypercubes (Intel, Ametek, 
NCUBE, FPS)
(3) Fine Grain Parallelism (1,000 < P < 100,000)
□  Data-wise parallel (Connection Machine)
Another Classification of Parallel Processing
• Shared Memory Multiprocessors
• Distributed Memory Message-Passing Multicomputers
Shared Memory Multiprocessors
• Globally shared data
• Processors communicate through shared variables
• Every memory access has to suffer penalty of going 
through network
• Easy to program, automated tools available for 
parallel programming
• Two ways of implementation
□  Bus-based, low-cost, limited number of procs 
(Sequent, Alliant, Encore)
□  Interconnection network based, high cost, large 
number of procs (BBN, RP3, CEDAR)
Distributed Memory Message-Passing 
Multicomputers
•  Each processor has own local memory with read/write 
privileges
• Data is distributed, no shared variable
• Processors communicate through messages
• Normal memory accesses do not suffer penalty of 
going through network, except during message­
passing
• Low cost systems
• Difficult to program, no automated tools available for 
programming
• Three ways of implementation
□  Ethernet based distributed workstations
□  Multicomputers based on various topologies 
(hypercubes, meshes, trees) with powerful node 
processors (Intel, NCUBE, Ametek)
□  Very large number of simple bit-serial processors, 
data objects distributed one per processor (MPP, 
Connection Machine)
Parallel Programming Models
• Pre-parallelized Function Libraries
• Multi-tasking parallelism (Coarse Grain)
• Micro-tasking parallelism (Medium Grain)
• Sub-language level parallelism (Fine Grain)
Multi-tasking Parallelism (Coarse-grain)
(1) Functional Partitioning
□  Decompose by FUNCTION
□  Object-oriented model
□  Example: Logic Simulation decomposed into 
tasks
□  Decompose by INSTANCE
□ Data-oriented model
□  Example: Logic Simulation: decompose circuit 
into blocks of gates
Microtasking (Medium Grain)
• Any vectorizable loop and many nonvectorizable 
loops
• Find outermost independent loops
• Decompose by iteration, one per processor
• Dynamic load balancing
• Example:
DO 10 1=1,10 
A(l) = B(l) * C(l)
10 CONTINUE
• Send loop iteration 1 to first processor, 2 to second 
processor,...
Sublanguage Level Parallelism (Fine-Grain)
•  Statement level
□  Independendent high-level statements
□  Must be done automatically by preprocessor
□  Example:
a = b * c + d; 
i = j + k; 
g -  f / y  - m;
Send statements to different processors
• Operator level
□  Independent operations
□  Must be done automatically by compiler
□  Example:
mov y, r2 
mov z, r3 
mult r4, r5 
add r6, r7
Send individual operations to processors
Overview of One Shared-Memory Multiprocessor
• Sequent Balance 8000 Multiprocessor
• Consists of 1 to 6 dual processor boards (12 
processors)
• Each board contains two NS 32000 CPUs, 8KByte 
cache, NS 32081 FPA, 32082 MMU, and other 
support circuitry for multiprocessing
• Consists of 1 to 4 memory controller boards: 2 
MBytes per board, can be expanded to 8MBytes
• Connected via SB8000 bus
□  Bandwidth 26.7MBytes per second (Clock rate 10 
MHz)
□  32 bits data and 28 bits address time multiplexed
CwM  Proconor Pool: 
2 to 12 32-bit CPUs
Multiprocessing, Multiprogramming, 
Multitasking
•  Multiprogramming on a Multiprocessor
^rurninq^ (^ryming^ .
Multiprogrammed operating system (DYNIX)
processor processor processor processor
r ..............i
< -  -  : 
* i 
L .................J
• Multitasking on a Multiprocessor
• Multiple jobs on a Multiprocessor
Multiprogrammed operating system (DYNIX)
processor processor processor processor processor processor processor processor
Programming a Sequent Multiprocessor_______
•  DYNIX Operating System (similar to 4.2bsd UNIX)
• Application Programming Supported by compilers for 
C, FORTRAN, PASCAL
• Parallel programming supported through UNIX style 
system call interface
• Parallel Programs consist of:
□  Creation and termination of multiple processes 
(55 msec):
□  Interprocess communication
□ Task synchronization and mutual exclusion
Task Scheduling Techniques
• Prescheduling
□  Task division determined by programmer
□  Each process performs a specific task
□  No automatic load balancing
• Static scheduling
□  Tasks scheduled at Run-time by processes
□  Predetermined division
• Dynamic Scheduling
□  Shared task queue
□  Processes compete for tasks
Process Synchronization
• Semaphores: Shared data structures used to 
synchronize actions of multiple cooperating processes
• Locks: Mutual exclusion
Pi
P2
P3
acquire
lock execute 
critical region
lock
wait tor '’lecJc" acquire
IOCK
execute 
critical region
release
\QCX
wait tor 
"lock*
exacute 
' acquire ' ' * critical region 
lock
release
■ O C X
time
• Events and Barriers: Synchronization points
Example Sequent Parallel Programming Library 
Calls
•  fork():
•  m_fork():
•  m_kill_procs():
•  m_lock():
•  m_sync(): 
•p jn it f t :
create copy of current process 
execute a subprogram in parallel 
Terminate child processes 
Lock a lock 
Check in at barrier 
Initialize shared memory
Example Problem: Matrix Multiplication
Sequential C program for matrix multiplication
float a[SIZE][SIZE]; b[SIZE][SIZE]; c[SIZE][SIZE];
main()
{
init_matrix(a,b);
mat_mul(a,b,c);
print_mats(a,b,c);
}
void mat_mulO
{
for (i = 0; i < SIZE; i++) { 
for (k=0; k < SIZE; k++) 
c[i][k] = 0.0; 
for G = 0 ; j < SIZE; j++) 
c[i][k] += a[i][j] * b [j][k];
}
}
Solving Example on a Shared Memory (Sequent 
Balance) _______
/* Global shared data */
shared float a[SIZE][SIZE]; b[SIZE][SIZE]; c[SIZE][SIZE];
mainO
{
init_matrix(a,b);
m_setj3rocs(nprocs); /* set number of processors */ 
m_fork(matmul, a, b, c); /* execute parallel loop */ 
m_kiIl_procs(); /* kill child processes */
print__mats(a,b,c);
}
void matmul(a,b,c) 
float a0[], bQ[], cDD;
{
nprocs = m_get_numprocs(); /* no. of processes */ 
for (i = m_get_myidO; i < SIZE; i += nprocs) { 
for (k=0; k < SIZE; k++) 
c[i][k] s 0.0; 
for (j s 0 ; ] < SIZE; J++) 
c[i][k] += a[i][j] * b [DM;
}
}
a b
Overview of One Message-Passing 
Multicomputer
• Intel iPSC-2 Multicomputer
• Consists of 16 to 128 node boards processors 
connected as a hypercube
• Each node board contains following
• 80386 CPU, 80387 FPU, 64KByte cache, 1-8MBytes 
memory, direct routing hardware (350 microseconds)
------------------------- 1
Hypercube Routing Logic
80387 NPU 
\3 0 0  K FLO P
80386 CPUl 
- 4 M I P
iLBX-ll Interface
RAM
Modules
1, 4, 8 Mbytes
i ,  4, 8 Mbtes
y
js
32-Bit iLBX Interface to Vector Processor or 
Memory Expansion
Hypercube Topology and Properties
2 “ processors, or nodes, each with local memory 
"d” Is the cube dimension 
Each node connected to "d" neighboring nodes 
Nodes share data by message passing
Examples...
• Dimension 3 hypercube...
• Each node connects to 3 neighbors
• 2s = 8 nodes
▲
Dimension 4 hypercube...
Each node connects to 4 neighbors 
24 = 16 nodes
Messages may traverse cube in "d” hops
• 8 processors
• Max. 3 hops away
Cube is superset of useful topologies
• Loops
• Rings
• Meshes 
•Trees
Programming the Intel iPSC Multicomputer
• Host operating system is System V UNIX, handles 
compilation, loading, I/O and execution
• Node operating system is NX, manages process 
scheduling (upto 20 processes), message routing, no 
virtual memory
• Applications Programming in C, FORTRAN, CCLISP 
with UNIX style system calls for supporting parallel 
processing
• Parallel Program consist of:
□  Separate host and node programs
□  Host loads node object files onto nodes, 
distributes data
□  Nodes execute applications in parallel with 
message passing
□  Host collects result data
Subcube requests served to requesting users
Rem ote Hosts Cube Server Hvpercube
Nodes
User 1 
User 2
User 3 
User 4 
User 5
Example Parallel Programming Library Calls
•  getcube():
•  csend():
• isend():
•  crecv():
•  irecv():
•  mynode():
•  numnodesf):
allocate a cube of a certain size 
send a message and wait for completion 
send a message, do not wait 
receive message, wait for completion 
receive message, do not wait 
obtain node id of calling process 
obtain number of nodes in cube
Solving Example on an Intel iPSC Multicomputer
Create two files
host.c: Very rough form, not correct
main()
{
init_mat(a,b); 
nprocs = numnodes();
PART = SIZE / nprocs; 
for (i=0; i < SIZE; i+=numnodes()) { 
csend(ROW_TYPE, a[i], PART, i, PID) 
for (j=0; j < SIZE; j+=numnodes()) { 
csend(COL_TYPE, b[j], PART, |, PID); 
for (stage=0; stage < numnodesQ; stage++) { 
for (k=0; k < numnodesQ; k++) { 
crecv(BLOCK_TYPE, b[j], PART);
}
}
print_mat(a,b,c);
}
node.c: Very rough form, not correct 
mainO 
{ w////m
% I
m yjiode = mynodeO; 
myjDid = mypidO; 
crecv(ROW__TYPE, a[i], PART); 
crecv(COL_TYPE, b[j], PART); 
for (stage=0; stage<nummodes(); stage++) { 
block__mat__mul(a,b,c);
csend(BLOCK_TYPE, a, PART, i-neigh(my_node), PID); 
crecv(BLOCK__TYPE, a, PART);
csend(BLOCK_TYPE, b, PART, j-neigh(my_node), PID); 
crecv(BLOCK_TYPE, b, PART); 
csend(BLOCK_TYPE, c, PART, host, PID);
}
% ■—
li
E=
i
ir—
l l
i
}
Overview of One Fine-Grain Parallel Computer
• Thinking Machines Corp. Connection Machine
• 64,000 bit-serial processors, connected by grid and 
hypercube topology
Connection Machine Programming
• Data parallelism: distribute one data object per 
processor
• Operations can proceed in parallel on each data 
object
• If number of objects greater than processors, use 
virtual processors
• Programming involves selecting a subset of 
processors to become active
• Selected processors execute intructions in SIMD 
mode
• Can program in FORTRAN and LISP
• Numerous fast list processing primitives supported by 
operating system
Shared Memory versus Message-Passing
•  Both approaches to parallel processing are being 
vigorously investigated
• Each can be used to simulate the other
□  Virtual Machine (what user sees) can be 
Shared/Message Passing
□  Physical Machine (what exists) can be 
Shared/Message Passing
• Characteristics of Shared Memory:
□  Easier to program
□  Existing automatic tools for parallelizing 
compilers
□  Bus based architectures are easy, inexpensive to 
design, but limited parallelism
□  Larger systems require expensive 
interconnection network, large delays for each 
memory access
Shared Memory versus Message-Passing 
(Contd) _______________
• Characteristics of Message-Passing:
□  More difficult to program
□ Newer technology, hence lack of automatic 
programming tools
□  Inexpensive to build very large systems (much 
greater parallelism for same cost)
□  Need not worry about write privileges and 
consistency in shared memory
□  Memory accesses are from local memory, hence 
performance not affected except during message 
passing
• Future is probably integrated approach: shared 
memory and message-passing
History/Projections of Medium-Grain Computers
Generation
Years
First
1983-87
Second
1988-92
Third
1993-97
Typical Node
100MIPS 1 10
MFLOPS scalar 0.1 2 40
MFLOPS vector 10 40 200
memory (Mbytes) 0.5 4 32
Typical system
256 1024N (nodes) 64
MIPS 64 2560 100K
MFLOPS scalar 6.4 512 40K
MFLOPS vector 640 10K 200K
memory (MBytes) 32 1K 32K
Communication latency 
(100 byte message)
0.5neighbor(microsec) 2000 5
nonlocal(microsec) 6000 5 0.5
• By late 1990s, the hardware will be ready to offer 
40-50 GFLOPS of performance using 1000 
processors
• Sufficient for VLSI CAD requirements
• Will parallel algorithms for VLSI CAD be ready by 
then?
• START THINKING NOW
Observations
• General-purpose parallel processors provide 
attractive solution to high performance requirements 
of future VLSI design environments
• Affordable parallelism
• Future workstations will use parallel processing 
technology, e.g. SPUR Project at Berkeley, Firefly 
from DEC
• Research has been started at various universities on 
parallel CAD
□  Berkeley, CMU, Illinois, Michigan, Stanford
• Industries developing and marketing parallel CAD
□  IBM, AT&T, DEC, Alliant, Sequent, Intel, Zycad, 
Silicon Solutions, Simucad, Daisy, Valid
VLSI CAD and Parallel Processing
SESSION 2
1. Silicon Compilers
2. Cell Generators
3. Floorplanning
4. Cell Placement
5. Global routing
6. Detailed routing
SESSION 4
1. Node Extraction
2. "Design Rule Checking
3. Design Verification
4. Test Generation
5. Fault Simulation
SESSION 3
1. Circuit Simulation
2. Switch Simulation
3. Logic Simulation
4. Behavioral Simulation
PARALLEL COMPUTING AND VLSI CAD; AN OVERVIEW
A Select Bibliography
Prith  Banerjee
[1] W. C. Athas and C. L. Seitz. “Multicomputer: Message-Passing Concurrent Computers.“ 
IEEE Computer, pp. 9-24, Aug. 1988.
[2] A. W. Van Ausdal, “Use of the Boeing Computer Simulator for Logic Design Confirmation 
and Failure Diagnostic Programs,” in Advances in Astronautical Sciences, ed., J. Vagners. 
American Astronautical Soceity. pp. 573-594. Jun. 1971.
[3] R. L. Barto and et al, “Architecture for a Hardware Simulator." Proc. IEEE Conference on 
Circuits and Computers, pp. 891-893. Oct. 1980.
[4] T. Blank, M. Stefik. and W. vanCleemput, “A Parallel Bit Map Architecture for DA 
Algorithms.” Proc. 18th Design Automation Conference, pp. 837-845, Jim. 1981.
[5] T. Blank. "A Survey of Hardware Accelerators Used in Computer-Aided Design." IEEE 
Design and Test, pp. 21-39. Aug. 1984.
[6] E. Carlson and R. Rutenbar. “A Scanline Data Structure Processor for VLSI Geometric 
Checking.” IEEE Trans. Computer-Aided Design, pp. 780-794. Sep. 1987.
[7] D. J. Chyan and M. A. Breuer. “A Placement Algorithm for Array Processors.” Proc. 20th 
Design Automation Conf., pp. 182-188, Jun. 1983.
[8] ZYCAD Co.. “Zycad LE-001 and LE-1002 Logic Evaluator-Product Description,” Jun. 1982.
[9] Intel Scientific Computers. “iPSC: The First Family of Concurrent Supercomputers.” 1985, 
product description.
[10] Alliant Computer Systems Corp.. “FX/Series Product Summary.” June 1985.
[11] F. Darema and G. F. Pfister, “Multipurpose Parallelism for VLSI CAD on the RP3,” IEEE 
Design and Test of Computers, vol. 4, pp. 19-27, Oct. 1987.
[12] P■> M. Flanders. D. J. Hunt, and S. F. Reddaway, in High-Speed Computer and Algorithm  
Organization. New York: Academic Press. 1977.
[13] G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, and J. K. Salmon, Solving Problems on 
Concurrent Processors, Nov. 1986.
[14] E. H. Frank. “Exploiting Parallelism in a Switch-Level Simulation Machine." Proc. Design 
Automation Conf.. pp. 209-215. Jun. 1986.
[15] J. M. Hancock and S. Dasgupta, Tutorial on Parallel Processing for Design Automation 
Applications.” Proc. Design Automation Conf.. pp. 69-77. Jim. 1986.
[16] W. D. Hillis. in The Connection Machine. Cambridge, MA: MIT Press. 1985.
[f 7] A. Iosupovici. C. King, and M. A. Breuer, “A Module Interchange Placement Machine,” Proc. 
20th Design Automation Conf.. pp. 171-174, Jun. 1983.
[18] J. Marantz, “Exploiting Parallelism in VLSI CAD,” Proc. Int. Conf. Computer Design Oct
1986. 6
[19] R* Nair, S. J. Hong, S. Liles, and R. Villani, “Global Wiring on a Wire Routing Machine,” 
Proc. 19th Design Automation Conference, pp. 224-231, Jun. 1982.
[20] R. D. Nielson, "Algorithmically Accelerated CAD,” VLSI Systems Design, pp. 65-66, Feb. 
1986.
[21] G. Pfister, "The Yorktown Simulation Engine: Introduction,” Proc. 19th Design Automation 
Conference, pp. 51-54, Jun. 1982.
[22] D. A. Reed and R. M. Fujimoto. in Multicomputer Networks: Message-Passing Parallel 
Processing. Cambridge. MA: MIT Press. 1987.
[23] R. A. Rutenbar, T. N. Mudge, and D. E. Atkins. "A Class of Cellular Architectures to 
Support Physical Design Automation.” in IEEE Trans. Computer-Aided Design of Circuits 
and Systems, pp. 264-278. Oct. 1984.
[24] T. Sasaki, HAL: A Block Level Hardware Logic Simulator,” Proc. 20th Design Automation 
Conf.. pp. 150-156. Jun. 1983.
[25] Sequent Computer Systems, Inc., "The Balance 8000 Parallel Computer System,” Product 
Description. 1986.
[26] K. T. Tam, "Parallel Processing for CAD Applications.” IEEE Design and Test. pp. 13-17, 
Oct. 1987.
[27] L. W. Tucker and G. G. Robertson. "Architecture and Applications of the Connection 
Machine.” IEEE Computer, pp. 26-38, Aug. 1988.
[28] K. Ueda. T. Komatsubara, and T. Hosaka, "A Parallel Module Placement Approach for Logic 
Module Placement," in IEEE Trans. Computer-Aided Design of Integrated Circuits and 
Systems, pp. 39-47. Jan. 1983.
[29] F. Hirose et al. "Simulation Processor SP.” Proc. Int. Conf. Computer-Aided Design, pp. 
484-487. Nov. 1987.
SESSION 2
PARALLEL ALGORITHMS FOR PHYSICAL DESIGN
Prith Banerjee
Coordinated Science Laboratory 
Electrical and Computer Engineering 
University of Illinois at Urbana-Champaign
November, 1988
VLSI CAD, Physical Design, and Parallel 
Processing ____
SESSION 2
1. Silicon Compilers
2. Cell Generators
3. Floorplanning
4. Cell Placement
5. Global routing
6. Detailed routing
SESSION 4
1. Node Extraction
2. Design Rule Checking
3. Design Verification
4. Test Generation
5. Fault Simulation
SESSION 3
1. Circuit Simulation
2. Switch Simulation
3. Logic Simulation
4. Behavioral Simulation
Outline
• Physical Design Tasks as Optimization Problems
• Parallel Algorithms for Floorplanning
• Parallel Algorithms for Cell Placement
• Parallel Algorithms for Global Routing
• Parallel Algorithms for Detailed Routing
VLSI Physical Design Tasks
• Most tasks in synthesis involve some kind of 
optimization, for performance or area or power
• Many of them have a combinatorial search space of 
solutions to explore to find optimum solution
• Two basic approaches to optimization in physical 
design problems
n Constructive: starting from an incomplete solution, 
one tries to add components to form complete 
solution
n Iterative: starting from an initial complete solution, 
one tries to modify it and improve it
• Problems with both approaches: getting stuck at local 
minima of objective functions
• Hence use Simulated Annealing (an iterative 
algorithm)
Simulated Annealing - An Overview
• Iterative improvement with probabilistic hill climbing
• Successfully applied to floorplanning, placement, and 
routing tasks
• State changes - Moves
□  Displacements
□  Exchanges
M l M3
J0  Z
M2 Slot
__
Exchange Displace
Introduction (cont.)
• Algorithm outline:
T = TO; X = XO;
WHILE (Stopping crit. not satisfied) DO 
WHILE (Inner loop crit. not satisfied) DO 
Y = generate(X);
IF (accept(c(Y),c(X),T))
X = Y'
END WHILE 
T = update(T);
END WHILE
Objective
Function
20 40 60 80 100
Search Space
• Advantages: Arrives at optimum solution
• Disadvantages: Requires tremendous computing 
resources and time
Parallel Simulated Annealing - An Overview
Accept
Reject
Accept
Reject
Iteration
AA A  A AAA AA A  A A A  A A  ( 1 / T e m p )
W iW m ik  • Accept
A A
(a)
/
Iteration
mil * Reject
é i
W W I111 It
(c)
Iteration
(1/Temp)
(b )
• Parallel evaluation within each move
□ Tasks include: module selection, wire length and 
overlap evaluation, acceptance evaluation
□  Limited parallelism (3-4)
• Parallel evaluation of multiple moves and acceptance 
of one
□  Accept only one per iteration step
□  Basically a serial algorithm
□ Possibly good moves left unused
• Parallel evaluation/acceptance of multiple moves
□  Problems with move interactions
□ Difficult to maintain consistant data among 
processors
□ Effects on convergence??
Standard Cell (SC) Placement - An Introduction
•  Given a set of standard cells and a net list
•  Place cells to minimize the total length of wires and 
area
Chip Image Boundry
Cl C2 C3 C4
• • • • • • • • • • • • • • •
Channel 1
C5 C6 C8 C9 CIO
Channel 2
Cll C12 13 14 C15 C17
Introduction (cont.)
• Coarse Grain Parallelism
□ Kravitz & Rutenbar
□  Kling & Banerjee
• Medium Grain Parallelism
□  Jones & Banerjee
□  Sargent & Banerjee
□  Rose, et al
□  Darema
□ Casotto & Sangiovanni-Vincentelli
• Fine Grain Parallelism
□ Wong & Fiebrich
□  Casotto & Sangiovanni-Vincentelli
Parallel SC Placement - Coarse Grain
Kravitz & Rutenbar
Arch: VAX-11/784 (four 11/780’s connected by 8Mbyte 
shared memory)
Algo: based on simulated annealing (similar to Timberwolf)
• High T: Move Decomposition:
□  Object - processors are responsible for sets of objects
□  Function - processors are responsible for certain 
functions/operations
□ Static or dynamic allocation to processors
• Low T: Parallel Moves:
□  Find "serializable subset" of moves (ie. same 
outcome as if processed in random serial order)
□  Simple Scheme: At a particular time step, abort 
all other moves after the first accepted move is 
reported.
Kravitz & Rutenbar (cont.)
• Results
□  Static move decomposition scheduling works 
better than dynamic due to the dynamic 
scheduling overhead
□ Move Decomposition works best at high temps; 
Parallel Moves works best at low temps
□ Best speedup comes from switching from one 
strategy to the other midway through the 
annealing process
□  The optima! switching temperature is difficult to 
determine
a(T)  (dotted ) Speedup (so lid )
Parallel Movc*4
1000 100 10 1 .1 .01 
T em p era tu re /1000
Parallel SC Placement - Coarse Grain
Kling & Banerjee
Architecture : SUN 3 Workstation network 
Algorithm : Simulated Evolution
Kling & Banerjee (cont.)
Evaluation
Selection :
Determine goodness of each 
cell according to given cost 
function, e.g.: minimize the 
total interconnection length
Select cells to be replaced 
based on their goodness 
= chance of survival in 
current location
Allocation Place selected cells in 
improved positions
Selected cells
Kling & Banerjee (cont.)
Parallel Algorithm Flowchart
PROCESSOR i PROCESSOR i+1
Workload partitioning for 3 processors
W /A////X/#XXi
I.
Kling & Banerjee (cont.)
Speedup Results for different circuits
I II III IV V c ircu it#
Circuit data : 1. approx. 200 cells
II. approx. 400 cells
III. approx. 600 cells
IV. approx. 800 cells
V. approx. 1000 cells
Parallel SC Placement - Medium Grain
Jones & Banerjee
Arch: Intel ¡PSC/D4 Hypercube
• Distributed memory parallel simulated annealing 
algorithm using message passing
• Partition chip into regions, assign to different 
processors
• Evaluate/accept multiple moves between specific 
pairs of processors
• Update new cell locations using broadcast
Parallel Moves
• Four types of moves:
D Intraprocessor displacement 
n Intraprocessor exchange 
a Interprocessor displacement 
n Interprocessor exchange
• Outline of move steps (interprocessor exchange)
Master Slave
Select a cell at random. Select ceil at random.
Send cell to slave. Send cell to master.
Receive cell from slave. Receive cell from master.
Compute partial exchange cost PI.
Receive partial P2 cost from slave.
If aggregate (P1+P2) cost acceptable, 
then modify cell list.
Add move to move queue.
Compute partial exchange cost P2. 
Send partial cost P2 to master.
Inform Slave of decision. Receive acceptance decision. 
If move was accepted, 
then update local placement.
Jones & Banerjee (cont.)
• Results - Runtime and speedup
Number 
of Cells
32
64
183
286
469
800
Number 
of attempts 
per cell
1 Processor 4 Processor 16 Processors
100
100
25
511
RuntimcChrsi
4 2  
11.7 
25.1 
32.5 
130.7 
57.0
Speedup
1.0
1.0
1.0
1.0
1.0
1.0
Runtimeflirsi
3.2
6.4
9.7
11.7
43.8 
18.1
Speedup
1.4
1.8
2.6
2.8
3.0
3.1
Runtimeflirsi
2.4
2.9
3.8
4.1
10.8
5.6
Speedup
1.9 
4.0 
6.7
7.9 111 
10.2
• Remarks
□ Good speedups achieved
□ Error in computation can be large effect 
convergence
□ No theoretical justification for empirical annealing 
schedule
Parallel SC Placement - Medium Grain
Sargent & Banerjee
Arch: Intel ¡PSC/D4 Hypercube
Algo: Distributed memory parallel simulated annealing
• Basic approach
□ Reduce error in parallel moves
□  Use row partitioning
t  Eliminates error in row length and cell overlap 
calculations
t  Only have possible error in wire length calculations
□  Gray code numbering scheme
r
v
Row 1 — Node 0
Row 2 — Node 1
L
Row 3 — Node 3
Row 4 — Node 2
Row 5 — Node 6
i
h
r"
«B
Row 6 — Node 7
Row 7 — Node 5
Row 8 — Node 4
□ Use Graph Coloring to increase the size of the 
"serializable set"
Sargent & Banerjee (cont.)
• Annealing schedule
□  T init - sampled so that 75% of bad moves 
accepted
□  Equilibrium detection - the variance of state cost 
is sampled until equilibrium is reached (similar to 
Huang, Romeo, Vincentelli schedule)
□  Frozen condition - when average placement cost 
varies < 1% over consecutive temperatures
□  Estimation of average error in computation - 
Keep acceptance rate within 5% of serial 
algorithm use that to guide frequency of 
update
Sargent & Banerjee (Cont’d)
• Three parallel annealing algorithms implemented
• FIXEDSEQ: Fixed number of cell moves with row 
partition
• CELLCOL: Use cell coloring to select non­
interacting cells
• ADAPTIVE: Adaptive annealing schedule with 
error control
• Placement Results
Circuit
Size
Number of
Processors FIXEDSEQ CELLCOL ADAPTIVE TimberWolf3.2
1 4666 6803 4930 5101
32 2 4761 6485 4855 -
4 5276 6663 4995 -
1 14604 25098 14448 14798
¿A 2 14830 24385 14553 -O f 4 14633 24497 14626 -
8 14497 22320 14541 -
1 80588 102191 81422 97956
2 80747 109021 75324 -
183 4 86932 114025 77968 -
8 82751 119648 89537 -
16 82594 125545 83095 -
1 126478 na 144939 127788
2 132140 na 135517 -
286 4 131887 na 172017 -
8 153923 na 149587 -
16 143310 na 142849 -
1 na na na 258744
2 na na na -
469 4 na na 245808 -
8 na na na -
16 na na 249162 -
1 na na na 494948
2 na na na -
800 4 na na 544721 -
8 na na na -
16 na na 553875 -
Sargent & Banerjee (Cont’d)
• Runtime and Speedup Results
Circuit
Size
Number of 
Processors
FIXEDSEQ
Time
(minutes)
Speedup
CELLCOL
Time
(minutes)
Speedup
ADAPTIVE
Time
(minutes)
Speedup
Jones
Time
(minutes)
32
1.0
0.9
1.1
1.0
1.2
1.4
1.0
1.2
2.1
36
30
64
32
30
24
22
1.0
1.1
1.3
1.4
38
31
22
20
1.0
1.2
1.7
1.9
28
17
10
7
1.0
1.7
2.8
4.4
102
66
183
1
2
4
8
16
406
451
447
123
118
1.0
0.9
0.9
3.3
3.4
542
465
247
128
97
1.0
1.2
22
42
5.6
434
433
125
118
58
1.0
1.0
3.5 
3.7
7.5
876
408
168
286
1
2
4
8
16
2940
2415
1135
720
482
1.012
2.6
4.1
6.1
na
na
na
na
na
na
na
na
na
na
1350
1114
348
390
170
1.0
1.2
3.9 
3.5
7.9
6840
2448
840
469
1
2
4
8
16
na
na
na
na
na
na
na
na
na
na
na
na
na
na
na
na
na
na
na
na
na
na
4149
na
1018
na
na
na
na
na
136800
45600
10800
800
1
2
4
8
16
na
na
na
na
na
na
na
na
na
na
na
na
na
na
na
na
na
na
na
na
na
na
6225
na
1260
na
na
na
na
na
117720
30780
10260
Parallel SC Placement - Fine Grain
Wong & Fiebrich
Casotto & Sangiovanni-Vincentelli
Arch: Connection Machine - 64K bit-processor SIMD 
machine
Algo: Simulated Annealing - similar to Timberwolf
□  Range limiting based on acceptance ratios is 
used to reduce the occurrence of conflicting 
moves
□  Partitioning - distribute cells and nets to PE’s so 
that each PE can perform a single task on its 
element
Processors Used for Each Cell
Processors Used for Each Net
□ Very large number of moves attempted and 
accepted in parallel
Fine Grain Parallel SC Placement (cont.)
• Results
□  Placement quality comparable to Tlmberwolf
□  800 cell circuit (2935 terminals) requires > 6000 
PE’s
Program Machine Moves per Second
TimberWolf VAX 11/780 300
VAX 8650 1000
IBM 3081 2000
Picolit CM 27000
Floorplanning - An Introduction
• Generalized module placement
□  Assign components or modules to rectangular 
portions of the chip surface
□  Modules may assume any of a set of allowed 
shapes (aspect ratios)
• Goals of Floorplanning:
□  Minimize module surface area
□ Minimize module interconnection length
□  Satisfy performance constraints
• Typical algorithm: (min-cut or simulated annealing)
Stage 1: Determine relative positions of modules to 
minimize interconnection length
Stage 2: Select shapes to minimize area
V
Parallel Floorplanning - Medium Grain
Jayaraman & Rutenbar
• Architecture: Intel iPSC Hypercube
• Algorithm: Floorplanning using the simulated 
annealing methodology
□  Applied Move Decomposition and Parallel Moves 
to achieve speedup improvements
□  Nodes are clustered and sets of modules are 
assigned to each cluster
Cluster 0 Cluster 1
Jayaraman & Rutenbar (cont.)
• Move Decomposition 
□  Master-Slave
□  Pipelined
M aster/Slave
Pipeline
• Parallel Moves
□  Clusters are not assigned a region of the chip 
surface, only a set of modules
□  To enhance the solution space, module 
ownership is periodically randomized
• Two methods of updating data structures:
Global: A variable length sequence of moves is attempted 
between consecutive global update cycles
Partial: updating neighbor clusters provides partially 
correct states and reduces overhead
Jayaraman & Rutenbar (cont.)
• Results
□  The increased time required to update the 
sequences offsets the savings from using long 
sequences of moves between updates
□  Speedups of 4-7.5 using 16 processors
□  Degradation in solution quality offsets the savings 
of partial updates
Global Routing(GR) - An Introduction
• Given a set of standard cells, along with cell positions, 
and a net list
•  Assign nets to routing regions (channels) between the 
cell rows
• Goal: Minimize channel density and chip area
Channel 5 _____
ill l l l l
Channel 4 ---------
Channel 3
Channel 2
Channel 1
Hij V ¡i Routing Pin
Chan 5 
Chan 4 
Chan 3 
Chan 2 
Chan 1
Standard Cell Placement Cost Array Representation
Introduction (cont.)
• Coarse Grain Parallelism
□ Rose
• Medium Grain Parallelism
□ Olukotun, Mudge
□ Won & Sahni
Parallel GR - Coarse Grain
Jonathan Rose
Arch: Encore MULTIMAX, Eight processors, shared 
memory
Algo: Find the minimum cost path by finding minimum 
average wire density along a set of possible paths
□  Decompose multi-terminal nets into two-terminal 
nets
□  Determine set of four permutations possible 
between the pin clusters
□  Evaluate a subset of two-bend routes and select 
minimum path
□  Increment cost array elements along the path 
• Only one or two-bend nets are used
• Lay-down/rip-up cycle is repeated four or five times to 
reduce wire order-dependencies
Standard Cell Rows
X
Route Permutation A2 -> B2
i Pin Cluster
Jonathan Rose (cont.)
• Axes of Parallelism:
□  Wire-Based - Each PE Is given an entire multi­
terminal net to route in parallel
□  Segment-Based - Each two-terminal segment 
can be routed in parallel
□  Permutation-Based - Each of four possible 
permutations can be evaluated in parallel
□  Route-Based - Each of the possible two-bend 
routes can be evaluated in parallel
•  Taken separately, Wire, Segment, and Route-Based 
parallelism each give reasonable speedup
□ Permutation-Based unnecessary for most nets
• Predict good speedup when two or more combined
Circuit Swn r
sr
nr
Sw x  Sr
w in e Ai
(m*) (m»)
BNRE 5.8T
12  
~ T
7.0
I T 46 6.6
MDC 5.9T
1 3
T
7.7n r 38 5.0
BNRD 7.0T
1.4
T
9.8
n r 50 5.1
BNRA 7.1T
2.0n r
14.2“32“ 134 9J
T«t06 7.2n r
4.6
* T
33
-5T 935 28
Primary 2 7.6n r 3 3T 25W 358 14
Parallel GR - Medium Grain
Olukotun & Mudge
Arch: NCUBE/6 Hypercube - 64 nodes with 128 Kbyte 
memory
Algo: Maze Routing
□  Wavefront Expansion
□  Backtrack
□  Cleanup
□
□
T
Vs
Vs T
Vs
_ I
__
□BUià^aia^a 
baud 
c
m
(a) Initial configuration
(c) After sweeping
(b) Afterfrontwave expansion 
and path recovery
blocked cell 
| | free cell 
|"5] Starting cell 
fT| Target cell 
— Recovered path
• Problem partitioning: Assign chip areas to 
corresponding nodes of an embedded 2-D array
• Wave expansion proceeds from each source node to 
its neighbor nodes in parallel
Channel Routing (CR) - An Introduction
• Rectangular channel with numbered terminals along 
top and bottom edges
• Nets connect terminals with the same terminal 
number
• Variations on the CR problem:
□  Two-layer
□  Multi-layer
□  Knock-knee i * 3 1 6 7 4 9 1 0 1 0
□ Switchbox •¡•••i-
2 3 5 3 5 2 6 8 9 8 7 9  
(a) Terminal Assignment
1 4 5 1 6 7  4 9  10 10
1 > 1 1 1 1 t i l l  
, 1________ ,____1____________ 1 1 1____1
1 I
i
•
•
1
1
i
1
•
* 1 '
1 » 1 1 1 1 |
; 1 1 1
1 1 * 1 1 • 1
1 1
* 1 1 1
2 3 5 3 5  2 6 8 9 8 7 9
(b) Possible Routing
Parallel CR (cont.)
• Coarse Grain
□  Zargham
• Medium Grain
□  Brouwer & Banerjee
• Fine Grain
□  Chang & JaJa
Parallel CR - Coarse Grain
Zargham
Arch: Sequent Balance 8000 - Shared memory, 1-6 
processors
Algo: Three main phases
1: Horizontally divide the channel into regions
2: Assign tracks to nets in the columns separating 
the regions
3: Assign track to nets in the remaining columns of 
the regions
C1 C2 C3 C* - l  _1 m
I ______ L
region
1
region
2
region
3
region
k
J-----1
i-----j-
1+1
straight
J+1 l 1+1 | |+1
dogleg H-jog exchange
Four types of moves for routing.
Zargham (cont.)
n Track assignments are based on heuristic functions
□ Processors can perform Phase 3 in parallel on the 
regions of the channel
□ Processors solve an AND/OR search tree to find the 
solution
Parallel CR - Medium Grain
Brouwer & Banerjee
Arch: Intel ÌPSC/2/D4 Hypercube, 16 processors 
Algo: Distributed memory parallel simulated annealing
• Determine channel density = N tracks
• Allocate N/P tracks to each processor
• Initially distribute nets to different processors
• Allow vertical and horizontal overlaps
• Objective: move nets to decrease overlap to zero
• Use simulated annealing to accept uphill moves
Brouwer & Banerjee (cont.)
• Partitioning the problem
□ Nodes are mapped in a gray code sequence from 
top to bottom of channel
□  Nodes maintain data for the set of nets in their 
own track sets
□  Nodes pair up with other nodes ±1, ±2, +4 units 
away
• Parallel Moves:
Intra-Displace
Inter-Displace
Intra-Exchange
Inter-Exchange
ITERATION STAGE
Brouwer & Banerjee (cont.)
•  Parallel Move Decomposition:
Move Summary
Type Master(M) Slave(S)
Preliminary Select Subnet 
Select Move 
Calc Remove Cost 
Send Net to S
Select Subnet 
Calc Remove Cost 
Recv Net from M
Move 0 
Intra-Displ
Select Trk 
Calc Add Cost
Select Trk 
Calc Add Cost
Move 1 
Inter-Displ
Recv Cost from S Select Trk 
Calc Add Cost 
Send cost to M
Move 2 
Intra-Exch
Select Net 2 
Calc Exch Cost
Select Net 2 
Calc Exch Cost
Move 3 
Inter-Exch
Recv Net from S 
Calc Exch Cost 
Recv Cost from S
Send Net to M 
Calc Exch Cost 
Send Cost to M
All Evaluate Acceptance 
Broadcast Updates
Evaluate Acceptance 
Broadcast Updates
• Parallel Move Communication
MOVE 0 MOVE 1
MOVE 3
sT
Brouwer & Banerjee (cont.)
• Speedup Results
Number of Nodes
Channel 1 2 4 8 16
12 Trks 
39
Subnets
T (sec) 22.8 10.0 5.6 4.32 NA
Speedup 1.00 2.28 4.07 5.30 NA
15 Trks 
73
Subnets
T (sec) 243 109 61.9 41.1 NA
Speedup 1.00 2.24 3.93 5.92 NA
17 Trks 
131
Subnets
T (sec) 798 330 176 95.7 58.8
Speedup 1.00 2.42 4.54 8.34 13.64
19 Trks 
229
Subnets
T (sec) 1952 728 332 190 110.5
Speedup 1.00 2.68 5.88 10.28 17.67
• Routing Results
T T T -1J___' ' ' l l
I 1 1 7 7 T 1 1
1
1 F T T 1
1 1 1 |L i _ 11 1 1 1 1 1 I 11 1 l 1 • 1 1 1 1 1 1 i 11 | i.n — i— r T l l l1 • i i t i i i 1 , l 1 11 1 | 1 | 1 1 J  1 L_ 1 I 11 | 1 i i i 1 l i I i 1 1
* i _U___!_ i l 1 i j  i_ 1 1 1 i l i r - l l 1— r | 1 |
i L_L_, 1 i 1 • U , i i - .1 l | 1 1 1 1,1 1 r“4- 1 1 i i | I |
i i JL 1J___ I i 1 | i i 1 i __ !_l i 1i ____!_l 1 1 1 1 1 1 i 1 l 1 1 | i i n
i i • 1 i 1 1 1 1 l | 1 ' f i 1r l |_L ! ■ ,1 1 1 * 1 1 1 1 r . 1 | , r 1 ii 1 i 1 1 1 1 1 i 1 | 1 1 ~ T — 1------Ti i 1 i 1 1 | I • 1 i T T n r ¡ T — t“ 1 1i 11 1 r 1 • 1 1 1 1 1 i 1 1 i J" 1 »■I i r + T — r1 n  i 1 1i 11 1 l i 1 l > l i * i 1 t 1 l i 1 1 1 l * 1 1 i | • “ 1 i 1 i I T T m  i | 1
. 1 1J_L 1 1 i 1 1 * 1 i • i 1 | 1 1 1 1 1 1 1 1 1 i i i | 1 1 i i i 1 l I i i i i I | 1
* * i i i _JLJ___L_ _ L | | 1 1 1 i 1 1 1 l j | 1 • I ! i 1 i n “1 1 1 1 i i i i i | | | t i i i | 1 11 . . lU L l | L l I 1
Conclusions
• Physical design tasks form a rich source of 
parallelism
• We have reviewed some novel parallel algorithms for 
floorplanning, placement, and routing
• Demonstrated good speedups for those applications
• Should investigate better parallel algorithms in the 
future
PARALLEL ALGORITHMS FOR PHYSICAL DESIGN
A Select Bibliography
Prith  Banerjee
[1] P. Banerjee and M. Jones, "A Parallel Simulated Annealing for Standard Cell Placement on a 
Hypercube Computer.** Proc. Int. Conf. Computer-Aided Design, pp. 34-37, Nov. 1986.
[2] R. Brouwer and P. Banerjee, "A Parallel Simulated Annealing Algorithm for Channel 
Routing on a Hypercube Multiprocessor,” in Proc. Int. Conf. on Computer-Design (ICCD-88), 
Rye Brook. NY. Oct. 3-5.1988.
[3] A. Casotto. F. Romeo, and A. S. Vincentelli, °“A Parallel Simulated Annealing Algorithm for 
the Placement of Macro-Cells.** Proc. In t. Conf. on Computer-Aided Design, pp. 30-33, Nov.
1986.
[4] A. Casotto, F. Romeo, and A. S. Vincentelli. “A  Parallel Simulated Annealing Algorithm for 
the Placement of Macro-Cells.’* IEEE Trans. Computer-Aided Design, pp. 838-847, Sep.
1987.
[5] A. Casotto and A. S. Vincentelli, “Placement of Standard Cells Using Simulated Annealing 
on the Connection Machine,” Proc. Int. Conf. Computer-Aided Design, pp. 350-353, Nov. 
1987.
[6] S. C. Chang and J. JaJa, "Parallel Algorithms for Channel Routing in the Knock-Knee 
Model.” Proc. Int. Conf. Parallel Processing, pp. 18-25» Aug. 1988.
[7] M. J. Chung and K. K. Rao. "Parallel Simulated Annealing for Partitioning and Routing,” 
Proc. Int. Conf. on Computer Design, pp. 238-242, Oct. 1986.
[8] F. Darema. S. Kirkpatrick, and V. A. Norton. "Parallel Algorithms for Chip Placement by 
Simulated Annealing,” IBM  Jour. Res. Dev., May 1987.
[9] F. Darema and G. F. Pfister. "Multipurpose Parallelism for VLSI CAD on the RP3,” IEEE 
Design and Test of Computers, vol. 4, pp. 19-27. Oct. 1987.
[10] S. Devadas and A. R. Newton, "Topological Optimization of Multiple Level Array Logic: On 
Uni and Multi-processors.” Proc. Int. Conf. Computer-Aided Design (ICCAD-86). pp. 38-41, 
Nov. 1986.
[11] R. Jayaraman and R. Rutenbar, "Floorplanning by Annealing on a Hypercube 
Multiprocessor.” Proc. Int. Conf. Computer-Aided Design, pp. 346-349. Nov. 1987.
[12] M. Jones and P. Banerjee. "Performance of a Parallel Algorithm for Standard Cell Placement 
on the Intel Hypercube,” in Proc. 24th Design Automation Conf.. Miami Beach, FL. pp. 807- 
813, Jun. 1987.
[13] R. M. Kling and P. Banerjee. ‘"Concurrent ESP: A Placement Algorithm for Execution on 
Distributed Processors.** in Proc. Int. Conf. on Computer-Aided Design. Santa Clara. CA. pp. 
354-357. Nov. 1987.
[14] S. A. Kravitz and R. A. Rutenbar» "Placement by Simulated Annealing on a Multiprocessor,” 
IEEE Trans. Computer-Aided Design, vol. CAD-6, pp. 534-549, Jun. 1987.
[15] O. A. Olukotun and T. N. Mudge. "A Preliminary Investigation into Parallel Routing on a 
Hypercube.” Proc. Design Automation Conf.. pp. 814-820, Jun. 1987.
[16] J. Rose. Locusroute: A Parallel Global Router for Standard Cells,” Proc. Design Automation 
Conf.. pp. 189-195. Jun. 1988.
[17] J. S. Rose, D. R. Blythe, W. M. Snelgrove. and Z. G. Vranesic. “Fast, High Quality VLSI 
Placement on a MIMD Multiprocessor.” Proc. Int. Conf. Computer-Aided Design, pp. 42-45, 
Nov. 1986.
[18] J. S. Rose, D. R. Blythe, W. M. Snelgrove, and Z. G. Vranesic, “Parallel Standard Cell 
Placement Algorithms with Quality Equivalent to Simulated Annealing,” IEEE Trans. 
Computer-Aided Design, pp. 387-396, Mar. 1988.
[19] R. A. Rutenbar and S. A. Kravitz. “Layout by Annealing in a Parallel Environment,” Proc. 
Int. Conf. on Computer Design, pp. 434-437, Oct. 1986.
[20] Y. Won and S. Sahni. "Maze Routing on a Hypercube Multiprocessor Computer,” Proc. Int. 
Conf. Parallel Processing, pp. 630-637, Aug. 1987.
[21] C. P. Wong and R. D. Fiebrich, “Simulated Annealing-Based Circuit Placement on the 
Connection Machine System,” Proc. Int. Conf. Computer Design, pp. 78-82, Oct. 1987.
M. R. Zargham, "Parallel Channel Routing,” Proc. Design Automation Conf., pp. 128-133, 
Jun. 1988.
[22]
10000-
Cost
5000-
a 8 Nodes
o 4 Nodes
o 2 Nodes
© INode
--- 1--1 2
logio Temperature
SESSION 3
PARALLEL ALGORITHMS FOR SIMULATION
Resve Saleh
Center for Supercomputer Research and Development 
Electrical and Computer Engineering 
University of Illinois at Urbana-Champaign
November, 1988
VLSI CAD, Simulation, and Parallel Processing
SESSION 2
1. Silicon Compilers
2. Cell Generators
3. Floorplanning
4. Cell Placement
5. Global routing
SESSION 4
1. Node Extraction
2. Design Rule Checking
3. Design Verification
4. Test Generation
5. Fault Simulation
SESSION 3
1. Circuit Simulation
2. Switch Simulation
3. Logic Simulation
4. Behavioral Simulation
6. Detailed routing
Outline
• Parallel Circuit Simulation
a Circuit Simulation Problem 
n Circuit Simulation Algorithms 
n Parallel Direct Methods 
n Parallel Relaxation Methods
• Parallel Logic Simulation
n Synchronous and Asynchronous Methods 
n Partitioning Algorithms
Circuit Simulation Problem
Solve the System (derived using KCL):
C (v)v(0=  — / (v(i)i m( t ))> v (0) = V, t e [ 0 , t ]
where C = nodal capacitance matrix
/  = sum of currents charging capacitances 
at each node
u = input voltages 
v = unknown node voltages
Circuit Simulation Techniques
Two classes:
1) Direct Methods (SPICE2):
• apply numerical integration formula to ODE
• use Newton-Raphson to convert nonlinear 
equation to linear equations
• solve linear equations using LU decomp.
2) Relaxation Methods:
• Nonlinear Relaxation (SPLICE)
i.e., Iterated Timing Analysis or ITA
• Waveform Relaxation (RELAX)
• Waveform-Relaxation-Newton (SPLAX)
Relaxation-based Methods vs. Direct Methods
i t  =  f . i x î . x T 1) ¿ 1  =  i
¿ Ï  =  f . ( x ï . x î )
-----7
X *  =  f < ( X i , Z t )
I
Integration Formulae 
Waveform Relaxation (e.g. Backward Euler)
Gauat-Seidel
Gauaaian Elimination 
or LU Decomp oaition
Forms of Parallelism
1) Direct Methods:
• Parallel Model Evaluation (PME)
• Parallel Linear Equation Solution (PSOL)
2) Relaxation Methods:
• Subcircuit Level Parallelism (SLP)
n Gauss-Jacobi 
a Gauss-Seidel
• Parallel Model Evaluation (PME)
• Time-Point Pipelining (TPP)
□forWR and WRN only
• Parallel Time-Point (PTP)
a for WRN only
CPU-Time vs. Circuit Size (direct methods)
Parallel Model Evaluation
• Model evaluation involves calculation and loading 
of entries of Jacobian matrix and right-hand side 
vector
• Device evaluations are completely independent of 
each other
• Loading process may result during the loading 
process
Resolving Loading Conflicts
• Using locks:
n single lock / matrix
□ single lock / row of matrix
□ single lock / element of matrix
□ single lock / section of matrix
m> Most effective strategy depends on the cost of locking 
and size of device tasks
• Without locks:
□ process all devices and store partial results in 
local templates
n accumulate results into matrix using a sequential 
or parallel approach
Device Task Granularity
• Each device has a different task size:
Device Size
MOS Transistor 1000
Diode 500
Resistor 250
Capacitor 125
• overhead of spawning parallel evaluation may 
dominate for small devices
• One Solution: combine devices together so that each 
task is roughly the same size
Linear Equation Solution
• Use LU factorization
for i =1 ,n{
for k —l,i  —1 { 
for j  = k + l , n  {
° i j  4r~a i j  ~ a ik a k j
}
for j  = i  + 1 / 2  { 
alJ *-atj  lau
J]
• Forward Substitution
• Backward Substitution
Levels of Parallelism
• Course Grain: perform pivots in parallel
• Medium Grain: perform rows in parallel
• Fine Grain: perform elements in parallel
PIVOT
PARALLELISM
Pivot Dependency Graph
• Depth of graph indicates the number of steps 
required to solve on a parallel processor (assuming 
enough processors are available)
Equation Ordering
• For uniprocessors, the equations are usually ordered 
to minimize fill-in (using Markowitz ordering):
X X 
X X X
X X X
X X X
X XX
• But this may not be the ordering that maximizes the 
parallelism! i.e., if reordered to maximize parallelism:
X X
X X  X
X X X  
X X X
X X  X
• However, number of fillins increases.
• Tradeoff exists between parallelism and fillins
Other Decompositions
• Bordered-Block Diagonal Form (BBDF)
1
1
1
2.
• Nested Dissection
Circuit Partitioning

Subcircuit Level Parallelism
n Gauss-Jacobi: process all subcircuits in parallel 
n Advantage: high degree of parallelism 
n Disadvantage: slower convergence speed
n Gauss-Seidel: process subcircuits in a specific order
n Advantage: faster convergence
a Disadvantage: amount of parallelism depends on 
circuit graph and processing strategy
•  multiple barrier in one iteration
• single barrier in one iteration
• multiple barrier across several iterations
• single barrier across several iterations (unrolled)
Multiple Barrier
• process all subcircuits in each rank in parallel
□ Advantage:
□ simple control structure on certain architectures
□ Disadvantages:
□ limited amount of parallelism (depends on graph)
□ potential load-balancing problems
Single Barrier
• process a subcircuit as soon as its inputs are ready 
(i.e., use a data-flow approach)
□ Advantage:
□ increases parallelism
□ Improves prospects of load-balancing
□ Disadvantages:
□ more complicated control structure
Multiple Barrier across iterations
31
Single Barrier across iterations
31
Mixed Gauss-Seidel/Gauss-Jacobi Method
• There are a variety of relaxation methods between 
GS and GJ that can be used on a limited number of 
processors
• However, there is a trade-off between convergence 
and parallelism
1) for p=4, multiple barrier requires 3 steps
2) for p=4, multiple barrier requires 2 steps
©) I I© © !
© I© I® ©
C
P
U
-t
im
e
Effect of Partitioning
• Partitioning used in the uniprocessor case may not be 
appropriate for parallel processing
No. of processors
Waveform Relaxation
. £ + 1  _  - / £ + 1  y  ^
X  i ~~ J  JvX j , X  2  t*  J
. £ + 1  r. ( £ + 1  £ + 1  f \
X  2  = / 2( x  1 ’* 2  ’n
^ time

Increasing Parallelism
 
Increasing O
verhead—
Tradeoffs in Parallel WR
Increasing Parallelism--------- >
■ Slower Convergence---------- >
Full Window
Time-Segment
Pipelining
Time-Point 
|  |  Pipelining
GS GS/GJ GJ
• • • •
♦
« •  •  •  t
Waveform-Newton Algorithm
*+/
K
K-l
♦ ♦
Parallel Time Point (PTP)
• In WRN, the values required to perform the 
linearization process are known in advance
• The timestep control is such that all the timepoints for 
a given iteration are known in advance
• Most of the computation across timepoint can be 
performed in parallel!
Logic Simulation
Problem Definition: Given a sequence of input 
patterns/waveforms, produce the logic waveforms at all 
output nodes of a given logic circuit.
Forms of Logic Simulation:
• switch-level simulation
• switch-level timing simulation
• gate-level simulation
• behavioral simulation
Logic Delay Models:
• zero-delay
• unit delay
• variable delay
Forms of Parallel Logic Simulation
• Synchronous: Require that the necessary nodes be 
evaluated at a given time point before moving on to 
the next time point
n Compiled approach
n Event-driven approach
• Asynchronous: Allow different portions of the circuit to 
be evaluted up to different points in time.
DTimepoint Pipelining
n Time-Warp Mechanism
Parallel Compiled Logic Simulation
• Basic Idea: Process all gates at every time step
• Parallel Version:
1. Levelize the logic circuit into n ranks (break 
feedback loops)
2. Partition circuit and assign to processors
3. Apply input pattern.
4. For k = 1 , n
Process all gates in rank k and determine output 
values
5. If end of simulation, STOP.
7. Else Goto step 3.
Features of Compiled Approach
• Can use static assignment of gates to processors
• Load-balancing is straight-forward if gate sizes are 
equal
• No dynamic event queue required 
Problem:
• low activity exists in large circuits and therefore most 
gates are processed unnecessarily
Event-Driven Simulation
• This form is based on classical event-driven logic 
simulation.
• All events at a given time point are processed before 
moving to next time point
• Steps:
1. All gates connected to inputs are scheduled in 
the event queue
2. Next event is removed from queue and 
processed
3. If output value changes, all fanout gates are 
scheduled
4. If no more events remain, STOP.
5. Else Goto step 2.
Parallel Synchronous Event-Driven Simulation
Simple Approach:
• Use one central queue for all event scheduling 
operations
• Each free processor removes an event from queue 
and processes it
• New events are then scheduled on the queue 
Problems:
1. Queue contention (processors are trying to access 
queue simultaneously)
2. Ratio of task size to queue overhead (granularity of 
task is relatively small)
3. Inactivity of large circuits (amount of work at each 
time point is small)
Distributed Queues
• Provide one queue for each processor
• Processors schedule events on other processors in a 
round-robin fashion
• Each processor gets events from its own queue
• If no work exists on the queue, look for work on 
another processor
Asynchronous Logic Simulation
(assumes variable delay)
• Use an approach similar to Timepoint Pipelining 
described for Waveform Relaxation
Steps for processor p:
(1) get minimum time of event from all inputs
(2) evaluate gate, store results, schedule fanouts
(3) get next event on the input node
(4) continue until no events left
Time-Warp Mechanism
Basic Idea: Allow processors to go ahead and 
compute logic waveforms with partial information 
whenever possible
r3
^  I o C a l  ‘t ' i r t i ^
t global 4"irhti
Process gate whenever any new information is 
becomes available
Reasoning behind this approach:
• blocked processors would normally waste time 
anyway
• computation performed with partial information is 
often correct!
Rejection and Rollback Mechanisms
What if new information arrives at an input?
use a rollback scheme and recompute solution
time 135
node Y 1
Î
Therefore, node X must 
be rolled back to 135.
• send cancellation message to fanouts to remove 
effect of all affected events
• recompute solution to the last event (162)
Static Partitioning Strategies
Considerations:
(1) concurrency
(2) inactivity
(3) queue contention
(4) communication
(5) task granularity
Possible Strategies:
(1) random partitioning
(2) partitioning by rank
(3) partitioning using signal paths
(4) partitioning using input/output cones
Random Partitioning
Randomly assign gate i to processor j:
• fast and simple
• offers more parallelism since activity tends to be 
clustered
• usually increases number of remote memory 
references
Partitioning by Rank
• levelize circuit
• assign gates in the same rank to different processors 
(for compiled simulation approach)
OR
• assign gates in adjacent ranks to different processor 
(for pipelined approach)
Partitioning using Signal Paths
Assign all gates along each signal path to one 
processor
• minimizes number of remote memory references
• minimizes overhead in distributed queue 
implementation
X3
Input/Output Cones
Assign gate to processors based on affinity of a gate 
to a processor based on the number of gates in the 
input or output cone already assigned to the 
processor
Can reduce the communication costs significantly!
Partitioning into Clusters of Gates
• Group gates together to minimize number of 
interconnections between clusters of gates
• Treat clusters of gates as a single entity and assign to 
processors using one of the other strategies
• Used to increase task granularity and reduce number 
of remote memory references
• Reduces concurrency
Most Appropriate Strategy?
Depends on:
• communication cost vs. task size
• degree and distribution of activity
• degree of pipelining used
Ultimately, dynamic partitioning may prove to be the best if 
overhead and contention can be reduced to a minimum or 
task granularity is increased appropriately.
REFERENCES FOR PARALLEL ALGORITHMS FOR SIMULATION
Resve Saleh
[Arn85] J. Arnold, C. Terman, "A Multiprocessor Implementation of a logic level Timing 
Simulator", Proc. Int. Conf. on CAD, Nov. 1985.
[Bis86] G. Bischoff, S. Greenberg, "CAYENNE: A Parallel Implementation of the Circuit 
Simulator SPICE", Proc. Int. Conf. on CAD, Nov. 1986.
[Bau78] G. Baudet, "Asynchronous Iterative Methods for Multiprocessors", J. of the 
ACM, Vol. 25, NO. 2, April 1978, pp. 226-244.
[BBN85] BBN Laboratories, "The Uniform System Approach to Programming the 
Butterfly Parallel Processor", Version 1, Oct. 1985.
[Cat85] G. Catlin, B. Paseman, "Hardware Acceleration of Logic Simulation Using Data- 
Flow Architecture",Proc. Int. Conf. on CAD, Nov. 1985.
[Cox86] P. Cox, R. Burch, B. Epler, "Circuit Partitioning For Parallel Processing", Proc. 
Int. Conf. on CAD, Nov. 1986.
[Cox87] P. Cox, R. Burch, D. Hocevar, P. Yang, "SUPPLE: Simulator Utilizing Parallel 
Processing and Latency Exploitation", Proc. Int Conf. on CAD, Nov. 1987. Proc. Int. 
Conf. on CAD, Nov. 1986.
[ChMi71] D. Chazan, W. Miranker, "Chaotic Relaxation", Linear Algebra and Its 
Applications", Vol.2, 1969, pp. 199-222.
[Deu84] J.T. Deutsch, A. R. Newton, "A Multiprocessor Implementation of Accurate 
Electrical Circuit Simulation", Proc. 19th Design Automation Conference, Las Vegas, 
Nv., 1984.
[Deu85] J.T. Deutsch, "Algorithms and Architecture for Multiprocessor-Based Circuit 
Simulation", Ph.D. dissertation, University of California, Berkeley, Memo. No. UCB/ERL 
M85/39, May 1985. Electronic Design, September 1984, pp. 153-168.
[Dum86] D. Dumlugol, P. Odent, J. Cockx, H. DeMan, "The Segmented Waveform 
Relaxation Method for Mixed-Mode Switch/Electrical Simulation of Digital MOS VLSI 
circuits and its Hardware Acceleration on Parallel Processors" ,Proc. Int. Conf. on CAD, 
Nov. 1986.
[Fra86] E. Frank, "Exploiting Parallelism in a Switch-Level Machine, Proc. Design 
Automation Conference, 1986.
[Gho88] S. Ghosh, M. Yu, "An Asynchronous Distributed Approach for the Simulation 
of Behavioral-Level Models on Parallel Processors", Proc. ICCP, 1988.
[Gin85] R. Ginosar, N. Jacobson, "The Simulation Machine: A VLSI Architecture for 
Circuit Simulation", Proc. Int. Conf. on Comp. Design, Port Chester, N.Y., Oct 1985.
[Jef85] D. Jefferson, H. Sowizral, "Fast Concurrent Simulation using the time warp 
Mechanism",
[Luc86] R. Lucas, T. Blank, J. Tiemann, "A Parallel Solution Method for Large Sparse 
Systems of Equations", Proc. Int. Conf. on CAD, Nov. 1986.
[Hil81] D. Hillis, "The Connection Machine", Ph.D. dissertation, M.I.T.
[Jac86] G. K. Jacob, A. R. Newton, D. O. Pederson, "Direct-Method Circuit Simulation 
Using Multiprocessors", Proc. Int. Sym. on Circ. and Sys., May 1986. the Performance 
of a Multiprocessor-Based Circuit Simulator", Proc. of the Int. Symp. on Circ. and Sys., 
San Jose, CA. 1986.
[Jac86] G. K. Jacob, A. R. Newton, D. O. Pederson, "An Empirical Analysis of the 
Performance of a Multiprocessor-Based Circuit Simulator", Proc. of the Int. Symp. on 
Circ. and Sys., San Jose, CA. 1986.
[Ko83] H. Ko, A. Sangiovanni-Vicentelli, "BLOSSOM: An Algorithm and Architecture 
for the Solution of Large-Scale Linear Systems", Int. Conf. on Comp. Design, Port 
Chester, N.Y., 1983.
[Mat86] S. Matisson, "CONCISE, a Concurrent Circuit Simulation Program", Ph.D. 
dissertation, California Institute of Technology, 1985.
[Nak87] T.Nakata et al., "A Multiprocessor System for Modular Circuit Simulation", 
Proc. Int. Conf. on CAD, Nov. 1987.
[Sal86] R. A. Saleh, "Nonlinear Relaxation Algorithms for Circuit Simulation", Ph.D. 
Thesis, Univ. of California, Berkeley, CA. 1986.
[Sal87] R. Saleh, D. Webber, E. Xia, A. Sangiovanni-Vincentelli, "Parallel Waveform- 
Newton Algorithms for Circuit Simulation", Int. Conf. on Comp. Design, Port Chester, 
N.Y., 1987.
[Sei85] C. L. Seitz, "The Cosmic Cube", Communication of the ACM, 28:22-33, January 
1985.
[Sma87] D. Smart, T. Trick, "Increasing Parallelism in Multiprocessor Waveform 
Relaxation”, Proc. Int. Conf. on CAD, Nov. 1987.
[Smi87] S. Smith, Bill Underwood, "An Analysis of Several Approaches to Circuit 
Partitioning for Parallel Logic Simulation", Proc. ICCD, Port Chester, NY, 1987.
[Sou88] L. Soule, T. Blank, "Parallel Logic Simulation on General Purpose Machines", 
Proc. 25th Design Automation Conference, 1988.
[Vla82] A. Vladimirescu, "LSI Circuit Simulation on Vector Computers", Ph.D. 
dissertation, University of California, Berkeley, 1982.
[Vla87] A. Vladimirescu et al., "A Vector Hardware Accelerator with Circuit Simulation 
Emphasis", Proc. Design Automation Conference, Miami, FL. 1987.
[Web87] D. Webber, A. Sangiovanni-Vincentelli, "Circuit Simulation on the Connection 
Machine", Proc. Design Automation Conference, Miami, FL. 1987
[Whi85a] J. White, A.L. Sangiovanni-Vincentelli, "Partitioning Algorithms and Parallel 
Implementation of Waveform Relaxation Algorithms for Circuit Simulation", Proc. Int. 
Synp. on Circ. and Syst., Kyoto, Japan, June 1985.
[Whi85b] J. White, R. Saleh, A. Sangiovanni-Vincentelli, A. R. Newton "Accelerating 
Relaxation Algorithms using Waveform Newton, Step Refinement and Parallel 
Techniques" Proc. 1985 Int. Conf. of Computer-Aided Design, Santa Clara, CA, Nov. 
1985.
[Whi85c] J. White, "The Multirate Integration Properties of Waveform Relaxation, with 
Application to Circuit Simulation and Parallel Computation", Ph.D. dissertation, 
University of California*, Berkeley, Memo. No. UCB/ERL 85/90, Nov. 1985.
[Whi86] J. White, N. Weiner, "Parallelizing Circuit Simulation - A Combined 
Algorithmic and Specialized Hardware Approach", Int. Conf. on Comp. Design, Port 
Chester, N.Y., 1986.
[Yam85] F. Yamamoto, S. Takahashi, "Vectorized LU-Decomposition Algorithms for 
Large-Scale Circuit Simulation", IEEE Trans, on CAD, Vol.4, No.3,1985.
SESSION 4
PARALLEL ALGORITHMS FOR DESIGN VERIFICATION
Prith Banerjee
Coordinated Science Laboratory 
Electrical and Computer Engineering 
University of Illinois at Urbana-Champaign
November, 1988
VLSI CAD, Design Verification, and Parallel 
Processing _______________
SESSION 3
1. Circuit Simulation
2. Switch Simulation
3. Logic Simulation
4. Behavioral Simulation
6. Detailed routing
SESSION 2
1. Silicon Compilers
2. Cell Generators
3. Floorplanning
4. Cell Placement
5. Global routing
SESSION 4
1. Node Extraction
2. Design Rule Checking
3. Design Verification
4. Test Generation
5. Fault Simulation
Outline
• Parallel Algorithms for Design-Rule Checking
• Parallel Algorithms for Circuit Extraction
• Parallel Algorithms for Test Generation
• Parallel Algorithms for Fault Simulation
• Wrap-up and Conclusions of Tutorial
DESIGN RULE CHECKING PROBLEM
• Given mask layout, report design rule violations.
• Typically 1-10 million rectangles. Also a large number 
of checks are to be done. Hence it is computationally 
intensive.
• The locality of the various checks permit parallel 
implementation.
• Approaches
D Coarse Grain : F. Gregoretti and Z. Segall 
D Medium Grain : G. Bier and A. Pleszkun 
D Fine Grain : E. Carlson and R. Rutenbar
• The input could be either FLAT or HIERARCHICAL
DRC - COARSE GRAIN
F. Gregoretti & Z. Segall
• Implemented on a 50 processor CM *, shared 
memory multiprocessor.
• The Input is given in hierarchical form.
• They concentrate on determining all intersections of 
rectangles.
sc
/----------------- 4C,
A
4C, X
____ A ________________r~
2C,,
~ \
2C„
r~
2C„
\
2Ca
\ \ / /  ^  \
---------- 1
1
1
1
1
1
1
1
1
1
1
r
i
i
i________
1C. 1C 1C 1C,s t i  »  *
OS 1C; (O n e  c e l l )
DF ;
DS 2 C ; (Tw o c e l l s )  
C a l l  1C;
C a l l  1C;
OF;
DS 4 C ; ( 4 C e l l s )  
C a l l  2C;
C a l l  2C;
DF;
DS 8 C ; ( 8  C e l l s )  
C a l l  4 C ;
C a l l  4 C ;
DF;
DRC-COARSE GRAIN (Contd)
• Uses three kinds of processes.
n CHS(A): Finds intersections local to symbol A. 
May spawn a COS or CSB process.
n COS(A, B ): Finds intersection between symbols 
A & B. May spawn a COS or CSB process.
n CSB(r, A ) : Finds intersection between box r and 
symbol A. May spawn a CSB process.
• Starts checking by initiating a CHS process for each 
symbol.
8C
--------------- -----------------------------------------------A _________
/  • 4C,
--------A « ft X
/
2C,,
/ ------A-----
“ A  /  _
2C„ 2 C „ \2C,
\  / \  / ------ *----- A-------- \
1C.n ,C.« 1C,„ IC^ 1C,,, IC^, 1<^ , 1Caf
Check 8C
J
Check 
4C, vs. «C2
J
Check 
2C,j vs. 
2 C *
i
Check
1C.» vs.
Check 4C
J '
Check 
2C„ vs. 
2C„
J
Check 
1C,„ vs. 
1C,2,
J
AH
Bectangtaa
Check 2C
J "
Check
1c„,vs-iet2
J
AH
Recta n g f s
Check ic
l i t
J
AD
Reeling!«»
Recangl«,
DRC-COARSE GRAIN (Contd)
• Process Scheduling is done by a TASK QUEUE.
• An idle processor fetches a process from the queue. 
A spawned process is added to the queue.
TASK QUEUE
DRC-COARSE GRAIN (Contd)
• Speedup results.
• Good speedup for small number of processors as 
enough subprocesses are generated leading to even 
distribution of load.
• Speedup is limited by the size of the leaf symbols and 
regularity in the circuit.
DRC-MEDIUM GRAIN
G. Bier &  A. Pleszkun
• Implemented on the CRYSTAL multiprocessor.
• The algorithm can work in both shared and message 
based multiprocessors.
• Takes in a flat description of a circuit.
• Basic Approach.
n Partition the circuit into vertical strips and extend 
each strip region by DRID ( maximum design rule 
interaction distance) creating overlap.
D
DRC-MEDIUM GRAIN (Contd)
n Perform design rule checking on each extended 
strip in parallel. Omit any errors detected in the 
overlap portion of a strip.
incorrectly
checked correctly
checked *
_ correctly
— lu iu rrc iu y
checked
*  checked
correctly -  correctly
*  checked > *  checked
• Speedup is expected to scale up well with the 
addition of processors.
• Projected speedup of 8 on 14 processors, from 
simulation studies.
DRC-FINE GRAIN
E„ Carlson & R. Rutenbar
• Implemented on the Connection machine with 64K 
processors.
• Implements the basic primitives, like boolean 
operations, width-space checking and region 
numbering.
• Takes as input the horizontal edges in the circuit, with 
information for each edge about the presence of its 
mask above and below it.
DRC-FINE GRAIN (Contd)
Basic Approach.
• Read edges so that each edge is mapped to an 
unique ( possibly virtual) processor.
• Split edges at stops made by a scan line algorithm, 
and sort them ( first X then Y ). Deposit them on
consecutive processors. Expected time is O (log
DRC-FINE GRAIN (Contd)
• For Boolean Operations (expected time <3(logN)).
n For each scanline, compute the masks above 
each edge.
n Compute boolean operations, get new edges.
^ — ■ B ------ — A & B
i ___ A _ ± _ A  & B
± —  B
• For Region numbering (expected time 0(log2AO ).
D Link edges connected by an vertical edge.
D Assign same number to connected edges by 
doubling.
n Take care of holes in regions.
O riginal
1 3  7
Edgw  
1 1 1  
r *  4 - a
Q ,
i l  ' T
o "
L__r
Phans 1
r ^ t  
Q ,  
U. - r
P h ase  2 P h ase  3
□
 IB
ID
DRC-FINE GRAIN (Contd)
• For width/space checking (expected time 0(\ogN).
n Check for vertical violations in a scanline.
n Check for horizontal violations between 
scanlines.
DRID
X
DRID
DRID DRID
Speedup Results.
CIRCUIT EXTRACTION PROBLEM
• Given mask layout, determine circuit connectivity & 
electrical parameters.
• Computationally intensive, because of the large size 
of circuits and complicated parameter models.
• Input can be flat or hierarchical.
• Has mostly local computations, and hence suitable for 
parallel computation.
• More difficult than DRC.
• Approaches
n Coarse & Fine Grain : None 
n Medium Grain : K. Belkhale and P. Baneriee
EXTRACTION MEDIUM-GRAIN
K. Belkhale and P. Baneijee
• Implemented on the Intel ¡PSC/D4 hypercube.
• Takes the input in a flat format, hierarchy will be 
implemented in future.
• Concentrates on the geometrical extraction, produces 
an output suitable for parameter extraction.
• Basic approach
n Partition the circuit into sub regions, one for each 
processor.
n Operate on the sub regions in parallel, with some 
inter processor communication.
EXTRACTION MEDIUM-GRAIN (Contd)
The basic steps
• Divide Circuit in x dimension by and in the y
dimension by 2d~k.
• Send rectangles of region ( i , j) to processor
G (i ,j) = GRAY (i) * 2d~k + GRAY (j).
(a) (b)
EXTRACTION-MEDIUM GRAIN ( Contd )
• The MERGE algorithm returns for an LCS set, the
label of its globally connected set (GCS). Each G C S 
set is owned by an unique processor.
PROCO PROC 2
Vdd M1 ST1
X DIFFDIFF n
X
B
p
0
DT1
M
L
Y
1 J
GNI
X DIFF T2 DIFF y
I M1 ST2 DT2
PROC 1 PROC 3
Proc 0 Proc 1 . Proc 2 Proc 3
Net 1 (VDD) Net 1 (A) Net 1 (VDD) Net 1 (B)
Net 2 (A) Net 2 (GND) Net 2 (A)
Net 3 (B) Net 3 (B) Net 3 (B)
Dnet 1 (ST1) Dnet 1 (ST2) Dnet 1(ST1) Dnet 1 (ST 1)
Dnet 2 (DT1) Dnet 2 (DT2) Dnet 2(DT1)
Cnet 1 (T1) Cnet 1 (T2) Cnet 1 (T1)
EXTRACTION - MEDIUM GRAIN ( Contd )
• Compute channel rectangles, and local nets ,
diffusion-nets, transistors ( ) at each processor
by a Scan Line Algorithm.
• Compute maximal boundary segments, for each layer 
and boundary.
i
\ Boundary
• The border segments for the four borders, and the
LCS sets they touch are passed to a MERGE 
algorithm.
EXTRACTION-MEDIUM GRAIN (Contd)
• A MERGE algorithm takes the border segments of 
the processor’s region, with their LCS sets.
• For each LCS set it gets a label denoting its GCS set.
PROCO PROC 2
PROC 1 PROC 3
Proc 0 Proc 1 Proc 2 Proc 3
LCS 1 (1) LCS 1 (2) LCS 1 ( ) LCS 1 (1)
LCS 2 (2) LCS 2 (1 ) LCS 2 (1 ) LCS 2 (2)
LCS 3 (2) LCS 3 ( ) LCS 3 (2)
LCS 4 (2)
LCS 5(1)
EXTRACTION-MEDIUM GRAIN (Contd)
A simple MERGE algorithm
• Uses d stages, and in each stage it sends and 
receives one message from another processor.
• In each stage, it doubles it domain of knowledge in 
the circuit.
• In each stage, a processor communicates with 
another whose domains abuts P’s domain.
X
Stage 2 Stage 3
EXTRACTION-MEDIUM GRAIN ( Contd )
Final Updating
• Update local diffusion nets to contain all rectangles in 
their GCS set.
• Update local transistors to gets all source / drain / 
gate rectangles of their global transistor.
• Send each local net, with its connecting transistors, to 
the processor that owns the net.
NetL st Head
NET NET NET
N et# ~ = ± r
Reqtptr
R E cr RECT R ECT
Tran List Head 
TRANl
Coo ras
Layer
Tranpt£
Traipptr 1 j
1 Pointer 
* o  transistor
TRAN TRAN
Tran#
Gate#
Source#
Drain#
Gate-----
Sou rra
Drçin
Ptr to gate rects 
Ptr to source rects 
Ptr to drain rects
EXTRACTION-MEDIUM GRAIN ( Contd )
• The algorithm has been implemented with expected 
number of messages from a node being 2cf+16.
• The processors can now extract parameters of nets 
and transistors they own in parallel.
• Speedup and load distribution results.
Data Box Procs Local
sec
Merge
sec
Diff
sec
Tran
sec
Net
sec
Total
sec
Speedup
Ckt1 51000 2 129.5 2.2 1.0 5.8 10.8 149.4 1.73
Ckt1 51000 4 66.1 1.2 0.5 2.9 10.8 81.6 3.24
Ckt1 51000 8 33.4 0.8 0.3 1.5 9.2 45.2 5.90
Ckt1 51000 16 16.7 0.9 0.2 0.8 10.4 29.1 9.19
Ckt2 128000 8 95.3 1.0 0.7 8.3 8.4 113.7 6.70
Ckt2 128000 16 43.6 0.7 0.39 4.8 12.0 61.5 11.35
Ckt3 228000 8 152.8 1.3 0.9 9.5 18.1 182.6 6.69
Ckt3 228000 16 84.1 1.7 0.7 6.5 19.1 112.1 12.00
Ckt4 283000 16 105.7 1.7 0.7 18.5 22.1 148.6 11.37
Data Box Procs #Nets 11 minn/n,- #Tran t min t/t-,
Ckt1 51000 2 1974 1.94 4288 1.91
Ckt1 51000 4 1974 3.82 4288 3.82
Ckt1 51000 8 1974 7.68 4288 7.74
Ckt1 51000 16 1974 14.51 4288 14.88
Ckt2 128000 8 4860 7.49 11084 7.68
Ckt2 128000 16 4860 15.09 11084 14.89
CM3 228000 8 8808 7.74 19660 7.77
CM3 228000 16 8808 14.97 19660 14.27
Ckt4 283000 16 10632 15.14 24676 14.08
Test Generation/ Fault Simulation (TG/FS) 
Overview
• Objective: Given a circuit, derive a set of input 
patterns to detect all (if possible), faults in a circuit
• Example:
s-a-1
□ Test for 8 s-a-1 (from test generation): 101XX
□ 101XX also detects 1 s-a-0, 3 s-a-0, 4 s-a-0 and 
16 s-a-0 (from fault simulation)
TG/FS Overview (Contd.)
• A generic TG/FS scheme:
procedure test;
F 4  / 1  ,/2, .Ah
while (F not empty) do 
fi = select_fault(F);
F = F
ti = test_generate(/j); 
i f *  NULL)
Fi = fault_simulate(F, ?,); 
F = F - F f, 
endif
endwhile
end; {procedure test}
Test Generation For Logic Circuits
• Requires searching the input space of a logic circuit 
for a vector or a vector sequence which distinguishes 
a faulty circuit from the good circuit
• NP-complete problem even for combinational circuits
• Algorithms are available which use heuristics to guide 
the search:
□  D-Algorithm [Roth]
□  POD EM [Goel]
□  Simulation based methods [Agrawal]
Need For Parallel Test Generation
• Search space is exponential in the number of primary 
inputs of a logic circuit
• For circuits of VLSI complexity, the test generation 
time becomes prohibitive
• Algorithms like FAN [Fujiwara] which use special 
heuristics to prune the search are effective in reducing 
the run time only to a limited extent
• Multiprocessing hardware has to be used to get 
orders of magnitude speedup
Exploiting Parallelism In Test Generation
• Fault parallelism
• Heuristic parallelism
• Simulation parallelism
• OR parallelism
Fault Parallelism
• Distribute the fault set equally across n processors
• Each processor generates tests for its own fault set
• In a uniprocessor implementation, test generation and 
fault simulation are interleaved to derive a compact 
test set
□  Test lengths may be too long in a parallel 
implementation
□  Solution:
- Divide the fault set into independent fault sets
- Reduces the likelihood of a test vector derived by 
one processor being a test for a fault on some 
other processor
- The test generation/ fault simulation phases can 
now proceed independently on each processor
- Could use processors dedicated to fault simulation 
only
• Merits
□  Very little communication between processors
□  Potential for linear speedup
Fault Parallelism (Contd.) [Chandra]
Speedup
Number of Processors
Heuristic Parallelism
• Algorithms like PODEM use heuristics for search:
□  RANDOM
□ DISTANCE
□ COP
□ SCOAP
□ CAMELOT
• No clear cut advantage of one heuristic over the other
• Each processor can use a different heuristic to guide 
the search for the same fault
• Superlinear speedups possible [Chandra]
• Parallelism limited to the number of heuristics 
available for search (5-6)
Simulation Parallelism
• Forward implication forms a major component of run 
time in test generation
• Makes backtracks more expensive
• Could use parallel logic simulation techniques to 
speed up forward implication
• Circuit activity is low
□  Circuit partitioning strategy is important
□  The partitioning strategy should minimize 
communication and maximize concurrency
□  An optimal circuit partitioning: still an open 
problem
• Very high message traffic:
□  Requires very low message latency
□  More feasible on shared memory machines
• Cost per backtrack is reduced
□ Can afford more number of backtracks
OR Parallelism
• Refers to concurrent evaluation of choice points
• Useful for hard to detect or redundant faults
•  Reduces backtracking
• Method:
□  Distribute the search space over the available 
processors
□  Each processor searches its own search 
subspace concurrently
•  Search space allocation
□  Brute force: divide the search space into equal 
parts.
- Disadvantage: may lead to useless search in non­
solution areas
□ Heuristics based: choose a portion of the search 
space aimed at increasing the probability of 
searching in a solution area
- Disadvantage: the search subspaces may not be 
disjoint
• Very low communication overhead
• Large number of processors can be utilized efficiently
• Can lead to superlinear speedups due to search 
anomalies
Search Space
Processor i Processor i Processor j
Figure 3. Splitting search spaces
Fault Simulation Of Logic Circuits
• Used to evaluate the fault coverage of a set of vectors 
over a set of faults In a logic circuit
• Fault simulation techniques:
□  "Parallel"
- Utilize word length to process W faults or patterns 
in parallel
- Word-oriented logic primitives are required
□ Deductive
- Deduces the faults that will cause each signal to 
have a value different from the fault-free value
□ Concurrent
- Simulates faulty and fault free circuits concurrently
- Only those parts of the faulty circuit which differ 
from the fault-free circuit are simulated
Need For Parallel Fault Simulation
• Fault simulation algorithms are typically of the 0(«2)
•  Theoretically, the most efficient algorithm can be at 
mostO(/i1-5)
• No hope for a linear time fault simulation algorithm
• For circuits of VLSI complexity (> 10000 elements) the 
fault simulation time becomes prohibitive
• Parallel fault simulation can be used to reduce the run 
time by orders of magnitude
• Types of parallelism:
□  Fault parallelism
□  Pattern parallelism
□  Simulation parallelism
Fault Parallelism
• Distribute the fault list equally among all the available 
processors
• Applicable to any fault simulation technique
• Low (almost nil) communication overhead
• Potential for linear speedup
• Upper bound on speedup = number of faults
Pattern Parallelism
»
• Distribute the input patterns equally among all the 
available processors
• Assumes all patterns are available at the same time - 
not always possible
• Low communication overhead
• May result in redundant work if a fault is detectable by 
more than one vector being simulated
• Potential for linear speedup
• Upper bound on speedup = number of patterns
Simulation Parallelism
•  Circuit partitioning could be used to distribute the 
simulation work
• Techniques similar to parallel logic simulation can be 
used
• May not be feasible for deductive fault simulation 
because of long messages
• Feasible for concurrent and "parallel" fault simulation
• Concurrent fault simulation:
□  More activity per gate
□  Two types of events:
1. Fault-free events
2. Faulty events
□  High computation to communication ratio: can be 
implemented efficiently on message-based 
machines
Parallel Test Generation Results
• 'Rdll -  'Banevyee 1988
• Use OR-Parallelism to concurrently search choice 
points for hard-to-detect faults
Circuit
HTD
Faults Processors
HTD
Coverage
Overall
Coverage Time/Fault (ms) Speedup
c432 53 4 73.59 (15.09) 97.32 (91.41) 117 (628) 5.4
8 77.36 (24.53) 97.71 (92.36) 116 (1232) 10.6
16 79.24 (43.40) 97.90 (94.27) 142 (2208) 15.6
c499 33 4 72.73 (12.12) 98.81 (96.17) 174 ‘ - i (897) 5.2
8 75.76 (21.21) 98.94 (96.57) • 152 (1802) 11.9
16 75.76 (27.27) 98.94 (96.83) 169 (3357) 19.9
C1355 151 4 91.39 (49.67) 99.17 (95.17) 200 (1318) 6.6
8 93.37 (54.97) 99.36 (95.68) 188 (2262) 12.0
16 92.72 (65.56) 99.30 (96.70) 206 (3634) 17.6
C1908 84 4 92.86 (17.86) 99.68 (96.33) 219 (1956) 8.9
8 95.24 (17.86) 99.79 (96.33) 204 (4262) 20.9
16 95.24 (22.62) 99.79 (96.54) 254 (8647) 34.0
c2670 109 4 38.53 (30.28) 97.56 (97.23) 453 (1147) 2.5
8 42.20 (41.28) 97.71 (97.67) 488 (2096) 4.3
16 45.87 (55.96) 97.85 (98.25) 584 (3493) 6.0
c3540 343 4 48.10 (18.95) 94.81 (91.89) 736 (2239) 3.0
8 68.51 (34.40) 96.85 (93.44) 625 (4457) 7.1
16 75.80 (44.61) 97.58 (94.46) 693 (8183) 11.8
c5315 37 4 70.27 (59.46) 99.79 (99.72) 441 (1418) 3.2
8 81.08 (70.27) 99.87 (99.79) 398 (2339) 5.9
16 86.49 (78.38) 99.91 (99.85) 467 (3643) 7.8
c6288 79 4 94.94 (29.11) 99.95 (99.28) 490 (2605) 5.3
8 96.20 (29.11) 99.96 (99.28) 543 (5026) 9.3
16 96.20 (49.37) 99.96 (99.48) 604 (9507) 15.7
c7552 247 4 61.54 (22.67) 98.74 (97.47) 549 (1621) 3.0
8 66.80 1 (26.32) 98.91 (97.56) 619 (3098) 5.0
16 64.78 . (31.17) 98.85 (97.75) 884 (5911) 6.69
A Complete Parallel TG/FS System
• Use a combination of fault, simulation and OR 
parallelism
• Potential for very high speedups
• A large number of processors can be utilized 
efficiently
• Parallel fault simulation techniques can be used for 
test compaction and to save effort in test generation 
for the remaining faults
• Ongoing research at the Univ. of Illinois [Patil, 
Banerjee]:
□  Analysis and implementation of circuit partitioning 
methods on general-purpose multiprocessors for 
fault and logic simulation
□  Implementation of parallel fault simulation 
methods
□  Analysis and implementation of OR parallel 
method for test generation
□  Implementation of a complete TG/FS system 
using parallel fault simulation and test generation 
techniques
Implementation Of TG/FS On Shared Memory 
Machines
• Examples: Sequent Balance, Encore Multimax, Alliant 
FX-8
• Parallel forward implication can be very efficiently 
implemented due to low message latency.
• Fault, heuristic and OR parallel techniques can be 
implemented without any significant overhead
• Potential for speedup limited to at most 20 due to 
limited number of processors
Implementation Of TG/FS On Message Passing 
Architectures
• Examples: NCUBE, Intel IPSC/2, Caltech MARK III
• High message latency
□ No reported success with parallel logic simulation
□  Speedup of at most 2-3
• Feasible to implement fault, heuristic and OR 
parallelism due to low communication requirements
• Large number of processors available (upto 1024)
□  Potential for high speedups
• An implementation of parallel test generation on SIMD 
type machines has been reported (Connection 
Machine) [Kramer]
□  Memory per processor a severe constraint
□  Feasible only for circuits with small number of 
primary inputs
What Have We Learned from the Tutorial?
• We have reviewed various options to speed up VLSI 
CAD tasks
• Parallel Computing Options: Special-purpose, 
General-purpose
• Special-purpose offers maximum performance at high 
cost, worthwhile only for simulation where algorithms 
are stable
• General-purpose parallel processors attractive for 
other VLSI CAD applications where algorithms may 
change
• General-purpose Parallelism Options: Coarse grain, 
Medium grain, Fine Grain
• Future workstations will use parallel processing 
technology, e.g. SPUR Project at Berkeley, Firefly 
from DEC
• By late 1990s, the hardware will be ready to offer 
50-100 GFLOPS of performance using 1000 
processors
• Sufficient for VLSI CAD requirements
• Will parallel algorithms for VLSI CAD be ready by 
then?
VLSI CAD Environment of the future
General Purpose 
Parallel Processor
o o o Pn
Special Purpose 
Simulation Engine 
(Optional)
I
%
User Workstation
Parallel CAD Research: Where the Action Is
• Research has been started at various universities on 
parallel CAD
□ Berkeley, CMU, Illinois, Michigan, MIT, Stanford,
• Industries developing and marketing parallel CAD
n AT&T, DEC, Daisy, IBM, Intel, Sequent, Silicon 
Solutions, Simucad, Valid, Zycad,...
Observations
• VLSI CAD tasks form a rich source of parallelism
• Important enough to deserve consideration for 
parallel processing machines
• Some research results obtained on specific tasks
• Most parallel implementations handle simple cases, 
no fancy features
• Not yet integrated set of VLSI CAD tools or database 
for parallel environment
• As parallel processing machines proliferate, these 
parallel algorithms would move to industrial 
environments
• START THINKING NOW!
PARALLEL ALGORITHMS FOR DESIGN VERIFICATION
A Select Bibliography 
Prith  Banerjee
[1] K. P. Belkhale and P. Banerjee. “PACE: A Parallel VLSI Circuit Extractor on the Intel 
Hypercube Multiprocessor.” Proc. Int. Conf. on Computer-Aided Design, Nov. 1988.
[2] G. E. Bier and A. R. Pleszkun. “An Algorithm for Design \ Rule Checking on a 
Multiprocessor.” Proc. Design Automation Conf.. pp. 299-303. Jim. 1985.
[3] E. C. Carlson and R. A. Rutenbar. "Mask Verification on the Connection Machine.” Proc. 
Design Automation Conf.. pp. 134-140. Jun. 1988.
[4] S. Chandra and J. H. Patel. Test Generation in a Parallel Processing Environment,” Proc. 
Int. Conf. Comp. Design (ICCD-88). Oct. 1988.
[5] K. T. Cheng and V. D. Agrawal. “A Simulation-Based Directed Search Method for Test 
Generation.” Proc. Int. Conf. Comp. Design (ICCD-87), pp. 48-51. Oct. 1987.
[6] H. Fujiwara and T. Shimono. “On the Acceleration of Test Generation Algorithms.” IEEE 
Trans. Computers, vol. C-32. pp. 1137-1144. Dec. 1983.
[7] P. Goel. "An Implicit Enumeration Algorithm to Generate Tests for Combinational Logic 
Circuits,” IEEE Trans. Computers, vol. C-30, pp. 215-222. Mar. 1981.
[8] F. Gregoretti and Z. Segall. "Analysis and Evaluation of VLSI Design Rule Checking 
Implementation in a Multiprocessor.” Proc. Int. Conf. Parallel Processing, pp. 7-14. Aug. 
1984.
[9] G. D. Hachtel and P. M. Moceyunas, "Parallel Algorithms for Boolean Tautology Checking,” 
Proc. Int. Conf. Computer-Aided Design, pp. 422-425. Nov. 1987.
[10] G. A. Kramer. "Employing Massive Parallelism in Digital ATPG Algorithms.” Proc. Int. Test 
Conf.. pp. 108-114, Oct. 1983.
[11] Y. H. Levendel, P. R. Menon. and S. H. Patel. "Parallel Fault Simulation Using Distributed 
Processing.” Bell System Tech. Jour., vol. 62.1983.
[12] S. Levitin. MACE: A Multiprocessing Approach to Circuit Extraction. MIT. 1986. Master’s 
Thesis.
[13] H. K. T. Ma. S. Devadas, and A. S. Vincentelli, "Logic Verification Algorithms and their 
Parallel Implementation.” Proc. Design Automation Conf., pp. 283-290. Jim. 1987.
[14] A. Motohara. K. Nishimura. H. Fujiwara, and I. Shirakawa. "A Parallel Scheme for Test 
Pattern Generation.” Proc. Int. Conf. Computer-Aided Design, pp. 156-159, Nov. 1986.
[15] D. Ostapko. Z. Barzilai. and G. M. Silberman. "Fast Fault Simulation in a Parallel Processing 
Environment.” Proc. Int. Test Conf.. Oct. 1987.
[16] J. P. Roth. "Diagnosis of Automata Failures: A Calculus and a Method.” IBM  Jour. Res. 
Develop., vol. 10. pp. 278-291.
[17] S. P. Smith, B. Underwood, and J. Newman, "An Analysis of Parallel Logic Simulation on 
Several Architectures.” Proc. Int. Conf. Parallel Processing, pp. 65-68, Aug. 1988.
[ 18] L. Soule and T. Blank, Parallel Logic Simulation on General Purpose Machines.” Proc. 
Design Automation Conf.. pp. 166-171. Jun. 1988.
