ProperCAD: A portable object-oriented parallel environment for VLSI CAD by Ramkumar, Balkrishna & Banerjee, Prithviraj
NASA-CR-192 30e
January 1993 UILU-ENG-93-2205
CRHC-93-04
Center for Reliable and High-Performance Computing
A/A (_/- _/3
.:_,/ / /j _--
/
ProperCAD:
A PORTABLE
OBJECT-ORIENTED
PARALLEL ENVIRONMENT
FOR VLSI CAD
Balkrishna Ramkumar and Prithviraj Banerjee
(,_A5_-C_-I'_230,3) ProperCAD: A
PORIA{_L_ O_JECT-ORIENTED PARALLEL
£NVIR'_)NME,_T FOP, VLSl CAO (Illinois
lJni v. ) 44 0
N93-20bO3
Unclas
G3/61 0148109
Coordinated Science Laboratory
College of Engineering
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
Approved for Public Release. Distribution Unlimited.
https://ntrs.nasa.gov/search.jsp?R=19930011414 2020-03-17T08:10:05+00:00Z

ProperCAD: A Portable Object-oriented Parallel
Environment for VLSI CAD*
Balkrishna Ramkumar
Dept. of Electrical & Computer Engineering
University of Iowa
Iowa City, Iowa 52242
Prithviraj Banerjee
Center for Reliable & High-Perf. Computing
University of Illinois
Urbana, Illinois 61801
Abstract
Most parallel algorithms for VLSI CAD proposed to date have one important drawback:
they work efficiently only on machines that they were designed for. As a result, algorithms
designed to date are dependent on the architecture for which they are developed and do not
port easily to other parallel architectures.
This paper describes a new project under way to address this problem. We are develop-
ing a Portable object-oriented parallel environment for CAD algorithms (ProperCAD). The
objectives of this research are two-fold. (1) To develop new parallel algorithms that run in a
portable object-oriented environment. We accomplish this in two stages. First, we are develop-
ing CAD algorithms using a general purpose platform for portable parallel programming called
CHARM [6, 12] developed at the University of Illinois. Second, we are concurrently developing
a C+ + environment that is truly object-oriented and specialized for CAD applications. (2) To
design the parallel algorithms around a good sequential algorithm with a well-defined parallel-
sequential interface. This will permit the parallel algorithm to benefit from future developments
in sequential algorithms.
We describe one CAD application that has been implemented as part of the ProperCAD
project: flat VLSI circuit extraction. The algorithm, its implementation, and its performance
on a range of parallel machines are discussed in detail. It currently runs on an Encore Multimax,
a Sequent Symmetry, Intel iPSC/2 and i860 hypercubes, a NCUBE 2 hypercube, and a network
of Sun Sparc workstations. We also provide performance data for other applications that have
been developed: namely test pattern generation for sequential circuits, parallel logic synthesis
and standard cell placement.
1 Introduction
In view of the increasing complexity of VLSI circuits of the future, the requirements on VLSI CAD
tools will continuously increase. Parallel processing for CAD applications is becoming gradually
recognized as a popular vehicle to support the increasing computing requirements of future CAD
tools. Recent research on parallel CAD applications have been reported for a wide variety of
*This research was supported in part by the National Aeronautics and Space Administration under grant NAG
1-613, and in part by the Semiconductor Research Corporation under grant SRC 91-DP-109.

applicationssuchasplacement[2, 13,19,20], floor planning [11], circuit extraction [3, 4, 14, 25],
test generation and fault simulation [17], etc. Parallel processing for VLSI CAD has become a
reality in industry as well. Hardware vendors such as Solbourne have already announced products
with multiple CPUs in a single workstation. Software CAD vendors such as Mentor have announced
products such as CHECKMATE, a parallel design rule checker using multiprocessing, to accelerate
a single job. A major limitation with almost all such previous work is that the parallel algorithms
have been targeted to run on specific machines like an Intel iPSC/2 hypercube or an Encore shared
memory multiprocessor. Such work, although interesting, is not usable by the rest of the VLSI
CAD community since the algorithms are not portable to other machines.
A second serious problem also presents itself in the design of parallel algorithms. The software
development cycle for parallel algorithms is considerably longer than for sequential algorithms. This
has two important implications. The first is that they are considerably more costly to develop than
sequential algorithms. This is only exacerbated by tile lack of portability across parallel machines.
The second implication is a more pragmatic one. Given the fast pace of progress ill tile development
and improvement of sequential algorithms for CAD applications, for a given application, sequential
algorithms frequently outperform parallel algorithms due to the longer development time of the
latter. For example, this is evident in parallel test pattern generation. The latest version of HITEC
[16], a uniprocessor test pattern generation program for sequential circuits is already comparable
in performance and is slightly better in quality of results than a recent parallel algorithm for test
pattern generation [17].
A related issue in the development of parallel algorithms is that certain approaches are inherently
parallelizable and others are extremely hard to parallelize. More often than not, the tradeoff
between these two approaches is in the quality of results. Cell placement is a good example.
The quadrisection algorithm [24] is easily parallelizable and is significantly faster than algorithms
based on simulated annealing. However, it cannot produce results comparable to TimberWolf ['22], a
sequential program that uses simulated annealing (and a host of related tricks) to do cell placement.
An interesting possibility would be to use a hybrid of these two (and possibly other) techniques,
where, for example, quadrisection could be used for decomposition of the layout area into regions,
and TimberWolf wouhl be used for placement in a given region. However, to experiment with such

techniques, it should not be necessary to rewrite the software entirely. Any attempt to rewrite
TimberWolf [22] will not only be extremely time consuming, it is also unlikely to be comparable
in performance. However, if it is possible to decouple the parallel and sequential algorithms and
provide a well defined interface between the two, it may be practical to experiment with hybrid
schemes such as these.
It would be presumptuous to assume that it will be trivial to interface the parallel algorithm
with the sequential algorithm as described above. For this to be practical, it is imperative that
sequential algorithms be written in a modular fashion. Fortunately, object-oriented programming
in C++ (or even disciplined C programming) goes a long way in realizing this requirement. Many
CAD vendors are already rewriting many of their well-established CAD applications using such
disciplined, modular programming methods due to the benefits offered in program design and
maintenance.
The most important questions that need to be addressed in the development of parallel algo-
rithms are therefore: "How can we design parallel algorithms that are truly portable across parallel
machines?", "How can we exploit good sequential algorithms in the design of parallel algorithms '',
and "How can parallel algorithms keep pace with future developments in sequential algorithms?"
These are the main objectives of a new project to be discussed in this paper.
CHARM [6, 12] is a run-time support system for portable parallel programming developed
at the University of Illinois. It currently runs on a wide range of parallel machines including
shared memory machines, message passing multiprocessors and a network of workstations. We are
using CHARM to build a prototype of a Portable object-oriented parallel environment for CAD
applications (ProperCAD). Since inception, the ProperCAD project (see Figure 1) is designed to
be completed in two phases. In the first phase, we are designing portable parallel algorithms for a
large set of CAD applications using CHARM. To date, algorithms for flat extraction, test generation
for sequential circuits [18] and combinational logic synthesis [5] and standard cell placement have
been designed and implemented. New algorithms for global routing, fault simulation and behavioral
simulation are currently under development.
The second phase of the project is expected to take a couple of years. It will involve the design
and implementation of a run-time support system for portable parallel programming in C++. This

system,althoughinspiredby CHARM, will be tailoredspecificallyfor CAD applications.This will
maketheprogrammingenvironmentruly object-orientedandwill supportfeatureslike inheritance
and classes. The ProperCAD applications will then be rewritten and ported onto the new C++
platform. The new platform will make it possible to adapt and integrate tile parallel applications
with software developed by companies like Cadence and Mentor, which are increasingly using C++
as a standard for their software development. Recall that reuse of sequential code is one of the
primary objectives of the ProperCAD project. In the second phase of this project, we will also
develop a library for the rapid prototyping and development of additional parallel (,_lAD applications.
The library will essentially be a parallel data manager that supports data distribution abstractions
and primitives designed for an integrated parallel CAD environment. The library can be viewed as
as being analagous to the Oct tools [9] distributed by the University of California at Berkeley for
uniprocessor CAD applications.
The CHARM system was chosen as the platform for two significant reasons. The first is that it is
a working prototype of a run time support system that offers true portability of parallel applications
across MIMD machines. Second, although not truly object-oriented, it supports an object-oriented
style of programming. This will make porting of the CAD applications to C++ much easier. We
discribe the CHARM system briefly in Section 2.
In Section 3, we discuss how flat circuit extraction is expressed as an example of the use of
the programming paradigm supported by the ProperCAD environment. The algorithm for circuit
extraction presented in this paper has three significant contributions: (1) It is portable across
MIMD architectures. (2) It is built around an existing sequential circuit extractor using a well-
defined interface. This enables it to benefit from future improvements in the sequential algorithms
for circuit extraction. (3) Unlike previous approaches to parallel circuit extraction, it uses an
asynchronous coarse-grained data-flow model of execution. This is instrumental in rendering the
parallel algorithm scalable on all the target machines. Contributions (2) and (3) together also
permit good load balancing and high processor utilization.

ProperEXT:extraction
ProperTEST:testgeneration
ProperFAULT:faultsimulation
ProperSYN:logicsynthesis
ProperPLACE:placement
ProperROUTE:routing
ProperSIM:behavioralsimulation
- Sequential
ThePr Interface
[ ) I ) ( 1 (
1
Parallel Algorithm | Modules
--rI
I |
Sequential
CHARM
o • • °
Network of
Encore Multimax Intel i860 Sun workstations BBN Butterfly
(shared) (message passing) (distributed) (NUMA)
Figure 1: A high-level view of the ProperCAD project currently under development using the
CHARM parallel programming environment. We list the CAD applications under development
above.

2 The Parallel Programming Model
CttARM is a run time support system for portable parallel programming [6]. It abstracts away
all machine dependent features away from an application program and provides a uniform set of
primitives that can be used by the application to render their program machine independent. Fea-
tures like dynamic process creation, mapping of processes to processors, dynamic load distribution
and load balancing, scheduling, interprocess communication, are provided by the kernel. These are
implemented in the most efficient manner possible on each of the machines that the kernel runs on.
These features often complicate the user application considerably. CHARM helps the programmer
separate these concerns.
The CHARM kernel supports a message-driven style of execution. Conceptually, it maintains a
pool of messages that represent work. The application program can create processes dynamically
by creating a message that represents a seed for a new dynamically created object. 1 Information
can be exchanged between these objects also via messages. When a message is created it is put in
the work pool. The messages in tile work pool are distributed (and periodically balanced) across
the available processors by the kernel. The kernel services messages in the pool until no more are
available. Quiescence is detected; the programmer may take necessary action at this point (for
example, printing results).
A CHARM program comprises a set of object definitions. Each object definition has a set of
entry points which have C-code associated with them (see Figure 6 for an outline of tile object for
circuit extraction). Instances of these object definitions may be created dynamically at run time. 2.
Messages may be sent to these objects at one of its entry points, and the servicing of a message
entails executing the code associated with the entry point sequentially. No interrupts or blocking
(e.g. for synchronous receives) are possible within a code block associated with an entry point.
Only one instance of a special object called the main object is created. It has special entry points
for initialization and detection of quiescence. These do not have messages sent to theln unlike
normal entry t)oints. The initialization entry point performs data and object initialization (luring
1An object is created when this message is serviced.
2These objects are similar to actors [1], a type of concurrent object. Wegner [26] categorizes actors to be active
imperative objects. Note, however, that features like inheritance are not supported by CHARM.
6

startup. The quiescencentry point is optional; it permits the user to provide tile action to be
takenupondetectionof quiescence.If it is absent,tile programterminates.
CHARM also provides a special type of object called a branch office. Orze instance of the
branch office object is created per processor. As with the main object, branches are initialized
automatically on the local processors by executing code associated with an lnit entry point. Both
types of objects permit the declaration of persistent data that is visible only when executing any
code associated with the object (or branch). Both types of objects also permit the declaration
of procedures or functions as part of its definition. The functions in an object are private to the
object, whereas the functions in a branch office may be invoked by other objects. Branch office
objects are useful in providing data and program abstraction and have a concurrent object-like
behavior. For example, the access to distributed data can be managed by branch office objects.
A program may comprise several branch offices, each of which manages a different complex data
structure (like a circuit, BDDs, etc.). An example of a branch office object definition is provided
in Figure 2.
Another interesting feature of the CHARM kernel is conditional packing. The program definition
also includes routines for packing messages into contiguous buffers and unpacking tlmm into a
representation used by the program. A pair of such routines are provided for each message type in
the program for which packing/unpacking is necessary. These are used by the CHARM kernel on
nonshared memory machines when it is necessary for a message to cross process boundaries. Note
that for shared memory machines packing is not necessary. Hence, the algorithm runs efficiently
on shard memory machines as well.
Other features provided by CHARM are beyond the scope of this paper. Only features that are
important to the ensuing sections are discussed above. Further details may be found in [6].
The primary objective of a CHARM program is to create a large number of messages represent-
ing parallel work. Typically, this is done by decomposing the problem hierarchically into smaller
and smaller subproblems which can be evaluated in parallel, until a threshold is reached. This
threshold is user defined, and is used to indicate that subproblems smaller than the threshold are
to be evaluated sequentially. Ideally, the threshold determines the point at which it is cheaper to
solve a subproblem sequentially in preference to decomposing it further into parallel subcoinpo-

nents. Determining this threshold accurately is not necessary as long as sufficiently large number
of messages are created each of which represents a reasonable amount of work (e.g. > 50 ms).
The more the messages available to the kernel, the better its capability to perform dynamic load
balancing. The problem decomposition is thus independent of the number of processors available.
CHARM has been ported to a variety of shared memory and nonshared memory machines
including the Encore Multimax, the Sequent Symmetry, the Alliant FX/8, the Intel iPSC/2 and
i860 hypercubes, the NCUBE 2 hypercube, and a network of Sun workstations. It is currently
being ported to the BBN TC2000 Butterfly multiprocessor. Four portable implementations of the
CHARM kernel have been developed so far, one for shared memory machines, one for nonshared
memory machines, one for NUMA type machines, and one for a network of workstations. Every
time a new parallel machine is announced, the kernel can be ported to the new machine with
relatively tittle effort. 3
3 VLSI Circuit Extraction
VLSI circuit layouts are typically described as a collection of rectangles in different mask levels. The
problem of circuit extraction is to take such a layout and determine tile circuit connectivity, and
obtain estimates for various electrical parameters such as resistance of lines, capacitances of nodes
and dimensions of devices. The circuit extraction problem has two components: netlist extraction
and parameter extraction. The first component involves determination of the electrically connected
regions (called nets). To do this, boolean task manipulations are performed on different layers to
derive new layers, as specified in a technology file. For example, in CMOS technology, N-type
transistors are obtained by intersecting poly, diffusion and pwell layers, whereas P-type transistors
are obtained by intersecting poly, diffusion and complement of pwell layers. A new diffusion layer
is obtained by intersecting the old layer with the complement of poly.
The rectangles in the device layers are grouped into maximally connected groups, which form
the devices. The rectangles in the other layers are grouped into maximally electrically connected
sets, which form the nets. The electrical connectivity information is also provided in a technology
aThis is true unless the architecture of the new machine is radically different to existing architectures. In this
case, a new implementation of the kernel best suited to the architecture will be developed.

file as mentioned earlier. This gives the layers that electrically connect on overlap. For example, in
CMOS technology, the metal and contact layers electrically connect on overlap, so do the diffusion
and contact layers.
The parameter extraction component involves device size extraction, parasitic capacitance ex-
traction and resistance extraction of nets. Different models for parameter extraction with varying
accuracy and computational requirements have been proposed. The more accurate the model, the
more computation intensive it becomes. The HPEX model [23] is used for the circuit extraction
algorithm in this paper. For reasons of brevity, we do not discuss it further here.
Sequential circuit extraction is a well studied problem. Several sequential circuit extractors
of varying speed and accuracy already exist [7, 8, 10, 15, 21, 23]. Parallel algorithms for circuit
extraction have also been recently proposed [3, 4, 14, 25]. These algorithnls perform parallel circuit
extraction in several phases, including a data distribution phase, a geometric extraction phase,
a merge phase, a device extraction phase and a parameter extraction phase. Such approaches
involve synchronization at the start of each phase of the execution. For example, it is necessary, to
uniquely determine tile nets and transistors before proceeding to the parameter extraction phase.
This reduces the processor utilization, especially on nonshared memory machines.
To improve the load balancing, two schemes were proposed for data distribution: area-based
partitioning [3] and point-based partitioning [4]. The former partitions the circuit into different
areas each of which was assigned to a processor which performed local netlist and transistor extrac-
tion for its region. This can result in load imbalance if certain areas of the circuit are denser than
others. This drawback is addressed by a point-based partitioning scheme which which partitions
the circuit so as to approximately assign an equal number of rectangles to each processor. This
is costlier and more complicated than the area-based scheme, but yields better results for circuits
that do not have its rectangles evenly distributed. In the final phase, the complete nets are also
distributed across the available processors for load balancing reasons. These load balancing schemes
adopted were different for shared memory [4] and message passing machines [31.
The ProperCAD approach requires us to design programs that are not tailored to a particular
type of architecture. It also encourages the use of a coarse-grained data-flow style of execution where
a operation can be executed as soon as the data necessary to execute it is available. In parallel

circuit extraction,weadopt a hierarchicalapproachto decomposition.The circuit is partitioned
into severalregionsandeachregionis assignedto a processor.(Thedetailsof the datadistribution
areprovidedin Section3.1.) A sequential algorithm for local geometric extraction is then run on
each the regions to determine the nets and transistors in that region. The nets and the transistors
touching a border of the region they belong to are deemed to be incomplete. Incomplete nets and
transistors are subject to a merge algorithm, whereas local nets and transistors are available for
processing using a sequential algorithm for parameter extraction.
The merge algorithm proceeds in a hierarchical manner where at each stage two adjacent re-
gions are merged. Following every stage of the merge algorithm, nets and transistors that become
complete are available for parameter extraction. Nets that are available for parameter extraction
are load balanced as and when they become complete, to ensure maximum utilization of processors.
As can be seen in the above brief description of parallel circuit extraction, the geometric extrac-
tion on each region as well as the parameter extraction are performed using a sequential algorithm.
It is easy to see that the best sequential algorithm can be used fox" this purpose. The focus of the
parallel algorithm is now simply that of (1) decomposing of the problem into subproblems, and
(2) merging these subproblems together.
However, how can the overhead of parallelization be kept in check? To see this, consider this
simple argument. Based on a property of trees, for a branching factor >_ 2, the number of leaves in
a tree is always greater than the number of internal nodes. If it is possible to ensure that the work
done at a leaf node in a problem decomposition tree is atleast 10 times the work done at an internal
node, the work done at the leaf nodes of the decomposition will dominate the execution time (>
90%). Thus, for circuit extraction, if we can conform to this rough criterion for decomposition, by
calling the best available sequential algorithm for geometric extraction and parameter extraction,
the total overhead of the parallel algorithm can be bounded to within 10% of the best sequential
algorithm.
The above argument has been simplified somewhat for ease of explanation. However, the
conclusion is still valid. We demonstrate this in Section 3.6 where we discuss the performance
of the parallel circuit extraction algorithm. In the following discussion, we describe the different
phases of circuit extraction in more detail and how our algorithm avoids synchronization between
10

readonly int grainstze;
branch office RectangleManager {
HashTableEntry data-distribution[M ax HashSize] ;
int initmsgcount;
LocalPartitionTree mypartition;
entry Init:
Compute the circuit partitioning to determine which
processor gets which region. The grainsize
determines the depth of the recursive partitioning
entry ReceivePartition: (message InitRectangles *msg)
Receive and insert received rectangles into local partition tree
entry ReceiveRectLoad: (message LoadMsg *msg)
Receive the current rectangle load on other processors
entry SendRectangles: (message SendRequest *msg)
Send some of local rectangles to processors with less load
}
RequestRectangles( region )
Continue(region, decompose)
/* other functions visible to other objects */
} /* RectangleManager */
Figure 2: The branch office object for data distribution of rectangles.
these phases.
3.1 Data Distribution
In order to effect proper load distribution for parallel circuit extraction, it is important to ensure
balanced data distribution. This needs to be accomplished with minimum overhead during data
distribution, and at the same time, it must not complicate the merge phase of the circuit algorithm
which combines the results computed for each of the partitions of the circuit, as was the case in
the point-based partitioning in tile PACE algorithm [4].
The distribution of rectangles is iinplemented using the branch office object outlined in Figure
11

mainobject Initial Distribution
1000
Processor 0
70 120
I
Processor 1
i
I
I
I
I
I I I
110 [sol 4o1
I I
_oll3o_ 90 I
, t |
160 25O
......................
450 300
I
Processor 2
30
" .... " 100
140
!
Processor 3
I I
125 _p:......so.
I
, 90
I
| t
Point Grainsize = 150
Figure 3: A simple example illustrating tile data distribution for 4 processors. Note that tile
number of rectangles in an region is smaller than the sum of the rectangles in its 2 subregions since
border rectangles are given to both subregions.
2. It also provides access routines to the distributed data. These routines are used as necessary
by the dynamically created objects in the system. In Figure 2, the C-code associated with the
Init entry point is executed on every processor upon creation of the branch office. A main object
reads in the circuit description and partitions the rectangles area-wise into n partitions, where n
is the number of available processors. The partitions are sent to the respective processors to the
ReceivePartition entry point. The rectangles are locally partitioned further based on a user-defined
threshold and the local load is broadcast to tile ReceiveRectLoad entry point of the sibling branch
offices on the other processors. The branch offices then determine the best distribution of the local
partitions across the available processors. Some of the processors with surplus rectangles tag some
12

of their local regionsasthoseto beprocessedby otherprocessors.Nomovementof rectanglestakes
placeat this point. Wediscusshowthis is accomplishedbelowin moredetail.
Initially, the circuit is partitioned usingan area-basedpartitioning schemeso as to assigna
regionof the circuit to everyprocessor.The rectanglescomprisingthe circuit from a file are read
in andsent to the processorowningthe circuit partitions to whichthey belong.A processorwill
alsoget all the rectanglesthat toucha borderof the circuit regionit owns. Theserectanglesare
sentin several'rectangle'messagesto overlapthe processingof theserectangleswith the reading
in of the input.
Uponinitialization, oneachprocessor,a hashtable is createdto store the data distribution.
This tableis usedto storeall the nodesresultingfromthe initial area-partitioningnodes,together
with local nodesresultingfrom the localpoint-basedpartitioning in the parallelcircuit partition
tree(seeFigure3). Eachprocessoralsoinitializetwo counts:a countof the numberof 'rectangle'
messagesit expectsto receive(init-rnsg-count= 1 ) and a count of the number of messages it expects
to receive from the other processors indicating the local point-based distribution on the respective
processors: (rect-load-count = num-processors -1). Every message except the last sent to the
processors as the circuit is being read in carries a send-count field = zero. In the last message,
however, the send-count field is set to number of messages sent + 1. Upon receipt of a message,
a processor increments its init-msg-count by 1 and decrements it by send-count. This ensures that
init-msg-count is zero if and only if all the messages have arrived, irrespective of the order of arrival.
The use of rect-load-count is described below.
As and when the messages are received, the rectangles in the message are inserted into a
partition tree. The root of the partition tree on every processor is the entire region owned by the
processor. Initially, the root of the tree is the only node in the tree. Rectangles are only stored
at the leaf nodes of this tree. When the number of rectangles at a leaf node L of the tree exceed
a user defined limit (called point grain size), the region represented by L is split into two. Two
leaf nodes Ll and L2 are created as children of L, and the rectangles stored at L are distributed
between L1 and L2 (the rectangles on the border are given to both regions). Thus. when all the
rectangles from the main process have been received and processed, every leaf node has <_ point
grain size rectangles. One triple ( region, penum, rectangle-list) for every node in the partition tree
13

is storedin the localdata-distributionhashtable. (The centerof the regionis usedasthe key to
indexthe hashtable.). This constitutesthe local phase of data distribution.
Once all the rectangles bound for a processor p have been received and processed, a message
containing the number of rectangles owned by p is broadcast to the other processors. This number
typically exceeds the number of rectangles received when the circuit was read in because it ac-
counts for the duplication of "border" rectangles. Note that no rectangles are sent across processor
boundaries at this time. Each processor will receive num-processors -1 such messages. The local
rect-load-count field is used to check the arrival of all such messages at a given processor. When all
these messages arrive, processors having more rectangles than tlle average assign some leaf regions
to lean processors. This is done by accessing the local hash table and changing the penum field
in the triple (region, penum, rectangle-list) appropriately. The rectangles are not sent to the lean
processors at this stage. Moreover, Care is taken to ensure that several surplus processors do not all
assign rectangles to a same lean processor but distribute it across the lean processors uniformly 4.
After a processor receives its rectangles, creates its local partition tree, and broadcasts the
number of rectangles to other processors, it is ready to begin the decomposition phase (Section
3.2). It does not wait for the receipt of all rect-load-eount messages from other processors.
The hash table, the partition tree and count information is all managed by a local data object
on each processor. These processes together provide a form of distributed data abstraction to the
processes created during the execution of the circuit extractor (see below).
3.2 The Decomposition Phase
Once all the rectangles have been read in and sent to the respective processors, all object responsible
for the entire circuit area is created. The object is named the CircuitExtractor object ill Figure 4.
In Figure 4, an outline of the object used to perform circuit extraction is shown. Briefly,
decomposition continues until the user-defined threshold is reached. This decomposition and the
corresponding creation of objects mirrors the data partitioning performed by the branch office.
When decomposition stops, the RequestRectangles function of the local branch office is queried for
the rectangles in the specified region, with a "reply-to" entry point = Receive Rectangles. If these
4No additional messages are sent to accomplish this. Each processor runs a local deterministic algorithm on the
periodic load information received from the other processors.
14

chareCircuitExtractor {
LocalDataType data;
ObjectIDType parentid;
int numchildmsgs;
BorderMsg *firstchildmsg;
entry Decompose: (message CurrentRegion *msg)
{
If ( LoadManager. Continue(msg->region, &decompose))
If (decompose)
Divide msg->region into two equal regions by bisecting its longer sides
Create 2 CurrentRegion messages to represent these regions
CreateChare( Decompose@ Circu_tExtractor, msg 1, pel)
CreateChare(Decompose@ CircuttExtractor, msg2, pe2)
Else
parentid = msg->objectid; numchildmsgs = 0;
LoadManager. Request Rectangles( msg- > region ) ;
}
entry ReceiveRectangles: (message RegionRectangles *msg)
{
ConstructLocalRectangleLists( msg->rectangles, data );
Identify Local Connected Nets netlzst and Transistors tranlist
ProcessTransistors(tranlist, &bordertranlist, &localtranresults, data);
ProcessNets(netlist, &bordernetlist, &localnetresults, data);
Report local transistors
Insert complete nets in nellist in LoadManager:localnets
Create a message bordermsg containing all border net and transistor info.
SendMsg(CircuitExtraeton_MergeRegions, bordermsg, parentid);
}
entry MergeRegions: (message BorderMsg *msg)
{
numchildmsgs = numchildmsgs + 1;
If (numchildmsgs == 2)/* both messages received */
MergeRegions( firstmsg, msg, data);
Identify Local Connected Nets netlist and Transistors tranlist
ProcessTransistors( tranlist, &bordertran[ist, &localtranresults, data);
ProcessNets(netlist, &bordernetlist, &localnetresults, data);
Report local transistors
Insert complete nets in netlist in LoadManager:localnets
Create a message bordermsg containing all border net and transistor info.
SendMsg( CircuzlExtractorff_MergeRegzons, bordermsg, parentid):
Else
firstchildmsg = msg;
Figure 4: The object that implements the circuit extraction algorithm. Some liberties have been
taken with notation for ease of exposition.
15

whole PE 0
PE2
O - global decomposition node
O - local decomposition node
O - leaf node (with rectangles)
upper
upper lower PE 1 fightleft PE 0 left
PE 2 Iow_" PE 3
right
asea-based partitioning
point-based partitioning
PE l to PE 0
Load balancing: leaf sent from PE 1 to PE 2
Initial Distribution:
PE 0:190 rectangles -134
PE 1:510 rectangles +510
PE 2:270 rectangles -54
PE 3:325 rectangles +1
Final Distribution:
PE 0:300 rectangles -24
PE 1:340 rectangles +14
PE 2:330 rectangles +6
PE 3:325 rectangles +1
average : 324 rectangles
Grain size = 150
Tolerance -50 to +50
Figure 5: The dynamic redistribution of objects to load balance the rectangles in Figure 3. Each
node in this decomposition tree represents an extractor object.
16

rectangles are not available locally, tile local branch office first determines tile owner processor for
the requested rectangles. It then sends a message requesting the rectangles to the SendRectangles
entry point of its sibling branch office owning the rectangles. The owner branch office sends the
requested rectangles to the RequestRectangles entry point of the requesting object.
A CircuitExtractor object queries the local branch office to determine whether further decom-
position is necessary. This is determined as follows. If the region owned by the Circuitextractor
object (its current region) is not present in the local hash table, further decomposition is deemed
necessary. If the current region falls within the circuit region owned by the local processor, it is
necessary to wait until the local data partitioning phase is complete. This is accomplished with
a do not continue response. The querying process suspends and relinquishes the processor upon
receiving a do not continue response. When local data partitioning is complete, these processes are
woken up and this query is retried. If further decomposition is required, the hash table provides
the processors on which the child process instances are to be created. Recall that the destination
of leaf processes may change due to distribution of rectangles as explained in Section 3.1. Non leaf
processes resulting from the initial area-based partitioning are assigned processors statically. A
non-leaf processes resulting from point-based partioning is created on the same processor on which
its parent resides (see Figure 5).
If the information necessary to answer tile query is available locally, the data distribution hash
table is checked for an entry corresponding to the current region. Recall that only hash table entries
for leaf nodes in the decomposition tree carry rectangles. Hence, if the entry in the hash table has
no rectangles, the process may continue execution, but must decompose the current region further.
This is accomplished by dividing the current region into two equal parts by bisecting its longer
sides. An instance of the CircuitExtractor object is created for each of these regions.
If the current region falls within a region owned by another processor, no further decomposition
is necessary since only leaf nodes in the decomposition tree may cross processor boundaries for
load balancing. The process then requests the rectangles that belong to its current region and
relinquishes the processor. The local data process sends the necessary data and wakes up the
requesting process. Sometimes, it may be necessary for the local data object to forward the query
to a data object on another processor to satisfy the request. In this case, the data object on tile
17

processorowningtherectanglesendstherectanglesto therequestingCircuitExtractor object. This
is done transparently as far as the requesting CircuitExtractor object is concerned.
The requesting CircuitExtractorobject is now primed for local processing. Due to the large sizes
of the messages exchanged between processes, processes created during area-based partitioning (see
Figure 5) are mapped onto processors so that a parent CircuitExtractorobject and one of its children
reside on the same processor.
3.3 The Local Extraction Phase
We first describe the local processing performed with the assumption that all nets and devices
computed are completely local to the region. We then discuss how nets and devices touching the
border are handled.
The local processing of a region of tile circuit is very similar to that employed by tile PACE
algorithm. To avoid repetition, we describe it very briefly here, emphasizing the differences between
the two algorithms. A scan line algorithm is used to determine the local connected components and
to identify nets and transistors. This forms tile netlist extraction component of the algorithm. The
output of this component is a list of devices and a list of nets. A device is described as a collection of
device rectangles. For each device, information about the nets connecting to tile different terminals
of the device is also computed. A net is also described as a collection of rectangles. For each net,
information about devices that connect to the net is computed.
For parameter extraction, we use the resistance-capacitance model used in HPEX [23]. Tlle
resistance of a net are converted into a horizontally maximal non overlapping form. This is also
accomplished by a scan line algorithm. This will produce a unique representation of the net. The
horizontally long rectangles are then combined in the vertical direction. Two rectangles that are
sufficiently longer in tile x direction than tile y direction are combined vertically if they abut on
their horizontal edges. Once this is done, for every rectangle R that is longer in one direction than
other rectangles abutting it, R's larger side is cut at the point of intersection with the abutting
rectangles. Two overlapping rectangles resulting from such intersections are merged.
Two rectangles are said to be electrically connected if they abut each other. A rectangle that
connects to only one other rectangle, or atleast three or more t'ectangles is defined to be a knot.
18

Rectanglesassociatedwith theterminalsofdevicesaredefinedto beports. The remaining rectangles
all have exactly two connections each and are defined to be branch rectangles. Every knot and port
is assigned a globally unique number which identifies a point of connection in the circuit. A net
can be thought of as an undirected graph where the knots and ports are nodes that are connected
to each other by edges which represent a chain of one or more branch rectangles.
Resistance calculations are then performed for every edge ill the graph representing a net. The
contribution of the knots and ports are also factored into the calculations. Capacitance calculation is
also performed at the same time. The capacitance of each of the knots and ports is ill'St determined.
The capacitance of each of each edge in the graph is computed by adding tile contributions of all
the branch rectangles on the edge. This capacitance is equally divided between the two end points.
The result of this phase is a distributed RC network for each net. Currently, we do not perform a
node reduction phase on the resulting network as is done in HPEX [23], but these features can be
included easily.
For parallel execution, the maximally connected nets and devices are identified as described
above. Following that, they are subject to the horizontal and vertical transformations and unique
identification of knots and ports. However, both nets and devices may touch a border. All knots
touching a border are marked as border rectangles. Rectangles abutting all incomplete transistor
are also treated like border rectangles. Furthermore, for every edge in the graph representing a
net, if any rectangle on the edge (including the end points) touches the border, all the rectangles
including the end points are marked. Only the marked regions of a net will be sent to a parent
process during the merge phase. Tile resistances and capacitances are computed locally for the
non-marked regions of the net and the rectangles are then discarded.
Resistance and capacitance calculations are computed as before with some exceptions. (a)
Capacitance is not computed for marked rectangles. (b) Resistance is not coml)uted for marked
rectangles. (c) Resistances and capacitances are computed for edges that are not marked even
if one or both of their end points are marked. In this case, the computed capacitance is divided
equally between the end points as described earlier. Resistances are reported immediately using
the unique node identifiers assigned to their end points. Marked end points carry the partially
computed capacitances and their unique node identifiers up tile (tecoml)osition tree until they
19

becomeunmarked.The resultsare reportedat this stage.Notethat that sameuniqueidentifiers
will beusedto report the capacitances.(d) Knot rectanglesthat aremarkedbut do not touchthe
borderare taggedasknotsto avoidbeingidentifiedasbranchrectanglesin an ancestorregion.
The markedregionsof a net togetherwith partially computedresistancesandcapacitancesat
the knotsnot touchinga borderaresentup to the parentprocess.The computedvaluesfor the
unmarkedregionsof a netarereportedwith agloballyuniquenumberidentifyingthe net. However.
if thereportedresultscorrespondto an incompletenet, the resultsare taggedasincomplete.
Local transistors,like local nets,poseno problem. Local transistorsare reportedas soonas
theyareencounteredandprocessed.Bordertransistors,however,canposepotentialproblems.For
eachcompletetransistor,uniquenodenumbersidentify thegate,sourceanddrain terminals. One
nodeis createdto representhegateof the transistorandis taggedasa poly port. Two additional
nodesare createdfor the sourceand drain netsof a transistorand taggedas diffusionports. A
rectangleboundingthe channelrectanglesof thetransistoris usedto representhe diffusionports.
This guaranteesthat the relevantsectionsof the sourceand drain netswill be markedasborder
rectanglesif the transistortouchesthe border. The terminalsof a transistorareonly determined
oncea transistoris complete.The channelrectanglesof an incompletetransistoraresentup to
the parentprocesstogetherwith referencesto all netsabutting it. Unlikenets,deviceextraction
resultsareonly reportedoncethedeviceis complete.
Weillustratethis with the helpof the simpleexamplecircuit in Figure6. Figure6(a)describes
anentirecircuit ina regionR0 in a form ready for parameter extraction if executed on one processor.
We consider the case where it has been divided into two regions R1 and R2 (Figure 6(b)) which
are processed by two processors. Figure 6(c) describes the state after local netlist and device
extraction. Transistors 7'1 and T2 are recognized as incomplete on their respective processors. The
determination of the source and drain nets is deferred until the transistor beconles conlplete. The
regions of the nets that are connected to the common border are marked as shown. The resistance
and capacitances for the unmarked regions are then computed. For exanlple, the capacitance
computed and lumped at knot K2 is reported (as belonging to net 2). The resistances between K1
and K2, K2 and K3 etc. are also reported. The partial capacitances computed at knots h'l, K3,
K5 and K6 together with the node identifiers assigned to them are sent up to the parent process
2O

Region 0
Diffuslon Diffusion!
I _I
(a)
Incomplebe TI:
Gatenet Net 1
Sourcenet = Net 2
IDrainnet= NULL
Shaded region
sent up to parent
Region1 Region2
Net 2
[
I
K1 ',T'.C
L
Net
K3
(bl
Region I
:K1.K2:iiiii:iiiiiiii:__
Net 2 Net 1
K31 i::.i!i!i!i::i::::i:._!i!!!!i::
(c)
Net I subsumes Net 3
Net 5 subsumes Net 2
(Parent) Region 0
Net 5 ' '
I<5
Region 2
Net 5 '
Incomplete T2:
Gate_net = Net 3
Source.net = NULL
Drainnet = Net 4
I
Net _ Net 4
I KI I l,'i-'N IKs'I
Net 5
f
Net 5
!
I
(d)
Partial results
reported for non
shaded region
T1 complete:
Gatenet = Net 1
Sourcenet = Net 5
Drainnet = Net 4
Figure 6: A simple example illustr,'tting the behavior of the parallel circuit extraction algorithm.
21

responsiblefor R0. No resistance or capacitance values are computed for knot K4.
As soon as a leaf process completes local geometric extraction, it processes the local complete
transistors, deposits the local complete nets in a local nets list in the local data object, and sends
a message to its parent containing the border nets and transistors. It then terminates and frees up
any allocated memory.
3.4 The Merge Phase
At the leaf processes of the circuit decomposition tree, four lists of border rectangles are created,
one for each border. For efficiency, these rectangles are chopped to line segments along the border.
Overlapping line segments are then combined to reduce the number of line segments sent up to
a parent process. Note that two adjacent regions will have the same set of line segments on
the common border since rectangles touching the common border are made available to both the
regions. These line segments point to the net or device that they belong to.
When a non-leaf process in the decomposition tree receives one message each from its children, it
is ready to begin processing them. The common border between tile two child regions is determined
and the set of line segments corresponding to the respective borders of the child regions are sorted.
The nets or devices corresponding to the same line segment in the two lists are merged.
Merging two nets will result in one net subsuming another. However, the child process may }lave
reported results corresponding to local parts of these nets using the globally unique identifiers given
to these nets during tile local extraction. Thus, the resulting net creates a list of net identifiers
of all nets it subsumes. A net may subsume more than one other net during the merge process.
This suggests that a list of merged net identifiers need to be created in the general case. It is also
possible for two nets, that have subsumed one or more nets each, to be merged. This means that
corresponding list of merged net identifiers also need to be combined when two nets are merged.
Once the merge operation is completed at an internal node in the decomposition tree, for every net
N resulting from the merge, the list of nets subsumed by N is reported.
For transistors, the list of channel rectangles of the transistors being merged are combined and
carried by the resulting transistor. As mentioned during the local extraction phase, it is necessary
to keep the information relating to tile connecting nets consistent. Titus, in Figure (i. when Net 1
22

subsumesNet 2, Net2 is markedasinvalidandthencarriesa referenceto Net 1,the netsubsuming
it. This is doneby mergingthelist of abuttingnetsfor the two transistors.Forexample,after the
mergeof transistorsT1 and T2 in Figure 6, the resulting transistor T1 will carry references to Net
1, Net 4 and Net 5. This is done by following references to subsumed nets until a reference to a
net that has not been subsumed by another is found. If a transistor becomes complete (i.e., does
not touch a border) following a merge, port nodes are created for the gate, source and drain nets
to enable computation of the capacitances at the terminals of the transistor.
All border line segments of the other borders of the two regions being merged are then combined
as necessary to create the border line segments of the new parent region. Tile nets and devices
pointed to by some of these border segments may have become subsumed by another net; this is
also resolved during the creation of tile border segments of the new region. The subsumed nets and
devices are then released.
The partial nets that are produced as a result of the merge operation are then processed as in
the local extraction phase. Once again, some partial results may be computed, and some rectangles
may be marked as border rectangles. This is identical to the local extraction phase with the caveat
that only the nets involved in the merge operation are processed. Incomplete nets that are sent up
by the child processes, but do not touch the common border, are not processed at this point.
Once processing is complete, local results are reported. Results that correspond to complete
nets and transistors are tagged as such. Other partial results on nets are reported but tagged as
incomplete.
Thus, at each level of the circuit decomposition tree, the partial nets sent up get smaller and
smaller since parts of them become eligible for processing, and the capacitance and resistance
computation for these parts is performed. Only the rectangles of a net marked as border rectangles
are sent up to the parent for further processing.
In Figure 6(d), we see the result of the merge phase on the simple example, h'_, h'_, h'_ and
h'_ denote knots in the child process carrying partial resistance and capacitance information. Note
that net 2 is a disconnected net at this point. A new knot K7 is created after merging nets 1 and 3.
The identifier of the new net is arbitrarily assigned to be one of the two nets. Thus Net 1 subsumes
Net 3 and net 5 subsumes Net 2. Once the merge is complete, the gate, source, and drain nets of
23

branch office NetlistManager {
Nets *longnets;
Nets *localnets;
entry lnit:
Initialize longnets and locainets to NIL
entry ReceiveNetLoad: (message LoadMsg *msg)
Receive the current net load on other processors
entry ReceiveNets: (message NetList *msg)
Receive nets that need to be processed and add it to local lists
entry ProcessNet: (message DummyMsg *msg)
{
Process one net in localnets list and report results;
SendMsg( ProcessNet@NetlistManager, msg, MyPeN urn( ));
}
/* other operations visible to other objects */
} /* NetlistManager */
Figure 7: Tile branch office object for distribution and parameter extraction of nets.
the complete transistor T1 are determined. (These nets may not be complete at this stage. Only
the region abutting the transistor has to be complete.} The computed values at knots h'_ and h_.
are reported for net 5 together with the information that nets 1 and 5 have subsumed nets 3 and
2 respectively.
3.5 Dynamic Load Balancing
In Section 3.1, we described the algorithm that is used to effect good data distribution. This.
however, only serves to distribute the effort of local extraction effectively across the available
processors. As described in Section 3.3, the merge algorithm may detect and identify completed
nets at different non-leaf nodes in the circuit partition tree in Figure 5. The device and parameter
extraction of these completed nets can be the most time consuming part of tile execution, especially
if a computationally complex model for resistance/capacitance computation is used.
In the PACE algorithm [4], all completed nets are identified in one phase. They are then
24

randomlydistributed acrossthe availableprocessors.In our algorithm, completednetsmay be
identifiedduringthelocalextractionphaseaswellaseverystagein themergephase.Theprocessing
of thesenetsisdoneasandwhentheybecomeavailable.Moreover,in ourapproach,wealsoprocess
'completed'regionsof incompletenetsduring the mergephaseto the extent it is possibleto do
souniquely. As a result,a moreinterestingloadbalancingschemeis used.Figure7 outlinesthe
branchofficeobjectusedto managenetlistsandloadbalancethemacrossthe availableprocessors.
Netsareidentifiedaslargeor smallbasedona user-specifiedvaluefor the numberof rectangles
containedin the net. All small netsare initially retainedon the processoron whichthey were
identifiedin a local nets list in the local data object. One process is created to perform device
and parameter extraction on each long net as and when it is identified and randomly assigned to a
processor.
During execution, on each processor, a count of the total number of rectangles in completed
small nets is maintained. In addition, whenever a long net process is picked up, the number of
rectangles in the long net is added to this count. Each processor periodically broadcasts this count
to all the other processors.
Upon receipt of the rectangle counts on the other processors, each processor independently runs
a simple balancing algorithm to determine the best distribution of rectangles within a predefined
tolerance limit. Note that all processors will arrive at the same distribution since they run a
deterministic algorithm on the same data. This identifies the donor and recipient processors as
was done for data distribution. Care is taken to match donor and recipient processors to ensure
that load distribution is even. A donor processor then sends a set of small nets to the appropriate
recipient processor. Recipient processors do not take any action during the load re-distribution
stage. Any nets received by a recipient object are added to the local nets list ill the local data
object.
An interesting feature of the the ProperCAD environment is that it provides support for priori-
tized execution of objects. To ensure effective load balancing, priorities are assigned to the different
phases of execution. Messages that periodically exchange tile load information (the rectangle count
for extraction) get tile higilest priority to ensure prompt action upon detecting unbalanced load.
Leaf CircuitEztractor objects get the next highest priority. When no leaf objects are available, non-
25

Table I: The characteristics of tile benchmark circuits used.
f Circuit Characteristics
Circuit Rectangles I
Prog. Logic Array 25384
Hypercube Router 52893
Multiplier Array 64031
Static Ram 128073
Placement Coprocessor 253556
Nets I Transistors
912 25O8
1744 3476
4227 8384
5136 14296
10266 28494
Table II: Execution times (in seconds) of benchmark circuits on the Network of Sun (SPARC)
workstations. These data are subject to wide variations due to context switching between unix
processes and network traffic. The data here is provided primarily as a proof of concept.
II Network of Sun workstations II
[1 Circuit II Sequential t 1 PEI2 PEs 14 PEs 11
PLA 20.1
H. Router 65.2
M. Array 74.4
S. Ram 107.8
Coprocessor
22.6 12.3 6.8
73.3 56.2 34.5
82.2 71.5 41.2
122.8 100.0 66.2
- - 314.8
leaf objects are picked up. Long nets get priority lower than CircuitExtractor objects but higher
than short nets. In this way, local pools of short nets can be used to maximum effectiveness to
correct any load imbalance that may be recognized.
3.6 Performance
We can now demonstrate the performance of the ProperEXT circuit extractor on a variety of
parallel machines. Table I lists the benchmark circuits used ill the experiment and their charac-
teristics. The benchmarks used were real circuits some of which had been designed as projects for
a graduate course in VLSI design. These circuits are the same as those used in [3, 4] to demon-
strate the performance of tile PACE algorithm. Circuits ranging in size from 25000 rectangles to
250000 rectangles were used. The largest circuit had over 32000 transistors. The circuits were all
in hierarchical CIF format and were flattened before data distribution. Tables II - VI report the
26

TableIII: Executiontimes (in seconds) of benchmark circuits on the Encore Multimax.
Circuit
Encore Multimax
II PACEAlgorithmIIProperCADAlgorithmII
II1PEI 8PEs II1PE I 8PEs II
PLA 64.5 10.4
H. Router 196 43 211.8 29.4
M. Array 221 41 238.1 64.2
Ram 305 63 332.9 55.0
Coprocessor 691 137 723.7 124.6
Table IV: Execution times (in seconds) of benchmark circuits on tile Sequent Symmetry.
Sequent Symmetry t1
Circuit
PLA
H. Router
M. Array
S. Ram
CoProcessor
II 1 PE 18 PEs 11
119.9 18.2
409.7 51.7
447.8 96.6
630.5 96.2
1276.7 233.2
Table V: Execution times (in seconds) of benchmark circuits on the NCUBE/2 hypercube.
11 NCUBE/2 (hypercube) II
II Circuit
PLA
H. Router
M. Array
S. Ram
CoProcessor
l[ 4 PEs I 8 PEst 16 PEs
25.7 16.3 8.0
87.2 47.7 38.2
- 97.1 84.3
- - 41.8
32PEs 64 PEs
6.8 4.4
34.0 34.7
80.5
34.0 28.6
59.7 48.9
27

Table VI: Execution times (in seconds) of benchmark circuits on the Intel i860 hypercube.
Intel i860 (hypercube)
Circuit I11 eEl 2 PEs 14 PEs ]8 PEs II
PLA 13.0 6.6 3.7 2.4
H. Router 45.7 20.1 11.3 6.2
M. Array 51.0 31.1 16.2 20.1
S. Ram - 37.7 19.2 12.6
CoProcessor - - - 27.7
execution time in seconds on all these circuits. The reported times exclude the time for input and
output. The grain size used for all the circuits was 500: i.e. the circuits were partitioned into
regions containing 500 or fewer rectangles.
In Table II we report the performance of the ProperEXT circuit extractor oil a network of 4 Sun
Sparc 1 workstations each with 24MB of memory. Only workstations with identical configurations
were used for the experiment. Only 4 workstations with the above configuration (and the requisite
memory) were available for the experiment. Data is also presented in Table III for an 8-processor
Encore 510 Multimax with XPC processors running UMax 4.3 operating system with 64MB of
main memory. In Table IV, the results of running the extractor on a Sequent Symmetry with 8
Intel 386 processors and 32MB of memory are presented s. Table V provides data on a 64-processor
NCUBE 2 at Sandia National Laboratories with 4 MB per processofi. It is important to emphasize
that the circuit extractor ran unchanged on all these machines.
We now consider the performance on each of these machines. First, modest speedups were
evident on the network of Sun workstations. Due to context switching by tile Sun operating system,
network traffic due to page faults and other workstations on the network, wide variations in the
performance of the extractor were observed. Furthermore, in the presence of context switching
between unix processes, the execution time across different workstations on tile network was quite
SWe thank Argonne National Laboratories for access to the Sequent Symmetry and tile Intel i860 hypercube. Tile
Symmetry had 26 processors. However, the presence of other users in addition to the limited available memory made
it impossible to use more than 8 processors for our experiment.
SThe NCUBE 2 at Sandia National Laboratories has 1024 nodes. However, due to heavy use, only 64 nodes were
available at the time of running the experiment.
28

inconsistent. As a result, we present the results in Table II more as a proof of concept: that a
network of suns can be used as a parallel machine to distribute the computation. Column 2 in
Table II provides the time taken by a uniprocessor implementation of PACE [4]. In Column 3,
the time taken by the ProperEXT extractor on one processor is presented. The difference between
Column 3 and Column 2 represents the overhead of parallelization in the ProperEXT extractor.
As can be seen by these two columns, the overhead of parallelization was approximately 10-12%.
(As mentioned earlier, this overhead can be controlled by the programmer by specifying the grain
size appropriately.)
In Table III, we compare the results of the ProperEXT circuit extractor with the PACE circuit
extractor [4] on the Encore Multimax. In spite of the fact that the PACE extractor was designed
and programmed specifically for the Encore Multimax, the ProperEXT extractor is marginally
slower on one processor, but significantly faster on 8 processors for 4 out of 5 circuits. Data for
the PLA circuit was not available for the PACE extractor. The ProperEXT extractor does not
perform as well as the PACE extractor on the Multiplier Array circuit. This was observed to be
due to the completion of a single large net at the root of the decomposition tree. Since few other
nets are available for balancing the load, the processor performing parameter extraction on this net
is the sole active processor at this time. In the PACE algorithm, no processing is done until all the
nets are complete. This makes it possible to distribute the load across processors more effectively
in this example. This approach proves significantly costlier on the other circuits, however.
In Tables V and VI, we demonstrate the performance of the ProperEXT extractor on the
NCUBE 2 and Intel i860 hypercubes. Again, with the exception of the Multiplier Array circuit,
the benchmark circuits exhibit good speedups on all circuits.
3.7 Varying the Grain Size
An important question that needs to be addressed is the importance of the choice of grain size.
How does the programmer determine the right grain size to be chosen. In Figures 8 and 9 we study
the effect of varying the grain size on the execution time. Two experiments are reported: one on a
shared memory machine: a Sequent Symmetry with 8 Intel 386 processors and a message passing
machines: an Intel i860 hypercube with 8 processors.
29

250
200
E
X
e 150
C°
T
i
In
e
(s)
100
5O
o olpe
2 pes
5 0 0 0
o 4 pes
,,, _ 8 pes
0 i i
0 1000 2000 3000
Gralnsize -- _rectangles/region
Figure 8: The effect of varying the grain size from 25 to 25000 rectangles per region for the PLA
benchmark circuit on a shared memory machine: the Sequent Symmetry.
25
E
X
e
C.
T
i
m
e
(_)
2O
15
10
o 1 pe
v u
o o o 2 pes
_- • _ : , 4 pes
_ _ 8 pes
0 I i
0 1000 2000 3000
Gralnsize ---- _rectangles/region
Figure 9: The effect of varying the grain size from 25 to 25000 rectangles I)er region for the PLA
benchmark circuit on a message passing machine: the Intel i860 hypercube.
3O

Table VII: Execution times (in seconds) of ProperTEST sequential test pattern generator running
ISCAS89 sequential benchmark circuits on a network of Sun Sparc workstations
Network of Sun workstations (Distributed Proe essing)
II Circuit [1 #PE s It Time/Fit
1 15 311.1 3.0 2.1
s386 2 15 148.0 1.9
4 15 89.4 1.8
8 15 39.2 1.2
1 2 42.2 3.9 1.9
s713 2 2 23.3 2.7
4 2 11.1 1.8
8 2 7.1 1.2
1 2 8.8 13.2 1.8
s1196 2 2 6.2 10.4
4 2 3.7 6.9
8 2 1.7 4.1
1 2 18.6 15.8 2.4
s1238 2 2 11.5 10.9
4 2 7.0 7.4
8 2 3.7 5.2
1 1 1384.2 200.2 69.4
s5378 2 1 899.6 113.8
4 1 523.3 72.4
8 1 298.7 49.6
[] T.Gen. ] F.Sim. [ Overhead II Coverage Efficiency #Vectors
81.8
81.8
81.8
81.8
81.6
81.9
81.9
81.9
99.8
99.8
99.8
99.8
94.7
94.7
94.7
94.8
70.6
70.5
72.3
68.6
100
100
100
100
97.6
98.5
98.5
98.3
100
100
100
100
100
100
100
100
72.1
72.0
71.8
70.2
333
293
361
369
182
204
236
246
365
387
380
370
383
374
398
406
849
929
950
1095
#Procs.
1713
1483
1525
1307
720
725
735
728
1243
1243
1243
1243
1363
1362
1363
1362
4604
4604
4604
4604
We varied the point grain size (see Figure 3) from 25 to 25000. As the grain size is increased,
the amount of parallelism exploited is reduced. For very small grainsizes, (i.e. < 100 rectangles per
region), the execution time is quite high, indicating a high overhead of parallelization. However, as
can be seen, a wide range in the grain size is observed for which the execution time exhibits little
or no change. Thus, any choice of grain size within this wide range is suitable for executing the
program.
4 Other Applications
Several other applications have already been developed as part of tile ProperCAD project. They
include test pattern generation for sequential circuits [18], combinational logic synthesis [5] and
standard cell placement. Currently, all these applications have also been developed using the
CHARM environment. As soon as the ProperCAD C + + environment is ready, these applications
will all be reimplemented in that environmemt.
31

Table VIII: Execution times (in seconds) of the ProperTEST sequential test pattern generator
running ISCAS89 sequential benchmark circuits on the Sequent Symmetry.
Sequent Symmetry (Shared Memory MIMD)
CircuitII#PEs IITime/Fit IIT.Gen.[ V.Sim.[ OverheadII Coverage
1 15 1164.4 15.1 20.4 78.4
s386 4 15 413.8 9.1
8 15 236.7 9.8
16 15 143.5 5.5
1 1 41.6 25.9 5.7
s713 4 1 12.7 11.8
8 1 7.2 8.6
16 1 4.6 5.8
1 1 46.2 72.0 9.0
s1196 4 1 12.5 30.0
8 1 6.7 19.2
16 1 3.9 11.7
1 1 85.3 87.3 9.3
s1238 4 1 23.2 35.8
8 1 12.2 21.0
16 1 7.7 14.8
1 1 1648.4 1078.7 462.27
s5378 4 1 425.7 320.9
8 i 239.3 245.7
12 i 181.2 198.7
Efficiency #Vectors It #Procs.
78.6
78.9
78.9
81.9
81.9
81.9
81.9
99.6
99.8
99.8
99.8
94.5
94.5
94.5
94.6
66.1
65.7
65.9
65.4
96.6
96.9
97.1
97.1
95.9
96.0
95.9
95.7
99.8
100
100
100
99.0
99.0
99.0
99.0
67.2
66.8
67.0
66.5
255
330
399
403
206
246
282
300
369
384
398
412
376
383
394
442
769
769
1090
1141
1986
2374
2529
2621
582
582
582
582
1303
1305
1310
1331
1356
1356
1356
1356
4604
4604
4604
4604
32

Table IX: Execution times (in seconds) of the ProperTEST sequential test pattern generator running
ISCAS89 sequential benchmark circuits on the Intel i860 hypercube.
Intel i860 hypercube (Message Passing MIMD)
[ICircuitII#PEs IITime/Fit IIT.Ce.. I F.Sim.] Overhead
1 15 184.4 2.7 1.8
s386 2 15 87.0 1.9 0.7
4 15 49.0 1.6 0.3
8 15 28.8 1.3 1.8
1 2 27.0 3.7 1.5
s713 2 2 15.5 2.4 1.0
4 2 8.8 1.2 1.7
8 2 6.6 1.0 2.0
1 1 5.5 10.1 0.8
s1196 2 1 3.3 7.6 0.9
4 1 1.5 4.5 0.7
8 1 0.9 3.2 0.4
1 1 11.7 12.3 1.3
s1238 2 1 6.3 8.7 1.0
4 1 4.1 5.4 0.8
8 1 2.2 3.7 0.6
1 5 6016.5 184.8 140.9
s5378 2 5 3548.6 97.4 86.7
4 5 1748.8 59.3 51.3
8 5 901.7 38.6 84.1
Coverage Efficiency #Vectors
81.8
81.8
81.8
81.8
81.9
81.9
81.9
81.9
99.8
99.8
99.8
99.8
94.7
94.7
94.7
94.8
73.4
71.6
72.5
70.8
100
100
100
100
98.8
98.8
98.8
98.8
100
100
100
100
100
100
100
100
75.3
73.4
74.3
72.7
330
306
370
418
217
199
213
221
365
386
362
394
383
385
387
406
985
932
1009
1228
H #erocs.
1783
1629
1673
1733
751
766
761
787
1243
1243
1243
1243
1369
1368
1356
1356
13233
13727
13511
13468
In Tables VII-X, we provide the performance of ProperTEST, the sequential test pattern gen-
erator based on the PODEM search algorithm on a variety of parallel MIMD machines. The
benchmark circuits used were the standard ISCAS 89 sequential circuits. For reasons of space,
results on a subset of the entire benchmark suite are presented.
In Tables XI-XIII we present the performance of ProperSYN, a portable combinational logic
synthesis algorithm that is based on the transduction method. The benchmark circuits used were
the standard MCNC combinational circuits. Once again, like the other CAD applications, the
programs run unchanged on all the target machines.
Finally, in Tables XIV-XVI we present the performance of ProperPLACE, a portable parallel
algorithm for standard cell placement using simulated annealing. The parallel algorithm is built on
top of TimberWolf 6.0, one of the most widely used sequential programs for standard cell placement
based on simulated annealing. The results reported are the best for standard cell placement among
parallel algorithms that preserve both the quality of the results and yet obtain speedups on parallel
33

Table X: Execution times (in seconds) of the ProperTEST sequential test pattern generator on
ISCAS89 sequential benchmark circuits on the Encore Multimax.
Encore Multimax (Shared Memory MIMD)
[[ Circuit II #PEs ][ Time/Fit I] W.Gen. I F.Sim. [Overhead Coverage Efficiency
1 15 740.8 11.5 17.7 81.2 99.5
s386 2 15 388.8 7.9 11.9
4 15 230.3 5.4 5.7
8 15 138.8 4.9 1.6
1 2 53.9 17.0 9.2
s713 2 2 30.4 9.2 5.2
4 2 22.5 7.6 1.0
8 2 12.9 5.7 0.2
1 2 27.5 45.2 4.7
s1196 2 2 14.3 30.5 3.4
4 2 7.54 18.5 2.6
8 2 4.6 12.8 3.4
1 2 56.2 56.2 5.8
s1238 2 2 31.7 39.6 0.7
4 2 17.1 22.7 1.6
8 2 9.2 14.8 1.4
1 2 2879.8 913.0 771.7
s5378 2 2 1681.4 533.4 506.2
4 2 884.1 283.4 256.9
8 2 479.9 217.1 143.5
80.5
81.0
80:5
85.7
85.7
85.7
85.7
99.8
99.8
99.8
99.8
94.7
94.6
94.5
94.6
68.8
70.6
70.5
68.8
98.7
99.0
98.7
97.2
97.2
97.2
97.2
100
100
100
100
99.7
99.6
99.6
99.6
70.3
72.2
72.0
70.2
#Vectors
316
322
375
370
206
204
241
272
360
364
367
404
375
392
377
414
985
1057
998
1395
#Procs. [[
2350
2431
2628
2907
813
826
910
934
1243
1243
1243
1243
1356
1356
1356
1356
13339
13875
14346
16030
34

Table XI: Performance of the ProperSYN combinatorial logic synthesis algorithm on MCNC bench-
mark circuits on a network of SUN workstations.
CKT
5xpl
b9
bw
fSlm
misexl
misex2
rd73
rd84
sao2
vg2
apex7
1 Processor
Run Time Speedup
102.99 1.0
121.25 1.0
468.19 1.0
180.97 1.0
41.68 1.0
109.00 1.0
387.74 1.0
4792.74 1.0
525.47 1.0
735.72 1.0
2272.41 1.0
2 Processor
Run Time Speedup
55.49 1.86
69.19 1.75
264.80 1.76
118.07 1.53
22.66 1.84
60.83 1.79
251.32 1.54
2476.32 1.94
284.74 1.84
400.70 1.84
1311.86 1.73
4 Processor
Run Time Speedup
31.97 3.22
38.51 3.15
172.58 2.71
72.24 2.51
13.92 2.99
35.85 3.04
129.42 3.00
1279.05 3.75
172.97 3.04
225.16 3.27
595.02 3.82
machines.
5 Summary
We have developed an environment for the portable object-oriented parallel execution of CAD
algorithms. The main objectives of this research have been to make automatic porting of parallel
software feasible and practical, and exploit the current and future advances in sequential CAD
algorithms. As mentioned in the introduction, since inception, the ProperCAD project (see Figure
1) is designed to be completed in two phases. In the first phase, we are designing portable parallel
algorithms for a large set of CAD applications using CHARM. The second phase of the project
will involve the design and implementation of a run-time support system for portable parallel
programming in C++. This system, although inspired by CHARM, will be tailored specifically for
CAD applications. This will make the programming environment truly object-oriented and will
support features like inheritance and classes. The ProperCAD applications will then be rewritten
and ported onto the new C++ platform. It should be noted that the parallel algorithms in the
ProperCAD project are being designed around existing sequential algorithms and extensively reuses
existing sequential code.
We have demonstrated the feasiblity of this approach through several applications, namely, flat
circuit extraction, test generation for sequential circuits, combinational logic synthesis and stan-
35

TableXII: Performanceof theProperSYNcombinatorialogicsynthesisalgorithmonMCNCbench-
markcircuitson the Intel i860hypercube.
CKT
5xpl
b9
bw
f51m
misexl
misex2
rd73
rd84
sao2
vg2
apex7
apex6
duke2
misex3c
1 Processor
Run Time Speedup
60.24 1.0
70.23 1.0
217.54 1.0
89.92 1.0
21.43 1.0
64.68 1.0
233.63 1.0
2381.77 1.0
356.66 1.0
390.08 1.0
1478.43 1.0
15418.68 1.0
12190.12 1.0
15257.09 1.0
2 Processor
Run Time Speedup
32.67 1.84
33.12 2.12
122.54 1.77
57.55 1.56
18.41 1.17
37.86 1.71
107.39 2.17
1190.08 2.00
197.76 1.80
224.59 1.74
884.24 1.67
7700.81 2.00
6371.77 1.91
7842.69 1.95
4 Processor
Run Time Speedup
18.61 3.24
21.90 3.20
77.62 2.80
37.22 2.42
13.45 1.60
22.83 2.83
66.15 3.53
563.25 4.22
108.92 3.27
112.09 3.48
461.96 3.2O
3555.27 4.34
3248.30 3.75
4074.12 3.75
8 Processor
Run Time Speedup
12.80 4.70
15.42 4.55
42.10 5.17
21.11 4.26
11.73 1.80
11.55 5.60
29.13 8.02
308.66 7.72
50.04 7.13
53.00 7.36
222.94 6.63
1842.22 8.37
1686.32 7.23
1907.11 7.98
Table XIII: Performance of the ProperSYN combinatorial logic synthesis algorithm on MCNC
benchmark circuits on the Encore Multimax.
CKT
5xpl
1 Processor
Run Time Speedup
1.0
2 Processor
Run Time Speedup
1.45177.62 122.52
4 Processor 8 Processor
Speedup
5.02
b9 231.24 1.0 123.35 1.87 5.75
bw 755.39 1.0 409.75 1.84 6.98
f51m 296.07 1.0 184.32 1.60 4.24
misexl 70.56 1.0 40.27 1.75 6.31
1.0
1.0
212.96misex2 108.11
410.88
3998.18
659.78
822.33
769.28rd73
1.97
1.87
2.29
1.77
1.67vg2
9159.49rd84
sao2 1174.38
1373.67
1.0
1.0
1.0
Run Time Speedup
57.20 3.10
57.82 3.99
198.64 3.80
101.76 2.91
22.98 3.07
60.70 3.51
207.46 3.54
2390.73 3.83
315.15 3.73
392.75
1341.27
14271.42
12142.03
13721.43
Run Time
35.30
40.21
108.21
69.84
11.18
36.15
110.92
1378.20
152.03
3.50 188.59
3.63 727.92
3.79 7241.64
3.30
3.66 7011.62
5.90
6.93
6.64
7.72
7.28
apex7 4868.01 1.0 2383.75 2.04 6.69
apex6 54062.65 1.0 26976.26 2.00 7.47
duke2 40138.20 1.0 25390.33 1.58
misex3c 1.050236.75 1.9026412.00 7.16
36

Table XIV: Performance of the ProperPLACE algorithm for standard cell placement on a network
of Sun workstations.
II 1PE II 2PEs II 4PEs II 8PEs II
Circuits Wire Time II Wire Time Wire Time Wire Time II
II II
Length (see.) Length (sec.) Length (sec.) Length (sec.)
s298 32120 780 32274 458 32395 282 32938 194
s420 38451 814 38480 525 38905 270 39032 201
fract 22067 640 22708 426 22592 213 23050 152
primary 372561 914 373034 605 381830 351 390743 241
primary I]
Table XV: Performance of the ProperPLACE algorithm for standard cell placement on the Encore
Multimax.
Circuits
s298
s420
fract
primary
I1 1PE II 2PEs II 4PEs II SPEs ]I
Wire Time Wire Time Wire Time Wire Time
Length (sec.) Length (sec.) Length (sec.) Length (sec.)
32052 1538 32603 899 32842 480 33106 317
38627 1678 39083 969 39130 559 39852 373
21575 1419 21692 857 21854 489 22540 368
375870 2054 376991 1194 380492 760 386403 503
primary [[
dard cell placement. New algorithms for global routing, fault simulation and behavioral simulation
are currently under development. All the applications exhibit good speedups on shared memory
machines including an Encore Multimax and a Sequent Symmetry, message passing machines in-
cluding an NCUBE 2, an Intel i860 hypercube and a network of Sun workstations. This is significant
especially given that the applications were all executed unchanged on all the above machines.
When the ProperCAD environment is available on a new architecture, say the Intel Paragon
multiprocessor, these algorithms will not need to be rewritten, unlike most prior algorithms. It is
only necessary to port the underlying programming platform (which itself is largely portable with
the exception of a small machine specific component).
All these applications are being developed on a parallel object-oriented platform, using a coarse-
grained data-flow style of execution. In all cases, the algorithms are being interfaced with unipro-
cessor implementations of the respective applications. In circuit extraction, for example, sequential
modules were used to perform local geometric extraction and device and parameter extraction.
37

Table XVI: Performance of the ProperPLACE algorithm for standard cell placement on the Intel
i860 hypercube.
Circuits
s298
s420
fraxt
primary
II 1 PE II 2 Prs [I 4 Prs H 8 PEs
Wire Time Wire Time Wire Time Wire Time
Length (sec.) Length (sec.) Length (sec.) Length (sec.)
32512 191 32603 120 32969 68 33106 47
38066 288 37960 178 38943 103 39836 73
22717 534 22904 322 23010 190 23109 137
373905 769 374042 447 381839 307 387592 198
primary [[
The parallel algorithm was primarily concerned with the decomposition of the circuit into regions
that could be processed in parallel, and the merging of these regions together. In cell placement,
we have interfaced the parallel algorithm with TimberWolf 6.0, a state-of-the-art widely used cell
placement program.
We believe that this multilevel separation of a parallel run-time system, a parallel library, a
parallel algorithm and a sequential algorithm with well-defined interfaces between them, as outlined
in Figure 1, is the most efficient way to develop parallel CAD algorithms. This permits the experts
in each of these different areas to concentrate on their fortes. An environment such a ProperCAD
is best written by an expert in parallel programming who has intimate knowledge about the target
machines. The parallel algorithms can then be developed with the constraint that the algorithms
are expressed using the ProperCAD environment. Finally, experts in the area of circuit extraction,
test generation, logic synthesis, cell placement, etc. should be designated the responsibility off
developing efficient sequential algorithms for their respective problems. We constrain them to
express their algorithms in a modular fashion: a desirable requirement for program design and
maintenance in any case. The ProperCAD environment serves to bridge the effort in these various
different areas of specialization.
Work is also under way to expand the set of target architectures for the ProperCAD envi-
ronment. We are awaiting access to parallel machines like tlle Intel Paragon and the Thinking
Machines CM-5 to initiate the port to these machines.
38

[1] Agha,G.A. Actors: A Model of Concurrent Computation in Distributed Systems. MIT press,
1986.
[2] Banerjee P., Jones M.H., Sargent J. Parallel Simulated Annealing Algorithms for Standard
Cell Placement on Hypercube Multiprocessors. IEEE Transactions on Parallel and Distributed
Systems, 1:91-106, January 1990.
[3] Belkhale, K.P, Banerjee, P. PACE: A Parallel VLSI Circuit Extractor on the Intel Hypercube
Multiprocessor. In Proceedings of the International Conference on Computer Aided Design,
November 1988.
[4] Belkhale, K.P, Banerjee, P. PACE2: An Improved Parallel VLSI Extractor with Parameter
Extraction. In Proceedings of the International Conference on Computer Aided Design, pages
526-530, November 1989.
[5] De, K., Ramkumar, B., Banerjee P. ProperSYN: A Portable Parallel Algorithm for Logic Syn-
thesis. In Proceedings of the International Conference on Computer-Aided Design, November
1992.
[6] Fenton, W., Ramkumar, B., Saletore, V.A., Sinha A.B., Kal6, L.V. Supporting Machine
Independent Programming on Diverse Parallel Architectures. In International Conference on
Parallel Processing, August 1991.
[7] Fitzpatrick, D.T. MEXTRA: A Manhattan Circuit Extractor. Technical Report Electronics
Research Lab M82/42, University of California at Berkeley, January 1982.
[8] Gupta, A. ACE: A Circuit Extractor. Ill Design Automation Conference, pages 721-725, June
1983.
[9] Harrison, D.S., Moore, P., Spickelmier, R.L., Newton, A.R. Data Management and Graphics
Editing in the Berkeley Design Environment. In Proceedings of the International Conference
on Computer-Aided Design, November 1986.
39

[10] Hon,R., Gupta, A. HEXT: A Hierarchical Circuit Extractor. Computer Science Press, 1983.
[11] Jayaraman, R., Rutenbar, R.A. Floorplanning by Annealing on a Hypercube Multiprocessor.
In Proceedings of the International Conference on Computer Aided Design, pages 346-349,
November 1987.
[12] Kale, L.V. The Chare Kernel Parallel Programming System. In International Conference on
Parallel Processing, August 1990.
[13] Kravitz, S.A., Rutenbar, R.A. Multiprocessor-Based Placement by Simulated Annealing. In
Proceedings of the 23rd Design Automation Conference, pages 567-573, June 1986.
[14] Levitin S. MACE: A Multiprocessor Approach to Circuit Extraction. Master's thesis, MIT,
Cambridge, Mass., June 1986.
[15] McCormick, S.P. EXCL: A Circuit Extractor of IC Designs. In Design Automation Conference,
pages 624-628, June 1984.
[16] Niermann, T.M., Patel, J.H. HITEC: A Test Generation Package for Sequential Circuits. In
Proceedings of the European Conference on Design Automation, pages 214-218, February 1991.
[17] Patil, S., Banerjee P., Patel, J.H. Parallel Test Generation for Sequential Circuits on General
Purpose Multiprocessors. In Proceedings of the 28th Design Automation Conference, June
1991.
[18] Ramkumar, B., Banerjee P. Portable Parallel Test Generation for Sequential Circuits. In
Proceedings of the International Conference on Computer-Aided Design, November 1992.
[19] Ravikumar, C.P., Sastry S. Parallel Placement on Hypercube Architectures. In International
Conference on Parallel Processing, pages III: 97-100, August 1989.
[20] Rose, J.S., Snelgrove W.M., Vranesic, Z.G. Parallel Cell Placement Algorithms with Quality
Equivalent to Simulated Annealing. IEEE Transactions on Computer-Aided Design, 7, no.
3:387-396, March 1988.
4O

[21] Scott, W.S.,Ousterhout,J.K. Magic'sCircuit Extractor. In IEEE Design and Test, pages
24-34, February 1986.
[22] Sechen, C., Sangiovanni-Vincentelli A.L. The TimberWolf Placement and Routing Package.
IEEE Journal of Solid-State Circuits, 23/2:410, 1988.
[23] Su, S-L., Rao, V.B., Trick, T.N. HPEX: A Hierarchical Parasitic Circuit Extractor. In Design
Automation Conference, pages 566-569, June 1987.
[24] Suaris P., Kedem, G. A Quadrisection-Based Combined Place and Route Scheme for Standard
Cells. IEEE Transactions on Circuits and Systems, 8:234-244, March 1989.
[25] Tonkin, B.A. Circuit Extraction on a Message Passing Multiprocessor. In Design Automation
Conference, pages 260-265, June 1990.
[26] Wegner, P. Conceptual Evolution of Object-Oriented Programming. In Object-Oriented Pro-
gramming Systems, Languages and Applications, 1989. keynote talk.
41

_ECU_f_ CL_5<_(LCAftQN O_ r_<_ PAC_
la. REPOI_T SECURITY CLASSIFICATION
Unclassified
2a. SECURITY CLASSIFICATION AUTHORITY
"2b. DECLASSIFICATION I DOWNGRADING SCHE[)ULE
4. PERFORMING ORGANIZATION REPORT NUMBER(S)
UILU-ENG-93-2205 CRHC-93-04
61. NAME OF PERFORMING ORGANIZATION
Coordinated Science Lab
University of lllinois
i i,
6_. ADDRESS (City, State, and ZIPCode)
ll01 W. Springfield Avenue
Urbana, IL 61801
_1l. NAME OF FUNDING/SPONSORtNG
ORGANIZATION
7a
8C. ADDRESS,(CitY,__, State, Jnd ZIP Code)
REPORT DOCUMENTATION PAGE
i
6b OFFICE SYMBOL
(If apolicable)
N/A
i
8b. OFFICE SYMBOL
(If applicable)
lb RESTRICTIVE MARKINGS
None
3 OlSTRIBUTIONIAVAILABILITY OF REPORT
Approved l_or public release;
distribution unlimited
S. 'MONITORING Q-RGANIZAT_ON REPORT NUMBER(S)
7a. NAME OF MONITORING ORGANIZATION
NASA Semiconductor Research Corp
i I ii
7b. AOORESS(Oty, State, andZlPCoo_)
Moffitt Field, CA Research Triangle
Park, NC 27709
i
9. PROCUREMENTINSTRUMENTIDENTIFtCATION NUMBER
i i
Object-oriented Parallel Environment for VLSI CAD
110. SOURCE OF FUNDING NUMBERS
I
ELEME.TNO. Iw'oRKUN,T• . ACCESS,O,,,N .
i i
7b
11, TITLE (Include Securit'y Clac_ification)
Proper CAD: A Portable
i
12. PERSONAL AUTHOR(S)
Balkrishna Ramkumar and Prithviraj Banerjee
i i
13,1. TYPE OF REPORT J13b.TIME COVERED "J14. DATE OF REPORT (Year, Mo¢1_, Day)
..... Tiechnical I FROM TO l _$- ?$-2_ ,
16. SUPPLEMENTARY NOTATION
S. PAGE COUNT41
ii i i __ iii i
17 COSATI COI_E'S i 18. SUBJECT TERMS (Continue on reverl41 if _ece._try and identify by block number)
FIELD [ GROUP 't ii B'GROUP extraction,parallel algorithms, parallel environment,circUitvLSl,and portable
ii
!9. ABSTRACT (Continue on reverse if neceS_llry and identify by block nurntJer)
Most parallel algorithms for VLSI CAD proposed to date have one important drawback: they work effi-
ciently only on machines that they were designed for. As a result, algorithms designed to date are dependent
on the architecture for which they are developed and do not port easily to other parallel architectures.
This paper describes a new project under way to address this problem• We are developing a Portable
object-oriented parallel environment for CAD algorithms (ProperCAD). The objectives of this research
are two-fold. (1) To develop new parallel algorithms that run in a portable object-oriented environment. We
accomplish this in two stages. First, we are developing CAD algorithms using a general purpose platform
for portable parallel programming called CHARM developed at the University of Illinois. Second,
we are concurrently developing a C + + environment that is truly object-oriented and specialized for CAD
applications. (2) To design the parallel algorithms around a good sequential algorithm with a well-defined
parallel-sequential interface• This will permit the parallel algorithm to benefit from future developments in
sequential algorithms.
(on back)
20. DISTRIBUTION/AVAILABILITY OF ABSTRACT J2_.ABSTRACT SECURITY CLASSIFICATION
_J_UNCLASSIFIED/UNLIMITED [] SAME AS ROT []DTIC USERS I Unclassified
22a NAME OF RESPONSIBLE INDIVIDUAL 122b TELEPHONE (Include Area Co_le) I 22c. OFFICE SYMBOL
I I
DO FORM 1473, 84 MAR 8J APR edition may be used untd exhau_ed. SECL,IRITY CLASSIFICATION OF TNIS P_GE
All other Ildtt=ons are obsolete
UNCLASSIFIED
StCUIII'rY CL.Ja-_.t.IIwlCATION OF THlll @A_E
i i
19. continued
We describe one CAD application that has been implemented as part of the ProperCAD project: liar
VLSI circuit extraction. The algorithm, its implementation, and its performance on a range of parallel
machines are discussed in detail. It currently runs on an Encore Multimax, a Sequent Symmetry, Intel
iP_qC/2 and i860 hypercubes, a NCUBE 2 hypercube, and a network of Sun Sparc workstations. We also
pz-ovide performance data for other applications that have been developed: namely test pattern generation
for sequential circuits, parallel logic synthesis and standard cell placement.
UNCLASSIFIED
