A parallel algorithm for multi-level logic synthesis using the transduction method by Lim, Chieng-Fai
"I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
II
NASA-CR-189925
19920010819
f_ !?t.\..!Lt\2..~L ALGOP.!TH},1 FOP. MULTI-LEVEL
LOGIC SYNTIffiSIS USING THE TRANSDUCTION METHOD
BY
CHIENG-FAI LIM
B.S., University of Illinois, 1990
..---_----:-:-=-.:-..1
LIBRARY COP"'
OCT 9; )
LANGLEY RESEARCH CENTER
THESIS LIBRARY NASA
HAMPTON, VIRGINIA
Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Computer Science
in the Graduate College of the
University of Illinois at Urbana-Champaign, 1991
Urbana, Illinois
https://ntrs.nasa.gov/search.jsp?R=19920010819 2020-03-17T13:23:41+00:00Z
DISPL\Y 92X200Gl/2
92X20061*# ISSUE 11 PAGE 1855 ('2\rrOORY 61
RPT#: KASA-CR-189925 ~\:\S 1.26:189925 C\~#: NAGl-613 91/00/00 80 PAGES
UXCU\SSIITED IXXC:,!E1l'
IJITl.J: A parallel algorithln for multi-level logic synthesis using the
transduction a:ethoo TIS?: N.S. The3is
N,T.--!: MID, CHIDG-FAI
fiRP: Illinois Univ., Urh."Jla-Ch2.rr~paign. css: (Coordinated Science L2.b.)
SAP: Avail: c.;;sr He AOS!:!? AOI
eTO: eXITED ST;\TI:S
:li\JS: /*AlOJRI'IID!S/+CD:!PtJITR SYS1'E)lS PERffiR''!.:t\"CE/''':!D:ORY (<D:IPUITRS) /*
MULTIPROCESSIKG (<D'l:PUIIRS) /*OPrDITZATIOW*PA.ftllJ.lEL PROCESSIKG (W·1PUTERS)
:l:C\s: / BAL/\:'\CIXG/ CXNPGITR AIDID DESlQi/ DY:'I'A.'ITC WADS! umc CIRCUITS/
PARTITIOXS N\THEX\TICS) / S'croTIThTI:S/ TRA:.\'SFERRIKG
AR.;: Author
ABS: TIle Transduction :;ethcxl has been show1l to be a IXJwerEul tool in the
optimization of mul tilevel networks. :!anl' tools such as the S\W:;-
synthesis system (X90), (C:-!89), {U,i90} have been develolled based en tIns
method. A parallel impl€m'2ntation is presented of SYlDX-X'TRA.\'S ()0;S9) on
an eight processor Encore ~·!ulti1fia'\ s1m-ed rr.ruDry mill tiprocessor. It
minimizes nmltilevel netl-.nrks consisting of simple gates through p..'lrallel
pruning, gate substitutio:1, gate ni2rg1ng, generalized gate substitution,
<mc1 gate inp:lt reductio:1. 111is in1)lE!r~ntation, called Parallel
E\"TER: '=ORE
'i
I
!
I A PARALLEL ALGORITHMFOt;. MULTI-LEVEL
I LOGIC SYNTHESIS USING THE TRANSDUCTION METHOD
t
I BY
'1 CHIENG-FAI LIM
B.S., University of Illinois, 1990
!
t
!
|
THESIS
I Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Computer Science
in the Graduate College of theUniversity of Illinois at Urbana-Champaign, 1991
I
I
I Urbana, Illinois
|
I
I
i °°°
111
I ABSTRACT
I The Transduction Method has been shown to be a powerful tool in the optimization of
I multi-level networks. Many tools such as the SYLON synthesis system [X90], [CM89],
[LM90] have been developed based on this method. In this paper, we present a parallel
implementation of SYLON-XTRANS [XM89] on an eight-processor Encore Multimax
shared-memory multiprocessor. It minimizes multi-level networks consistings of
simple gates through parallel pruning, gate substituti_an,gate merging, generalized gate
I substitution, and gate input reduction. This implementation, called Parallel
I TRANSduction (PTRANS), also uses partitioning to break large circuits up and
performs inter- and intra-partition dynamic load balancing. With this, we are able toI
achieve good speedups and high processor etticiencies without sacrificing the resulting
I circuit quality.
I
!
I
!
!
!
!
!
I 0
| iv
'! ACKNOWLEDGEMENTS
I I am most grateful for the constant advice andisupportof my advisor, Professor
I Prithviraj Banerjee, who has made the completion of this thesis possible.
I I would like to thank Professor Saburo Muroga and his students who have shared
their valuable experiences with me. I would also like to thank the students and staff
I members of the Center for Reliable and High-Performance Computing who have been a
I great source of help. Specifically, I am thankful for Kaushik De for his ideas and
assistance.
I
Finally, I would like to thank my fellow graduate students for making my stay in
I this country a precious experience.
I
I
I
I
!
I
I
I
i
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
TABLE OF CONTENTS
CHAP1'ER 1. IN'TRODUCTION .
1.1. Motivation for Parallel CAD Algorithms .
1.2. Two-level and Multi-level Logic Synthesis ..
1.3. Related Work on Parallel Logic Synthesis , ..
1.3.1. Parallel ESPRESSO ..
1.3.2. Parallel Kernel Extraction ..
1.3.3. Parallel Tautology Ch€?cking ..
1.4. Thesis Outline .
CHAP1'ER 2. REVIEW OF THE TRANSDUCTIOl'; METHOD .
2.1. Terminology and Notations .
2.2. Maximum Set of Permissible Functions .
2.3. Compatible Set of Permissible Functions ~ ..
2.4. Pruning ..
2.5. Gate Substitution ..
2.6. Gate Merging , .
2.7. Generalized Gate Substitution ..
2.8. Gate Input Reduction ..
CHAP1'ER 3. PARALLEL IMPLEMENTATION OF SYLON-XTRANS .
3.1. General Overview ..
v
1
1
2
2
3
4
5
5
7
7
11
15
17
20
21
24
26
29
29
!
I vi
I 3.2. Binary Decision Diagrams .......................................................................... 30
3.3. Partitioning Algorithm • ' 37o........Hoooo...................._ °1...1...1...1.....o...* ooo.*....* H....°...
3.4. Program Model ............................................................................................ 38
I 3.5. Discussion of Number of Partitions ............................................................ 41
I 3.6. Parallel Evaluation of Functions and CSPFs of Gates ............................... 42
3.7. Parallel Pruning ........................................................................................... 46
I 3.8. Parallel Gate Substitution ........................................................................... 48
I 3.9. Parallel Gate Merging ................................................................................ 51
I 3.10. Parallel Generalized Gate Substitution/Gate Input Reduction ................. 54
3.11. Ordering of Search-Spaces ........................................................................ 57
I CHAPTER 4. EXPERIMENTAL RESULTS ........................................................ 59
I 4.1. Overview of Experiments ........................................................................... 59
I 4.2. Circuit Degradation with Number of Processoz's ....................................... 60
4.3. Efficiency of Intra-Partition Load Balancing .............................................. 61
I 4.4. Efficiency of Inter-Partition Load Balancing ............................................. 63
! ,4.5. Comparison among MIS 2.1, SYLON-XTRANS, and PTRANS .............. 66
I CHAPTER 5. CONCLUSIONS .............................................................................. 70
REFERENCES ........................................................................................................ 72
I
!
!
!
!
I 1
I CHAPTER 1.
I INTRODUCTION
!
1.1. Motivation for Parallel CAD Algorithms
I
Computer Aided Design (CAD) algorithms alv,ays face the conflict between the
I need to produce superior quality results and the need to shorten the long processing
time they require. Many problems in VLSI CAD are NP-complete [GJ79], hence
determining the optimum solutions to these problems can take extraordinary amounts of
I CPU time. Hence, heuristics are used to reduce their complexities so that the results can
I be delivered within a reasonable amount of time.
I To reduce the runtimes of CAD tools, a simple way is to execute them on faster
uniprocessor machines. However, this is no longer feasible as we are approaching an
I upper bound on the speed of the processors that can be made with current technology.
I This problem has led to more attention being focused on parallel machines.
With today's increasing availability and performance of parallel machines, a new
I direction has been created for parallel processing of C M3 algorithms. Many of the CAD
I applications a high degree parallelism. There is a bright future in thehave of inherent
i integration of new parallel programming paradigms, parallel architectures, and CAD
algorithms so as to provide users with a shorter turnaround time.
!
I
!
!
I 2
I 1.2. Two-level and Multi-level Logic Synthesis
i Automation of logic synthesis tools is becoming increasingly important as the
numbcrof logic gates in VLSI chips gets larger.In the past, many studies were devoted
! •to realizing combinational logic functions with 2-level networks using PLA"s. Many
efficient algorithms such as ESPRESSO [B84] and PMIN [C87] have been developed.
Unfortunately, many combinational logic functions can be more efficiently realized
I with multi-level networks in terms of compactness, 'cost, and speed. Many tools have
I also been developed for multi-level logic synthesis. SOCRATES [GBGH86] and MIS
I [BRSW87] are among them. In the early 70's, the Transduction Method was developed
at University of Illinois. This involves the concept of permissible functions, which is
I also regarded frequently as observability don't-cares. Based on this method, SYLON-
I
I XTRANS [X90], SYLON-DREAM [CM89], and SYLON-REDUCE [LM90] have been
developed. They have shown that the Transduction Method is a powerful tool in the
I optimization of multi-level circuits.
!
1.3. Related Work on Parallel Logic Synthesis!
With increasing accessibility of parallel machines, there have been many studies
I on parallel CAD algorithms. This section reviews some of such work including
I Galivanche's parallel ESPRESSO [G86], Zipfel's p_rallel kernel extractor [Z91] and
Hatchel's parallel tautology checking [HMJ88].
!
I
!
!
m 3
I 1.3.1. Parallel ESPRESSO
I In ESPRESSO, there are three main procedures called Complement, Expand, and
Reduce. The section describes their parallelization processes described by Galivanche
I [G86].
To compute the complement of a given function, the Complement procedure
recursively decomposes it into two sub-functions along a splitting variable until a singleI term is reached. In the parallel version, a new proczss is created at each level of the
I recursion so that the two sub-functions can be handled simultaneously. This creation of
processes stops when the number of processes created equals the number of processors
available.
I The Expand procedure generates a limited set of prime cubes of a given function.
I The set of cubes under consideration are maintained in a list. Each cube is expanded
with the objective of coveting other cubes in the lis':. In the parallel algorithm, cubesI
are expanded in parallel. However, duplicated cubes can be created. To minimize this
redundant work, periodic checks are made to halt du61icated work. The procedure also
i terminates with a final clean-up phase to remove the ctuplicatedcubes.
The third procedure, Reduce, tries to obtain a minimal number of cubes covering a
I given function so that any further reduction would change the function. Although most
I of the cubes can be reduced simultaneously, a process could be reducing a cube Ci
I thinking that it is covered by another cube q without knowing that Cj is also being
currently reduced by another process. The solution tc this problem is to assume that all
I
I
I
I i
! 4
I other cubes currently being reduced do not exist. Although this gives correct outputs, it
affects the quality of the final results.I
Galivanche achieved linear speedup in completion time with slight degradations in
I the resulting qualities of the generated PLA's with these algorithms.
!
1.3.2. Parallel Kernel Extraction
I Zipfel [Z91] has implemented a parallel version of the kernel extraction procedure
I used in MIS during algebraic faetorization [BRSW87].
First, the kernel-cube matrix is built in parallel. This is also executed in parallel
with the formation of the Boolean representation of "anode since they are independent.
I After building the kernel-cube matrix, the next phase performs the actual extraction. In
I parallel, each process creates its own local partition ef the kernel-cube matrix and uses
it to perform any extraction from the global network. These partitions are generated in
I parallel as well.
I With its own partition of the kernel-cube matri;_, a process then proceeds to look
i for kernel intersections that are extractable. If the value of a kernel intersection is
greater than zero, a new node is then created and exclusively substituted into the
I Boolean network. When a process has exhausted its partition, it waits until all of the
I other processes have exhausted theirs before repeating the Kernel-cube matrix-building
algorithm again. With this parallel algorithm, Z_pfel was able to obtain slight
I improvements in the minimality of the circuits tested. However, the low speedups he
I
!
!
I 5
achieved showed that MIS is very difficult to parallel,.'ze.
I 1.3.3. ParallelTautologyChecking
I Hatchel'sparaUeltautologycheckingalgorithm[HJM88]usestheparallelismof a
I serial divide-and-conquer algorithm. The serial _ilgorithm recursively divides the
function into smaller partitions until the function to be checked is sufficiently small. In
I the parallel version, a process is created for each sub-function if a partition is found to
I be complicated enough. Each process waits for all of its children (if any) to report back
before terminating. With this tree-structured computation, good speedup has been
I achieved.
I
1.4. Thesis Outline
I
This thesis describes a parallel implementation of the Transduction Method of
I multi-level logic synthesis on a shared-memory machine, the Encore Multimax
I computer. The implementation, called PTRANS (Parallel TRANSduction), is based on
SYLON-XTRANS [X90], [XM89].
I Scalability has been a problem in the parallelization process. In order to maintain
I high processor utilization when the number of processors increases, the circuit to be
i minimized has to be large. Unfortunately, the amount of physical memory available
places an upper bound on the size of the circuit to be minimized.
!
!
!
!
! 6
I To solve this problem, large circuits have tc be partitioned. The partitioning
algorithm tries to retain the don't-cares within a partition. Clearly, the minimization of
I the partitions can be performed in parallel. However: although two partitions may be of
I the same size in terms of the number of gates and connections, the time required to
I minimize each of them could be different due to differences in their functional
complexities.
I This thesis describes how the partitions can be minimized simultaneously with
I both inter- and intra-partition parallelism being handled by dynamic load balancing.
The organization of this report is as follows. In Chapter 2, the basic concepts of the
I Transduction Method is given. It also provides some background information on the
I permissible functions and the transformation and reduction procedures found in
SYLON-XTRANS. The parallelization of these procedures and the implementation ofI
dynamic load balancing is presented in Chapter 3. !n Chapter 4, some experimental
I results achieved with PTRANS are reported, followed by a conclusion in Chapter 5.
I
I
I
I
I
I
I
I 7
I CHAPTER 2.
I REVIEW OF THE TRANSDUCTION METHOD
!
SYLON-XTRANS is an extension to the original Transduction Method in [MK89]
so that it can minimize multi-level circuits consisting of AND, OR, NAND and NOT
I gates in addition to NOR gates. It contains four main procedures, namely, pruning, gate
substitution, gate merging, and combined generalized gate substitution/gate input
I reduction. Each of these procedures is basically an iterative improvement algorithm that
I keeps transforming and reducing a circuit until no further improvement can be made.
The transformations can be applied to a circuit in any order. However, formal proofs ofI the transformations axe omitted in this thesis for simplicity. They can be found in
I [XM89] and IX90]. For the ease of translating into binary decision diagrams (BDDs)
which axe actually implemented in PTRANS, the transformations are explained usingI
the vector notation.
I
I 2.1. Terminology and Notations
In this thesis, we will consider only cycle-free multi-level circuits consisting of
I AND, OR, NAND, NOR and NOT gates. Let n be the number of primary inputs, m be
I the number of primary outputs, and g be the number of gates in a multi-level circuit.
Let X = {x1, x2, .... xn} be the set of input variables and Z = {zp z2..... zm} be theI
set of output variables of the circuit. In addition, let V = {Vl,V2.... , Vg} be the set of
I
I
!
I 8
I gates in the circuit, and C - {cij} be the set of connections where cij connects the
output of gate vi to an input of gate vj.I
A circuit can be viewed as a graph consisting of gates arranged in levels. The level
I of a gate in a circuit can be defined either from the primary inputs or the primary
I outputs. Formally, the level of a gate with respect to the primary inputs is defined as :
1) 0 if the gate is a primary input, or
I 2) 1 + the maximum level among its immediate predecessors.
I The level of a gate with respect to the primary outputs is similarly defined as :
1) 0 if the gate is a primary output, or
I 2) 1 + the maximum level among its immediate successors.
I The levelizing procedure for a circuit can be found in [PBP89]. An example of a
I circuit wherebythe gates are arrangedaccordingto their levels is shownin Figure 2.1.
!
!
. m_.
|
Levels
I Figure 2.1. An example of a levelized circuit.
!
!
!
I 9
I A gate vi is an immediate predecessor of vj if there exists a connection cij.
Conversely,vj is an immediatesuccessorof vi if cij exists.LetIP(vi) andIS(vi) be the!
setof all immediatepredecessorsandimmediatesuccessorsof the gatevi respectively.
I When there is a sequence of gates Vkl, Vk2, .... Vktsuch that vk b+l _ IS(Vkb) for all b
I = 1, 2.... t-l, then Vktis a successor of Vkl. Similarly defined, Vkl is a predecessor ofi
Vkt. Let P(vi) and S(vi) denote the set of predecessors and successors of the gate vi
I respectively. The gate vi is said to have a reconvergent fanout (or is reconvergent) if
I there exist two distinct gates Vkl, Vk2_ IS(vi) such that S(Vkl) n S(Vk2) €:0.
A function realized at a gate is the set of values output by the gate in a circuit forI
all combinations of the input variables. This is also very frequently referred to as the
I function at the gate for short. The function at a ga_e vi, f(vi), can be expressed as a
I vector of Boolean values. For example, if n = 3 and vi is an AND gate with x1, x2 andt
x3 as its input, where x1 - (01010101), x2 - (00110011), and x3 = (00001111), then
I f(vi) = (00000001). Also, if G is a Boolean vector, let G(d) be the dth value in G. It
I does not matter if the first value is the leftmost or ri.ghtmost bit of a vector as long as
this remains consistent. The value of d can range from 1 to 2n inclusive. This is a more
I convenient way of representing the truth table. In the Transduction Method, connections
I are often treated as gates. Hence, the function at a ga'_eis also extended to cover that of
a connection, which is defined by f(cij) = f(vi).I
The function at a gate vi can sometimes be expressed not only in terms of the
I input variables but also as the function at some off_er gate vk in the circuit. This is
I denoted by f(vi IVk)and is called the function at vi with respect to vk. In this case, the
!
!
m 10
I gate vk is treated just as it is an input variable ignoring the functions at its input
connections.
I
Very frequently, the function at a gate can be changed without affecting the
i function at the primary outputs. A permissible function at a gate is a function which
I the output of a gate can be for this purpose. For example, in Figure 2.2, for n = 3, x1 =
(01010101), x2 = (00110011), x3 = (00001111), f(v1) = (11101110) and z1 = f(v2) =
i (10111011). However, if f(v1) is changed to (01101110), z 1 is still unchanged. Hence,
i (01101110) is a permissible function of v1.
i The vector (01100110) is another permissible function of v1. To represent these
two permissible functions collectively, a don't-care value '*' is used. This is used to
I mean either a '0' or '1' value. Hence, (0110"110) represents both (01101110) and
I (01100110). A collection of permissible functions is known as a set of permissible
functions (SPF), of which two special forms are the maximum set of permissible
I functions and compatible set of permissible functions. These are explained in greater
i details in Sections 2.2 and 2.3.
!
I _ Vl_v2___Xl Zl
x3 ,.....°......°.o.....°**°°°..°°°.........°°...
i Figure 2.2. An example of a permissible function.
!
!
!
I 11
I 2.2. Maximum Set of Permissible Functions
As the name suggests, the maximum set of permissible functions (MSPF) of aI
gate in a circuit is the set that contains all possible permissible functions of the gate.
I [MK89] shows how the MSPFs for gates and connections in a multi-level circuit
i containing only NOR gates are calculated. [X90] extends this to OR, AND, NAND and
NOT gates.
I In [X90], the methods of calculating MSPF's are described using the on-set/off-set
I notation since it uses the sum-of-products (SOP) form to represent Boolean and
permissible functions. However, PTRANS uses l_inary decision diagrams (BDDs)
I [B86]. As it is convenient to translate bit vectors into BDDs, the methods of calculating
I MSPFs and CSPFs are shown in this thesis using fiae vector notation instead. This is
i similar to that used in [MF89]. Section 4.2 shows how the vector notation can be
translated into BDD representation.
I Before formally describing the methods of computing MSPFs, an example is given
I here. From Figure 2.2, we have x1 = (01010101), f(v1) = (11101110), and f(v2) =
(10111011). Since the output z1 = f(v2) must remain constant, MSPF(v2) = f(v2) =I (10111011). Let the first bits of the vectors to be the leftmost bits. Considering these
I bits, xI = 0 and f(v2) = 1. Since v2 is a NAND-ga.te, and xI is 0, f(v2) is always 1
i regardless of the value of f(vl). Hence, the first bit of the MSPF of v1 is *. Similarly,
the rest of MSPF bits of v1 can be computed, md this vector is found to be
I (*1"0"1"0).
!
!
I
| 12
I The ways of computing the MSPFs of gates and of connections are different. To
I show how the MSPF of a connection cij is computed, consider a portion of a circuit
which contains cij as shown in Figure 2.3.
I Suppose the functions at all of the connections Cxj for 1 < x < k and at vj are
i known. Let the MSPF of vj be MSPF(vj) and suppose that it is known too. If vj is a
NOR gate, the dth bit of the vector MSPF(cij), MSPF(d)(cij), is then given by :
I MSPF(d)(cij) = F(d) #NOR MSPF(d)(_ ) (E2.1)
I where the operator #NOR is defined in Table 2.1 and F = Ul<_x_;k,x_,if(Cxj),u being
I the normal Boolean OR operator. Similarly, for the Cases in which vj is an OR, AND,
or NAND gate, MSPF(d)(eij) is given by equations E2.2, E2.3 and E2.4 respectively.
I The vector G in E2.3 and E2.4 is nl<x<_k,x#if(Cxj), where n is the Boolean AND
i operator and the operators #OR' #AND and #NAND are given in Tables 2.2, 2.3 and
2.4 respectively. The '-' sign in these tables means that those situations will never be
I encountered.
I MSPF(d)(cij) = F(d) #OR MSPF(d)(_ ) (E2.2)
MSPF(d)(cij) = G(d) #AND MSPF(d)(_ ) (E2.3)
i MSPF(d)(ci]) = G(d) #NAND MSpF(d)(_ ) ' (E2.4)
I ,. A simplegate
eij--_
Figure 2.3. Calculating the MSPF of a connection.
!
I
!
I 13
I A special case arises when vj is a NOT gate. MSPF(d)(eij) is then computed
simply by E2.5 where - is the COMPLEMENT operator.
i The evaluation of the MSPF of a gate is sliglatly more complicated. Consider a
gate vi as shown in Figure 2.4 and having Oil, ci2, ... elk connected to its output
terminal.
I If the gate Vi is not a reconvergent gate, its MSPF is given by E2.6 where *nl =
In* = 1 and *nO = On* = 0 in addition to the normal properties of n on the domain
i {0,1}.
I Table 2.1 Table 2.2
#NOR MSPF(d) #OR MSPF(d)
B i o1.F(d) 1 0 * F(d) 0 0 1 *• * 1_ - * *
I Table2.3 Table2.4
#NAND MSPF(d) #AND MSPF(d)
G(d) - * * G(d) 0 * *
1 0 * 1 0 1 *
Tables 2.1 through 2.4.
I Definitions of #NOR' #OR' #NAND' and #AND"
!
A simple gate
I ei2
_ Figure 2.4. Calculating the MSPF of a gate.
I
i
!
I 14
I MSPF(d)(vi) = pF(d)nl<_x<_kMS (Cix) (E2.6)
However, if vi is reconvergent, it is then treated as an input variable and all the
primary outputs are evaluated with respect to Vi . Using Shannon's Expansion, the
I function at every output zj can then be expressed as :
wherePjandQj aresome functionsexpressedintermsoftheprimaryinputsonly.
! MSPF(d)j(vi), which is the MSPF of vi due to zj, can then be computed using the
I following algorithm :
If f(vi)(d) = 1 and Q(d)j = 0 then MSPF(d)j(vi) = 1
I elseiff(vi)(d)=0 andQ(d)j= I then MSPF(d)j(vi) = 1
I else iff(vi)(d)=1andP(d)- 0 thenMSvF(d)(vi)- 0
1 else if f(vi)(d) - 0 and p(d) = 1 then MSPF(d)j.vi) = 0
else MSPF(d)(vi) = *.
I The final value for MSPF(d)(vi) is then the intersection of MSPF(d)j(vi ) for 1 < j
I <m.
To explain the correctness of this algorithm, suppose f(d)(vi) = 1 and Q(d)j = 0.!
Hence, f(d)(zj I vi) = p(d)j, which could be either 1 or 0. Therefore, MSPF(d)(vi) must
I be 1 so as to allow p(d). to propagate to f(d)(zj Ivi), This argument is similar for theJ
I other three cases.
Knowing how the MSPF of a connection and gate can be calculated, the MSPFs of!
all the gates and connections can then be calculated by first setting the MSPF of each
I
!
!
_ 15
I primary output to be the same as the function at the output gate and then compute the
rest of the MSPFs from the outputs towards the primary inputs.I
I 2.3. Compatible Set of Permissible Functions
As the MSPF of a gate contains the largest set _f permissible functions associated
D
with it, this set also contains the largest observability don't-care set [SB90] for the gate.
The observability don't-care set of a gate is the set of input values with which the
gate's output is not observable through the primary oatputs. However, as seen from the
previous section, the computation of MSPF could be time-consuming, especially when
a circuit has many reconvergent gates. In addition, the MSPFs of all the gates and
I connections in a circuit have to be recomputed each time the circuit is transformed and
reduced. Therefore, to reduce the amount of processing time required, compatible sets!
of permissible functions (CSPF) for gates and connections are more frequently used.
A compatible set of permissible functions, is a subset of the MSPF and is
I-- computed based on some ordering of the connections in a circuit. Although the don't-
care set associated with a CSPF is often smaller than :hat with the MSPF, the quality of
I the resulting circuit minimized based on CSPF usually does not suffer too badly and the
I processing time required is dramatically reduced.
CSPFs are computed similar to MSPFs. Referring to Figure 2.3 again, the CSPF
for the connection cij is given by :
I CSPF(d)(cij) = F'(d) #NOR CSPF(d)(_ ) (E2.7)
= #OR
t CSPF(d)(cij) f'(d) CSPF(d)(vj) (E2.8)
I
!
I , 16
I CSPF(d)(cij) = G'(d) #AND CSPF(d)(vj) (E2.9)
CSPF(d)(cij) "- G'(d) #NAND CSPF(d)(_ ) _ (E2.10)
I depending on whether vj is a NOR, OR, AND or NAND gate respectively. F' and G'
I are slightly different from F and G in equations E2.1 through E2.4 and are given in
I equations E2.11 and E2.12.
II
F'=,.,x<:(cxj) o 2.11)
G' = nx<i,x.#'f(Cxj) (E2.12)
I As can be seen, F' and G' depend on how tke connections are ordered. For a
connection ordered with a smaller 'i' value, the size of its don't-care set associated with
I its CSPF is smaller. [MK89] uses some heuristics to order the connections in a circuit
I and they are listed here.
1) Connections that are connected to input variables are given smaller 'i'
I values. This is because such connectior_s are often difficult to remove. In
the removal of the other of connections may cause someaddition, types
i gates in the circuit to be removed also and result in a better overall gain.
2) Connections connected to gates with larger fanouts are given smaller 'i'
values than connections connected to gates with smaller fanouts. This
I increases the chance of removing a gate when all of its output connections
are removed.
The computation of the CSPF of a gate is Ferformed exactly as the case of
I computing the MSPF of a non-reconvergent gate. For nally,
i CSPF(d)(vi) = nl<_x<_kCSP (Cix) , (E2.13)
!
!
I 17
I The CSPF of an output gate is the same as its output function. Again, similar to
MSPFs, CSPFs are computed from the primary outpLts towards to the primary inputs.I ,
The use of E2.13 to compute the CSPFs of every gate is one of the major time-
I saving factor in using CSPF rather than MSPF as a circuit needs not be evaluated again
I to obtain the output functions with respect to a reconvergent gate. In addition, as CSPFs
are based on a partial ordering of the connections, they need not be recomputed again
I each time the circuit is transformed.
I After calculating the CSPFs or MSPFs of the gates and connections in a circuit,
I transformations can be applied to reduce their number. Such procedures are explained
in the following sections.
I
I 2.4. Pruning
The pruning procedureremoves redundantconnections in a circuit. Pruning can
either be based on MSPF or CSPF. In order to detec'_redundant circuits, the MSPFs or
I CSPFs of all the connections have to be computed. The rules of deciding whether a
i connection is redundant is as follows :
1) If the gate vj is a NOR or OR gate and SPF(d)(cij) = 0 or * for all 1 < d <
I 2n, cij is redundant.
SvF(d)(cij)I 2) If vj is an AND or NAND gate and = 1 or * for all 1 < d _<2n,
cij is redundant.
I 3) If vj is a NOT gate and SPF(d)(cij) = * for all 1 _<d < 2n, cij is redundant.
I
!
!
| 18
I To see why this is true, consider a connection cij connected to an AND gate as
shown in Figure 2.5. If SPF(d)(cij) = 1 or * for all 1 < d < 2n, then cij is actually notI
needed to turn off the output of vj for all combinations of the input variables and still
maintains the primary outputs of the circuit. Hence, cij is redundant and can be
removed. This similarly explains the cases for the other gate types of vj.
The procedure for performing pruning based ,3n MSPFs is given in Procedure
I 2.4.1.
I Procedure 2.4.1 - Pruning based on MSPFs.
1) Calculate the output function at every gate.
2) Levelize the circuit with respect to the p_mary outputs.
I 3) For every level of gates starting from the one nearest to the primary
I outputs,
i For every gate within a level,
3.1) Compute the MSPF or"the gate.
I 3.2) Compute the MSPF of each of the gate's input
I connections.
3.3) If a connection is redundant, remove it and possibly the
I
I Figure 2.5. An example of a connection eij to an AND gate.
I
!
!
| 19
I gates attached to it. Repeat from step 1 until no further
improvement can be made.I
If CSPFs are used instead of MSPFs in the pruning procedure, Step 3.3 in
I Procedure 2.4.1 can be modified so that it does not repeat from Step 1. This is given in
Procedure 2.4.2.
I Procedure 2.4.2 - Pruning based on CSPFs.
1) Calculate the output function at every gate.
I 2) Levelize the circuit with respect to the primary outputs.
3) For every level of gates starting from ".heone nearest from the primary
I outputs,For every gate within a level
I 3.1) Compute the gatethe CSPF of
i 3.2) Compute the CSPF of each of the gates input
connections.
I 3.3) If a connection is redundant, remove it.
I 4) Repeat from Step 1 until no further improvement can be
made.
I However, the circuit obtained from Procedure 2.4.2 may not always be free of
I redundant connections. This is because a CSPF does not contain the full don't-care set
I associated with a connection or a gate. To obtain an irredundant circuit, Procedure
2.4.1 can always be performed after Procedure 2.4.2. This is faster than using Procedure
I
!
!
I ' 20
I 2.4.1 alone to obtain an irredundant circuit [X90].
I 2.5. Gate Substitution
I In gate substitution, a gate in a circuit is select:d and the other existing gates are
i each checked to determine if the latter can replace the former without changing the
functions at the primary outputs. This is illustrated in Figure 2.6.
I To determine if a gate vj can replace another gate vi, either the MSPF or CSPF of
I vi can be used. However, CSPF is used in our implementation of the Transduction
Method (PTRANS) as the use of MSPF is too time-consuming. In fact, CSPF is used!
for all of the other transformations described later. The condition for vj to be able to
I replace vi is f(d)(vj) _ CSPF(d)(vi) for all 1 _ d < 2n. The correctness of this condition
i follows straight from the definition of the CSPF 'of vi. If f(vj) is an element of
I Gatesubsti ution
'| tt LL LI :............: t t
.. • • _,
I Figure 2.6. An example of gate substitution.
!
I ..
!
I 21
I CSPF(vi), the functions at the output connections of vi can be changed to the function
at vj without changing the primary outputs. Hence, each of them can be connected toI
the output of vj instead of vi, and vi can be removed from the circuit.
I The procedure for performing gate substitution is given in Procedure 2.5.1.
I Procedure 2.5.1 - Gate Substitution.
I 1) Calculate the CSPFs of all the gates and connections.
2) For every gate vi,
I For every other gate vj which is not a successor of vi,
I if f(d)(vj) _ CSPF(d)(vi) for all 1 < d < 2n, replace each output
n connection of vi with the output from vj and remove vj (with
II the possibility of some other gates in the circuit) from the
I circuit.
I 3) Repeat from Step 2 until no further substitution can be performed.
4) Repeat from Step 1 until no further sul_s_tution can be performed.
I In Procedure 2.5.1, vj must not be a successor of vi. This is to prevent a loop from
I being formed during the substitution.
I 2.6. Gate Merging
I The gate merging procedure is slightly more complicated than gate substitution as
I described earlier. The basic idea of this procedure is to select two gates and determines
if a third gate can be synthesized with inputs connec:_iingto existing gates in the circuit
I
!
!
I 22
I other than the two above-mentioned gates so that thie third gate can i'eplace them. This
results in the saving of a gate and is shown in Figure 2.7.I
To perform gate merging, the connectable condition for gates is used. A gate vi is
I said to be connectable to another gate vj if the following conditions apply :
I 1) If vj is a NOR gate and there does not exist a value for d between 1 and 2n
such CSPF(d)(vj) = 1 and f(d)(vi) = 1.
I 2) If v. is a OR gate and there does not exist a value for d between 1 and 2nJ
I such CSPF(d)(vj) = 0 and f(d)(vi) = 1.
3) If vj is a AND gate and them does not exist a value for d between 1 and
I 2n such CSPF(d)(vj) = 1 and f(d)(vi) = 0.
I 4) If vj is a NOR gate and there does not exist a value for d between 1 and 2n
I such CSPF(d)(vj) = 0 and f(d)(vi) = 0.
'I
"iI Gatemerging '_| ......................: t t
I Figure 2.7. An example of gat:; merging.
!
!
I
I
I 23
I The case of vj being an inverter is not listed in any of the conditions as it can be
treated as a single-input NAND or NOR gate. In addition to the connectable condition,
I .L
the intersection operator, n, on the three-value doma'2",of {0,1,*] is used and is defined
I in Table 2.5. This operator is symmetric.
I The gate merging procedure is given in Procedure 2.6.1. Again, CSPF is used.
I Procedure 2.6.1 - Gate merging.
1) Calculate the CSPFs of all the gates.
I 2) Pick two gates v 1 and v2 such that their CSPFs are intersectable, i.e. the
I '-' sign in Table 3.1 does not arise. .
3) Synthesize another gate v3 with CSPF equals to CSPF(v 1) n CSPF(v2).
I Let v3 be a NOR gate.
I 4) Search the circuit to obtain the set of gates which are connectable to v3.
I These gates cannot be successors of v1 and v2. If the set of gates obtained
is empty, try v3 being either an OR, AND or NAND gate. If the set is still
I empty, repeat from Step 2 to try some other pairs of gates.
I 5) Find the minimal set of connectable gates by going through Steps 5.1 to
5.2.
I
0 1 * '
I 0 0 - 01 1 1
• 0 1 *
I Table 2.5. Definition of the operator n.
I
I
I
I 24
I 5.1) For each gate vk in the set, remove it and test if the resulting
f(v3) is still a member of CSPF(v3).
I 5.2) If it is, remove vk from the set.
k
I 6) Connect the connectable gates to v3. Let v3 takes over the output
connections of v1 and v2. Delete v 1 and v2 and possibly some other gates
I from the circuit.
I 7) Repeat from step 1 until no further improvements can be made.
I Although Step 7 in Procedure 2.6.1 can repeat from Step 2 instead of Step 1 as
CSPF is used, it is found that only very few pairs of gates can be merged in each
I iteration of Steps 1 through 6. Hence, it generally sa_,esmuch more time to repeat from
I Step 1 after a merge than to continue searching in a highly unsuccessfully search-space
by starting at Step 2.!
I 2.7. Generalized Gate Substitution
I
i As its name suggests, generalized gate substitution is a more general form of gate
substitution. In gate substitution, a gate is checked to see if all of its output connections
I can be replaced by the output of another gate in the circuit. In generalized gate
I substitution, each of the gate's output connection is checked if it can be replaced by the
output of some other gate instead. Hence, a gate may be substituted by more than one
I gate. An illustration of this procedure is shown in Fig-are2.8 in which c34 and c35 can
I be (supposedly) replaced by c14 and c45 respectively. The crossed-out connections and
the gate v3 can then be removed from the circuit, _resultingin one less gate for theI
I •
!
I 25
I circuit.
Due to the similarity between generalized gate substitution and gate substitution,!
the procedure for this transformation is obtained by modifying Procedure 2.5.1 slightly.
! Procedure 2.7.1 - Generalized Gate Substitution.
,,
I 1) Calculate the CSPFs of all the gates and b.onnections.
i 2) For every gate vi,
For every connection Cik,
I For every other gate vj which is not a successor of vi,
I if f(d)(vj)" CSPF(d)(c_) for all 1 < d < 2n, replace Cik
c4 5
I
I
I Generalizedgatesubstitution r
¢14
I
'| LL t
. . . : :
I Figure 2.8. An example of generalized,gate substitution.
i
!
I ..
!
I 26
i with a new connection Cjkif there isn't any connection Cjk
l
originaUy. Remove cik with some other gates in the
I circuit if any.
I 3) If all of the output connections of vi are aot substituted, undo Step 2.
4) Repeat from Step 2 until no further improvement can be performed.
I 5) Repeat from Step 1 until no further improvement can be performed.
I In this procedure, a gate cannot be partially substituted as this does not result in a
I better circuit size. Step 3 prevents this from occurdng_
2.8. Gate Input Reduction
I Finally, the fourth transformation available in SYLON-XTRANS is gate input
i reduction. In this transformation, a new gate vj is synthesized to replace a target gate vii
such that the number of inputs of vj is less than that of vi. After a successful gate input
I reduction transformation, the total number of connections in the circuit is reduced.
I To perform this transformation, a more sw_:ngent form of the connectable
I condition, namely, the effectively connectable condit:_'onis needed. A gate vj is said to
be effectively connectable to vi if one of the following four conditions is true.
I 1) If vi is NOR gate, there must be some value of d between 1 and 2n
I inclusive such that CSPF(d)(vi) = 0 and f(d)(vj) = 1 .
2) If vi is OR gate, there must be some value of d between 1 and 2n
I inclusive such that CSPF(d)(vi) = 1 and f(d)(vj) = 1.
I 3) If vi is AND gate, there must be some value of d between 1 and 2n
I
!
m 27
I inclusive such that CSPF(d)(vi) : 0 and 'f(d)(vj) = 0.
4) If vi is NAND gate, there must be some value of d between 1 and 2n
I inclusive such that CSPF(d)(vi) = 1 and f(d)(vj) = 0.
I With the effectively connectable condition, the procedure for gate input reduction
i is as follows :
i Procedure 2.8.1 - Gate Input Reduction.
1) Calculate the CSPFs of all the gates in tl-,ecircuit.
I 2) For each gate vi,
I 2.1) Synthesize a new OR gate v which has the same CSPF and
function as vi.
I 2.2) Search for the set of gates L,athe circuit which are effectively
I connectable to v. These gates must not be successors of vi.
I 2.3) Minimize the number of gates in the set obtained from Step2.2 by the following :
I 2.3.1) For each gate '_k in the set, remove it and test if
i the resulting f(v) is still a member of CSPF(v).
2.3.2) If it is, remove vk from the set.
I 2.4) If the size of the set is less than the number of inputs of vi,
I add a new connection frorr each of the gates in the reduced
set to the input of v and use it to replace vi. Otherwise, try
I synthesizing v as a NOR, AND or NAND gate instead of
I
!
!
| 28
I NOR.
3) Repeat from Step 2 until there is no further improvement.
I 4) Repeat from Step 1 until there is no further improvement.
I Very frequently, it is found that the law of diminishing returns, applies to
Procedures 2.7.1 and 2.8.1. The number of reductions to the circuit that can be madeI
decreases rather rapidly after these procedures are applied for a constant number (once
I or twice) of times. Hence, the two procedures are combined into one single procedure
I which is then applied once or twice to a circuit. This combined procedure goes through
the circuit and for each gate, it tries to perform generalized gate substitution on that
I gate. If this is unsuccessful, gate input reduction is then applied (if applicable) to it.
I The procedures given from Sections 2.4 through 2.8 form the basic tools for the
optimization of a multi-level circuit in SYLON-XTR_2,1S.I
I
I
I
I
I
I
I
I
I
I 29
I CHAPTER 3..,
PARALLEL IMPLEMENTATION OF SYLON-XTRANS
I
3.1. General Overview
I
In the parallelization of SYLON-XTRANS, many problems have to be dealt with.
I This section describes the problems and their solutions that have led to the present
i implementation of PTRANS on an Encore 510 Mulumax, which is an eight-processor
shared-memory multiprocessor.
I The first problem concerns the size of the input circuit. After several experiments,
I the synchronization overheads incurred in PTRANS ,_ere found to grow slower than the
actual time spent in minimizing the circuit. Hence: the input circuit has to be large
I enough so that the overheads can be sufficiently masked for achieving good speedups
I and high efficiencies. However, it is impossible for.FTRANS to minimize any arbitrary
I large circuits as this is bounded by the computer's memory limitation.
To solve this problem, binary decision diagrams (BDDs) are used to represent
| 'functions and permissible functions instead of the more traditional SOP form as used in
I SYLON-XTRANS. BDDs are generally more compact than the SOP representation
[B86]. In addition to using BDDs, the file system is also used as a temporary storage.
I Although the Encore Multimax computer has virtu_ memory, the amount of swap-
I space available on our system is limited. Hence, PTRANS has to mafiage the
i temporary disk storage explicitly. It selectively stores and retrieves BDDs generated
!
I 30
I during program execution to and from the disk.
Unfortunately, some of the circuits (eg. the ISCAS benchmarks) are still too big to
I be minimized as a whole. Such circuits are partitioned into smaller circuits before
I be afterwards. The parStions either be minimized inminimization and can merged Can
I parallel consecutively, or in parallel simultaneously. PTRANS performs the necessary
intra- and inter-parf!'tion load balancing automatically. These modes of parallelism are
i illustrated in Figure 3.1.
I In this chapter, the details of the implementation of PTRANS is given. In Section
3.2, the methods of manipulating BDDs to handle permissible functions are described.
I Section 3.3 briefly summarizes the partitioning algorithm used to partition large
I circuits. In Section 3.4, the program model is given, followed by descriptions of "how
i the functions and permissible functions of gates in a circuit can be evaluated in parallel.
Finally, Sections 3.5 through 3.11 explains the parallel implementation of the various
I transformations.
!
3.2. Binary Decision Diagrams
I The use of BDDs to represent Boolean functions was formally introduced in [A78]
I and [B86]. As permissible functions contain a third don't-care value (*) in addition to
the {0,1} binary values in ordinary Boolean functions, the original BDD structure hasI
to be modified to represent this additional value [MF89]. Furthermore, PTRANS uses
some additional BDD operators which are also described in this section.
!
!
I 31
!
Circuitbrokeninto4 partitions &,,. Processor1
I Partition 3,," Partition 1 .
S_... _
. t 0 Processor2Partition2
I • % IPartition4_
Q Processor3
Mode1.:Inter-partitionparallelism
'
1 ocessor 3
I Mode2 : Intra-partifionparallelism Processor1
I Processor2
I _) Processor3
I Mode3 :Inter-andintm-partitionparallelism
!
I Processor1 Processor2 Processor3
Figure 3.1. The three modes of parallelism in PTRANS.
I The way in which the don't-care value is represented in a BDD is as follows.
i Suppose the bit vector (0"110"1") is to be represented and it corresponds to the truth
table shown in Table 3.1. Its BDD equivalent is then given in Figure 3.2. As can be!
seen, the only modification needed is to have another terminal node in the BDD that
I
I
| 1
| 32
represents the don't-care value.
As can be deduced from Figure 3.2, the size of a BDD is dependent on theI
ordering of the variables, i.e. the levels at Which the input variables appear in the BDD.
! Some ordering heuristics have been presented in the literature [MWBV88] and
i [FFK88]. PTRANS uses a heuristic ordering based on the frequency with which each
primary input is connected to a gate. The justification is that a primary input that is
connected to more gates probably affects more functions, and hence is given a higher
priority in the variable ordering. Thus, it is placed nearer to the roots of the BDDs.
x1 0 0 0 0 1 1 1 1
x2 0 0 1 I 0 0 1 I
I x3 0 1 0 1 0 ! 0 1Vect. 0 * 1 1 0 * 1 *
I Table 3.1. Truth table for the vector (0" 110"1").
I root
I
I
I
!
I Figure 3.2. The BDD for the truth table in Table 4.1.
!
!
' 33
I In PTRANS, there are four new procedures for manipulating these BDDs. They
are listed as follows :
1) Test if a function is a member of another!function.
i 2) Test if a function intersects with another function.
3) Test for the connectability condition.
4) Test for the effectively connectablility condition.
I These are given in Procedures 3.2.1, 3.2.2, 3.2.3 and 3.2.4 respectively. The
I relevant fields in the data structure used for the BDDs are basically the same as those
described in [B86]. Procedures 3.2.3 and 3.2.4 for testing the connectable and
I effectively connectable conditions follow straight from their definitions in Sections 2.6
I and 2.8 respectively.
I Procedure 3.2.1 (BDD1,BDD2) - Tests if BDD! is a member of BDD2.
/* Input : BDD1 and BDD2.!
Output : Returns 1 if BDD1 E BDD2, 0 otherw._.se.*!
! ,1) If BDD2.val - *, return(I).
2) If BDD2.val _ * and BDD 1.val _ BDD2.val, return(0).
3) Recursively call on the subtrees of BDD1 and BDD2 to check if the
| "subtrees of BDD1 is a member of their corresponding subtrees of BDD2.
I Procedure 3.2.1 is a straight forward of if bit inimplementation checking every the
vector represented by BDD1 is a subset of the :orresponding bit in the vector!
represented by BDD2 by traversing both BDDs. When the subset condition fails for a
| '
!
I _ 34
I pair of bit values in the two vectors, the procedure returns a 0 immediately.
I Procedure 3.2.2 (BDD1,BDD2) - Tests if two BDDs intersect.
/* Input : BDD1 and BDD2.
i Output : Returns 1 if BDD1 n BDD2 _: _, 0 otherwise. */
I 1) If BDDl.val = BDD2.val _ *, return(l).
2) If BDDI.val = 1 and BDD2.val = 0 or v_ce-versa,return(0).
3) Recursively call on the subtrees of BDD1 and BDD2 to determine if they
I intersect.
I Similar to Procedure 3.2.1, Procedure 3.2.2 traverses both BDDs to ensure that the
corresponding bits in the vectors represented by Bi3D1 and BDD2 are intersectable.
This intersectable condition is violated only when a oit in the first vector is 1 and the
I corresponding bit in the second vector is 0 or vice-,_ersa. At this point, the procedure
stops and returns a 0.I
Procedure 3.2.3 (f,SPF,Gatetype) - Tests connectability.
I /* Tests if a gate with function f is connectable to another gate v with CSPF
called SPF. Gate_type is the type of gate v i_. It can be NOR, OR, AND or
NAND. ...
Input : a function f, a CSPF called SPF and a gate type.
I Output : Returns 1 if the connectable condition is true and 0 otherwise. */
m 1) if SPF.val = ,, return(I).
2) if Gate_type = NOR
| .,
!
I !l
I 35
I if SPF.val = 1 and f.val = 1, return(0); else if SPF.val _: *, return(I).
'!
3) If Gate_type = OR
I •iif SPF.val - 0 and f.val -- 1, return!0); else if SPF.val _ *, return(l).
I 4) If Gate_type = AND
if SPF.val = 1 and f.val = 0, return(0); else if SPF.val _ *, return(l).! 5) If Gate_type -- NAND "
I if SPF.val = 0 and f.val = 0, return(0); else if SPF.val ;_*, return(l).
I 6) Recursively call on the subtrees of f and SPF to check for effectively
connectability.
i
I At each recursion of Procedure 3.2.3, if a pair of terminal values is reached, the
I procedure checks if the connectable condition defined in Section 2.6 is violated
depending on the type of gate v is. Once a violation is detected, the recursion aborts
I and the procedure returns a 0. Otherwise, the procedure recursively checks other pairs
of terminal values in the two BDDs, f and SPF.!
I Procedure 3.2.4 (f,SPF,Gatetype,flag).- Tests effectively connectability.
i /* Tests if a gate with function f is effectively connectable to another gate v with
CSPF SPF. Gate_type is the type of gate v is. It can be NOR, OR, AND or
I NAND. 'flag' is an external Boolean variable.
!
Input : a function f, a CSPF called SPF, a gate type, and an external variable flag.
Output : Returns 1 if the effectively connectable condition is true and 0 otherwise.
!
!
!
m 36l
I 1) If SPF.val = *, return(l). ,
2) If Gate_type = NOR
I if SPF.val = 1 and f.val = 1, return(0); else if SPF.val = 0 and f.val =
I 1, set flagto be true. Otherwise, if SPF.val ;e *, return(I).
3) If Gate_type = OR '!
I if SPF.val = 0 and f.val = 1, return(0); else if SPF.val = 1 and f.val =
I 1, set flag to be true. Otherwise, if SPF.val _ *, return(l).
I 4) If Gate_type = AND ),
I
if SPF.val - 1 and f.val = 0, return(0); else if SPF.val = 0 and f.val =
0, set flag to be true. Otherwise, if!SPF.val _ *, return(l).
I 5) If Gate_type = NAND
if SPF.val = 0 and f.val = 0, return(0); else if SPF.val = 0 and f.val =
I ,,!1, set flag to be true. Otherwise, if SPF.val _: *, return(l).
6) Recursively call on the subtrees of f. and SPF to check for effectively
I connectability.
7) The effectively condition is only true if both the procedure returns 1 and
I flag has been set to true. ',
Procedure 3.2.4 is very similar to Procedure 3.2.3 except that a Boolean flag is
used to record if f is effective with respect to the function SPF, i.e. if f has helped in
I the setting of any bit of SPF to its value based on the type of gate v is. The remainingq
I conditional statements in the procedure checks for the connectable condition which is
I already shown in Procedure 3.2.3. Thus, when the procedure returns both a 1 and the
!
!
37
!
flag has been set, both the effective and connectable conditions are satisfied.
I
I 3.3. Partitioning Algorithm ,
i In this section, the partitioning algorithm used for breaking up large circuits is
briefly described. More details can be found in [DB9I].
I The partitioning algorithm comprises of seed-clustering and group-migration algo-
l rithms. Each execution of the algorithm breaks a circuit into two partitions. The seed-
clustering algorithm starts by locating two seeds for two partitions chosen such that
I they are maximally away from all boundary gates like primary inputs and primary out-
I puts in the circuit. They are also as far away as possible from each other.
I After the two seeds are located, they are separated into two growing partitions.
The other gates not yet considered are placed on a free-list. Considering one partition at
I a time, a gate is then picked from the list such that the gain obtained by putting it into
I the partition is maximum. The cost function for calculating the gain will be described
later.
I
When the free-list becomes empty, Kemighan-Lin's algorithm [KI_,70]is then used
I to swap pairs of gates between the partitions. The pairs of gates are selected such that
swapping them result in more gain in the overall qualities of the partitions.
The cost function used to measure the amount cf gain of a gate with respect to a
I partition is an estimate of the size of the don't-care set associated with the gate. This
I can be found by choosing random vectors to simulate _hecircuit the gate is in. From
!
!
!
38
the frequency of O's and l's appearing in each cor:hection, the don't-care set can be
I estimated. More details of this can be found in [DB91].
I With this partitioning algorithm, large circuits can be partitioned and optimized in
I parallel. The details of this parallel implementation is given in the following sections.
I 3.4. Program Model
I As mentioned earlier, large circuits have to be partitioned before they can be
minimized. At the implementation level, no distinction is made between a partition of a
! circuit and a whole circuit. PTRANS can be fed with as many partitions as possible
I simultaneously under the constraint caused by the amount Of memory available. There
is no relation between the number of input partitions and the number of processorsI PTRANS uses.
I PTRANS uses a multiple master-slave model. This is very similar to the normal
I master-slave program model, whereby the master diswibutes computations to the slave
processes and is also in charge of synchronizing thern. The results of the computations
I are then passed back to the master. The only differences between the model PTRANS
I uses and the normal master-slave model are that multiple masters axe present in
PTRANS, and each slave does not always belong to, the same master. In PTRANS,
I each master or slave is actually a process in the system. A processor is assumed to be
I always allocated to a process by the operating system. The number of processes can
vary from one to the number of processors available on the system.
!
!
m 39
I At any instant of time, only one master is associated with a partition. This master
is responsible for the whole minimization process of the partition. During the
I minimization of its partition, the master will never_be used for the minimization of
other partitions. As for the slaves, they stay in a shared slave pool. Whenever a master
reaches a point during its execution where it can distribute its load to other processes, itI
will enter exclusively into the slave pool and try to ge.tas many slaves as possible from
I the pool. It then distributes the load to those slaves When these slaves have finished
i their computations, they return to the slave pool awazting for future masters. Whenever
a partition has been minimized, the corresponding master becomes a slave and it too
I enters the slave pool.
I In order to efficiently utilize the processors, each master cannot own slaves
throughout the whole minimization process of a pa.stition as this will deny the other
I masters of slaves. In PTRANS, there are several entry and exit points. Entry points are
I locations where slaves can join a master in the minimization of a partition. Similarly,
i exit points are locations where slaves can leave a master and return to the slave pool.
After a slave has been sought for help by a master, it will enter at an entry point
determined by the master, perform the computations in parallel with the master and0
I other slaves, exit at the next exit point and return to tt:e slave pool. Between every pair
of entry and exit points is a well-defined piece of job such as gate substitution etc. An
I illustration of this master-slave relation is shown in Figure 3.3. Using this slave pool,
I the load can be distributed to idle processors. This forms the basis of the load
balancing between the processors in PTRANS.I
I
m 40
I
X. : terminationof task and returningof slave to pool
I Process 1 Process 2 Process 3 Process 4(master) (sla,_e) (slave) (master)
I Time _ ( _ J....7_
I Distributionof work to slaves
Figure3.3. A sample timing diagramof two mastersand two slaves.
I Beside using the master-slave model, PTRA_t_XlSalso use a semi-distributed
I memory model. Every process has some semi-private memory locations pre-allocated toi
it. This is to avoid the contention in allocating memory for frequently used data
I structures such as BDDs since allocating shared-m_.mory is a sequential bottleneck.
I This set of memory locations is classified as private memory because only the owning
i process can allocate memory out Of its set. However, it is semi-private as data
structures allocated from a set can be read and de-allocated by other processes.
I With this basic model, PTRANS is able to minimize multiple partitions
t,
I simultaneously. Let p be the number of partitions and P be the number of processors.
Initially, there are min(p,P) masters. This number wi!t reduce gradually. If P is greater
I than p, there will be P-p initial slaves also. Each maser is allocated a list consisting of
I
I
I 41
I p/P partitions. Since P does not generally divide..p, some masters may have one
partition more than the other masters. These partitions can be minimized in parallel
I without any dependency between them. Whenever a master has processed all of its
I allocated partitions, it performs a scan of the other masters' lists of partitions and looks
for the first uncomputed partition. It then removes this partition from the list and!
minimizes it. If the master cannot find such a partition, it checks if it is the last master
I among all of the P processes. If so, this master will send a termination message to each
I of the other P-1 slaves and all of the P processes will then exit, thus terminating the
whole program. Otherwise, this master will change its status to a slave and enters the
I slave pool.
I In addition to this high level inter-partition pa,:allelism, PTRANS is also able to
i apply the Transduction Methods on a partition in Farallel. This are described in the
coming sections.
I
I 3.5. Discussion of Number of Partitions
An obvious way of extracting significant speedup out of the logic synthesis
I application is to generate a large number of partitions and synthesizing each partition
:.
I independently. The results of the individual p_xtitions are then merged back.
Unfortunately, such an approach has the problem that with increasing number of
I partitions, the quality of the overall circuit degrades This is because each synthesis
I procedure of a partition only synthesizes within tI-e partition by treating it as an
I independent block. It does not take any global information into consideration during
I 42
I minimization. A good partitioning algorithm that cata guarantee minimum degradation
in the circuit quality is desirable. Examples of existing partitioning algorithm are
I BEAT-NP [CHNS88], COROLLA [DBK90], and that of Banerjee [DB91].
I In the interest of better quality, one should therefore choose a minimum number of
i partitions. Then, one is forced to resort to intra-partition parallelism which is much
harder to exploit. One may not get good speedup within a partition. There is clearly a
I tradeoff between result quality and runtime determined by an optimal number of
I partitions. Such a theory needs to be developed but is outside the scope of this thesis.
I 3.6. Parallel Evaluation of Functions and CSPFs of Gates
I To the evaluation of functions andexploit intra-partition parallelism, parallel
CSPFs of gates is discussed in this section. The evaluation of MSPFs is slightlyI different from that of CSPFs and is deferred to the next section.
| 'The parallel evaluation of the output functions of gates is similar to the parallel
I methods of logic simulation [SB881 and circuit partiton approaches to fault simulation
[PBP91]. From the definition of a level in a circuit n Section 2.1, it can be seen that
I gates within the same level with respect to the primary inputs can have their functions
evaluated in parallel. Similarly, the CSPFs of gates having the same level number with
i respect to the primary outputs can be computed in _amllel too. This is illustrated in
Figures 3.4(a) and (b).
! ,
!
!
43!
!
: gateD !
"''*_I ...._ gate12 ......
GatesD andE areevaluatedin parallel.
i,
Figure 3.4(a). An example of parNlel evaluation.
..
! ,• Synchron'._ation
Process 1 (master) Prccess 2 (slave)
I Action Input Output Action Input I Output
.-Q-_e-u-eQ.__eu_L_......
_9u_ Queue
{A,B} {C} I
m m_mmm mmm_m _m,..4mm_w.
EvaluatesA [B) =EvT_a=_U',.-
Finishes C '
Finishes A GeneratesE {E}
I Obtains BEvaluates B
Finishes B
| .......... &___e_t_ ....... 3E,p!"
Re-distributes {E} {D}
Output
I Queues
Time
I Figure 3.4(b). A sample timing diagram for Figure 3.4(a).
To traverse the circuit so that gates in the saree level can have their functionsI
evaluated in parallel, every process (both master ar_J slaves) working on the circuit
I needs an input and an output queue. Initially, the primary inputs of the circuit are
i evenly distributed among the input queues of these processes. There is a counter
!
I 44
I associated with each gate which is initialized to zero. Whenever a process takes a gate v
from its input queue and evaluates its function, it increments the counters in each of the
I immediate successors of v. If the counter in a gate equals to the number of its input
I connections (signifying that all its inputs have been processed and hence the output
function of the gate should be evaluated), this coun':er is reset to zero and the gate is
I enqueued into the output queue of the process. Aftel a process has processed all of its
!
I input queue, it examines the input queues of the other processes, picks the longest
i queue, removes half of its contents, puts those into its own input queue and continues
processing the queue. When all of the input queues have been emptied, a level of gates
I have been processed. The master of the circuit th_n concatenates all of the output
I queues into a single queue and distributes the gates in this queue evenly among the
input queues. After this, all processes involved in th'.s circuit will continue processing
I their input queues as described earlier.
I Whenever the master finds that all of the output i'queuesare empty after a level has
i been processed, the functions at all of the gates in the circuit have been evaluated. The
slaves will then return to the slave pool.
I The evaluation of CSPFs is similarly computed, _xcept that the circuit is levelized
I with respect to the primary outputs and traversed backwards.
I For small circuits, the CSPFs can be stored in the main memory after they are
evaluated. However, for more complicated circuits, there is insufficient memory to hold
I all of these permissible functions simultaneously. To avoid this problem, some of the
i
!
!
!
m 45
i
I CSPFs are transferred to the disk. In order to minimize the number of disk accesses
when evaluating such functions, the permissible function of a gate is not stored into the
I disk immediately after it is evaluated. In fact, it will be held in the main memory until
I the CSPFs of all of the input connections of the gate'have been computed. After this, it
i is packed into a contiguous format and sent to the disk.
As for the functions of the gates, they generally require much less memory for
| storage than CSPFs. This is because the functions _.: the connections are the same as
I that of the gate they are connected to, whereas the_"CSPFs are different from that of
the gate. Hence, such functions are not stored in the d_sk.I
Another slight difference between the evaluation of functions and CSPFs is the
I granularity in which these two are performed. For n,_rmal functions, the evaluation of
I the whole circuit is treated as a single task. Thus, the master will only enter the slave
pool at the beginning of this task to look for slaves. On the other hand, the evaluation
I of CSPFs is more time-consuming as the BDDs needed to represent these functions are
I generally larger. Additional processing is also needed to pack these BDDs for disk
storage. Hence, the evaluation of permissible functions is broken up into a smaller
I grainsize. This grainsize is set at the levels of t!_e circuit. At this grainsize, the
I evaluation of the gates at the same level is treated as _ task and the master is allowed to
enter the slave pool to obtain slaves for each level of _necircuit.I
It should be noted that the above approach is one way of exploiting the parallelism
I in the evaluation of functions and CSPFs. Another way would be to partition the input
|
!
m . 46
m space on different processors and letting each processor to perform function and CSPF
evaluations on its input vectors. For example, with two processors, Processor 1 might
be processing the first d/2 bits of the vectors whil½ Processor 2 is in charge of the
m "remaining bits. Although this is conceptually simple to parallelize in the SOP
representation, the difficulty comes when using BDDs. With the splitting of the input!
space, multiple BDDs are needed to represent a single function. The amount of
I subtree-sharing in these BDDs will thus be smaller as compared to that in a single BDD
m representing the same function if input space had not been splitted. This would
increase the amount of memory needed by the BDDs.
! ,
I 3.7. Parallel Pruning
The pruningprocedurecan be brokendown int_ two parts. The first part consists
m of identifying the redundant connections and the seccnd performs the actual removal of
m the redundant connections.
m Pruning based on CSPF is slightly different from pruning based on MSPF. This is
because the removal of a redundant connection does not invalidate the CSPFs of other
m gates and connections whereas this is not true with MSPF. Therefore, for pruning based
m on CSPF, the CSPFs of all the gates and connections can be first generated before
performing any redundancy removal. As the CSPFs axe generated, the connection wires
m are checked to see if the wires can be pruned. If so, such connections are marked. The
m generation and checking of the CSPFs be incan performed parallel as described in the
previous section. This is illustrated in Figure 3.5.
!
i!
m 47
|
| °1iga
Connectionscl, cT c_,andc4 arecheckedforredundancysimultaneously.
I c 1 and c2 are found t6 be redundantand are marked in parallelbut arcremoved sequentially
Figure 3.5. An example of parailelpruning.
I After all of the connections have been checked for redundancy, the master process
I then goes through the marked connections and remove them sequentially. This is not
,.;
performed in parallel as the time taken to adjust a fe_, pointers during the removal of a
| 'redundant connection is negligible as compared tO the time needed to detect its
I existence. In addition, the number of redundant connections is usually very small as
compared to the total number of connections in the circuit.!
As,for pruning based on MSPF, both the computation of MSPFs and the removal
I of redundant connections have to combined into a single phase to avoid redundant
I work. This is because after a connection is found to be redundant and pruned, the
MSPFs of all other connections and gates are invalic!ated and have to be rec0mputed.
i The grainsize of the computation of the MSPFs is set to the level of a circuit and is
I similar to the case with CSPFs. Each process computes the MSPF of a gate or a
connection as described in Section 2.2 and checks for redundancy of a connection afterI
I !,
| i
A
| 48
I its MSPF is computed. If a connection is found to be redundant, it is recorded in a
shared variable readable by every process working, on the same circuit. During the
I computation of the MSPFs, every process checks this variable periodically to determine
I if a redundant connection has been detected. When '._hisis set to true, the slaves will
then return to the slave pool whereas the master process will perform the removal of thei
redundant connection. After this, it restarts the compt_tion of the functions and MSPFs
i of the circuit. This cycle is repeated until no further redundant connection can be found.
i As with CSPF, there is insufficient memory to Jtore the MSPF of every gate and
connection. However, MSPFs are not stored in the disk since they are not needed for!
any other transformation. The MSPF of a gate or con"n.ectionis deleted once it has been
I used by all of the relevant immediate predecessor gates or input connections. In the
pruning procedure of PTRANS, pruning with CSPF is first executed before pruning| •
with MSPF. This combination yields an irredundam circuit in a shorter time then
pruning with MSPF alone. °
I
3.8. Parallel Gate Substitution
i The main idea of gate substitution is to search t):e given circuit for a pair of gates
I such that one gate (the candidate gate) can replace _he other (the replaced gate). As
there are possibly many pairs of gates satisfying the gate substitution condition at a
time, the gates are ordered and searched so that the re Jlaced/candidate gate is as near to
I the primary outputs/inputs as possible. This is to mit_imize the occurrence of a pair of
candidate and replaced gates such that the candidate igateis a successor of the replacedI
!
!
I . 49
I gate. In this case, it is impossible to perform the substitution since the resulting circuit
iI
will not be loop-free. ,_
| '
To search for such pairs of gates in parallel, the gates of the circuit are arranged in
I two shared queues. The first queue, Q1, contains the gates traversed in a breadth-first-
I search from the primary outputs to the primary inputS. The second queue, Q2, contains
the same gates arranged in reversed order, i.e. from primary inputs to primary outputs.
i' This is illustrated in Figures 3.6(a) and (b).
I To search for a pair of gates for substitution, a pi'ocess first exclusively dequeues a
gate v from Q1. It then scans the gates in Q2 from hdad to tail and stops when it finds a
gate which can substitute for gate v. Next, the process records this pair of gates and
informs the other processes working on the same circuit to stop their search by setting a
i shared flag that is periodically monitored by them.
II at
When a process finds that the flag has been set, it will return to the slave pool if it
is a slave. Otherwise, if it is the master, it waits until all of its slave processes have
I returned to the pool and then the It then into the slaveperforms substitutioi.a. goes pool
i again to get more slaves and continues searchir_g for other pairs of gates for
substitution.
In order to provide a better load balancing, a m_-.sterwill also try to obtain slaves
i from the slave pool during the search if it has not yet obtained P-1 slaves, where P is
the number of processors PTRANS is executing on. :To do this, each time the master
I has exclusively dequeued a gate from Q1, it will enter the slave pool to look for idle
!
!
| 50
I QI__ gateCgateDgateE ' gateF ---_
I Process 1 :lequeues gate A
|
/f" "_ fromQ1_d searches
,t,I(gat ;A) forsubstLute candidate
_Q2___ ': " ---.
fromQ1andsearches
for substitute candidate _. /
from Q2
I Figure 3.6(a). Searching for substitutes in parallel.
Process1(slave) Proces 2 (master.',. Proce23 (slave)
I • ObtainsB )ObtainsA ),. , J
from Q1 from Q1 _ Obtains C!
I and gets from Q1
iProcess3
Searchesfor asslave Searchesfor Searchesfor
candidate candidate candidatefrom Q2 from Q2 from Q2
_ Finds
I ) Halt ) ( Halt candidateReturns to Master waits for , Returns
slave pool slaves to return Master performs slave pool
I substitution
Time
I Figure 3.6(b). A sample timing diagram for Figure 3.6(a).
processors. These new slaves will then help in the search by entering the substitution
I procedure and each exclusively dequeues a gate frorla Q1. Every slave is blind to the
I presence of other slaves.
i When Q1 becomes empty, this marks the e:ad of an iteration in the gate
substitution procedure. All slaves will return to the sl_.vepool. If some substitution has
!
!
!
I _ 51
I been performed before Q1 becomes empty, the ma_ter will set the queues up for the
next iteration and initiate the re-evaluation of the functions and permissible functions of
I the circuit. If not, it will proceed to the next transformation procedure.
I
3.9. Parallel Gate Merging!
Gate merging can either use CSPFs or MSPFs. However, in PTRANS, CSPFs are
I used since the computation of MSPFs is a time-consmning process.
l Although CSPFs allow multiple pairs of gates to be merged before re-evaluating,
this approach is not used since the number of possible merges with each evaluation of
I the CSPFs is very small (usually less than 3). Hence, instead of wasting processing time
to look for another pair of gates to be merged after a pair has been found, it is more
worthwhile to recompute the CSPFs of the gates an,a connections and start the searchI
over again.
I To look for a pair of gates to merge, gates nearest to the primary outputs are
I examined first. This is because in gate merging, a thlJd gate needs to be synthesized to
replace a pair of gates. However, the immediate p_decessors of this gate cannot beI
successors of the gates to be replaced so as to maintain a loop-free circuit. As gates
I nearest to the primary outputs have fewer successors, this ordering creates a higher
i probability of being able to synthesize the third gate.
The data structure to iterate the search space for gate merging involves only a
I single shared queue which contains all of the gates in the circuit traversed in a breadth-
I
I ,i
!
t
I _ 52
I first-search starting from the primary outputs. This queue comes with a 'fetch-and-
advance' operator. This operator is a variant of tt'_efetch-and-add primitive and it
I atomically returns a copy of a pointer pointing to a gate in the queue and advances the
I pointer to the next node in the queue. This pointer qriginally points to the head of the
queue. The queue being operated on by this operator remains intact.I
The main loop of the gate merging procedure requires every process to 'fetch-
I and-advance' for a gate from the single queue. Let ,_ be the gate which is returned by
I the fetch-and-advance operator.After obtaining v, a process then scans the same queue
from the its head to tail to look for a gate other than'v whose CSPF intersectswith the
I CSPF of v. This is illustrated in Figure 3.7(a) and (b). Let F be the intersection of
I these two CSPFs. After this, it tries to synthesize a third gate, y, to substitute the pair of
gates. To synthesize y, the process again scans the ga_tesin the queue, picks those gates
I that are connectable to v and adds them to the immediate predecessor set of y. After
I this, it checks if the function is of F. If this thenresulting at Y a m.ember so, process
I minimizes the set of connectable gates obtained and informs all of the other processes
working on the same circuit to stop by means of setting a shared flag that is being
constantly polled. The stopped slaves then return to the slave pool and the merging is
I performed by the master after every slave has returned. Following the merge, the master
re-initiates the gate merging procedure for the netwerk until no further merge can be
! found.
I merging usually takes much longer t_e as compared with the otherGate
I transformations. Thus, the grainsize of this procedure has to be small enough to ensure
| '.]
!
53
!
tests tOgat_
! ,
I GatesE and F are fetch-and-advancedby Processes 1 and 2
i Figure 3.7(a). Parallel searching in gate merging.
Pro<_ss1(slave) Proce_;2 (master) Procesg3 (slave)
ObtainsE ,_ ObtainsF ,
I andgets _ (Obtains GProcess3
Searchesfo_ as slave Searchesfct" Searchesfor
I candidate candidate candidate
mergable mergable mergable
withE withF _ withG
I from Q1 from Q1 ,! from Q1
Finds = =
candidate Halt Halt"
I ;( >(Returnsto Masterwaitsfor Returnsto
slavepool slavestoreturn Masterperferms slavepool
I ,merge
Time
I Figure 3.7(b). A sample timing diagrar:l for Figure 3.7(a).
i a good load balance. If this is not so, the following scenario might occur. Suppose there
are 2 processes (on 2 processors) minimizing 2 partitions of a circuit simultaneously. At
I time t = 10, Process 1 may be working on Partition 1,when Process 2 looks for a slave
I for gate merging. Process 2 cannot find any slaves and has to work alone. At time t =
12, Process 1 becomes idle after having minimizing ".tspartition. As gate merging may
I take a long time, Process 2 does not finish until t = 20. Hence, a processor is idle from t
!
I
I 54
I = 12 to t = 20 and this is highly inefficient.
After several experiments, the following procedure to choose the grainsize wasI
adopted. The program is written such that every t_m.eafter the master has 'fetch-and-
advanced' for a gate, it enters the slave pool to look for more slaves unless it already
i owns P-1 slaves. Each of these additional slaves th.m proceed straight to 'fetch-and-
advance' for their gates• This is transparent to the other processes already working on
I "the partition. At this granularity, the processors are more efficiently utilized.
!
3.10. Parallel Generalized Gate Substitution/Gate Input Reduction
Generalized gate substitution and gate input reduction generally do not reduce the
I size of circuit as much as the previous transformatlons in relation to the amount of
i processing time they take. Therefore, these procedures are combined into a single
procedure and is applied to the circuit only a constant number of times while the former
I transformations iterate until no further improvement,can be made to the circuit.
I In the combined procedure, a gate is first examined to see if it can be generally-
substituted by other gates. If not, gate input reductSon is then applied• In gate inputI
reduction, a new gate is synthesized for an existing gate such that this new gate has a
I smaller number of inputs than the original gate. Thi_;new gate can either be a NOR,
i OR, AND or NAND gate. NOT gates are eqmvalent to single-input NAND or NOR
gates•
I
!1
I
!
I 55
I To perform parallel generalized gate substitutio:_/gateinput reduction, two queuesI
are used. In the first queue, Q1, gates are arranged in a breadth-first-traversal order from
the primary inputs towards the primary outputs as usual. On the other hand, the
I contents of the second queue is different. This qt,eue contains four transformation
I events for each gate in the circuit. In each event, there is a label field and a gate field.
These four label fields for a gate are marked AND, OR, NoR, NAND respectively. The
I use of this field will be obvious later. The gate field contains a pointer to the
I corresponding gate.
The second queue, Q2, is divided into two se:'tions. In the first section, all the!
events marked 'AND' are linked consecutively and _b events for the gates nearer to the
I primary outputs are placed nearer to head of the que'ae. In the second section, the other
i remaining events are linked up such that the events for the same gate are grouped
consecutively and the sequencing of the group of events for each gate is same as the
sequencing of the gates in the first section. An example of this arrangement is shown in
I Figure 3.8.
In an iteration of the transformation, each process exclusively dequeues an event
I from the second queue. The master process will also _ook for more slaves in the pool at
I already owns P-1 slaves. These new slaves will immediatelythis time unless it
i exclusively dequeues an event each. If the event is marked 'AND', the process first
tries to perform generalized gate substitution on the gate in the event. The candidate
I gates for the substitution is obtained by scanning the &atesin the first queue in the order
I in which they are enqueued. If a successful substitution is found, the other processes
I
I 56
I NOR OR
gate A gate A
NOR NAND Section 2
I
1  NOR ORTail gate C gate C
Figure 3.8. An example of the queues in get_ralized gate substitutionand gate input reduction.
'!
I will be halted and the slaves will return to the slave l:ool. The master then performs the
transformation and gets more slaves which will continue to dequeue events from the
second queue.
On the other hand, if the substitution is not successful, the process tries to perform
gate input reduction on the gate in the dequeued event. Similarly, all other processes
will be halted if a reduction is found to be successful. Again, the physical reduction is
done by the master, which first ensures that all of its slaves have returned to the pool.
During the reduction, the type of the gate to be syr_hesized is the same as the label
field in the event. For example, if the event is marked 'AND', the synthesized gate will
be an AND gate. After the transformation, slaves will be again employed to consume
1
!
I ,_ 57
I the events in the second queue until it is empty.
i For the other events which are marked NOR, OR, and NAND, the process in
charge of such an event only tries to perform ga_e input reduction. Therefore, the
I arrangement of the events by first having a section of 'AND' events ensures that
I generalized gate substitution is tested before gate input reduction is considered for a
gate.
!
During the physical transformation of a circuit, some gates may be deleted and
I some of the events become invalidated. Deleted gates at this stage are only marked
i 'deleted'. Whenever a process extracts an event with a deleted gate, it will ignore that
event and continues with the next one. The end of £ae procedure is reached when the
.r
I event queue becomes empty. At this point, the master cleans the circuit up by freeing
I those gates marked 'deleted'.
I 3.11. Ordering of Search-Spaces
I The arrangement of the gates in the qu_.ues described in the various
transformations is performed so as to avoid huge differences in the search-spaces whenI
the program is executed on the same set of data with different number of processes. It
I tries to force the processes to look at the gates in a specified order so that the resulting
i qualities of the circuit will not differ significantly.
However, a process may sometimes be faster that others due to variances in
I system load. This sometimes results in super-linear speedups and slightly different
I
I
I 0
I f, 58
I circuit qualities as the sequence of gates being transformed varies from one execution
to another.
I
To avoid this problem, priorities can be assigned to order the gates in these queuest
I which impose a stricter ordering in the examination of the search-space. However, this
I is not implemented in PTRANS as the ordering employed in PTRANS are after all
heuristics that do not guarantee optimum results. Hence, some degree of randomness in
iterating the search-space could even be beneficial ,vhen circuit quality is concerned
I although this could give super-linear speedups. This is evident in the non-degradingi
circuit qualities over different number processors _s will be presented in the next
I chapter.
! ,
!
I •
!
!
!
!
m
!
!
!
I
| 59
I CHAPTER 4.
I EXPERIMENTAL RESULTS
f
I
4.1. Overview of Experiments
I
In this chapter, the experimental results from PTRANS is presented. The initial
I °networks are obtained by using MIS 2.1 [BRSW87] to map the MCNC and ISCAS
I benchmarks into simple gates. The ISCAS benchmarks are partitioned into smaller
circuits. These circuits and partitions are subjected to the following sequence of
I transformations, the order of which can be changed with ease.
I 1) Pruning with CSPF.
2) Gate substitution.
I 3) Pruning with CSPF.
I 4) Pruning with MSPF.
i 5) Generalized gate substitution/gate input reduction.
6) Gate merging.
I Very frequently, the initial networks produced by MIS 2.1 are found to have only
I very few redundant connections. Consequently, pruring with MSPF is not used after
Step 1. However, after gate substitution is performed, it is usual to find more redundant
I connections. Thus, pruning with both CSPF and MSPF are applied in Steps 3 and 4 to
I remove all redundant connections. Gate merging is applied last as it takes the longest
time for execution and thus saves more time when applied to circuits after beingI
I
?I _ 60
I minimized by the other transformations.
4.2. Circuit Degradation with Number of Processors
I In this section, the relation between the final circuit quality and the number of
i processors used is investigated. The results are tabulated in Table 4.1 where g refers to
gate count and c refers to number of connections.
I When a circuit has to be broken down to multiple partitions, each partition is
I minimized one after another. The qualities obtained by minimizing the partitions
simultaneously is identical to those obtained by consecutive minimization.
On the whole, the circuit qualities do not decade with increasing number of
I processors. In fact, some of them even show better circuit qualities. These slight
I variances in qualities are due to the fact that when different number of processors are
Circuit No.of Initial I processor 2 processors 4 processors 8 processors
I Partitions (g/e) (g/c) (g/c) (g/c) (g/c)f51m 1 131/270 811157 81/157 81/157 81/157
5xpl 1 129/279 79/153 79/153 781151 75/147
I 9sym 1 205/470 173/360 173/'362 180/379 187/394bw 1 205/481 145/289 146/291 150/298 149/300
sao2 1 129/310 99/213 99/2_3 109/231 111/235
vg2 1 158/391 73/163 73/1_:2 73/162 73/162
I rd73 1 135/325 115/243 115/243 115/241 110/235duke2 1 48511224 352/'736 344/715 343/718 352/737
alupla 1 114/223 103/205 103/295 103/205 103/205
I misexl 1 69/154 51/102 511102 50/100 51/1012 87/233 82/178 82/ 78 87/178 82/178
misex3e 1 493/1231 363/776 360/770 355/755 356/766
I C432 2 198/411 166/345 166/345 166/345 166/345C499 2 526/942 493/892 493/892 493/892 493/892C880 4 342/688 313/362 313/352 313/362 313/362
C1355 4 492/1018 507/1022 507/1G22 507/1022 507/1022
I C1908 4 599/1220 448/914 449/914 448/914 453/940
Table 4.1. Circuit quality versus the number of processors used.
| ,
!
I 61
,i
I used, different number of gates are tested for the possibility of transformation at the
same time. As the timings of multiple processes are indeterminate, this may result in| "different sequences of gates being transformed whiEh affects the final quality of the
I circuit.
I i'
4.3. Efficiency of Intra-Partition Load Balancing
I After studying the effects of multiple processors on the circuit quality, this section
I reports on the efficiency of the implementation of PTRANS for a single circuit or
partition, which is based on the speedups obtained and load balance of the processes.
The speedups in Table 4.2 is computed using be longest processing time taken
I among the processes rather than user time as the used as a temporary storage is adisi_
I sequential bottleneck. This can be avoided by using multiple disks.
t
I 1processor
Circuit Time 1 processor 2 processors 4 processors 8 processors
I (see)fSlm 217 1.0 1.8 3.2 3.4
5xpl 182 1.0 1.9 2.9 4.4
9sym 5408 1.0 2.8 4.1 8.3
I bw 793 1.0 2.2 2.8 3.7sao2 1556 . 1.9 3.6 .5
vg2 1603 1.0 1.6 2.8 4.6
I rd73 967 1.0 2.1 i 3.0 5.5duke2 19650 1.0 2.5 ' 4.2 7.3
alupla 6560 1.0 1.9 3.6 6.5
misexl 115 1.0 1.8 3.2 4.8
I misex2 264 1.0 1.7 3.1 3
Q7
misex3c 72834 1.0 2.5 3.7 4.9
i Table 4.2. Speedups for circuits using intra-partitionarallelism on one partition.
!
!
!
i
I i 62
.>
I Table 4.2 shows cases of super-linear speedups for some circuits like 9sym and
duke2 as the number of processors used varies. This is again due to the varying
|
sequence of gates being transformed.
I On the other hand, PTRANS produces consistent final qualities for the circuit
I alupla when minimized by 1, 2, 4 and 8 processors as shown in Table 4.1. It is very
likely that the sequences of gates being transformed are the same throughout these runs.
I The speedups obtained for minimizing this circuit am graphed in Figure 4.1. The
I deviation from the ideal linear speedup is small.
As the sequence of gates being transformed var.'es from one execution to another,!
the speedups shown in Table 4.2 is not sufficient to show the efficiency of PTRANS.
I Table 4.3 shows the actual load balance when 8 processes am used. The values are
i obtained by measuring the processing time of each of the process in each run, and the
longest of which is scaled to 100 time units. The rest of the processing times are
I expressed as a percentage of this time. As shown in Table 4.3, the processes' loads fall
I 8Speedup
Io'
i S6 ., f°o°
4 .._,I
2
Numberofprocesso"s
I Figure 4.1. Speedup obtained for minimi_:ingthe circuit alupla.
m !
!
!
63
l Pmc. Pmc. Pmc. Pmc. Proc. Pmc. Proc. Pmc.
l's 2's 3's 4's 5"s 6's 7's 8's
i Circuit load load load load load load load load1%) (%) (%) (%) t%) (%) (%) (%)
f51m 100.0 93.7 77.7 84.1 65.1 65.1 60.3 60.3
I 5xpl 100.0 97.6 87.8 87.8 80.5 80.5 80.5 75.69sym 100.0 91.2 88.6 87.3 84.3 86.7 81.0 80.6
bw 100.0 59.0 71.0 82.0 63.1 47.5 47.5 50.2
sao2 100.0 91.8 82.4 82.4 82_q 78.4 84.4 85.0
I 100.0 92.8 87.5 87.5 838 83.2 82.9 85.8vg2rd73 100.0 96.0 96.0 96.0 91.0 84.2 83.6 81.9
duke2 100.0 99.9 94.1 88.3 83.5 87.1 84.3 86.5
I alupla 100.0 90.7 93.7 87.1 87,6 84.0 87.1 82.2misexl .0 1.7 87.7 3.3 3.3 75.0 66.7 6 .5
misex2 100.0 68.1 65.3 65.3 72.2 59.7 59.7 62.5
I misex3c 100.0 96.5 94.6 95.6 95.. 94.0 94.1 94.5Table 4.3. Load balance for circuits using intra-partition
parallelism on one partition on 3 processors.
I above 80% of the largest load in most cases, which implies that the efficiency is greater
I than 0.8 most of the time.
I 4.4. Efficiency of Inter-Partition Load Balancing ,_
I In Section 4.3, the efficiency of PTRANS in minimizing a single partition or
i circuit is examined. As large circuits needs to be broken into multiple partitions before
it can be minimized, this section investigates th_ efficiency of PTRANS when
I minimizing these partitions simultaneously. ,
I 'By Amdahl's Law, the efficiency of any parallel programs decreases with
increasing number of processors. When multiple partitions are minimized
I simultaneously, the average number of processors per partition is smaller than when a
| "single partition is minimized by the same number of processors. Hence, a higher
i efficiency is expected.
I i
!
I 64
! No. of 1 processor 1 2 4 8
Circuit Time
i partitions (see) processor processors processors processorsC432 2 4911 1.0 2.0 3.4 5.9
C499 2 5231 1.0 1.9 3.7 7.5
I C880 4 1535 1.0 2.0 3.8 7.0C1355 4 33517 1.0 2D 3.9 7.4C1908 4 6219 1.0 2.0 3.8 5.3
I Table 4.4. Speedup for multiple partitions of circuits minimized simultaneouslywith combination of inter- a d intra-partition parallelis .
I From Table 4.1, the final qualities of the ISCAS circuits except C1908 remains
consistent. Hence, it is also very probable that the s_quences of transformations during
| the minimization of these circuits are the same t_roughout the runs from 1 to 8
I processors. The speedups obtained for such circuit_ are shown in Table 4.4. These
speedups are much nearer to linear than those shown in Table 4.2, suggesting the high
I efficiencies achieved. This efficiency is also expressed in terms of the load balance
I between the processes in Table 4.5.
I In Table 4.5, again, the processing time of each process is expressed as a
percentage of the longest processing time in each rut:,.As can be seen, the load of each
Proc. Proc. Proc. Proc. Proc. Proc. Proc. Proc.
I l's 2's 3's 4's 5's 6's 7's 8'sCircuit load load load load load load load load
(%) (%) (%) (%) (%) (%) (%) (%)
I C432 100.0 98.9 81.3 80.7 77.1 80.7 81.1 73.9499 100.0 99.0 94.3 99.0 98.4 96.1 98.8 96.4
C880 100.0 94.0 98.6 96.2 96.2 97.7 95.8 98.6
C1355 100.0 92.4 97.6 95.3 92.2, 95.8 94.2 91.2
I C1908 100.0 95.5 96.8 97.6 95.2 95.0 95.8 94.9
Table 4.5. Load balance for 8 processors running on multiple partitions of the
I circuits simultaneously with combination of inter-and intra-partition paralleiism.
!
m 65
i process is greater than 90% of the most heavy load !for every circuit except for C432.
Even for C432, the largest load imbalance is a mere 26%. This shows the effectiveness
I of the dynamic load balancing strategy used.
! ,Finally, a comparison is made to investigate the differences between using inter-
I partition load balancing and notusingit. To do this, every partition of each circuit isP
minimized one after another using 8 processors. The longest processing time each
I partition takes is recorded and is shown in Columns 3 through 6 of Table 4.6. Column
7 shows the sum of these times for each partition. Hence, if the partitions of each
circuit are minimized in parallel consecutively, the to+.altime needed for each circuit is
I limited by the longest processing time for each partition of the circuit and this value is
I reflected in Column 7. In Column the times taken when8, longest processing all of the
partitions of a circuit are minimized concurrently are recorded.
From the table, it is important to note that there is a large disparity between the
I times in Column 3 through 6 for a circuit. For example, this varies from 24 seconds to
3375 seconds for C1355.
Col. 1 Col. 2 Col. 3 Col. 4 Col. 5 Col. 6
Col.7 Col.8
No.of Longest Longest Longest Longest Sumof Minimi-
Circuit partit- timefor timefor timefor timefor allpa- zed to-
m Part. 1 Part. 2 Part. 3 Part. 4 rtitions getherions (see) (see) (see) (see) (see) (see)
C432 2 345 568 913 827
I C499 2 526 285 811 701880 4 32 55 69 ; 90 246 218
C1355 4 661 563 24 3375 4623 4537
C1908 4 1013 386 34 34 1467 1178
I Table 4.6. Efficiency of using inter-partition parallelism.
!
Im 66
I With inter-partition load balancing, there is a _ignificant difference between thetlI
time needed to minimize the partitions simultaneously and consecutively. This shows
I that the inter-partition load balancing further enhances the overall efficiency achieved
I by intra-partition load balancing alone.
r
I 4.5. Comparison among MIS 2.1, SYLON-XTRAN'S, and PTRANS
I In this section, comparisons are made among; MIS 2.1 [BRSW87], SYLON-
I XTRANS 1.1 [X90], and PTRANS (our implemer:tation) on the Encore Multimax
computer. PTRANS is executed on a single processoz-.The qualities of the final circuits
I are measured in terms of the number of simple gates and connections (g/c). MIS 2.1 is
I executed on both partitioned and non-partitioned circuits using the Boolean script. The
algebraic script is also used so as to demonstrate the effectiveness of don't-care based
I minimization in the Boolean script. In this script, the circuits are simplified using
! don't-cares and disjoint support filtering so that the final qualities can be compared with
that of PTRANS as it is basically a don't-care based minimization program. In theI
following comparison tables, a '-' sign means eithe_ the corresponding program runs
I out of memory, could not finish within 30 hours, or unable to handle the number of
I inputs in the circuit.
Between XTRANS and PTRANS, the circuits produced by PTRANS are usually! ,
slightly bigger than those by XTRANS as shown in Table 4.7. This is because
I XTRANS 1.1 recognizes XOR and XNOR gates, which are presently not accepted by
PTRANS. The timings of XTRANS is also faster than PTRANS by a factor of about 2!
!
!
I ] 67
!
Non-partitioned Partitioned
I Initial MIS2.1 MIS2.1 No.of ! MIS2.1Circuit
Algebr- Boolean _, Boolean XTRANS PTRANS
aie partitions [
I (g/e) (z/c_ (g/e) . (g/c) (g/c) (sic)f51m 131 270 107/233 110 225 1 i.: 110/225 70/128 81/157
5xpl 129/279 103/224 92/185 1 1! 92/185 62/112 79/153
i 9sym 205/470 163/376 175/408 1 175/408 162/346 174/363bw 205/481 142/312 123/250 1 123/250 144/262 144/288sao2 129/310 125/270 100/211 1 100/211 99/195 99/213
vg2 158/391 711147 66/141 1 66/141 82/158 73/162
rd73 135/325 96/213 67/134 1 67/134 79/156 110/236duke2 48511224 285/627 282/627 1 , 282/627 327/654 348/730
alupla 114/223 109/230 134/265 1 ii" 134/265 97/192 103/205
I misexl 69/154 45/99 47/88 1 47/88 46/84 51/102
I
misex2 87/233 75/162 78/159 1 _ 781159 94/181 82/178
misex3c 493/1231 393/907 311/730 1 311/730 326/680 352/757
C432 198/411 - 2 169/380 - 166/342
I C499 526/942 511/911 521/925 2 522/946 - 493/892C880 342/688 361/703 4 373/707 - 313/632
C1355 492/1018 515/915 519/523 4 543/959 - 507/1022
I C1908 599/1220 528/983 4 1554/1011 - 448/914Table 4.7. Comparison of circuit qualities among MIS 2.1,
XTRANS 1.1, and PTR.ANS.! ,i
to 3. However, as PTRANS uses the disk as a temporary storage, time is needed to
I pack and unpack the BDDs as they are transferred to and from the disk. This has been
found at times to amount to greater than 50% of the total time taken by PTRANS.!
Hence, the actual time used by PTRANS in performing the Transduction procedures is
I much smaller than those shown in Table 4.8. Not_, however that this feature was
I incorporated into PTRANS to handle very large circuits which cannot be handled by
XTRANS 1.1.
I An interesting point to note is the difference in p.me PTRANS and XTRANS take
I for the circuit alupla. PTRANS is about 20 times faster than XTRANS. This could be
due to the difference between the BDD- and the SOP- representations used by the two.!
r
!
•l
!
I " 68
i Non-partitioned Partitioned
MIS2.1 MIS2.1 No.of MI32.1 XTRANS PTRANS
i Algebraic Boolean Boolean (1 proce-Circuit Time Time partitions Ti:aae Time ssor)Time
i (see) (see) (see) (see) (see) (see)f51m 37 77 -1 -77 103 217
5xpl 35 59 I 59 86 182
i 9sym 139 150 1 150 215 5408bw 56 265 1 265 331 793sao2 41 40 1 40 371 1556
vg2 28 165 1 165 626 1603
i rd73 50 33 1 33 562 967duke2 159 7897 7897 9801 1 50
alupla 26 361 1 261 111420 6560
i misexl 10 9 1 .c 33 115misex2 15 65 1 65 176 264
misex3e 379 7776 1 7776 24065 72834
C432 2 1784 4911
I C499 86 9842 2 1769 5231880 79 4 380 15 5
C1355 86 9343 4 559 33517
I C1908 2599 4 3451 6219Table 4.8. Comparison of timings among MIS 2.1, XTRANS 1.1,
and PTRANS on the Encore Multimax w:;tha single processor.
i A limitation of XTRANS 1.1 is that it only m_nages circuits with 32 inputs or
I less. Hence, it is unable to minimize any of the ISCAS circuits. However, XTRANS 2.0
is currently being developed and will avoid this limitation.
II
The circuits used to compare between XTRAN_; and PTRANS are also used fori
I MIS 2.1 and PTRANS. The MCNC circuits can be minimized without partitioning. The
i qualities produced by both MIS 2.1 and PTRANS for"these circuits are comparable and
MIS 2.1 is faster than PTRANS.
i However, MIS 2.1 is unable to handle larger circuits such as C432. Such circuits
are partitioned and both programs are executed on the'_epartitions. As Table 4.7 shows,
PTRANS produces better qualities than MIS 2.1 consistently. In fact, some of thesei
!
| 69
I final circuits are even smaller than those produced bj MIS 2.1 running on the original
circuits as a whole. .,
I
Although PTRANS is usually slower than M_S 2.1, the advantage with using
I PTRANS is that it is parallelizable as can be seen from the results in the previous
I sections. On the other hand, [Z91] has already sh,_wn that MIS is very difficult to
parallelize. Hence, when multiple processors are ernployed, PTRANS is able to both
I execute faster and produces better quality circuits than MIS 2.1.
I
I
I
!
I
I
I
r,
I
I
I
I
I
!
I 70
I CHAPTER 5.
.i
I CONCLUSIONS
!
In this thesis, a parallel algorithm implementing ,.heTransduction Method has been
I proposed and implemented. In Chapter 1, an introdut_fionto the problem of multi-level
i logic synthesis is given. Chapter 2 shows the methods of computing the MSPFs and
CSPFs of simple gates. The basics of SYLON-XTRANS is also summarized. It
! .contains four main transformation procedures, namely, pruning, gate substitution, gate
I merging and generalized gate substitution/gate input reduction. More detailed
information on these procedures can be found in IX90].
I
The paraUelization of XTRANS is described in Chapter 3. The major problem
I PTRANS faces is the need for large input circuits to achieve high processor utilization.
i This is limited by the amount of memory availiable on our computer system. To solve
this, large circuits have to be partitioned into smaller components. PTRANS also needs
I to manage the disk as a temporary storage. Wher_ multiple partitions are present,
I PTRANS is able to minimize them simultaneously, or consecutively. The consecutive
€
minimization of partitions is also performed in paralIel. These intra- and inter-partition
I parallelisms are achieved through the multiple m_ster-slave program model used.
I Furthermore, PTRANS also uses BDDs instead of tt:e SOP representation. BDDs are
generally more compact the the latter in representing P.oolean functions.I
!
!
!
| 71
I With both types of parallelism, the results prodl_cedby PTRANS are presented in
Chapter 4. Considering the efficiency of the imI:lementation of PTRANS, it has
I
achieved good speedups and high processor utilization, even when not using inter-
I partition dynamic load balancing. Of course, when nter-partition parallelism is used,
i the efficiency achieved is even higher. As compar,',d with XTRANS, PTRANS has
comparable performance. However, when executing .on a single processor, it is slower
I than MIS 2.1 due to differences in algorithm complexities. On the other hand, PTRANS
I is parallelizable and produces better quality circuits than MIS 2.1.
On the whole, the PTRANS implementation has been very successfully. Future
I work includes porting it to the Chare Kernel Progt:unming Language, which is also
I developed at University of Illinois. This language is machine-independent, and will
I allow PTRANS to execute on most of today's parallel machines with little or no source
changes.
I
I
I
I
I
I
I
II 72
I REFERENCES
I
[A78] S.B. Akers, "BinaryDecision Diagrams,"IEEE TC, 1978, pp. 509-516.
I [B84] R.K. Brayton, et al., "ESPRESSO-_I: A New Logic Minimizer for
I Programmable Logic Arrays," CICC, Jane 1984, pp. 370-376.
[B86] R. Bryant, "Graph-Based Algorithms for Boolean Functions
I Manipulation," IEEE TC, Aug., 1986, pp. 677-691.
I [DBK90] S. Dey, F. Berglez, and G. Kedem, "Corolla Based Circuit Partitioning
i and Resynthesis," 27th DAC, 1990, pp: 607-612.[BRSW87] R.K. Brayton, R. Rudell, A. S. Vincentelli, and A. R. Wang, "MIS: A
I Multiple-level Logic Optimization System," ICCAD, Nov., 1987, pp.
i 1062-1081.
[C87] K.C. Chen, "Program PMIN for PLA Minimization," M.S. thesis, Dept.
I of Computer Science, Univ. of IU.,Urbana, 1987.
I [CHNS88] H. Cho, G. Hachtel, M. Nash, and L. Setiono, "BEAT-NP: A Tool for
Partitioning Boolean Networks," ICCAD, 1988, pp. 10-13.
I [CM89] K.C. Chen, and S. Muroga, "SYLON-DREAM : A Multi-level Network
I Synthesizer," ICCAD, 1989, pp. 552-555.
[-DB91] K. De, and P. Banerjee, "Logic Partitioning and Resynthesis for
I Testability," ITC, 1991.
I [FFK88] M. Fujita, H. Fujisawa, and N. Kawat:_, "Evaluation and Improvements
of Boolean Comparison Method Based on Binary Decision Diagrams,"I
!
!
I 73
I ICCAD, Nov., 1988, pp. 2-5.
[G86] R. Galivanche, "A Parallel Logic ldinimization Algorithm for PLA
I Synthesis," M.S. thesis," Univ. of Iowa, 1986.
I [GBGH86] D. Gregory, K. Bartlett, A. de Geus, and G. Hachtel, "SOCRATES: A
System for Automatically Synthesizing and Optimizing Combinational
I Logic," 23rd DAC, 1986, pp. 79-85.
I [GJ79] M.R. Garey, and D. S. Johnson, "Computers and Intractability: A Guide
to the Theory of NP-Completeness," San Fransico, CA, W. H. Freeman
I & Co., 1979.
I [HMJ88] G. Hatchel, C. Morrison, and R. Jacoby, "EXPRESSO_MLT:
ESPRESSO for Multi-level Logic Minimization using Tautology
I Checking," ICCAD Tutorial, 1988.
I [KL70] B.W. Kemighan, and S. Lin, "An _tticient Heuristic Procedure for
Partitioning Graphs," Bell System Technical Journal, vol. 49, 1970, pp.
I 291-307.
I [LM90] J.C. Limqueco, and S. Muroga, "SYLON-REDUCE : AMOS Network
i Optimization Algorithm using Permissible Functions," ICCD, 1990.
[MF89] Y. Matsunaga, and M. Fujita, "Mult3-1evel Optimization using Binary
I Decision Diagrams," ICCAD, 1989, pp 556-559.
I [MK89] S. Muroga, Y. Kambayashi, H. C. Lai, and J. N. Culliney, "The
Transduction Method - Design of Logi,_Networks Based on Permissible
I Functions," IEEE TC, Oct., 1989, pp. 1404-1424.
I [MWBV88] S. Malik, A. R. Wang, R. K. Brayton, and A. S. Vincentelli, "Logic
!
!
I 74
I Verification using Binary Decision Diagrams in a Logic Synthesis
Environment," ICCAD, Nov., 1988, pp. 6-9.
I [PBP891 S. Patil, P. Banerjee, and C. D. Po!ychronopoulos, "Efficient Circuit
I Partitioning Algorithms for Parallel Logic Simulation," 26th DAC, 1989,
pp. 361-370.
I [PBP91] S. Patil, P. Banerjee, and J. H. Panel, "Parallel Test Generation for
I Sequential Circuits on General-Purpose Multiprocessors," 28th DAC,
1991.
I [SB88] L. Soule, and T. Blank, "Parallel Logic Simulation on General Purpose
I Machines," 25th DAC, 88, pp. 166-171.
[SB90] H. Savoj, and R. K. Brayton, "The Use of Observability and External
I Don't Cares for the Simplification of Multi-level Networks," 27th DAC,
I 1990, pp. 297-301.
[XM89] X.Q. Xiang, and S. Muroga, "SYLON-XTRANS : A Multilevel Logic
I Network Synthesizer," IWLS, NcMc, May 1989.
I [X90] X.Q. Xiang, "Multi-Level Logic Network Synthesis System, SYLON-
i XTRANS and Read-Only Memory Minimization Procedure, MINROM,"
Ph.D. thesis, Dept. of Computer Science, Univ. of I11.,Urbana, 1990.
I [Z91] G. Zipfel, "Parallel Algorithm for Algebraic Factorization with
i Application to Multi-Level Logic Synthesis," M.S. thesis, Univ. of Ill.,
Urbana, 1991.
I
!
!
L
I
i
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
