Combinational logic synthesis is a very important phase of VLSI system design. But the logic synthesis process requires large computing times if near optimal quality of the logic network is desired. Parallel processing is fast becoming an attractive solution to reduce the computational time. Recently, researchers have started to investigate parallel algorithms for problems in logic synthesis and veri cation. Much of the work in parallel algorithms for CAD reported to date, however, su ers from a major limitation. The parallel algorithms proposed for the CAD applications are designed with a speci c underlying parallel architecture in mind. Moreover, incompatibilities in programming environments also make it di cult to port these programs across di erent parallel machines. As a result, a parallel algorithm needs to be developed afresh for every target parallel architecture. The ongoing project of ProperCAD o ers an attractive solution to that problem. It allows the development and implementation of a parallel algorithm on the CHARM runtime system such that it can be executed in all the parallel machines without any change in the program. In this paper, we describe a portable parallel algorithm for logic synthesis based on the Transduction method, called ProperSYN. This algorithm uses an asynchronous message driven data-ow model of computation, with no explicit synchronizing barriers separating di erent phases of parallel computation as used in many previously developed parallel algorithms. Our algorithm is therefore more scalable to large numbers of processors The algorithm has been implemented and it runs on a variety of parallel machines. We present results on several benchmark circuits for shared memory MIMD machine like Sequent Symmetry and Encore Multimax, distributed memory MIMD machine like the Intel/860 hypercube and distributed processing systems like networks of SUN workstations.
This limitation has serious consequences. A parallel algorithm needs to be developed afresh for every target parallel architecture. This is compounded by the length of the software development cycle, which, for parallel programs, is considerably longer than for sequential programs. As a result, parallel programs are considerably costlier to develop than sequential programs.
There are two goals of the ProperCAD project 13] (Portable object-oriented parallel environment for CAD applications). The rst goal is to develop an environment for writing portable parallel programs for VLSI CAD applications that will run on a range of parallel machines including shared memory multiprocessors such as the Encore Multimax and Sequent Symmetry, distributed memory multicomputers such as the Intel iPSC/860, and networks of workstations. This is being accomplished in two phases. The rst phase involved the use of the CHARM language and runtime system which is based on extensions to the C language 14, 15] on top of which we developed CAD application speci c libraries to support the development of parallel CAD applications. The second phase involves the integration of the CAD libraries and the machine independent runtime support in C++ in a truly object-oriented manner. The other goal of ProperCAD is to develop the new parallel algorithms for CAD around existing sequential algorithms with a strict interface de ning the interactions between the two. This will enable the parallel algorithms to bene t from future improvements in the sequential algorithm.
The ProperCAD approach to the design of parallel CAD algorithms is illustrated in Figure 1 . A parallel algorithm is designed around an existing uniprocessor algorithm by identifying modules in the uniprocessor code and designing a well-de ned interface between the parallel and sequential code.
Parallel execution in the ProperCAD environment, based on the CHARM runtime system 14, 15] , follows a coarse-grained data-ow model and is message driven. Conceptually, a pool of messages is created where each message may either request the creation of a new object or be destined for an existing object. An object can only create new messages or be waken up by messages sent to it at designated entry points. Processes are not allowed to block, or explicitly make receive requests from other processors.
Every message thus represents some work. The machine speci c code (see Figure 1 ) is responsible for dynamically distributing the messages across the available processors. A choice of queuing strategies are also available to the programmer to in uence the order in which messages in the pool are processed. The strategy may be depth-rst, breadth-rst, or priority-based. In the priority-based strategies, the programmer provides priorities with every message, which determines the order in which the messages in the pool are processed.
One process is provided per processor for data abstraction for one type of data 13] . All the processes in that processor can access that data using access routines. These special processes are called work managers.
In this paper, we present a new parallel algorithm for logic synthesis using the Transduction method 4] on the rst phase of the ProperCAD project, namely, the one based on the CHARM system 14, 15] and is named as ProperSYN.. The successful implementation of SYLON- XTRANS 16] and its variants 17] have proven that the Transduction method is a powerful tool in the synthesis of multi-level networks. However, the Transduction method has the disadvantage of large execution times for large networks. We believe that an e cient parallel algorithm for logic synthesis using the transduction method will make the runtimes manageably small. The parallel algorithm for logic synthesis in the ProperSYN system is novel and di erent from the parallel algorithm described by Lim et al 11] 12]. Lim's algorithm is suitable for shared memory multiprocessors, and uses a shared circuit description over all processors and explicit barrier synchronization between phases of transformations. Our algorithm uses an asynchronous message driven computing model with no synchronizing barriers separating phases of parallel computation, and is therefore inherently more scalable to larger number of processors. Scalability also depends on other important aspects like load balancing and scheduling. To achieve good load balancing and scheduling, we create an order of magnitude more jobs than the number of processors available. Those jobs are automatically distributed by the CHARM runtime system. Since the number of jobs is large and the size of job is small, the average work performed by each processor tends to be the same. While creating jobs, we took into consideration the communication to computation ratio to make sure that the communication overhead remains low. Also, our algorithm is portable across a wide variety of parallel architectures. Finally, our algorithm is built around a well de ned sequential algorithm interface, so that we can bene t from future expansion of the sequential algorithm. Some preliminary results of this algorithm were reported in 18].
Section 2 provides a brief background on the Transduction method of logic synthesis. Section 3 describes the portable parallel algorithm for logic synthesis developed in the ProperSYN system. Section 4 describes the results of the same parallel algorithm on di erent parallel processing platforms, namely shared memory MIMD multiprocessors such as the Sequent, the distributed memory MIMD multicomputers such as the Intel i860 hypercube, and distributed environments such as a network of SUN workstations.
2 Overview of the Transduction Method
Terminology
In this paper, we will consider loop-free multi-level combinational networks consisting of simple gates such as AND, OR, NAND, NOR and NOT gates. However, it is conceptually trivial to extend the methods discussed in our paper to handle more complex gates like XOR and XNOR.
Let g be the number of gates in a multilevel network, V = fv 1 , v 2 , : : :, v g g be the set of gates in the network, and C = fc ij g be the set of connections where c ij connects the output of gate v i to an input of gate v j . A gate v i is an immediate predecessor of v j if there exists a connection c ij . In that case v j is an immediate successor of v i . If there exists a sequence of gates through which one can reach the gate v kt from the gate v k1 , then v kt is a successor of v k1 and v k1 is predecessor of v kt .
A network can be viewed as a directed acyclic graph, consisting of the gates as the nodes in the graph and the connections as the edges in graph. The Transduction method is based on the concept of sets of permissible functions 4]. The output function of a gate can be any member of its set of permissible functions without changing the functionalities of the primary outputs of the network. Hence, conceptually, sets of permissible functions are equivalent to the observability don't cares. There are two types of permissible functions, the maximum set of permissible functions (MSPF) and the compatible set of permissible functions (CSPF). As its name suggests, the MSPF of a gate or connection contains the largest set of functions which the output of a gate can take without changing the functions of the primary outputs. On the other hand, the CSPF of a gate or connection is a subset of its MSPF, calculated based on some ordering of the inputs at each gate. However, the computation time of CSPF is much shorter than that required for MSPF, though it is less e ective. The methods of computation of MSPF and CSPF can be found in 4]. For our parallel algorithms reported in this paper we used CSPF. But one can use MSPF as well without any conceptual change of our algorithms if he/she is ready to accept larger runtimes and memory requirements for better quality of the circuits.
Types of Transformation
There is a suite of transformations and optimizations that can be applied to the network with the help of the CSPFs of the gates and the connections. The Transduction method consists of four main transformations, namely, gate substitution, gate merging, generalized gate substitution and gate input reduction. It also prunes redundant connections in the network.
In gate substitution, one searches for a pair of gates, v i and v j , such that the output function of v i is a member of the CSPF of v j . In this case, one can replace all the output connections of v j by new connections from v i . Gate v j , along with the fanout free input cones of v j are deleted from the network.
In gate merging, one searches for a pair of gates, v i and v j , such that the intersection of their CSPF is nonempty. One can then try to synthesize a new gate, v k , whose output function is a member of the intersection of the CSPFs of v i and v j . This is achieved by connecting the inputs of v k to a minimal set of existing gates in the network through the connectability condition 4]. After this, v i and v j can both be deleted from the network, along with the fanout free input cones of them.
A more general version of gate substitution is generalized gate substitution. In this transformation, if all the output connections of a gate, v i , can be replaced by output connections from a set of existing gates in the network, then the gate v i can be removed from the network.
Finally, in gate input reduction, one tries to reduce the number of connections in the network. Here, for every gate v i in the network, a new gate v i ' is tried to be synthesized whose output function is a member of the CSPF of v i and and has less number of input connections than v i . The inputs to v i ' are from the outputs of a minimal set of existing gates in the network and such connections can be found by using the e ectively connectability condition 4]. With these transformations, the Transduction method is able to produce near optimal networks 16].
3 Parallel Algorithms for Transduction Method
Parallelism using Partitioning
An obvious way of extracting signi cant speedup out of the logic synthesis application is to generate a large number of partitions and to synthesize each partition independently. The results of the individual partitions can then be merged back. Unfortunately, such an approach has the problem that with increasing number of partitions, the quality of the overall network degrades. Speci cally, it has been shown that even if the individual partitions are synthesized perfectly such that they are each completely irredundant, the resultant network can still have redundancies when the synthesized partitions are reconnected to form a larger network, 19] . This is because each synthesis procedure of a partition only synthesizes within a partition by treating it as an independent block. It does not take any global information into consideration during minimization. In the interest of better quality, one should, therefore, choose a minimum number of partitions. Then, one is forced to resort to parallelism of operations within each partition which is much harder to exploit. One may not get good speedups within a partition. There is clearly a tradeo between result quality and execution time as determined by an optimal number of partitions. Such a theory needs to be developed but is outside the scope of this paper. 
Our Approach using Work Decomposition
In our ProperSYN algorithms, we do not use the partitioning approach. Although, in our approach the logic network is divided into a set of non-overlapping partitions, the partitioning is for parallelization purposes only, the transformations and the optimizations are performed on the entire network. We used a very simple partitioning strategy. The user speci es the maximumnumber of gates per partition, max gates. Initially all primary outputs of the given circuit are put in a list called cone tips. Then a node is picked from the list cone tips and the transitive fanins of which do not have their partition assigned yet are put into a new partition in breadth-rst manner until the number of nodes in reaches max gates after which a new partition ' is formed. Then all the fanin nodes to nodes in are put at the end of cone tips. This procedure continues until all the nodes in the network are assigned to some partition. The algorithm is shown in Figure 2 .
For each partition, one object is created. Processes that perform various operations on these objects are distributed to di erent processors using some load balancing methods supported by the CHARM runtime system, as depicted in Figure 3 . Typically, we create a lot of small partitions such that load balancing is good. But we make sure that the amount of computation required by each partition is roughly an order of magnitude higher than that of communication time for sending a message between objects. We will denote the object for a partition p as partition object p. Each processor has a copy of the network, but has the responsibility for the shaded regions of the network. E ectively, we can say that each processor \owns" some parts of the network. The quality of the resultant circuit does not vary with the partitioning strategy beyond the statistical variation shown in Section 4.2 because the partitioning is only used for the purpose of division of work. The evaluation time for the output function and the CSPF depends on the partitioning strategy. But determining a good partitioning strategy for the minimization of the evaluation times for the output function and the CSPF is a non-trivial procedure and can increase the overall runtime of any synthesis algorithm. The evaluation times for the output functions and the CSPFs is are small part of the total runtime of the algorithm. Hence, we decided to use a simple partitioning strategy.
Parallel Evaluation of Output Function
For optimization purposes, we need to evaluate the output functions at each gate and connection in the network. The output functions need to be expressed in terms of the primary inputs of the network. We compute output functions using binary decision diagrams (BDD) 20].
A major problem we have to deal with is the fact that since di erent processes perform di erent optimizations simultaneously on the network, the output functions and the CSPFs keep changing. It is important to provide some coherence mechanism among various applications of these optimizations in parallel. To do that with each BDD, we keep a tag, called the version. Also, with each gate and connection, we keep the current version number of the output function and the CSPF. A BDD is \valid", if its version number is current. Any transformation in the network changes the output functions and the CSPFs in some parts of the network. The version numbers are used to prevent any \illegal" transformation done using an \invalid" BDD. Initially, the versions of output functions and the CSPF are set to 1. If due to any transformation, the output function or CSPF of a gate or connection becomes \invalid", the corresponding version number attached to it is incremented by 1.
The evaluation of output functions start from the primary inputs of the network. The output function of a gate can be evaluated if \valid" output functions of its fanins are available. Any gate whose fanins are primary inputs only can be evaluated without any constraints. The gates which have fanins from other partitions can not be evaluated unless the output function of those fanins arrive from the other partitions.
After the output function of a gate is evaluated in a partition, it is checked if its output function is needed in any other partition. If the other partition that needs the output function is in the same processor, that partition object is informed about the availability of the output function of . But if it is in a di erent processor, then the current partition object invokes the work manager in the current processor, which in turn transmits the BDD to the work manager of the destination processor and then the work manager in the destination processor sends a message to the corresponding processes in that processor about the availability of the output function of . We use the work manager to ensure that only one copy of the BDD is sent across the processor boundary. The procedure is illustrated in Figure 4 .
Whenever a partition object receives a message about the availability of the output function of a gate , it checks to make sure that it is a \valid" BDD. It then checks for the gates in this partition which are waiting for the availability of the output function of . Those gates which have all the output functions of its fanins available can now be evaluated. It is clear from the above description that the parallel evaluation of the output function is performed using an asynchronous message driven data-ow computational model. We do not have any synchronizing barrier for levels of gates as is done in parallel logic simulation algorithm, and hence our parallel algorithm is expected to scale well with large number of processors.
Parallel Evaluation of CSPF
The ow of evaluation of CSPF moves from the primary outputs to the primary inputs. If the output function of a primary output is available, its CSPF can be calculated immediately. If no external don't care 21] is speci ed for that primary output, then the CSPF is the same as the output function, otherwise the CSPF is the union of the output function and the external don't care.
The CSPF of an input connection to a gate can be computed if the output functions of all the input connections to that gate is available as well as the CSPF of that gate. Then the CSPF of the connection is computed as a function of the type of the gate, the CSPF of the gate, the output functions of the sibling connections and the output function of the connection itself 4]. The CSPF of a gate can be computed if the CSPF of all the output connections are available. The CSPF of a gate is the set intersection of the CSPFs of its output connections, 16].
The partitions which have any primary outputs start evaluation of the CSPF at the primary outputs. After evaluation of an input connection, it is checked if it is connected to the output of a gate which is in a di erent partition and the partition is in the same processor; then that partition object is informed about the availability of the CSPF of the connection. If the partition is in a di erent processor, the CSPF is transmitted to the other processor through the work managers of the two processors similar to the case of the evaluation of output functions. The procedure is explained in Figure 5 .
Again, it is clear from the description of the parallel algorithm that the parallel evaluation of the CSPF proceeds in an asynchronous message driven data-ow computational model with no barriers.
Parallel Gate Substitution
Gate substitution is one of the transformations provided in the Transduction method. The main idea of the gate substitution is to search a given network for pair of gates such that in each pair, one of the gates can substitute for the other. However, as the network needs to be loop-free, the substitute gate must not be a successor of the gate to be substituted. Hence, to minimize the chance of this happening, the substitute gate and the substituted one should be as near as possible to the primary inputs and outputs, respectively.
Let us denote p is the number of partition objects created. Then the partition object x will create another : :, p. The gate sub object denoted by (i, j) is responsible for substitution of any gate in the partition object i by any gate in the partition object j. Note that there will be another gate sub object (j, i) will be created whose job is just the opposite of the job of (i, j). Hence, the number of gate sub objects created will be p 2 . We have already mentioned that it is preferable to have the partition i close to the primary outputs and the partition j close to primary inputs to maximize the chance of nding a gate substitution pair. So, we adjust the priorities of the di erent gate sub object to achieve that (the priorities are used to direct the CHARM runtime system to process certain messages with higher priority). As shown in the partitioning algorithm given in Figure  2 , the partitions closer to the primary outputs are formed earlier than the partitions closer to the primary inputs. Hence the number designating a partition closer to the primary outputs will be smaller than the number designating a partition closer to the primary inputs. Hence, a simple method of assigning priority to gate sub object (i, j) will be to assign higher priorities to the objects with smaller i and larger j. The equation we used to assign the priority to a gate sub object (i, j) is given by priority(gate sub object(i; j)) = 100000 + 1000 i ? j
The procedure is depicted in Figure 6 . Now, consider the gate sub object (i, j). A gate g 1 in the partition i can be substituted by a gate g 2 in the partition j if the output function of the gate g 2 is a subset of the CSPF of the gate g 1 . So the gate sub object (i, j) will need \valid" CSPF of all the gates in the partition i and output functions of the gates in the partition j.
When the gate sub object denoted by (i, j) is awakened by a message, it starts by collecting the output Figure 6 : Creation of objects for gate substitution functions and the CSPFs that it needs. Some output functions and CSPFs may not be available that time, so it asks the work manager in this processor to get the BDD and inform it when that is available. If this processor does not \own" that portion of the network which contains the gate for which BDD has been asked, the work manager on the requesting processor will ask the work manager in the requested processor to send the BDD. After this gate sub object has collected all the available \valid" BDDs, it does pairwise comparisons to nd out any possible gate substitution. It performs the comparison only if that gate substitution will not create a cycle in the circuit. That check is performed by marking the fanout cone of the gate to be substituted and checking if the substitute gate is not marked. If any gate substitution is possible, then it sends a message to a particular processor, designated to be the master processor for deciding about any change in the network, asking permission for this gate substitution. The message also contains the version numbers of the output function and the CSPF used for checking the gate substitution. When it is nished with all comparisons, it asks partition object i to inform it if CSPF of any gate in the partition i changes and it asks partition object j to inform it if the output function of any gate in the partition j changes. Whenever an output function or CSPF arrives from other processor, or the current version gets computed in this processor, the work manager checks the list to nd out if any object has asked for this BDD. If so, it sends a message to the corresponding objects.
Whenever a gate sub object (i, j) receives a message about the availability of an output function of gate g 2 in the partition j, it then compares this with the \valid" CSPFs available of the gates in the partition i. If any substitution is possible, it then sends a message asking for permission for that gate substitution, as explained Figure 7 .
We want to emphasize the fact that in our parallel algorithm, we check for all possible gate substitutions, as done in the sequential algorithm. We explain that fact with some examples. Consider the circuit shown in Figure 8 (a) with 4 gates (A, B, C and D). If we perform all possible comparisons as shown using the directed edges, there are 12 comparisons in total. Let us assume that we partition the circuit into two partitions, where partition 1 contains A and B and partition 2 contains C and D. Then our parallel algorithm will generate 4 gate sub objects. In Figure 8 (b), we have shown the comparisons performed by di erent gate sub objects. One can observe that total number of comparisons performed by them is 12, which is the same as the number of comparisons performed by the sequential algorithm.
When the master processor receives the request for permission for a gate substitution, it rst checks if both the gates in the gate substitution pair exists. Then it checks if the current version of the output function and the CSPF of the substitute gate and the substituted gate respectively are the same as that mentioned in the message. The version check is performed to make sure that the requested gate substitution is still \legal", i.e., no other transformation performed on the network has a ected this gate substitution. The master processor does not check for the acyclic property of the circuit. That check was already performed in the gate substitution object. If any change was made to the network after that check, that would re ect in the change in version number of the BDDs. That would make the requested transformation \illegal". The master processor then broadcasts a message to all other processor to perform this gate substitution in their copy of the network and then itself performs the substitution. It is through this centralized update, that coherence is maintained in data updates and it is guaranteed that no errors are made in the optimization process. After the substitution is performed, the output functions of all the successors of the substituted gate will become invalid as well as the CSPFs of the predecessors of the substitute gate. This is shown in Figure  9 (a). But if the output functions are reevaluated in the region where they are marked, then the CSPFs of all the gates in the transitive fanins of that region will get invalidated. Since in our approach, we do not have any synchronizing barrier to decide when to start reevaluation of the output functions and the CSPFs, the reevaluation can start any time in any order. So we mark the invalidation of output function and the CSPFs as shown in the Figure 9(b) . Invalidation of the output function or the CSPF is done by incrementing the current version number attached to the corresponding gates and connections such that any BDD in the system with lower version number will be considered \invalid".
After the invalidation of the output functions and the CSPFs, messages are sent to the corresponding partition objects to reevaluate the BDDs. After they have reevaluated the BDDs, they will inform the corresponding gate sub objects to check for more gate substitutions as the BDDs have changed. Similar to gate substitution, gate merging is an iterative improvement procedure. In this transformation, we search for a pair of gates, (g 1 , g 2 ) and try to synthesize a third gate g 3 which can replace both the gates. Two gates, g 1 and g 2 , are mergeable if the intersection of the CSPFs of the two gates is non-empty. Then we try to form a third gate with four simple gate types: AND, OR, NAND and NOR. For each gate type, we try to connect the inputs of the gate from the minimal set of outputs of the existing gates in the network. If we succeed in forming such a gate, we can replace both the gates, g 1 and g 2 , by the new gate. However, in order to maintain the loop-free condition of the network, the inputs of the new gate, g 3 , must not come from the successors of g 1 and g 2 . Hence, to maximize the chance of gate merging, we prefer the gates g 1 and g 2 to be close to the primary outputs. Let us denote p be the number of partitions. For each partition pair, (i, j), an object is created of type gate merge object. The gate merge object denoted by (i, j) is responsible for testing for any gate merging between any gate in the partition i and any gate in the partition j. In addition to that, a number (equal to p) of gate merge objects will be created, one for each partition to check for any possible gate merging in the same partition. So, in total ? p 2 + p objects are created. The procedure is depicted in Figure 10 . Now consider a gate merge object denoted by (i, j). The gate g 1 in the partition i can be merged with the gate g 2 in the partition j if the intersection of the CSPFs of these two gates is non-empty. Hence, this gate merge object will need \valid" CSPF of all the gates in the partition i and the partition j.
A gate merge object behaves almost the same manner as a gate sub object if it does not nd any CSPF. It requests the work manager in the current processor, which sends a message to the work manager to the corresponding processor if needed. Whenever that CSPF is available, the work manager will inform this object. If this object is nished with searching for gate merging pairs with available CSPFs, it requests the Figure 10 : Creation of gate merge objects and creation of objects for gate synthesis corresponding partition objects to inform this object whenever the CSPF in those partitions change. We are omitting the details to avoid the repetition of the description of the almost similar procedure. The algorithm is described in Figure 11 . If any gate merge object nds a pair of gates, (g 1 , g 2 ), which have non-empty intersection of their CSPFs, then it creates 4 new objects, called gate synthesis objects, for each of four types of simple gate, AND, OR, NAND and NOR. A gate synthesis object has the responsibility of synthesizing a gate of type supplied to it whose output function is a member of the intersection of the CSPFs of the pair of gates which we want to merge. The input of the synthesized gate will be connected to a minimal set of outputs of the existing gates in the network, the successors of the two gates to be merged are forbidden to connect.
Whenever a gate synthesis object is awakened, it rst checks if both the gates exists and they have the CSPF version number as they had when they were checked for merging. Then it makes a list of all the gates which are not successors of both the gates which are to be merged. Let us denote this list as the possible input list. For every gate in the possible input list, it tries to connect that gate to the input of the synthesized gate if \valid" output function is available. If possible to connect, it rst puts in a partial input list and then checks if the new gate synthesis is complete, i.e., the output function of the new gate is already a member of the intersection of the CSPFs of the gate pairs to be merged. If it has already been realized, it rst minimizes the number of input connection to the new gate and then asks the master processor for permission to perform this gate merging. If the synthesis process is not complete, it remembers the partial output function that has been produced so far. If \valid" output function of any gate in possible input list is not available, it requests the work manager in the current processor to inform itself whenever it is available. After going through possible input list if it can not synthesize the gate and it has no new output function to receive from the work When a gate synthesis object receives an output function, it again checks if both the gates which are to merged exist and they have the same CSPF version number. It then checks if the output function can be connected to the synthesized gate. If it is possible to connect to the synthesized gate, it checks for all the gates in the partial input list to ensure that the version number of the output functions has not changed. If there is no change, it just adds this new connection and updates the partial output function and partial input list. If any change has happened, it has to discard the partial output function and perform the computations again. The algorithm is described in Figure 12 .
When the master processor receives a message, asking permission for a gate merging, it checks for the existence of the gates to be merged and the correct versions of CSPF. It then checks for the existence of the immediate predecessors of the synthesized gate the correct version of their output functions. If everything is satis ed, it broadcasts a message to all other processors to perform this gate merging and performs the gate merging itself. Due to a gate merge, the invalidation of output functions and CSPF will be done similar to the case of gate substitution described earlier. 
Parallel Pruning
The ProperSYN system prunes a connection to a gate if the connection is found redundant, i.e., removal of that connection will not change the output function of any primary output. When the CSPF is computed for a connection, the redundancy of that connection is checked. If it is found redundant, permission is asked from the master processor. If the master processor nds it \valid", it broadcasts a message to other processors, and then performs the pruning.
Parallel Generalized Gate Substitution
As the name suggests, generalized gate substitution is a more general form of gate substitution. In this transformation, every output connection of a gate is substituted by an existing gate in the network. Let C be the set of output connections of a gate ?. The gate ? can be generalized gate substituted if 8c 2 C; 9 suchthat output func( ) CSPF(c)
To keep the network loop-free, the gates which substitute the gate ? must not be a transitive fanout of ?.
For each gate in the network, we create a gen gate sub object. The job of the gen gate sub object corresponding to the gate is to search if the gate can be generalized gate substituted. For that purpose, it needs the CSPFs of its output connections and the output functions of all the gates except those which are transitive fanouts of .
When a gen gate sub object is awakened by a message, it starts by collecting the CSPFs and the output functions it needs. If it does not nd some, it asks the work manager to supply them later. Then it compares the available CSPFs and output functions. If the output function of the gate is a subset of the CSPF of an output connection c, then the output connection c can be substituted by the gate . If all the output connections of the gate can be substituted by some existing gates in the network, permission is asked from the master processor.
If the gen gate sub object receives a CSPF of an output connection of , it rst checks if this output connection can be substituted by an existing gate in the network. If all the output connections can be substituted, it asks for permission from master processor to generalized gate substitute . Similarly, when it receives an output function it asked for earlier, the output function is compared to check if it can substitute any unsubstituted output connection of . If all output connections can be substituted, permission is asked from master processor.
When the master processor receives a request for permission for generalized gate substitution, it checks if the gates and connections exist in the network. Then it checks for the version number of the BDDs. If they have not changed, the substitution is permitted. If any version number has changed, the gen gate sub object is asked to check for substitution again. The algorithm is given in Figure 13 .
Parallel Gate Input Reduction
In this transformation, one tries to reduce the input connection to an existing gate by synthesizing a new gate with fewer number of input connections. In this process, the number of connections in the network will be reduced.
To synthesize a new gate which can replace an existing gate in the network, the output function of the new gate must be a subset of the CSPF of . The input connections to the new gate are from the existing gates in the network. To check if a gate can be connected to the new gate which has a given CSPF, the e ectively connectable condition is used as given in 4].
For each gate in the network, four gate inp red objects are created, one for each of the gate type: NAND, NOR, AND and OR. Each gate inp red object tries to synthesize a new gate of type given to it during its Algorithm: Parallel Generalized Gate Substitution ( ) /* This is for gen gate sub object */ (1) /* When the object is awakened */ creation. The number of input connections to the new gate must be less than that to the existing gate. Other than that, the parallel algorithm is almost similar to that of gate synthesis algorithm described in context with gate merging. Hence, the details of this algorithm are omitted for the sake of conciseness.
Overview of the ProperSYN Algorithm
In the previous subsections, we have discussed the parallel algorithms for each of the transformation performed in the ProperSYN system. In this subsection, we will give the overview of the entire system.
We should mention that the ProperSYN system is not the same algorithm as the Transduction system. In the transduction system, the user speci es the order in which di erent transformations are to be applied at the beginning and the transformations are applied strictly in that order, one after the other. A typical run of the Transduction method is shown in Figure 14(a) . The order in which di erent transformations are applied is determined by the cost-performance ratio of each transformation. For example, the gate substitution has the lowest cost-performance ratio, i.e., checking for gate substitution is the cheapest operation. Hence, the gate substitution transformation is applied rst on the network to reduce the size of the network. Then one applies the more expensive transformations like gate merging to further reduce the size of the network. On the other hand, if one chooses to apply the gate merging transformation rst on the network, it will take signi cantly more time to reduce the network. In case of the ProperSYN, although we use the same transformations as the Transduction method, applications of the di erent transformations are intermixed. All the transformations are simultaneously started to avoid any synchronizing barrier. However, the gains of a particular ordering is preserved by the choice of priorities. We experimented a lot with the actual priorities until we found good results. We assign di erent priorities to di erent transformations according to their cost to performance ratio to guide the synthesis process. The list of transformations in the decreasing order of priorities : gate substitution, generalized gate substitution, gate input reduction and gate merging. It is the same order in which the transformations are applied in the Transduction system. There is no synchronizing barrier at the end of each transformation. Each transformation is applied repeatedly until there in no more change in the network. A typical run of ProperSYN on 4 processors is shown in Figure 14(b) . This gure emphasizes the fact that di erent processors may be trying di erent transformations at any particular time.
To keep coherence in the network, for any possible transformation permission is asked from the master processor. We could not use the multiple master model (namely, one master per partition) due to the fact that for any transformation, the nodes a ected are not localized in any region. As explained earlier, all the nodes in the fanin and fanout cones gets a ected. Hence, we were forced to use a single master. But the number of communications to/from the master for the purpose of permission for transformations is much smaller than the total number of communications. The number of \successes" in search for possible transformations is very small compared to the number of search. We found from extensive experimentation that the number communications to/from master is less than 1% of the total communication.
Results
The parallel algorithms developed in the last section have been implemented on the ProperCAD environment and is called ProperSYN. The system contains approximately 12000 lines of C code.
Run Times and Speedups on Di erent Parallel Machines
In this subsection, we are going to present results of run time and speed up information on some of the parallel machines on which ProperSYN currently runs. We would like to reiterate that to run ProperSYN in di erent machines, no change in the parallel program is necessary. The presence of`-' in any of the following tables indicate that data is not available due to limited time and limited availability of the machines. Table 1 presents the result obtained in an Encore Multimax machine. It is a shared memory MIMD machine with 8 processors. Table 2 presents the results obtained in a Sequent Symmetry machine. It is another shared memory MIMD machine. Table 3 presents the results obtained in an Intel/860 machine. It is a distributed memory MIMD machine with 8 processors connected by a hypercube con guration. Table 4 presents the results in a network of workstations environment. This behaves as a distributed processing system. There is a host workstation, which distributes work to other workstations. The workstations used in this experiments are SUN4 workstations with SPARC 1 processor in them. In all the tables, we present run times for various MCNC benchmark circuits on di erent number of processors and speedup with respect to 1 processor runtime. One can observe that sometimes our algorithm produces \super-linear" speedups. This is due to the fact that synthesis is search problem and in parallel implementation of a search problem, speedup anomaly is possible. If one transformation preceeds another transformation, the search space may be reduced considerably, which may result in a \super-linear" speedup.
Comparison of Quality
In this section, we will compare the quality obtained by Then ProperSYN is applied to that network and at the end it is again technology mapped. In the case of MIS 2.2, we used the boolean script to perform synthesis of the network 22]. Then we performed technology mapping by using the same simple gate library. The transduction method, XTRANS 1.2, was run with the script provided and then technology mapped. We present the results of di erent benchmark circuits in terms of the number of gates and the number of literals, gate/lit, in the nal network in Table 5 . As it can be observed from the results, ProperSYN produced better results than MIS 2.2 for almost all the circuits. The results produced by ProperSYN and XTRANS 1.2 were comparable. Table 6 shows the variation of the quality of the circuit produced with di erent number of processors. The result is shown for only one type of machine and we have chosen the Intel/860 machine for that purpose. The result indicates that there is not much variation in the quality of the circuits and sometimes better circuits are produced with larger number of processors. That can be due to di erent orders in which di erent transformations are tried. The variations were similar for other parallel machines. The variation of quality with di erent number of processors in di erent parallel machines is shown in Table 7 . The variation in the result from machine to machine is due to the non-deterministic nature of our algorithm and has to again do with the fact that di erent processor speeds on di erent machines can produce di erent ordering of successful transformations. Due to the non-deterministic nature of our algorithm, the quality and runtime varied slightly from run to run for more than one processor. We performed an experiment to determine the distribution of the variations of the quality and the runtime. For the benchmark circuit bw, we ran our algorithm in Encore Multimax for 20 times each for varying number of processors (2, 4 and 8) . The distribution of the number of gates in the nal circuit and the run time is shown in Figure 15 .
Conclusions and Future Research Direction
In this paper, we have presented a portable parallel algorithm of logic synthesis, called ProperSYN. ProperSYN has been implemented as a part of the ProperCAD environment 13] based on the CHARM system 14, 15] and it runs on variety of parallel machines like Sequent Symmetry, Encore Multimax, Intel/860 Hypercube and network of workstations. The algorithm developed for ProperSYN uses an asynchronous message driven dataow model of computation. No explicit synchronizing barrier is allowed in the algorithm, hence our algorithm Results show excellent speedup in all the machines with almost no degradation in the quality of the synthesized network over the uniprocessor algorithm. For large combinational circuits, the output functions and the CSPFs will be very big and will require a very large memory space. In our present algorithm, many copies of the same BDD may be residing in di erent processors. In the future, we propose to use caching techniques, in which the BDDs of a particular part of the network will be permanently stored in that processor which \owns" that part of the network. If any other processor needs those BDDs, it will rst look for them in its local \cache" and if not found, ask the \owner" processor to sent it. If the local \cache" size is small, the number of extra copies of a BDD will be small. In that case, we will be able to handle really large circuits.
For really large circuits when the BDD size becomes too large to t in the memory, we plan to use partitioning techniques based on 19] similar to the work reported by Lim et al 11, 12] .
ProperSYN only has the transformations based on the Transduction method at this moment. Later, we want to add more transformations like algebraic factoring used in synthesis systems such as MIS to ProperSYN such that it becomes more powerful.
Acknowledgement
We are grateful to C. F. Lim and Professor S. Muroga for various discussions on the transduction method of logic synthesis. We are grateful to Professor L. V. Kale for the use of the CHARM system. We are also thankful to Argonne National Laboratory for giving us the access to their parallel machines.
