The paper presents one concept of decomposition methods dedicated to PAL-based CPLDs. The proposed approach is an alternative to the classical one, which is based on two-level minimization of separate single-output functions. The key idea of the algorithm is to search for free blocks that could be implemented in PAL-based logic blocks containing a limited number of product terms. In order to better exploit the number of product terms, two-stage decomposition and BDD-based decomposition are to be used. In BDD-based decomposition methods, functions are represented by Reduced Ordered Binary Decision Diagrams (ROBDDs). The results of experiments prove that the proposed solution is more effective, in terms of the usage of programmable device resources, compared with the classical ones.
Introduction
Nowadays, Programmable Logic Devices (PLDs) are very extensively used in designing electronic digital circuits. Simple PLDs can be divided into several kinds: PAL (Programmable Array Logic), PLA (Programmable Logic Array), and PLE (Programmable Logic Element). Based on simple PLDs, there is a group of devices called Complex Programmable Logic Devices (CPLDs). Logic blocks available in CPLDs utilize a PAL-based structure (Fig. 1 ). Other PLDs include FPGA (Field Programmable Gate Array) devices. Due to the high complexity of these systems, efficient CAD algorithms must be developed to address their design challenges. Logic minimization and technology mapping are two important elements in this process.
The classical approach to the synthesis of PAL-based CPLD structures, implemented in many computer aided design tools, employs two-level minimization of separate functions and technological fitting to the structure of programmable logic blocks (Bolton, 1990) . The Espresso algorithm is mainly used for two-level minimization of Boolean functions (Brayton et al., 1984) . The strategies of synthesis implemented in commercial CAD tools are designed mostly for a small group of devices produced by single manufacturers, but they do not provide effective solutions. The synthesis methods implemented in Hardware Description Language (HDL) compilers, such as the VHDL or Verilog, use a multi-level representation of functions, while the synthesis process consists in fitting a function to the technology library patterns. These methods lead to inefficient use of the available product terms in PAL-based logic.
The aim of the paper is to show a complete logic synthesis method of a multi-output function based on a new concept of decomposition adopted for a PAL-based architecture. This method is based on two-stage decomposition (Kania, 2004) .
In general, a single step of functional decomposition consists of input variable partitioning into two disjoint 
368
A. Opara and D. Kania subsets (bound and free set), computing a symbolic function on bound set variables and input/output encoding that determines predecessor and successor functions (Murgai et al., 1994) .
Conventional approaches to input encoding are based on finding codes which satisfy input constraints resulting from the use of symbolic minimization. Some methods presented in (Saldanha et al., 1994; Yang and Ciesielski, 1991 ) are targeted at two-level minimization. Proper encoding of the input and output problem is widely discussed in connection with state encoding of FSMs in (Ashar et al., 1992; De Micheli, 1994) . Those problems are connected with symbolic state encoding, dichotomy theory, multi-value function minimization, and output domination analysis.
For PAL-based CPLDs, more relevant are methods optimizing multi-level implementation. One of the first input encoding approaches to FPGA-based functional decomposition was presented in (Murgai et al., 1994) . Methods using symbolic decomposition or symbolic factorization perform encoding after decomposition. Sometimes, these methods have been applied to classical decomposition, which is not efficient for LUT-based FPGAs. Similar approaches were presented in (Burns et al., 1998; Muthukumar, 2001; Muthukumar et al., 2000) .
The main limitation of PAL-based logic blocks is the number of multi-input product terms available in one PAL-based logic block. This results in the observation that the essence of decomposition dedicated to PAL-based structures can be reduced to two major tasks:
• Minimising the number of PAL-based logic blocks used;
• Adjusting the designed circuit to best fit the structures of PAL-based blocks.
The proposed decomposition model concerns directly the second issue. The essence of the proposed model consists in finding a design partitioning (function decomposition) which enables us to implement the free block in one PAL-based logic block containing a predefined number of product terms.
Functional decomposition backgrounds were worked out by Ashenhurst (1957) and extended by Curtis (1962) . This model is the foundation of complex decomposition algorithms dedicated for the LUT-based FPGA (Burns et al., 1998; Lai et al., 1994; Lai et al., 1996; Nowicka et al., 1997; Rawski et al., 2008; Scholl, 2001) . The essential parts of synthesis are the partitioning and mapping of a designed circuit into a configurable logic block. There are several algorithms that utilize the decomposition in synthesis for PLA-based devices (Ciesielski and Yang, 1992; Devadas et al., 1988) .
A very interesting approach was presented in (Anderson and Brown, 1998; Chen et al., 2002) . The synthesis process developed for FPGA devices was employed, but Look-Up Table ( LUT) blocks were replaced by Programmable Logic Array (PLA) cells. The optimal structure of the cells was selected after carrying out several experiments presented in (Kouloheris and Gamal, 1992) . The authors prove that using small PLA cells to construct an FPGA device is more area efficient than using the LUT cell approach (Kim et al., 2001; Yan, 2001 ). On the other hand, synthesis methods dedicated to PLA structures are well mastered and have been known for a long time (Chen and Muroga, 1988; Ciesielski and Yang, 1992; Devadas et al., 1988) .
The high importance of decomposition in modern logic synthesis is obvious.Functional decomposition is widely researched in the context of logic synthesis for the LUT-based FPGA and PLA-based architectures. We propose to use decomposition to minimize the area of a circuit mapped into PAL-based CPLDs.
Is it profitable to use decomposition in the synthesis process for a PAL-based CPLD?
This is the question this paper attempts to answer.
Only the devices with PAL-type logic blocks are considered. The novel ideas of decomposition presented here were inspired by two-stage PAL-decomposition described in (Kania, 2004; Kania et al., 2005) . A novel nondisjunctive PAL-decomposition based on the BDD was introduced. This decomposition model is particularly promising in the case of hardly decomposable functions.
The obtained results of experiments are (among others) compared with the classical method. This method is briefly introduced in the next subsection.
Classical method.
The classical method of logic synthesis, dedicated to PAL-based CPLD, and implemented in the great majority of vendor tools, consists of two steps. First, two-level minimization is applied separately to every single-output function; next, the implementation of the minimized functions in PAL-based blocks, containing a predefined number of product terms, is performed. If the number of implicants Δ f , representing a function after minimization, is greater than the number of product terms k, available in a logic block (Fig. 1) , a greater number of logic blocks has to be utilised to implement the function. The classical product term expansion method consists in utilising feedback lines to build a multi-level cascaded structure, which significantly increases propagation delays.
Implementing a minimized function f , which can be represented as a sum of Δ f implicants, requires Δ f PALbased logic blocks containing k product terms
Decomposition-based logic synthesis for PAL-based CPLDs
369
Similarly, the classical implementation of a f :
As an example, a classical implementation, utilising PAL blocks containing three (k = 3) product terms of a function f : B 4 → B 3 , is presented (Fig. 2) .
. The purpose of this paper is to present more effective methods of function implementation in PAL-based CPLD structures. The paper is structured as follows. Section 2 presents the theoretical background and main ideas of two-stage decomposition methods dedicated to CPLDs. All steps of the proposed decomposition methods based on the BDD are given in Section 3. Experimental results are reported in Section 4. The paper closes with conclusions in Section 5.
Two-stage decomposition dedicated to PAL-based devices
The kernel of the most popular CPLDs is a PAL-based structure, which consists of a determined (in most cases, constant) number of terms connected to an output cell.
The terms with the output cell are called PAL-type logic blocks. Two-stage PAL decomposition offers more efficient logic block use than the standard two-level minimization and fitting. Adaptation to the number of terms in a PALbased logic block is a characteristic feature of two-stage PAL decomposition. Similarly to the Ashenhurst-Curtis decomposition, the partition of a variable set into free and bound sets is of major importance (Fig. 2) . The partition is chosen so that a free block can be created in one PAL-based logic block. The two-stage PAL decom- position algorithm uses a Karnaugh map as a representation of the logic function. The rows of the map are described by the values of bound variables, whilst the columns are denoted by the values of free variables. In this map, row patterns can be determined. For function f (X),
a row pattern is a row described by the function g(X b ), which returns the value 1 for all cubes associated with the columns, for which the function f (X) value is 1. In the case of the Karnaugh map depicted in Fig. 4 , the first row denoted by "g 1 (X b )" is described by the function
There are three rows here described by this pattern:
The row pattern described by g 1 (X b ) is called the row pattern complement. In the case of the Karnaugh map illustrated in Fig. 4 , there were determined two other special cases of row patterns: a full row denoted by 1 and an empty row denoted by 0. The rows in the Karnaugh map can be broken down into a few groups:
• empty rows,
• full rows,
• rows associated with the same row pattern or its complement.
370
A. Opara and D. Kania A row multiplicity of a partition matrix denoted by μ(X f |X b ) is defined as a number of different row groups, except for the groups containing empty and full rows. It is defined by the expression (3):
A-a row patterns set, B-a row groups set,
For the function presented in Fig. 4 , the row multiplicity is μ(X f |X b ) = 2; the first group of rows contains row patterns g 1 (X b ) and g 1 (X b ), the second-row pattern g 2 (X b ).
Each group of rows can be assigned a function defined as follows:
1, for full rows, 0, for all other rows,
1, for rows, for which the row pattern is described by the function g i (X b ), 0, for all other rows,
1, for rows, for which the row pattern is described by the functiong i (X b ), 0, for all other rows ,
where i = 1, . . . , μ(X f |X b ). Using these functions, and assuming that the row multiplicity is p, the following equation is derived:
In the case of the discussed example of a function with va-
the functions h are expressed as follows:
and, finally, the function f will be of the form
Figure 5 depicts the implementation of the function under consideration into a CPLD with six product terms in one PAL-based block. As can be seen, three PAL-based logic blocks were employed here. To compare the obtained result with the classical one, the formula (1) is used. After two-level minimization is done with the use of the Espresso algorithm, the function under consideration (Fig. 4) can be represented as a sum of 21 products. The required number of logic blocks δ f utilized in the classical approach, determined by the expression (1), where k = 6 and Δ f = 21 , is equal to 4. The classical approach requires only four PAL-based logic blocks with six product terms, i.e., one more block when compared with the decomposition-based approach.
Decomposition-based logic synthesis for PAL-based CPLDs

371
The main idea of two-stage PAL decomposition is to search for such variable partitions which could provide (i) a free block realisation in one PAL-based logic block, and
(ii) the smallest number of bound block outputs.
The number of bound block outputs is equal to the row multiplicity. An effective computation of the row multiplicity for a given variable partition is a major problem in two-stage PAL decomposition algorithms.
2.1. Method for row multiplicity evaluation. In this chapter, an algorithm for determining the row multiplicities is presented. The algorithm uses a specific colouring algorithm for the row incompatibility and complement graph. First, it is necessary to introduce some new terms. A pair of cells (i, j) located in the same column of a Karnaugh map will be called incompatible if the values of the functions described by these cells are equal to (1,0) or (0,1). If in the set of all cell pairs located in two specific rows at least one pair of incompatible cells can be found, such rows will also be called incompatible. If in the set of all cell pairs located in two specific rows neither the (1,1) nor the (0,0) pairs can be found, we will say that one of the rows is a complement to the other. Let us now define a graph G(Y,U), where Y is the set of nodes corresponding to the Karnaugh map rows, and U is the set of edges. Let
• U I is the set of edges connecting the nodes corresponding to mutually incompatible row pairs, except for empty rows, full rows, and mutually complementing row pairs;
• U C is the set of edges connecting the nodes corresponding to mutually complementing row pairs, except for empty rows and full rows.
The edges belonging to the U I set will be drawn using a solid line, and the edges belonging to the U C setwith a dashed line. The graph G(Y,U), built according to the procedure described above, will be referred to as the row incompatibility and complement graph. Example 1. Let us sketch the row incompatibility and complement graph for the function f : B 6 → B presented in Fig. 4 . First, we have to locate empty rows and full rows, which will be excluded from further operations. In our case, we find two empty rows and one full row, marked in Fig. 4 with the 0 and the 1 symbol, respectively.
The analysis of subsequent row pairs leads to creating the row incompatibility and complement graph depicted in Fig map. The row multiplicity will be equal to the number of different colours, used to distinguish the nodes of the graph. The concept of graph colouring consists in assigning a minimum number of colours to graph nodes in such a way that any two nodes connected by a solid line will receive different colours. The node colouring algorithm is based on a sequential selection of nodes. A node is assigned either a permitted colour (denoted by a capital letter) or a complementary colour (denoted by a slash with capital letter). After assigning in step i a permitted or complementary colour "A" to node N, all nodes connected to N with a solid line are assigned the forbidden colour "a" (forbidden colours are denoted by lower case letters), and all nodes connected to N with dashed lines will receive the colour complementary to "A".
The selection of the i-th node is performed according to the following rules:
• A node with the maximum number of forbidden colours is selected. It is assigned a permitted colour (one of the colours that have already been used, if possible).
• If some nodes have the same number of forbidden colours, the one to which the maximum number of edges are connected is selected.
• If some nodes have the same number of forbidden colours and the same number of edges connected, the one selected has the maximum number of complementary colours. If possible, a complementary colour is assigned to it.
• If some nodes have the same number of forbidden colours, complementary colours, and edges, the one to which the maximum number of solid line edges are connected is selected.
In the first step, node 001 is selected (Fig. 7a) . The assignment of colours is performed as follows:
• node 001: permitted colour denoted by the A letter; Step-by-step colouring of the row incompatibility and complement graph.
• nodes 000, 011, 010, 110: forbidden colour denoted by the a letter.
After selecting a node and assigning permitted and forbidden colours to the respective other nodes, the graph is reduced by eliminating the edges that connect the selected node with other nodes of the graph. The result is presented in Fig. 7 (b). Then, a new node is selected based on the analysis of the reduced graph. The subsequent steps of node colouring are depicted in Fig. 7 . As the result of the row incompatibility and complement graph colouring procedure described above, a row multiplicity μ(X f |X b ) = 2 is obtained. Certainly, colour A corresponds to the row g 2 (X b ) in Fig. 4 , and colour B to the row g 1 (X b ).
A detailed description of two-stage decomposition and the proposed graph colouring algorithm can be found in (Kania, 2004; Kania et al., 2005) . The classical graph colouring algorithms are presented in (Chartrand and Zhang, 2008) .
BDD application in decomposition
The binary decision diagram is a graph-based structure used for a memory-efficient representation of logic functions. BDDs were first proposed by Akers (1978) , and popularised by Bryant (1986) and Brace et al. (1990) . Due to their implicit power to represent Boolean functions, BDDs are considered the most efficient Boolean representation known so far. BDDs are widely used in decomposition algorithms (Lai et al., 1996; Yang and Ciesielski, 2002) .
A BDD is a directed acyclic graph (a tree) with each node associated with a function variable. All nodes (except for terminal ones) have two outgoing edges pointing to two children nodes, one for variable value 0 and one for 1. This binary tree contains two terminal nodes termed 0-node and 1-node. The analysis of the paths connecting to the BDD terminal nodes determines the value of the function according to the values of the variables.
Only Reduced Ordered Binary Decision Diagrams (ROBDDs) have practical meaning. In an ordered BDD, the variables in all paths have the same variable order, and they occur, at most, once on every path. Reduced ordered BDDs have a minimal number of nodes for the given variable order and are canonical forms of function representation. The reduced form is obtained from an OBDD by the reduction of the same sub-graphs and through removing all redundant nodes.
There are some ROBDDs with special attributes added to the edges for efficient memory use and faster computations (Minato, 1996) . A complement is one of the most known attributes. If the edge is complemented, it means that the sub-diagram pointed by this edge must be interpreted as a negation of the formula represented by the sub-diagram.
The classical two-stage PAL decomposition employs a partition matrix as a representation of the logic function. There is a possibility to develop an algorithm using reduced ordered binary decision diagrams as an effective representation, followed by non-disjunctive decomposition, whilst the application of the negation attribute can additionally increase the algorithm's efficiency.
Counting the number of paths.
As far as the synthesis of digital circuits in programmable structures with PAL-based blocks is concerned, the key problem is to determine the minimal number of products in the sum of products representation. In the classical approach, the Espresso algorithm may be used for this purpose. When the ROBDD is used for logic function representation, another concept can be exploited. Each path in the diagram obtained from a root to a leaf 1 corresponds to one product. The total number of paths can vary with different variable orderings in the diagram. Changing the variable order is a way to minimize the path number. Often, the smallest number of paths is greater than the number of products after minimization, although decomposition with path counting can provide better results than the classical approach with the two-level Espresso minimization. The main advantage of the method used to determine the number of products through counting the paths is the low computa- tion complexity. The number of paths can be counted by a recursive procedure. The number of paths Δ 1 connecting the given node v 1 to the leaf node 1 is equal to the sum of the number of paths connecting the children node (high(v 1 ), low(v 1 )) to the leaf node 1 (Fig. 8a) . Similarly to the standard procedure bdd_apply() (Bryant, 1986) , a computed table is used to store the intermediate and final results of each algorithm iteration. A result in this context means the number of paths for a given node, which is the root of a sub-graph representing a function. Due to the use of cached intermediate results, the path counting procedure will be performed only once for each node. For instance (see Fig. 8(b) ), for a node denoted by w, the number of paths will be computed only once, although two edges point to this node and during a depth first traversal across the diagram this node will be visited twice. The computation complexity of the procedure counting the number of paths is O(n), where n is the number of nodes in the diagram.
The number of paths in the diagram highly depends on the variable order. It is possible to use heuristic algorithms similar to those aiming at minimizing the number of nodes for the number of paths minimization (Ebendt et al., 2005) . For this purpose, a sifting algorithm (Rudell, 1993) can be used, but the optimality criterion must be changed. Each variable is moved up and down in the variable order and the position that produces the smallest OBDD size is maintained. At each position, the resulting ROBDD size is recorded and, finally, the variable is moved to the best position. The ordering change is performed by swaps of variables, which are adjacent in the variable ordering. The variable swapping affects the BDD structure of only two levels involved in the swap, whilst the whole part of the ROBDD above and below these levels remains unchanged. All modifications have a local scope and concern two levels of nodes. This local-level of the swap operation is responsible for the efficiency of the sifting algorithm. Figure 9 illustrates some portions of the ROBDDs before and after the swap operation on two levels with assigned variables x 1 and x 2 . The number of nodes and paths after swapping is unchanged (see Fig. 9(a) ) and changed (see Fig. 9(b) ), respectively. In the second case, there is a need to recompute the total number of nodes in the diagram. Since only two levels are altered, only the number of nodes in two levels must be recounted, and the difference between the nodes number before and after swapping is added to the previous total number of nodes in the diagram. In order to compute the new number of paths in the diagram, the results of the previous calculation can also be employed. Exchanging two adjacent levels has no influence on the number of paths below these levels. The number of paths for all nodes in the upper part of the diagram must be updated. In this case, only the processing time of the lower part of the diagram is saved.
PAL-oriented BDD-based decomposition.
The core of PAL-oriented decomposition is to search for a partition of function variables assuring free block implementation in one PAL-based block with a constrained number of product terms. Furthermore, the partition found must provide a structure with the smallest possible number of outputs of the bound block. The partitioning of the variables in a partition matrix is equivalent to the cut in the ROBDD representing the logic function. The variables associated with the nodes above the cut line form a free set X f , and below the cut line-a bound set X b (contrary to
374
A. Opara and D. Kania the Ashenhurst-Curtis decomposition using ROBDD representation). Figure 10 depicts the ROBDD corresponding to the Karnaugh map of the function under consideration in Fig. 4 . All nodes pointed by edges crossed by the cut line will be termed the cut nodes. As can be seen, each cut node is associated with one row pattern. The row multiplicity μ(X f |X b ) is the number of row groups. A row group is formed by a row pattern or its complement. All nodes in an ROBDD with edge complement attributes correspond to one row group. The row multiplicity can be efficiently computed by counting the number of cut nodes in a ROBDD with edge complement attributes. Different partitions are obtained by changing the variable ordering in the ROBDD and fixing the level of the cut line diagram. The decomposition algorithm consists of some phases. During each phase, there is established the number of free set variables which corresponds to the cut level. A variable partition is searched, which allows us to obtain a free block in one PAL-based logic block and the smallest number of the bound block outputs. If in a given phase the solution is found, the cut level is incremented.
The main idea of the decomposition algorithm is presented in Fig. 11 . During each phase of the algorithm the cut level is fixed. Searching for an appropriate partition is started with the cut level equal to log 2 (k), where k is the number of terms in a PAL-based block. For such a cut level, a partition can always be found, so it is a good starting point. Furthermore, the cut level is incremented until the partition can be found. For the last found partition a set of cut nodes is remembered and functions represented by cut nodes are further recursively decomposed. The last step of the algorithm is a comparison with the classical realisation. If the classical realisation gives a smaller number of PAL-based blocks, then the classical solution is used. More detailed listing of the decomposition algorithm is presented in Fig. 12 . The minimal number of products in the sum of products form is determined by path counting (line 4) in the ROBDD. There, the number of paths to the leaf 1 and 0 is counted. Since the relation of row complementing is symmetrical, there is a possibility to assign a function or its complement to the rows, and to obtain the solution with a smaller number of paths (products). In the case of using PAL-based logic blocks, without the possibility to program the output polarities when the decomposition procedure is initially employed, only positive polarisation should be accepted.
The algorithm contains a few improvements by just eliminating certain situations which otherwise would be further processed, e.g., if a function after minimization is described by less than 2k implicants (line 5) (where k is the number of product terms in PAL-based logic block), then the decomposition will not reduce the number of if //---better classical realisation-------------31.
(blocks_class <= number_of_cut_nodes+1){ //------better classical realisation-------------21. return classical realisation; 22. }else{ //-------cut nodes decomposition------------------
return classical realisation; 32. } 33. }//======end decomposition============================== 34.} blocks because one PAL-based logic block is needed for the free block and at least one block for the bound block, respectively. After this condition is not met, a partition is searched (lines 8-18).
A free block is ensured in one PAL-based logic block for all partitions for which 2 |X f | < k, |X f | log 2 k , holds hence to save computation time the search process will start with |X f | = |log 2 k| + 1 (line 8). If the partition is found for a given cut level, the cut level is incremented (line 12). If the partition is not found for the cut level equal to |log 2 k| + 1, then the partition is computed with the cut level equal to |log 2 k|.
Additionally, the number of blocks in the classical approach (Formula 1) is computed in the line 19.
After a partition is found, a check will be done to see if the partition could reduce the number of PAL-based logic blocks compared to the classical approach (line 20). The minimal number of PAL-based logic blocks required to implement a circuit after partitioning is equal to μ(X f |X b ) + 1, so the condition necessary to eliminate some partitions from further processing is given as
If the above condition is not true, further processing is to be continued and the functions represented by cut nodes will be decomposed (lines 24-26). At the end, the last check is made (line 29) if the decomposed function truly gives a smaller number of blocks than the classical approach.
Algorithm refinements.
The number of paths in an ROBDD connecting the root to the leaf 1 in some cases can significantly differ than the number of product terms of two-level minimized logic function, e.g., a function with two products f 0 = x 0 x 1 x 2 + x 3 x 4 x 5 has four paths with variable ordering x 0 , x 1 , x 2 , x 3 , x 4 , x 5 (Fig. 13) . Using these paths, this function can be represented as a sum of four products f 0 = x 0 x 1 x 2 + x 0 x 3 x 4 x 5 + x 0 x 1 x 3 x 4 x 5 + x 0 x 1 x 2 x 3 x 4 x 5 . Although this representation is not optimal (as the experiments on benchmarks prove), path counting decomposition gives good results (Table 1) . For further enhancement of the algorithm, the Espresso algorithm was used instead of path counting. Looking at the algorithm in Fig. 12 , the only difference is in line 4, where a two-level minimization algorithm is employed as an alternative. 
Non-disjunctive PAL decomposition.
In order to reduce the number of logic levels, non-disjunctive partitions can be employed (Opara, 2009; Opara and Kania, 2009 ). Non-disjunctive decomposition is that of the function under consideration (Fig. 14) implemented with PAL-based blocks containing three product terms. The first stage of non-disjunctive decomposition is to find a good disjunctive partition. For a given variable order, only x 0 can be included into the free set. In this case, a free block described as f = x 0 · g 2 (x 1 , x 2 , x 3 , x 4 ) + x 0 · g 1 (x 1 , x 2 , x 3 , x 4 ) is implemented by two product terms. Function g 2 describes a diagram rooted by node v 2 , and g 1 by v 1 . Due to the inclusion of one more variable (x 1 ) to the disjunctive free set, four product terms are needed, so the limit of three terms in a PAL block is exceeded. Function g 1 is created in one PAL block and g 2 in two blocks, respectively. Finally, using disjunctive decomposition, a circuit can be implemented with four blocks situated in three levels.
Through the introduction of non-disjunctive decomposition, the variable x 1 is included into the free and bound set. The free block is described by the formula f = x 0 ·x 1 ·g 0 + x 0 ·x 1 ·g 0 + x 0 ·g 1 and utilizes three product terms. The whole circuit is built of three PAL-based logic blocks in two levels (Fig. 15) .
The algorithm presented in Fig. 12 is modified, so after a proper disjunctive partition is found (lines 23-34) a procedure is employed to try to add one child of a cut node to the cut node set. In the example considered, v 0 is chosen as a child of v 1 . The node is accepted if the resulting implementation of the free block fits one PALbased logic block. 
Experimental results
The developed BDD-based synthesis methods were compared with (i) the classical method,
(ii) a two-stage decomposition, and (iii) a synthesis implemented in firmware tools (Quartus).
In order to compare these methods, a synthesis of benchmarks was carried out for PAL-based logic blocks containing k number of terms. All experiments were performed on a PC with a Pentium Centrino 1,6 GHz processor and 1 GB RAM, under the Windows XP operating system. To carry out the experiments, a dekBDD prototype tool was developed, about which some additional information can be found at: db.zmitac.aei.polsl.pl/AO/dekBDD.html. x -dekBDD x -dekBDD+E k levels decBDD levels classical 
Comparison with the classical method.
A method of implementing a function in PAL-based structures incorporating the BDD presented in this paper (dekBDD) was compared with the classical approach with respect to the number of logic blocks used and the number of logic levels. The comparison was made for an algorithm in two versions: simple, denoted by dekBDD, and enhanced, denoted by dekBDD+E in Table 1 . For multi-output benchmarks, this algorithm was applied separately to the outputs. The left part of the table shows the results of the synthesis performed on the benchmarks using the classical approach. The column marked with "Esp" lists the number of function products after the Espresso minimization, "Bdd" lists the number of paths in the ROBDD representing the function, the letter "B" list the numbers of k-product PAL-based blocks, and the columns marked with the letter "L" list the numbers of logic levels. The second part of the table, denoted with the heading "dekBDD", contains the results obtained using the new method. In the set of about 2600 cases compared, the proposed dekBDD algorithm allowed 168 solutions to be found, whilst the dekBDD+E algorithm allowed 263 solutions to be found, which required a smaller number of logic blocks than in the classical method. For some benchmarks, the reduction of the logic block count was significant, e.g., for rd84 f 1 , rd73 f 1 , cordic f 1 , misex3 f 2 , f 7 , 5xp1 f 2 . Significant differences can be noticed not only for small values of k. Unfortunately, the number of logic levels does not follow the reduction of the number of logic blocks. Among the examined benchmarks, only a few percent of the solutions demanded a smaller number of logic levels.
The results of the experiments are presented in a synthetic way in Figs. 16 and 17 . The values represented on the axis of ordinates in Fig. 16 were calculated from the rational formula shown in the graph. Σblocks classical and Σblocks dekBDD denote the relevant total sums of block counts obtained using the corresponding synthesis methods and are presented in Table 1 . The values represented in Fig. 17 were calculated in a similar manner. The analysis of the benchmarks allows us to state that, in most cases, the reduction of logic block counts by using this new algorithm is obtained at the expense of a certain expansion of logic levels. The proposed method is particularly efficient if k = 4, 8, and 16. A significant reduction in block counts was observed, while preserving a comparable number of logic levels.
Comparison with two-stage decomposition.
In the development of BDD-based decomposition algorithms, two-stage decomposition was a certain reference. This method was presented in Section 2. Decomposition based on the classical Ashenhurst-Curtis model is very effective. Unfortunately, the computation complexity precludes the synthesis of large designs. Two-stage decomposition implemented in a PALDec system (Kania, 2004) allows the synthesis of functions with at most 16 argu-
378
A. Opara and D. Kania ments. A logic synthesis based on a BDD (a simple one, denoted by dekBDD, and the enhanced one, denoted by dekBDD+E) was compared with two-stage decomposition implemented in a PALDec system with respect to the number of logic blocks used and the number of logic levels. The results are presented in Table 2 . The rows show the results of synthesis performed on the benchmarks using the classical approach (rows marked "Classic"), logic synthesis based on the BDD (rows marked "dekBDD" and "dekBDD+E") and logic synthesis based on two-stage decomposition (rows marked "PALDec"). The columns marked with the letter "B" list the numbers of k-product PAL-based blocks used, and the columns marked with the letter "L"-the numbers of logic levels. The relevant total sums of block and level counts obtained using the corresponding synthesis methods are presented in the four lowest rows of Table 2 .
When comparing a set of 128 cases, two-stage decomposition (PALDec) gave 20 solutions (15%) requiring a smaller number of logic blocks than BDD-based decomposition methods. For certain benchmarks, the reduction of logic block count was significant, e.g., for f51m, z5xp. Crucial differences can be noticed only for k = 3. In the majority of cases, the numbers of logic blocks and levels obtained for both methods were identical.
The results of the comparison of all decomposition methods are presented in a synthetic form in Figs. 18 and 19. The values represented on the axis of ordinates in Fig. 18 and Fig. 19 were calculated from the formula shown on the graph. Σblocks and Σlevels denote the relevant total sums of block counts obtained with the use of the corresponding synthesis methods.
The analysis of the benchmarks allows us to state that, in most cases, the reduction of logic block counts by using the new algorithm is obtained at the expense of a certain expansion of logic levels. All decomposition methods are particularly efficient if k = 4 or k = 8. In this case, a significant reduction of block counts, while preserving comparable (sometimes the lowest) number of logic levels, was observed.
These experiments show the following:
• The decomposition method is better, with respect to the number of logic blocks, than the classical approach.
• The decomposition method can be useful in cases for which the reduction of the chip area is of the utmost concern, without significantly degrading the chip dynamic properties.
• BDD based decomposition algorithms (DekBDD, DekBDD+E) have significantly lower computation complexity than two-stage decomposition based algorithms.
• Two-stage decomposition sometimes produces better solutions than BDD-based techniques.
• If the reduction of the number of logic levels is an important factor in the synthesis, the proposed decomposition algorithm is particularly effective for structures consisting of PAL-based blocks containing 2 i (a power of 2) product terms.
4.3.
Way to describe circuits to use decomposition results for commercial applications. The main problem in porting a proposed method to a vendor-specific system is to find an appropriate intermediate format for the design data exchange. Commercial vendor-independent systems (e.g., Synplify, Leonardo Spectrum) use low level netlists for this purpose. This approach is secure because there is a little chance that the low level structure will be interfered with by implementation tools. The method is, however, not universal because low level netlists contain much vendor-specific and architecture-specific information. Using this approach requires thus equipping the synthesis software with procedures or plugins responsible for converting formats, and preparing data specific for the implementation tools. This is acceptable for commercial companies but difficult for academic experiments. It was, therefore, desirable to find alternative formats for the data exchange, possibly more universal, and to use a higher level of abstraction.
Decomposition-based logic synthesis for PAL-based CPLDs
379
Here using a Hardware Description Language (HDL) seems to be the most obvious choice. Choosing the right abstraction level for the intermediate format is an important task because vendor implementation software can change and "destroy" logical structures generated by synthesis tools. Behavioral HDL description seems presently to be the design specification format most preferred for design entry. Because of its high abstraction level, it allows the designer to concentrate on proper description of the desired functionality. As a textual format, following the standard of the chosen language, it is universal and portable between technologies and software tools.
A number of experiments were carried out to examine various synthesis tools and, in particular, the effects of selecting different data exchange formats on the quality of results. The tools were tested using the standard benchmarks. The benchmarks were implemented in PAL-based CPLDs.
It was found that, if behavioural description was used as the entry format, the quality of the solutions was not good. A high abstraction level in behavioural modelling gives much freedom to the software. Logical structures can be easily "spoiled" by vendor implementation programs. During the experiments it appeared that it is possible to propose as the intermediate format a style of Verilog description lying at a lower level of abstraction than behavioural modelling, but still portable between software tools and comprehendible to a human.
To this end, a way to describe the circuit under design was developed using a set of equations. The advantage of this solution is that the decomposed circuit retains its structure. The proposed circuit description ensures the transferability of results to different hardware platforms. The designed circuit described by the sum of products ( Fig. 20(a) ) is then decomposed using the prototype tool, to obtain the description in the form of a set of equations in the Verilog language (Fig. 20(b) ). Adding certain attribute signals (* KEEP *) prevents a specific given signal from being reduced by the firmware synthesis tool. In consequence, the decomposed circuit will retain its specific structure. Apart from the transferability, such a description has another advantage-it does not limit the possibility of using specific resources of programmable structures, such as, e.g., shared expanders. Thus, further improvement of the obtained results is possible and experiments confirmed the effectiveness of the proposed approach. In order to verify the practical usefulness of the proposed description and decomposition methods for standard benchmarks, the Quartus II v8.0 software from Altera and the MAX 7000B, EPM 7512 BFC256-5 programmable logic were used. Each benchmark was synthesised using the Quartus software. Moreover, a description of the circuit was produced for comparison purposes, using dekBDD, and then the synthesis was continued using the firmare system. 5xp1.pla module benchm ( input x0,x1,x2,x3,x4,x5,x6 , output f0,f1,f2,f3,f4,f5,f6,f7,f8,f9 ); . . .
. . . endmodule 5xp1.pla module benchm ( input x0,x1,x2,x3,x4,x5,x6, output f0,f1,f2,f3,f4,f5,f6,f7,f8,f9 ) ; . . .
. . . The results of the experiments are presented in Table 3. The first column header contains the name of the benchmark, the next two columns contain the number of the blocks of type PAL (k = 5) obtained using the classical method and the dekBDD + E method. The next two groups of column headers contain "MAX 7000B Quartus II area opt." and "MAX 7000B dekBDD + E + Quartus II", and they are related to the results of the synthesis of the circuits developed in the MAX 7000B programmable structure using two methods:
1. "MAX 7000B Quartus II area opt."-synthesis employing Quartus, focused on area minimization;
2. "MAX 7000B dekBDD + E + Quartus II"-decomposition using the dekBDD + E method ending with the description in the Verilog language, followed by post-synthesis using Quartus.
Here, "MC" is the number of the macrocells, "Exp" is the number of the shared expanders used, and "tp" is the propagation time through the longest path. The penultimate column contains the standardized number of macrocells Classic  k=3  k=4  k=5  k=6  k=7  k=8  k=12 k=16  k=3  k=4  k=5  k=6  k=7  k=8 k=12 k=16  k=3  k=4  k=5  k=6 78 102 39 4 26 4 20 3 16 3 13 3 11 3 7 2 6 2 3 9 4 26 4 18 3 12 3 10 3 11 3 7 2 6 2 2 0 4 14 3 13 3 10 3 9 3 8 3 7 2 6 2 f6:
111 173 Among 23 benchmarks considered, in 15 cases there was observed a reduction in the number of macrocells. In the majority of cases, such a reduction was as high as several dozen percent. A surprisingly good result was reached in some cases. The best result was obtained for the spla circuit, where the number of macrocells dropped from 927 to 89 (above a tenfold decrease in the number of blocks). Similar results were obtained for pdc, f51m and rd84 benchmarks. After the Quartus II synthesis, it was impossible to develop the pdc and spla bench- Table 2 . Comparison between decompositions that employ the BDD (DekBBD, DekBBD+E) and two-stage decomposition (PALDec) referred to the classical method.
Classic 35 3 26 3 22 marks using even MAX7000B with the highest number of macrocells. The benchmark t481, for which the considerable growth of the number of macrocells was observed, is a special case (from four to 23). This benchmark requires a different strategy of decomposition. Quartus II allows the description of a circuit to be created, where several outputs of macrocells can be connected to one productterm. The reason for a large difference in the number of blocks for t481 is because the dekBDD + E algorithm allows, at most, to connect one macrocell output to one product. A suitable modification of the dekBDD + E algorithm, which also takes into account such cases, will be the subject of further studies.
Despite the unfavourable result for one benchmark, the total number of macrocells for all benchmarks was reduced almost three times (2.8 times). Also, in the majority of cases, the number of shared expanders was reduced proportionally to the reduction in the number of macrocells. It is worth noticing that the proposed description in the Verilog language does not exclude the use of expanders in the firmware synthesis tool. For example, in the case of the 5xp1 benchmark, the dekBDD + E method enables one to create the description using 19 PAL-based blocks (k = 5). Once the final stage of the synthesis using the firmware tool is made, 16 MAX 7000B macrocells are created based on the description of 19 blocks (developed around the PAL-type core) and six expanders. Thus, the use of the firmware tool enabled us to take advantage of the specific features of the architecture of a programmable structure and to further improve the decomposition results as well.
In the majority of cases, no increase in the propagation time was observed while reducing the number of blocks (16 of 23 benchmarks). The average gain of propagation time remained almost unchanged (difference of 4%). However, this average value does not take into account the fact that the number of macrocells used in the firmware tool was higher than the maximum available number of macrocells in the circuits of the MAX7000B for two benchmarks. On the other hand, the use of the dekBDD + E algorithm enabled us to build the same circuits and to obtain propagation times as low as 17 ns. This confirms the practical effectiveness of the proposed methods and of their description in HDLs.
Focus should also be put on the working speed of the proposed algorithms and the influence that processing of the circuit description files has on the duration of the synthesis with the use of firmware tools. The average synthesis duration was 10 and 12 minutes, respectively, for pdc and spla using Quartus II. In the case of the proposed dekBDD+E module, the synthesis of the same circuits using Quartus II took 30 seconds. Thus, as high as a 20-fold speed-up of the whole synthesis process was obtained.
Conclusions
The paper presents a logic synthesis method dedicated to PAL-based CPLDs. The aim of that method was to utilize non-standard decomposition in order to minimize the area of the implemented circuit and the reduction of necessary logic blocks in the programmable structure. These methods provide an alternative to the classical approach based on two-level minimization of individual single-output functions.
The paper presents three variants of PAL-oriented decomposition dedicated to PAL-based CPLDs. First, twostage PAL-oriented decomposition is presented. This method is an extension of the classical Ashenhurst-Curtis decomposition. Decomposition based on a two-stage model is very effective. Unfortunately, the algorithms contain very demanding procedures. The computation complexity precludes the synthesis of large designs. Other PAL-oriented decomposition models use reduced ordered binary decision diagrams. The binary decision diagram was taken into consideration in order to increase computation performance/efficiency. The experience gained in the implementation of two-stage decomposition allows us to implement efficient partitioning procedures for the BDD. Decomposition results for BDD methods are slightly worse as referenced to previous approaches. The synthesis process is computation efficient and allows us to decompose complex logic circuits in a reasonable amount of time. The exploration of BDD decomposition methods shows their undiscovered potential; a potential which can still be developed, especially for the decomposition of a function consisting of a few hundred input and output variables.
The essence of all the methods is to incorporate decomposition into the synthesis process dedicated to CPLD structures. The algorithm consists in a sequential search for decomposition which provides the feasibility of implementation of a free block in one PAL-based logic block containing a predefined number of product terms.
The proposed methods were practically proved. For all synthesis methods, the results of the experiments presented in the paper become close to one another with growing k. The conclusion is that for large k it is better to use the dekBDD+E approach, which works fast and gives comparable results.
Through the adjustment of the decomposition elements to the logical resources characteristic for a PALbased logic block, a significant improvement of the synthesis effectiveness in relation to the classical approach could be obtained. Unfortunately, a reduction in the area is not always associated with a reduction in logic levels.
Although satisfactory results were achieved, the presented methods will still be improved. In our opinion, the quality of results could be enhanced, e.g., by extending the decomposition model to allow creating the description of a circuit, where several outputs of macrocells can be connected to one product-term. Considering perspectives for further research, comparisons with other tools and integration with commercial tools will be taken into consideration.
