Abstract-New results are given concerning the design of combinational logic circuits. We give time and component bounds for combinational circuits specified in several ways. For any sequential machine defmed by linear recurrence relations, we discuss an algorithm for the synthesis of equivalent combinational logic. The procedure includes upper bounds on the time and components involved. We also discuss the transformation of nonlinear recurrences into combinational circuits. Examples are given using gates as well as IC's as components. These include binary addition, multiplication, and ones' position counting. The time and component bounds our procedure yields compare favorably with traditional results.
I. INTRODUCTION THIS paper discusses some aspects of the re-1 lationship between sequential circuits and combinational circuits. Circuit design in both areas has been studied extensively in the past. Past studies have included efforts to reduce the time and gates required to compute various functions. This paper establishes upper bounds on time and gates, and also provides a systematic procedure for transforming a sequential circuit design into a combinational circuit.
The upper bounds on time which we prove are quite good, relative to the best known lower bounds in most cases [11-431. We also give gate bounds, which have often eluded detailed analysis in the past. Our gate bounds seem quite sharp relative to the actual numbers found in real logic design examples [4] , [51, [13] [14] [15] .
The algorithms we have for transforming sequential circuit designs into combinational ones yield circuits which meet the above-mentioned gate and time bounds. In this sense, we present a uniform design procedure for the realization of any linear sequential machine in combinational circuit form. The advantage of this is that one can often specify the behavior of some desired function quite easily as a sequential circuit. It is somewhat more difficult to translate such a specification into a faster combinational circuit form. A classic example is the ease with which a bit serial adder is specified in sequential form. On the other hand, the 'design of'combinational parallel adders (with Manuscript received October 4, 1975;  revised June 15, 1976 . This work was supported in part by the National Science Foundation under Grant US NSF DCR73-07980 A02.
S.-C. Chen various lookahead schemes) occupied many logic designers for some years in the 1950's. The automatic design of a fast parallel combinational adder derived from a bit serial specification is one example of the use of our method.
Not all interesting logic design problems are presented in a sequential form that is lin'ear. As we shall see later, multiplication is an example. While some nonlinear cases can be linearized mathematically, we shall discuss another approach. We will show how nonlinear logic circuits can be used to remove the nonlinearity in the sequential specification. Then, in terms of components which contain the nonlinearities, we obtain a linear system at a higher level. Our method can then be applied in a straightforward way.
An important question in modern, practical logic design is what to put in one integrated circuit package and then how to synthesize useful circuits using such packages. One of the methods we present deals with what can be regarded as logic design at the integrated circuit package level. We show what logic should be contained in a package and then give a method for interconnecting packages. Again our discussion is centered on transforming given sequential logic specifications into combinational logic in the form of packages. This is closely related to the subject of the previous paragraph in the sense that nonlinear logic functions can often be'hidden in integrated circuit packages, leaving us with a linear problem at a higher level.
Throughout the paper we illustrate our methods with examples giving gate and time bound coefficients for several practically useful logic design problems'including adders, multipliers, and-ones' position counters.
The techniques described in this paper are variations on our earlier efforts to design fast parallel operation computers [6] and [7] . There our basic units were adders and multipliers which operated on whole' floating-point numbers, while here we are dealing with logic design at lower levels. In this paper we deal with operations on bits and bytes at the gate and integrated circuit package level. It is important to notice that mathematically, precisely the same ideas and algorithms are used at all levels; only the details of the technology change. Thus, we feel that in at-, tempts to automate the design of general purpose or special purpose machines, one set of underlying ideas may be of general use.
The following definitions and assumptions will hold throughout the paper. An atom is a constant or variable denoted by a lower case letter. In some parts of the paper we will deal with Boolean atoms (which have value 0 or 1) and in other parts we will deal with arithmetic atoms (which represent binary numbers). A dyadic Boolean operator is either a logical OR or a logical AND. A dyadic arithmetic operator is either an addition or multiplication operator. We denote these by + and *, respectively, in either case. The context will make our meaning clear when necessary, and in some cases the same result will hold in either the Boolean or the arithmetic case.
Except as noted in the paper, we assume that all Boolean NOT's and arithmetic subtractions are distributed down to the level of atoms. in the arithmetic case, this is discussed in [8] , while in the Boolean case a similar procedure may be carried out using DeMorgan's Laws. We do this without loss of generality to simplify our discussion.
An expression (Boolean or arithmetic) is a well-formed string consisting of atoms and operators and is denoted by an upper case letter. We write E (e), for example, to denote an expression E containing e atoms. The distinction between Boolean and arithmetic atoms and expressions will be clear by the context of our discussion.
We We emphasize the fact that in practice fan-out is usually greater than fan-in, but fan-out delays may be nonnegligible. We account for fan-out delays and gates in all of our bounds. Thus, our results represent a more refined treatment than is usually found in abstract bounds of this type which often ignore fan-out limitations.
We use the notation TG [El to denote the number of gate delays in a circuit which implements expression E using G gates. Similarly, we use the notation Tp[E] to denote the number of processor delays required to compute E using P processors.
Throughout the paper we use log x to denote log2 x.
II. COMBINATIONAL CIRCUITS
In this section we discuss gate and time bounds for combinational logic circuits. We give bounds for gates with fan-out f and fan-in 2. After giving some elementary fanout and combinational fan-in bounds we present an overall circuit bound. This is expressed in terms of the number of inputs and outputs, and could, for example, be used to bound the gates and time needed for an integrated circuit package. or combinational gates which are accounted for elsewhere. This is illustrated in Fig. 1 .
Thus we can fan-out to f places with zero gates, to f -1 + f places with one gate, to f -2 + 2f places with two gates, and to f -G + Gf places with G gates. Since,we want e . f -G + Gf, we see that e < f -G + Gf < e + f -1.
Thus, we have G < (e -)/( -1).
We can fan-out to / places in zero time, to /2 places in 1 time unit, to f3 places in 2 time units, and to fk places in k -1 time units. It follows that for e < fk < fe we have k < 1 + logf e, so k -1 < logf e. But k -1 = TG so the theorem is proved.
Q.E.D. Next, we bound the gates and time in the combinational part of any logic circuit without NOT gates; these will be introduced in Theorem 1.
Lemma 2 [8] , [9] : Any complement free Boolean expression E (e) of e atoms can be realized using gates of fan-in 2 [8] . In most practical expressions, the depth of parentheses nesting is small, so this provides the best bound. However, if d > 3/2 log e, we use the second half of the lemma which is proved in [9] , where it is also shown that this may be extended to TG[E(e)] < 3 log e with G[E(e)] < 2.5e. We have found that for practical purposes a low gate bound is more important than a low time bound, however. In much of the following we will use Lemma 2, assuming for simplicity that d < 3/2 log e. The case of fan-in greater than 2 has been considered recently in [10] .
Next we define a combinational circuit and then give overall gate and time bounds for such circuits.
Definition 1: A combinational circuit C(r,s,e,n,d) is defined by 1) A set of inputs xi, 1 < i < r.
2) A set of outputs yj, 1 < j < s, where yj is defined by an output expression Ej (ej) of ej atoms (representing inputs or complements of inputs) and with parentheses nesting depth dj.
3) e = maxj fejI is the maximum number of atoms contained in Ej, 1 < j < s. 4) n = EL ej is the total number of atoms in all Ej, 1 < j <s. 
5) d
Example 1 Suppose we have a 16-pin integrated circuit package which contains only combinational logic. Assume we can use 7 pins for inputs and 7 pins for outputs, i.e., r = s = 7. Assume that we have an average of 4 atoms per output expression so n = 4 -7 = 28, the maximum number of atoms per expression is e = 8 and d = 2. Thus, a typical output expression may be of the form
Let us use circuits with fan-in 2 and fan-out 8. Now for any possible combinational logic with the above characteristics, a package can be designed such that the total package time in gate delays is TG < rloge+2(d+rlogf nl)
The total number of gates in any such package is at Thus, we see that for realistic assumptions about packages and logical expressions, we obtain gate and time bounds that are of practical interest.
III. SEQUENTIAL CIRCUITS In this section we discuss methods of transforming sequential circuits into combinational ones and give time bounds and component bounds on the resulting circuits.
Definition 2: A sequential circuit S (r,s,e,n,d,m) is defined at time t by 1) A set of inputs xi(t), 1 < i < r = r, + r2. We call the xi(t), 1 < i < rl, the external inputs, and the xi(t), r1 + 1 < i < r, the feedback inputs.
2) A set of outputs yJ (t), 1 <.j < s = s1 + s2, where for any logical functions fj,
as shown in Fig. 2 . We call the yj (t), 1 < j < s , the external outputs and the yj(t), S1 + 1 < i < s, the feedback yi (t) = fi[xI(t), Xrl ,xr1(t),ys1+l(t-.* ,Ys(t -Mr)] = ci + ailysj+l(t -ml) + *--+ air2yS(t -Ml), where the ci and aij, 1 < j < r2, are derived from any logical functions of the inputs xl(t), --,xrr(t). The following lemma forms the basis of much of our subsequent work. We will use it to count gates as well as higher level components such as integrated circuit packages or whole processors. Thus, we state the lemma in terms of operations 0 which can be interpreted as logical OR and AND or as arithmetic addition and multiplication. When we deal with fan-out, at the gate level 0 corresponds to gates while at the processor level it refers to registers or demultiplexors. For a proof of Lemma 3, see the Appendix. It may be possible to sharpen this result slightly using a reformulated version of [7] given in [11] . Lemma with s2(3+f2 )nlogn-(+f n)
Thus, we see that for gates with large fan-outs, we can solve any R (n,1 ) system in TG = O(log n) with G = O(n log n). Thus, we see that for gates with large fan-outs we can solve an R(n,m) system in TG = O(log m log n) with G = O(m2 n log n).
Definition 5: The k step operation of a sequential circuit S is defined by k pairs of vectors
ys(t))] for 1 < t < k. These vectors represent the external inputs and outputs of S at each time step t.
Theorem 2: The k step operation of any linear sequential circuit S (r,s,e,n,d,m)-can be realized by a combinational circuit such that for large k TG < . (logf s2k)(log s2k) + 0(log k) with G < (m + 1)2S23 (2 + ) k log s2k + 0(k).
Proof: Our proof is in three parts. First, we set up the A and c arrays of Definition 4. Then we evaluate the resulting recurrence system. Finally, we generate the external outputs.
The A matrix and c vector components can be generated from the external inputs at any of the k time steps. Thus, we have a total of kr1 inputs to combinational circuit C1 which produces as outputs the components of A and c. Since a total of n2 atoms are used in generating all of the feedback outputs of S, there are at most kn2 nonzero components in A and c. The maximum number of atoms in any expression is eb, the total number of atoms is kn2 and the maximum parentheses depth is db, so we can set up the A and c arrays with a combinational circuit Cl (krl,kn2,eb,kn2,db) - Next we solve the linear recurrence R (n,m). There are a total of ks2 outputs in k time steps so n = ks2. Since the maximum delay is m time steps with S2 outputs per time step, the bandwidth of this system is at most (m + 1)s2 -1. Thus, we have a recurrence of the form R (ks2, (m + 1)S2 -1 Now we turn to the consideration of higher level components as our basic circuit elements. We will define two package types which could be implemented directly using integrated circuits. Our time bounds will be expressed in package delays. The techniques of the previous section could be used to design such packages. Our component bounds will be expressed in terms of the total number of packages required.
Our strategy in this case is to decompose a linear recurrence system R (n,m) into a number of small identical systems. These smaller systems can be solved directly by interconnecting the integrated circuit packages we specify. An algorithm to decompose a large R (n,m) system has been given in [7] , [12] for arithmetic operations. Here we present the algorithm for logic design and consider only the R (n, 1) case for the sake of easy explanation. The R (n, 1) case is by far the most common one occurring in practical logic design, and our method can be extended to larger m in a straightforward way.
Definition 6: We define two types of integrated circuit packages. a) ICR(n 1) is a package which accepts input atoms ci for 1 < i < n, and ai for 2 < i < n. It computes the outputs xi for 1 < i < n according to the recurrence relation xO = 0O xi = ci + aixi-1.
For signal input and output it has a total number of pins equal to 3n -1 times the number of bits per atom. b) ICU(n) is a package which may accept input atoms ai and bi for 1 < i < n, and c and d. It computes the outputs xi for 1 < i < n, according to The following algorithm is adapted from [12] (cf., ch. 4). It solves any R (n, 1) system by partitioning it into smaller systems.
Algorithm 1: Any given first-order linear recurrence R(n,1): xo = 0 xj= c + ajxji, 1 < i < n can be solved as follows.
Step 1: a) For any h > 2, compute n/h independent recurrence systems Z(J), 1 < i < n/h, defined as follows. Step 2: From the results of Step 1, compute the following recurrence system Zh(0) = 0 Zh W = ZhW0 + Yk U)Zh ), n h From this step we obtain another [(n/h) -1] elements of the solution, i.e., Xjh = Zh W) for 2 < j < n/h.
Step 3: From the results of Steps 1 and 2, compute the remaining elements of the solution using the following n -(n/h) -(h -1) independent expressions: Xi+(j-I)hZi + YiWz U- 1) for 1 <i < h -1 and 2 <j < n/h. 
IC.4-+ 3 g -4.
h log h IV. APPLICATIONS In this section we will study several practical logic design problems. The methods of Section III will be used to derive time and component bounds. We will consider binary addition and ones' position counting in detail. In less detail we will consider binary multiplication, digital filtering, and a control problem.
Definition 7: By the addition of two n digit binary numbers a = an ... a, and b = bn --* b1 we mean the generation of sum digits s = ... si and carry digit cn, defined as follows. We write Si = (aibi + ib6)ci_1 + (daibi + aibi)hi_i (1) where 1 < i < n and c0 = 0, such that si = 1 iff just one or all three of ai, bi, and ci-, are equal to 1. Also we write ci= aibi + (ai + bi)ci-l
where 1 < i < n and c0 = 0, such that ci = 1 iff any two or all three of ai, bi, and cij-are equal to 1. Now let Xi = ai + bi (3) and Example 5 The R ( 16,1) system xi = 0 for i < 0 and Xi = ci + aixi-I For use in a later application, we now consider a special case of an R (n, 1) system. Let ai = 1, for all i, in Algorithm 1. In this case we need not perform Step lb). So for each iteration, only [n'/hl type ICR(hj1) packages are required.
Also, note that all Z(J) are computed in Step la) by merely summing atoms. Since Steps 2 and 3 require only multiplication by the y's generated in Step 1, which are l's, no yi = aibi. Our first result concerns binary addition using gates as components. ---------------------------------------------------------------------- (2) z (2) z (2) z (2) zl (3 Z(3) z( 4 3 Z (4) z (4) z (4) z (4) l l I II R0 3 I a 7 l Proof: Our proof consists of three parts. 1) To generate the xi and yi, 1 < i < n, from ai and bi by (3) and (4), we need 2n gates and one gate delay, so TG1 =landGl=2n.
2) To generate the si, 1 < i < n, from xi, yi, and ci-1 using (6), we refer to Fig. 5 It is easy to verify that the theorem holds for n = 1 by a direct construction. Thus, we have TG2 = 3 with G2 < 7n.
3) To generate the ci, 1 < i < n, from xi and yi using (7), we turn to Lemma 3. Since (7) defines an R ( n, 1) system, it follows immediately from Corollary 1 (cf., Fig. 3 Proof: The xi and y-of Definition 7 can be generated in one package delay using 2n/h type ICU packages. The carries of (7) (cf., Fig. 4 ) can be generated following Lemma 4 in TIC < (2 (log n/log h) -1) using 6(n/h) + 4(log n/log h) -7 packages. Then the sum bits of (6) can be generated in one package delay using n/h packages of type ICL following Definition 6b), ii). Summing these counts proves the theorem.
Q.E.D.
Example 6
Consider the problem of adding two 32-bit binary numbers using gates with fan-in 2 and fan-out 8. By the method of Theorem 3, the sum can be formed in at most 21 gate delays since 1 The next application we study is a ones' position counter. This is the problem of determining the number of ones' to the right (say) of each bit in a word. The problem arises in various real-world contexts, particularly in control design. We discuss the problem because of its practical interest and also because it serves as an interesting case standing between binary addition and binary multiplication.
As we saw above, given the theoretical background of Section III on solving linear recurrences, the design of a binary adder is straightforward. The ones' position counter is not as easy, however. When formulated at the bit level, this problem leads to a nonlinear recurrence which cannot be solved by the methods of Section III. As we shall see later, binary multiplication also shares this property.
The technique we use to solve such logic design problems with bit level nonlinearities, is to reformulate them at a higher level where they are in fact linear. The nonlinearity is thus hidden inside a more complex bit level operator. In practical terms, this can be accomplished by building a nonlinear circuit element and then combining these in linear ways according to the techniques of Section III. Putting such nonlinearities inside integrated circuit packages is an attractive possibility. Thus, by using 1 + log n bit adders (cf., Theorem 4) as components we can solve the system in O(log n) adder steps (cf., Corollary 1), so TG = O(log n) O(log log n) = O(log n log log n).
Since each adder has O(log n -log log n) gates, we have a total gate count of G = O(n -log n) -O(log n -log log n) = O(n . log2n -log log n).
By formulating this problem in terms of integrated circuit packages we can use Corollary 3 to achieve a better gate count than the above. Thus, to solve an arithmetic R (n, 1) system we need IC = 0(n/h). Each ICR(h, 1) package is used to count l's, so inside each package we can use the method of Corollary 1 to solve an arithmetic R (h, 1) system. Thus, from Corollary 1, we have 0 = O(h log h). Now let us choose h = log n so 0 = O(log n -log log n). Each such 0 processor is used to add log n bit numbers. Thus we use Theorem 3 to count the gates as G = O(log n -log log n). Multiplying these three levels of components we obtain a total gate count of G = 0 (. h logh h logn *log log n) =O(n log n * (log log n)2).
Similarly, we obtain the time. By Corollary 3 we have O(log n/log h) package delays. Each package delay is To = O(log h) from Corollary 1. And the add time by Theorem 3 is O(log log n). Hence, our total time in gate delays is TG = O(logn -log log n).
Thus, we see that the time is the same but we have reduced the gate count over the straightforward method. We can summarize this as Theorem 5: The ones' position count of an n = 2 t > 0, bit word can be generated in TG = O(log n log log n) with G = O(n -log n (log log n)2). We note that the gate count can be further improved by using more types of packages. For example, if we let h = log log n in Step 1 of Algorithm 1 and h = log n in Step 2 (see proof of Lemma 4), we can obtain a solution in TG = O(log n -log log n) with O(n -(log log n)2(log log log n)) G = O(n2 (-(log log n)1 log log log n)). By using even more package types, even better gate bounds are possible.
To obtain a package bound, Corollary 3 can be applied directly. The 
where ci,o = ai and zo0j = 0. Notice that at the bit level, this is a nonlinear recurrence and cannot be solved by the methods of Section III.
If we use half-adders as components, it is easy to see that the problem can be solved with G = 0(n log n) or TG = 0(n). This gate count is comparable to the best shown above, but it uses much more time.
Next, we turn to bounds for binary number multipliers.
Definition 9: By the multiplication of two n digit binary numbers a = an ... a1 and b = bn ... b1, we mean the generation of 2n product digits p = P2n ... PiFirst, we can formulate the multiplication problem using a straightforward (row parallel) carry-save adder array. If we let x correspond to various pairwise AND's of input bits [13] , we obtain a coupled recurrence system of the form qij = X ED qi-lj ED ci-I,j-_ (10) cij = X * qi-l,j + X Ci-i,j-1 + qi-ij Ci-lj-l- (11) Note that this nonlinear recurrence system is a generalization of (8) and (9) for the ones' position counter. This cannot be solved by the methods of Section III, however, we can solve it directly using an array of n2 bit level adders. This gives a circuit which can multiply two n bit numbers in TG = 0(n) with G = 0(n2). Since we are interested in faster schemes, we will now turn to two methods to solve the recurrence of (10) and (11) in parallel.
The first method uses a tree of 2n bit adders. First, we form a standard array of partial products. Then we use the adder tree to form the sum. 1) to generate the n2 partial product bits ai -bj, for all 1 < i,j < n we need n2 AND gates and one gate delay. Since each input bit is fanned out to n places we have from the above and Lemma 1, TG1 < 1 + logf n with Gl<n2+2n < <n2( 1+ 1) 2) To generate the sum of the partial products we need an adder tree of n -1 adders. Each adder adds 2n bit numbers and the height of the tree is log n adder delays. Q.E.D.
As an example of this theorem, consider an integrated circuit package as follows. Example 8 Using gates with fan-out 8, a multiplier of two 4-bit numbers can be implemented with a delay of TG< (6) The above result is somewhat sloppy because we considered all inputs to the tree adder to be 2n bit numbers.
In fact, the inputs to the first level of adders are only n bit numbers. At succeeding levels they are of length n + 2, n + 5, * *.,n + i + 2i-1-2, for 1 < i < log n. By a careful analysis which takes this increasing length into account, we can improve the gate count in Theorem 6 by a factor between 2 The generation of partial products is done in the same way as in Theorem 6. For an upper bound on time, we assume a three to two column compression scheme [13] . The column compression for two n-bit numbers can be done with (n 2-4n + 3) full adders and (n -1) half-adders. The half-adder can be built using 9 gates (see Theorem 3 with n = 1). A full adder of 2 bits can be easily implemented with 11 gates by a scheme similar to Fig. 5 . Thus, we have a total of G2 < 11 (n2-4n + 3) + 9(n -1) = 1n20-35n + 24.
The time for the column compression is TG2 < 6 1og3/2 n = 10 log n since each full adder requires at most 6 and b = b. ---bi and a starting bit, either a1 or b1. We wish to generate a word e = en * * el which consists of those bits on a path through a and b chosen as follows. First, we let el be the given starting bit. Then we choose bits in the same word until we encounter a zero in (say) bit i, which causes us to choose bit ei+1 from the other word. We continue in the other word until we encounter a zero which causes another switch, etc. We define ci = ai-ic_1 + bi-, * di-1, 2 <i n and As a final application of the ideas of this paper we mention digital filtering. This topic has received a great deal of attention in recent years. Our combinational results can be applied to nonrecursive filters and our recurrence results can be applied to recursive filters in rather direct ways. For more details about such filters see [16] or [17] .
APPENDIX A Proof of Lemma 3
Our proof follows the proof of Theorem 2 of [7] and a logical circuit can be constructed following Algorithm 2 of [7] . First, we consider the time required. The computational 0 delays follow directly from the time bound for solving an R (n,m) system in Theorem 2 of [7] . Thus, for the first part of our time bound, we have from Theorem 2 of [7] . Tl < (2 + log m) log n (log2 m + log m).
2
To complete the time bound, we must consider the fan-out time required by Theorem 2 of [7] . Such times were regarded as negligible compared to arithmetic operation times in [7] . The solution of an R (n,m) system is generated in log n iterations. It may be seen from Fig. 4 of [7] that on iteration i = log k, we perform at most (k/2 + m -1) way fan-outs. Thus, the fan-out time on iteration i is [logf ( .q(l+2+3+---+logfn) = q21lgfn(1+1lgfn) 1~~~~~~~= log n (1 + logf n). 2 Thus, our total time is Ti + T2 or To < (2 + log m) log n -(log2 m + log m) 2 1 + -logn(1 + logf n) (2+ logim + logfn) * log n-2 (log2 m + log m). 2 Next, we consider the number of 0 operations required in the combinational circuit. In the proof of Theorem 2 [7] , we gave expressions for counting the number of processors in evaluating an R (n,mi) system. Since a tree of n leaves has at most 2n -1 nodes, we can upper bound the number of 0 operations by doubling the processor count from Theorem 2 of [7] . We choose the worst expression for the processor count on iteration i = log k, namely, expression (2) [7] , the 2m < 2i < n case, sum over all iterations, for As is discussed in [7] , the trees we are evaluating are of a special form with -operations at the leaf nodes and + operations elsewhere. The above sum can be used as an exact count of -operations. But since the trees are somewhat sparse, a more refined count reduces the number of + operations. Thus, our factor of 2 above is too large. By a straightforward but long argument similar to the above, we can show that the 0 operation count is actually bounded by 01 < (m2 + ) n log n + (m3-22M2)n + m(2m-1) which we use in the statement of the theorem. Now we consider the number of fan-out 0 operations required. It follows from Theorem 2 [7] that iteration i requires (mi2 + mi)n/k -Mi2 fan-outs, each fanning out to at most k/2 + m -1 destinations. Thus the total number of 0 operations can be computed using Lemma 1 as 
