A general theory is developed for constructing the asymptotically shallowest networks and the asymptotically smallest networks (with respect to formula size) for the carry save addition of n numbers using any given basic carry save adder as a building block.
Introduction
The question 'How fast can we multiply?' is one of the fundamental questions in theoretical computer science. Ofman-Karatsuba [9] and SchOnhage-Strassen [24] (see also [1] , [15] ) tried to answer it by minimising the number of bit operations required, or equivalently the circuit size. A different approach was pursued by Avizienis [2] , Dadda [6] , Ofman [17] , Wallace [28] and others. They investigated the depth, rather than the size of multiplication circuits.
The main result proved by the above authors in the early 1960's was that, using a process called Carry Save Addition, n numbers (of linear length) could be added in depth O(log n). As a consequence depth O(log n) circuits for multiplication and polynomial size formulae for all the symmetric Boolean functions are obtained.
They all used a component called a '3 ---+ 2' Carry Save Adder (CSA3_,2) which reduces the sum of three numbers (of arbitrary length) to the sum of only two in a (small) constant depth. It is easy to see that using log3/2 n 0(1) levels of such CSA3_,2's it is possible to reduce the sum of n numbers to the sum of only two. The resulting two numbers could be added (if required) using a Carry Look Ahead Adder (see [3] , [10] ) with additional depth (1 + o(1)) log m (where m is the length of the numbers).
In this paper we look more carefully at the construction of CSA-networks for carry save addition. A moment's reflection shows that any CSA3_,2-network for the carry save addition of n numbers will have at least log3/2 n levels of CSA3_,2's. Thus, if a CSA3-.2 is regarded as a black box to which the inputs should be supplied simultaneously and which then after a fixed delay returns the two outputs simultaneously, then this naive construction is optimal. It turns out however that, even for the best CSA3_,2's, some of the inputs may be supplied after the others, without delaying the outputs, and that one of the outputs is sometimes produced before the other. The task of constructing networks with minimal total delay in such cases becomes much more interesting.
In general we assume that we are given a CSAk_,I whose delay characteristics are described by a delay matrix M. The entry mi. ; of the matrix gives the relative delay of the i-th output with respect to the j-th input. In particular, if the k inputs to a CSAk_,L are ready at times xl, , xk, we assume that the i-th output is ready at time y, = maxi<j<k{mi; x j }. This corresponds to taking the {max, +} inner product between M and x. We show how to extract from any delay matrix M the minimal constant q such that CSAk,.1-networks for the carry save addition of n numbers with delay (q o(1)) log n can be constructed using CSAk_,t's with delay matrix M. We exhibit explicit constructions achieving this optimal behaviour.
For a given implementation of a CSAk_,I using Boolean circuitry, if we define mi.; to be the length of the longest path from any bit of the j-th input number to any bit of the i-th output number then the above result translates immediately to a result about depth.
Several basic designs of GSA's are described in the next section. Using these designs optimally we get U2-circuits (circuits over the unate dyadic basis U2 = B2 -{®, of depth 5.42 log n and B2-circuits (circuits over the basis B2 of all dyadic Boolean functions) of depth 3.71 log n for the carry save addition of n numbers. Using more complicated GSA's, not described here, these results could be improved to 5.02 log n and 3.57 log n respectively. As a consequence, we derive circuits of depth 6.02 log n and 4.57 log n for the addition of n numbers (of linear length) or for the multiplication of two n bit numbers. This improves a previous result of Khrapchenko [14] and the naive estimates of Ofman and Wallace. Multiple addition circuits (of n numbers of n bits each) are necessarily of size SZ(n2). Our circuits are composed of 0(n) CSA's each of size 0(n) so they have this optimal size. The Schonhage-Strassen multiplication algorithm uses the Discrete Fourier Transform (DFT) to reduce dramatically the size of multiplication circuits. Since the computation of a DFT essentially involves multiple additions, carry save adders could be used to implement DFT's, and therefore the whole SchOnhage-Strassen algorithm, in logarithmic depth (cf. [16] , [29] ). The implied constant factors however are much larger than those obtained here. Another special case of multiple addition is bit counting. A counter for n bits could be obtained by carry save adding the n input bits, treating each as a number, and then adding the two output numbers. Note that the length of the two output numbers is O(log n) so the additional depth required to add them up in this case is only O(log log n). As a consequence we get depth 5.02 log n U2-circuits and depth 3.57 log n B2-circuits for counting. Many symmetric Boolean functions, such as majority and MODk for any fixed k, can be computed in depth o(log n) once the bit count is done, so we get the same bounds for them as well. An analogous theory is developed for formula size. We assume that the formula size characteristics of a CSAk_,/ are described by an occurrence matrix N. The entry ni; gives the number of appearances of the j-th input number in the formula for the i-th output number. If the k inputs to a CSAk_,/ have formula sizes xl, , xk then the i-th output number will have formula size yi = niixi. Note that this corresponds to multiplying the matrix N by the vector x = (x1, . , k). Again we show how to extract from the occurrence matrix the minimal q such that CSAk_,L-networks of formula size n(9+0(1)) can be constructed and describe constructions with optimal behaviour. Using the CSA designs of Section 2 optimally we get U2-formulae of size 0(n4-70) and /32-formulae of size 0(723.21) for each output bit in the carry save addition of n numbers and for many symmetric Boolean functions as before. Again, using more complicated GSA's, not described here, these bounds could be improved to 0(124') and 0(n3.13) respectively. These constructions improve previous results of Khrapchenko [13] , Pippenger [21] , Paterson [18] and Peterson [20] . Depth and the logarithm of formula size are closely connected. It is known for example that log LB2 (1) 5_ DB,(f) _< 2.47 log LB, (f) (in fact even that Du, (f) _< 2.47 log LB2 (f)) and that log Lu2(f) < Du, (f) 5. 1.81 log Lu2(f) (see [4] , [23] , [25] ). These relations are insufficient for the derivation of optimal constants however, and we have to optimise separately for depth and for formula size. The known connections between B2 and U2, namely Dal (f) 5_ 2DB2(f) and Lug (f) < 0 ((LB2(f))1o g3 10) (see [22] ), are also too crude to be of any help to us. The theories developed for depth and formula size are analogous. However some differences result from the fact that the usual {+, x} inner product is used in the formula size case while the not-so-usual {max, -}-} inner product is used for depth. In particular, while the parameters that should be optimised in the formula size case are continous, some of them are discrete in the delay case. This changes the nature of the optimisation problems involved. A summary of the 'numerical' results obtained in this work together with the previously known results is given in Table 1 .1. The right columns give the dimensions of the GSA's used. Results marked by a single star (*) are obtained using building blocks described in this paper. Results marked with a double star (**) are obtained using building blocks which will be described in a subsequent paper. In three out of the four cases we improve the previously known results even using the same CSA's used by the previous authors. The improvements we get are quite marginal in some of the cases. Our results, however, could only be improved by either designing improved basic carry save adders or by designing circuits which are not constructed from carry save adders.
Carry Save Adders
A k-bit full adder (FAk) receives k input bits and outputs flog(k +1)1 bits representing, in binary notation, their sum. Usually k is of the form 2-e -1.
Arrays of FA's could be used to construct GSA's. A construction of a CSA 3_,2 using FA3's, for example, is illustrated in Fig 2. 1. The depth of the CSA obtained is equal to the depth of the FA used and is independent of the length of the numbers to be carry save added. A B2-implementation of an FA3 is given in Fig. 2 .2(a). The delay matrix of this FA3 which describes the relative delay of each output with respect to each input is easily seen to be (1 3 2 3 2) . The delay characteristics of the CSA 3_,2 obtained are also described by k 2 this delay matrix.
Notice that x1 may be supplied to this FA3 one unit of time after x2 and x3 are supplied, and that yo is obtained one unit of time before yi. Thus, the FA3 can be represented schematically by the 'gadget' appearing in Fig. 2 
.2(c).
The two formulae obtained by expanding the circuit of The formula for yi has size 5. The variable x1 appears only once in it while each of the variables x2, x3 appear twice. We therefore say that the occurrence vector of the formula is (1, 2,2). The occurrence vector of the formula for yo is (1,1,1). Combining these ( 1 2 11 vectors we get that the occurrence matrix of the implementation is An alternative B2-implementation of an FA3 is given in Fig. 2 .3. This implementation 1 2 2 1 has a worse delay matrix ( 3 3 3) but a better occurrence matrix 3). It can be checked that no other B2-implementation has a better delay or occurrence matrix than those given.
Both the implementations of Fig. 2 .2 and Fig. 2 .3 are also minimal with respect to circuit size. This, however, will not happen in general. Since we are not concerned here with the size of the circuits (which will always be 0(n2)) we can think of an implementation of an FA as a set of formulae, one for each output bit. This also stresses the fact that we may try to optimise the structure of each formula separately. occurrence matrix ( It could be checked that both these are optimal. Note kl 2 2 that this time it is not clear how to schedule the inputs to this unit. If x1 is supplied two units of time after x2, x3 then delays are introduced in the circuitry for yi; if x1 is supplied one unit of time after x2, x3, then delays will be introduced in the yo circuitry. These alternatives give rise to the two gadgets shown on the right in Fig. 2 .5. This is a simple example of non-modularity. The construction of the building blocks that are used to get our best results is technically involved. Since we want to concentrate on the general theory we will not describe their construction here. These details will appear in a forthcoming paper. We would just point out here that GSA's could be built using any bit adder, not necessarily using full adders. Our currently best U2 results for example are obtained using a bit adder that adds 7 bits with weight 1 together with 4 bits of weight 2. It is also not necessary for the result to be supplied in a non-redundant form.
GSA-networks
A CSA is regarded henceforth as a black box with k inputs and., outputs (t < k) with the property that the sum of its outputs is always equal to the sum of its inputs. The delay and formula size characteristics of a CSA are described by its x k delay and occurrence matrices. We assume that all the entries in the delay matrix are positive and that all the entries in the occurrence matrix are at least 1. This corresponds to the assumption that every output depends on every input and that no computation is instantaneous. A more general setting could allow the outputs to depend on subsets on the inputs. In such cases, -co entries will appear in the delay matrices and 0 entries will appear in the occurrence matrices. Almost all the results of this paper could be extended to cover this more general situation. The complete proofs of these results are much longer however, and although we do need these generalisations in order to get our best results mentioned in Table 1 .1 we will not present them here. Some will appear in the planned subsequent paper.
In the sequel, whenever a delay matrix is considered, it is assumed that all its entries are positive and whenever an occurrence matrix is considered it is assumed that all its entries are at least one. A CSA-network is an acyclic network composed of CSA units of a fixed type. An inductive argument shows that any CSA-network has the property that the sum of its outputs is equal to the sum of its inputs. Using the delay and occurrence matrices, M and N respectively, of the CSA unit used we can assign a delay and (formula) size to each 'wire' in the network. The inputs to a network are assigned delay 0 and size 1. If the k inputs to a CSA have delays xl, , xk then the i-th output of this CSA will have delay yi = maxi<j<k{nli; si}.
We express this by y = M o x where the o denotes the {max, d-} inner product. If the k inputs to a CSA have sizes xl, , xk then the i-th output of this CSA will have size yi k nix We abbreviate this by writing y = Ns where this time the usual x} inner product is used. The delay (respectively, size) of a network is the maximum of the delays (respectively, sizes) of its outputs. Given a fixed CSAk _,1 with delay matrix M and occurrence matrix N, our task is to construct using these units networks with n inputs and £ outputs with minimal delay or formula size. We denote by DM(n) the minimal delay of such an n t network and by -FN(n) the minimal (formula) size of such a network. Strictly speaking, n t networks exist only if (k -.01 (n -£) (and then use exactly (n -t)1(k -t) GSA's). This is relaxed however by allowing constant zero inputs. Such dummy inputs will also simplify the presentation of the constructions described in Sections 7 and 8. The optimal networks for some small values of n, constructible using the CSA 3-42 of Fig. 2 .2, are shown in Fig. 3.1 . For every fixed CSA unit there is a polynomial time algorithm which constructs for every n the set of n networks with minimal delay. We are interested in the asymptotic behaviour of the functions DM(n) and FN (n). The next theorem states that DM(n) behaves logarithmically and FN (n) polynomially. Proof : By collapsing first the columns then the rows of an n x m array of inputs, we see that
DM(nm) < DM(n) DM (m) DM(i 2).
It follows that the function DM(n) DM(t 2) behaves sub-additively (as its argument n multiplies), and thus the limit 8(M) = limn_,,", DM (n)/ log n exists. By a similar 8 argument we get that the limit e(M) = limn", log FN(n)/ log n exists. These two constants satisfy the required conditions. Our goal in the next sections will be to determine 8(M) and e(N) as functions of M and N.
Lower bounds
Define the following functions : 
Theorem 4.2 (i) Dm(n)?_ e(M)log(n/t); (ii) FN(n) > (nlir N).
Proof : Consider an n -+ t network composed of GSA's with delay matrix M. If the inputs to a CSA in the network have delays xl, , sk then the outputs will have delays yi, , yt where y = M o x. The definition of 8 = 61(M) ensures that E",.1 2s;/6 < 2Y.A. Using induction we get a similar relation for the inputs and outputs of the whole network. The n inputs have delay 0. If the outputs all have delay at most d then we get that n < £21/6 or equivalently d > S log(n I t). Similarly, if the GSA's in the network have occurrence matrix N we get that the sizes of the inputs and the outputs to every CSA in the network satisfy the relation iixiiiie < where e = e'(N). Using induction we get a similar relation for the whole network. The n inputs have size 1. If the £ outputs all have size at most f then we get that f > (n/t)`. 0
The argument used in this proof is similar to the one used in a proof that a binary tree of depth t can have at most 21 leaves using Kraft's inequality (E 2-4 < 1 where the are the depths of the leaves in a binary tree).
As an immediate consequence we get Corollary 4.3
N 61(M) < 5(M); (ii) e(N) 5_ e(N).

The delay problem
In this section we show that in order to compute SW) we need only consider a finite number of points x E Rk. This provides a practical way of computing 81(M). We will also gain some insight into how a CSA with delay matrix M could be used optimally.
Lemma
If max x; < min yi and £ < k then the equation Px,y(A) = 0 has a unique root A(x, y) in the interval (1, oo).
Proof : If A = 1 then Pz,v(1) = k -t > 0 and if A -* oo then Px,y(A) -oo. Thus the equation has at least one root in the interval (0, oo).
Since a translation of x and y by the same amount leaves the roots invariant, we may assume, without loss of generality that max xi < 0 < min yi. Every positive contribution to Pz,y(A) is now decreasing in A while every negative contribution is increasing. Therefore Pz,y(A) as a whole is decreasing and the uniqueness of the root is guaranteed. 0
If x E Rk and y E Rt, we denote by y -xT the t x k matrix whose elements are yi -x j. A pair (x, y) of vectors x E Rk and y E Re satisfying y -xT > M will be called a schedule for M. If we impose the additional requirement that xl = 0 we get a oneto-one correspondence between schedules and modular matrices dominating M. This set of modular matrices dominating M will be denoted by P(M) and will be called the modular polyhedron over M.
If M = b -aT is a modular matrix and y -xT > M then y, > maxici<k{bi -a; +xi } = bi-Fc where c = maxi<;<k{x; -ai}. If we define e i = c-Fa; then we still have y-xiT > M, although x' > x and therefore A(x', y) > A(x, y). Since translating both x and y by the same amount leaves the roots invariant, we may assume that c = 0. So x' = a, y = b, and we have proved the following Theorem.
Theorem 5.2 If M is modular, M = b -aT say, then A(M) = A(a,b).
As an immediate consequence of this theorem we get that
The set P(M) is defined using a finite set of linear inequalities and it is therefore a polyhedron. As mentioned before, we can identify a point M' E P(M) with the unique schedule (x, y) which satisfies x1 = 0 and y -xT = M' . For every schedule (x, y), define a bipartite graph r(s, y) in which the elements of x and y are the nodes and in which x; and yi are connected by an edge if and only if yi -x; = mii. It is easy to check that (x, y) is a vertex of the polyhedron if and only if the graph P(x, y) is connected. A vertex of a polyhedron is an extremal point of it, that is, a point which is not a convex combination of any other two points in the polyhedron. The polyhedron P(M), has only a finite number of vertices. We denote this finite set of vertices by P*(M). Our aim is to prove that the maximum in the definition of A(M) is attained at some vertex of P*(M). Suppose that (x, y) E P(M) is not a vertex of P(M), so the graph F(x, y) is disconnected. We will show that there exists a schedule (x', y') such that r(e, y') is connected and A(x', y') > A(x, y). Suppose that A is the set of variables in the nonempty connected component of P(x, y) containing xl. Let B denote the complementary set of variables. We can break the definition of Px,y(A) in the following way 
common constant decreases PB(A). While if PB(A) is non-negative, then decreasing these variables by a common constant does not increase PB(A).
In either case the variables of B can be shifted in the appropriate direction until one more of the constraints yi -x; > mi; is satisfied with equality. The result is a schedule (x', y') for which F(x', y') has all the edges of F(x, y) together with at least one edge between A and B. Furthermore Pz,,y ,(A) < 0 for A = A(x, y), so that A(x', y') > A(x, y).
By repeating this procedure we arrive at a schedule for which the graph is connected (so that the schedule corresponds to a vertex of P(M)), without ever decreasing A. We have thus proved 
M' E P* (M)}
In Section 7 we will see that the lower bounds of the previous section are tight. Theorem 5.3 thus says that if M is a non-modular matrix then there exists a modular matrix M' that dominates it (and is a vertex of the modular polyhedron over M) such that asymptotically, we can do as well using M' as we can using M. In Fig. 2 .5 we tried to describe the behaviour of a non-modular CSA. The two gadgets in Fig. 2 .5(b)(c) turn out to be the vertices of the modular polyhedron. The gadget in (c) turns out to be the optimal. As we noted in Section 2, internal delays are inevitable when using a non-modular gadget. Apriory, it might seem that the ability to delay some of the inputs in some cases and others in other cases is advantageous. This however is not the case. In order to get optimal performance we should always choose the same internal delays.
In the example of Fig 2. 5, for instance, we should always delay x1 internally and use the gadget shown in (c).
The results of this section could be generalised to cover the case in which a finite set of basic gadgets, each with an associated delay matrix, is given to us. In order to get optimal performance it is always enough to use only one of the available gadgets.
The formula size problem
Our aim in this section is to prove that if N is an occurrence matrix and e = e'(N) then there exists a unique direction x E (Rik for which 11x111/6 = II II,. Furthermore, all the components of this direction are strictly positive. The existence of such a positive direction will be needed in Section 8 where constructions achieving the lower bound on formula size from Section 4 are obtained. We will need the following lemma.
Lemma 6.1 If N is an occurrence matrix then e'(N) > 1.
Proof : It is clear that if N' > N then e'(N') > c'(N).
The k x I occurrence matrix 1 all of whose entries are 1 is dominated by every other k x £ occurrence matrix. A direct computation shows that e (1) = logqi k > 1. 0
We are now ready to prove by scaling and it is therefore also strictly convex. The function fl(e) is obtained by summing strictly convex functiOns and it is therefore strictly convex also. The set B' is convex and therefore f' has a unique minimum point on it. Finally, we would like to prove that the minimum point x* lies in the interior of B' so that none of its co-ordinates is 0. Let fAx1) = [fi(xl)r = Ei(n' ijej)t. Suppose, on the contrary, that one of the co-ordinates of x* is 0. We know of course that at least one of the co-ordinates of x* is non-zero. Without loss of generality assume that xi = 0, x; > 0.
It is easy to check that the function (niiA)e (ni2(x2 -Ant has a negative derivative
at A = 0. For small values of A we would therefore get that ff(4) < ff(x*), or equivalently Mel) < fl(x*), where 4 = (A, x2 -A, , x," ). Since this holds for every i we get that for sufficiently small A, f'(4) < f(x*) which contradicts the minimality of x*.
0
The strict convexity of the functions involved makes the numerical task of finding = e' (N) and the direction x = x(N) satisfying 11x111/, = iiNx hie a very easy one. As we shall see in Section 8, the components of the direction x(N) give the ratios between the sizes of the inputs that should be fed into this gadget if it is to be used optimally.
Depth constructions
As we saw in Section 5, for every delay matrix M there exists a modular delay matrix M' which dominates it and for which A(M) = A(M'). It is therefore enough to consider in this section only modular gadgets. A general modular gadget is shown in Figure 7 .1. It has the modular delay matrix M = b -aT where we assume that 0 = al < a2 < ak < bi < b2 < < bt and < k. We assume here that all the outputs are produced after all the inputs were supplied which is always the case if all the entries in the delay matrix are positive. As mentioned in Section 3, the results of this section could be extended to cover more general cases but we shall not do so here. The characteristic equation of this gadget is number of signals processed at time 0 is therefore 0(n) and the number of signals processed at time log), n is 0(1). This however does not correspond to a concrete construction for two reasons. First, the above process is infinite, it begins at time -oo and never ends. Secondly, the number of gadgets that should be used at each time unit is generally not integral. These problems are however easily overcomed.
We choose
where c = (1-1)/(k-,) .
We now verify the following facts.
(i) If 0 < d < th then
so we may input at least n inputs at time 0, and even some more at times 1,...,ak.
(
Since Bald is integral we get that Bald > 0. In other words no outputs are produced at times less than or equal to [logy nj
We therefore get a total of at most L = Al El=i A 1 '1 ibt (c + 1) outputs at times less than or equal to LlogA nj N. Note that L is fixed, independent of n.
(iv) Finally, if d > LlogA nj b1 then Bald = 0 so no more outputs are produced.
A final stage could now be used to reduce the L outputs obtained to only 1. Thus, if a, b are integral then we even have DM(n) < log), n + 0(1).
Note that without the assumption that al < a2 < < ak < b l < < b1 this construction may fail. It will not always be guaranteed that Bald > 0 for 1 < d < b1. (1)). Reducing the timescale in this construction by a factor of q throughout, produces a network of delay log), n + 0(1) with O(q) outputs. The final stage to reduce these outputs to t requires a delay of 0(log q). This construction satisfies the Theorem provided that q is chosen so that log q = o(log n).
Formula size constructions
Let N be an occurrence matrix with entries greater than or equal to one. Choose a vector x E (0, oo)k and define the following two vectors ai = log xi 1 < j < k, bi = log Mi niix; 1 < i < t.
Associate with every CSA with occurrence matrix N a delay matrix M(x) = b -aT. Note that all the entries of M(x) are positive.
Suppose that is a network composed of GSA's with occurrence matrix N, and therefore with delay matrix M(x). To every wire w in r we can now assign both a size f(w) and a delay d(w). We can generalise the observation that the size of a formula is at most 2 to the power of its depth.
Lemma 8.1 For any wire w in r we have f(w) < 2d(w).
Proof : If w is an input wire then f(w) = 1 and d(w) = 0 so the relation holds with equality. Suppose now that u1,... , uk are the inputs of some CSA in the network, that f(u;) < 2d(u3) for every j, and that , are the outputs of this CSA. Let t = maxi{d(u;) -ai }. Then t may be regarded as the time at which this CSA is activated. The delays of the inputs satisfy d(u;) < t a; while the delays of the outputs satisfy d(vi) = t bi. The size of vi will now be Atli) < E ni j2t+aj = 2t \---, = 2t+bi = 2d(vi). 
Numerical examples
The delay matrix of the FA3 described in Fig. 2 .1 is the modular matrix M = (2 3)T -(1 0 0). Therefore A(FA3) is the unique positive root of the cubic equation A3 -I-A2 -A -2 = 0. We can verify that A 1.2056 and that logy nL•, _ 3.71 log n.
The delay matrix of the FA3 described in Fig. 2.4 is the non-modular matrix (2 2 4 3 4 3) . 1.1365. The second posibility is clearly better and the circuits that we would build would have depth log),, n 5.42 log n.
6 6 6
Khrapchenko [14] designed a U2-FA7 with delay matrix ( 6 6 6 5 6 6 7 7 7 7) . Using 5 6 6 6 6 6 6 ad hoc methods he was able to construct with it networks of depth 5.12 log n. The delay matrix of Khrapchenko's FA7 is non-modular. The optimal vertex in the modular (5 6 6 6 6 6 6 polyhedron of this matrix is 6 7 7 7 7 7 7) and therefore a(FA7 ) is the unique 5 6 6 6 6 6 6J
positive root of the equation A7 + 2A6 -A -6 = 0. We find that A 1.1465 and that log), n 5.07 log n. We can thus improve Khrapchenko's construction even using his own gadget. We can reduce the depth of the U2-circuits to 5.02 log n using a novel design of a CSA 11-44. The size of the optimal formulae for multiple carry save addition that can be obtained using GSA's based on the FA3 described in Fig. 2 .2 is n'+'(1) where 1 Vxi, X2, X3 > 0 e = max S : p xi p + 4 + 4 < (x, + x2 + x3)P + (x1 + x2 + 3x 3)P A numerical solution gives e 3.2058 and equality is achieved when x1 = x2 = 1 , x3 0.3926. This yields formulae of size 0(n3.21). As mentioned before we can get better results using more complicated GSA's.
Concluding remarks
Many related open problems still remain. The hardest of them all is probably to determine the exact depth and formula size of multiplication, multiple addition and multiple carry save addition. In this paper upper bounds on these complexities were obtained. Although these bounds may be close to the real values, we believe that they may be further improved by devising better basic building blocks.
A more tractable problem, perhaps, is the question of whether or not the optimal depth and formula complexities of the above mentioned problems can be obtained, or at least approached, using carry save networks.
