Parallel evaluation of arithmetic circuits  by Revol, Nathalie & Roch, Jean-Louis
ELSEVIER Theoretical Computer Science 162 (1996) 133-l 50 
Theoretical 
Computer Science 
Parallel evaluation of arithmetic circuits 
Nathalie Revol *, Jean-Louis Roth 
LMC-IMAG, 100 rue des Math&mutiques, BP 53X, 38041 Grenoble Cedex, France 
Abstract 
In this paper, a generic algorithm designed for the parallel evaluation of arithmetic circuits 
is given. This algorithm can be used in the domain of VLSI design, in order to get tight 
upper bounds on the computing time of a circuit. It can also be used in automatic paral- 
lelization of numerical programs, as a guide for the detection of some predefinite schemes 
such as dot-products or reductions. More generally, the (theoretical) algorithm presented in Sec- 
tion 2 evaluates very quickly arithmetic straight-line programs, and its evaluation time serves 
as a good upper bound. This algorithm generalizes Miller, Ramachandran and Kaltofen’s al- 
gorithm (1988) in the sense it deals with a great variety of algebraic structures: semi-rings, 
rings or lattices. Our contribution resides on the one hand in a new bound for the evalua- 
tion of circuits over lattices, which improves previous results (Miller and Teng, 1987) and on 
the other hand in the unified formulation for the evaluation algorithm. This algorithm runs in 
lY(min(log n + log d) log n, (h, + log n) log n)) parallel time, d being the “algebraic degree” (in 
an extended sense) of the circuit and h, the maximal number of altemances of @ and @ on 
a path of the circuit if the @ and @ operations define a lattice, with M(n) processors, where 
M(n) is the number of processors necessary for the multiplication of two n x n matrices in the 
structure in Q(log n) parallel time. After presenting this algorithm, its efficiency is shown on 
particular cases: taking as input a simple and sequential algorithm, it can be used as a “com- 
piler” to produce a sorting circuit as fast as Cole’s circuit, with logarithmic depth, or an adder 
equivalent to Brent and Kung’s adder in terms of size and depth. These academic examples 
confirm the relevance of the algorithm presented here in the area of conception of fast VLSI 
arithmetic operators. 
1. Introduction 
In the domain of VLSI design, two questions naturally arise: the first one is to design 
a circuit computing the solution of a given problem, the second one is to determine 
whether this circuit can be improved. It appears that in this area, the computation time 
of a circuit is an important criterion. Our works allows one to derive tight upper bounds 
on the computation time of a multi-valued boolean function. 
* Corresponding author. 
0304-3975/96/$15.00 @ 1996 Elsevier Science B.V. All rights reserved 
SSDl 0304-3975(95)00252-9 
134 N. Revel, J.-L. Roth I Theoretical Computer Science 162 (1996) 133-150 
Actually, in order to measure the quality of a VLSI circuit, two measures are used 
(cf. [16]): the first one is A, the area of the chip surface that is taken up by the elec- 
tronic components devoted to the considered computation; the second one is the time, 
T, which represents the number of clock cycles spent in a computation. It is assumed 
that the components of the circuit are totally synchronous. Even if this assumption is 
not very realistic, it is interesting since it permits to derive tight upper bounds on the 
complexity of actual circuits. These two quantities can be combined in order to get a 
new measure of the quality of a circuit, which exhibits the trade-off between the area 
A and the time T. It appears that with respect to the quantity AT, circuits are optimal 
that are not considered to be good VLSI circuits; eventually, the quantity which is 
widely used is AT2. Thus, it is particularly interesting to get good upper bounds on 
the time T, both in terms of performance (for the user of the circuit) and of hardware 
cost (AT2 cost). 
Since our works focuses on T, an apparent limitation is that the quantity A is 
often overestimated. It appears that there exists circuits achieving our bounds on T 
with reasonable A; for instance, we derive automatically a logarithmic time bound 
from the boolean equations modeling the addition of two n-bit integers, and it is 
well-known that Brent and Kung [3] have proposed an adder with logarithmic time 
and small (linear) area. (This formed a test case for our work.) Another apparent 
weakness of our result is that it gives asymptotic bounds on T. However, the con- 
stants are small (between 1 and 2). Moreover, our algorithm takes benefit of the 
algebraic properties of the boolean operations in order to derive the upper bounds, 
and in fact the result indicates for instance whether the associativity should be 
used in order to reorganize the computations or the commutativity could help more 
fruitfully. 
Briefly stated, our work takes as input boolean equations describing the computations 
and describes how they can be evaluated quickly in parallel. The time of this evaluation 
is very often a tight upper bound on the time needed by a circuit to perform these 
computations; the way the boolean operations are performed often indicates how to 
design a real VLSI circuit achieving this time complexity with a reasonable area A. In 
the last section of this paper, some upper bounds for already known problems illustrate 
this point. 
To replace this in a more general framework, we consider the problem of the eval- 
uation of arithmetic circuits when the + and x operations define various algebraic 
structures such as semi-rings, rings or lattices. The boolean algebra, which is the basic 
algebraic structure used for VLSI design, constitutes in fact a particular application. 
The term of “circuit” will be used in a more general meaning than the VLSI one; 
actually the parallel complexity theory is based on the notion of - uniform - boolean 
[2,7,21] and arithmetic [9] circuits, also called straight-line programs. We present in 
this paper a generic algorithm for the parallel evaluation of arithmetic circuits when 
the underlying algebraic structure is commutative and is either a semi-ring, a ring or 
a lattice. Our first contribution is a unified presentation of known and new algorithms 
designed for each case. 
N. Revel, J.-L. Rochl Theoretical Computer Science 162 (1996) 133-150 135 
The boolean circuits constitute a theoretical model for parallel computations and 
the complexity classes NC?, NC are defined upon this model [7]. Thus, any parallel 
computation is equivalent to the parallel evaluation of the corresponding boolean circuit. 
Since the boolean algebra ({True, False}, V, A) is a lattice, algebraic properties can be 
taken into account in order to speed up the parallel evaluation of boolean circuits. 
Our other contribution consists in a new algorithm for the parallel evaluation of lattice 
circuits; its complexity improves the best-known results [ 191. We introduce a new 
measure, the maximal number of altemances, to which the complexity is related. It 
has to be noticed that Ladner has shown that the boolean circuit evaluation problem 
is PP-complete. 
In the framework of parallel evaluation, two cases can be distinguished: expressions 
and circuits. An expression is a formula where every variable and every intermediate 
result can serve only once as an operand. It can be represented as a tree, and optimal 
algorithms exist with a EREW~(n/ log n, log n) complexity ’ [ 1,4, 10, 15, 171, where n 
is the number of nodes in this tree. Note that this problem is NC’-complete [14]. 
The evaluation of an arithmetic circuit with operations in a commutative semi-ring 
(SR, +, x, 0, 1) can be done by the Miller et al. algorithm * [ 181. It has a complexity 
of CREWs~(M(n), log n(log IZ + log d)), where d denotes the arithmetic degree of the 
circuit and n the number of nodes. It consists in applying repeatedly a sequence of 
three procedures. The first one groups two successive + nodes into one, the second 
evaluates + nodes having their operands evaluated, and the last one evaluates or shunts 
x nodes (the shunt is the equivalent of the rake of Kosaraju and Delcher [ 151, or of 
the prune of Cole and Vishkin [4]). 
A lattice (L, @, @) is a set L in which two internal operations, @ and E, are 
commutative and associative and satisfy the absorption law: Vu, b E L, (a G b) @ a = 
(a @ b) @ a = a. In this paper we shall restrict the work to distributive lattices. Miller 
and Teng [ 191 proposed two algorithms for the contraction of distributed lattice circuits. 
They are based on MRK’s algorithm [18], designed for circuits with operations in a 
semi-ring. The first one consists in two simultaneous executions of this algorithm, one 
considering @ as the addition and @ as the multiplication, and the other inverting the 
roles of @ and @. The first having completed its computation stops the second. The 
other algorithm consists in applying the basic procedure of contraction twice at each 
step, the first one with @ as multiplication, the second with @ as multiplication. 
If de stands for the arithmetic degree of the circuit when @ represents the addition 
and @ the multiplication, and dB is the degree when the roles of @ and @ are inverted, 
’ The notation EREWs(nb of proc, time) is due to Karp and Ramachandran [14]; it means that the com- 
putation time of the algorithm on a EREW (Exclusive Read Exclusive Write) PRAM is C’(time), where an 
operation on the structure S is done in one unit of time, while the number of processors required to achieve 
this time is &(nb of proc). The notation CREWs(nb of proc, time) just differs on the parallel machine 
performing the computation: it is a CREW (Concurrent Read Exclusive Write) PRAM. Similarly, when the 
parallel machine is a CRCW (Concurrent Read Concurrent Write) PRAM, the notation CRCWs(nb of proc. 
time) will be employed. For this last case, it does not matter which kind of CRCW-PRAM is used. 
* In the following, Miller, Ramachandran and Kaltofen will be abbreviated as MRK. 
136 N. Revel, J.-L. RochITheoretical Computer Science 162 (1996) 133-150 
d being the minimum of de and d,, and if n is the number of nodes in the circuit, 
then the complexity of Miller and Teng’s algorithms is 
CREWL(M(~), log n(log n + log d)). 
The main benefit of Miller and Teng’s algorithms is their simplicity, since you just have 
to run MRK’s algorithm twice. However, this simplicity is counterbalanced by a loss 
of performance. Actually, its major drawback comes from the difference of treatment 
between @ and @, whereas in a lattice they have exactly the same properties. 
The algorithm presented in this paper (Section 2) is a generalization of MRK’s 
algorithm; on the one hand, it is designed to handle algebraic structures of different 
kinds; on the other hand, it fully exploits the symmetric properties of @ and @ in 
the lattice case. It is composed of four procedures. The first operation, called Group, 
groups + nodes by means of matrix multiplications, and it groups both @ and @ nodes 
in the lattice case. The second one, Eval, evaluates + (resp. @) nodes as well x (resp. 
8) nodes having their operands already evaluated. The third one is a partial evaluation 
of the nodes having some of their operands evaluated; its name is PartialEval. A 
generalization of the Shunt is then performed on x nodes, or on @ and @ nodes in 
the lattice case, suppressing chains of unary x (resp. $ and @) nodes; thus it is called 
Suppress. 
This algorithm can be modulated in order to be a simple extension of the tree 
contraction technique, to correspond to MRK’s algorithm for circuits over semi-rings, 
or to be a very efficient algorithm in the lattice case. Its complexity is thus the same 
as MRK’s complexity in the semi-ring case. For the lattice case, this algorithm has a 
first complexity upper bound of CREWL(A~(~), logn(logn + logd)), which means that 
it is (at least) as efficient as Miller and Teng’s algorithms. Another upper bound is 
CREFV~(M(n), (h, + log n) log n), where h, is the maximal number of alternances of 
@ and @ on any path between a value node and a result node in the circuit. Proofs of 
these bounds are to be found in Section 3. 
In Section 4, some applications of this algorithm are presented. These examples are 
classical test problems, in order to check if the evaluation algorithm achieves good per- 
formances on well-known problems. The first one illustrates the power of this algorithm 
as complexity predictor: it gives a O(logn) time complexity on a CRCW - PRAM for 
the sort, when given as input the insertion sort algorithm. The second problem is the 
addition and the multiplication of two n-bit numbers: the practical complexity matches 
the theoretical one, since the experimental time is logarithmic; this algorithm used as a 
compiler produces in this case an adder of linear width and logarithmic depth starting 
from the boolean equations of the addition, this means that it produces automatically 
an adder equivalent to Brent and Kung’s adder [3]. Lastly, the limits of our algorithm 
are given: for the 9-complete lexicographic maximal independent set problem [6], it 
evaluates the corresponding circuit in linear parallel time. Since the evaluation algo- 
rithm gives satisfying results on these problems, some real applications are considered 
as future work. 
N. Revol, J.-L. RochlTheoretical Computer Science 162 (1996) 133-150 137 
2. Algorithm 
2.1. Dejnitions and notations 
A commutative semi-ring (SR, +, x, 0, 1) is a set SR in which two internal operations, 
+ and x, are associative and commutative, have a unit element (0 for + and 1 for 
x) and x is distributive with respect to +. In a commutative ring, every element has 
an inverse for +. 
A lattice (L, @, @) is a set L in which two internal operations, @ and 8, are 
commutative and associative and satisfy the absorption law: Va,b E L, (a @ b) EG a = 
(a@b)cBa = a. From the absorption law, we can deduce the idempotency of $ and 633: 
Vu E L, a @ a = a 63 a = a. In this paper, we will restrict the work in two directions: on 
the one hand, we consider only distributive lattices, i.e. lattices where ~3 is distributive 
with respect to CB, which implies that @ is also distributive with respect to 63’; on the 
other hand, we limit ourselves to lattices with a greatest element e (a unit element for 
~3) and a smallest element E (a unit element for 69). Actually, this second assumption 
is not restrictive, since it is possible to add dummy e and E elements to L; we also 
have ‘VX E L, e @ x = e and s @ x = E because of the absorption law. 
Notations. In what follows, S stands for an arbitrary algebraic structure, whereas 
L stands for a lattice, and + and x represent operations from a semi-ring or a ring, 
CE and @ stand for lattice operations. 
In each of these structures, an arithmetic operation (either an addition or a multi- 
plication) is assumed to be performed in unit time. In the case of a totally ordered 
lattice, it happens that the addition or multiplication of n elements can be performed 
in constant time with 0(n2) processors on a CRCW-PRAM. 
In the following, arithmetic circuits are represented by DAGs. The vertices are la- 
beled as leaves, + or x nodes, or B or @ nodes. The out-degree of leaves is 0, the 
out-degree of +, CB and @ nodes is 2 1 (operations of variable arity), the out-degree of 
x nodes is 62. The edges are directed top-down, from the operator to the operands; 
they are weighted with (a x x)-like linear functions or (a @x $ b)-like affine functions. 
Notations. u and w denote nodes of the DAG, usually with w representing any child 
of u. The adjacency matrix associated to the DAG is denoted by U; its coefficients 
represent the linear or affine functions weighting the edges: if the function is linear, it 
is simply represented as a single coefficient whereas an affine function is represented 
as a pair of coefficients. 
The four basic operations of the contraction algorithm are detailed in the following. 
2.2. Parallel evaluation algorithm 
Eval. The most obvious procedure is Eva1 (Fig. 1): when a node knows the value of 
all its operands, it computes a value and becomes a leaf, i.e. it disconnects itself from 
its children. 
138 N. Revol, J.-L. RochITheoretical Computer Science 162 (1996) 133-150 
Fig. 1. The Eva1 procedure. 
This can be done with CREWs(n2,10gn) complexity and CRCW,(n2, 1) complexity in 
a totally ordered lattice. 
Group. We generalize the “MM” operation of MRK’s algorithm (Fig. 2). In the latter, 
they use only linear functions on the edges, and they group + operations by matrix 
products (which correspond to one step of a transitive closure computation), so as 
to transform two successive + nodes into a single one, and to compute the result of 
n-ary +. 
They define two matrices from the adjacency matrix: 
- Ui+ is the matrix of weights of the edges between two + nodes, 
- U+. is the matrix of weights of the edges from a + node to a x node or to a leaf. 
MRK’s “MM” operation can then be defined as 
u+. - u++.u+. + Uf. 
followed by Uff c U+‘.UfS 
As far as lattices are concerned, since CB and @ are symmetric, ~3 nodes are 
grouped as well as $, and thus we define four auxiliary matrices from the adjacency 
matrix U: 
- U@@ (resp. U@@) is the matrix of weights of the edges between two ~3 (resp. ~3) 
nodes, 
- U@. (resp. U@) is the matrix of weights of the edges from a 63 (resp. 8) node to 
a 18 (resp. @) node or to a leaf. 
Fig. 2. MFK’s Group. 
N. ReGol, J.-L. Rochl Theoretical Computer Science 162 (1996) 133-150 139 
The grouping operation can be written as matrices operations: 
u@. c u@@,u’“. + u@ 
followed by U@@ - U@B.UcB@ 
and 
lP - u@@.u”. f u@. 
followed by U8@ - lP@.iP@ 
Since the coefficients of the matrices represent affine functions, usual matrix products 
can not be used and have to be slightly modified: 
(A.B)i,, = GAl,k(X) O Bk,jCxl 
k=l 
when $ nodes are grouped, and 
(A%/ = 6 AI,/&) o Bk.,(X) 
k=l 
when 8 nodes are grouped. 
The Group procedure has the same complexity as a matrix product: 
CREWs(M(fi), log n), 
where M(n) is the minimal number of processors required to multiply two n x n 
matrices in time c(logn), and CRCW~(A4’(n), 1) m a totally ordered lattice, where 
M’(n) is the minimal number of processors required to multiply two IZ x n matrices in 
time G(1). M(n) is roughly bounded by 0(n3), and M’(n) by G(n4). 
PartialEval. This procedure (Fig. 3) and the following one form a generalization of 
MRK’s Shunt of x nodes; they allow one to shunt both $ and @ nodes. 
Any node having leaves children computes its partial result and puts the result on 
one (or any in the lattice case3 ) edge to a non-leaf child if one exists; otherwise the 
node keeps only one child (a leaf) and puts its value on the edge between them. This 
procedure has a CREWs(n2,10gn) complexity, and a CRCW,(n2, 1) complexity in a 
totally ordered lattice. 
Suppress. The previous procedure may have created unary nodes. We now “compress” 
the chains of unary nodes (only the x nodes in the semi-ring case, both @ and 8 
nodes in the lattice case). A pointer-jumping technique is used; it consists in repeating 
the following process until no node has a unary child: each node which has a unary 
child disconnects itself from this child and connects to the only grand-child originated 
3 Using the idempotency of 63 and 8. 
140 N. Revel, J.-L. Roth I Theoretical Computer Science 162 (1996) 133-150 
0 
cyopog(x) 
=, h aOpof(x) 
A - jJopo7 
Q P 7 Q 
Fig. 3. The partial evaluation procedure 
from the unary child. For the whole evaluation of the DAG, the total cost of Suppress 
is CIEWs(n, log n) and CRCW~(n, log n). 
The combination of PartialEva and Suppress corresponds to the Shunt procedure 
of MRK. 
Algorithm. The main procedure of the evaluation algorithm consists in a preprocessing 
and then in applying the “Group-Eva1-PartialEvaMuppress” sequence as many times 
as required. 
Algorithm 1. (DAG evaluation) 
Preprocessing 
Group*: do [log n1 times Group 
Suppress 
repeat Phase 
Group 
CD nodes 
@ nodes 
Eva1 
all nodes in parallel 
PartialEva 
all nodes in parallel 
N. Revel, J.-L. RochITheoretical Computer Science 162 (1996) 133-150 141 
Suppress 
every chain oj’ unary nodes 
until every node is evaluated. 
end Phase 
In the next section, an upper bound of the number of applications of Phase is given. 
Remark. For the CREW version of this algorithm, the read/write protocol is the fol- 
lowing: we work with two copies of the DAG, an old copy used to read the old values 
and a new, modified copy, so as to avoid write conflicts; the nodes and the adjacency 
matrices are then updated. This allows to perform in parallel the matrices products, 
using old matrices and then updating them. 
3. Complexity 
In this section, an upper bound of the complexity of the previous algorithm is given. 
In the lattice case, it is split into three parts. The first one involves an algebraic measure, 
depending on the function computed by the DAG: he corresponds to the height defined 
by MRK, when @ represents the addition and @ the multiplication. The second one 
inverts the roles of EE and @ and involves h@, the corresponding height. For the last 
part of the proof, we introduce h,, the maximal number of altemances of @ and 8 on 
any path of the DAG. Using these measures, we put in evidence three different upper 
bounds. Two characteristic quantities have to be defined upon the DAG: the degree of 
a DAG and its number of altemances. 
Definition 1. Let us define de, d, and d as follows: 
d&v) = 
i 
kd,(w child of v) 
if v is a leaf, 
if v is a 3 node, 
w 
max(d@(w child of v) if r is a 8 node, 
w 
de of a DAG is the maximum of the dB degrees of its nodes, 
if v is a leaf, 
d,(v) = w I Cd@(w child of v) if v is a @ node, 
max(d@(w child of r) if u is a $ node, 
w 
d@ of a DAG is the maximum of the d, degrees of its nodes. 
The degree d of a DAG is the minimum of its de and dB degrees in the lattice 
case, d = d x otherwise. 
142 N. Revel, J.-L. Rochl Theoretical Computer Science 162 (1996) 133-150 
This notion of degree does not exactly meet the usual definition of “algebraic de- 
gree”: a DAG computing (x+x2)-(x2) computes the identity function, whose algebraic 
degree is equal to one, whereas the degree of the DAG is 2. This definition permits 
to manage such pathological cases; however, d can be thought as the usual degree in 
a first approach. 
Definition 2. We then define h, as the maximal number of altemances on a path from 
an output node to a leaf of $ and @ nodes. 4 
More formally, h,(u) is equal to 1 if v is a leaf, 
i 
h,(w 8 child of v) + 1, 
h,(v) = max h,(w @ child of v), 
h,(w leaf child of V) + 1) 1 
if v is a @ node, and 
i 
h,(w $ child of v) + 1, 
ha(u) = max h,(w @ child of v), 
h,(w leaf child of V) + 1) 1 
if u is a @ node. The number of altemances h, of a circuit is the maximum of the h, 
numbers of altemances of its nodes in the lattice case, h, = +oc otherwise. 
These two quantities permit to measure the parallel complexity of the DAG, as 
shown in the following theorem. 
Theorem 1. An upper bound for the complexity of Algorithm 1 is 
CREWL(M(~), log(n) * min(log(nd), h, + log(n))) 
and 
CRCW@f(n), min(log(nd), h, + log(n))). 
Proof. An easy point to prove is the complexity of the procedure Phase: the complexity 
of the procedure Phase is CZL?W(M(n), logn) and CRCW (M’(n), 1) in a totally 
ordered lattice. 0 
The number of applications of the procedure Phase is a little bit more tricky to 
establish. First of all, an upper bound is established using a new quantity, the height 
of a DAG (this first upper bound is based on MRK’s proof). 
4 This notion of alternance should not be confused with the notion of altemance defined for alternative Turing 
machine: in the latter context, the notion of altemance refers to the number of random choices made during 
a given execution. 
Let us define he as follows: u being a node of the RAG, let h@(u) be 1 if v is a 
leaf, 
( 
hB(w CD child) + i, 
ha(u) = m;x h@(w 6% child), 
h8(w leaf child) 
if v is a @ node, and 
h@(v) = Ch&w child) 
M’ 
if v is a 43 node. 
A dominant child w of a @ node o is a child such that h&r;) = h,(w) + 3 if w is 
a @ node, h@(v) = he(w) otherwise. h@(v) is the height obtained by exchanging @ 
and @ in the previous definition. 
The height h, (resp. h%) of a DAG is the maximum of the heights of its nodes. 
We assume that there is no unary node in the entry DAG (thanks to the preprocessing 
ending up with Suppress). Following MRK proof’s scheme, let us show that he is 
divided by 2 by one application of Phase. Let us consider Phase, Group, Eva],. . .as 
maps of circuits to circuits. They are surjective on nodes, and modify only the edge 
structure. We denote by U = (XJ) a circuit and by U’ = (X’, E’), with X’ = X, its 
image by Phase, by v a node in X and by 2)’ E X’ its image, and by U, and U,‘, the 
subcircuits they induce (i.e. the subcircuits computing their value). 1~ will represent 
any child of v’, and w its antecedent by Phase. 
Let v be a @ node having one @ child w. If w’, its image, is not a child of u’, the 
only possibility is that w has become a unary node with child x after PartialEv~, and 
has been suppressed by Suppress. Thus the I)- w edge has been replaced by only one 
edge U’ - x’. From this point we deduce: 
Corollary 1. If v’ is a ~3 node, then its height is h,(v’) = C h,(w’ child of v’) and 
the height qf its antecedent is h,(v) 2 C h@(w antecedent of w’). 
The same holds when G? and @ are e~~c~zanged. 
Theorem 2. rf U and U’ are arithmetic circuifs, if v’ is a node of U’ which is 
neither a leaf nor an output node (a node with(~~t parents) and v is its antecedent, 
then h,(v)>,2h&v’). 
Proof Let us prove it by induction on the size of UL( the subcircuit induced by v’ in 
U’. 
Initialization. Let v’ be neither a leaf nor an output node of U’, and let v’ have 
only leaves children. 
If u’ is a %I node, h&v’) = 1. Let us prove that h,(v)32 by reducing it to the 
absurd. If he(v) < 2, either h@(v) = 1 or h,(v) = i. If h@(v) = 1, since there is no 
unary node, v is either a leaf or a @ node whose children are leaves. After Eval, L; is 
a leaf, and thus v’ is a leaf, which is opposite to the assumption. If h,(v) = $, then 
144 N. Revel, J.-L. Rochl Theoretical Computer Science 162 (1996) 133-150 
v is a @ node whose children are either @ nodes with only leaves children, or and 
possibly leaves. After Group, every child of u is a leaf, and after Eva1 v itself is also 
a leaf. Thus, v’ is a leaf, and this is a contradiction. The case where u’ is a $ node is 
completed. 
If u’ is a 18 node, h@(v’) = #(w’ child of v’). It is enough to prove that every w 
antecedent of W’ has a height 3 2, thanks to Corollary 1. If h@(w) < 2, h@(w) = 
1 or h,(w) = i. If h,(w) = 5, then (cf. the previous case) after Group w is 
a @ node having only leaves children. If h@(w) = 1, after Group w is also a @ 
node having only leaves children, In both cases, after Eva1 w is a leaf. Since w’ 
is a child of v’, w must be the only child of u after PartialEval. Thus, v is a 
unary node and after Suppress it is disconnected from its parents. This means that 
u’ is an output node, which is a contradiction. Hence, every w antecedent of w’ 
child of v’ has a height 3 2, and h@(v)3 C&(w)>2 C h@(w’) = 2h@(v’) (by 
Corollary 1). 
Our induction is correctly founded, let us treat the general case now. 
General Case: The induction hypothesis is that for any subcircuit U, of size <k 
(i.e. with a number of nodes <k), and such that w’, the image of w, is neither 
a leaf nor an output node, the height of w is divided by (at least) 2 by 
Phase. 
Let v be a node such that U, is of size k + 1, let us prove that h,(v) is divided 
by 2 by Phase. 
l If v is a @ node, v’ is also a @ node. Let w’ be a dominant child of u’ and w 
its antecedent. If w’ is a @ node, then h@(v’) = h~(w’)d~h,(w) by a straight- 
forward application of the induction hypothesis, and since w is a descendant of u, 
h,(w)dh,(v). Thus, 2h,(v’)<h&u). 
l If w’ is a @ node, we only need to show that there exists a path of length at least 
2 between v and w. If such a path does not exist, then, v and w are adjacent, and 
Group is disconnecting them. Hence, we have h@(w) + 1 d&(v), and h@(u’) = 
h&w’) + $ d #r,(w) + l>< ;h&). 
In conclusion, the property is true for @ nodes. 
l If v is a @ node, v’ is also a @ node. he(v’) = Ch@(w’ child of v’). Thanks to 
Corollary 1, we only need to prove that for each w’, h,(w’) < i&(w). w’ is not an 
output node since it has a parent v’. If w’ is not a leaf, then, the induction hypothesis 
applies, and h&w’) < ihe( 
The only delicate case occurs when w’ is a leaf. Let us show that h@(w)>2. By 
a mean of contradiction, if h,(w) < 2, as for the initial case, v’ would be an output 
node. This is a contradiction with the initial assumption on u’. Hence, we deduce 
that h@(w)>2 if w’ is a leaf. So h,(v’) = C he(w’ child of v’) < i Ch&w)< 
h,(a). 
By this induction, we proved that for any node in the DAG which is not transformed 
into a leaf or an output node by Phase, its height he is divided by 2. q 
Let h = min(h@, h,). 
hr. Revel, J.-L. RochlTheoretical Computer Science 162 (1996) 133-150 145 
Theorem 3. Each application of Phase on a circuit divides its height h = min(h@, h,) 
by 2. 
Proof. It is true for he. Since Algorithm 1 deals with 3 and @3 in a symmetric manner, 
it is also true for h@; hence, it is true for h the minimum of hB and h,. 1 
By this theorem, after [log, hl applications of Phase, a circuit of height h is trans- 
formed into a circuit with only leaves and output nodes. One Eva1 is enough to evaluate 
every node. (Note that the preprocessing does not increase the height, and thus its in- 
fluence can be neglected). 
MRK proved that he 6 ieeda + d,, where e@ is the number of @-CE edges. Sim- 
ilarly, h, d ie8dB + d,, where e@ is the number of 8-8 edges. 
Thus, after [log, hl + 1 = C(log(nd)) applications of Phase, a circuit of degree d 
with n nodes is evaluated. 
Let us tackle the last part of the proof. Intuitively, it is clear that what prevents the 
Phase procedure to work more is the alternance of $ and @ nodes. 
Let us show that at most (h, + logn) applications of Phase suffice to evaluate the 
DAG. 
Firstly, none of the Group, Eval, PartialEva or Suppress procedures increase h,. 
Secondly, the preprocessing plays an important role: Group* computes a transitive 
closure of @ nodes and @ nodes. After Group*, h, is the length of the longest path of 
the DAG +l. Obviously, the parallel evaluation time of a DAG is less than the length 
of its longest path: h, applications of Eva1 are enough to evaluate the DAG, a fortiori 
h, applications of Phase suffice to evaluate it. 
If these results are put together, they involve Theorem 1: at most C(min(log(nd), h,+ 
log(n))) applications of Phase are enough in order to evaluate every node of the DAG. 
Remark. The complexity of the preprocessing is bounded by the complexity of logn 
Phase; the preprocessing can even be replaced by logn applications of Phase if an 
homogeneous algorithm is preferred instead. 
4. Applications 
The aim of the following examples is only to illustrate the efficiency of Algorithm 1. 
They have been chosen because their time complexity is already well-known and thus 
the comparison between the best implementation and the performances of our algorithm 
is possible. The results we obtain are encouraging. Real applications are mentioned at 
the end of this section and will constitute our future work. 
4.1. Addition and multiplication of two n-bit numbers 
In order to illustrate the complexity of Algorithm 1, we have simulated its execution 
and counted the number of applications of Phase. The input straight-line programs are 
146 N. Revel, J.-L. RochITheoretical Computer Science 162 (1996) 133-150 
12 I I I I I 
_.__.~ _,_... 
0 0 
0 
___.___.._..-...-~~ 
_._____..... 
+ + + 1 
Tree contraction o 
log?r + 2 .... 
Lattice algorithm + 
1.351og,2+2 
I 
i i 
21 I 
I , I I I 
0 20 40 60 80 100 120 
Fig. 4. Experimental results for the addition. 
classical “paper-and-pencil” algorithms for infinite precision - either integer or fixed- 
point real - arithmetic, described with boolean gates. Let us consider the addition 
of two integers, the adaptation to real fixed-point addition being straightforward. Let 
a = (a,_1 . ..a~) and b = (b,_, . . . bo) two n-bit integers. Introducing carries ci, the 
equations defining the result r = (r,-i . . . t-o), that we assume to be given in the VLSI 
specification, are the following: 
co = 0, 
ci = (~iAbi)V((aiVbi)~ci-*), 
ri = [(ai A b;:) V [(& A bi)] A CiZ1) V [(dj A b;:) V (ai A bi)] A Ci-1 , 
r n= c,-1 (rn is an overflow flag). 
It may be noticed that we consider here the VLSI specification as simple as possible. 
The corresponding boolean circuit has O(n) nodes. Its degree is linear in the n, and 
h, is also linear. Thus, the predicted complexity is CRCWL(M’(n), log n). 5 On Fig. 4, 
we can check that this complexity is achieved, and that the constants are small. Since 
the parallel time is logarithmic for the tree contraction technique, and since this al- 
gorithm requires only a linear number of processors, this result means that an adder 
with a linear number of gates and a logarithmic delay exists. It can even be built if 
this contraction algorithm is used as a compiler instead of being used as an interpreter. 
An adder with the same properties has been proposed by Brent and Kung [3]. Our 
algorithm presents the advantage that it can build the circuit automatically from the 
classical boolean equations of the addition. 
The same results (logarithmic time, small constants) occur when the boolean circuit 
for the multiplication of two n-bit numbers is evaluated [20]. 
5 In the boolean algebra, M’(n) = 0(n3). 
N. Revel, J.-L. RochITheoretical Computer Science 162 (1996) 133-150 147 
To obtain these results, only one test with arbitrary inputs has been done, in order 
to determine the number of steps needed to compute the result; indeed, there is no 
trick using the actual values of the inputs in the algorithm; thus, the number of steps 
is the same, whatever the inputs are. 
The addition requires the NOT operator. Goldschlager [l l] has proven that a boolean 
circuit with NOT gates can be transformed into a monotone boolean circuit (without 
NOT gates). In fact, we did not use this transformation; instead, we slightly modi- 
fied the algorithm to be able to treat the NOT operation: the weighting functions are 
True, False, x, 1X (or more generally ((a @xx) $ (b @ IX) @ c)-like functions); since 
NOT is a unary operation, the NOT nodes are suppressed; since a @ (~(b C!C c)) = 
a CB (4) @(x), ~3 and @ nodes are grouped when they are connected by a 7x edge; 
since a 8 (~(b @ c)) = a @ ((41) @ (lc)), @I nodes are not grouped when they are 
connected by a lx edge, neither are the C$ nodes. The evaluation time is the same for 
the monotone circuit as for the circuit with NOT gates; we can thus save half of the 
memory. 
4.2. Parallel sort 
Algorithm 1 can be used as a predictor of parallel complexity: the computation of the 
degree and the maximal number of altemances of a program provides an estimation of 
its parallel complexity which is often non-trivial. For instance, let us consider the prob- 
lem of sorting an array of 12 distinct numbers x1,. . .x, and of placing them in an array 
yl,...y, in increasing order. The structure (rwU{+~}U{-~},min,Max,+~,-oo) 
is a distributive lattice. The easiest way to solve this problem is to program an inser- 
tion sort. An obvious parallelization of the insertion procedure leads to the following 
algorithm, the upper index denoting the step: 
Algorithm 2. (Insertion sort) 
for i = 1 to n do 
if i = 1 then 
(1) 
[y, l+bhl 
if i = 2 then 
[y\2’, $‘I + [~~~(xl,~z),~~(xl,xz)l 
if i > 2 then 
y(li) (i-l) 
+- min(xi, YI > 
yl(i) + MaX(yjfl’),Xi) 
for j = 2 to i - 1 do 
yji) t Max(y~~~“,min(y:‘-“,Xi)) 
Actually, the computation of yy) from yy:;), yij-‘) and Xi corresponds to the com- 
putation of the second element of the sorted array containing these three elements. 
148 N. Revel, J-L. Rochl Theoretical Computer Science 162 (1996) 133-150 
The maximal number of alternances of the corresponding circuit is at least linear: 
h, Bh,(_$)) and h,(yt)) = 2 + h,(yf-l)) = 2(i - 1). The degree dMax is exponential: 
for 26idn, 
dician(y’l”) = d,w&-I)) 
(0 d~~(yi ) = d~~(yi(‘-“) + 1 
and they correspond to the ($). The degree d,in is linear: for 2 <i <n, 
d,i,(yy’) = d,i,(y(li-‘)) + 1 
d,,(yF’) = max(d,in(yj’-‘) )+ l,d,i~(~~f~))), 2<jdi - I 
Thus, Algorithm 1 evaluates the insertion sort circuit in logarithmic time on a CRCW- 
PRAM. The parallel complexity of the best-known algorithms [12] for this problem is 
thus automatically predicted, using a simple and a priori not highly parallel algorithm. 
4.3. A 9-complete boolean circuit 
A problem of particular interest is the lexicographic maximal independent set prob- 
lem, denoted by LMIS; actually, it is a P-complete problem. As a matter of fact, it 
happens that the LMIS is one of our worst cases. The LMIS problem is the following: 
let G = ( V, E) be a graph, and I’ be a linearly ordered set: V = (01,. . . , u,} with 
VI > v2 > ... > v,. The LMIS problem consists in determining a maximal set of 
vertices that form an empty subgraph; this set must be maximal for the order on V. 6 
A greedy sequential algorithm gives the solution, and the corresponding boolean 
circuit has an exponential degree, and a linear h,, the complexity bound for its parallel 
evaluation using Algorithm 1 is thus O(n). 
4.4. Application jields: circuits for jixed-point arithmetic 
First of all, some other boolean circuits can be studied in the same way as the addi- 
tion or multiplication circuits. Thus an estimation of their depth can be easily obtained. 
If this estimation is good, it is then possible to “compile” the original circuit into an- 
other one, achieving the depth bound: Algorithm 1 has to be applied “symbolically” 
on the circuit, i.e. each operation such as disconnect from the child and connect to a 
grand-child has to be physically realized, but no evaluation is performed. Furthermore, 
for the addition and multiplication circuits, it appears experimentally that the adjacency 
matrices are very sparse. Thus, if only the useful computations are performed for the 
Group operations, the number of simultaneous operations decreases significantly, i.e. 
6 Without the lexicographic order constraint, this problem can be solved efficiently in CI(log2(n)) parallel 
time. 
N. Revel, J.-L. RochITheoretical Computer Science 162 (lW6) 133-150 149 
the size of the circuit is small. For the addition circuit, only a linear number of oper- 
ations are performed at each step, and then the size of the “compiled’ circuit is small 
compared to the &(n3) theoretical bound. 
It seems that the usual arithmetic operations ~ implemented either by power series 
or in a “CORDIC way” for instance ~ present the same characteristics: very good 
theoretical time, theoretically overestimated size which appears to be reasonable in 
practice. In such cases, Algorithm 1 can be used as a preprocessor for VLSI design, 
in order to build a circuit with a good cost. 
In other areas, Algorithm 1 cannot be used because it evaluates a straight-line pro- 
gram with too many processors. However, the complexity result of Section 3 can be 
used in order to estimate an upper bound of the parallel time required to solve a prob- 
lem, as a predictor. Such areas include for instance graph theory: problems such as 
the computation of connected components are expressed in terms of lattice operations 
(min-max), optimization problems or reliability studies require either lattice (minmax) 
or semi-ring (max-+, A-x) operations and finally enumeration problems use a semi- 
ring. Lastly, simulations of discrete events systems (modeled by timed Petri nets for 
instance) perform computations in the semi-ring (R, max, +). 
5. Conclusion 
In this paper, an algorithm for the parallel contraction of arithmetic circuits has 
been presented. Firstly, it unifies the various algorithms designed for different algebraic 
structures. Secondly, it improves previous algorithms by the use of the lattice’s alge- 
braic properties, and by a symmetric treatment of the lattice’s 8 and @ operations. 
Its complexity is CREW.(M(n), log n * min(log n + log d, h, + log n)), where d is the 
algebraic degree of the circuit, and h, is the maximal number of alternances of 3 and 
8 in the DAG in the lattice case, +cc otherwise. In most problems, this bound is 
rather tight, thus this algorithm appears as a predictor for the time complexity of an 
algorithm, and as an indicator of the algebraic properties that have to be taken into 
account, in order to reach this time. The mapping issue is not covered by this approach: 
firstly, the material resources criterion is not minimized (the number of processors in a 
parallel computation or the area of a VLSI circuit); secondly, the problem of reusing 
the processing components is not considered. Further work has to be done in order to 
attain a trade-off between time and material resources - reflecting the AT2 measure of 
quality in VLSI design. 
More generally, this easy-to-compute complexity estimation provided by Algorithm 1 
is particularly interesting: it can be used to detect the existence of reductions in numer- 
ical programs: actually, a constant h, for instance means that one operation prevails, 
and that reductions based on this operation probably exist. A linear de means that re- 
ductions of “dot-product” kind are worth to be searched, whereas a linear dE indicates 
that p + p @xi @ y, patterns are preferably to be looked for. This is valid even if the 
@ and @ operations do not define a lattice. Thus, the complexity results established 
150 N. Revel, J.-L. Roth I Theoretical Computer Science 162 (1996) 133-150 
in Section 3 can be integrated in automatic parallelizing tools, in order to guide the 
detection of reductions: indeed, the reduction procedures are now integrated in most of 
the parallel languages (HPF, MPI, . . .) because they are performed efficiently on most 
of the highly parallel arithmetic units. Since the detection algorithms are rather costly 
and cannot be applied to the whole numerical program to be parallelized, they really 
need such an expert tool to guide them. 
References 
[1] K. Abrahamson, N. Dadoun, D.G. Kirkpatrick and T. Przytycka, A simple parallel tree contraction 
algorithm, J. Algorithms 10 (1989) 287-302. 
[2] A. Borodin, On relating time and space to size and depth, SIAM J. Comput. 6 (1977) 733-744. 
[3] R.P. Brent and H.T. Kung, A regular layout for parallel adders, ZZX,E Trans. Comput. C31 (1982) 
260-264. 
[4] R. Cole and U. Vi&in, Approximate and exact parallel scheduling with applications to list, tree and 
graph problems, in: 27th IEEE Symp. on Foundations of Computer Science (1986) 478-491. 
[5] R. Cole and U. Vi&kin, Approximate parallel scheduling. Part I: the basic technique with applications 
to optimal list ranking in logarithmic time, SIAM J Comput. 17 (1988) 128-142. 
[6] S.A. Cook, Towards a complexity theory of synchronous parallel computations, Enseignement 
Mathtmatique 27 (1981) 999124. 
[7] S.A. Cook, A taxonomy of problems with fast parallel algorithms, Inform. Control 64 (1985) 2-22. 
[8] D. Coppersmith and S. Winograd, Matrix multiplication via arithmetic progression (full paper), J. 
Symbolic Comput. 9 (1990) 251-280. 
[9] J. von zur Gathen and G. Seroussi, Boolean circuits versus arithmetic circuits, Proc. 6th Znternat. Conf 
Computer Science, Santiago, Chile (1986) 171-184. 
[IO] A. Gibbons and W. Rytter, Optimal parallel algorithms for dynamic expression evaluation and context- 
free recognition, Lecture Notes in Computer Science, VLSI Algorithms and Architecture, Vol. 319 
(Springer, Berlin, 1988) 32-45. 
[l l] L.M. Goldschlager, The monotone and planar circuit value problems are log space complete for P, 
SZGACT News 9 (1977) 25-29. 
[12] J. J&I& An Introduction to Parallel Algorithms (Addison-Wesley, Reading, MA, 1992). 
[13] E. Kaltofen, Greatest common divisors of polynomials given by straight-line programs, J. ACM 35 
(1988) 231-264. 
[14] R.M. Karp and V. Ramachandran, in: J. van Leeuwen, ed. Handbook of Theoretical Computer Science, 
ch. Parallel Algorithms for Shared-Memory Machines (Elsevier, Amsterdam, 1990) 869-941. 
[15] S.R. Kosaraju and A.L. Delcher, Optimal parallel evaluation of tree-structured computations by raking, 
Lecture Notes in Computer Science, VLSI Algorithms and Architecture, Vol 319 (Springer, Berlin, 
1988) 103-110. 
[16] T. Lengauer, in: J. van Leeuwen, ed., Handbook of Theoretical Computer Science, chapter VLSI 
Theory (Elsevier, Amsterdam, 1990) 835-868. 
[17] G.L. Miller and J.H. Reif, Parallel tree contraction and its application, In: IEEE, 26th IEEE Symp. on 
Foundations of Computer Science (1985) 478489. 
[18] G.L. Miller, V. Ramachandran and E. Kaltofen, Efficient parallel evaluation of straight-line code and 
arithmetic circuits, SIAM J. Comput. 17 (1988) 687695. 
[19] G.L. Miller and S.-H. Teng, Dynamic parallel complexity of computational circuits, .Z. ACM (1987) 
25&263. 
[20] N. Revol, Complexite de I’evaluation parallele de circuits arithmetiques, Ph.D. Thesis, LMC, INPG, 
1994. 
[21] W.L. Ruzzo, On uniform circuit complexity, J. Comput. System Sci. 22 (1981) 365-383. 
[22] V. Strassen, Gaussian elimination is not optimal, Numerische Mathematik 13(HetI 4) (1969) 354-356. 
[23] V. Strassen, Vermeidung von Divisionen, J. Mathematik 264 (1973) 184202. 
