Area-Time Optimal Division for T=Ω((logn)^1+e) by Mehlhorn, K. & Preparata, F.P.
1Area-time Optimal Division
f o r  T =  Q({logn)1+c)
by
K. Mehlhorn* and F.P. Preparata**
* Fachbereich 10, Informatik, Universität des Saarlandes, 6600, Saarbrücken, 
West Germany.
* Coordinated Science Laboratory, University of Illinois, Urbana, IL 61801.
This work was supported by the DFG, SFB 124, VLSI Entwurf und Parallelität, 
and by NSF Grant ECS-84-10902.
2A family of area-time (AT2) optimal networks for the computation of the in­
verse of an n-bit number (referred to here as “ dividers") has been proposed some 
time ago by Mehlhorn [1]. A network of this type can be constructed for each 
computation time T in the range [H(log2 n), Of^/n)]. Since then considerable pro­
gress has been made in the design of faster dividers [2], culminating in the result of 
Beame-Cook-Hoover [3] illustrating an 0(logn)-tim e divider (i.e. , a time-optimal 
network in the hypothesis of bounded-fan-in components). However the Beame- 
Cook- Hoover network (referred to here as the BCH network) does not achieve area 
optimality. Thus, it is natural to ask the question of the existence of area-time 
optimal dividers for T =  o(log2 n). This paper provides an affirmature answer for 
T £ [Q((logn)1+c), 0 (log2 n)] for any positive constant e < 1. It must be pointed 
out that the proposed networks are so complicated - notwithstanding their area- 
time-optimality - that they are exclusively of theoretical interest.
The network (see Figure 1) consists of [1 /e] +  2 cascaded modules. (For simpli­
city we assume that 1/e is an integer.) The first 1/e modules are modified dividers 
of the BCH type, computing a sequence of approximations of the inverse with in­
creasing numbers of bits /i <  I2 <  • • • < ¿i/e <  n.
Fig. 1: Block structure of the divider
The last two modules are designed to complete the build-up of the result size 
from li/t to n bits by implementing the Newton approximation method, which, at 
each iteration doubles the length of the result. This is carried out in two phases, 
respectively executed by the “ fast" and “ slow" approximators. The fast approxima­
tor basically consists of a single area-time optimal fastest multiplier, used to execute 
the initial iterations; the slow approximator is instead a cascade of affordably slow 
multipliers, each executing one of the final iterations. Both approximators execute 
0(loglogn) iteration steps. Note that the cascade of the two Newton approximators 
structurally coincides with Mehlhorn’s divider [1],
The paper is organized as follows. In section 2 we present a more efficient 
implementation of the BCH method leading to a circuit referred to as “ modified 
BCH divider". In Section 3 we discuss an alternative method for the computation 
of the inverse, which uses the modified BCH method as a subroutine. Finally, in
2.. An efficient implementation o f  the BCH method. 3
Section 4 we illustrate the combination of the previous techniques with the Newton 
approximation, to yield our proposed network, while Section 5 contains a few closing 
remarks.
2.. An efficient implementation of the BCH method.
In this section we first describe (a variant of) the BCH method [3] and then modify 
it so as to reduce its area requirement.
The original BCH method computes the inverse of an n-bit number x by adding 
the first n powers of u — 1 — x and truncating the n2-bit result to its leading n-bits. 
Each power of u is computed individually and the n powers are subsequently added 
together; so we just consider the computation of un. The approach consists of taking 
the “ logarithm" of u, multiplying it by n, and then taking the “ antilogarithm".
Since taking logarithms of large numbers is very hard, the method resorts to a 
modular representation and works as follows:
Algorithm INVERSEl(x)
Input: an n-bit number x in the range [1/2,1). Given are primes . . . ,  pm such 
that
m
n » *  (Note that m ~  n2/lo g  n)
J—1 (n is assumed to be a power of two)
Output: an (n + 2)-bit number v in the range
(1, 2], so that v X x =  1 +  6 with 6 <  2~n~ 1 (v is given by the first n +  2
1 bits of £ r= o '(l -  *)*')
(1) begin u :=  (1 — z)2n;(*u is an integer *)
(2) for j, 1 < j  <  m
(3) pardo bj :=  umodpj;
(4) compute r3 so that a-3 =  bj, where
aj is a generator of the multiplicative group of Z*Pj ;
(5) for l =  0 to log n — 1
(6) do := a - 2 mod!pJ-1 ) (*m ^ =  u2,1 modpj*)
(7) od;
(8) Vj :=  n !=on ‘] +  1 )modpj
( ;* Vj =  £  "Jo ul modpj*)
(9) Vj := VjMjmod(pi. . .  pm)
(* Chinese remaindering *)
(10) odpar;
4(11) v :=  EJL i Vjmodijh . . .p m);
(12) v : =  truncate v to the first m +  2 bits and set 
point after the second bit from the left
(13) end
Let us next describe the different steps of this algorithm in more detail. In this 
description we will make frequent use of the following two facts.
1) One can multiply two fc-bit integers in time T and area A where AT  2 =  
0 (k 2) and T  E [fi(log k), 0(y/k)]. This is the result of [6].
2) One can add m fc-bit integers in time 0 (logm  + log A:) and area 0(km  • 
logm). This can be achieved by expressing the m integers in redundant repre­
sentation (see, e.g. [4,5,6]) and then adding them in a tree-like fashion. The tree 
has depth O(logm ) and requires area O (m logm ) for every bit position. Each le­
vel of the tree introduces a delay of just 0 (1) thanks to the redundant number 
representation.
We are now ready to describe the circuit in more detail. We start with the 
parallel loop, lines 2-10.
Line 3: This line is easily executed in time O(logn) and area 0(n(logn)P) by 
expressing u by its binary expansion u =  X S cT  1 ut2\ut E {0 ,1 }, storing the 
numbers 2tmodpJ in a table and performing the required additions in redundant 
number representation. We leave the details of this step to the reader.
Line 4: Step 4 is realized by a table-look-up, i.e. by a loop-up in a table which gives 
the value of r3 for each possible value of bj. Since pj can certainly be expressed 
using 2 log n bits this table has n2 extries of 2 log n bits each. We realize this table 
by 2 log n H-trees each requiring area 0 (n 2). Thus the total area is 0 (n 2 log n) and 
a table-look-up takes time O(logn).
Note that the 2 logn slices of the table are accessed in parallel. Also note 
that this circuit is pipelinable, (its period is 0 (1) in technical terms) and therefore 
O(logn) look-ups can also be performed in time O(logn) using the same area. This 
observation is important for step 6.
Line 5,6,7: Consider a fixed l first. We first compute 
R^P — r j2lmod(pj — 1)
as outlined in line 3. Note that the /-place shift does not have to be executed 
explicitly; it only determines which powers of two need to be looked-up. The 
computation of R p  takes time O(logn) and area 0 (n (logn )2). We perform this 
computation in parallel for all /, 0 <  l < logn — 1.
The integer is computed from R.p by look-up in a table of “ antilog­
arithms". The logn. look-ups are pipelined and take time O(logn) and area 
0 (n 2 log n) (refer to the description of line 3).
2.. An efficient implementation o f  the BCH method. 5
Finally note that =  of-2 mod(p3 1) =  b2- modpj — u2* modpj.
Line 8: We use a tree of multipliers. This tree has depth O(loglogn) and has 
log n nodes. Each node contains a circuit multiplying two 2 log n bit numbers and 
reducing the result modpj in time O(loglogft) and area O^lognJ2). This shows 
that step 8 takes time (logn) and area O(n). Both estimates are very generous.
Finally note that
logn—1 logn—1 n—1
Y [  (1 +  m ^ °)=  Y l  ( l + u (2i)) =  ] T V
1=0 /=0 1=0
Line 9: Let Mj — [(pi . . .  pm)/Pj]Pi~1{m odpi. . .  pm). Then Mj is the coefficient 
of Vj required for Chinese remaindering [7]. The number M j  is precomputed and 
stored in a register of length 0 (n 2). We multiply Vj  by M j  by dividing M j  into 
n2/lo g  7i pieces of length O(logn), performing ft2/lo g n  multiplications in parallel 
and then summing the results. This can certainly be done in time O(logn) and 
area 0 (n 2 logn). Also the reduction m od(pi. . .  pm) can be done in that area and 
time. •
Summary: Line (3) to (9) take time O(logn) and area 0 (n 2 logn) for each p} . 
Since un has n2 bits we have m =  0 (n 2/  logn) and each modulus is representable 
in 2 log ft bits. We realize loop (2) to (10) by having a module for each modulus 
and hence the loop takes time O(logn) and area 0 (n 4).
Line 11: In Line 11 we add m numbers of ft2 bits each. This takes time 
0(log ft) and area 0 (m  log m • r?)=  0 (n 4).
Lemma 1. There exists a circuit which computes the ft-bit inverse of an n-bit 
number in time O(logft) and area 0(ft4).
Proof: Immediate from the discussion above. I
The enormous space requirement of the method sketched above is essentially due to 
the fact that the powers of u are computed with ©(ft2) bits of precision. However, 
only the leading ft +  logn. bits are truly needed for the computation of v. This 
observation is the key to the “ modified" BCH method, to be described next. In 
the modified method we compute the powers of an l-bit integer u in m rounds 
(this m has nothing to do with the m in algorithm INVERSEl), where m is a 
design parameter to be selected. In each round we compute the sum of s =  ( /)1//m 
consecutive powers using the method of Lemma 1. We call s the depth of the 
method. This takes time O(logZ) and area 0 (( /s )2) and yields a result of O(ls) 
bits. The space requirement results from the fact that only Zs/log(/s) different 
prime moduli, each of length 2log(Zs) bits, must be used. We truncate this result 
to l T [*log 12m] bits and start the next round. The details are as follows.
Algorithm INVERSE2(z)
6Input: an /-bit number x € [1/2,1) and an integer s =  ( / / / m. 
Output: an (/ +  2)-bit number v E (1, 2]
begin uQ :=  1 — x;
for % — 0 to m — 1 do
begin compute u®-1 , u®;
E s—1 jj=0 Ui
Ui+1 :=  truncate u® to <7 =  l +  [log 12m] bits right of point;
end;
v :=truncate <70<7i. . .  0m- i  to / bits right of point;
end.
To prove the correctness of this algorithm we must show that v gives the (/ +  2) 
leading bits of 1/(1 — u) (of which the rightmost / bits represent the fractional part). 
To this end, we must show that the error of the approximation is <  2~l.
For any variable a used by the above algorithm let a denote the corresponding 
exact value (note that, since all numbers are nonnegative, the truncation mechanism 
gives a >  a), and ¿(a) the absolute error on a, such that a =  a — ¿(a). Recall also 
that 6(a-b) < ¿(a)6+^(6) a and that <$(a+6) =  <5(a) +  <5(6). Using these relationships, 
we readily have
6 ( C7q . . . Om _ 1 ) <C Oq . . .  bm _ 1 K ° q)
£0
+  . . .  + ¿(O m -l)\
° m -1 J
Since do . . .  <rm_ 1 < 3 and ct, > l(f =  0 , . . . ,  m — 1), we obtain
6 { ( 7 o  .  .  . <7m_i) < 3(<$(<Jo) +  • • . +  ¿ ( < J r n - 1))- 
From b{ =  we have
s — 1 s— 1
6(ai) =  <  ^(^*)/(l -  a ,)2 < 4 6(ui)
jz=0 j = 0
since Ui < 1/2 for i =  1 , . . . , m — 1. (Obviously ¿(ifo) =  0.)
Thus ¿(op . . .  &m— 1) <  12mmax(5(u1) and the condition
12mmax<5(u,) < 2~l
ensures the correctness of the method. We claim that 6(ui) < 2~q as a result of 
truncating to q bits right of the point. Indeed ¿(i^) < 2~q, trivially. For i >  1, 
assuming 6(ut) < 2~q, let u*+1 =  u® (before the truncation). Then
3.. An Accelerating Technique 7
¿ K + i )  <  su* 1S(ui) <  6(ui)
since u, < 1/2. If we assume s >  4, then ¿ (u ^ J  < 2~g_1, which shows that its 
[q +  1) bits to the right of the point are correct. Thus, the prescribed truncation 
yields 8(ut+i) < 2~g, and the induction step is complete. In conclusion, we choose
q> l-\ - log 12 m
(Note that for any choice of s, [log 12m] <  4 +  log log / by the definition of m.) 
Noting that m • 0 (lo g /)  =  0 ( lo g / /  logs), we have:
Lem m a 2. For any 2 < s < / there exists a circuit computing the l-bit inverse 
of an l-bit number in time 0 (  log2 / /  logs) and with area (^((/s)2).
The AT2-performance of the above circuit is given by
AT2 =  0  (V lo g 4 / • j“ T ^ ) ( ! )
By choosing the depth s as s =  le(e > 0), the resulting circuit achieves T =  
0((1  /e) log l) and A T 2 =  0(Z2(1+e)), i.e. it is a moderately AT2-suboptimal divider 
still achieving T =  0 (log /), for fixed e. We are aware that this result had been 
previously obtained by F. T. Leighton [8], presumably by a similar argument.
3.. An Accelerating Technique
We now describe an alternative approach to the computation of the inverse of an l- 
bit number, which capitalizes on the presence of leading zeros in the representation 
of the number to be inverted. This method is best described for an /-bit integer 
x € [ l , 2).
The number x G [1, 2) can be written as 
x — X\ -f- 2~llzu
where x\ is an / i —bit number (the leading l\ bits of x) and w is an (/ — 4 )-bit 
number (the trailing l — Zi-bits of x). Then x\ £  [1, 2) and 2 llw £ [0, 2). Let Vi 
be an 4 -bit approximation to xi (i.e. xiV\ =  1 -j- r),r) < 2-Zl). Then
8V\X =  V\X\ +  V\w2 ll — 1 +  rj +  viw2 l l ,
that is, V\X has at least l\ — 1 consecutive 0’s immediately to the right of the point. 
Define
y =  l/v1x
Then, if v2 denotes an approximation of 1 /y, we have viv2 — l/x. Also, if 
v\ y =  1 -f- r}' then V1 V2 X — 1 +  rj1, i.e. V1 V2 is an approximation of precision ?/. The 
process can be iterated for the computation of the inverse of y, thereby obtaining
1 /x =  Vi v2 . . .  vk
This leads to the following algorithm:
Algorithm INVERSE3(z)
Input: an /-bit number x G [1,2), and an integer sequence 4 < h <  • • • < Ik =  l- 
Output: an /-bit number v E (1/2,1], such that vx =  1 +  e, e < 2~l
( 1)
(2)
(3)
(4) 
(3)
(6)
begin z :=  x
for i =  1 to k do
begin Xi :=  leftmost /, bits of zt\
Vi :=  (/^  +  l)-bit inverse of a;,-; 
z%+\ :=  ZiVi
end;
v :=  viv2 . . .  vk
end
The correctness of the method follows from the fact that ¿>(u) =  V\ . . .  Vk-i * 
6(vk)<  2 • 2~/_1=  2~l.
Step 4 is the crucial action in the above algorithm. To analyze its performance, 
we need the following result.
Lemma 3. If an /-bit number a; € [1, 2) has 4 -  1 zeros immediately to the 
right of the point, the /-bit inverse of x can be computed in time T =  0 (lo g (///i)-  
log / /  log s) and area A — for any 2 < s < l/h. (Note that this result
subsumes Lemma 2 for l\ =  1.)
Proof: Indeed u =  1 — x is a (negative) number with l\ zeros immediately to the 
right of the point. This implies that <  2_ / , so that only the first |"///i]
consecutive powers of u need to be computed. 1
4. The Divider Network 9
The numbers x,-, i =  1, . . . ,  k, used in Step 4 meet the conditions of Lemma 3, 
since 1 — 2,u, is a (negative) number with /, leading zeros. Step 4 is therefore carried 
out by applying Algorithm INVERSE2 so that the i-th iteration is characterized 
by length Z, and depth sf . An implementation of this accelerating technique is 
therefore completely specified by the two sequences:
Zi j ¿2) • • • > Zfc
and
> 2^> • ■ • ?
Before closing this section we note that Step 5 involves a multiplication of 
(/, l)-bit numbers at the z-th step; thus this operation is no more complex than 
the execution of the homologous Step 4, and will not be further mentioned in this 
discussion.
4. The Divider Network
We have all the premises to illustrate in detail the structure of the divider sketched 
in Figure 1.
The first 1/e stages are collectively designed to implement the accelerating tech­
nique; each module implements the modified BCH algorithm. For i — 1, 2, . . . ,  1/e, 
let U be the (output) operand length, s,- the depth, A i)t the area, and Ti)t the 
time of the z'-th module. We seek a solution where all such modules have identical 
area (i.e. =  A' for i =  l , . . . , l / e )  and identical computation time, equal to
the target time (i.e. , Tlit =  0((logn)1+e), i =  l , . . . , l / e ) .  By the requirement of 
optimality, we have
/a n n fo\
vM l’ ’ =  Î Ï7  =  ( l o g n ) ^ '
We also choose:
(log n)1+e.st ’
St =  2(logn)1 7(i+ (i°g«)‘ )’ 1 (i =  l , . . . , l / e ) .
10
Since the area of the ¿-th module is 0((Z,s,)2), condition (2) is obviously verified. 
Next note that log /,• =  0 (lo g » ), 't  =  1, . . . ,  1/e. We therefore infer from Lemma 2:
T,,i =  0 (lo g /1l ^ L )log Si
=  O ((logn)2/(lo g n )1_c) 
=  0 ((lo g n )1+c)
and for i — 2, . . . ,  1/e
ri.i =  o ( i o g d i - .
t»-i logs,-
=  0 ( l o g ^ . h i l L )
= 0(log/,
s, log s, 
log S,_
log Si
since /,s,' =  li-iS i-i
- )  since Si >  1
— Of log n (1° gn)1 e(1 + ( 1°g ” )•)•' \ 
(1 +  (logn)e) l-2 (logn)1"_e^
=  0 ((log n )1+e)
thus verifying the objective for the computation time.
With these choices, each module of the chain is AT2-optimal, and the global 
computation time is c i(l/e )(lo g n )1+e =  0((log n)1+c), for some constant ci. The 
value of li/C) the number of bits of the result, is bounded from below as follows:
l — n n
1//£ (logn)1+t2(lo8n)1_e/(i+(log «)6)1/<_1 > (logn)1+c • 2 ’
This value /t represents the length of the operand supplied to the cascade of 
the two Newton approximators, to be described next.
Starting with the downstream approximator, we recall (see figure 2) that this 
module is in turn the cascade of p submodules (p is an integer to be defined shortly), 
where the ¿-th submodule has area and time and T3,,-, respectively, and
A3(, =  2A3(1_ i , T3(l =  \/2T3j,'_i i =  2, 3 , . . . ,  p.
With this choice (originally proposed in [1 ]), the global area and time of the 
slow approximator are respectively proportional to the area A3(P and time T3>p of
4- The Divider Network 11
A
1st submodule
p-th
submodule
Fig. 2: The module structure of the slow “ Newton approximator".
the p-th (last) submodule. Since we are aiming for an AT2-optimal network with 
computation time O(T), we must have
A 3,PT 3,p =  0 (n 2)
and
T»,P =  T.
This condition enables us to specify the parameter p. Indeed, the speed of the 
submodules increases as we proceed upstreams (by decreasing submodule index), 
and each submodule must satisfy the condition that its multiplication time is at 
least logarithmic in the operand length. Since the operand length is halved in going 
from index i to index i — 1 (due to the mechanism of the Newton approximation), 
and the most stringent condition occurs for i =  1, we have
T . n
g {^ ’
which is certainly satisfied if we select p as follows:
p =  2 log =  2elog log n. (3)
12
Finally we turn our attention to the “ fast approximator". This module receives 
an approximation of length li/e =  n /(logn )1+e • 2 and delivers an approximation 
of length n /(logn )2e. Thus, this module must execute (1 — e) log log n +  1 iteration 
steps, each of them within time 0(logn). The module essentially consists of a 
“ fastest" multiplier of numbers of length n /(logn )2e, and can be realized with area 
A2 such that A2(logn)2 =  0 ( (n /2 p)2), i.e. , A2 =  0 ((n /( lo g n )1+2e)2). Thus, the 
resulting AT2-measure for this module is
A2T2 =  &( ( ( log ™y+J7 lo8 n ' (1 ”  <0 log log n f )  =  0 (n 2)
and the optimality condition is clearly satisfied.
Since each of the three major units of our divider - the chain of modified BCH 
dividers, the fast Newton approximator and the slow Newton approximator - has 
area O ((n /(logn )1+e)2) and time 0 ((log n )1+e), we conclude with the following 
result:
Theorem 1 For any fixed 1 > e > 0, the n-bit inverse of an n-bit number can be 
calculated with optimal AT*2-performance for any T E [H((log n)1+e), 0 ((log  n)2)].
5. Conclusion.
We constructed an AT2 -optimal divider with computation time (logn )1+c for any 
e > 0. The reader may wonder whether one can choose c as a decreasing function of 
n (tending to zero as n goes to infinity). This is indeed the case if the construction 
is slightly modified. In the construction as it is now we use a chain of modified 
BCH dividers each with the same area and speed. Thus both area and time grow 
as l /e  and hence AT2 grows (at least) as (1 /e )3.
If t is chosen as a function of n, then this simple chain of equally sized modules 
does not suffice. Rather one has to use a chain of increasingly larger (and slower) 
modules as we did for the Newton iteration. Omitting the tedious and not particular 
illuminating details we have:
Theorem 2. There is an AT2-optimal divider for n-bit integers for any T £ 
[Q(log n • 2(log log ” )8/4), 0 ((log  n)2)].
Note that 2(log log n)3/4 =  0 ((lo g n )e) for any e > 0.
5. Conclusion. 13
References
[1] K. Mehlhorn: “ AT2-optimal VLSI Integer Division and Integer Square Rooting", 
Integration, 2, 163-167, 1984.
[2] J. Reif: “ Logarithmic Depth Circuits for Algebraic Functions", 24th FOCS, 138- 
145, 1983.
[3] RW . Beame, S.A. Cook, H.J. Hoover: “ Log Depth Circuits for Division and Related 
Problems", 24th FOCS, 1-6, 1984.
[4] W.K. Luk, J. Vuillemin: “ Recursive Implementation of Optimal Time VLSI Integer 
Multipliers", VLSI 83, Trondheim, Norway, 1983.
[5] O. Spaniol: “ Arithmetik in Rechenanlagen", Teubner Verlag, Stuttgart, 1976.
[6] K. Mehlhorn, F.P. Preparata: “ AT2-optimal VLSI Integer Multiplier with Minimum 
Computation Time", Information and Control, 58, 1-3, 137-156, 1983.
[7] D.E. Knuth: “ The Art of Computer Programming", Vol. 2: Seminumerical Algo­
rithms, Addison-Wesley, Reading, Mass. 1981, 2d ed.
[8] F.T. Leighton, personal communication, May 1985.
