Area-Time Optimal VLSI Networks Based on Cube-Connected-Cycles by Preparata, Franco P. & Vuillemin, Jean E.
SECUR ITY  C LASS IF IC AT IO N  OF THIS PAGE (When D a te  En te red )
REPORT DOCUMENTATION PAGE R E A D  IN ST R U C T IO N S  B E F O R E  C O M P L E T IN G  FORM
1. REPORT  NUMBER 2. GOVT ACCESSION NO. 3. R E C IP IE N T 'S  C A TA LO G  NUMBER
4. T I T L E  (and S ub title )
AREA-TIME OPTIMAL VLSI NETWORKS BASED ON THE CUBE- 
CONNECTED-CYCLES
:
5. TY P E  OF REPORT  a PER IOD  COVERED
Technical Report
6. PERFORMING ORG. REPORT NUMBER
1-875(ACT-21);UILU-ENG 80-2207
7. AUTHO R fa ;
F. P. Preparata 
J. E. Vuillemin
8. C O N TR ACT  OR GRANT NUMBER^«.)
MCS-78-13642; N00014-79- 
C-0424
9. PERFORM ING  O R G AN IZAT IO N  NAME AND ADDRESS
Coordinated Science Laboratory 
University of Illinois at Urbana-Champaign 
Urbana, Illinois 61801
10. PROGRAM ELEMENT . PROJECT , TASK  
AREA a WORK UN IT  NUMBERS
11. C O N TR O LL IN G  O F F IC E  NAME AND ADDRESS
National Science Foundation;
Joint Services Electronics Program Contract
12. REPORT DA TE
February, 1980
13. n u m b e r  o f  p a g e s
14
14. MONITOR ING AGENCY NAME a ADDRESS^/ /  d iffe re n t from C on tro llin g  O ffic e ) IS. SECUR ITY  CLASS, (of. th is  report)
UNCLASSIFIED
15«. D ECLASS IF ICAT IO N /DO W N  GRADIN G 
SCHEDULE
16. D ISTR IBUT ION  STA TE M E N T  (o i th is  Report)
Approved for public release; distribution unlimited
17. D ISTR IBUT ION  STA TE M E N T  (o f the abs trac t en tered in  B lo c k  20, i f  d iffe re n t from Report)
18. S U P P LE M E N TA R Y  NOTES
19. KEY  WORDS (C on tinue  on reverse side i f  necessa ry and id e n t ify  by b lock number)
Parallel processing, optimal networks, cube-connected-cycles, cyclic shifts, 
Discrete Fourier transform, integer multiplication
20. AB S TR A C T  (C on tinue  on re ve rse  side i f  necessa ry and id e n t ify  by b lock number)
We present designs for VLSI circuits computing Cyclic Shifts, Discrete 
Fourier Transforms, and Integer Multiplication, all based on a machine 
architecture, the Cube-Connected-Cycles CCC, introduced by the authors in 
[10]. All of our designs match, to within a constant factor, the known 
theoretical lower bounds [3], [4], [8] for area X(time)2 products.
DD , j an ̂ 73 1473
SECUR ITY  CLASS IF ICAT IO N  OF THIS PAGE (When Data En te red )
AREA-TIME OPTIMAL VLSI NETWORKS BASED ON 
CUBE - C ONNE CTED - CY CLES
by
Franco P. Preparata and Jean Vuillemin
This work was supported in part by the National Science Foundation 
under Grant MCS 78-13642 and Joint Services Electronics Program under 
Contract N00014-79-C-0424.
Reproduction in whole or in part is permitted for any purpose of 
the United States Government.
This report is issued simultaneously by the Coordinated Science 
Laboratory and by the Institut National de Recherche d'informatique et 
d 'Automatique, 78150 Rocquencourt, France.
Approved for public release. Distribution unlimited
AREA-TIME OPTIMAL VLSI NETWORKS BASED ON
THE CUBE-CONNECTED-CYCLES
F.P. PREPARATA
Coordinateci Science Laboratory 
University of Illinois 
Urbana, IL 61801 
U.S.A.
J.E. VUILLEMIN 






We present designs for VLSI circuits computing Cyclic Shifts, Discrete
Fourier Transforms, and Integer Multiplication, all based on a macnine architecture,
the Cube Connected Cycles CCC, introduced by the authors in [10]. All of our
designs match, to within a constant factor, the known theoretical lower bounds
2[3], [4], [8] for area x (time) products.
This work was partially supported by National Science Foundation Grant 
MCS-78-13642, by the Joint Services Electronics Program Contract N00014-^ 
79_C_0424, by I.R.I.A., Institut de Recherche en Informatique et Automatique, 
78150 Le Chesnay, France, and by ERA 452 ’al Khowarizmi of Centre National 
de la Recherche Scientifique.
1
1. INTRODUCTION
Very-Large-Scale integration (VLSI) is revolutionizing the methodologies 
of digital system design. The traditional criteria of component count -whether 
applied to processors or to simpler devices- are no longer adequate to establish 
a scale of comparison among various solutions to a given problem. Indeed number 
of-elements criteria are substantially based on the fact that processing 
elements and their interconnections are realized by different media. This 
difference disappears in VLSI, which "integrates” both processing elements and 
their interconnection in a two-dimensional geometry, the surface of the silicon 
chip. Thus, a meaningful figure-of-merit is represented by the area occupied by 
the total system, since area captures the complexity of both computation and 
data communication. As a result, the solution to a given computational problem 
involves the conception of an interconnection architecture, its layout, and the 
design of an algorithm for that architecture. It is clear that the traditional 
hardware-software antinomy disappears in VLSI technology.
Pioneering and fundamental work in the area has been done by Mead-Conway [1] 
and by Thompson [23, both as regards the development of a VLSI model of compu­
tation and in the design of computations (architecture+algorithms) for specific 
problems. As is typical of the methodology of concrete computational complexity, 
for a given problem and selected complexity measures one seeks both lower-bounds 
to these measures which hold for any realization in the computation model and 
upper-bounds by exhibiting explicit realizations which comply with the model.
In spite of the relative novelty,the great interest of the topic is attested to 
by the additional contributions of Thompson [3], Abelson and Andreae [A3, King 
and Leiserson [53, Guibas et al. [63, Brent and Kung [7,8], Savage [9], and 
Preparata-Vuillemin [10].
2
The VLSI computation models of Mead-Conway Cl] and Thompson [2] are not 
significantly different. We briefly recall the latter one for the benefit of the 
reader. A VLSI computing system (or network) is viewed as a communication graph, 
whose vertices and edges are called nodes and wires, respectively. Nodes store 
and process local information ; wires transmit information between nodes. Nodes 
and wires are laid out on a grid of unit squares, where "unit" relates to the 
so-called "feature width", a basic parameter characterizing the resolution of 
current fabrication techniques. Wires have unit width and must be partitionable 
into no more than v sets of non intersecting segments, where v is the number 
of conducting layers. In this work, we assume that v=2, the almost universal 
two layers standard. It is assumed that a bit of information takes unit time to 
propagate from node to node, independently of the wire length (this implies that 
longer wires have more powerful drivers, of area proportional to the wire length) ; 
node processing time is absorbed into wire propagation time, and the total time 
for 3 given computation is the number of time units to execute it.
The usual metric selected for complexity is an area-time product AT2a, where 
A is the chip area, T is the computation time, and a is a real parameter satis- 
fying 0<a<1. This metric allows a flexible trade-off (based on a) between the 
product-ion cost (area) and the incremental cost (time) of computation.
For several interesting problems, lower—bounds to the area—time product have 
been obtained. A crucial notion in obtaining such lower-bounds S is the minimal 
bisection width u of a given communication graph G=(V,E), which is defined as the 
smallest integer such that w=|(u,v)eE ; ueVj, veV2 )|, where {Vj,V0} is a partition 
of V with |Vj I — IV̂ I — IVj| + |. Thompson has shown [2] that for any n—node communi-
* • • • Ocation graph with minimal bisection width oo, Â co /4 (in unit squares) . Therefore,2 otlower-bounds to AT are obtained by bounding the computation time T in terms of w. 
In this paper we restrict ourselves to the following problems : cyclic shifts, 
integer multiplication, and radix—2 Discrete—Fourier—Transform. As regards cyclic 
shifts, it has been shown in [4] and [10] that T>n/2o) for any VLSI design which 
performs any cyclic shift of an array of n one-bit terms ; using a technique due 
also to Thompson [2,p. 72], this leads to the lower-bound AT2a=ft(n1+a). Since the 
ability to perform an arbitrary cyclic shift of an n-bit string is reducible to 
the multiplication of two (n/2)-bit integers, the lower-bound to AT2a for cyclic 
shift becomes a lower—bound to integer multiplication [4] ; however, an independent 
proof of the latter —in a slightly more general model— has been supplied by 
Brent and Rung [8]. Finally, as regards radix-2 DFT, Thompson [2] has shown that
A
3
AT2a=ft(n1+aiog2an) for any communication graph which computes the DFT of n numbers 
each represented with O(logn) bits.
The purpose of this paper is to provide upper-bounds to the chosen metric 
of complexity for the problems mentioned above. The paper is organized as follows. 
Section 2 reviews the structure and the layout of a general computation network 
-called the cube connected-cycles [10]- which is remarkably suited to VLSI design. 
Section 3 and 4 discuss optimal designs based on the cube-connected-cycles ; 
specifically Section 3 considers a network for cyclic shifts, while Section 4 
considers networks for integer multiplication and radix-2 Fast~Fourier-Transform.
2. THE CUBE-CONNECTED-CYCLES
The cube-connected-cycles (CCC) interconnection has been proposed in [10] 
as a general-purpose network of processing modules, suited for the implementation 
of various combinatorial algorithms. The specifications, the operating modes, 
and the performance of the CCC are now briefly reviewed.
An h*2S CCC-interconnection consists of 2S cycles, indexed from 0 to 2 -1 ; 
each cycle is the circular interconnection of h modules (h>s), indexed from 0 to 
h— 1. Thus each module is addressed by a pair (£,p), where £ and p are respectively 
the cycle and the module indices, and is denoted M[£,p]. Each module has three 
ports : F, B, and L , ^  and the connection is completely specified by
F(£,p) ++B(£, (p+l)mod h)
B(£,p) «-*• F(£, (p-l)mod h)
L(£,p)-^L(£+e2P ,p) if p—s— 1 (unconnected if p>s)
where e=l-2BIT (£) (here BIT (£) is the coefficient of 2P in the binary expansion 
P Pof £) . In the hypothesis that modules reduce to nodes (i.e., they can be placed 
at vertices of a uniform grid of squares) and that wires are laid out on the 
grid, a layout of a 6x24 CCC is shown in figure 1. Notice that if all nodes of 
every cycle are ideally collapsed into a single node, the resulting set of nodes 
are connected as a binary s-dimensional cube (s-cube). This justifies the CCC 
denotation. (In the layout of figure 1, vertical and horizontal wires realise the 
cycle and cube connections, respectively).
(1) F, B, and L are respectively mnemonic for "forward", "backward", and 
"lateral".
4
Figure 1. A standard layout for an hx2S CCC (h=6, s=4) .
The dimensions of this s-cube are numbered l,2,...,s, and the set of horizontal 
wires realizing dimension i are collectively denoted as sheaf i.
As a paradigm of computation, we consider the following type of algorithms. 
Abstractly, there are n=2 data items (operands), assigned addresses from 0 to 
2 -1 (or equivalently, each operand is addressed by an r-dimensional binary 
vector and assumed to be placed at the corresponding vertex of the r-cube). The 
algorithm is a sequence of r=log2 n steps -each executable in parallel- with the 
property that at each of these steps each operand interacts with another operand, 
which is adjacent to it one a specified r-cube dimension ; specifically, either 
the i-th (ASCEND type algorithms) or the (r-i)-th dimension (DESCEND type 
algorithms) pertains to step i. (Typical instances of such algorithms are the 
Radix-2 Fast-Fourier-Transform and Bitonic merging of sorted sequences.) We see 
that these algorithms are supported by an r-cube interconnection.
We now show how algorithms of the type just described can be implemented on 
a 2^x2r ^ CCC. Processing occurs in two consecutive phases. Making reference for 
concreteness to the ASCEND mode of operation, the first phase (referred to 
conventionally as LOWSHEAVES) pertains to r-cube dimensions l,2,...,p (which 
are subsumed b}' the CCC cycle connection), while the second phase (denoted 
HIGHSHEAVES) pertains to r-cube dimension p+1, p+2,...,r (which ordely cor­
respond to CCC sheaves l,2,...,p).
5
The LOWSHEAVES phase emulates in general the r-cube behavior as follows.
Since operand-interaction can occur in the cycles only between adjacent modules, 
it is necessary to successively realize the adjacencies■corresponding to p-cube 
dimensions l,2,...,p. The key permutation for this task is the perfect unshuffle 
[11], and it is shown in [10] that the required adjacencies are globally realizable 
in time proportional to 2P, thereby showing that the first phase runs in time 0(2P) .
In the second phase the r-cube behavior is emulated as follows. The parallel 
step pertaining to r-cube dimension p+j can no longer be executed in one time 
unit ; however, using repeated circular shift within the cycles, each operand can 
be successively brought to reside for one time unit in that module in its cycle 
which is connected in sheaf j . Although this processing of all operands in a cycle 
on sheaf j now requires 0(2P) time units, this computation can be pipelined 
(overlapped) with the analogous computations corresponding to all other sheaves, 
according to the scheme illustrated in figure 2. The sequence of steps during 
which a given sheaf is active is called the active phase of that sheaf (for 
example, steps 3-6 for sheaf 3 in figure 2).
Thus, the second phase also runs in time 0(2P), and, when p is chosen equal 
to jlog0 (r-p)j , processing time on the CCC is 0(logn). We see therefore that, 
by combining the principles of pipelining and parallelism, the CCC can emulate 
the cube with no significant loss of performance. In the sequel, we shall assume 
that
Figure 2. Illustration of the pipelining of parallel computations.
A "X" denotes a step at which a given sheaf is active.
the cycle length 2P satisfies the limitations corresponding to
[iog2(r-p)| <p< |r/2_] O)
Finally, as concern the area of the layout, by referring to figure 1 we 
readily see that a 2Px2r p CCC can be laid out on a chip of height (2r p+2P-r) 
and width (2r P+1-l) (in the chosen units).
3. CYCLIC SHIFT
IT ITLet T[0:2 -1] be an array of n=2 one-bit operands (for any other operand 
length, both the area and the time will be multiplied by a constant).
We describe cyclic shifts to the left by t<n positions of the operands 
of this array ; although dual implementations are possible, for concreteness, 
we describe a cyclic shift scheme which corresponds to an algorithm in the 
ASCEND class.
We now note the following property : Assuming that T[0:2r ^-l] and T[2r:2r—1]
• r-1have both been subjected to a left-cyclic-shift by tmod2 positions, the
desired final configuration is obtained by the following alternative exchanges :
if (t>2r S  then foreach k:0<k<2r ^-tmod2r *-l pardo T[k]-*->T[2r ^+k] odpar 
else foreach k:2r ^-tmod2r ^<k<2r  ̂ pardo T[k]-*-*T[2r ^+k] odpar
( 2 )
The proof of this property is straightforward (and it is basically supplied 






A B C 1 D 1 A B C D 1
k---- 2r_1 -56—  2f' Ü l  t .¿It-̂ 41A  s
L tmod2r ---
i___ S____ A D C
Exchange _ r




Configuration L _ 2 _______ U B C B C D A
Figure 3. Illustration for the proof of Rule (2)
7
r-1adjacent pairs on dimension r (i.e., pairs of the form (j,j+2
r—] r—1 r-1into to sets, of respective sizes tmod2 and (2 -tmod2 ) and the numbers
)) are partitioned
 
r-1of the former or of the latter set are exchanged depending upon whether t<2 
of t>2 . Furthermore, since the two halves of the array T[0:2 -1] are treated
in exactly the same way as regards dimensions 1,2,...,r-1, it follows by induc­
tion on decreasing j that the exchanges pertaining to dimension l^j^r are 
completely described by :
if (tmod2^>2^ then for each k: k=2s.2'1 *+v, 0^v^2'1 ^-tmod2'1 1,
0<s<2r ■*-! pardo T[k]-<-*T[k+2^ *] odpar
else for each k: k=2s.2^ ^+v, 2̂  -̂tmod2~* ^<v<2 
pardo T[k]-«->T[k+2''  ̂] odpar
j"1-!, 0<s<2r"j-:
(3)
Notice that in the above rule (3) a crucial role is played by the parameter 
v, which defines the range of the pairs to be exchanged.
We now propose to implement the described cyclic shift operation on a CCC-like 
network. We select a 2Px2r P CCC-interconnection so that dimension j of the 
previous abstract description (for j>p) corresponds to CCC sheaf j-<f>. We now 
observe the following facts :
(i) Refering to the standard layout of figure 1, in sheaf £, for £ in the
range (l,r-p), adjacent pairs on the same orizontal line are characterized by the
same value of the parameter v, as defined above. This means that all pairs on a
given horizontal line of the layout will behave identically during the execution
of the shift algorithm, whence the behavior of sheaf £ is completely specified
£-1by the behavior of modules M[i,£-1] in cycles i=0,l,...,2 -1. Therefore, as
regards sheaf £, we may restrict ourselves to the subarray T[0:2P ] ; in
particular according to rule (3), this array is partitioned into
T[0: 2P+'e" 1-tmod2P+'e"1-l] and T[2 ^ “1-tmod2P+£~1 :2V+l~]-\ ], and either the
first or the second of them exchanged on sheaf £.
(ii) tmod2P+^ ^=q 2P+tmod2P , where q«= (tmod2P+^ S/2P . Thus modules
X
M[i,£-1] (0<i<2 -1) are divided into three sets,
{M[0,£-1 ] , .. . ,M[2̂ "”1 -q^-2,£-l ]} , {l O ^ - q ^ - l  ,£-l 3} , and {M[2£”1-q^,^-l ] , . . .,
M[ 2^~1-1,£-1]}, such that modules in the first and third set have fixed behavior 
during their active phases (either exchange or no-exchange), while M[2 -q^-1,^-1]
changes its behavior after the first 2P-tmod2P steps of its active phase.
8
(iii) For any value of t (i.e., for all sheaves) the quantity (2P-tmod2P) is 
independent of £. This means that all modules with mixed behavior, the change 
of behavior (from exchange to no-exchange, or vice versa) occurs after the 
same number of steps during their active phases.
(iv) The LOWSHEAVES phase is void. Indeed the effect of a left cyclic shift 
by tmod2P positions within each cycle is implicitly achieved by the timing of 
the exchanges in the HIGHSHEAVES phase. All that is needed initially for each 
cycle is a' toward cyclic shift by one position so that T[i2P-l] resides in 
M[i,0], for i=0,l,...,2r P-l.
In summary the shift operation can be controlled as follows. Each module of 
the CCC is assigned two bits, bj and b^, which respectively control the module 
operation during the first (2p-tmod2p) and the last tmod2P steps of the module 
active phase. Bit b^ is set to 1 denote "exchange” and to 0 otherwise. The 
timing of the possible change of behavior (between step (2P-tmod2P) and step 
(2P-tmod2P+l) is controlled by the bit sequence (0) 2P-tmod2p j (o)tmoĉ , which 
circulates in each cycle along with the operands. Thus we conclude that three 
control bits for module are sufficient, i.e., the cyclic shift operation has 
a finite-state module control.
Since the layout of figure 1 is used without modification, (and the 
nodes have constant area) we reach the following conclusions. For p=[r/2| we
obtain a CCC whose computation time is 0(2P)=0(/n), i.e. a "slow" realization,
2awhich, however is optimal for the AT metric (0<a<l) : in fact, referring to 
the expression for the CCC height and width obtained at the end of Section 2, 
we have : A=0(2r) and T=0(2r//̂ ), whence AT^a=0(n*+a). If,on the other hand,we seek 
minimum computation time 0(logn), the corresponding "fast" CCC is obtained by 
choosing p= |iog?(r-p)| v llog^n.^^ In this case we obtain T=0(2P)=0(logn) and 
A=0((n/logn) ), whence setting a=l in AT , we obtain AT =0(n ), i.e. the 
network is optimal (notice that this occurs only for a=l).
4. INTEGER MULTIPLICATION AND DISCRETE FOURIER TRANSFORM
The design of hardware multipliers is not a new problem in computer science. 
The classical shift and add method multiplies two n bits integers in time T=0(n), 
within a circuit area A=0(n). Furthermore, such a circuit is laid out in a
TO The notation "llog" is used in this paper instead of the more common 
"loglog".
rectangle of constant width 0(1), corresponding to a few wires, and height 0(n) 
proportional to the number of bits. Although this multiplier does not meet 
the AT=0(n ) and AT =0(n2) bounds of Brent-Kung [8], it proves to be useful 
in designing optimal VLSI for the DFT and binary multiplication.
A. 1 . CIRCUITS FOR DFT
In [10], we indicate that the radix-2 FFT algorithm can be imple­
mented on the CCC in time T proportional to the cycle size h. Each module of 
the machine performs one of seven tasks at a given time : it may transmit 
operands, in either direction, on one of its three communication lines, or it 
may be performing an internal operation. Internal operations, in this context, 
are linear combinations of the form (U,V)-*-(U+aV, U-aV) where U et V are two
operands present in the module, and a is an appropriate power of
2 j7 T
m=e n , a primitive n-th root of unity. In fact, the successive values of 
. . . Ta vary with time, taking the form oOq .Wj, where and are appropriate powers 
of a> ; keeping the value of u in a special register allows to update a with 
a single multiplication. Internal operations can thus be computed in each 
module by a multiplication a-«-a.u) , another multiplication V«-a.V, and a final 
add-substract step (U,VMU+V, U-V) . Using a shift and add multiplier, and a 
few registers, such a butterfly module can be implemented on a chip of area 
0(logn) proportional to the number of bits used for representing each of the n 
inputs. As for the multiplier, this butterfly can be laid out in a rectangle 
of width 0(1) and height 0(logn). The n butterflies are then placed on the 
CCC of figure 1 as indicated in figure 4. It should be pointed out that in 
figure 4 the horizontal wires realizing the sheaf connection can be inter­
leaved with the horizontal wires belonging to the butterfly modules; this 
interleaving at most doubles the height of the latter modules. Thus, the width 
of this new layout for a 2P X 2r‘P CCC (with n = 2P) is 0(2r“P) = 0(— )• we' p' 3shall now evaluate the height of the CCC. 2
As we see from figure 1, in each cycle there are two sets of modules; 
the set of the sheaf-modules, whose lateral port is used, and the possibly
empty set of non-sheaf modules, which we shall consider first. Let H be the 
module height.
IT —pEach row of 2 non-sheaf modules (there are 2P-(r-p) such rows) can 
be laid out in an obvious way, using height H; the chip height used to
accommodate these rows is thus (2P-r+p)H. As regards sheaf modules - although 
more compact placements are possible — we just assume the standard placement 
shown in figure 4, where sheaf i uses height 2 (H + 2*') ̂  . Thus the (r-p) 
sheaves contribute height 2° P+2+2H(r-p), and the total chip height is 
(2r P+2+ H(2P+r-p)) = 0 ( ~  + 2Plogn), since r = logn and H = O(logn). It 
follows that (provided 2P, the cycle length, is upper-bounded by Jn/logn ; 
we already know that 2P S; logn) the total CCC area is A = 0(n2/22p).
Sheaf 2
Figure 4. Placement of the butterflies (vertical rectangles) on the 
CCC.
Processing time is devoted to butterfly operations and operand trans­
mission, each of which requires time O(logn). There are 2P + log(n/2P) steps 
in the computation, thus total time is T = 0(2Plogn). Since we have just 
observed that 2P is bounded as logn^ 2P ^Vn/logn, for any choice of T within
the bounds log n^T^A/nilogn we have just designed networks of area A =2 2 20(n log n/T ), thus achieving the lower-bound of Thompson [2].
Of independent interest is the product AT which, as observed by Thompson
[2], is proportional to the amount of energy spent in the computation. In
3/2this regard, a lower bound AT=Q(n logn) is obtained in [2], for the computa-
3 / 2tion of the DFT; from the lower bound AT=Q(n ) obtained by Brent-Kung [8]
(2)The factor 2 is due to the interleaving of module wires and sheaf connections
11
for binary multiplication, it is straightforward however to conclude that any
3/2DFT circuit satisfies the bound AT = fi((nlogn) ). This last bound is met by 
a "slow" CCC design, with the choice of 2^ = 0( (n/logn)^2) for cycle size, 
yielding a circuit for the DFT with values A = O(nlogn) and T = O^nlogn).
The fastest circuit, among those described, is obtained for a cycle size
P 2 22 = O(logn); it uses an area A = 0(n/(logn) ) for computing DFT in time
0((l°gn)2).
4.2. CIRCUITS FOR INTEGER MULTIPLICATION
It is well-known [13] that the Discrete Fourier Transform allows to compute 
convolution products. From the preceding section, we know that we can 
construct circuits computing the convolution of two sequences of q integers, 
each integer being represented on log2 q bits, with the following characteristics :
A.T =0(q logq ) for any T such that (logq) <T</qlogq.
In [13], Schonage and Strassen show that, if a.'=e2^7T*̂  and its powers 
are represented with 5 log2 q bits, and the arithmetic is carried out with this 
precision, the approximate error in computing convolutions via FFT remains 
confined to the fractional parts of the terms involved.
In order to compute the product of two n bits integers, we divide each 
operand into q = ^—  blocks of length logn bits each, and we compute the convolution
of the two sequences of q integers. The exact product is then found by 
"releasing the carries", in a straightforward manner.
Setting q= y~—  in the expression above for convolution shows that we have 
designed circuits for binary integer multiplication having the characteristics .
A.T2=0(n2) for any T such that (logn) 2<T<v/n.
2 2These circuits meet the AT =fi(n ) bound of Brent and Kung [8] ; in this class,
3/2the slow circuit corresponding to T=0(\/n) also meets the AT=Q(n ) lower bound [8] .
12
Although circuits in section A appear to be too complex for being feasible
on one chip in the present state of the technology,the sheer existence of, say,
an A=0(n), T=0(/n) multiplier raises interesting very practical prospects.
5. REFERENCES
[]] C. Mead and L. Conway, Introduction to VLSI Systems, Addison-Wesley,
Reading, Mass. 1979.
[2] C.D. Thompson, "A complexity theory for VLSI”, Ph.D. Thesis, Department 
of Computer Science, Carnegie-Mellon University, Pittsburgh, Penn.,
September 1979.
[3] C.D. Thompson, ’’Area-time complexity for VLSI," Proc. of the 11th Annual 
ACM Symposium on the Theory of Computing (SIGACT), pp. 81-88, May 1979.
[A] H. Abelson and P. Andreae, "Information transfer and area-time trade-offs
for VLSI multiplication", to appear in the Communications of the ACM (1980).
[5] H.T. Rung and C.E. Leiserson, "Algorithms for VLSI processor arrays,"
Symposium on Sparse Matrix Computations, Knoxville, Tenn., Nov. 1978.
[6] L.J. Guibas, H.T. Rung and C.D. Thompson, "Direct VLSI implementation of 
combinatorial algorithms," Proc. Conference on VLSI Architecture, Design, 
Fabrication, Calif. Inst. of Techn., January 1979.
[7] R.P. Brent and H.T. Rung, "A regular layout for parallel adders," Research 
Report, Department of Computer Science, Carnegie-Mellon University,
Pittsburgh, Penn., June 1979.
[8] R.P. Brent and H.T. Rung, "The area-time complexity of binary multiplication," 
Research Report, Department ofComputer Science, Carnegie-Mellon University, 
Pittsburgh, Penn., July 1979.
[9] J.E. Savage, "Area-time trade-offs for matrix multiplication and transitive 
closure in the VLSI model," Proc. of the 17th Annual Allerton Conference
on Communications, Control, and Computing, October 1979.
13
[10] F.P. Preparata and J. Vuillemin, ’’The cube-connected-cycles : a versatile 
network for parallel computation," Proceedings of 20-th Annual IEEE
Syposium on Foundations of Computer Science, Puerto Rico, October 1979.
[11] H.S. Stone, "Parallel processing with the perfect shuffle", IEEE Transactions 
on Computers, Vol. C-20, pp. 153-161 ; 1971.
[12] C.S. Wallace, "A Suggestion for a Fast Multiplier", IEEE Transactions on 
on Computers, Voi. 12, pp. 14-17, February 1965.
[13] A. Schönage and V. Strassen, "Schnelle Multiplikation grosser Zahlen," 
Computing, Vol. 7, pp. 281-292, 1971.
