We introduce a network of processing elements, the cube-connected-cycles (CCC) , complying with the present technological constraints of VLSI design. By combining the principles of parallelism and pipelining, the CCC can emulate the cube-connected machine with no significant degradation of performance but with a much more compact structure. We describe in detail how to program the CCC for efficiently solving a large class of problems, which includes Fast-Fourier-Transform, sorting, permutations, and derived algorithms. The CCC can also be used as a general purpose parallel processor.
INTRODUCTION
The great technological progress embodied by VISI has made it possible to conceive large systems of processing elements cooperating in the execution of parallel algorithms. This has motivated considerable research interest in parallel computation. Unfortunately, here the situation is very different from that of serial computation, where the RAM machine [1] represents a universally accepted model. The difficulty of choosing a specific interconnection is frequently bypassed by assuming a model (shared-memory-machine) where each pair of processors is connected (or an equivalent system) [2-5]. Although not without merit, because it aims at uncovering the inherent data-dependence of given problems, this approach ignores the technological constraints of VLSI, particularly as regards the communication among the processing elements [6] . At the opposite end, other workers [7] [8] [9] [10] [11] suggest that processor interconnection should be limited to planar links between topologically neighboring cells (arrays or meshes). Such designs are certainly well suited for current VISI technology, and they have cleverly been used in implementing matrices or graph problems [9] [10] [11] [12] ,for example. This type of connection is, however, not suited for efficiently implementing various fundamental problems, such as sorting and convolution. Indeed, good algorithms for solving these problems intrinsically require data movement between processors which are topologically far apart; for example, sorting on an n processor array such as ILLIAC IV requires time 0. (~) [8] . The purpose of the paper is to propose and analyze a new interconnection of processors, called the cube-connected-cycles, which is remarkably suited for implementing efficient algorithms such as Fast-Fourier-Transform (FFT), sorting, etc . . . . The geometric structure underlying the interconnections is that of the k-dimensional cube. This structure which has already been studied in relation to parallel computation [lZ] , is not readily usable for VLSI design, since each of the Zk processors is connected to k other processors.
By combining parallelism and pipelining we are able to achieve the following results:
(1) The number of connections per processor is reduced to 3;
(2) Processing time is not significantly increased with respect to that achievable on the k-cube structure; (3) Programs for the individual modules are obtained in a systematic way from a standard description of the global algorithms; (4) The overall structure complies with the basic requirements of VISI realization: modularity, ease of layout, simplicity of communication among the processing elements, simplicity in timing and control of the entire system [14] . We also propose a wire layout of the CCC, which can be physically realized with two orthogonal layers of wires. This layout is optimal for several problems, according to a recently proposed VSLI model [18] .
(5) Finally we are able, without resorting to any drastic departure from classical algol-like languages, to provide fully accurate and hopefully easily understandable descriptions of our parallel programs. This is a favorable sign that VLSI design may possibly be endowed with suitable high level programrrdng languages.
DESCRIPTION OF THE SCHEME
Our parallel computing system, the cubeconnected-cycles (CCC),is a network of identical processors. Each of these processors, called a module, contains an operand register T, a few memory locations, and possesses basic arithmetic and logical capabilities. It is controlled by a stored program. For the time being, we make the hypothesis of unli~ted parallelism, that is, the number of modules is tailored to the problem size; under this hypothesis, the one or two memories mentioned earlier suffice. Subsequently, under the hypothesis of li~ted parallelism, we shall endow each module with a small private RAM. In either case, each module is somewhat simpler than a current microprocessor but not basically different from it.
A module has 3 interconnection ports. Each interconnection line linking two modules can be used for the bidirectional transmission of one operand, and it is irrelevant whether operand transmission is serial or parallel. For correctly executing the algorithms described in the following sections, it is indifferent to synchronize the entire system through a central clock, which defines time units for all modules, or to let synchronization problems be settled at the level of each communication line, thus achieving an asynchronous system. In order to describe the interconnections, we assume for simplicity that n, the number of modules, is a power of two, i.e., n = 2 k , and, moreover, assume that k is of the form k = r+2 r ; the modifications resulting when k is arbitrary are straightforward (in the latter case, r is the smallest integer for which r+2 r~k ).
Each module has a k-bit address m which in turn is expressed as a pair (t,p) of integers represented with (k-r) and r bits respectively, such that t.2 r +p = m.
As mentioned earlier, each module has three ports: F, B, and E (mnemonic for forward, backward, xternal), whose connection is entirely determined by the module address (t,p), that is:
E(t,p) is connected to E(~+ € 2 P ,p) where € = 1-2BIT (.e). Here, BIT (.e) is the coeffi- It is interesting to examine the just described CCC within the framework of the "VLSI model of computation" recently proposed [14, 18] . In this model, each wire has unit width on the silicon chip and transmits a unit of information in a unit of time; information is taken from or delivered to special areas on the chip, called nexuses, each associated with a module. Within this model, which takes realistic account of the placement of modules and interconnection, C.D. Thompson has studied the implementation of the Fast-Fourier-Transform (18] and has elucidated significant relationships between input size n, chip area A, processing time T, and the so-called minimal bisection width w. (1) Thompson has shawn that A~w Z /4 in general, and that, for the n-point FFT, T~n/Zw, thus achieving the bound AT Z 2 n Z /16. This lower bound to the time applies to a wider class of problems, as shown by the following proposition which we state without proof:
Proposition: In the VLSI model (Thompson [18] ), T~n/Zw to merge two sorted sequences of length n/Z, or to realize the data rearrangement specified by some permutation drawn from a transitive group of permutations. (2) With respect to the CCC, we are able to show that operations such as fFT, merging, cyclic shifts, shuffles, etc., are all realizable in time T = O(logn). We now demonstrate that A = O(n/(logn)Z), thus achieving the lower bound exactly; this means that the CCC is optimal in the VISI model for FFT, merging of sorted sequences, and realization of permutations drawn from a transitive group.
To realize that A = O«n/logn)Z), consider a layout which uses two sheaves of evenly spaced wires, horizontal and vertical, used respectively for cube and cycle connections. Figure Z pictorially provides base, inductive hypothesis, and extension, to prove that a jzj module CCC can be placed on a zj X (Z.Zj-l) chip; thus letting n = j2 j we obtain j~logZ(n/log 2 n), whence the chip size is about (n/logzn) X (2n/log 2 n-l)
O«n/logn)Z). Slightly more complicated constructions yield somewhat more efficient modulE' p lacemen ts .
. CCC system with 4.2 4 modules.
FUNDAMENTAL MODES OF OPERATION

Classes ASCEND and DESCEND
Assume momentarily that the 2 k modules of the parallel system be interconnected as a k-cube, where a module m is connected to all modules m' such that the binary expansions of the integers m and m' differ in exactly one position. We shall now demonstrate how the choice of the appropriate cycle length and the use of pipelining on the previously described scheme can emulate a k-cube interconnection without a significant degradation of performance.
As mentioned earlier, the k-cube links can be subdivided into sheaves 0,1, ... ,k-l, according to their dimension. We introduce two dual classes of algorithms: DESCEND uses the sheaves in descending order k-l,k-2, ... ,0, and ASCEND uses them in ascending order 0,1, ... ,k-l. Each algorithm in our classes is completely specified by some basic operation of the type OPER(M,J;U,V) k where M:= address of a module,°~M < 2 J:= order of a sheaf of links,°J < k U,V:= operands and (U,V)~functio~J(U,V) is the operation perf~rmed. (3)
It must be emphasized that M and J are parameters which specify the nature of the operation performed on the operands U and V. We expect our algol-like description of parallel programs to be self-evident. Sequential iteration is controlled by a loop: for I~A step S until I = C, while parallel operations are controlled by foreach I:p(I) pardo OP odpar. As a general rule, we use ";" to denote sequencing of operations while instructions separated by"," are to be executed in parallel.
Algorithms in the DESCEND class are then specified as:
(3)According to a frequent convention, we tend to use capital letters, say J, to denote memory locations, and the corresponding lower case letters to denote the content j of that memory location.
To obtain ASCEND, we obviously change the control loop to:
for J~O step 1 until J=K-l. In both cases, the number of parallel steps is clearly k. In the following discussion we shall refer to programs in the DESCEND class, although the treatment of ASCEND is analogous.
Implementation on the CCC
In order to implement DESCEND on the CCC, we prune the k-cube so as to use only existing connections. The first stage consists in removing the sheaves corresponding to dimensions O,l, ... ,r-l, and using instead the cycle connections F and B, as described in section 2. Our original DESCEND program is thus transformed into:
Here procedure DLDOPOPER(L) processes the data within cycle (loop) £ to compute the desired result in O(2 r ) parallel steps, as we show later. Note that the running time is still 0 (k-r)+O (2 r )=o (logn).
The second transformation consists in removing, for all j = 0, ... ,k-r-l, the k-cube links pertaining to sheaf (r+j), except those existing between modules whose addresses are of the form (.,j): the resulting interconnection is then exactly the one of the CCC, as described in sec tion 2.
The computation corresponding to the for loop of the above algorithm can no longer be performed in one parallel step. Using repeated circular shifts within cycles, however, each operand in the cycle can be successively brought to reside for one time unit in module (.,j), where OPER(.,J,.,.) can then be executed. Although the execution of OPER(.,J,.,.) for all operands in a cycle now re- The inner operation of the for loop is executed in two time units: one for OPER, then one for BSHIFT. The total running time is thus 4.Z r plus the time for executing DLOOPOPER. If we can ensure that DLOOPOPER can be processed in time linear in the cycle size, the entire procedure is thus executed on the CCC in time O(logn).
Computation Within the Cycles
The next question to be answered is the implementation of DLOOPOPER(L) so that it runs in time linear in the cycle length. Obviously, we are constrained to using only the F and B cycle links existing in the ecce Our objective is to ernulate,on the cycle of length Zr, the operation OPER as it would be executed on hypothetical r-cube sheaves. Since OPER may take place in the cycle only between adjacent modules, particular care must be exercised to ensure that the desired adjacencies, corresponding to all sheaves, be globally realized in time linear in the cycle length. The key permutations for this task (with reference to DLOOPOPER) are based on the so-called perfect shuffle [16, 17] We can now elucidate the general format of DLOOPOPER, which consists of a sequence of shuffleoperation pairs, each emulating a sheaf operation. This is preceded by BRP; so at the completion, the results are in the correct order (see Figure 3) . In the description below the parameter A gives the original address of the operand which would be brought to module ( 0" 2" 4" 6" 1" 3" 5" 7" OPER 0" 1" Z" 3" 4" 5" 6" 7" SHUFFLE ( ,2) 0' 'f 1 ' "Z f " 3 ' "4 ' " 5 ' "6 ' " 7 I" 0 PER Procedure DLOOPOPER runs in r parallel steps, plus the times taken by BRP and the SHUFFLE sequence, which are O(2 r ), thus proving an earlier claim.
From the preceding global description of DESCEND, it is rather straightforward to produce the sequential program of module (t,p). The program DMODULE(L,P) for a given DESCEND algorithm is of the form: HIGHSHEAVES(L,P);LOWSHEAVES(L,P), which respectively implement the (k-r)-cube operation and DLOOPOPER. The entire DMODULE(L,P) is of a very simple nature: it basically counts up time and, at each time unit numbered T, it tests a simple logical condition involving L, P, and T; depending on this test, either it does nothing, or it exchanges operands, or it exchanges operands and performs an operation on them. The details, however are omitted. 
. APPLICATIONS
In this section, we present applications of the CCC to the solution of interesting problems. Some applications -such as bitonic merge and cyclic shift -are directly within the ASCEND or DESCEND classes (simple algorithms); for these applications, all we have to do is specify OPER(M,J;U,V), since the individual programs for each module can be automatically derived from it. Other applications (such as permutation, shuffle, unshuffle, bit-reversal (BRP) , odd-even-merge, Fast-Fourier-Transform, convolution, matrix transposition) have programs which are cascades of a fixed (small) number of applications in the preceding class (cascaded algorithms); the corresponding algorithms on the CCC all run in O(logn) parallel steps.
We also have applications -such as bitonic sort, odd-even-sort, and calculation of symmetric functions -for which the combining step of the two results of a recursive call is itself an algorithm in one of the two preceding categories. These algorithms, which we call composite, run in 0«10gn)2) parallel steps on the CCC. We also mention that, in some instances, improvements can be found with respect to the general implementation scheme, especially in the LOOPOPER part of the program.
Finally, the CCC can also be used to solve a wide class of unrelated important problems, as matrix multiplication.
Bitonic Merge
The elegant algorithm for bitonic merge ,due to K. E. Batcher [15] , is ideally suited for implementation within the DESCEND class. All is needed is to specify OPER(A,I;U,V) as a comparison-exchange. Precisely, in order to handle sequences which are sorted either in increasing or in decreasing order, we define ORIENTCOMPEXCHANGE(A,I;U,V) as
Of course, the running time of DLOOPOPER(L) can be improved with respect to the general format given earlier by using a straightforward "transposition sort."
4.Z. Radix-Z Fast-Fourier-Transforms
The important FFT algorithm can be set in the ASCEND class. Let w be a primitive root of unity of order n = Zk. If <AO, .
•. ,A n _ l > is the Fourier <a l ,a 3 ,· .. ,a k >; we call the wj,s the combining 2 -1 root powers.
The above relationships indicate that the sequence <aO,.·.,a -1> must be initially rearranged by means of the bi¥-reversal permutation. Once the desired reconfiguration has been achieved, we may proceed with the actual FFT computation, which is in the ASCEND class.
Its basic operation OPER(M,J;U,V) is specified by:
It can also be shown that a can be computed efficiently at each step; precisely, the time used by each module to compute the required combining root powers for the entire algorithm is O(r 2 ) = O«loglogn)Z) o(logn).
Data Rearrangements
Being able to efficiently permute the data on the CCC is obviously important for many applications. For example, the BRP rearrangement is a necessary preliminary step to the FFT algorithm of the preceding section.
In general, the eee can emulate a Benes permutation network [Zl] . Specifically, assuming that the settings of the network switches are known for a given arbitrary permutation of n items, the eee can realize that permutation by a cascaded algorithm, i.e., in time O(logn).
To substantiate this claim, we make reference to the simpler (and somewhat redundant) version of the Benes network with Zk inputs, which consists of The operation of the rightmost k-l stages is emulated by an analogous program in the DESCEND class. Notice that, once the switch settings have been precomputed, each module must be provided with additional information consisting of O(k) bits in order to realize the specified data rearrangement.
Special types of permutations, such as cyclic shifts, shuffle, unshuffle, BRP, matrix transposition, of course admit of straightforward and somewhat faster realizations on the CCC, and are such that the exchange information need not be externally precomputed and stored in the individual modules but may be locally calculated as functions of the module address M and of the sheaf number I. The details, however, are omitted.
Odd-Even Merge (OEMERGE)
It is well-known that Batcher's odd-evenmerge [15, 16] is realizable as the cascade: BRP; OECOMBINE, where OECOMBlNE merges adjacent sorted strings of two keys, then of four, and so on, in a standard hierarchical fashion. The ensuing algorithm OECOMBINE(O,K) is Obviously in the ASCEND class with OPER(M,J;U,V)~COMPEXCHANGE(U,V). Its efficiency can be improved if the standard ALOOPOPER is replaced by an ad-hoc sorting procedure which is naturally suited to linearly connected arrays as the CCC-cycles: transposition sort. In order to correctly implement the HIGHSHEAVES portion of OECOMB1NE, the correct positioning (off-set) of the sorted sequence in each cycle must also be realized. Specifically, the 
Sorting Algorithms
The previously described merging routines may obviously be used to implement sorting algorithms.
At first we consider a sorting procedure based on bitonic merging. We call OR1ENTLOOPBITONIC(L) the DLOOPOPER corresponding to ORIENTBITONICMERGE(O,K). We immediately obtain: 
Matrix Multiplication
To compute the matrix product C = A X B of two n X n matrices, we must Obviously first store the output format (say, row major). Although the details of this algorithm are a bit tedious to describe, it should be clear that matrix multiplication can be computed on the eec in time O(logn).
LIMITED PARALLELISM
So far, we have assumed that the size n of the cec was tailored to the application. To cope with the realistic situation where the number N of inputs is larger than the size n of the ecc, we suggest to let each module of the CCC be a full fledged microprocessor endowed with a private RAM memory.
Assuming for simplicity that N tn, with t = zq integer, we require that the RAM memory of ;. ogn for algorithms in the classes ASCEND or DESCEND.
