An interconnection pattern of processing elements, the cube-connected cycles (CCC), is introduced which can be used as a general purpose parallel processor. Because its design complies with present technological constraints, the CCC can also be used in the layout of many specialized large scale integrated circuits (VLSI). By combining the principles of parallelism and pipelining, the CCC can emulate the cube-connected machine and the shuffle-exchange network with no significant degradation of performance but with a more compact structure. We describe in detail how to program the CCC for efficiently solving a large class of problems that include Fast Fourier transform, sorting, permutations, and derived algorithms.
3OO I. Introduction
The great technological progress embodied in very large scale integration (VLSI) of electronic circuits has made it possible to conceive of large systems of processing elements cooperating in the execution of parallel algorithms. This has motivated considerable research interest in parallel computation. Unfortunately, the situation is very different from that of serial computation where the RAM machine [1] represents a universally accepted model. The difficulty of choosing a specific interconnection is frequently bypassed by assuming a model (shared-memory machine) where each pair of processors is connected (or an equivalent system) [7, 8, 18, 24] . Because it aims at uncovering the inherent datadependence of given problems, this approach, although not without merit, ignores the technological constraints of VLSI, particularly in regard to communication among the processing elements [5] . At the opposite end, other workers [2, 6, 11, 14, 23] suggest that processor interconnection should be limited to planar links between topologically neighboring cells (arrays or meshes). Such designs are certainly well-suited for current VLSI technology, and have been used cleverly in implementing algorithms for matrices or graph problems [6, 11, 14, 12] . However, this type of interconnection is not suited for efficiently implementing algorithms for various fundamental problems such as sorting and convolution. Indeed, good algorithms for solving these problems may intrinsically require data movement between processors that are topologically far apart; for example, sorting on an n processor array such as ILLIAC IV requires time f~(~n) [23[ . For a wide class of problems, such as Discrete Fourier Transform, sorting, etc., there are algorithms whose data exchange pattern corresponds to the links of the binary multidimensional cube. The cube, which has already been studied in relation to parallel computation [15] , is not readily usable for VLSI design, since each of the 2 ~ processors in the system is connected to k other processors. A feasible substitute for the cube-connected network is the so-called shuffle-exchange network [20] . This paper proposes and analyzes a new interconnection, the cube-connected cycles, which is also a feasible substitute for the cube-connected network; it has all the desirable features of the shuffle-exchange network and a VLSI layout which is not only more compact but more regular. It is demonstrated that for a wide class of problems the CCC is optimal with respect to the area X (time) 2 measure of complexity in the VLSI model [21, 22] .
The operation of the CCC is based on the combination of pipelining and parallelism, which leads to the following results:
(1) The number of connections per processor is reduced to three; (2) Processing time is not significantly increased with respect to that achievable on the cube-connected network; (3) Programs for the individual modules are obtained in a systematic way from a standard description of the global algorithms; (4) The overall structure complies with the basic requirements of VLSI technologj: modularity, ease of layout, simplicity of communication among the processing elements, simplicity in timing and control of the entire system [13] . (Indeed, as mentioned above, the proposed CCC layout is optimal for several problems.) (5) Finally, without resorting to any drastic departure from classical ALGOL-like languages, fully accurate and hopefully, easily understandable descriptions of our parallel programs can be provided. This is a favorable sign that paraUel processing may be endowed with suitable high level programming languages. This paper is organized as follows. Section II introduces a class of algorithms comprising many important applications such as merging, sorting, Fourier Transform, data rearrangement, etc. Section III presents models of module connections, including the CCC, allowing for efficient parallel execution of the algorithms in Sec. II. Section IV describes the implementation of such algorithms on the CCC, and Sec. V is devoted to optimality considerations regarding a layout of the machine for VLSI realizations.
II. A Class of Highly Parallel Algorithms
The paradigm of the algorithms to be considered and to which the proposed parallel computing system is particularly attuned is the iterative rendition of a divideand-conquer scheme. Specifically, (1) The input and output of the algorithms are each a vector of n data items; (2) "Divide" refers to two subproblems of equal size and there is a one-to-one correspondence between the data items of the subproblem results; (3) The "marry step" which combines the results of two subproblems, consists of executing a single operation on corresponding pairs of data items. As simple as this scheme may seem, some of the fundamental algorithms--merging, Fast Fourier Transform, permutations, sorting, convolution, matrix operations, etc.--are either instances of the scheme or simple combinations of such instances as shown below.
To be specific, the paradigm informally described above can be reformulated as follows. Assume that input data to, tl ..... tn-] are stored, respectively, in storage locations T[0], T [1] ..... Tin - 1] , and that n = 2 k, i.e., the number of inputs is a power of 2. An algorithm is in the DESCEND class if it performs a sequence of basic operations on pairs of data that are successively 2 k-l, 2 ~-2 ..... 2 o = 1 locations apart. (In terms of the above divide-and-conquer model, the marry step involves pairs 20 locations apart.) Each basic operation OPER(m, j; U, V) modifies the two data items present in storage locations U and V; the computation performed affects only the contents of U, V and may depend upon param-301 eters m and j, which are integers 0 _< m < n, 0 _ j < k.
Algorithms in the DESCEND class are then specified We also introduce the dual class ASCEND, where the control of the algorithm is changed to for j *--0 step 1 until j = k -l, i.e., OPER is performed on data that are successively 1 = 2 °, 21 ..... 2 k-1 locations apart. (Again, in terms of the divide-and-conquer scheme, the marry step ,nvolves pairs 2 k-1 locations apart.)
To clarify the duality between ASCEND and DE-SCEND, consider the binary representation of m = ~o_<i<k bit/(m). 2 i and define th = ~0~i<k biti(m). 2 k-i-l, the integer whose binary representation is the reversal of that of m. Once k is IrLxed, the function m --~ r~ is an involutory permutation of 0, l,..., 2 k -1 known as the bit reversalpermutation (BRP). For example, if (k = 3), then theBRPof(01234567)is(04261537).By first applying the BRP to its inputs, an ASCEND algorithm can be transformed into a dual DESCEND algorithm whose basic operation OPER is related to the original OPER by (re, j; U, V)--OPER (rh, k-1 -j; U, V).
Algorithms for solving specific interesting problems are now considered. For some applications, such as bitonic merge and cyclic shift, the corresponding algorithms are directly within the ASCEND or DESCEND classes (simple algorithms), in which case we merely need to specify OPER(m, j; U, V). These algorithms run in 0(log n) steps.
Other applications (such as permutation, shuffle, unshuffle, bit-reversal (BRP), odd-even-merge, Fast Fourier Transform, convolution, matrix transposition) have programs consisting of a short sequence of algorithms in the preceding class and run in 0(log n) parallel steps (cascaded algorithms).
Other applications such as bitonic sort, odd-even-sort, and calculations of symmetric functions, are such that the combining step of the two results of a recursive call is itself an algorithm in one of the two preceding categories. The discussion of sample algorithms in each of the three classes is not essential to the understanding of the rest of this paper, and may distract some readers. However, to substantiate our previous claim of compliance with the given paradigm, specific algorithms are described in the Appendix.
III. Description of the CCC Interconnection
In order to efficiently implement algorithms in the DESCEND or ASCEND classes, the most natural interconnection of modules is the k-dimensional binary cube (k-cube) where each of the 2 k processors is numbered from 0 to 24 -1 and is connected to each of the k processors whose binary numbering differs in exactly one binary position ( Figure 2 ). Although an ASCEND or DESCEND algorithm can be implemented on such a machine in log2 n parallel steps, this proposal is not feasible mainly because the number k = logz n of connections for each processor is too large. The unfolded k-cube and the shuffle-exchange interconnections have been proposed [20] (Figure 3 ) as attempts to remedy this difficulty. Both structures emulate the performance of the k-cube with respect to ASCEND-DESCEND algorithms (i.e., their computation times are comparable), and have a bounded number (4) of connections per processor. However, the topologies of these structures [] make their physical layout inferior (see Sec. V) to the scheme now described, which also emulates the k-cube (Sec. IV).
Our parallel computing system, the cube-connected cycles (CCC), is a network of identical processors called modules. A module has three interconnection ports. Each interconnection line linking two modules can be used for the bidirectional transmission of one operand, and it is irrelevant whether operand transmission is serial or parallel. To correctly execute the algorithms described in the preceding section, it is possible either to synchronize the entire system through a central clock that defines time units for all modules, or to let synchronization problems be settled at the level of each communication line, thus achieving a globally self-timed system. In order to describe the interconnections, we assume for simplicity that n, the number of modules, is a power of two, i.e., n --2 k and, moreover, that k is of the form k = r + 2r; the modifications resulting when k is arbitrary are straightforward (in the latter case, r is the smallest integer for which r + 2 r >_ k). Each module has a k-bit address Xk-r--l) are the dimensions of the (k -r)-cube, then all edges along dimension xi, called collectively sheaf L link modules whose addresses are of the form (. , i). The total number of interconnection links is at most 3.2 h-1 = (3/2) n. To provide intuitive appreciation for the structure and the denomination, a CCC with k = 5 (whence r = 2 and n ---32) is illustrated in Figure  4 (b).
Each module contains an operand register T, a few memory locations, and basic arithmetic and logical capabilities. It is controlled by a stored program or a circuit implementation of such a program. For the time being, we make the hypothesis of unlimited parallelism, that is, the number of modules is tailored to the problem size; under this hypothesis, the one or two memories mentioned earlier suffice. Subsequently (Sec. IV.C) under the hypothesis of limited parallelism, each module is 303 endowed with a small private memory. In either case, each module is somewhat simpler than a current microprocessor but not fundamentally different from it.
IV. Emulation of the k-Cube on the CCC
We refer, for concreteness, to the DESCEND algorithmic scheme. To show that the CCC can successfully emulate the k-cube in executing a DESCEND algorithm, the k-cube is transformed into the CCC in two steps; for each step the corresponding transformation of the program is shown and the resulting computation time discussed.
The first step of the transformation consists of removing the sheaves corresponding to dimensions 0, 1, .... r -1 from the k-cube and introducing instead the cycle connections F and B as described in Sec. III. Note that each module is still connected to k -r + 2 other modules. Our original DESCEND program is thus transformed to (the steps being modified are shown in a box):
Procedure LOOPOPER(I) processes the data within cycle l in order to compute the desired result in 0(2 r) parallel steps, as shown later. Note that the running time is still 0(k -r) + 0(2 r) = 0(log n).
The second step of the transformation consists in removing, for all i = 0,..., k -r -1, the k-cube links pertaining to sheaf (r + i) except those existing between modules with addresses of the form (., i): the resulting interconnection is then exactly the CCC introduced in Sec. III.
For any fixedj the computation corresponding to the for loop of the above algorithm can no longer be executed in one parallel step, since in each cycle only one module, with address (., j -r), is connected to a module 22 positions apart. However, by means of repeated circular shifts within the cycle, each operand in the cycle can be successively brought to reside for one time unit in module (., j -r), where OPER(., j;., .) can then be executed. Although, according to this mode of operation, the execution of OPER(;j;., .) for all operands in a cycle now requires 2" time units (the length of the cycle), this computation can be17i17elined (overlapped) with the analogous operations OPER(., i;., .) for r _< i < k. The whole sequence of events is best illustrated with the aid of the timing diagram of Figure 5 ; for a CCC of 64 modules (r = 2), the four modules in the generic cycle are considered, and for each of them the time units
Communications
May 
IV.A. Computation Within the Cycles
The next question to be addressed is the implementation of LOOPOPER(I) so that it runs in time linear in proe the cycle length. Obviously, we are constrained to using only the F and B cycle links existing in the CCC. Our objective is to emulate on the cycle of length 2", the operation OPER as it would be executed on hypothetical r-cube sheaves. Since OPER may take place in the cycle [10, 20] . Recall the implementation of the unshuffle of the items stored in a one-dimensional array T [ 0 : 2 i -1]. Unshuffle can be viewed as packing in a stable manner (i.e., without altering their original order) the items initially placed in even positions into the left half of the array, and those in odd positions into the right half. A specialization of the well-known transposition-sort algorithm clearly achieves this result:
P E R F E C T -U N S H U F F L E for b ~ 2 i-1 step-1 until b = 2 do foreach m: m = 2 i-] + p, -b < p < b, ( p m a d 2 = b m a d 2 ) pardo T [ m -1] ~ T[m] odpar od P E R F E C T -U N S H U F F L E
An illustration of this procedure on an array of length 8 is provided in Figure 7 .
Returning to LOOPOPER, use is made of the operation UNSHUFFLE(/, i) which performs the perfectunshuffle on each of the The general format of LOOPOPER which consists of a sequence of unshuffle-operation pairs, each emulating a sheaf operation, can now be elucidated. This is preceded by BRP, so that upon completion the results are in the correct order (see Figure 8 ). In the description below the parameter a gives the original address of the operand which is brought to module (/, q) by the sequence: BRP; UNSHUFFLE(/, r -1); UNSHUFFLE(/, r -2); .... UNSHUFFLE(/, j + 1). (Recall that denotes the integer whose binary representation is the reversal of that of the integer q.) proc LOOPOPER(I) BRP(I); forj *--r-1 step-I until j= 0 doforeach q: 0 -< q < 2 r, bit0 (q) = 0 pardo OPER(a,j; With respect to execution time, we note that UN-SHUFFLE(., i) runs in (2 / -1) steps; thus BRP and LOOPOPER jointly run in 0(1 + 2 + 2 2 + ... + 2 r-a) --0(2 ~) steps, linear in the cycle length.
IV.B. Programs for each Module of the CCC
From the preceding global description of DE-SCEND, it is rather straightforward to produce the sequential program of module (l, p). The program MOD-ULE(l, p) for a given DESCEND algorithm is of the form HIGHSHEAVES(I, p); LOWSHEAVES(I, p), which, respectively, implement the (k -r)-cube operation and LOOPOPER. The entire MODULE(/, p) is of a very simple nature; it basically counts up time and at each time unit numbered t, it tests a simple logical condition involving l, p, and t. Depending on this test, either it does nothing, or it exchanges operands, or it exchanges operands and performs an operation on them. The details of these programs are omitted for the sake of brevity.
The precise execution time of DESCEND (or AS-CEND) on the CCC is given by the formula: where Tccc is the time required for stepping up the control variable t, testing it and performing one data exchange on some of the links; Toper is the time required for computing OPER(m, j; U, V) within each module.
IV.C. Limited Parallelism
So far, we have assumed that the size n of the CCC was tailored to the application. To cope with the realistic situation where the number N of inputs is larger than the size n of the CCC, we suggest letting each module of the CCC be a more complex processor endowed with a private memory of adequate size. With regard to this private memory, a distinction is in order: (1) If the modules are to be thought of as microprocessors, the time unit is determined by the microprocessor's clock and we may assume uniform cost criterion for all operations. In this case the module's private memory may be viewed as a conventional RAM; (2) If the CCC is to be realized on a single chip, 1 then the memory is to be viewed as a large shift register (in fact, as will be seen below, only sequential access is really used).
Assuming for simplicity that N = sn, with s = 2 q integer, we require that the RAM memory of each module be of size s and denote the private memory It should be clear by now that all of the algorithms described in Sec. I can be applied here. A direct analysis shows that on a CCC consisting of n processors, each processor hav!ng memory N/n, N inputs can be processed in time O[(N/n).log N] for algorithms in the classes ASCEND or DESCEND, thus achieving the optimal speedup possible with n processors.
V. Layout of the CCC for VLSI Up until now the CCC appears as a feasible interconnection of processing modules capable of emulating the binary cube with no degradation of performance with respect to the ASCEND-DESCEND algorithms. On the other hand, the same capability is exhibited by the shuffle-exchange interconnection whose modules have the same bounded fan-out as those of the CCC. Is the CCC just an alternative to the shuffle-exchange? In light of our current knowledge, it is more than that; in the VLSI model of computation [13, 21, 22] , the CCC has a layout of smaller area than the best known layout for the shuffle-exchange (or the binary cube, obviously). Indeed, it is now shown that the CCC has a minimal area VLSI layout for several applications.
In the VLSI models of computation proposed by Thompson [21] and Brent-Kung [4] , a parallel computing system is viewed as a "computation graph," consisting of "nodes" and "wires" whose respective functions are computation and communication. Each wire has unit width on the silicon chip and transmits a unit of information in a unit of time; information is taken from, or delivered to, special areas on the chip called nodes (each associated with a module). Within this model, which takes realistic account of the placement of modules and interconnection, Thompson has studied the implementation of the Fast Fourier Transform [22] and has elucidated significant relationships between input size n, chip area A, and processing time T, by proving the bound A T 2 >__ c.n 2 where C is a constant dependent upon the technology. Further results by Brent-Kung [4] and Vuillemin [25] extend and sharpen his area-time tradeoff for various problems.
Proposition ([22] , [25] , [17] ). Any VLSI network capable of --computing the DFT on p points, each represented with k-bit accuracy; --computing the product of two n-bit integers; --computing the product of two degree d polynomials, coefficients being represented with k bits; --realizing the data rearrangement specified by permutations drawn from a transitive group 2 on n bits; --merging two sorted sequences of length p/2, each element being coded with at most k bits; is subject to area-time tradeoffs expressed by A T 2 _> Co. n 2 and A T >--cl.nv~n where A is the chip area, T the computation time, n the problem size measured in bits (n =pk for DFT and merging and n = dk for polynomial product) and Co, Cl are constants dependent upon the technology. 1 Were the memory to be realized as a RAM of M locations, in the VLSI model of computation (see Sec. V.) log M time would be required for address decoding. In [17] the authors show that specific VLSI circuits can be designed from a general CCC layout, that achieve the optimal bounds AT 2 <_ c~.n 2 for 0(log2n) _< T _< 0(v~n) and AT_< c~.n~n for T= 0vrnn), where c~, c~ are positive constants. This general CCC layout is now described, starting with its fastest realization. The preceding sections exhibit algorithms for executing DFT, merging, and various data rearrangement in time 0(log n) on the CCC. To achieve the corresponding minimal area A = O(n2/log2n), consider a layout which uses two layers of evenly spaced wires, horizontal and vertical, corresponding, respectively, to cube and cycle connections. Figure 9 pictorially provides a base, inductive hypothesis, and extension, to prove that an n --s. 2 s module CCC can be placed on a 2 s x (2.2 ~ -l) chip; since s = log2(n/log2n), the chip size is about (n/log2n) × (2n/log2 n -1) = 0((n/log n)2). Slightly more complicated constructions yield somewhat more efficient module placements as suggested by For pedagogical reasons, the CCC introduced so far has a number n = s. 2 s of processing modules with s = 2 r, a power of 2. A more general version of the CCC can be designed, comprising n = h.2 s modules. Each of the 2 s cycles of the machine has h _ s modules. The lower s x 2 * modules of the cycles exhibit the horizontal interconnection of a standard CCC, while the (h -s) × 2 ~ higher modules only have vertical (cycle) connections, as indicated in Figure 11 . Such a layout has height 2" + h -s and width 2 *+1 (in unit wire width). The programs presented in Sec. IV can be adapted to run on such a machine by simply ignoring operations pertaining to nonexisting horizontal (lateral) links, and their running time is proportional to the cycle length h. We see that for any value of h satisfying log2n < h _.< v~n, the area × (time) 2 product
AT 2 = ((n/h) + h -log(n/h)) x (n/h) × h 2 = n 2 + nh 2 -nh log(n/h) = O(n 2)
meets the optimal theoretical bound to within a constant I, r r ,r nr factor. Of particular interest is the choice h = 0(x/nn), which leads to a running time T = O( ~n ) and uses the minimal achievable area A = 0(n).:. This particular layout achieves the optimalA T = 0(n 4n) bound for A T, which is of special importance since ,4 T is proportional to the energy spent in the computation. Also note that such a network emulates the x~n x ~n square mesh both in chip area and in computation time (see, e.g., [23] and [14] ). The fact that the CCC interconnection can be specialized to emulate such diverse systems as the (fast) binary cube and the (slow) mesh demonstrates its versatility.
V.C. I/O in the CCC
The input/output mechanism for the CCC is of special importance with regard to VLSI circuits. It is suggested that I/O ports (or pins) be located at both ends of the "wrap around" links of each of the CCC cycles (for example, at the bottom rim in Figure I l) ; input/ output operations are thus series/parallel, and the total I/O time is proportional to the cycle length, thus to the computation time within the network.
V.D. Comparison Between the CCC and ShuffleExchange
A comparison between the CCC and its competitors, the unfolded k-cube and shuffle exchange, is in order. A VLSI layout of the k-cube requires area 0(n 2) which is larger than that of the CCC; indeed, the CCC can be regarded as an emulation of a pipelined k-cube, whence the size reduction in the layout. Thompson [21] has f~n a layout for the shuffleexchange of area O(n2/x/log n). More recently, Hoey and Leiserson [9] applied a general technique based on the graph-separator theorems improving the result to O(n2/ log n). Extending the Hoey-Leiserson approach, Steinberg and Rodeh [19] have obtained area 0(n2/(log n)~/2). Thus, to date, the best known layout for the shuffleexchange is by a factor x/~g n inferior to that of the CCC. Another apparent advantage of the CCC is the high regularity of its layout, (as evidenced by Figure 9 ) which is contrasted with the seeming intricacy of the shuffle-exchange network. It has also been pointed out in Sec. V.B. that the CCC admits layout of various sizes and speed, all achieving the optimal A T 2 = 0(n 2) value provided that 0(log n) _< T <_ 0(x/-nn). No such feature is known for the shuffle-exchange network.
VI. Conclusion
A structure has been proposed which can be used for direct hardware implementation of specific useful algorithms, or, as suggested in Sec. IV.C as a general purpose parallel processing system.
The CCC is expected to be practically feasible in the present state of the technology, and to be capable of efficiently executing a wide variety of algorithms. The extent of the class of algorithms amenable to efficient CCC processing is not yet well understood, but it goes beyond the applications described in Sec. I; in particular, it includes a variety of matrix and graph algorithms, as well as arithmetic and algebraic problems. The strength of such an interconnection network comes from its ability to efficiently permute data, sending them to the location where the next processing step will occur. Indeed, this feature of the CCC is exploited by Galil and Paul a who claim to exhibit an efficient general purpose parallel machine.
Another salient feature of this work is the possibility of developing a high level, general purpose language for parallel programming that would nevertheless be automatically compilable. The CCC is also well suited for VLSI design. In [17] the authors describe circuits using the CCC interconnection for computing cyclic shifts, polynomial and integer products, that meet the optimal A T 2 = 0(n 2) bound for any choice of time within the bounds 0(log n) ~ T _< 0(~n). In particular, this demonstrates the existence of a n-bit multiplier having linear area A = 0(n) and computing time T = 0(x/nn), a result of some practical interest. Finally, if technological progress keeps up its current pace, we Can envision integrating a general purpose (programmable) CCC on a single chip having a few million transistors. This would indeed be a very fast microprocessor whose general architecture would be drastically different from the ones currently used.
a Private Communication (1981) .
