AbstractÐWe illustrate the potential of techniques and results from the theory of network emulations to enhance the performance of a parallel architecture. The vehicle for this demonstration is a suite of algorithms that endow an N-processor bit-serial processor array e with a ªmeta-instructionº GAUGE k, which (logically) reconfigures e into an N=k-processor virtual machine f k that has: 1) a datapath and memory bus whose emulated width is k bits, as opposed to e's 1-bit width and 2) an instruction set that operates on k-bit words, in contrast to e's instruction set, which operates on 1-bit words. In order to stress the strength of the approach, we show (via pseudocode) how our emulation techniques can be implemented efficiently even if e operates in strict SIMD mode, with only single-bit masking capabilities and with no indexed memory accesses. We describe at an algorithmic level how to implement our techniqueÐincluding datapath conversion (ªcorner-turningº) and the creation of the word-parallel instruction setsÐon arrays of any regular network topology. We instantiate our technique in detail for arrays based on topologies with quite disparate characteristics: the hypercube, the de Bruijn network, and a genre of mesh with reconfigurable buses. Importantly, the emulations that underlie our technique do not alter the native machine's instruction set, hence allowing an invariant programming model across gauges.
INTRODUCTION
A MONG the many notable contributions of the field of parallel algorithmics over the past two decades has been the subfield of network emulations, with landmark studies such as [3] , [11] , [14] . Virtually all applications of network emulations, to this point, have been in the development and/or analysis of parallel algorithms for scientific problems; cf. [1] , [8] , [23] . In the present study, we apply network emulations to a new class of computational problems, namely, the use of algorithmic techniques to enhance the performance of parallel architectures. We have selected the problem of endowing a processor array with a multigauging capability-the ability to change the apparent width of the array's datapath and memory bus 1 Ðas a vehicle for our study, since such a capability is desirable in certain important applications, yet is prohibitively expensive to realize in hardware [5] , [16] , [25] , [26] .
Multigauging. A fundamental issue in the design of high-performance parallel computers is the appropriate granularity of the processing elements (PEs). The quest for a ªbestº PE-width seems to have diminished in importance as the historical stream of ever larger arrays of small Pes has largely been replaced in recent years by arrays of tens to hundreds of microprocessors. However, there has been a recent resurgence of interest in arrays of small PEs, due to the recognition that memory chips offer a vast amount of internal bus bandwidth that is untapped by traditional microprocessors. The resulting ªsmart memoryº or ªin-telligent RAMº architectures are once again exploring the design space of PE gauges to identify a ªbestº width. Additionally, for many applications, synchronous arrays of small PES remain a cost-effective solution. Indeed, it is well recognized that different computations favor different granularities, often within the same application or even the same function; cf. [25] , [5] , [16] . The significance of the question of appropriate PE-width would diminish if an architecture could dynamically change its gauge efficiently, becoming a large array of small PEs or a smaller array of proportionally larger PEs, as mandated by the current computation. Techniques for endowing an architecture with multigauge behavior via hardware enhancements are studied in [5] , [26] , [10] , [2] . These sources comment on three problems associated with hardware-enabled multigauging: the costs in hardware, the difficulty of enabling a large repertoire of gauges, and the extent to which a commitment to hardware multigauging interferes with subsequent design decisions. 2 Notably, the algorithmic approach to multigauging we present here encounters none of these problems, offering great flexibility at modest operating cost. Indeed, our approach providesÐat least in principleÐvirtually any conceivable gauge at an operational cost that grows slowly with the target gauge (cf. Section 5) .
We present our emulation algorithms via register-level pseudocode rather than the more traditional high-level description. Thereby, we illustrate how our techniques can be implemented efficiently even if the host architecture operates in strict SIMD mode, with only single-bit maskingcapabilities, and with no indexed memory accesses; moreover, the pseudocode affords us easy, accurate estimates on the constants hidden in our big-O analyses.
Networks and Emulations. A processor array consists of many identical PEs (each having a local memory module) that communicate via a point-to-point interconnection network. We identify a processor array with its network, which we view as a directed graph whose nodes are PEs and whose arcs are internode communication links; we represent bidirectional links in networks as mated opposing arcs in the directed-graph model. We logically transform a bit-serial processor array e to a gauge-k array by using network emulations to logically aggregate e's PES into groups of size k, which then cooperate to act as a single width-k PE. Importantly, our emulation techniques operate on an instruction-by-instruction basis, thereby enabling multigauging even under our spartan computing regimen. The idea of using aggregates of PEs to cooperate in simultaneously solving subproblems is not new, appearing, e.g., in [29] as a formalism for arrays of finite automata to cooperate on certain computations and in [9] as a mechanism for nonuniform aggregates of PEs to perform spatially mapped computations. Our use of aggregation differs somewhat from those in these sources in our focus on the programming, as well as algorithmic, aspects of emulations. Specifically, we aim for a ªcleanº implementation of our emulated aggregations, so as to operate strictly within the programming model that is native to the host array.
Our View of Multigauge Processor Arrays. In order to emphasize how little our algorithmic emulation strategy demands of its host architectural platform and to illustrate our techniques on a testbed where they are most likely to be of significance, we focus henceforth on processor arrays e that are bit-serial (i.e., have unit gauge) and that operate in an SIMD regimen. With a more powerful host, which has, e.g., a global router and/or MIMD capability, our suite of algorithms could be implemented even more efficiently.
One can view the algorithmically-enhanced architecture e as having a meta-instruction, GAUGE k, whose functionality is summarized in Fig. 1 . This meta-instruction, which has the apparent effect of reconfiguring the N-PE bit-serial host e to an N=k-PE gauge-k array f k , allows e to change its (apparent) gauge even in the midst of a computation. Thereby, the single bit-serial processor array e can act as though it were a family ff k g k P N of word-parallel processor arrays that differ in gauge (the parameter k) but that share uniform instruction-level semantics across all gauges.
A Roadmap. Section 2 describes the detailed architectural framework of our study. Section 3 presents our emulation strategy in a topology-independent fashion. Section 4 describes and analyzes our emulations on three substantively different array topologies that have been used in computer architectures: the hypercube [27] (Section 4.1), the de Bruijn network [22] (Section 4.2), and the extended coterie network, a mesh with a reconfigurable bus which abstracts the UMass CAAPP architecture [30] (Section 4.3.). In Section 5, we show that the operational performance of our emulation-based multigauging is superior to having the host e perform k-bit operations in the straightforward bitserial way.
ARCHITECTURAL DETAILS
Our use of register-level pseudocode to describe algorithms necessitates a more detailed specification of architectural details than is common in algorithmic papers. It will be seen from the following discussion that our model merely captures the essential features of a conventional SIMDarray PE. We can describe both physical and virtual architectures simultaneously because of our uniform operational semantics across gauges: The physical host array e is merely the k I-instance f I of the family ff k g k P N of virtual arrays.
Control. We focus on processor arrays that compute via alternating computation steps and communication steps. During a computation step, each PE of f k refers to its own memory module for a k-bit operand and/or performs an arithmetic or logical operation on k-bit operand(s). During a communication step, each PE of f k receives/sends a single k-bit word through one of its I/O ports from/to a PE that is adjacent in the underlying network. The specific operation to be performed, the address(es) of the operand(s), and any relevant disabling information are found in the native SIMD instruction that is broadcast at each step by the SIMD controller.
Instruction Repertoire. Each PE in a gauge-k processor array is capable of the following operations on words of width k: algebraic addition, logical operationsÐboth bitwise and accumulative (e.g., the OR of a word's bits), numerical comparison, and circular shifts. We assume only that each bit-serial PE has the capabilities of a full adder. Memory. Since all PEs in a processor array are identical, their memory modules have the same capacity (in bits), say M bits. We implement the meta-instruction GAUGE k in a manner that ensures the uniform capacity of the memory modules of the resulting word-parallel array's (virtual) PEs. The emulations that achieve multigauge behavior may tie up some small (constant) amount of memory within each virtual PE, so the total available memory in a virtual array f k may be slightly less than in the physical host array e.
Our algorithms can be implementated to accommodate a variety of modes of addressing physical PE-memory, most notably direct access, wherein accessed location(s) are named in instructions' address fields and indirect access, wherein accessed location(s) are named in designated local address registers. 3 Throughout, the addressing mode(s) of the bitserial host array are inherited by any emulated wordparallel guest array.
Control Registers. Each PE of an array has: a read-only PE index register (PIR) which contains its unique ªname;º an activity bit register (AB) which allows the SIMD controller to isolate classes of PEs in order to enable or disable them for the execution of an instruction. Control registers are addressed in the same manner as memory.
Data Transfer. Each PE has an input port and an output portÐeach having the same width and access mode as a memory locationÐfor each incident arc in the network. A communication instruction transfers the contents of the output port of the PE at the source end of an arc into the input port of the PE at the target end of the arc. Only one of a PE's output ports can be active at each communication step.
Data I/O. In order not to alter the user interface of an array, we neither model nor emulate the I/O subsystems of virtual arrays, relying on the bit-serial host array to load and unload the physical memory. Specifically, during physical I/O, each word-parallel data item is handled by one bit-serial PE. The major consequence of this decision is that every change in gauge must begin and end with a corner-turning circuitry of machines that provide hardware support for multigauging. When one goes from bit-serial to word-parallel processing, the operation builds the words for the word-parallel PEs; when one returns from wordparallel to bit-serial processing, the operation repositions data for bit-serial I/O.
THE IMPLEMENTATION STRATEGY

The High-Level Strategy
We have a bit-serial array e behave as a gauge-k array f k by logically reconfiguring the network 4 e that underlies e into disjoint isomorphic copies of a k-PE aggregate array k, each of which will act as a logical PE of f k . To emphasize the logical status of the various entities in our emulation, we call f k a macro-array, B k a macro-network, and each copy of u a macro-PE. The major challenge in implementing this strategy resides in the question of how to partition e in a way that:
. allows the SIMD controller to ªaddressº all macroPEs efficiently; . allows the PEs within each macro-PE to ªcooperateº efficiently. We address these challenges via the principle of selfsimilarity: Choose arrays u and f k to be smaller versions of array e. Importantly, self-similarity allows one to retain across gauges algorithmic strategies that are optimized for specific network topologies. It is achievable with all popular network topologies since they tend to be defined via infinite families of like-structured networks. One disadvantage of self-similarity is that it usually restricts one's range of gauges since most topologies come only in a limited range of sizes, e.g., powers of 2 for hypercubes or perfect squares for meshes. On balance, we feel that the benefits of selfsimilarity outweigh this disadvantage. 5 
The Detailed Strategy
We now describe the three major ingredients of our emulation strategy. To strengthen our results, we assume that the only data type in the host array e is bit; for simplicity, we assume that the central (SIMD) controller operates on integer data.
Ingredient 1: Emulating Direct-Product Arrays
Say that the host array e has N PEs. For each desired gauge k, we choose a k-PE macro-PE u and an N=k-PE macroarray f k . 6 We then have e emulate the array u Â f k that is based on the direct-product network 7 K Â B k . Since the direct-product operator yields identical copies of the ªfactorº networks, this approach automatically solves several implementation problems for our strategy. First, since u Â f k can be viewed as array f k with ªfatº nodes, each of which is a copy of array u, all macro-PEs in a directproduct solution have the same node-set. This solves the ªaddressingº problem for the SIMD macro-controller (i.e., e's SIMD controller acting as a controller for the macro-array f k ): The controller uniformly addresses the ith PE of the jth macro-PE as PE hi; ji of the (emulated) direct-product array. Additionally, if copies u I and u P of macro-PE u are ªadjacentº within the macro-array f k , then every node of u I is adjacent to its homologous node in u P within u Â f k . This renders trivial the problem of implementing inter-(macro-)PE data transfers: A word-transfer between macro-PEs h and j is effected via parallel bit-transfers between PEs hi; hi and hi; ji of u Â f k (one bit-transfer for each i).
In more detail: Each copy u i of K operates as a macro-PE operating on k-bit operands in f k by emulating a k-bit PE that has a k-bit ALU, with each bit-serial PE of u i responsible for one bit-wide slice. The copies of network f k realize the connections of the macro-network that 3 . We easily support shift-register memory also, wherein each PE has direct access only to the bit with local address H, accessing all other bits via shifts of the register.
4. Henceforth, uppercase script letters denote processor arrays and the corresponding uppercase italic letters denote their underlying networks.
5. The approach to multigauge emulations in [19] makes less restrictive demands than self-similarity.
6. Of course, we condition our choices in q way that maintains selfsimilarity.
7 . T h e n o d e s o f K Â B k a r e a l l o r d e r e d p a i r s fhu; vi j uis node of K ndv is node of B k g. There is an arc in K Â B k connecting hu; vi to hu H ; v H i just when either u u H and v is connected to v
underlies f k . Slices are assigned by numbering the PEs in u from 0 to k À I and implementing f k on array e so that: 1) the ith bit of every k-bit operand in f k resides in the ith PE of the corresponding copy of u and 2) all the bits have the same memory, port, or register address. The transfer of a k-bit operand between adjacent macroPEs u I and u P is effected by (emulating) k parallel single-bit transfers in f k : For all H i k À I in parallel, the ith PE in u I transfers a single bit to the ith PE in u P . For efficiency, we assign physical ports to macro-ports uniformly across macro-PEs.
Memory accesses are straightforward in u Â f k since all bits of a memory word within a macro-PE have the same bit-serial addressÐwhich becomes the word's macro-address. For definiteness, we concentrate henceforth on direct memory access, but our techniques allow f k to inherit other addressing modes from e.
To implement control in u Â f k , we must provide the equivalent of loading the activity bit AB depending on some bit(s) of the PE index. The most natural solution retains the bit-serial PE indices while endowing them with a different interpretation. Each binary PE index is divided into two fieldsÐone for the macro-PE index in the macro-array (MIR) and one for the node within the macro-PE (AIR). In this way, a (physical) bit-group in the PIR can characterize either individual like-named PEs within all macro-Pes (for arithmetic/logical operations) or entire macro-PEs as units (for data transfer).
Ingredient 2: Emulating Complete Binary Tree Sweeps
For any integer k, the k-leaf complete binary tree T k is the undirected graph 8 whose nodes comprise the set fI; P; F F F ; Pk À Ig and whose edges connect each node i > I with its parent node i=P. Each node Pj (respectively, Pj I, for j > H) is the left (respectively, right) child of node j. Nodes j ! k, which have no children, are leaves of T k ; n o d e I i s i t s r o o t ; 9 s e e F i g . 2 . F o r e a c h 1 0 P fI; P; F F F ; lgPk À Ig, nodes P`À I ; F F F ; minP`À I; Pk À I comprise level`of T k .
The Role of Trees in Our Strategy. Efficient implementation of the meta-instruction GAUGE k requires efficient orchestration of several logical and arithmetic operations on our k-PE macro-PEs. It is well-known (cf. [13] ) that k-leaf binary trees are a vehicle for efficiently realizing a large repertoire of such operations. For instance, a single lgPk À I-step leaves-to-root sweep suffices to compute the reduction (or, accumulation) operator x I Ã x P Ã Á Á Á Ã x k for any binary associative operation Ã, when each x i is stored in PE i of the macro-PE. This class of operations includes (logical and arithmetic) addition and multiplication, minimization and maximization, among others. The repertoire of efficiently realizable operations grows significantly when one uses a leaves-to-root-to-leaves updown sweep of the tree to compute the parallel-prefix (or, scan)
11 of a binary associative operation Ã; a sampler of such operations appears in [4] . Notably, for our purposes, carrylookahead addition can be computed efficiently using scan [13] .
Implementing Tree-Sweep Algorithms. Importantly, one can implement (up and/or down) sweeps on k-leaf binary trees efficiently via a sequence of lgPk À I simple routing operations, provided that each tree is mapped appropriately onto its k-PE macro-PE. As we shall see in Section 3.2.3, the mapping that places the leaves of the tree in numerical order into the PEs of the macro-PE and then recursively maps each nonleaf node i into the same PE as its left child Pi (see Fig. 2 ) works well for our purposes.
Ingredient 3: Permutation Routing
A permutation route in a processor array is a set of inter-PE messages wherein each PE is the source and the destination of (at most) one messageÐso that the source-destination pairs form a (partial) permutation of the PEs. Permutation routes efficiently implement a variety of requisite internal data transfers in our emulations. Importantly, the emulations mandated by our strategy (of direct-product arrays and binary-tree sweeps) do not require the (typically expensive) routing of arbitrary permutations. We need only four specific types of permutations, all of which yield to efficient solutions within our algorithmic setting. Indeed, in many important networks, such routing can be performed in time only slightly exceeding the diameter (= maximum internode distance) of the network. We emphasize that the algorithmic fact that sequences of such routings achieve the necessary data transfers is not surprising; it is the efficient implementability of the required routings within the very constrained setting of SIMD arrays with 1-bit masking that is interesting here. The major implementational challenge is to orchestrate these routings in such a way that items to be moved are accessed efficiently within local memories. We now consider our varied uses of permutation routing.
Emulation Routing. An emulation of a guest array q by a host array r entails two mappings: Each PE p of q is assigned to a PE p of r that will perform its role; each communication link of q, say, from PE p to PE q, is assigned a routing-path in r from PE p to PE q, along which the link's messages will be sent. We orchestrate communication within q by ªcoloringº its links so that all links entering the same PE receive distinct colors and all links leaving the same PE receive distinct colors; it is well-known that at most d I colors are needed, where d is the largest indegree or outdegree of any PE of q.
12 Easily, each set of like-colored links of q specifies a permutation of the array's PEs; therefore, one can emulate each communication step of array q via d I consecutive permutation routes in r.
Word Shifting. Shifting a k-bit word within a macro-PE whose constituent PEs each contains one bit of the word is transparently a permutation route within the macro-PE.
Binary-Tree Sweeping. When a k-leaf binary tree is mapped onto a k-PE macro-PE, as illustrated in Fig. 2 , the transitions from one tree level to the next (either up or down)Ðwhich compose to form a tree-sweepÐclearly form 8 . The level of detail of our algorithmic specifications demands that we explicitly define graph structures whose general nature is well known.
9. Our version of ªcomplete binary treeº thus enjoys the structure that underlies heaps.
10. We denote by lgn the quantity dlog P ne: 11. The scan operator starts with argument x i stored in PE i of the macro-PE and ends with the ith prefix-product x I Ã x P Ã Á Á Á Ã x i stored there.
12. Efficient algorithms for computing such colorings abound; cf. [7] .
the partial permutation represented by the parent 6 (right child) edges between the levels. Moreover, since each such edge connects a PE labeled i with one labeled Pi I, these permutations admit efficient SIMD implementations. Datapath Conversion. The meta-instruction GAUGE k converts the datapath to width k concurrently within all macro-PEs. The conversion procedure transposes the memories within all macro-PEs so that the memory of the ith PE in each macro-PE starts out with the k-bit word b i;H b i;I Á Á Á b i;kÀI a n d e n d s u p w i t h t h e it h b i t s b H;i ; b I;i ; F F F ; b kÀI;i of all words in the collective memory of the macro-PE.
Datapath conversion is the most demanding and least obvious operation necessary to implement multigauging via emulations. 13 We consider it worthwhile to present a detailed topology-independent algorithm for this operation because it allows us to illustrate the influence of the mode of memory access on the implementation. Our analysis suggests that, absent knowledge of the topology of the network underlying the host array, the SIMD regimen we are assuming favors a shift-register style of memory access.
The Language. All of our detailed procedure specifications are phrased in the following pseudoprogramming language. The instructions in the language are``high level'' in that their implementation may take different numbers of basic instructions on different physical host arrays. Although these numbers may not be equal across our instruction set, even for a specific host array e, the variations across instructions are within small constant factors which are easily derived from the pseudocode (once e is specified). For illustration, an instruction such as:
may translate into a sequence of low-level instructions with the following effects:
1. load the ALU from the ith bit of the PIR; 2. load the AB according to this bit; 3. load the ALU from the memory bit M[j]; 4. transfer the ALU bit to the port named port; 5. clear the AB.
The cost of a macro-instruction can be estimated from the number of basic instructions it takes on a conventional host array. We focus only on the host array's instructions, ignoring the cost of the controller program. 14 The topology-independent k-bit datapath-conversion procedure assumes a given (but arbitrary) k-node interconnection network K k . For purposes of specifying and analyzing the procedure, we posit the existence of a SIMD protocol for routing the following permutations in network K k within Rk steps; cf. [19] . Permutation j; k maps the set fa H ; a I ; F F F ; a kÀIg as follows:
The Procedure. We convert u's gauge via the sequence fj; kg of corner-turning permutations. For definiteness, we assume that PEs have shift-register access to local memory; this requires the sequence fj; kg to be bracketed by local (left-)circular shifts that make each element b i;kÀiÀI accessible in the memory of the ith (see Fig. 3 ). In order to enhance legibility, we relegate to a series of appendices all of the pseudocode specifications that verify our claims about, e.g., I-bit masking; the pseudocode for this section appears in Appendix A. The reader who does not want to wade through the pseudocode can see that the required shifts yield to SIMD specification with I-bit masking by observing that an i-place shift is effected by instructing all PEs to shift P m times for each position m of their indexregister, using the value of the bit at that position as the enable flag.
The Analysis. Operation LeftShift performs k Olog k bit-serial instructions per conversion (k memory accesses and Olog k adjustments to activity bits); the constant within the big-O is estimated to be under 5. The total datapath-conversion cost is, thus, Pk Olog k kRk. Absent details about network topology, the term kRk cannot be improved. When the algorithm for LeftShift is implemented on a direct-addressed memory which lacks shift-register capabilities, conversion time rises to Pk P Olog k kRk since each one-place shift of a word 13. In some network topologies, other ingredient algorithms, such as emulating a direct-product network, could also be quite complex.
14. The costs of the controller are small, residing only in the time for procedure calls and some small amount of space that does not depend on k.
requires k memory accesses. On an indirect addressed memory which lacks shift-register capabilities, one would perform the conversion via referencing rather than physical shifting; this leads to a conversion time of OkRklog k when address arithmetic is bit-serial (so that incrementing the address register takes time proportional to the length of the register) or OkRk when address arithmetic is wordwide. The only addressing mechanism which outperforms a shift register here is an address index register whose contents can be added to the value of the address field to obtain the actual address; such a mechanism reduces the conversion time to OkRk.
A Final Word. When assessing the cost of our strategy, one must keep in mind that the routings which implement word-shifting, tree-sweeping, and datapath conversion take place within the k-PE macro-PEsÐand that k will usually be much smaller than the size of the physical network.
THREE CASE STUDIES
We now flesh out our strategy by describing its implementation on three quite different array topologies: hypercubes, which are direct-product networks (Section 4.1); de Bruijn networks, which are not direct-product networks, hence needing a heavier dose of emulation (Section 4.2); the meshstructured extended coterie network, which supports our strategy more efficiently if one ignores its native structure in favor of an artificial one that is achieved via emulation (Section 4.3).
Hypercube Networks
The n-dimensional (Boolean) hypercube network Q n has nodeset 15 Z n P ; each node x is labeled intx, thereby linearly ordering the network (and the array n ). The arcs of Q n connect every pair of nodes u and v that differ in just one bit-position, i.e., u xy and v x " y, for some P Z P , y P Z m P , and x P Z nÀmÀI P ; in n , the mth (I/O) port connects PEs u and v. See Fig. 4 .
Algorithmic Issues
Emulating a Direct-Product Network. Q n is (isomorphic to) each direct-product Q lgk Â Q nÀlgk : node hx; yi of Q lgk Â Q nÀlgk corresponds to node xy of Q n . Therefore, the host array e def n can emulate the product lgk Â nÀlgk of the k-PE macro-PE u def lgk and the P n =k-PE macro-array f k def nÀlgk with no time loss. The emulation becomes only slightly more complicated when k is not a power of P. Routing Permutations. In Section 4.1.2, we provide efficient SIMD implementations of the necessary permutation-routes.
Emulating
Implementation Issues
The pseudocode for this section appears in Appendix B.
Computation, Communication, and Control. Because Q n is a direct-product network, data transfers in the macroarray f k are emulated with no slowdown since the PIR of each bit-serial PE is the concatenation of the AIR and the MIR:
es sn À lgk F F F n À IY ws sH F F F n À lgk À I:
Letting g P nÀlgk , we order the copies of macro-PE u so that the ith macro-PE comprises exactly those PEs of n whose names form the set A i fa j a imod gg. Within a copy of macro-PE u, we order PEs so that the ith bit position in all macro-PEs is occupied exactly by those PEs of n whose names form the set A i fa j i a=gg. To output a k-bit word to a port j (by all macro-Pes in parallel), all bit-serial PEs output a bit to their port j, as in macroinstruction OUTPUT (j: Port). To output a bit to a port m by all bit-serial PEs within each macro-PE u (in parallel), all bit-serial PEs output a bit to their port m n À lg k, as in procedure AgOutput (m: Port).
Both output and input data transfers are performed in unit time in both f k and u, hence incurring no slowdown compared to bit-serial data transfers in n . In contrast, if an n À lgk-dimensional subcube of n were used instead of the macro-network to compute bit-serially on k-bit-wide data, then the cost of a communication operation would be Âk. 16 Parallel-prefix computations within each macro-PE u require exactly the PEs on level h of the emulated tree k to be active at stage lgk À h of the ascending phase of the algorithm and at stage h of the descending phase. This emulation incurs no slowdown because of our PE assignment. The cost of emulating parallel-prefix is thus Olog k, as compared to its bit-serial cost of Âk. The constant 16. In analogy to the big-O, which translates loosely to ªless than or equal,º the big-Â means ªexactly proportional to.º hidden in the emulation's big-O is slightly larger than in the bit-serial expression's because the emulation operates on 2-bit data; hence, the emulated parallel-prefix is slower than the bit-serial version for very small gauges. Compensating for this slowdown are two types of speedup.
Computational instructions are accelerated from their
linear bit-serial times either to logarithmic emulated time, for arithmetic operations (in all gauges except the initial few), or to constant emulated time, for logical operations.
Communication instructions are accelerated from
linear to constant time for all gauges. Note, therefore, that whereas many parallel algorithms accelerate computation but slow down for communication, our approach improves as the emulated computation calls for more communication! Datapath Conversion. Our corner-turning algorithm for macro-PEs is similar to the matrix transposition algorithm MTADEA described in [21] , except that we provide a detailed memory layout, augmented with a precise description of the local data movements under the SIMD regimen. The macro-instruction GAUGE(k) essentially transposes the k Â k bit-matrix formed by k bit-serial k-bit memories, operating in lgk rounds, corresponding to the dimensions of the macro-PE. For H i lg k À I, round i transposes two P i Â P i blocks; Fig. 5 illustrates the three-dimensional case. In contrast to [21] , where entire P i -bit ª`rowsº are exchanged between adjacent PEs at once, we are careful to send one bit at a time, thereby using I-bit buffering per bitserial PE (the bit variable save). GAUGE k performs k log k bit-serial data transfers and Ok log k bit-serial computation instructions, with the constant in the big-O estimated to be under 20.
de Bruijn Networks
The base À P order À n de Bruijn network D n has node-set Z n P ; each node x is numbered by intx, thereby linearly ordering the network (and the array). The arcs of D n and the corresponding (I/O) ports of h n are summarized in Table 1 ; cf. Fig. 6 . Techniques similar to those described here work with relatives of D n such as shuffle-exchange [28] or perfectshuffle [24] networks.
Algorithmic Issues
Emulating a Direct-Product Network. We sketch the algorithmic basis of an efficient emulation by the N P n -PE host array e def h n of the direct-product array h lgk Â h nÀlgk , wherein u h lgk is the macro-PE and f k h nÀlgk is the macro-array. A nontrivial emulation is needed here because de Bruijn networks do not enjoy a direct-product structure. Node assignment. Assign nodes of D lgk Â D nÀlgk to nodes of D n by concatenating node-names: node hx; yi P Z 1. implement only``successor'' moves, SHUFFLE and SHUFFLE-EXCHANGE, leaving the ªpredecessorº moves, UNSHUFFLE and UNSHUFFLE-EXCHANGE, to the reader; 2. assume that lgk n=P, which mandates our rewriting the first-coordinate string of node hx; yi in Fig. 7 , rather than the second. (The reverse inequality mandates the complementary choice.)
.
R o u t e t h e S H U F F L E ( -E X C H A N G E ) a r c
hx; yi; hx; yi within copy y of D lgk via the length-Plgk path from node xy to node xy in D n depicted in the left-hand side of Fig. 7 (wherein move names are abbreviated to save space). . Route the SHUFFLE(-EXCHANGE) arc hx; yi; hx; yi between copies y and y of D lgk via the lengthPlgk I path from node xy to node xy in D n depicted in the right-hand side of Fig. 7 . It is shown in [1] how to orchestrate the traversals of the indicated link-routing paths in h n so that only constantly many messages traverse any single link at one timeÐ-independent of n and k. This orchestration is consistent with an SIMD regimen and assures that our emulation incurs only Olog k slowdown, with a small constant in the big-O.
Routing Permutations. In Section 4.2.2, we provide efficient SIMD implementations of the necessary permutation-routes.
Emulating a Complete Binary Tree (Level by Level). Our level-by-level emulation of k by h lgk derives from [24] (wherein the technique is used on the perfect-shuffle network). We assign each node x of k to node H lgkÀlgthx x of h lgk , and we route arcs of k via shortest paths.
. Route arc x; xH of k in h lgk via the SHUFFLE arc
. Route arc x; xI of k in h lgk via the SHUFFLE-EXCHANGE arc
. Route the predecessor arc x; x of k in h lgk via the ªinverseº of the arc that is used to route x; x. Since each arc of k is routed along a single arc of h lgk , this emulation incurs no slowdown.
Implementation Issues
The pseudocode for this section appears in Appendix C.
Computation, Communication, and Control. For all k P n , the direct-product h lgk Â h nÀlgk can be emulated by the order-n de Bruijn array h n with slowdown Ominlgk; n À lgk; cf. Section 4.2.1, Emulating a Direct-Product Network. Assuming that lgk n=P, we let the PIR of each bit-serial PE be the concatenation of the AIR and the MIR:
Letting g P nÀlgk , we order the copies of macro-PE u so that the ith macro-PE comprises exactly those PEs of h n whose names form the set A i fa j a imod gg. The communication macro-instruction that outputs a kbit word to a macro-port SHUFFLE in f k (other macroports being handled analogously) follows the routing function in the righthand side of Fig. 7 , with . Starting at PE xy, the routing makes lgk I SHUFFLE or SHUFFLE-EXCHANGE hops to node yx and, thence, lgk UNSHUFFLE hops to the target node xy. During each of the first lgk I hops, the SIMD regimen forces us to spend one transfer cycle for the communication through SHUFFLE ports and another for the communication through SHUFFLE-EXCHANGE ports. Two more cycles per hop are spent on ªmemorizingº the most significant PIR bit of the previous node in the sequence, to make it equal the least significant bit of the next node. PEs with 0 in the most significant bit of the PIR send in the first two cycles; PEs with 1 in this bit send in the last two cycles. Four bits of memory buffer bits as they are input during one 4-cycle hop and output during the next. OUT-PUT(SHUFFLE) performs Slgk R bit-serial data transfers Fig. 7 . Left: The length-Plgk path from node xy to node xy in D n ; x I P Á Á Á lgkÀI . Right: The length-Plgk I path from node xy to node xy in D n ; x I P Á Á Á lgk .
and Olog k bit-serial computation instructions, where the constant in the big-O is estimated to be under 10.
In contrast to the situation with hypercubes, the Olog k cost of a communication step in a multigauge de Bruijn array cannot be compared to the cost incurred in bit-serial processing of the same data as it is not clear that bit-serial processing is even possible. To wit, whereas Q nÀlgk is a subnetwork of Q n for any lgk n, there is no known way to identify a copy of D nÀlgk within D n in any nontrivial case.
Within a copy of macro-PE u, we order PEs so that the ith bit-position in all macro-PEs is occupied exactly by those PEs of h n whose names form the set A i fa j i a=gg.
The macro-PE data-transfer primitive AgOutput requires one bit in the PIR of the PE at hop lgk to depend on a bit in the PIR of the first PE in the sequence; cf. Fig. 7 with . This precludes memorizing this bit on the way; therefore, two complete rounds of Plgk cycles (one for each value of the bit) must be performed in sequence. The number of bitserial data transfers is Tlgk, while the number of computation instructions is Olog k, with the constant in the big-O estimated to be under 10.
As described in Section 4.2.1 (Emulating a Complete Binary Tree (level by level)) parallel-prefix computations within each macro-PE k are performed by emulating a complete binary tree in the manner of [24] . In this emulation, each nonleaf tree-PE x communicates with its left child xH through its SHUFFLE port and with its right child xI through its SHUFFLE-EXCHANGE port; each nonroot tree-PE xH (respectively, xI) communicates with its parent x through its UNSHUFFLE port (respectively, its UNSHUFFLE-EXCHANGE port). The emulation takes time Olog k; but, recall, u is itself implemented via an emulation (of h lgk Â h nÀlgk by h n ) which takes time Olog k to emulate each communication step of u. Therefore, each k-bit parallel-prefix computation takes Olog P k steps of h n .
Datapath Conversion. Our pseudocode for datapath conversion in u is designed to reveal the similarity between the implementations for h n and n (in global communication and data layout). Communication for h n is slower by a factor of Olog k, reflecting the emulation overhead for macro-PE data transfers. So, the``corner-turning'' procedure takes Ok log P k steps per k-word block of k-bit words.
Extended Coterie Networks
The final networks we study are extended coterie networks (ECNs), which differ from hypercubes and de Bruijn networks in two respects. Each ECN:
1. is the union of a graph and a hypergraphÐa graphlike structure whose ªedgesº may ªconnectº sets of nodes of any cardinality; 2. is a dynamic structure whose hypergraph component may change at each communication step. The dynamic quality of ECNs requires us to parameterize them with a time-index as well as a size-index. For each t H; I; P; F F F , the time-t, n Â n ECN C t n has node-set Z P n . Each node hi; ji of C t n is incident to arcs leading to and from the (at most four) nodes hi H ; j H i for which ji À i H j jj À j H j I; thus, the graph component of C t n is a directed n Â n mesh. Additionally, each node v of C t n is incident to precisely one coterie-hyperedgeÐa subset S of Z n P that contains v and that induces a connected subgraph of the n Â n mesh. Initially (i.e., in C H n ) each v resides in a singleton coterie-hyperedge; in subsequent steps, coteries form in unpredictable waysÐwith no dependencies from time t to time t I. We linearize the nodes of C t n (and of array g t n ) in row-major order; for economy, we let ªi; jº denote both the node-name hi; ji of a PE of g t n and the PE's serial-name, in j; context will always disambiguate the intended usage. Each PE in g t n has input and output ports that link it to its North, South, East, and West neighbors, PEs i AE I; j and i; j AE I. (Of course, boundary-PEs lack some of these neighbors.) Additionally, each PE has an I/O port connecting it to precisely one coterie: PEs hi; ji and hk; li of g t n are connected at time t to the same coterie just when nodes hi; ji and hk; li of C t n are incident to the same coteriehyperedge. A coterie is a bus in the sense that, at any time, a single incident PE can ªtalkº while all other incident PEs ªlisten.º See Fig. 8 .
Techniques similar to those described here will work on relatives of the ECN, such as the RMesh [17] , the Polymorphic Torus [15] , and the networks of the Illiac-III [18] and Clip-4 [6] .
Algorithmic Issues
We assess unit time for each point-to-point communication in g t n as with other arrays. For each coterie communication, such an assessment is unrealisticÐbecause of switch traversals and distancesÐunless we restrict the sidedimension n to a fixed bound (n SIP in the implemented ECN of [30] ). Our algorithmic techniques extend readily to a scalable version of the ECN, but scalability would demand a delay model that assesses time linear in the diameter of a coterie for each communication step. Choosing practicality over scalability, we henceforth fix n (without specifying its value) and assess one step for broadcasting a message along a coterie. (We justify this cost assessment in Section 4.3.2.)
Emulating a Direct-Product Network. In deference to self-similarity, we have the host array e def g t n emulate the direct product 17 
is the macro-PE and f k g t n= k p is the macro-array. Such aggregation provides data-transfer times that are symmetric in the row and column directions and, more importantly, . We route tree-link x 3 xH via the coterie P x . . We route tree-link x 3 xI via a ªnullº link: treePEs x and xI reside in the same ECN-PE. . We route the predecessor tree-link x 3 x by reversing the path used to route link x 3 x.
Implementation Issues
The pseudocode for this section appears in Appendix D.
We illustrate the implementation of our EC-array emulations via the Content Addressable Array Parallel Processor (CAAPP) [30] . The CAAPP's two communication networksÐa nearest-neighbor mesh and a reconfigurable ªcoterie networkºÐcorrespond, respectively, to the meshlinks and coteries of an EC-array. All CAAPP instructions take a single cycle. Data transfers in the mesh take one instruction for data in any PE register. Transfers in the Coterie network require three instructions: Data is transferred to the transceiver (X) register, broadcast, then read. Coterie switch settings are modified by a (single-instruction) write to the MR register.
Macro-PE to Macro-PE Data Transfer. The Mesh Network. The primitive operation of the mesh network is a parallel transfer of data to the neighboring PE in a specified direction. The corresponding macro-array instruction is identical except that it transfers k-bit words rather than single bits. Since the geometry of the macro-PE is a k p Â k p mesh, the macro instruction requires only k p communication instructions to perform the transfer, whereas the equivalent bit-serial procedure requires k such instructions. The Coterie Network. Implementating the macro-array coterie network requires two primitive operations:
. setting the switches of the macro-array g t n= k p ;
. transferring data among macro-PEs that belong to the same coterie. The first of these operations involves setting the appropriate switches in the host array g t n ; the second involves timemultiplexing the communication links when multiple PEs in a copy of macro-PE g t k p are sending data. We consider each in turn.
Setting macro-PE switches. The switches of each macro-PE g t n= k p are emulated by switches of its k constituent hostarray PEs. During each macro-array computation step, all internal switchesÐthose that connect PEs within the same macro-PEÐare closed so that each macro-PE will act as a unit. During each communication step, all external switchesÐthose that connect PEs within distinct macro-PEsÐin a given (say, easterly) direction are set according to the setting of the corresponding (logical) macro-PE switch. For example, if the E switch of the macro-PE is closed, then the E switches of all k p PEs on the east side of the macro-PE will also be closed.
One programs the macro-coterie network by setting the macro-PE equivalent of the MR with the low-order 4 bits from a specified memory location starting at, say, location MemLoc. The program first creates coteries corresponding to each macro-PE (closed internal, open external switches) and then has PEs 0 through 3 successively broadcast their MemLoc values. In response, each PE in the macro-PE sets its switches according to the value of the input signal and its position within the macro-PE. The following masks are used.
IsolateMacroPEMask:
contains the initial switch settings MacroPEMask(0, 1, 2, 3): specifies selection of macro-PE MR PEs SwitchMask(N, E, W, S): indicates which PEs participate in the setting of the macro-PE switches in each direction.
The procedure requires 18 cycles: two cycles for overhead (setting the initial mask and performing the final load) and four cycles per direction (set send mask, send, set receive mask, receive).
Intermacro-PE Data Transfers. The actual data transfer proceeds by having macro-PEs broadcast k-bit words (from location MemLocSend to location MemLocReceive) bit-bybit. (Recall that the PEs in each macro-PE belong to the same coterie.) The locations of the broadcasting and receiving macro-PEs are stored in variables BroadcastMask and ReceiveMask, respectively. The macro-array instruction that accomplishes this broadcast takes Rk-steps: For each of k bits, the appropriate sending PE is masked, the data are sent, the appropriate receiving PE is masked, and the data are received.
Datapath Conversion. Whereas a permutation-routing implementation of gauge-k corner-turning requires time Âk S=P on the CAAPP, one can achieve the same result in time Ok P via the CAAPP's coterie-based broadcast capability. Each macro-PE isolates itself as a coterie, thenÐsequentially within each coterie, but in parallel among distinct coteriesÐhas each PE broadcast its bit to the appropriate destination. This procedure takes Rk P I machine cycles by using the same masking and broadcast technique as above for each of k bits of memory bit in each of k PEs per macro-PE.
Parallel-Prefix Computations. Employing a strategy from [17] , we can use the CAAPP's coterie-broadcast capability to halve the number of communication steps required by the standard parallel-prefix tree-sweep algorithm. In the broadcast-based algorithm, as data passes up the tree, being combined with sibling data at each node, we broadcast the partial result computed at each node to all descendants of its right-hand sibling. This strategy obviates the second phase of the standard algorithm, wherein partial results are passed down the tree. The algorithm is implemented by creating coteries during each phase i H; F F F ; log k À I to emulate level i of the tree. Each phase-i coterie contains: one level-log k À i À I PE, both of its children, and all descendants of its right child. These coteries are constructed as follows. Since our binary-tree emulation has the same EC-PE emulate each tree node and the node's right child, the link from each left child to its parent also contains the right child. Moreover, for the emulation of the lower log k p tree-levels, all descendants of the right child are also part of that coterie. The emulation of the upper log k p levels of the tree is not quite so direct, but is still straightforward. The following operations take place during each phase i: j P log kÀiÀI coteries are formed, each consisting of P i I tree-nodes. One PE in each coterie broadcasts its data; the remaining PEs receive that data and combine it with their own. Using row-major indexing within macro-PEs, the PEs within the jth coterie during iteration i are specified as follows: The sender PE is computed by taking i ones and adding jP iI ; the receivers are the P i PEs numbered consecutively from the sender. The resulting procedure requires T log k-cycles: for each level, the coterie switches must be set, the broadcaster masked, the data sent, the receiver masked, the data received, and the data combined.
CONCLUSION
We have presented and illustrated a strategy for emulating a family ff k g kPN of word-parallel SIMD processor arrays on the family's bit-serial instance e f I . Our goal has been a collection of semantically consistent, computationally efficient virtual-machine instruction sets, indexed by gauge. We believe that the most appropriate method of assessing the efficiency of our approach is to compare the cost of the various macro-instructions we have implemented to the cost of achieving the same functionality within a purely bitserial computation regimen. While an exact assessment would require details of network topology and specifics of the node-architecture, the big-O assessments that we present indicate that our approach outperforms its bit-serial alternative, at least for sufficiently large values of k. Since the constants hidden in the big-Os are small (as noted throughout the text), the ªeventuallyº of the asymptotics eventuates at modest values of k. This comparison is more notable since we have taken no pains to optimize our multigauge virtual machinesÐbecause the problem of achieving multigauge behavior has been a vehicle for illustrating a general algorithm-based approach to achieving architectural enhancements, rather than an end in itself.
The following tables compare the time required for three classes of k-bit operations when implemented on bit-serial hypercube, de Bruijn, and EC arraysÐfirst using our emulation algorithms, and then using straightforward software implementation, without emulations. See Table 2,  Table 3, and Table 4 for items 1, 2, and 3, respectively.
1.
Operations that are cheap (OI circuitry) if multigauge behavior is implemented in hardware [26] (e.g., communication, bit-wise logic, memory reference) 2. Operations that are moderate in cost (Ok circuitry) if multigauge behavior is implemented in hardware [26] (e.g., arithmetic) 3. Operations whose hardware complexity is substantial and detail-dependent [26] (e.g., shifting and multiplication). Operations in this class would not typically appear in the native instruction set of a bitserial machine but are common in a word-parallel one. The preceding assessment considers only recurring costs, ignoring the cost of datapath conversion, which is incurred precisely once each time a new gauge is selected. As noted in Section 3.2, ªturning the cornerº from bit-serial mode to gauge k is never more expensive than k deterministic offline permutation routes within the emulated k-PE macroPEs (multiplied by any overhead needed to emulate these macro-PEs); significantly, the particular forms of the needed permutations usually allow more efficient routings than for general permutations.
APPENDIX A PSEUDOCODE FOR TOPOLOGY-INDEPENDENT MULTIGAUGING
Topology-independent datapath conversion macro-instruction GAUGE(k): { LeftShift ± shift within each macro-PE to make data accessible for j X H to k À I do { ± for all bits route j; k ± invoke the permutation right } ± shift next bit RightShift } ± shift within each macro-PE to restore data in memory i I-position left shift in the ith PE:
procedure LeftShift: { left ± shift by I position for m X H to lgk À I do { ± for bits in PIR if (PIR[right] = 1) then { ± for each I for j X I to P m do { ± as many shifts left }}}} 
