Abstract-Three-dimensional packaging technologies are critical for enabling ultra-compact, massively parallel processors (MPPs) for embedded applications. Through-wafer optical interconnect has been proposed as a useful technology for building ultra-compact MPPs since it provides a simplified mechanism for interconnecting stacked multichip substrates. This paper presents the offset cube, a new network topology designed to exploit the packaging benefits of through-wafer optical interconnect in ultra-compact MPP systems. We validate the offset cube's topological efficiency by developing deadlock-free adaptive routing protocols with modest virtual channel requirements (only two virtual channels per link needed for full adaptivity). A preliminary analysis of router complexity suggests these protocols can be efficiently implemented in hardware. We also present a 3D mesh embedding for the offset cube. Network simulations show the offset cube performs comparably to a bidirectional 3D mesh of equal size under uniform, hot-spot, and trace-driven traffic loads. While the offset cube is not proposed as a general replacement for the mesh topology, it leverages the benefits of through-wafer optical interconnect more effectively than a mesh by completely eliminating chip-to-chip wires for data signals. Hence, the offset cube is an effective topology for interconnecting ultra-compact MCM-level MPP systems.
INTRODUCTION
HE performance, cost, and physical complexity of massively parallel processing systems (MPPs) are greatly influenced by the chosen network topology and its physical packaging. To optimize cost-performance, the network architect must view the system as a topology-technology pair, selecting the topology that most effectively leverages the benefits of a particular packaging approach. Especially desirable for MPP systems are topology-technology pairings that distribute computing resources and interconnect uniformly throughout a volume. Such 3D packaging solutions improve signaling speed by minimizing the length of chip-to-chip signal paths and improve throughput by maximizing the number of interconnects crossing the network bisection. 3D packaging technologies also enable ultra-compact MPPs [33] , [44] , [52] for embedded applications by minimizing system volume and footprint.
In keeping with these design goals, a number of studies have advocated the use of direct network topologies such as low-dimensional meshes and tori [1] , [2] , [14] , [43] . A frequently cited rationale is that the simple point-to-point connections in such networks can be embedded into a physical, three-dimensional structure without extensive folding of network links. Unfortunately, exclusive use of packaging technologies involving planar substrates (e.g., VLSI, printed circuit boards (PCBs), and multichip modules (MCMs)) does not result in an optimal embedding since these technologies provide high-density interconnects in only two physical dimensions. For a balanced system, highperformance interconnects must be provided in the third dimension as well, necessitating a technology for vertical communication between stacked substrates.
Methods for interconnecting stacked substrates can be classified according to how spatially centralized or distributed the vertical interconnect is. An extreme example of centralized interconnect is a conventional wire-based backplane in which vertical interconnects are routed to the periphery of each substrate and then through the backplane itself. This approach is an industry standard for packaging of electronic PCBs, and there is a great deal of industrial interest in applying optical technologies to the backplane packaging model. Despite its popularity, the traditional backplane exhibits some disadvantages: It increases wire lengths, constrains the number of connections between substrates, and uses in-plane wiring resources that could be otherwise devoted to connecting chips occupying the same substrate. At the other extreme, vertical interconnects could be uniformly distributed across the substrates being connected. This enhances system scalability by keeping signal paths short and using a minimal amount of the substrate's T in-plane wiring for out-of-plane communication. Intermediate approaches which localize vertical interconnects at a smaller number of distributed sites are also possible.
Through-wafer optical signaling offers one method for realizing distributed vertical interconnects between stacked multichip substrates. In through-wafer schemes, vertical interconnects are obtained using aligned optical emitter/detector pairs distributed across the area of unpackaging chips populating the substrate. The integrated optical I/O devices can be low-cost, hybrid, monolithic, thin-film InGaAsP/InP emitters/detectors [6] . Because these devices operate at a wavelength (1.3-1.5 µm) to which silicon is transparent, optical signals can be exchanged between vertically aligned chips through the intervening chip carrier substrate. Through-wafer interconnects keep vertical signaling paths short since they are "line-of-sight" optical connections which do not have to be routed through multiple levels of the packaging hierarchy using wires or waveguides. Moreover, through-wafer interconnects can be uniformly distributed over the surface of a populated multichip substrate. The highly integrated nature of through-wafer interconnects allows the implementation of ultra-compact MCM stacks.
Although a number of studies have investigated the technical feasibility of through-wafer optical links [6] , [24] , [35] , no studies have explored what network topologies best exploit this technology from a packaging standpoint. An obvious evolutionary strategy is to use through-wafer signaling to enhance the physical implementation of an existing network topology. For example, though-wafer optical links could be used to implement the z-dimension channels of a 3D mesh network built using stacked MCMs. This approach is attractive since it merely involves a technology refresh for a well-understood and widely used topology. The disadvantage is that an established topology is not guaranteed to make optimal use of the new technology.
In this paper, we present and evaluate a new 3D network topology, the offset cube, which is particularly amenable to a through-wafer optoelectronic implementation. The offset cube has the important property that all network links can be implemented with vertical, through-wafer optical connectionsno chip-to-chip wires are necessary for data transmission. In contrast, a 3D mesh using through-wafer connections would likely use optical links only for the z-dimension channels; xand y-dimension channels would still require high-density wiring layers within each planar substrate. The chip carrier substrate used in the offset cube is much simpler since wiring is needed only for power and ground distribution. Because the cost of an advanced MCM substrate with multiple layers of high-density interconnects is nontrivial, the total system cost of the all-optical offset cube implementation is potentially lower than the hybrid optical-electrical mesh implementation. Another potential disadvantage of the mesh implementation is that the routing of in-plane connections is restricted by the need to avoid blocking the transmission of through-wafer optical signals, limiting the extent to which signals can be routed under chip sites. If a network's size and channel widths are fixed, this either increases substrate area or reduces the number of chips per substrate (increasing the number of substrates required). Finally, if technology trends eventually favor optical over wire-based interconnect (for speed, power, or density reasons), the offset cube's all-optical implementation will provide more balanced performance than networks using both electrical and optical links. This paper does not advocate the use of through-wafer technology over other wire-based and optical interconnection approaches for every situation, nor does it propose the offset cube as a general replacement for other established network topologies (such as meshes, tori, and hypercubes). We simply note that the offset cube topology potentially simplifies the cost and complexity of the packaging hierarchy in ultra-compact MCM-level MPP systems built using through-wafer optical interconnect and therefore warrants further investigation at an architectural level. The results in this paper show that topological efficiency in the offset cube is not sacrificed for these packaging benefits.
In our architectural evaluation of the offset cube, we use the 3D bidirectional mesh as a standard for comparison because its performance advantages under bisection bandwidth and pin-out constraints over other direct topologies (such as binary hypercubes) have been demonstrated in a number of studies [1] , [2] , [14] , [43] . The objective is not to outperform the mesh topology, but to maintain competitive performance while preserving the packaging benefits of through-wafer optical interconnects. Our evaluation of the offset cube topology is driven by four metrics:
• Topological Efficiency. Section 3.3 analyzes the offset cube using traditional figures of merit such as average routing distance, network diameter, bisection bandwidth, and no-load message latency.
• Visualization/Usability. Section 3.4 presents a method of embedding a 3D mesh network into the offset cube so that the visualization and usability advantages of a 3D mesh can be leveraged.
• Routing Protocol Performance. Section 4 develops lowoverhead deadlock-free routing protocols for the offset cube. Flit-level network simulations (Section 6) of these protocols assuming an approximate constant pin-out constraint show that the latency-throughput characteristics of the offset cube are comparable to those of an equal-size 3D mesh under a variety of synthetic and trace-driven traffic loads.
• Routing Logic Complexity. Because the offset cube's routing protocols are more complex than those used in a mesh, the impact on hardware routing delays must be considered. Although an in-depth study is beyond the scope of this paper, a preliminary analysis in Section 5 suggests that routing delays in the offset cube can be made comparable with those in a mesh router.
The remainder of this paper is organized as follows. Section 2 summarizes related work on stacked substrate architectures, optical interconnect technologies, and the offset cube topology. Section 3 formally defines the topology after briefly describing its through-wafer optoelectronic implementation. Section 4 develops deadlock-free routing protocols for the offset cube. Section 5 briefly considers the hardware implementation complexity of these protocols. Section 6 presents simulation results comparing the performance of deadlock-free routing protocols for the offset cube and 3D mesh topologies, and Section 7 contains our conclusions and agenda for future research.
RELATED WORK
A number of MCM-level/wafer-scale stacked substrate architectures have been put forth in the literature [19] , [33] , [44] , [52] , [54] . Proposed wire-based approaches for communicating between stacked substrates include thermomigrated aluminum feedthroughs [48] , laser-drilled throughwafer vias [28] , microspring bridges [33] , chemically etched through-wafer vias [32] , and collapsible pressure contacts such as "fuzz-buttons" [19] . Various guided-wave and freespace optical interconnect technologies (OITs) have been proposed and/or demonstrated at all levels of the interconnect hierarchy. Most interest currently focuses on the hybrid (interchip) and backplane (interboard) levels since it has been established that electrical interconnects outperform optical interconnects over sufficiently short distances [20] , [21] . However, a recent study [55] indicates that free-space optical interconnections can offer a speed-energy product advantage as high as 30 over electrical interconnect technologies at distances characteristic of on-chip (up to wafer scale) and off-chip (MCM) situations. Thus, optical interconnects have significant potential for providing highperformance vertical interconnects in ultracompact stacked wafer systems.
The optical interconnect (OI) technological models appearing in the literature have provided insight into the intrinsic limits of free-space and guided-wave optical interconnect technologies (power, density, etc.), the conditions that make a particular optical link superior to an equivalent electrical link, and the scalability of the OI configuration. A number of guided wave interconnect technological studies have been put forth which focus on the impact attenuation, dispersion, and fan-out have on signal transmission [45] , and on the development of crosstalk models for optical waveguide arrays [4] , [29] , [47] . In the case of free-space optical interconnects, technological models have been put forth that allow the determination of the breakeven length for which a free-space optical link becomes superior to an electrical link in terms of power and speed considerations and interconnection density limitations [20] , [21] . While specific optical free-space and guided-wave optical backplane configurations have been proposed and/or developed [41] , [42] , [46] , [49] , these technologies can be quite bulky and as a result are not always applicable to chip-tochip interconnections in ultra-compact system implementations, which is the focus of this paper. For such ultracompact (MCM-level) systems, a variety of approaches have been put forth [10] , with through-wafer optical interconnects [24] having been demonstrated to provide a potential solution to the problem of high-bandwidth intermodule interconnects.
Optical interconnection studies concerning the interaction between multicomputer packaging and interconnection topologies have traditionally focused on the technological component, with the topological component held constant [7] , [8] , [16] , [25] , [30] , [31] , [49] . While these studies have led to convincing arguments regarding the relative superiority of certain optical interconnect technologies over electrical interconnect technologies, the question of what the optimal topology-technology pair is for particular application classes has been addressed only recently [11] , [38] , [40] . This paper contributes an additional datapoint to this investigation, considering the design of an efficient topology for exploiting through-wafer optical interconnects.
While a variant of the offset cube topology is considered in [37] (referred to as "Topology B"), this study does not include implementation-specific constraints in evaluating performance. For example, although analytic expressions for no-load message latency and bisection bandwidth are presented in [37] , an idealized spherical boundary for the topology was assumed. In addition, the simulation results in [37] were obtained under a fully adaptive routing algorithm which used an inexhaustible number of virtual channels to avoid deadlock. The routing algorithms presented in Section 4 require either a modest number of virtual channels (i.e., two per physical channel) or none at all.
THE OFFSET CUBE NETWORK
This section introduces the offset cube network by first describing its through-wafer optoelectronic implementation. The offset cube's topology, naming scheme, and minimal routing strategy are discussed next, followed by an analytical comparison of the offset cube and 3D mesh networks. An efficient mapping between the two topologies is presented in Section 3.4.
Through-Wafer Optoelectronic Implementation
Physically, the offset cube is a collection of processing planes organized in a vertical stack. Each processing plane consists of a silicon carrier substrate which provides mechanical support for an array of chips. Each chip contains one or more fine-grained processing elements and an integrated network router, both implemented using Si-based VLSI.
The chips are attached to each substrate in the pattern shown in the upper half of Fig. 1a . The surface of each chip is covered with thin-film optoelectronic devices in the pattern shown in the lower half of Fig. 1a . Each device is labeled as an emitter (E) or detector (D). All network links in the offset cube consist of vertically aligned emitter-detector pairs which transmit optical signals through the intervening chip and carrier substrates. Each emitter and detector has a fixed direction (up or down) in which it emits or detects optical signals. E u denotes an emitter which emits "up" to a detector, D d , on the substrate above it, whereas E d denotes an emitter which emits "down" to a detector, D u , on the substrate below it. The dotted lines in the lower half of Fig. 1a divide the chip surface into quadrants. In a system with bit-serial channels, each quadrant contains two emitters and two detectors. However, each emitter/detector site in Fig. 1a can be replaced with a group of optical devices operating as an arbitrarily wide parallel bus. The I/O devices in each quadrant allow a single bidirectional link with the chip lying above the quadrant and a single bidirectional link with the chip below it. Each chip, therefore, maintains bidirectional communication with eight neighbors (two per quadrant).
To ensure correct device alignment between chips on adjacent planes, every other plane in the stack is rotated by 90°, as shown in Fig. 1b . Note that the pattern of devices on each chip is invariant under this rotation. Aligning the edges of each substrate in the stack causes each chip to be overlapped by four chips above and four chips below. This is depicted in Fig. 1c by superimposing the chip patterns of two adjacent substrates. The lower half of the figure shows the correspondence between quadrants of overlapping chips. Inspection of the device patterns in Figs. 1a and 1b shows that all emitters and detectors are correctly aligned according to the intended direction of communication.
Because no electrical connections besides power and ground are needed within a substrate, each substrate can support a large number of chips, enabling an ultra-compact MPP system. Admittedly, heat removal and optical alignment of emitter/detector sites present formidable engineering challenges to the offset cube's physical implementation. However, innovative solutions to the heat removal problem are currently being investigated in connection with other dense 3D packaging techniques [19] , [54] . Optical alignment is sensitive to mechanical, thermal, and optical effects. Although the offset cube's large number of distributed optical I/O sites make correct alignment challenging, the ultra-compact nature of the system alleviates some of these difficulties by reducing the distance between emitters and detectors. Recent advances indicate that low-cost alignment techniques may be available in the near future. For example, microconnectors for self-alignment of freespace optical interconnections have been demonstrated to allow low-cost alignment accuracies of 20 um [36] . A more detailed discussion of system-level design issues regarding the offset cube's through-wafer implementation can be found in [35] , [52] .
A final issue that must be considered is the currently high cost of optical interconnect relative to electrical alternatives. In addition to costs arising from optical alignment issues, there is the cost of integrating optical emitters and detectors onto each unpackaged chip. Despite the performance benefits of distributing the optical components over each chip, this approach is admittedly more expensive than concentrating them at a smaller number of sites since the cost of optical integration per substrate plane cannot be amortized over a large number of nonoptical ICs. The feasibility of distributing the optical interconnect clearly depends on achieving low-cost optical integration per chip; given the lack of a mature commercial infrastructure for optical integration, this goal will not likely be achieved in the near future. However, this fact should not discourage the investigation of appropriate network topologies for exploiting various optical interconnection approaches. Also, it should be noted that the optoelectronic integration techniques to be used in the offset cube are expected to cost significantly less than typical free-space optical interconnect approaches used for backplane interconnections. This is due to 1) the integration technology proposed (ELO integration of optical receivers/transmitters with CMOS circuitry) and 2) the compact nature of the interconnection structure (minimizes optical routing and alignment problems). unidirectional optical channels. Black nodes represent chips on even-numbered z-layers, whereas white nodes represent chips on odd-numbered z-layers. In Fig. 2b , nodes connected with solid lines lie in planes with odd x-coordinates, whereas nodes connected with dashed lines lie in planes with even x-coordinates. The yz projection is virtually identical to the view in Fig. 2b . Network nodes are named by their x-, y-, and zcoordinates. Adjacent nodes differ in their z-coordinate and in either their x-or y-coordinate. A node at coordinates (x, y, z) communicates directly with eight neighbor nodes at the following coordinates: (x − 1, y, z − 1), (x + 1, y, z − 1), (x, y − 1, z − 1), (x, y + 1, z − 1), (x − 1, y, z + 1), (x + 1, y, z + 1), (x, y − 1, z + 1), (x, y + 1, z + 1). The offset cube can also be generated by modifying a rectangular 3D mesh network with x, y, and z dimensions of k × k × 2k. Nodes with even zcoordinates are removed if their x-and y-coordinates are not both even or not both odd; nodes with odd zcoordinates are removed if their x-and y-coordinates are both even or both odd. Replacing the six-neighbor connections present in the original mesh with the eight-neighbor connections described above completes the topology. Since half the original nodes are removed, the offset cube consists of k 3 nodesthe same number as in a cubic 3D mesh with radix k. We call the resulting network a k-ary offset cube. In hops. For z-dominant messages, δ z must be reduced at each routing step, requiring δ z hops. Again, since the required hops exceeds the sum of the distances in x and y, both positive and negative routing in x and/or y is needed to reduce these components to zero in exactly δ z hops. For codominant messages, δ z and either δ x or δ y must be reduced at each routing step. In the co-dominant case, no direction changes in any of the three coordinates are required. It follows that the shortest path distance in hops between two nodes, A and B, in an offset cube network is given by: Because the offset cube is a multipath topology, adaptive routing can be exploited to improve throughput and fault tolerance. At each node visited by the message header, an adaptive routing function examines the current state of the header's distance vector and returns a set of feasible output channels, one of which is chosen by a selection function. Table 1 summarizes the possible output channels a message header can take at a given network node under a minimal, fully adaptive protocol, given the state of its distance vector. In Table 1 , each output channel is expressed as a vector indicating the x or y direction and z direction taken by a message header (for example, (+x, −z)). The sign function, sgn(x), returns the sign (+ or −) of its argument. The procedure in Table 1 does not ensure freedom from deadlock. Routing constraints for adaptive deadlock-free routing are developed in Section 4.
Topology, Node-Naming Scheme, and Minimal Routing
v d x x y y z z ab a b a b a b x y z = − + − − = + max , max , . 3 8 4 9 δ δ δ (1)
Topological Comparison of k-ary Offset Cube and 3D Mesh
Past studies have generally compared network topologies under either constant bisection bandwidth [14] , [43] or constant node-size/pin-out [1] , [2] constraints. The constant bisection width assumption is valid when network connectivity is limited by the internode wiring capacity of a particular packaging approach. Constant node-size/pin-out is more appropriate when channel widths are constrained by the number of I/Os available from the single die or multichip package forming a network node. In this study, we assume the mesh is implemented using a wire-based 3D packaging technique, while the offset cube is implemented using through-wafer optics. Comparing the two systems under constant bisection width is not meaningful under these assumptions since the width of the offset cube's lineof-sight optical links is not constrained by the wiring capacity of a planar medium. Constant pin-out provides a more realistic constraint since both systems utilize areabased I/O (assuming flip-chip connections for the mesh) and there is currently no compelling reason to assume a difference in I/O densities between flip-chip and throughwafer schemes. Because each offset cube node has eight bidirectional links, as opposed to six for 3D mesh nodes, all performance studies in this paper assume that the offset cube's links are 0.75 times the width of links in the mesh. We note this is not exactly a constant pin-out constraint since the offset cube's larger number of links implies a greater number of I/Os for flow control and synchronization. However, the assumed approach is a reasonable approximation for the wide channels possible with area-based I/O (using either surface-mounted optoelectronic devices or flip-chip connections) and it simplifies the analysis. Table 2 compares topological properties of the k-ary offset cube and bidirectional 3D mesh. The equations for average distance were derived using empirical curve fits. Network capacity is the combined bandwidth of all network channels in bits per cycle. Bisection bandwidth is the combined bandwidth of the minimum set of channels that must be cut to divide the network into equal halves. The parameters W mesh and W oc denote the channel widths in bits of the mesh and offset cube networks.
No-load message latency provides a reasonable approximation of performance below saturation. For wormhole routed networks, it consists of two terms: the distance traveled by the header in hops and the message "aspect ratio" obtained by dividing the message length in bits by the channel width. Assuming a constant pin-out constraint, an offset cube message has 4/3 the aspect ratio of a mesh message with the same number of bits. However, because the offset cube's average routing distance is lower than that of an equal-sized mesh for k > 2, a point is reached as network size increases and message size decreases at which the offset cube offers lower no-load latency. In general, the offset cube will have lower average no-load latency under uniform traffic patterns if the following relationship is satisified, where L is the message length in bits:
The "breakeven" aspect ratio L/W mesh for a system with 512 nodes (k = 8) is approximately two flits. Even when the network size is scaled up to 4,096 nodes, the break-even aspect ratio has only increased to about five flits. Consequently, the offset cube is expected to offer lower latency than the mesh primarily for fine-grained systems (i.e., large system sizes and short messages). However, five flits is realistic for many practical applications, especially as improvements in packaging technology continue to increase the number of off-chip I/Os, decreasing the aspect ratio of messages. For W mesh = 64 bits, five flits represents a message length of 320 bits, comparable to the message length of 200 bits assumed in Dally's analysis of k-ary n-cube networks [14] . Although the I/O capacities of the offset cube and mesh nodes are equal under the constant pin-out assumption, the offset cube has a lower network capacity since its boundary conditions differ from those of the mesh. As k increases, the network capacity of the offset cube approaches that of the mesh, achieving 93.7 percent of the mesh's capacity for k = 16.
Embedding a 3D Mesh in the Offset Cube Topology
The similarities between the k-ary offset cube and 3D mesh networks enable a straightforward mapping between the two topologies. When adjacent planes of the offset cube are considered in pairs, as in Fig. 2a , the nodes appear connected as a k × k 2D mesh, identical to a single plane from a 3D mesh with radix k. In Fig. 2b , planes with z-coordinates n and n + 1 for n = {0, 2, 4 ... 2k − 2} can be paired to form a stack of k 2D mesh planes, similar to a 3D mesh. This suggests the coordinate transformation in (3) from a cubic 3D mesh with radix k to a k-ary offset cube. The second term in the expression for z oc in (3) evaluates to zero if x mesh and y mesh are both odd or both even; otherwise, it evaluates to one. 
For xy-dominant and co-dominant messages, it is obvious that the offset cube shortest path always requires δz mesh fewer hops than the mesh shortest path. For z-dominant communications, the difference in shortest paths is:
In this case, communication locality can be either gained or lost under the mapping, depending on the sum of the distances in x and y relative to the distance in z. 
The lower bound in (7) is derived by substituting in (6) the maximum value of δ δ x y mesh mesh + that preserves zdominance, namely δz oc − 1. The upper bound is derived by setting the distances in x and y both to zero in (6) and expressing δz oc in terms of δz mesh . Three upper bounds are obtained, depending on how δz mesh maps to δz oc (see left column of Table 3 ).
As the above results imply, the mapping is most efficient for mesh communication patterns that exhibit most of their nonlocality in the x and y dimensions. Because a cubic 3D mesh is isotropic, the xyz coordinates of source/destination nodes can be permuted so that the percentage of nonlocal traffic in the z dimension is minimized for a particular application. This allows the mapping technique to preserve maximum communication locality when porting the application to the offset cube.
DEADLOCK-FREE ROUTING ON THE OFFSET CUBE
This section develops the necessary routing constraints for practical, deadlock-free routing protocols (Sections 4.1-4.3) using techniques presented in [18] , [23] . While detect-andrecover schemes, such as Compressionless Routing [26] and DISHA [3] , can be readily adapted to the offset cube, they are not considered here.
Positive-Z-First Algorithm
The Positive-Z-First (PZF) algorithm was developed by applying the Turn model described in [23] to the offset cube. The Turn model is based on analyzing the directions in which a message header can turn in a given topology and identifying the resource cycles created by combinations of turns. Prohibiting some minimum set of turns removes all cycles from the channel dependency graph (CDG) while retaining maximum routing flexibility. The absence of cycles in the CDG guarantees freedom from deadlock [15] .
Let the offset cube's channels be divided into two sets, C + and C − , where C + contains all channels in the +z direction and C − contains all channels in the −z direction. Since any hop in the offset cube changes a message header's zcoordinate, all cyclic paths involve at least one turn from a channel in C + to a channel in C − and at least one turn from a channel in C − to a channel in C + . Prohibiting either type of turn will break every cyclic path in the network. At each network node (excluding boundary nodes), there are 48 possible turns, assuming that 180-degree turns are excluded. There are 12 turns from C + to C − and 12 turns from C − to C + .
Prohibiting either set of turns removes only 1/4 of the possible turns from the network, leaving significant routing flexibility. The routing restrictions imposed on a particular message depend on its initial distance vector. If the vector is z-dominant or co-dominant, no changes in the z-direction are ever required and the message can be routed with full adaptivity (no turn restrictions) from source to destination. An xy-dominant message requires at least one change in its z-direction movement. Since the PZF algorithm prohibits turns from C − to C + , xy-dominant messages must complete all movement in the +z direction before routing in the −z direction. Unfortunately, this produces a disconnected routing function for a finite-sized offset cube network since it is causes certain routes to collide with the top surface of the cube before +z routing is complete. For example, messages cannot route between nodes inhabiting the top zlayer of the cube since it is clearly impossible in this case to complete +z routing before −z routing. The problem can be resolved by prohibiting a different set of turns for message headers occupying nodes on the top two z-layers of the cube. Let this set of nodes be denoted N top and let the total set of network nodes be denoted N. For all nodes in the set N − N top , the turns from C − to C + are prohibited, whereas for the nodes in N top , all turns from channels in the yz-planes to channels in the xz-planes are prohibited. The PZF routing protocol is summarized as follows. zdominant and co-dominant messages will never experience collisions with the top of the cube and can be routed with full adaptivity from source to destination as described in the first and third entries of Table 1 . xy-dominant routing requires what is potentially a three-phase procedure. In Phase one, messages attempt to complete all +z routing before colliding with the top of the cube. Although movement in z is constrained to be positive during Phase one, the order of routing in x and y is arbitrary, allowing partial adaptivity. If a collision with the top z-layer occurs before +z routing is complete, Phase two routing occurs in which the message header is allowed to oscillate or "bounce" between the top two z-layers until it completes all +z routing. Note that an equal amount of −z routing occurs as well. During Phase two, the message does not perform routing in y unless all routing in x is complete (i.e., δ x = 0). When +z routing is complete, the message routes in the −z direction until it reaches its destination (Phase three). During Phase three, the order of x and y routing is unconstrained, as in phase one. xy-dominant messages that do not collide with the top of the cube require Phases one and three only. Although this protocol does not require virtual channels for deadlock avoidance, it can use virtual lanes to improve network throughput. that the total CDG is acyclic, we need only show that the channels within each set can be numbered such that messages are always routed over channels with increasing numbers. Assign the number n to each channel in C int + connecting a node on z-layer n with a node on z-layer n + 1. Since messages traveling over the C int + network always move to nodes with higher zcoordinates, these channels are always traversed in order of increasing number. Assign the number z max − n to each channel in C int − connecting a node on z-layer n + 1 to a node on z-layer n. Since messages traveling over the C int − network always move to nodes with lower z-coordinates, these channels are also traversed in order of increasing number. Finally, note that the subnetwork consisting of the nodes in N top and the channels in C top is isomorphic to a 2D mesh (see Fig. 2a ). When moving in this subnetwork, messages must complete all x routing before y routingthe equivalent of dimension-order routing (DOR). It is well known that the channels in a 2D mesh can be numbered such that DOR always traverses them in order of increasing number [15] . Thus, the complete CDG is acyclic and PZF is deadlock free. o
Planar PZF Routing
Although the high adaptivity of PZF is attractive for improving network performance, the increased cycle-time cost of routing and flow-control operations at each routing node could potentially negate the latency and throughput gains [9] . The routing freedom of PZF can be restricted to trade off adaptivity for simplified (and faster) hardware, providing an offset cube analog to dimension-order routing on a mesh. The Planar PZF (PPZF) algorithm is identical to the PZF algorithm with the added restriction that turns from channels in the yz-plane to channels in the xzplane are never allowed, allowing a deterministic version of PZF.
ASSERTION 2. The Planar PZF algorithm is deadlock-free.
PROOF. Given a current node and a destination node, the PPZF routing function returns a subset of the channels returned by the PZF routing function. Because the CDG for PZF was shown to be acyclic in Section 4.2, the CDG for PPZF must also be acyclic since it is simply a subgraph of the CDG for PZF. Therefore, the PPZF algorithm is deadlock free. o
Fully Adaptive Routing Using Duato's Method
The PZF and PPZF algorithms avoid deadlock via acyclic channel dependency graphs. Duato has demonstrated [18] that an acyclic CDG is unnecessarily restrictive for deadlock avoidance. Cycles are permitted in the CDG of a routing function, R, provided there exists a subset of the total network channels, C 1 ⊆ C, defining a "routing subfunction," R 1 , which is connected and has no cycles in its "extended" channel dependency graph (ECDG) [18] . The ECDG is formed by augmenting the CDG of R 1 with "indirect" dependencies created by message routes that use the channels C − C 1 between uses of the channels supplied by R 1 . An acyclic ECDG guarantees that any output channels belonging to C 1 will eventually become free. Duato's theorem only applies to routing functions of the form R: N × N → P(C), where N is the set of network nodes, C the set of network channels, and P(C) is the power set of C. The power of this approach is that it allows a fully adaptive routing protocol to be built using a partially adaptive routing protocol as the routing subfunction discussed above. Applying Duato's methodology to the offset cube produces a fully adaptive routing protocol dubbed DUPOC (DUato's Protocol for the Offset Cube). DUPOC uses the partially adaptive PZF algorithm as the routing subfunction, R 1 . Each physical channel, c i , is shared by n virtual channels: a i,n−1 , a i,n−2 , ... , a i,1 , b i . The channels a i form the set C − C 1 , while the channels b i form the set C 1 . Routing is fully adaptive (unrestricted) on the set C − C 1 while routing restrictions are imposed on the set C 1 to avoid deadlock. Because each physical channel has at least one associated virtual channel from each set, the protocol requires a minimum of two virtual channels per physical channel. The DUPOC algorithm uses the following procedure to route message headers at each network node: 1) Attempt to route the message with full adaptivity using any of the unallocated virtual channels a i belonging to the physical channels c i returned by the procedure described in Table 1 . If none of these virtual channels are available, proceed to Step 2. 2) Attempt to route the message over any of the unallocated virtual channels b i belonging to the physical channels c i returned by the PZF routing function. If none of these virtual channels are available, proceed to Step 3. 3) Repeat Steps 1 and 2 until a free output channel is available.
ASSERTION 3. The DUPOC algorithm is deadlock-free.
PROOF. The proof consists of showing that DUPOC satisfies the conditions of Duato's theorem as stated above. First, PZF is clearly of the form N × N since routing decisions are made solely on the basis of the distance vector in the message header (which is itself a function of only the current node and the destination node). Moreover, PZF is connected and its CDG is acyclic by the proof in Section 4.1. Thus, we need only show that indirect dependencies cannot introduce cycles in the ECDG. Any cycle requires at least one indirect dependency from a −z channel to a +z channel. However, when using channels in the set C − C top (as defined in Section 4.1), PZF will never route over a −z channel while it is still possible to route in the +z direction; hence, no dependencies from −z channels to +z channels are possible for the set C − C top and no cycles can be formed. This argument also rules out cycles created using channels from both C − C top and C top . The only other possibility is a cycle using only channels from the set C top . Such a cycle requires channels from both the xz-and yz-planes of the cube. Because PZF disallows turns from yz-plane channels to xz-plane channels when routing over the C top subnetwork, no such cycle is possible. Therefore, the ECDG is acyclic and the DUPOC algorithm is deadlock-free. o
OFFSET CUBE ROUTER COMPLEXITY
Although the offset cube routing protocols presented in Section 4 have low virtual channel requirements, they are conceptually more complex than mesh protocols offering comparable routing flexibility. This section briefly considers the impact of these complexities on the path-setup hardware for an offset cube router implementing the fully adaptive DUPOC protocol. A more detailed treatment of this issue is pursued in [27] . Fig. 3 illustrates a simple delay model for a path-setup operation in an adaptive router. The model is similar to the one presented in [9] . In this discussion, a path-setup operation consists of deciding which output channel an incoming header flit should be routed to and establishing an inputoutput connection that will allow subsequent data flits to follow the path taken by the header. The delay of moving the header flit through the router's internal switch is not included in path-setup. In Fig. 3 , the routing function logic generates a set of feasible output channels based on routing information in the incoming message header. The channel selection logic selects a particular output channel from this feasible set, subject to the current busy/idle status of these channels. Once an output channel is selected, the connection setup logic performs any functions necessary for establishing the input-output connection used by the data flits following the header (e.g., configuring the router's internal switch). If the relative distance from the destination node is encoded in the header (relative addressing), the header must be updated in a manner dependent on which output channel is selected. Path-setup latency can be reduced by computing all possible header updates in parallel with the routing and channel selection operations; the appropriate modification to the incoming header is then selected based on the chosen output channel.
Because the channel selection and connection setup mechanisms are identical in the offset cube and mesh routers, the T setup and T chan_sel delays in the two routers are expected to be similar. A slight increase in T chan_sel is expected in the offset cube router since the number of output channels is larger. However, the difference is probably not severe since this delay tends to grow logarithmically with the number of channels [9] .
The delay T route in the DUPOC router is potentially degraded by two sources of routing complexity. First, messages must be routed differently depending on whether they are co-dominant, z-dominant, or xy-dominant; computing the dominance class of an incoming message in the routing function logic would significantly lengthen T route . Second, the three-phase procedure for xy-dominant messages under the PZF protocol requires routing to be restricted in different ways for each phase; hence, the routing function logic must rapidly determine the current phase for xy-dominant routes.
Fortunately, both problems are readily addressed in hardware. The first issue is resolved by encoding the dominance class of a message in its header (using two status bits). The status bits are initialized at the source node before injecting the message into the network. The routing function logic can immediately examine these precomputed status bits at each hop. Because z-dominant and xy-dominant messages eventually become co-dominant at some point along the route, this change must be detected and the header bits updated accordingly. As described below, this overhead can be moved into the header update and header selection logic, allowing it to overlap the T route , T chan_sel , and T setup delays. With regard to the second issue, detecting the correct phase of PZF routing can be shown to require the following information [27] : 1) the dominance status of the incoming message, 2) a check of whether the δ x component of the header is zero, 3) a status bit indicating if the current node resides on the top z-layer, and 4) a status bit indicating if the current node resides on either of the top two z-layers.
Item 1 is available from the header and Item 2 is already required to support relative addressing in both the offset cube and mesh routers. Items 3 and 4 can be readily determined by each routing node at power-up and stored in a register for use by the routing function logic. With this hardware organization, the estimated increase in the T route delay of a fully adaptive offset cube router over a comparable fully adaptive 3D mesh router implementing the protocol described in [17] is only two gate delays [27] .
To compute all possible header modifications in parallel, the header update logic of a fully adaptive 3D mesh router requires three increment/decrement units, one for each distance component in the header. In contrast, the fully adaptive offset cube router requires five increment/decrement units (this follows from the z-dominant case in Table 1 ). The delay for incrementing/decrementing the distance components is approximately the same in both routers (representing the δ z component in the offset cube requires one additional bit as compared to the mesh).
The offset cube router requires additional header update logic for detecting changes in the message dominance status. This logic exploits the fact that, on each hop, the quantity d =δ x  + δ y  − δ z  either remains unchanged or is decreased by two units until reaching zero, at which point the message becomes co-dominant. The logic works as follows. First, the quantity d is computed using the unmodified distance components of the incoming header. If the result of d equals the value two, then the message status should be changed to co-dominant if either of the following additional conditions are true: 1) the message is currently xy-dominant and δ z  will be incremented on the next hop, or 2) the message is currently z-dominant and either δ x  or δ y  will be incremented on the next hop.
Note that the computation of d and its comparison with the value two can be overlapped with the T route and T chan_sel delays, reducing (or potentially eliminating) its impact on path-setup latency. Conditions 1 and 2 can be quickly determined in parallel with the T setup delay using the result of the channel selection logic. The computation of d can be structured such that it is equivalent to the addition of three distance components [27] . Using carry-save addition, this delay is only a small additive constant longer than the delay for adding two distance components. A more detailed study is necessary to quantify the actual difference in path-setup latency between the offset cube and mesh routers. The analysis is complicated by the fact that considerable overlap is possible between the various delay components in Fig. 3 , requiring accurate estimates of each component. However, the potential for latency hiding coupled with the other observations in this section show that the conceptual complexities of offset cube routing need not impose a commensurate degradation in path-setup latency. This provides a reasonable degree of optimism that the offset cube router's performance can be made competitive with that of a mesh router. It is true that the offset cube router requires more hardware resources than a mesh router to achieve comparable path-setup latency. However, the added hardware consists primarily of increment units and adders which can be implemented compactly in VLSI and are not likely to consume as large a percentage of chip area as other on-chip structures (e.g., flit queues and the internal crossbar switch).
NETWORK SIMULATION RESULTS
This section presents simulation results for the offset cube routing algorithms discussed in Section 4. The simulator is a 10,000-line C++ program capable of modeling multicomputer wormhole networks at the flit level. Each experiment performed for the offset cube was also performed for an equalsized bidirectional 3D mesh network under the approximate constant pin-out constraint discussed in Section 3.3. The mesh algorithms selected for comparison are dimension-order routing (DOR), a well-known deterministic algorithm, and Duato's Protocol (DP) [17] , a fully adaptive algorithm requiring a minimum of two virtual channels per physical channel. The two networks were driven with the following traffic loads: uniform random traffic, hot-spot traffic, and trace-driven loads from actual parallel kernels/applications. These results assume that all protocols have the same delay for routing and flit-level flow-control operations. While this is not realistic, normalizing cycle times to absolute time units such as nanoseconds would require detailed hardware modeling of both router and packaging delays and is beyond the scope of this paper (although certainly not of this research). A discussion of offset cube router implementation complexity and its effect on path-setup latency is given in Section 5. The following additional assumptions apply to all experiments presented in this section:
• Two virtual channels are associated with each physical channel in the network. This number was chosen because it corresponds to the minimum required for both fully adaptive protocols (DUPOC and DP). Protocols which do not require virtual channels for deadlock avoidance use them as "virtual lanes" to enhance throughput [13] .
• The virtual channel buffers associated with router inputs are six flits deep.
• The physical input/output links to the network do not use virtual lanes.
• The channel widths for the offset cube and mesh networks are 24 bits and 32 bits, respectively; each channel can transmit up to one flit per cycle.
• Source queuing is used at the inputs to the network. The maximum length of each source queue is 100 messages and messages arriving at a full queue are dropped.
• Message latency is measured as the number of network cycles from the time a message is added to the appropriate source queue until the time the last flit of the message is received at its destination node.
• Throughput is measured as the average number of eight-bit bytes of data arriving per destination node per cycle after the network has reached steady state.
Uniform Random Traffic
In this mode, each source node generates messages with exponentially distributed interarrival times and uniformly distributed destination nodes. Figs. 4 and 5 illustrate the latency-throughput curves for 512-node systems using two message lengths, 192 bits and 480 bits. For 192-bit messages, the offset cube provides performance comparable to the mesh under fully adaptive routing. Below saturation, the average latency for DUPOC is only slightly higher than the latency of DP. While fully adaptive routing on the mesh provides only a 9 percent improvement in saturation throughput over DOR, the performance difference between deterministic and adaptive offset cube algorithms is much more pronounced (PZF and DUPOC provide improvements of 23 percent and 39 percent over PPZF). The adaptivity gain depends on how efficiently deterministic routing distributes uniform traffic on each network. As observed in past studies of meshes and k-ary n-cubes [23] , DOR distributes uniform random traffic evenly over the available bandwidth due to the mesh's orthogonal arrangement of network links. PPZF is hard-pressed to match DOR's efficiency since the offset cube's diagonal paths tend to concentrate traffic toward the center of the cube, creating more of a hot-spot than exists in the mesh. The offset cube benefits greatly from the increased adaptivity of the PZF and DUPOC protocols since they dynamically route traffic away from the center of the cube, mitigating the hot-spot effect.
Increasing message lengths to 480 bits highlights the advantage of wider channels in the mesh (Fig. 5) . Near the saturation region, DP exhibits only 80 percent of the latency of DUPOC. This correlates well with the fact that the average noload message latency in the mesh is approximately 84 percent of the no-load latency in the offset cube. The closeness of the calculated and observed results suggests this performance difference is primarily due to the dominance of message aspect ratio over the distance component (for the offset cube simulations in Fig. 5 , the average distance is 7.14 hops while the number of flits per message is 20). Despite the latency results, the saturation throughputs attained under the two fully adaptive algorithms remain comparable.
Hot-Spot Traffic
The 3D mesh and offset cube networks were also subjected to the commonly employed bit-reversed and dimensionreversed traffic patterns. Although these patterns are defined with respect to a 3D mesh, they can be mapped to the offset cube using the embedding described in Section 3.4.
In the bit-reversed pattern, a source node is mapped to a destination node by reversing the source's address bits (i.e., 〈a n−1 a n−2 ... a 1 a 0 〉 → 〈a 0 a 1 ... a n−2 a n−1 〉). For this traffic pattern, Figs. 6 and 7 show that the offset cube exhibits considerably higher performance under deterministic routing than does the mesh. This behavior contrasts sharply with the results observed for uniform traffic, in which the offset cube's deterministic performance was worse. Unlike uniform traffic, the hot-spot effect of bit-reversal traffic pattern is not significantly compounded by the offset cube's diagonal links. Hence, PPZF performs reasonably well under this mapping.
As expected, the performance of both networks for bitreversed traffic improves considerably under adaptive routing, particularly that of the mesh. For 192-bit messages, the offset cube maintains a performance edge over the mesh under fully adaptive routing, achieving comparable latency below saturation and a 15 percent greater saturation throughput. The performance of PZF is quite competitive with that of DP for messages of this size. For 480-bit messages, the saturation throughput of DP exceeds that of PZF and is comparable to the throughput achieved by DUPOC. Interestingly, the PZF curves in Figs. 6 and 7 and the DP curve in Fig. 7 each exhibit a small range of injection rates above saturation in which message latency temporarily increases at a much slower rate than expected. For example, the PZF simulations in Fig. 6 show this behavior for measured throughputs between 0.5 to 0.8 bytes/node/cycle. A possible explanation is that as the network begins to saturate, the bit-reversal traffic pattern causes messages traveling longer distances to experience congestion at lower injection levels than messages with high locality. For a range of injection rates, this would allow a greater proportion of shortdistance messages to enter the network, temporarily skewing the average latency in favor of messages with high locality. As the injection rate is increased further, even local messages experience long delays and the average latency increases rapidly again for subsequent loading levels.
In the dimension-reversed pattern, a source node is mapped to a destination node by reversing the x and y coordinates of the source address (an xy-transpose) and reflecting the z-coordinate across the middle plane in the system. This pattern causes a large number of paths to cross the center of the network. When this pattern is applied to the offset cube under PPZF routing, the predominantly diagonal paths of the offset cube cause a hot-spot in the center of the network, which in turn leads to the poor PPZF performance results shown in Figs. 8 and 9 . Fortunately, the adaptivity of the DUPOC and PZF algorithms help circumvent this problem.
Adding adaptivity to the 3D mesh and offset cube routing algorithms provides varying levels of performance improvement for dimension-reversed traffic. For L = 192 bits, the DUPOC algorithm is seen to provide a 228 percent improvement in saturation throughput over the PPZF algorithm, while a 42 percent improvement is registered by adopting the DP algorithm rather than DOR. The DUPOC algorithm is seen to provide a 10 percent higher saturation throughput than the DP algorithm, although it results in latency values near the saturation region that are approximately 10 percent higher than that of the DP algorithm. Similar trends are registered for the L = 480 bits case, although in this case the advantage of the wider mesh channel widths assumed for the 3D mesh is accentuated due to the long message lengths. Nevertheless, the DUPOC algorithm is seen to provide a slightly higher saturation throughput than DP, but at the expense of an approximately 26 percent higher latency near the saturation region.
Trace-Driven Traffic
To complete the study, four trace-driven simulations based on two parallel kernels and two full-scale applications were carried out. The first kernel, thermal relaxation (TR), solves Laplace's equation (∇ = 2 0 Φ ) using the Gauss-Jacobi method over a 2D, N × N grid. The resulting communication patterns of this kernel correspond to a logical nearest-neighbor type access. The second kernel, matrix multiplication (MM), is based on the subblock decomposition method [22] and yields communication traces characterized by both logical nearest-neighbor and logical local neighborhood (broadcasting) patterns. The first application, JPEG image compression, is based on the Discrete Cosine Transform (DCT) [50] . The logical communication patterns in this application correspond to logical localized neighborhoods. The second application studied is a maximum likelihood-expectation maximization (ML-EM) algorithm for the reconstruction of Positron Emission Tomography (PET) images [12] . The dominant communication pattern in this application is that of angular rotations. The characteristics of the traces are given in Table 4 , where the problem size refers to: the number of elements in the grid partitioning for thermal relaxation, the number of elements in the matrices multiplied in matrix multiplication, the number of pixels in the compressed image in JPEG, and the number of pixels in the reconstructed image in PET. More details regarding the actual parallelization of these applications can be found in [51] .
The network latency simulation results for the applications under the different routing protocols are presented in Fig. 10 . The small difference obtained across routing algorithms for the PET and TR traces is due to the low burstiness, low message injection rate, and short message lengths associated with these traces. Of these two traces and topologies, the only combination that benefits significantly from adaptivity is that of the TR trace applied to the 3D mesh, where a 14 percent performance improvement is registered in going from the DOR to the DP algorithm. The reason why the other routing algorithm-topology mappings do not benefit much from adaptivity is that 1) the small system size in PET (128 nodes) limits the amount of adaptivity that can be taken advantage of for both the 3D mesh and offset cube and 2) the embedding of the 2D TR problem to three physical dimensions results in a traffic pattern that is dealt with very efficiently by the PPZF algorithm, limiting any performance improvements to be gained from the PZF and DUPOC adaptive routing algorithms.
CONCLUSIONS
This paper has presented the offset cube, a new network topology that is well-suited to ultra-compact MCM-level MPP systems built using through-wafer optical interconnect. The topology-technology pairing of the offset cube with through-wafer optics yields a network with compact, scalable packaging and good system-level performance. Fully adaptive routing on the offset cube is possible using only two virtual channels per network link. A preliminary analysis suggests this fully adaptive protocol can be efficiently implemented in hardware. Flit-level network simulations show that the offset cube performs comparably to a bidirectional 3D mesh under fully adaptive routing for fine-grained applications in which large numbers of computing nodes exchange relatively short messages. Future research will continue to stress both topological and technological aspects of the offset cube. Topological studies will focus on the development of parallel algorithms optimized for the offset cube and will more clearly delineate the application classes for which the topology performs well. Technological studies will pursue a more detailed analysis of offset cube router implementation complexity and will address the challenges that through-wafer technology introduces into system design. The most pressing design issues include: 1) the design of low-latency optoelectronic transceiver circuitry, 2) emitter-detector alignment techniques, 3) emitter-detector crosstalk caused by spurious optical reflections, 4) emitter-detector coupling efficiency, and 5) low-cost thermal management.
Although certainly challenging, these problems do not pose fundamental limitations and effective solutions are being investigated in the systems integration, optoelectronics, and thermal management thrust areas of Georgia Tech's Low Cost Packaging Research Center. 
