Abstract-Multiprocessor architectures demand efficient interprocessor communication to maximize system utilization and performance. To meet future demands, these interconnects must communicate at significantly higher speeds while operating more efficiently to meet system size, weight, power, and energy requirements. As high-performance parallel computing architectures make their way into portable systems, compact, efficient, and error-tolerant computing and communication mechanisms will be required. This paper presents the High-Performance Efficient Router (HiPER), an efficient multidimensional router supporting high-throughput errorcorrected communication channels. HiPER is a proof-of-concept vehicle for efficient implementations of routing, switching, and error control mechanisms. It combines mad postman (bit-pipelined) switching with dimension-order routing, producing a low-latency routing router that is less sensitive to message distance than a word parallel crossbar router. To maintain robust communication as link speeds increase and link power budgets decrease, HiPER employs flit-level hop-by-hop retransmission of erroneous flits, which provides builtin error control at the network level. Data presented on the implemented bit serial version of HiPER offer insight into future router designs with channel sizes between bit-serial and word-wide.
INTRODUCTION
T HE performance of the interconnection network in a parallel multiprocessor architecture is critical to overall system performance. The interconnection network must exhibit both low latency and high throughput or system performance will suffer. As parallel processing is incorporated in systems that require compact designs and energy efficient operation (hand-held computing, space-borne computing, autonomous systems), the interconnection network must also exhibit these characteristics. The HighPerformance Efficient Router (HiPER) research presented in this paper focuses on a high-performance network switching element that is compact and energy efficient.
During the past years, high data-rate communication channels up to several gigabits per second have been achieved for both serial and parallel channels. For example, an 8 Gb/s serial link [3] and a 2.4 Gb/s/pin bidirectional parallel link [4] have been demonstrated. Parallel channels deliver higher throughput than serial channels by transferring bits simultaneously. However crosstalk can limit parallel channels to lower clock frequencies and shorter distances than their serial counterparts.
The HiPER architecture can support channel widths from word parallel to bit serial. However, the implementation presented in this paper employs serial data channels both internal and external to the router for two reasons. First, serial channels represent the case of minimum phit size, which causes the most latency penalty for traditional router based on parallel crossbar. Finally, the use of serial channel reduces router's size and increases its energy and power efficiency. Serial crossbars require less energy per bit transferred, due to the dominance of interconnect capacitance in parallel crossbars. Serial data channels are more efficient off-chip as well. Packages and modules are smaller with serial channels and the problem of escape routing [18] is greatly reduced. In addition, serial links are more energy efficient as bit rates increase due to transmission line effects [1] .
As link speeds increase, the network routing function accounts for an increasingly large portion of overall message latency. To address this issue, the switching scheme used in HiPER is mad postman (bit-pipelined) [12] which is an extension of wormhole routing [6] , [8] (flitpipelining), and the routing algorithm is dimension order routing [7] . This combination produces a very low latency routing function. Mad postman also makes HiPER's latency less sensitive to the distance a message travels.
As link speeds increase and power budgets decrease, the operating margin for a router's data channels is reduced. This results in less robust physical links with an increased probability of errors. To address this issue, HiPER supports hop-by-hop flit-level retransmission of data that has encountered errors during transmission. May et al. [13] discuss work in flit-level error control in multicomputer interconnection networks. The flit-level retransmission capability of the HiPER router places the responsibility of error control on the network itself, freeing the processor from the additional work required with an end-to-end retransmission scheme. Flit-level retransmission is a more efficient mechanism than end-to-end retransmission in terms of both bandwidth and energy since retransmissions involve small segments of the message (flits) rather than the entire message [14] . The mad postman switching protocol has been extended to support flit-level retransmissions by harnessing the flit-invalidation mechanism required by mad postman switching. Additionally, the error control mechanism in HiPER provides a mechanism to reduce link power while maintaining a system bit error rate (BER) target (e.g., reducing the driver voltage swing or the light intensity in optoelectronic links) This paper covers the design of the HiPER routing element as well as some of its implementation details. Area and latency models were created from the implementation and are discussed as well. This paper presents background, design, implementation, and evaluation of the HiPER router. It concludes with a summary and future directions for HiPER research.
BACKGROUND

Mad Postman Switching
Mad postman switching was first described by researchers at the University of Southampton in 1989 [12] . Mad postman is an extension of wormhole switching. Wormhole switching pipelines flits in the network, making routing decisions based on information contained in a head flit. The head flit is observed at each node in the path from source to destination and a routing decision is made based on its addressing information. If the physical channels carrying the flits are narrower than the width of the flits, multiple cycles will be required for a flit transfer, reducing the latency benefits of wormhole switching.
The idealized latency for a wormhole switched message and a mad postman message with no contention are described by the following equations which are normalized to the link data transfer time (one bit transferred over one wire). The latency for wormhole switching is given by:
where D is the distance from the source node to the destination node in hops, L flit , L message , and W channel are the flit length, message length, and channel width, all in bits, and n rs is the number of data transfer cycles required to make a routing/switching decision. The following equation describes the same latency for mad postman switching:
Subtracting the two equations gives the following result.
The previous equation shows that a wormhole switched network exhibits longer latencies than a mad postman switched network for channel widths less than the width of a flit. When the flit width and the channel width are equal, the two switching protocols are equivalent (equal latency) and, when serial channels are employed, the difference is the greatest (D½L flit À 1).
As shown above, mad postman switching eliminates the latency penalty associated with narrow data channels by bit-pipelining the message. The problem with this solution is that the head flit can no longer be completely observed by the node making the routing decision, so a speculative route must be chosen. Since dimension-order routing is generally employed with mad postman switching, the speculative route is always along the same dimension and in the same direction that the message arrived on. This means that an incorrect routing decision will be made only once per dimension, at the point when the message should have turned to the next network dimension (or ejected).
Mad postman switching employs a flit invalidation mechanism that allows the router to recover from incorrect speculation. Since bit pipelining spreads flits out over several routing nodes, the incorrectly routed flit cannot be recovered and rerouted. Instead, it is simply tagged as invalid and either routed off the edge of the network or eliminated if and when it arrives at a destination node. In HiPER, this invalidation mechanism is harnessed by the flitlevel retransmission protocol to invalidate flits that are determined to be in error by the error control mechanism.
Traffic Workload
Some of the message traffic workload parameters used in this paper were collected from applications running on a message passing architecture (PICA) being developed at Georgia Tech [17] . The application suite includes both scientific and image processing programs running on system configurations of 100 to 4,000 nodes. Because of the highly parallel, communications-oriented implementations of the algorithms, the high volume of message traffic includes relatively short, local messages. The average message characteristics used in this paper are a length of nine flits and a distance traveled of five hops [14] . The router handles messages that are a maximum of 64 32-bit flits in length.
HIPER DESIGN
To make the router as fast as possible, the results of Chien's analysis for wormhole router performance [5] were applied and the decision was made to use dimension order routing with no virtual channels. The switching scheme chosen strongly affects the performance of the router; therefore, to maximize the performance of the channels narrower than the flit size, such as serial channels, mad postman switching is employed. The advantage of mad postman switching manifests itself in message latency. The bit-pipelined nature of mad postman switching produces an end-to-end message latency approaching that of a router with parallel channels equal in width to the size of a flit [10] . The remainder of this section presents a high-level overview of the HiPER design. For a detailed presentation of the HiPER design, see [14] .
The configuration of HiPER is n unidimensional routers, one for each dimension in the network. This is the same configuration as the canonical dimension-order router presented in [5] (partitioned router), which was based on other routers such as the torus routing chip [6] . Fig. 1 shows the configuration of a two-dimensional routing node composed of two unidimensional routers.
The target network size for the initial implementation of HiPER is a two-dimensional mesh with no more than 64 Â 64 ð4; 096Þ total nodes. This limitation is imposed by the flit size and the size of the addressing information in the head flit. Fig. 2 illustrates the configuration of one unidimensional router showing only one of its inputs and one of its outputs. The routing decision and control are performed by the Arbiter block and the switching function is performed by a crossbar. The error control mechanism for the first implementation of HiPER is single bit parity that is checked in the Parity block. The result of this parity calculation is sent back to the transmitting node to initiate a retransmission when an error is detected. The parity block is a modular design, so more robust error control mechanisms can be easily designed into future versions of HiPER.
Router Block Diagram
The two elements of the router that support retransmission of flits are the Retransmit Buffer and the Slack Buffer. The Retransmit Buffer stores the last two flits sent out over a link. When a retransmission is initiated, the data flits that are retransmitted are taken from the Retransmit Buffer. The Slack Buffer provides storage for input data while a retransmission is in progress. This storage prevents the loss of input flits while the output is engaged in retransmission. A more detailed explanation of how the Slack Buffer and Retransmit Buffer support retransmission is presented in Section 3.4.
Flit Format
The flit format is organized to facilitate serial implementation of the data channels and the channel control hardware, as well as being able to handle incorrect routing speculation and errors in the links. Fig. 3 shows the formats for both head and data flits. The Discard bit serves two purposes. First, it is set if an error is detected in the flit so that the destination node will discard it. Second, when speculative routing fails, the flit that has missed its turn to the ejection channel (the current head flit) is tagged as a discard flit. If this flit blocks in the network, hardware can be designed to detect this and delete the flit. In the head flit, the Eject bit is used to speed up routing decisions at the injection channel, allowing the injection channel to make its decision based only on the first two bits of the header (Sign and Eject), which indicate one of the three possible output ports (+. -, or eject). There is one head flit for each dimension in each message. The current head flit is stripped from the message when the message changes dimensions.
Mad Postman Switching in HiPER
A description of mad postman switching in HiPER begins with the notion of bit-pipelining. In a system with flits larger than phits, additional latency is introduced for each message by requiring that each flit be completely transferred between routing nodes before routing decisions are made. This was described in Section 2. To eliminate this additional latency, the routing decisions must be made at the bit level. This, of course, is not feasible as the number of bits required to make a routing decision and to store the state of the route is greater than one.
In bit pipelining using the mad postman approach, routing decisions are made speculatively (speculative dimension order routing). In a dimension order router, the distance between source and destination is entirely exhausted in one dimension before routing begins in the next. Because of this regularity of routes, it is easy to speculate which direction is taken at each node. The channel that the flit leaves the node on will be along the same dimension and in the same direction as the one that it entered on. This varies only when the message makes a turn and when it is ejected from the network. As explained in [12] , this means that only one incorrect decision will be made in each dimension. This incorrect decision must be handled in the network. HiPER handles this incorrect decision by tagging the head flit as invalid by setting its Discard bit.
Mad Postman Switching and Flit-Level Error Control
The fact that HiPER implements bit-pipelining makes retransmission of erroneous data more complicated. This is because flits are spread across many nodes, each of which holds only two or three bits in its data path. The error detection mechanism cannot detect the presence or absence of an error until the entire flit in question has been observed. At this point, the flit in question has been routed to several other nodes due to mad postman's bit-pipelined nature.
The mechanism that enables a flit-level retransmission under these conditions consists of two phases. The first phase is invalidation of the erroneous flit as it leaves the node. The flit is tagged as bad (Discard bit set) and handled when it arrives at the destination. The second phase consists of retransmitting a buffered copy of the flit while buffering incoming data in the node's Slack Buffer.
Since the error detection and signaling process will likely take more time than the transmission of a single bit, the retransmit buffer holds two flits. This also means that the Parity Error signal will arrive at the upstream node sometime during the transmission of the next flit (the one following the erroneous flit), assuming that wires hold no more than four to five bits in flight. In the case of long wires, where more than a few bits can be in flight, the Parity Error signal may arrive after two or more flits following the erroneous flit have been transmitted. This requires additional retransmit buffers and slack buffers. Fig. 4 shows the timing relationship of the retransmission relative to the flit times. This figure takes into account the very short bit-times and pipelining of bits on the interconnect "wires."
Flit n þ 1 contains an error(s) that was accumulated during transmission from node m to node m þ 1. This error is detected at node m þ 1 and the Parity Error signal is activated after node m has begun transmitting flit n þ 2. Because HiPER guarantees the in-order arrival of flits, both flit n þ 1 and n þ 2 will be retransmitted. The time between the arrival of the error signal at node m and the time that node m finishes transmitting flit n þ 2 is the time that node m has to set up for the retransmission. Fig. 4 shows the timeline for this retransmission, including the initial error detection and subsequent error signaling.
In Fig. 4 , the flits with the "x" through them have been marked as "bad" via node m þ 1 setting its Discard bit. Actually, in this example, only flit n þ 1 was in error, but both are invalidated due to the in-order delivery requirement. Following the delivery of flit n þ 2, node m retransmits flits n þ 1 and n þ 2. In HiPER, an error indication (ParityError) signal reception during the retransmission of flit n þ 1 is ignored. This signal would indicate that flit n þ 2 was received in error and, since this flit is being retransmitted anyway, the ParityError signal may be disregarded for flit n þ 2. Extension of this protocol to handle multiple retransmissions is simple since buffering holds the data to be retransmitted at all times.
The Slack Buffer is required to provide a path for incoming flits that prevents them from colliding with flits being retransmitted. The Slack Buffer is also used to store flits that enter a node while the output path for those flits is being used by another message (message blocked). The Slack Buffer for a given node can be activated in one of the three ways:
. initiation of the retransmission process, . blocking of the current message by another message (crossbar arbitration failure), and . activation of the Hold line by a downstream node. The following paragraphs explain each of these processes.
During the retransmission of a flit (detailed in the previous section), there is still data streaming into the node. In order to prevent this data from being dropped, it is buffered in the Slack Buffer. The Slack Buffer is activated when the retransmission begins and, at the end of the retransmission, the buffer select switch is set to allow the data in the buffer to be routed to the output (see Fig. 2 and Fig. 5 ). Retransmission requires that two flit-sized registers in the Slack Buffer be allocated to the data path so that no flits are lost. In other words, the Slack Buffer registers become part of the data-path. Fig. 5a shows the router just before flit n þ 1 will be retransmitted. In Fig. 5a , flit n þ 2 has just finished leaving the router, flit n þ 3 is beginning to enter the router, and flit n þ 1 is about to be retransmitted. To prevent data collisions, flit n þ 3 is being routed to the slack buffer. Fig. 5b shows the same router after flits n þ 1 and n þ 2 have been retransmitted. The first two-flit buffers in the Slack Buffer are now part of the datapath.
Simulations on the PICA architecture indicate that at most one retransmission is encountered during the transmission of any message for links with BER less than 10
À5
Therefore, two-flit Slack Buffer should be satisfactory in most cases. However, in the cases of multiple retransmissions (which would rarely occur) and head-of-line blocking, a large Slack Buffer could be required. Since the resources of any one router are finite, another mechanism is required to prevent flit loss. This mechanism is the Hold signal.
When multiple retransmissions occur within the same node, the local Slack Buffer will quickly fill up. When this happens, "slack" must be taken up in the upstream node. Another situation where the Slack Buffer fills up is when the head flit blocks to wait for output resources to free up. If the path through the node were to remain passing through the Slack Buffer after it has completely filled up, there would be no slack available for future retransmissions. A mechanism must be put in place to recover this slack to make HiPER more robust.
The mechanism employed by HiPER requires that the Slack Buffer of the downstream node completely empty before the Hold signal is released to the upstream node. This will reset the slack absorbing capabilities of the node before it accepts any new data. For a detailed treatment of the Slack Buffer and the Hold signal, refer to [14] .
HIPER IMPLEMENTATION
The goal of the HiPER design and implementation effort was the physical design of a functional router that combines mad postman switching and FLRT. The implementation of this router was carried out in such a way as to produce a routing element that highlights the differences between HiPER and a parallel crossbar router. The comparisons made were in the areas of chip area usage and message latency. The full flit-size parallel crossbar router was chosen as a comparison since that style of implementation is commonly employed in router designs. Since a complete flit has to be received before the parallel crossbar router can make a routing decision, a conversion circuitry is required at its input and output when it is used with channels narrower than flit size (serial channels in this implementation).
To highlight the areas of greatest difference between HiPER and the comparison router, a complete unidimensional router (see Fig. 1 and Fig. 2 ) was implemented, but without the arbitration granting and resource conflict resolution state machines. This arbitration functionality was not implemented because it would be nearly identical for both HiPER and the comparison router. The arbitration interface, however, was fully specified, so the arbitration functionality can be added at a later time with relative ease.
Datapath and Control Implementation
The HiPER datapath was implemented using fully custom CMOS logic. The logic gates were implemented using static circuit design techniques and the flip-flops were implemented using dynamic charge storage to hold data values. A twophase nonoverlapping clock is generated for the entire chip and distributed using a balanced distribution-tree.
The datapaths of the HiPER router are pipelined to distribute the datapath functionality and increase the clock frequency. The number of pipeline stages in the datapath for HiPER is a function of the input channel. For the injection channel, the number of stages is five and, for both router-to-router input channels, the number of stages is four. The reason that the injection channel pipeline is larger than the router-to-router pipeline is because the injection channel does not perform speculative routing. Instead, the injection channel observes the Sign and Eject bits and routes deterministically.
The crossbar is a serial structure, so its contribution to the overall router area is small. This is in contrast to a full flit-size parallel crossbar, which contributes a substantial portion of the total router area. The serial crossbar also presents a smaller parasitic load than a parallel crossbar due to the reduced sized metal traces required to connect all the switch transistors. This reduced parasitic load will allow for higher data frequencies in the HiPER datapath. The area of the HiPER crossbar is only 2.6 percent of the total area of the router.
The control circuitry consists of communicating finite state machines that generate the control signals and control the operation of the datapath. The circuits were automatically generated using SIS [15] and pla2mag [14] . HiPER was evaluated against a comparison router that incorporates a parallel datapath (crossbar and registers) and a serial to parallel conversion function at the router inputs and outputs. The comparison router block diagram appears in Fig. 7 . The comparisons made between HiPER and the comparison router were chip area consumption and message latency, with models created for both. For the latency models, the actual datapath bit rate was simulated and incorporated into the evaluation, producing router latency based upon actual datapath implementations.
Router Layout
Area Models
This section presents models for chip area consumption for both HiPER and the comparison router. These models characterize the area consumed by the two routers vs. the flit size of the system. The basic form of the models for both routers is:
This model is based upon extracting the area of macro structures that comprise the router (e.g., crossbar, flit buffers, state machines) and applying an empirical expansion factor F ex to convert the areas of the macro structures into the actual router layout area. The value F ex was obtained by examining the global layout of HiPER shown in Fig. 6 . The terms A, B, and C represent the combined areas of all macro structures that are: (A) independent of flit size, (B) proportional to flit size, and (C) proportional to the square of the flit size, respectively. The value f is the size of a flit in bits. Table 1 shows the values of A, B, and C extracted directly from the HiPER layout. The first column in the table (block) gives the name of the macro structure being measured. The second column (constant area) gives the partial area of the macro structure that does not vary with flit size. The third column (per-bit area) gives the portion of the macro structure's area that varies with flit size in units of 2 =bit. The quantity column indicates how many of these macro structures are present in the HiPER layout. The last two columns show the total areas consumed by each macro type.
The total constant column is summed to give the value of A for the HiPER router (A H ¼ 4; 764; 647), and the total perbit column is summed to give the value of B (B H ¼ 205; 985). An interesting thing to note about HiPER is that there is no C term (C H ¼ 0). This is because none of the macro structures used to create HiPER are proportional to the square of the flit size. Another interesting observation about Table 1 is that most of the router area is due to the constant area blocks (control circuitry) and very little is proportional to the flit size. This is because the datapath of this HiPER implementation is serial and most of the datapath structures are one bit wide independent of the flit size. Exceptions are the Retransmit Buffer and Slack Buffer.
The total area of HiPER, which operates on 10-bit flits (Fig. 3), is 10,186,500 2 . This quantity was extracted directly from the HiPER layout. Taking the data from Table 1 and applying a 10-bit flit size produces a total macro structure area of 6,824,497
2 . Dividing these two numbers produces the estimated expansion factor, F ex ¼ 1:49, which is used for both HiPER and the comparison router to estimate macrocell wiring area. It is assumed that this area is proportional to gate complexity and that neither router design would be pathologically easy or difficult to wire. This interconnect consists of both wiring and glue logic. Similarly, blocks of the comparison router were both laid out (datapath elements) and estimated from similar circuits in the HiPER layout (control circuitry) [14] . Given these layouts and estimates, the data in Table 2 was compiled. The format of Table 2 is the same as that of Table 1 .
An important characteristic of Table 2 and, thus, the comparison router is the inclusion of a C term. This comes about because the parallel crossbar's area is proportional to the square of the flit size.
The modeling presented above produces the following equations for HiPER area and the comparison router area vs. flit size. For comparison, the size of the flit (f) refers to the data payload portion of the flit. For HiPER, the actual flit size is two bits larger than the indicated flit size (due to Parity and Discard bits). For the comparison router, the actual flit is one bit larger (due Parity bit only). This is indicated by the f þ 2 and f þ 1 factors in the following equations. Area H is the equation for HiPER and Area C is the equation for the comparison router.
These relationships are plotted in Fig. 7 for flit sizes of 8, 16, 32, and 64 bits. Fig. 8 compares HiPER to the comparison router as well as to a purely parallel router implementation. The purely parallel curve is simply the comparison router without the serial to parallel and parallel to serial functionality. In this figure, the quadratic nature of the comparison router area model can be clearly seen. One interesting observation is that the comparison router consumes less area for 8-bit flits, but consumes more area for flit sizes of 16 bits or greater. This is due to the effect of the parallel crossbar on the comparison routers total area and the fact that the constant contribution to the area of HiPER is greater than for the comparison router, due to HiPER's more complicated control circuitry.
As flit size increases, the C C term (which is multiplied by f 2 ) begins to dominate until, at 64 bits, the crossbar is the largest block in the comparison router. If the area data in Fig. 8 is decomposed into datapath and control circuitry, with the datapath circuitry further divided into crossbar and noncrossbar circuitry, the differences between the two routers can be more clearly understood. The noncrossbar datapath area component of the two routers is nearly equal, with the comparison router's noncrossbar datapath area component being just slightly larger. The control circuitry area components, however, are not equal. HiPER has a larger portion of its total area contribution coming from the control circuitry of the router (state machines). The state HiPER's crossbar, being a serial structure, does not contribute significantly to the router's overall area. The contribution of the crossbar to HiPER's area is so small that it comprises only 2.6 percent of HiPER's total area. In the comparison router, on the other hand, the crossbar begins to dominate at 32-bit flits and is the largest single contributor at 64-bit flit sizes. This brings up another important point and that is that, when considering chip area consumption, the HiPER router is less sensitive to variations in system flit size.
Latency Models
The latency modeling was partitioned into two efforts: router datapath clock frequency prediction and message latency in bit-times. The two results were combined to determine the average router latencies in seconds.
Latency in Bit-Times
The latency models presented in this section were derived assuming a no-contention situation in the network in order to highlight the full advantages of mad postman switching. When contention is present in the network, messages begin to block, and latency is no longer strongly affected by speculative routing (mad postman). When contention is present in the network, traffic patterns play a bigger role in message latency. In [10] , there is a very thorough treatment of the effects of accepted traffic and routing algorithms on latency. The routing algorithms for both HiPER and the comparison router are the same (dimension order) so that factor cancels for both routers. HiPER buffers four flits per router during a blocking condition, whereas the comparison router can buffer only two flits. This gives HiPER a more "cut-through" look, allowing HiPER messages to hold only half as many link resources for a given message size. This comparison is also presented in [10] .
In this section, the clock frequency of the two routers is taken to be the same. For this reason, the models presented here give latency in terms of bit-times. The latency for the two routers will be modeled vs. message size and path length, with both values based upon the average message characteristics presented in [14] . This equal bit-time comparison will prove accurate when technology advances allow the router cores to be run at the speeds currently achievable in high-speed serial links [2] , [19] , [9] .
Comparison Router. The comparison router has a relatively simple model for latency. Message latency has two components, the time for the head flit to reach the destination and the time required for the message to drain out of the network after the head flit has arrived at the destination. The following variables will be used in the modeling:
. D = Distance from the source to the destination in hops, . L = Length of the message in flits, and . f = Size of the data portion of the flits. The model for the comparison router latency is constrained by the serial channels used to feed the router. The equation representing this latency is given below.
The term f + 1 is the length of the flits which are one bit larger than the data payload due to parity. The T bit term is the bit time, which is being set to one for this comparison. The 2D term represents the time required to get the head flit from the source to the destination. The factor of two comes about because the comparison router has registers on either side of the crossbar. The L term represents the time required to drain the message and is equal to the length of the message in flits times the length of the flit in bits. HiPER. The HiPER latency models are slightly more complex due to misspeculation effects. Since the misspeculation occurs once per dimension in the network, the latency is dependent upon the number of the dimensions in the network and the probability that any given route will make dimension transitions (and how many). For this modeling, a two-dimensional and a three-dimensional network will be considered. The variables presented in the modeling of the comparison router will be used here, as well as the following additional terms.
. P ð1 turnÞ ¼ Probability that a message makes a single dimension transition, . P ð2 turnsÞ ¼ Probability that a message makes two dimension transitions (zero for 2D mesh), . P ðno turnsÞ = Probability that a message makes no dimensional transitions. Each dimension transition has a latency penalty associated with it, which will be described in the following paragraphs.
For the two-dimensional case, first the probability of making a dimensional transition must be analyzed. This analysis assumes all possible destinations a distance D from the source are equally probable. The total number of destinations in a 2D mesh in the general case is 4D, ignoring edge effects. The number that are reachable without turning (no dimensional transition) is four. The number reachable with a turn is therefore 4D À 4 ¼ 4ðD À 1Þ. This produces a probability of turning that is given by:
This results in a probability of not turning that is given by:
The latency associated with a hop that does not involve a turn is four, which is the number of flip-flops in the datapath of a single HiPER router. In a hop where the speculation is incorrect (dimensional transition), the latency is equal to the flit size in bits plus five. This is because the misrouted flit has already passed through the router (flitsize latency), and then the misspeculation requires that the ejection-injection path must be taken, which has five flipflops in its datapath. These latency numbers, in conjunction with the above probabilities, produce the following result for the turn contribution to latency:
This result is the probability of a turn multiplied by the turn latency plus the probability of no turn multiplied by the no-turn latency. The f þ 2 term appears because the HiPER flit size is the payload size (f) plus the Parity bit and the Discard bit.
The equation for the latency of HiPER in a twodimensional mesh is:
In the previous equation, the term enclosed in the square brackets represents the latency required to get the head of the message from the source to the destination. The 4ðDDf þ 2Þ þ 5 term represents the latency for the one turn required of all messages to eject from the network. The Lðf þ 2Þ term is the term that describes draining the message from the network after the head flit has arrived. The T bit factor will be set to one for this comparison. The analysis for the three-dimensional network is similar, but more complicated, and is not fully presented here. For the complete analysis, see [14] . Using the same reasoning used to derive the two-dimensional relationship, once again ignoring edge effects, the following relationships result:
The total number of possible destinations for a message traveling a distance D is:
All of these destination nodes lie on a tetrahedral surface centered at the message source. The first term in the above equation corresponds to the six vertices of this tetrahedral surface (no-turn destinations). The second term corresponds to the eight edges (one-turn destinations), and the last term corresponds to the eight faces (two-turn destinations).
The following are the probabilities associated with each turn situation:
The penalties for each turn are identical to those presented for the two-dimensional router and this produces a turn latency given by:
The factor of eight multiplying the no-turn probability is the latency for two straight passes through the HiPER datapath. The 11 þ f term multiplying the one-turn probability is the latency for one straight-through route and one turn (ðf þ 2Þ þ 5). The term multiplying the two-turn probability is two times the one-turn latency. The total latency in the three-dimensional mesh network is given by:
This equation is virtually identical to the equation for the two-dimensional latency, so no detailed description will be presented.
Results. The results of applying these latency models to average message sizes may now be plotted. Fig. 9 shows the latency of average messages (L ¼ 9, D ¼ 5, see [14] ) for the comparison router and for both the two-dimensional and three-dimensional HiPER models. Flit sizes of 8, 32, and 64 bits are plotted.
The two-dimensional HiPER plots are less than the comparison router for all flit sizes and the three-dimensional HiPER plot is lower than the comparison router for all flit sizes other than eight bits. Another important feature of HiPER is that its latency is less sensitive to variations in the flit size. This is due to the speculative nature of the mad postman routing algorithm. Since the no-turn latency of HiPER is independent of the size of the flits, the flits can grow without strongly effecting the overall latency. The latency associated with misspeculation and turning is all that grows with flit size and this latency is only incurred once per dimension in the network. Fig. 10 shows how the latency models are affected by the distance the message travels (D). This plot shows the affect of the portion of latency associated with routing the head flit from the source to the destination. This plot shows variations in message distance from one to nine hops while holding the message length constant at L ¼ 9.
HiPER is less sensitive to message distance variation because of its speculative routing algorithm. HiPER's inclusion of mad postman is what produces this behavior and is exactly what mad postman was designed to accomplish. The portion of HiPER's latency due to getting the head of the message to the destination is much lower than what is exhibited for the comparison router. Another interesting characteristic of the HiPER plots is the flattening of the HiPER latency curves at larger message distances. This is due to the dimensionality of the network and the ratio of misspeculations (turns) to correct speculations (straight through routes). As the distance the message travels increases, the number of misspeculations become a smaller number relative to the entire message distance and the effect of incorrect speculation plays a smaller part in overall latency. The comparison router's latency is directly proportional to both message distance and flit size as indicated in the plot.
The results of varying the message lengths are plotted in Fig. 11 . The message distance is being held constant at D ¼ 5 and the message length is varying between four and 14 32-bit flits. This figure shows the affect of the portion of latency that represents draining the message from the network after the message head has arrived at the destination. Both HiPER and the comparison router have the same dependency on message length since mad postman only affects the portion of latency that is dependent on getting the head of the message to the destination. As a result, the plots in Fig. 11 all have roughly the same slope. The stairstep characteristic of the 64-bit flit plots comes about because the message size is being increased in increments of 32-bits.
Datapath Clock Frequency
HiPER outperforms the comparison router in terms of latency expressed in bit times, but the speed of the datapath has not been explored. The simulations presented in this section have been performed to determine the maximum datapath clock frequency for HiPER and the comparison router in HP26G 0:8 m CMOS. For this study, only the datapath elements of both routers were simulated. The control circuitry was generated using automated design tools (Berkeley SIS tools) for the purpose of area comparisons only. Further, optimizations of this circuitry could be included in the future. The datapath, on the other hand, was laid out by hand for comparisons of both area and latency.
Comparison Router. The comparison router performs a serial to parallel conversion operation on the incoming data before routing the flits through its datapath. The clock frequency of the input bits is f þ 1 times greater than the datapath clock frequency, where f is the size of the data payload in a flit. This means that the datapath of the comparison router will be operating at a frequency that is many times slower than the serial to parallel conversion function.
SPICE simulation of both the datapath and the serial to parallel conversion function showed that the datapath is capable of operating at a maximum rate of approximately one-fourth the rate of the serial to parallel conversion function, indicating that the serial to parallel circuitry is the limiting factor in this router. So, for the purposes of determining the maximum clock frequency, the serial to parallel circuitry was extracted and SPICE was run. The SPICE parameters were obtained through the MOSIS foundry service. The simulated maximum frequency of this circuitry is 490 Mbps.
HiPER. For HiPER, the entire datapath is serial, so the entire datapath must be extracted to determine its data rate. The router-to-router channel is the most complicated and heavily loaded of the datapath channels, so that circuitry was extracted and simulated. The datapath elements were extracted directly from the global layout, as were all loads presented to the datapath. The maximum simulated frequency of this circuitry is 325 Mbps.
Latency in Seconds.
Combining the results from the previous two sections, the router latencies can be expressed in units of time (seconds). Fig. 12 shows the effect of applying the actual simulated data rate to the latency models presented in the previous section. The faster bit clock of the comparison router shifts its latency curve down, making the comparison router a lower latency router for all flit sizes below 64 bits in the 3D case, and for all flit sizes below 32 bits in the 2D case.
For HiPER to realize its full potential to reduce latency, the data rate of the network must be limited by the link speed, not the router core speed. When this is realized, the assumption made in the previous section's latency analysis (constant bit rate for both routers) will be true.
Energy Models
Using the HiPER prototype and the comparison router design, the energy consumption can be estimated for different configurations, each with varying channel width. As described earlier, HiPER's sub-flit-wide channel (bitserial in the implementation) leads to a more complicated controller than the traditional parallel crossbar router, while the crossbar portion is substantially smaller. The energy evaluation is, therefore, divided into two parts. First, the control and datapath circuit is evaluated in 100nm GENESYS [11] , an analytical technology modeling tool. Then, the crossbar energy is evaluated using parameters from the same technology.
Throughout this paper, a flit with 10-bit data portion is used. The flit size is 12 for HiPER and 11 for parallel crossbar router due to the additional control bits. To evaluate energy consumption, channel width is varied from the smallest width of 1 (bit serial) to the case where channel width matches flit size. Four configurations are thus chosen for comparison. The first three use HiPER with 1) serial channels, 2) 3-bit parallel channels, and 3) 6-bit parallel channels. The other configuration is a parallel crossbar router with 11-bit (full size) parallel channels.
For the control and datapath circuitry, the number of transistors for each configuration is extracted and input into GENESYS. Technology parameters are set to 100nm technology as described in the ITRS roadmap. V th is set to 0.297 volt, which is one-third of V dd (which is 0.9 volt). As expected, the control part of parallel crossbar router consumes the least energy of 1.201 mW, while HiPER control circuits consume 1.459, 1.744, and 2.172 mW for 1-, 3-, and 6-bit channels, respectively.
The energy consumption model for crossbars (w = channel width) is shown in (18) . The model incorporates the continuous switching of the crossbar control lines, input channels, and output channels. Crossbars are laid out with dimensions based on lambda = 50nm. GENESYS provides other parameters (e.g., wire capacitance, gate capacitance, driver's input capacitance). An operation frequency of 1,200 MHz is predicted by GENESYS. The energy consumption result is shown in Table 3 .
The total energy consumption is dominated by the control and datapath circuit. Especially for HiPER with serial channels, the crossbar represents only 1 percent of total energy consumption. With wider channels, the crossbar grows to a higher proportion of total power (1.5, 3.5, and 14 percent for 3-, 6-, and 11-bit channels, respectively).
Total energy consumption is comparable among all four configurations since only 3 Â 3 channel crossbars are implemented. In many system implementations, the channel energy budget (i.e., the links that connect routers) will dominate router energy. In these circumstances, a designer can use a HiPER-type router to optimize between the bandwidth advantage of parallel channels and the smaller form factor (size, weight) and lower energy consumption when narrower channels. When high bandwidth is desirable, wide parallel channels can be used. HiPER architecture is suitable when a small channel is desirable (e.g., when using small, lightweight external connections). HiPER successfully combines flit-level error control and mad postman, producing a routing element well suited to interprocessor communication in high-performance power and energy constrained systems. The studies presented in this paper show HiPER to have superior performance with respect to chip area consumption and message latency when compared with a basic parallel crossbar router operating over serial channels. In addition, HiPER is less sensitive to flit size variation effecting area consumption and it is less sensitive to message variations effecting latency. HiPER outperforms a parallel crossbar router in absolute latency (measured in seconds) when considered at the same router datapath frequency. While there are many competitive variations of the basic parallel crossbar router presented here, the HiPER approach merits consideration when non-flit-sized channels are desirable.
Future work can be pursued in the different areas, building upon what was learned from HiPER. The Slack Buffer sizing may be optimized to reduce the effects of congestion. The HiPER design could be reimplemented using high-performance circuit techniques, quantifying the amount of datapath speedup this affords. HiPER routing and error control techniques could be applied to table-based routers and source-routed hypercubes to determine if speculative routing is advantageous in either type of router.
ACKNOWLEDGMENTS
This work is supported by US National Science Foundation contracts #ECS-9422552, EEC-9402723, and ECS-9058144.
