Abstract
Introduction
Message latency in parallel computer networks is directly affected by the underlying switching techniques. Virtual cut-through switching forwards a message immediately after its routing computation when the output resources are acquired [10] . Hence, it can reduce the latency compared with store-and-forward switching for no collision messages. In wormhole switching, a message is divided into a number of flits (flow control digits) so that flit buffers, or virtual channels (VCs), can be implemented by shallow and fast operable FIFOs [3] . Mad postman switching tries to eliminate latency for the routing computation as well as the serialization delay of the message header on each router which is organized with serial links [8] . This strategy works well with dimension-order routing in 2-D mesh networks because messages turn only once, at most, from the first dimension to the second.
Another factor affecting latency is the pipeline depth of the router. Most routers process message headers in several pipeline stages, e.g., routing computation, VC allocation, followed by switch allocation. Since the router's clock frequency becomes high and routing complexity increases, the routing pipeline tends to become deeper. One approach to reducing the required cycles for the pipeline is routing speculation, which performs several pipeline stages in parallel [16] . The BlackWidow network allows speculative message transmission even to the next router before verifying the cyclic redundancy check code, which is attached as a message trailer [17] . Lookahead routing, which performs the routing computation one hop ahead, also contributes to efficient pipelines [7] .
Unfortunately, none of above techniques actively utilize the communication regularity that can be seen in many parallel applications. Recently, we proposed a predictive switching technique for 2-D tori that effectively utilizes such regular communication patterns [19] . Although we have shown the potential for reducing message latency based on communication prediction, its performance has not yet been evaluated. We believe that the effectiveness of predictive switching relies on the treatment of prediction failures. Prediction failures lead not only to incorrect path traversals, but also disturbances in non-predictive communication. In this paper, we propose a method to safely detect and discard all mis-predicted packets. We also present a technique to minimize undesirable packet traversals caused by prediction failures. By statically classifying messages based on their output direction at the source node and introducing simple hardware in each router to validate predictive switching, we can efficiently reduce occurrences of mis-predicted packets.
The remainder of this paper is organized as follows; Section 2 briefly explains predictive switching based on router pipelines. In Section 3, we explain measures to properly treat prediction failures for a dimension-ordered oblivious routing algorithm and Duato's adaptive routing algorithm [6] . Section 4 describes the prediction algorithms, and Section 5 presents network simulation results. We conclude this paper in Section 6. H2 H3 P0 P1 P2 P3  IB RC RC RC VA SA ST ST ST ST ST ST ST ST  H0 H1 H2 H3 P0 P1 P2 P3  IB RC RC RC VA SA ST ST ST ST ST ST ST ST   H0 H1 H2 H3 P0 P1 P2 P3  IB  VA  SA ST ST ST ST ST ST ST ST  H0 H1 H2 H3 P0 P1 P2 P3   0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Figure 1 shows the time-space diagrams for transferring an 8-phit (physical transfer digit) packet by (a) nonpredictive and (b) predictive switching. The packet consists of four header phits, H0 to H3, and four payload phits, P0 to P3. We assume that router-A receives the packet phit-byphit from cycle 0 to 7, and transfers it to router-B as illustrated in (c). In the case of non-predictive switching, each router hops the packet in five pipeline stages: input buffering (IB), routing computation (RC), output virtual channel allocation (VC), switch allocation (SA), and switch traversal (ST). We also assume that the RC stage requires three cycles to process a packet header 1 , and the other stages take one cycle per phit. Therefore, router-A executes VA and SA stages in cycles 4 and 5, respectively, followed by ST stages from cycle 6 to 13. If there is a two-cycle cable delay between two adjacent routers, router-B receives the packet from cycle 9 until 16; then it repeats the same pipeline stages to hop the packet. In total, each router takes 13 cycles to hop an 8-phit packet, and the second router-B sends it out at cycle 22 if there is no contention.
H0 H1

Predictive Switching
Router Pipeline
In the case of predictive switching, VA and SA stages can be performed speculatively in parallel with the IB stage based on output port prediction [19] . Hence, we can eliminate five cycles, including three cycles for RC plus two cycles for VA and SA, per hop per packet. Consequently, ten cycles are eliminated in predictive switching of Fig.1(b) in comparison with the non-predictive switching of Fig.1(a) . Actually, speculative execution of SA in parallel with VA is possible even in the case of non-predictive switching. In this scenario, shortened cycles may be limited to RC cycles. However, a deserialization latency is not negligible for narrow channel routers because a packet header is composed 1 The multiple cycles for the RC stage include deserialization for the packet header phits.
with multiple phits. Dally claims that it would be inefficient to execute the RC stage in parallel with the other speculative stages [4] . The reason is uncertainty of the output port. We believe that predictive switching with efficient elimination of prediction misses is a key to resolving this problem.
Detection of Mis-Predicted Packets
Predictive switching has a risk of misrouting packets when the output port prediction fails. In this paper, we call the misrouted packets mis-predicted packets. To guarantee correct packet delivery, the normal RC stage is also executed in parallel with the predictive pipeline. When the computed output port matches the predicted output port, it means predictive switching succeeds. Otherwise, predictive switching fails and the packet is retransmitted to the correct output port by the non-predictive pipeline. After several dead phits have been sent out, the router immediately sends a tail phit along the mis-predicted path to minimize the overhead. Because mis-predicted packets consume network resources and bandwidth, they should be properly detected and discarded. This is explained in section 2.4.
For minimal routing algorithms, such as X-Y dimensionorder routing (DOR) and adaptive routing based on Duato's protocol [6] , referred to as DPR later, mis-predicted packets can be detected by checking whether they stay in a minimal quadrant of the network from source to destination. We add this detection mechanism to the RC stage. Since mispredicted packets are detected in the RC stage, they can be discarded when they are blocked by other packets. In other words, mis-predicted packets would propagate until they are blocked. To avoid deadlocks caused by a torus cycle, we also set up routers that can disable predictive switching. Figure 2 shows an example of a router arrangement in which some input ports, marked as solid black, only work in non-predictive mode. In this example of a 2-D torus, north and south input ports in the first row as well as east and west input ports in the right edge column do not execute predictive pipelines. Let us consider a message sent from source node 16 to destination node 13. It will take a minimal path such as 16 → 17 → 18 → 13. If this packet is transferred in the wrong direction as result of prediction failures, it would not be blocked by any other packets at intermediate routers but it would be detected at routers in the right edge column or in the top row. For example, when the packet is misrouted at router 18 in the east direction, it is discarded at router 19. The packet is then retransmitted from the input buffer on router 18 in the north direction through the non-predictive pipeline. By setting up more routers that disable predictive switching in the networks, we can reduce the average propagation distance of the mis-predicted packets. It is also possible to reduce the number of routers by disabling predictive switching, e.g., at diagonal positions only, to lengthen the predictive switching distance. Deadlocks can be avoided using the same techniques as in non-predictive switching networks, e.g., virtual channel flow control [3] or bubble flow control [18] .
Reduction of Mis-Predicted Packets
In minimal routing algorithms, such as DOR and DPR, the output direction of packets is fixed prior to the transfer. We encode this static information in the packet header to reduce propagation of mis-predicted packets. Note that the static information to transfer a packet is never changed after the packet is injected in a network. Therefore, it differs from lookahead routing, which dynamically updates routing information. Our static information for packet direction eliminates obvious errors of incorrect packet propagation caused by prediction failures. It does not guarantee that the static information is always correct. For example, it is easy to imagine that a packet with an "X+" flag may eventually be received through a consumption port at a destination node instead of the X+ output port.
This section discusses a technique to reduce the number of mis-predicted packet occurrences. Figure 3 shows the classification of packet directions for a 5-ary 2-cube. Since a torus network is homogeneous, namely every node has the same topological relation with the other nodes, we show an example for the classification from node 12. Packets are classified into eight groups based on their directions, and their directions are encoded into four bits,
Classification of Packet Directions
Each bit corresponds to a valid output port that the packet will take for X+, X−, Y +, and Y − directions. For instance, a packet that should be routed only to the X+ direction has 1000 in its packet header. A packet to X+ and Y+ is encoded as 1010, and so on. Figure 4 shows the block diagram of the input port of the predictive switching router. It consists of one or more virtual channels, memory to save communication histories, and a decoder circuit to create an enable signal for predictive switch traversal (PST). The predictor may be a shared resource among all the input ports. It generates a five-bit value, P x+ P x− P y+ P y− P c , and only a single bit is set among these five bits. Each bit corresponds to an output port (X+, X−, Y +, and Y −), and a consumption port (C) at the destination node, although P c is only set for predicting the network ports and it is never set for the injection port 2 .
Conditions for Predictive Switching
In this subsection, we show conditions to enable predictive switching. Simple bit operations are used to determine whether the prediction value is adequate or not. input port to an output port for X-Y dimension order routing. Blank cells in the table mean prohibited turns in the routing algorithm, such as 180-degree turns and turns from the Y to the X dimension. The PST enable signal for an input port is computed by the sum-of-products of each row in the table.
An injection packet will be output from one of the network port. The PST enable signal for the X+ and X− directions is valid when D x+ * P x+ = 1 and
, since packet movement in the X-dimension must always proceed movement in the Y-dimension. This additional condition is not necessary for an adaptive routing algorithm which allows for routing packets in any order between the X and Y directions.
For X-dimensional ports, packets may be transferred under one of the following three options: straight to the opposite side of the X-dimension, turns into the Y + or Y − direction, and reception at the consumption port. The first condition for transferring a packet in a straight line in the X-dimension is just P x+ = 1 or P x− = 1, since the injection condition guarantees that the current direction of the X-dimension is correct. The second condition of whether a packet may turn to one of the Y-dimensions simply checks whether it can move to the particular Y-direction. Therefore, the prediction failures for transferring packets to the opposite direction with their D y+ and D y− bits in the packet headers can be avoided. A predictive request sent to the consumption port is always valid for the network ports when P c = 1 regardless of the value in the packet header.
Conditions for the Y-dimensional ports are much simpler, because accepted packets continue to advance in the same direction until they reach the destination nodes. Hence, a valid prediction is the opposite Y-dimensional port or consumption port.
Shortening Mis-Predicted Packets
The mechanism referred to in the previous section completely removes incorrect propagations of mis-predicted The whole pakect is transferred to the correct path.
Mis-predicted packet is shortened by setting a tail flag.
history Router Figure 5 . Shortening mis-predicted packet.
packets from the injection ports, although they may still appear at the network ports. Because the mis-predicted packets waste network bandwidth, incorrect packet propagation should be stopped as early as possible.
We explain how to do this by using the example in Figure 5 . Suppose a predictor supplies Y − as the output port for the next incoming packet of the X+ port. Later, a long packet reaches the X+ input port, and it is predictively traversed to the Y − direction because D y− is in its packet header. The non-predictive routing pipeline is also executed in parallel with the predictive one, as we explained in section 2.1. Therefore, a correct path can be learned at the RC stage by decoding the destination field of the packet header. When the correct path is X−, and the routing logic recognizes that the predicted path is wrong, predictive switching is stopped by setting a tail flag for the current phit in the predictive pipeline. Predictive ST stages will not be executed for the rest of the phits so that the Y − port immediately becomes available for other packets. The packet stored in the buffer of the X+ input port is retransmitted from the header phit to the correct output port through the non-predictive pipeline 3 .
Communication Prediction
Prediction algorithms are a crucial performance factor in a network using predictive switches. We briefly describe three practical prediction algorithms that can be applied to the predictive switching architecture.
Sampled Pattern Matching (SPM) Prediction
This prediction algorithm selects a value which has the highest probability of appearance after positions of a particular pattern in its communication history. The original algorithm was proposed as a universal predictor [9] .
Let us assume that a sequence X n 1 = X 1 , X 2 , ..., X n is a communication history, where X 1 , X 2 , ..., X n denote output port numbers. The SPM algorithm predicts the next value X n+1 by finding the value of the most frequently occurring number at sampled positions. First, it searches for the longest suffix sequence of X n 1 whose copy appears somewhere in X n 1 . That is X l , ..., X n = X l−j , ..., X n−j for some 1 ≤ j ≤ n, where the length of the repeated sequence is D n = n − l + 1. Then, it defines a marker that is a sequence of length k = αD n , where 0 < α ≤ 1. Such marker sequence X n n−k+1 appears O(n 1−α ) times in X n 1 with high probability. Finally, the predicted value is calculated by applying a majority rule to all numbers appearing at positions just after the markers. Although the original SPM uses 0 < α < 1, we initially use α = 1 since longer matches showed higher hit rates in our preliminary evaluation [19] . When more than two numbers appear equally at sampled positions, we recompute the algorithm by shortening the length of the marker sequence and choose the value that appears most often.
In the following example, the longest matching suffix is 0012 and the sampled numbers are 3, 2 and 2, respectively. Hence, the SPM algorithm predicts 2 for the next number. 00 0012 312 0012 233 0012 21 0012 ?
Static Straight (SS) Prediction
This strategy predicts that all incoming packets from the network will be forwarded straight in the same dimension in the same manner as mad-postman switching [8] . For example, packets arriving at the north input port are always predicted to be output from the south port in 2-D tori. In the dimension-order routing algorithm, this strategy fails at most two times per packet in 2-D tori where the packets turn from one dimension to another, and at the destination nodes. Therefore, packets which travel long distances increase prediction hit rates, whereas communication locality negatively affects this strategy. Predictions are not made for injection packets since all network ports will be allocated to the opposite network input ports in this prediction algorithm.
The static predictor does not require a memory history, hence its implementation cost is low.
Latest Port Matching (LPM) Prediction
A latest port matching method predicts that a next packet will use the same output port as the previous packet for each input port. This method requires only a single history record in each port so that the prediction can be performed in a short time. Although this strategy is quite simple, it may work for several patterns, such as straight after straight communication which appears often in DOR, and repeated communication between two neighbors.
Experiments
Simulation
To evaluate the performance of predictive switching, we conducted experiments utilizing a network simulator. We implemented the predictive router pipeline with three prediction algorithms, SS, SPM and LP, for a booksim network simulator [4] .
Our experimental simulation conditions are as follows;
Network: k-ary 2-cube (k = 8, 16, 32).
Routing algorithm: X-Y Dimension Order Routing (DOR).
Virtual channel: Two VCs per physical channel, and each has an 8-phit FIFO.
Traffic pattern: Uniform random, bit-reversal, and a communication pattern in LU (size of W) from NAS parallel benchmarks [2] .
Packet length: 8 phits.
Switching: Virtual cut-through.
Pipelines: Non-predictive and predictive switching pipelines take six and two cycles, respectively, until they traverse the first phit of an input packet as shown in Figure 1 . Subsequent phits are traversed in one cycle per phit.
Cable delay: Two-cycle propagation delay between two routers.
Non-predictive port: X-dimensional ports of nodes whose X-addresses are zero and k/2, and Y-dimensional ports of nodes whose Y-addresses are zero and k/2 perform only the non-predictive pipeline.
SS and LP:
No prediction delay is incurred, and prediction value is available for every packet. 
SPM:
We assume that each router equips a central predictor which is shared by all the input ports. Prediction requests are issued when the communication histories are updated in each input port after completing the RC stage of non-predictive pipeline. The prediction delay takes four cycles after requesting it. When the prediction is not ready for an input packet, the packet is sent through the non-predictive pipeline. The maximum length of the communication history is 512 for each input port. Data collection: After warming up the network, we executed the simulation until 12,000 packets were received. Then, average latency, prediction hit rates, and PST execution rates were measured for the received packets. The latter two values were calculated as number of hops for prediction hit and PST execution divided by the total number of hops for all received packets.
Uniform Random Traffic
Figures 6 and 7 show experimental results for uniform random traffic on a 16-ary 2-cube and a 32-ary 2-cube, respectively. Figures (a) shows average packet latency when we vary the offered traffic rate of each network. Figures (b) shows prediction hit rates for all the received packets (line graphs), and PST execution rates (bar graphs). The latter shows packet hops as the percentage that predictive switching reduces latency.
The results indicate that predictive switching with SS, LP, and SPM algorithms reduces the average packet latency compared with non-predictive switching (denoted as normal in the graphs). The corresponding reduction cycles T c of average latency can be represented as follows:
where D ave is an average distance of packets, T hop is reduction cycles per hop, and R P ST is the PST execution rate. D ave reflects the fact that larger networks potentially reduce more cycles of latency than smaller networks. T hop takes a value between one and four for all or some cycles for the RC and VA stages. Because of R P ST , the latency reduction becomes larger as the prediction hit rates and PST execution rates increase.
The prediction hit rate of SS is the highest of three prediction algorithms, approximately 65% for the 16-ary 2-cube and 80% for the 32-ary 2-cube. Because communication regularity is not inherent in the uniform random traffic, prediction hit rates rely only on the regularity of the routing algorithm. In X-Y DOR, packets correct the X-dimensional addresses before they correct the Y-dimensional addresses. This rule increases straight movements of packets, especially for long-distance packet transfers in large networks. LP and SPM have lower prediction hit rates than SS since history-based prediction does not match the random communication pattern. The prediction hit rates of SPM slightly decrease as offered traffic rates increase because of the prediction delay, causing some of the packets not to have prediction values. The PST execution rates decrease as offered traffic rates increase because predicted output ports tend to be busy in heavily offered traffic. This tendency is common for both network sizes, although the PST execution rates of the larger network are higher than in the smaller one. Packets have more chances of predictive switching in the 32-ary 2-cube than in the 16-ary 2-cube because the average distance between non-predictive switching ports is longer in the 32-ary 2-cube. This traffic pattern has communication regularity, although packets with long and short distances are mixed. Because of the communication regularity, prediction hit rates for LP and SPM are higher than those of SS. LP and SPM achieve almost equal prediction hit rates, approximately 72% for the 16-ary 2-cube and 83% for the 32-ary 2-cube. These values do not vary much for changes in offered traffic rate. Execution rates of PST for LP and SPM are also higher than those of SS, although the differences become smaller on the 32-ary 2-cube. These results imply that communication regularity of the bit-reversal traffic is still more affected by DOR, especially on a large k-ary 2-cube. Average latency is reduced except for the case of SS which offered the traffic rate of 0.05 on the 16-ary 2-cube. Since the prediction hit rate is low, the number of mispredicted packets increases. These mis-predictions may disturb proper packet switching when the network is close to saturation. Hence, an overhead of mis-predicted packets appears when network congestion becomes significant. Figure 10 shows experimental results for a traffic pattern in LU on an 8-ary 2-cube. This traffic pattern is based on an MPI trace obtained by execution of an LU program on a 64-node PC cluster. In the LU program, all communication is organized by neighbor-to-neighbor traffic. In addition to the communication locality, it also exhibits a fixed chronological order for packet destinations among four neighboring nodes [11] . In our network simulation, 8-phit packets were used with speeding up its issue time to vary offered traffic rates. Therefore, we use accepted throughput for the horizontal axis in Figure 10 .
Bit-Reversal Traffic
Traffic Pattern in LU
Because of the strong communication regularity, SPM achieves approximately 88% of the prediction hit rate when the accepted throughput is low. Note that this prediction hit rate increases to 99% if we subtract the number of hops at non-predictive ports referred to in section 2.2. This value decreases to 65% for high accepted throughput since the predictor's delay for some packets caused it not to make predictions. LP achieves 45% of the prediction hit rate because network ports always match the prediction when receiving packets from neighboring nodes, whereas packets injected to four neighbors do not match this prediction at the source nodes. The prediction hit rate of SS is around 15% because packets are never transferred more than two hops. As shown in Figure 10 (a), average latency of SPM is largely reduced and saturation throughput is also improved. On the other hand, the latency reduction of LP is limited, and SS can not reduce latency at all. Based on these results, we can say that precise prediction algorithms are the key for the predictive switching technique.
Related Works
Ding et al. proposed predictive multiplexed switching for indirect networks [5] . Their time division multiplexing requires a global scheduler to establish end-to-end connections; therefore it may not be suitable for direct networks. Our predictive switching technology does not use a global communication scheduler, and each router predicts the next output paths.
Routing speculation [16] and lookahead routing [4, 7] reduce packet hop latency, whereas deserialization of packet header followed by routing computation (RC) should be performed before the switch traversal stage inside the router. For example, BlueGene/L processes packet headers in an eight-stage input pipeline [1] . It needs to update hint bits that are used for lookahead routing at each router before the packet header is sent out. Our predictive switching does not require updates to routing information in the packet header so that it can output the packet before completing the RC stage. Express virtual channels (VCs) reduce latency by bypassing intermediate router pipelines [14] . It needs a lookahead signal to use the express VCs so that it may be feasible for networks which organize with wide links to pack a packet header into a single phit.
Path-sensitive router [12] and Guide flit queuing [13] embed a VC identifier in the packet header for the next router. Destination-based head-of-line blocking elimination also manages input buffers associated with packet destination groups [15] . Our classification of packets for reducing occurrences of mis-predicted packets does not necessarily require individual VCs.
Mad-postman switching starts to transfer packets before completing the RC stage [8] . However, it does not use dynamic prediction to decide the output paths for packets.
Conclusion
We developed an efficient implementation of predictive switching for 2-D torus networks. We proposed techniques to reduce overhead caused by mis-predicted packets. Firstly, we arranged non-predictive switching ports in the network so that we can safely detect and discard mispredicted packets. Secondly, we statically classified packets according to their output direction, and used the classified information to validate predictive switch traversal. Finally, we shortened the mis-predicted packet to minimize wasteful bandwidth consumption.
We implemented these mechanisms in a network simulator, and measured the effect of predictive switching. The results are summarized as follows.
• Because DOR has routing regularity in its path selection, predictive switching can reduce communication latency, especially for large networks and packets sent long distances.
• Dynamic prediction algorithms which utilize a communication history improve prediction hit rates for regular traffic patterns, and they are more effective than the static prediction.
• The effect of reducing communication latency increases as regularity of traffic increases. Predictive switching also contributes to raising the network saturation throughput by shortening the routing pipeline stages on each router.
Our future work will include a cost-and performanceefficient design of a dynamic predictor, and applications of predictive switching to a variety of networks.
