Low-dimensional k-ary n-cubes have been popular in recent multicomputers. These networks, however, suffer from high switching delays due to their high message distance. To overcome this problem, Dally has proposed a variation, called express k-ary n-cubes, with express channels that allow non-local messages to partially bypass clusters of nodes within a dimension. K-ary n-cubes are graph topologies where a channel connects exactly two nodes. This study argues that hypergraph topologies, where a channel connects any number of nodes, thus providing total bypasses within a dimension, represent potential candidates for future high-performance networks. It presents a comparative analysis, of a regular hypergraph, referred to as the distributed crossbar switch hypermesh (DCSH), and the express k-ary n-cube. The analysis considers channel bandwidth constraints which apply in different implementation technologies. The results conclude that the DCSH's total bypass strategy yields superior performance characteristics to the partial bypassing of its express cube counterpart.
INTRODUCTION
The success of large-scale multicomputers, consisting of thousands of processing elements (PEs), greatly depends on the efficiency of their interconnection network. This network is constructed from switching elements (SEs) and channels: SEs are responsible for moving information between PEs through the channels.
In practice, implementation technology places bandwidth constraints on network channels and these are an important factor in determining how well the theoretical properties of a particular network topology can be exploited. For VLSI implementation, the network wiring density determines the overall system cost and performance. Dally [1] has shown that for fixed wiring density, the torus out-performs the hypercube as the former has wider channels to compensate for its larger diameter. Dally's study has had a considerable influence on the design of current multicomputers. The highdimensional hypercube originally used in the iPSC/2 [2] and Cosmic Cube [3] , has been replaced by the low-dimensional torus (and related mesh) in more recent systems, such as the J-Machine [4] and iWarp [5] .
Wormhole routing [6] has also promoted the use of lowdimensional k-ary n-cubes [4, 5] due, first, to its low buffering requirements, allowing efficient SEs. Second, and more importantly, it makes the message latency independent of the distance in the absence of blocking in the network. In wormhole routing, a message is broken into flits (typically, a few bytes each) for transmission and flow control. The header flit governs the route and the remaining data flits follow in a pipelined fashion. If the header is blocked, the data flits are blocked in situ.
It has been argued in [7, 8] , however, that the recent move towards the use of low-dimensional topologies is not fully justifiable. Dally's results are not applicable to current multiple-chips implementations which are pin-out rather than wire-limited technology that allows an entire system to be accommodated on a single chip will not be achievable for many years. Furthermore, SE switching delays are still significant and dominate wire delays [8, 9] . Such delays cause performance degradation in low-dimensional k-ary ncubes due to their higher average message distance. To overcome this problem, especially in large k-ary n-cubes, Dally [10] proposed an enhanced variation, called express k-ary n-cubes (or express cube for short), that use express channels, allowing non-local messages to partially bypass groups of nodes within a dimension.
Common multicomputer network topologies, such as kary n-cubes and their express cube equivalents, can be formally modelled as graphs of the form G(E, V ), defined over a set of vertices V and a set of edges E. Each vertex typically represents a node and each edge between two vertices represents a channel connecting the two nodes. A fundamental constraint of the graph model is that each edge joins exactly two vertices. If a network channel is permitted to connect any number of nodes, a new general and powerful class of topologies emerges [11] . Using a PERFORMANCE MERITS OF BYPASS CHANNELS 63 graph-theoretical framework, members of this class can be modelled as hypergraphs [12] , which are generalizations of the conventional graph in which individual edges are able to join an arbitrary number of vertices.
One sub-class of regular multi-dimensional hypergraph topologies, hypermeshes [11, 13, 14] , has several significant advantages [11, 13, 14, 15, 16, 17] . First, their diameter grows slowly with the network size, compared to most graph networks. Second, they can host parallel applications that map naturally onto hypercubes, tori and binary trees. Third, they can emulate SIMD permutations that can be performed on the omega and inverse-omega multi-stage networks. Finally, they are extremely effective at important global operations such as broadcast.
A particular class of regular hypermeshes, known as the distributed crossbar switch hypermesh (DCSH), has been shown to possess several desirable topological and performance advantages over common k-ary ncube topologies, notably the torus and hypercube, when implemented in VLSI and multiple-chip technology [15, 16, 17] . A node in the DCSH has a dedicated channel that connects it to every node along a dimension, thus providing a total bypass for non-local messages within a dimension; a message experiences switching delay only when it needs to cross to another dimension. This paper evaluates the relative performance merits of total and partial bypassing provided by the DCSH and the express cube respectively, taking into account channel bandwidth constraints imposed by implementation technology. While Dally [10] has considered constraints for VLSI technology only, this study considers both VLSI and multiple-chip technologies. Furthermore, whereas Dally has used the average message distance as a performance measure, which does not truly reflect network behaviour in the presence of message blocking, the present analysis develops queueing models to compute message latency in the DCSH and express cube under message blocking. The models have two important features: first, they make realistic assumptions, ignored in previous studies [1, 7, 10] , such as including effects due to switching delays and the use of pipelined-bit transmission to lower the effects of long wires; second, they take into account the multiplexing effects of multiple virtual channels [18] onto individual physical channels, yielding more accurate results. The comparative analysis reveals that in most cases considered here the total bypassing strategy inherent in the DCSH topology provides better performance than the partial bypassing strategy used in express cubes. Furthermore, the DCSH can employ slower SEs than the express cube and still maintains superior performance levels.
The remainder of the paper is organized as follows. Sections 2 and 3 describe the DCSH and express cube. Section 4 outlines the queueing models for the two networks. Section 5 compares their relative performance merits for equal implementation costs in VLSI and multiple-chip technology. Finally, Section 6 draws some conclusions from this study. 
DISTRIBUTED CROSSBAR SWITCH HYPER-MESH (DCSH)
Several implementation schemes of hypermeshes have been suggested. These are based on either shared-buses [19] , crossbar switches [20] or complete-connections [21] . All these schemes, however, suffer from bandwidth limitations as system size grows. Some authors have recently suggested the use of optical technology offering high communication bandwidth [11, 14] . However, as integrated single-chip nodes become feasible, it will be necessary to develop optoelectronic transmitters and receivers, that are capable of operating at high rates, and will not cause a disproportionate increase in the site area occupied by the node. Commercial technologies with these parameters appear to be some way off [22] . A hypermesh implementation called the distributed crossbar switch hypermesh (DCSH) has been shown to possess better topological properties and latency characteristics than existing networks, including tori, hypercubes and multistage networks [15, 16, 17] . The one-dimensional DCSH, referred to as a cluster, is a hypergraph consisting of k nodes connected by a distributed crossbar switch; the multiplexing/demultiplexing functions of the conventional 'centralized' crossbar switch are performed in the SEs. Figure 1 depicts the basic structure of a cluster. Every node possesses a uniquely owned channel that connects it to the other (k − 1) nodes in the cluster. At each of these (k − 1) destinations, there is a (k − 1)-to-1 multiplexer with buffered inputs. A k-ary n-dimensional DCSH, is a regular hypergraph with N = k n nodes, formed by taking the Cartesian n-product of the cluster topology. This has the effect of imposing the cluster organization in every dimension, making each node equally a member of n independent orthogonal clusters. When k = 2, the DCSH reduces to a hypercube.
Let dimensions in an n-dimensional DCSH be numbered 0, . . . , (n − 1). A node, v, can then be labelled by an (n × 1) address vector with v i being the node position in its dimension (cluster) i . Each node is connected to n(k −1) other nodes with which it differs in only one address digit, i.
EXPRESS K-ARY N -CUBES (EXPRESS CUBE)
Express k-ary n-cubes [10] are obtained by inserting periodically in a conventional k-ary n-cube an interchange, after each cluster of k n nodes, that provides a short path to non-local messages. An interchange is connected to its neighbouring interchanges by one or more, say m, express channels. It receives messages, arriving on either express or local channels, and directs them either to a local channel, if the next cluster contains the destination, or to one of the m express channels otherwise. Only nodes have the capability to consume messages if they are the destination or to decide on switching messages to another dimension. When a message is generated at a cluster node, it traverses only local channels if its destination is in that cluster. Otherwise, it has to cross some local channels to reach the last one leading up to the level of interchanges. At that level, if the destination is in the neighbour cluster, the interchange forwards the message down through the local channel. Otherwise, the message keeps on traversing interchanges and express channels until it reaches the last interchange connected to the cluster containing the destination node. Figure 2 shows a two-dimensional express cube with k = 6, k n = 2 and one express channel connecting interchanges.
Low-dimensional k-ary n-cubes suffer from high average message distance for large k. Express cubes aim at reducing the message distance and thus reducing the switching delay that a message experiences. They achieve this, however, at the cost of adding interchanges and increased wiring density. The DCSH can be considered as an extreme case of the express cube with a total bypass path along a dimension. This total bypassing reduces considerably the diameter and average distance at the expense of a higher wiring density requirement compared to the express cube. This results in the DCSH having narrower channels than its express cube equivalent (as we shall see later in Section 5). This study investigates whether the DCSH's total bypass strategy can compensate for its narrower channels and provides superior performance to the partial bypassing of the express cube.
MODELS
The modelling approach described in [1] is adapted for this study. There have recently been several models of wormhole routing suggested in the literature (e.g. [23] ) and which have been shown to be more accurate than that in [1] . Nonetheless, we have opted to use the modelling approach of [1] as it allows the development of simpler and more computationally-efficient models. More importantly, the general conclusions reached using these models have been found to be in close agreement with those obtained through simulation experiments [24] . Detailed derivations of the models presented later can be found in [24] . The models are based on the following assumptions, which have been widely used in the literature [1, 7, 8, 15, 16, 23 ].
Nodes generate traffic independently of each other
and follow a Poisson process with a mean rate m g messages/cycle. Messages are exponentially distributed with a mean length L m flits, each flit requiring one-cycle transmission time across a physical channel. 2. Message destinations are uniformly distributed. Although many evaluation studies assume uniform traffic, it is not always a true reflection of reality. There are, for example, parallel applications that exhibit communication locality. Nonetheless, there are situations where the assumption is justifiable. For instance, the shared-memory model has recently been implemented on a number of multicomputers as it is easier to use than message-passing [5] . In this model, practices such as memory interleaving and distributed software-tree implementations of barrier-synchronization tend to spread access uniformly over all nodes [8] . 3. Messages generated by the PE are put in a queue of infinite buffering capacity. Messages at the destination are consumed as soon as they arrive. 4. Messages are transmitted between SEs (and interchanges) using wormhole routing. 5. Deadlock during message routing is avoided by using the virtual channel algorithm [25] . Restricted routing, where messages visit network dimensions in a strict order, is a special case of this algorithm, ensures deadlock-free routing in the DCSH. In the express cube, however, in addition to restricted routing between dimensions, each physical channel is divided into V = 2 virtual channels to avoid message deadlock within a dimension, caused by the wrap-around connections. Restricted routing has been popular in multicomputers [2, 3, 4, 5] as it requires a minimal number of virtual channels and allows the design of efficient SEs [9] . The dimensions in the DCSH and express cube are numbered from 0 to n − 1 and messages visit lower numbered dimensions first. of switching delays on performance and stressed those of long wires. This is unrealistic, given current and foreseeable technology. Switching delays are still significant and much higher than wire delays in current implementation technology [8, 9] . Delays due to long wires can be made less of an issue by using bit-pipeline transmission, as suggested in [26] . Figure 3 shows the SE structure in the two-dimensional DCSH. Discussions can be easily extended to higher dimension DCSHs. Due to restricted routing, only messages from the PE may access dimension one. Furthermore, only messages from the PE and dimension one may access dimension two. However, messages can be transferred from the network to the PE from any dimension. Buffers are provided at the input side of the SE and are connected by means of a (3 × 3) crossbar switch to the output channels.
The DCSH model
Messages from the PE are buffered in a queue to be injected later into the network through the ejection channel. There are also two (k − 1)-to-1 multiplexers per dimension for transit and local messages, respectively. At each input of a multiplexer, there is a queue for each of the (k − 1) senders in a dimension. As soon as a message is granted a channel, the transmission of the header flit resumes along with the data flits in a pipelined fashion. Messages are consumed by the local PE through the consumption channel.
Let N j and P j be the number of nodes distant j hops away from the source and the probability that a message travels j hops. They are given by
The average message distance in the DCSH is then
Under light traffic (m g ≈ 0), the message latency, L (in cycles), can be approximated by
as message blocking is negligible. The first term in this equation accounts for the average number of SEs that a header flit visits while the second is the transmission time of the data flits. With increased traffic however, a message experiences blocking over network resources. Since latency at a dimension depends on the subsequent higher dimensions, all dimensions are taken into account when computing the mean message latency. Latency is determined first at the destination (dimension n-1), and then propagated back to dimension zero. The total message rate,
The first term is the rate of messages that are generated by the PE and sent across dimension i , while the second is the rate of messages that arrive from dimension j (0 ≤ j < i ) and pass through dimension i . Messages arriving at dimension i are split into two streams. The first stream terminates at that dimension and messages are sent to the PE with probability P t i . The second is the transit messages which pass through transit multiplexer i with the probability 66 S. LOUCIF, M. OULD-KHAOUA AND L. M. MACKENZIE
(1 − P t i ). P t i is found to be [23]
Since messages are serviced as soon as they arrive at their destination, the service time at the destination, S n , is simply L m + D t . Let S i be the service time seen by a message entering dimension i . It consists of the service time at the subsequent dimension, S i+1 , increased by the SE decision time D t , the average blocking delay at the output channel, B C i , and the average blocking delay at the input multiplexer (transit or local) at that dimension B M i ·S i is then
Once having crossed dimension i , a fraction m c i P t i will be consumed by the local PE and therefore, it needs to pass the local multiplexer. The other fraction m c i (1 − P t i ) passes through the transit multiplexer, in order to cross subsequent dimensions. The average traffic rate arriving at a multiplexer entry, for instance a local multiplexer (similar analysis is applied to transit multiplexer), is m c i P t i /(k − 1). A message at each entry of the multiplexer competes with messages of the k − 2 other entries to pass through the multiplexer. When collision occurs, given that a message requires an average service time of S i+1 , the blocked message waits on average S i+1 /2 [1] . The average blocking delay at multiplexer i , B M i , can then be written as
and B C i , the average blocking delay at an output channel at dimension i , can be written as
The effects of queueing that occur at the source must also be included. It is assumed that messages are serviced at the source according to an exponential distribution with an average service time S 0 . M/M/1 queueing theory gives the mean latency as [27] 
The express cube model
The present analysis considers express cubes with m express channels between each pair of interchanges. The SE structure in express cubes is the same as in k-ary n-cubes, the structure of SEs and interchanges can be found in [10] . The derivation which follows is similar to that in the DCSH and, for brevity, only differences are shown. The average number of dimensions crossed in the express cube, d dim avg , is given by
where N dim j and P dim j are the number of nodes j dimensions away from a node and the probability that a generated message is sent j dimensions away, respectively.
The average number of hops crossed within a dimension, k avg , is the sum of the average number of local channels crossed before reaching the first interchange, k i , the average number of express channels, k x , and the average number of local channels crossed after the last interchange, k f . k avg is therefore given by [10] 
where h is the average Manhattan distance crossed by a message along a dimension and is given by [7, 8] 
Under light traffic, latency can be written as
Messages, arriving at dimension i at a rate m c i (Equation 5), take either the right or left direction. The average message rate on a direction is therefore given by
Without loss of generality, let us focus on one direction. Once a message enters a direction, it crosses, on average, k avg hops. On arriving at an interchange, it either skips the next k n nodes, using an express channel, with probability 1/k n , or routes through those nodes using local channels with probability (1 − 1/k n ). The average message rates on a local or express channel, m lc i and m ec i respectively, are therefore
The average service time, S i , seen by a message entering dimension i , is also the average service time of the first channel at that dimension:
THE COMPUTER JOURNAL, Vol. 42, No. 1, 1999
PERFORMANCE MERITS OF BYPASS CHANNELS 67
The average service time, S i, j , seen by a message entering the ( j + 1)th channel at dimension i , is the average service time at the next channel, increased by the decision time, and the average blocking delay at that channel B [L|E] i, j , depending on whether it is local or express. S i, j is found to be
The average message blocking at an output channel can be written as
where P [L|E] i, j,v is the probability that v virtual channels of the ( j + 1)th local or express channel at dimension i are busy (and V (= 2) is the number of virtual channels). P [L|E] i, j,v is the steady state solution of a Markovian chain described by the following equations [18] :
The multiplexing delay of virtual channels at either a local or express channel is given by
The overall average multiplexing delay in the network, V avg , is therefore
The average service time seen by a message entering the network is S 0 V avg . Including the waiting time at the source yields the mean message latency as [27] 
PERFORMANCE COMPARISON
If the channel width, in practice, is w bits, a message length of M bits is broken up into L m = M/w channel words or phits as they are sometimes known, each of which contains w bits; L m is referred to as the message aspect ratio. In practice, a flit in wormhole routing may be composed of one or more phits [10] , but in the results presented later we assume that a flit is equal to one phit, since the general conclusions do not change for the other cases. Furthermore, we assume that express channels have the same width as local channels in the express cube. The low-dimensional versions of the DCSH (i.e. twoand three-dimensional (3D)) are most interesting cases as they map naturally into the physical dimensions, yielding more compact and regular implementation [17] . The following discussion compares the two-dimensional DCSH to its express cube counterpart with a fixed network size N and equal implementation cost in VLSI and multiplechip technologies (the conclusions are the same for the 3D case). In a pure VLSI implementation, the bisection width, i.e. the number of wires that cross the middle of the network, is a rough measure of implementation cost, since wiring density is constrained. Assuming that a network is implemented on two-dimensional space √ N with nodes along a given dimension [1] , the bisection width (B) of the DCSH and express cube, with a channel width of w DCSH and w Express Cube respectively, are found to be
If bisection width is held constant, the channel width of the express cube in terms of that of the DCSH is given by
In multiple-chip implementations, where a node is fabricated on a chip, pin-out, which is the (node degree × channel width), is a more suitable metric [7, 8] . The pin-out (P) of the DCSH and express cube can be written as
P Express Cube = 4nw Express Cube at a node P Express Cube = 4(m + 1)w Express Cube at an interchange.
Since in the express cube the number of express channels must be at least one (m ≥ 1), the pin-out of the interchange dominates the cost. Assuming a constraint of constant pinout, the channel width relationship is therefore
For illustration, two network sizes are considered: a small system of 256 nodes and a moderately large system of 4096 nodes. The cluster size, k n , is set at 4 and 8 for the 256 node and 4096 node express cubes, respectively: these figures yield the optimal message distance along a dimension, as suggested in [10] . Two cases for the number of express channels are considered: m = 1 and m = 4. Multiple express channels have the advantage of making better use of the express channels since a message chooses any one of them. This flexibility, however, is achieved at the expense of higher pin-out and increased wiring density requirements.
Let us set a channel width (or phit) in the DCSH to 8 bits. The channel width in the express cube is normalized to this. Two message lengths are considered: long messages of 128 (DCSH) phits, a typical figure considered by other authors [1, 7, 8] , and short messages of 32 (DCSH) phits, which reflects a possible scenario where wide channels are employed; message aspect ratios can become short, perhaps even less than the channel width itself. In this and the following, all message lengths are quoted in terms of (DCSH) phits. Although we are interested in latency taking into account the effects of decision time, results when decision time is ignored are also presented to assess its impact on performance. outperforms the DCSH under all traffic conditions at both long and short message lengths. This is because the wider channels of the express cube reduce its messages aspect ratio and ultimately message blocking in the network.
Results for VLSI implementation
When the decision time is increased to one cycle, which is an optimistic figure given current technology [8, 9] , the express cube loses edge to the DCSH at both long and short messages. As the message length decreases, its sensitivity to decision time becomes more critical because message distance dominates latency. Increasing the decision time further to 2 cycles, which is still a realistic figure [8, 9] in terms of practical hardware implementations, offsets the advantage of the wider channels in the express cube. These models show that message blocking increases quadratically with the message length. Increasing the decision time has the same effect and again results in a quadratic increase in blocking. The express cube is more sensitive to the effects of decision time because it has a larger average message distance and therefore a message requires a longer service time to reach its destination.
When the number of express channels is increased to m = 4, Figures 4b and 5b show that the express cube saturates at earlier traffic loads, making its performance worse than that of the DCSH even with zero-cycle decision time. This is because, first, the increase in small system sizes reduces the channel width and, as a result, the message aspect ratio in the express cube becomes longer than that in the DCSH. Second, messages travel, on average, longer distances in the express cube and therefore encounter higher blocking in the network.
Figures 6 and 7 evaluate performance when the system size is N = 4096 nodes. A larger system implies wider channels in the express cube, but also a longer average distance, compared to the DCSH. Both Figures 6a and 7a show that zero decision time favours the express cube for both long and short messages. Including the effects of decision time, however, even when it is only one cycle, causes considerable degradation in the performance of the express cube. This degradation is more significant when messages are short. There are two reasons for this. First, the effect of decision time on performance is higher as messages become shorter because message distance dominates message latency. Second, shorter messages do not enable the express cube to compensate for its higher message distance because they reduce the ratio between the message aspect ratios in the express cube and DCSH and consequently reduce the difference in message blocking in the two networks.
Figures 6b and 7b reveal that the increase in m has less degradation effects on performance compared to the 256-node case. However, the DCSH gives better latency characteristics at all traffic conditions and message lengths even with zero decision time. The figures also show that when the decision time in the DCSH is set higher (D t = 2) than that in the express cube (D t = 1), the former still provides better performance. This finding reveals an important advantage of using the total bypass strategy in the DCSH. It gives the DCSH the option of using slower and therefore cheaper SEs and still maintain superior performance levels.
Results for multiple-chip implementation
Under the constant pin-out constraint, the channels in the express cube are four times wider than those in the DCSH. Figures 8 and 9 show that ignoring the decision time in a 256-node system favours the express cube. The express cube outperforms the DCSH at long messages even when the decision time is 1 cycle. However, the wider channels of the express cube do not bring any performance advantages over the DCSH when messages are short. This is because short messages yield short message aspect ratios causing message latency to become more sensitive to the distance. Figures 10 and 11 , on the other hand, show that when the system is scaled up to 4096 nodes, the express cube with one or multiple express channels can out-perform the DCSH only when the decision time is set to zero. Increasing the decision time always favours the DCSH. When multiple express channels are used, the DCSH still delivers better performance even with higher decision time.
CONCLUSIONS
Express k-ary n-cubes (or express cubes) have been introduced to alleviate the problem of higher message distances in large k-ary n-cubes, by allowing non-local messages to partially bypass clusters of nodes within a dimension. This paper has shown that distributed crossbar switch hypermeshes (DCSHs), based on the hypergraph concept and providing total bypass channels along a dimension, have superior performance characteristics to those of the express cubes.
Even though the express cube has wider channels than the DCSH for equal implementation costs in both VLSI and multiple-chip technology, its extreme sensitivity to switching delays, due to decision time, offsets this advantage.
As the express cube size increases, its performance degrades sharply when realistic figures for decision time are included. This is because, although the express channels allow the express cube to reduce its average message distance, relative to the torus, a message still crosses, on average, a larger number of switching elements (and interchanges) during its network journey compared to the DCSH and thus encounters higher blocking over network channels. Furthermore, the results have also shown that the DCSH has the option of using slower, and therefore cheaper, switching elements and still delivers better performance.
At present and for the foreseeable future, wormhole routing will continue to be a popular switching technique as it requires low storage and allows the design of fast switching elements. As the results have shown, however, switching delays, due to decision time, have an immense impact on the performance of wormhole-routed networks, and therefore must not be ignored in the design of future high-performance multicomputer networks. As switching delay will dominate wire-delay for the foreseeable future, the hypermesh offers a superior solution to the latency problem under all practical constraints.
