Multistage interconnection networks (Banyan networks) 
Introduction
Multistage interconnection networks (MINs) are often proposed in high performance environments such as multiprocessor systems [1] , high-bandwidth communication networks (e.g. ATM or Gigabit/10G Ethernet), and are candidates for application in distributed real-time systems.
MINs are used as well in embedded applications, providing the underlying interconnect between processing elements [11] . As the number of network nodes increases, their lower count of cross-points offers a considerable advantage over fully-meshed crossbars if cost or space constraints arise. With higher levels of integration becoming reality, multiprocessors on single chips are emerging. Several Network-On-Chip architectures for the interconnection network are the topic of current discussion [7] . Choosing a specific network design and topology is considered very application-dependant [5] and continues to be an active topic of research. Being able to evaluate the performance of such networks during the design phase is important as it allows to determine if a certain design fits the QoS requirements. Methods employed for performance evaluation include both analysis and simulation, each with its advantages and disadvantages. Specifically, there is a tradeoff between speed of execution of the analysis and accuracy of simulation. Also, [3] shows that obtaining quantile measures by simulation requires considerable computational effort compared to lower-order measures such as mean values. In the following, our focus will be on analytical performance evaluation and its use.
Applications of the proposed analysis method include design evaluation of Network-on-Chip topologies under multicast conditions or interconnection networks that provide links between processing elements in a large-scale multiprocessor environment. Being able to gauge performance measures including delay time distributions is an important property of a system development process.
In [4] Jenq developed an analytical model to cope with buffered MINs using just one input buffer to represent an entire network stage. Tutsch and Hommel [10, 8] extended Jenq's model to include packet multicasting and allowed for arbitrary (but finite) buffer sizes. They also considered dependence between packets of successive clock cycles. Their model yielded average throughputs and delays but no delay time distributions and thus was not suited to model realtime systems. To the best of the authors' knowledge, the proposed method is the only approach to obtain delay time distributions analytically. This paper presents a model introduced in [2] and extends it to a networking application example, delivering delay time distributions while supporting multicast traffic. Delay time distribution analysis is inherently more complex than the computation of traditional expected (mean) values. Moreover, there are no numerically tractable exact methods. The presented method is able to cope with arbitrary network sizes, buffer capacities for the internal switching elements and arbitrary (but uniform) multicast traffic patterns. In order to validate the results obtainable with the analytical model, a state-based simulation was used.
The remainder of this paper is organized as follows: Section 2 gives a short overview of MINs, Section 3 describes key ideas involved in developing the model. Section 4 presents an application of the proposed method to a 128 × 128 node network, focusing on the impact of multicast traffic on delay time distribution in particular. Finally, Section 5 gives a conclusion and points out directions for further studies.
Multistage Interconnection Networks
The N × N-MINs considered in this paper (connecting N input ports to N output ports) consist of buffered 2 × 2 switching elements which are arranged in n = ⌈log 2 N⌉ stages. The MIN is internally clocked with all packet sending and receiving operations occurring simultaneously. 
Figure 2. multicasting while routing
Packet multicasting can be handled by multistage interconnection networks in a bandwidth-conserving manner: packets that have multiple output ports as their destination are copied within the corresponding 2×2 switching element as required (Fig. 2 , the packet is copied at the latest possible stage).
To be able to handle MINs by analytic means, one has to lower the complexity of the model. Here, this is accomplished by reducing each stage in the network to a single switching element. For this simplification to be valid, some prerequisites usually have to be assumed: the same input load is offered to all inputs, all packets have equal size, conflicts between packets are resolved randomly, and multicast traffic is uniformly distributed among the network outputs.
These conditions ensure that the buffers can be treated equally so that just one buffer can represent the behavior of all the buffers in the same network stage. Despite prohibiting analysis of non-uniform behavior occurring with hot-spot traffic or highly correlated data streams these assumptions can be considered valid for an average network usage scenario.
Leading to the model presented in the next section was the observation that deviation of delay measures from simulation results was considerably greater than this was the case for the throughput measures considered in [9] .
Model and Analysis
Informally speaking, the discrete-time model describes one-step state transition probabilities. It is an approximation because not all individual model states are considered -similar ones are treated alike to save state space. Later on, the model is iteratively applied to an initial state probability vector, until convergence is reached (fix-point iteration). Performance measures of interest can then be derived from the resulting probability vector.
Due to space limitations, this section does only describe key aspects of the proposed model. Particularly, state transition equations are omitted and an overview of deriving these equations is presented instead.
As in [10] , a stage of the MIN is described by two means: First, the type of packet in the first buffer position of the switching element and second, the number of packets waiting in the queue to be sent. All feasible combinations of packet types are considered states. The individual packet types used are: type 0 (empty buffer), type n (unicast packet), type nb (blocked unicast packet), type b (broadcast packet), type bb1 (broadcast packet, one target buffer blocked), type bb2 (broadcast packet, both target buffers blocked), and type fb (unicast packet, that is not in conflict with the unicast packet in the other buffer).
Based on these packet types the actual states of the model are determined. The states are composed of feasible combinations of packet types in the first positions of both buffers in a switching element. Considering these restrictions, the following 18 states (corresponding to their respective state probabilities π) have been identified to represent the first buffers of a switching element: (0, 0), (n, 0), bb2) and (fb, fb) . Two of these require additional explanation: In the case of two unicast packets, each with a blocked target buffer in the next network stage (states (nb, nb, * )), a differentiation is made regarding their destinations. If both packets have the same buffer as their target, the state (full or non-full) of the other target buffer is unknown. This state is designated (nb, nb, c). If these two packets do not compete for the same target (state (nb, nb, nc) ), both their target buffers must be full.
Closely coupled with the actual states are the probabilities r * (k, t) for sending some or all packets in the switching element to the next stage. For every state exists a set of sending probabilities to describe all possibilities of packets leaving the switching element in that particular state.
In order to obtain delay time distributions, an additional quantity l m (k, t) is introduced that holds the probabilities of a switching element's buffer to contain a packet at position m (where 1 is the first buffer position) with a certain number of clock cycles expired:
Initially (when the analysis starts with an empty network), this vector is set to zero for all m and k. New packets that enter the network have a zero delay time associated with them. Fig. 3 presents an example run for the first three clock cycles of an iteration under simplified conditions: 4 × 4-MIN, buffer size 1, offered input load 1, every multicast is a broadcast in each stage. At t = 1 two broadcast packets enter the first network stage, the second stage is still empty. In the next iteration step, one copy of each broadcast packet can be sent resulting in the state (nb, nb, nc) (probability 0.5) or one of the broadcast packets can be sent completely while the other remains in that stage (probability 0.5). Then, the buffer would be filled with a new broadcast packet immediately (because the probability of a packet being offered to the network inputs equals 1) which would turn into an blocked broadcast packet since both target buffers are full. In both cases, stage 1 contains two broadcast packets at t = 2. Using fixed-point iteration over the state probabilities, a steady state is reached from which the result parameters of interest are determined.
Results for an Application Example
The proposed method allows to efficiently determine the distribution of packet delay times in multistage interconnection networks (MINs) composed of 2 × 2 switching elements. In addition, mean throughputs and mean delay times are also determined as described in [2] . Input parameters of arbitrary choice for the analysis are the number of stages (thereby determining the network size), buffer size for each stage's switching element, the offered load and the multicast traffic pattern (when examining multicast traffic).
The following example assumes a 128 × 128 node network which is used as the underlying interconnect of a multiprocessor system. As it is considered an important ability to efficiently broadcast or multicast information during parallel and distributed computation tasks, a MIN is a viable choice for such an interconnect. Due to its property of performing broadcast at the appropriate stage during packet routing, bandwidth is used economically. For a characterization of the multicasts, it is assumed that tasks run localized on a small number of nodes as well as requiring information to be updated on many or almost all nodes. Multicasting causes the network to saturate: Due to the multiplication of packets the buffers are almost always occupied at an offered load of 0.5. Only a small number of packets is able to traverse the network in just 7 clock cycles and the distribution of delay times becomes wider due to the uncertainty of the random packet-choosing process in the switching elements. Fig. 4 shows the corresponding delay distribution. This and the subsequent figure also compare the analysis' result to simulation, for that purpose the individual values have been connected by lines.
Offering a lower load (0.1), the corresponding delay time distribution shows a delay < 10 clock cycles for most pack-ets with a rapid decline in probability towards higher delays. Considering the application's requirements, this could be an acceptable delay figure when specifying the maximum load the network will be able to handle within its performance specification.
In order to validate the proposed method, simulation has been used. This was done on packet level, measuring each delay time bin separately allowing for a 5% margin of error and a 95% confidence level. The highest deviation can be observed for packets that have been blocked multiple times because of network congestion. Compared to simulation, runtime of the analysis is considerably shorter: Obtaining delay-time figures by simulation typically takes several hours of computation time when adhering to the desired confidence levels and margins of error. Table 1 
Conclusion
The proposed model allows to analytically determine several important performance measures of multistage interconnection networks: mean throughputs at the inputs and outputs, mean delay times and additionally distributions of delay times. The latter measure is especially important when considering real-time applications such as audio/video transmission or other multimedia scenarios or network-on-chip communication architectures.
Due to the short amount of time required for the method to complete the analysis process, one can evaluate several sets of MIN parameters and thus improve network design to meet target specifications at low cost early in the design process.
Because of the model's state complexity, it does not appear viable to expand it to MINs that are composed of switching elements larger than 2 × 2. Because every combination of packet types in the first buffer position can (with the exception of infeasible states) become a model state, increasing the switching element size would cause an explosion of the state space. Simpler models do not to provide satisfactory results when one is interested in distributions of delay times and the network uses packet multicasting. This is because of the dependencies between adjacent buffers which are not considered by these models. These dependencies strongly affect the delay time performance measures and in particular the vector that is used to determine the distributions of delays.
Further studies will include characterization of the found delay time distributions from the analysis as well as allowing deadlines for packets. Then, discarding of packets that have exceeded their deadline and thus would be useless (e.g. in the context of a multimedia application like an audio/video transmission) would also be possible.
