Abstract
Introduction
Multicast communication is one of the most important collective communication operations [1] and is highly demanded in parallel applications as well as in other communication environments. For example, multicast is required to make updates in replicated and distributed databases, multiprocessor systems also require multicast for cache coherence and message passing, and multicast is a critical operation for video/tele-conference calls and video-on-demands services in a telecommunication environment. Clearly, providing multicast support at hardware/interconnection network level is the most efficient way supporting such communication operations [1] . Switching networks that can realize arbitrary multicast communication without any blocking are referred to as multicast networks. This type of network has been investigated by several researchers in the literature [2] - [8] .
In this paper, we design a new type of multicast network using an approach based on recursive decompositions of multicast networks. This approach was first introduced by Nassimi and Sahni [2] and was later adopted by Lee and Oruç [6] in their design of a multicast network. The n n multicast network proposed in [2] uses Okn 1+ 1 k log n 2 2 switches and has Ok log n depth and Ok log n set-up time for any k, 1 k log n. Since the routing algorithm in [2] relies on a cube or a perfect shuffle connected parallel computer consisting of On 1+ 1 k processors, as mentioned in [6] , the routing process actually takes Ok log 2 n gate delays. Lee and Oruç [6] designed a multicast network with a special built-in routing circuit. Their network uses On log 2 n logic gates, and has Olog 2 n gate delay and Olog 3 n set-up time where the unit of time is a gate delay.
Another notable feature in our multicast network design is the self-routing scheme, in which the routing circuit is distributed into individual switches. Self-routing is a promising routing scheme which usually renders a network with faster switch setting and less hardware complexity. However, most of self-routing network designs described in the literature are for permutation networks, for example, [9] and [10] . In a recent work, Cheng and Chen [10] designed a new self-routing permutation network constructed by reverse banyan networks.
In this paper, we propose a design for a new self-routing multicast network, which is also based on the reverse banyan networks. We will explore the properties of a reverse banyan network so that it can be used to handle arbitrary multicast connections in a selfrouting manner. Different from earlier proposed multicast networks, the new multicast network is conceptually simple and has a good modularity. The network design is based on the binary radix sorting concept and all functional components of the network are recursively constructed reverse banyan networks. Therefore, it renders a potential to greatly reduce the network cost by reusing part of the network. The new multicast network we design uses On log 2 n logic gates, and has Olog 2 n gate delay and Olog 2 n set-up time where the unit of time is a gate delay. Moreover, by re-using part of the network, the feedback version of our design can reduce the network cost to On log n.
Multicast Networks Based on Binary Radix Sorting
We consider an nn interconnection network with n inputs and n outputs where n = 2 m and each input or output address can be expressedas an m-bit binary number a0a1 : : : a m,1. For a multicast connection from network input i (0 i n , 1) to a subset of network outputs, let Ii denote the subset of the outputs that input i is connected to. Ii is referred to as the destination set of the multicast connection, or simply the destination set of input i. Then a multicast assignment can be expressed as a vector I0; I 1 ; : : : ; I n,1; where, Ii Ij = for i 6 = j and S n,1 i=0 Ii f 0; 1; : : : ; n , 1g. Clearly, a permutation assignment is a special case of a multicast assignment where every Ii has at most one element.
In this paper,we design a multicast network based on binary radix sorting concept and refer to it as a binary radix sorting multicast network (BRSMN) . An nn BRSMN is recursively constructed by an n n binary splitting network (BSN) followed by two n 2 n 2 BRSMNs as shown in Figure 1 . For an input i, if all elements in its destination set Ii are in the upper (or lower) half of the network outputs, i.e. the most significant bit of every binary address in Ii is 0 (or 1), there is a single connection from input i via the n n BRSMNs, respectively. Then, for an n 2 n 2 BRSMN, we need to check the second most significant bit of the binary addresses in the corresponding destination set, and so on. Finally, for a 2 2 BRSMN (i.e. a 2 2 switch), realizing a multicast or a unicast connection is straightforward. Figure 2 shows the routing for a multicast assignment in an 8 8 BRSMN. ; f3; 4; 7g; f2g; ; ; ; f5; 6g in an 8 8 BRSMN.
Now the problem of constructing a self-routing multicast network is transformed to the problem of designing a binary splitting network (BSN) described above.
The Binary Splitting Network
The binary splitting network splits the multicast connection on each of its inputs into two multicast connections (when necessary) based on the outputs in the destination set of the multicast connection belong to the upper half or the lower half of the network outputs. Since the splitting of an output is determined by the most significant bit of its binary address, the routing in a BSN can be simplified by using a routing tag with four values for each link: 0, 1, , and , where, 0 and 1 indicate that all destination addresses of the multicast passing on the link have a 0 and a 1 in the most significant bit, respectively, means that at least one destination address of the multicast on the link has a 0 in the most significant bit and at least one destination address of the multicast has a 1 in the most significant bit, and means the link is idle (i.e. carries no packets). In an n n BSN using the four value routing tags, let n0, n1, n , and n denote the numbers of inputs with value 0, 1, , , respectively. It can be verified that they satisfy the constraints: n0 + n1 + n + n = n (1) n0 + n n 2 and n1 + n n 2 (2) n n :
Now, the function of a BSN is to transform the input tags 0's, 1's, 's, and 's to the output side such that all 's are eliminated, all 0's are in the upper half of outputs, all 1's are in the lower half of outputs. Let e n0, e n1, e n , and e n denote the numbers of outputs with value 0, 1, , and , respectively. Since any paired with one will be transformed to a pair of 0 and 1, we must have e n0 = n0 + n ; e n1 = n1 + n ; e n = n , n ; e n = 0(4) 0 e n0; e n1 n 2 ; e n0 + e n1 + e n = n:
An n n BSN can be constructed by cascading two n n reverse banyan networks (RBNs) as shown in Figure 4 . The first RBN scatters all 's to 0's and 1's and is referred to as a scatter network. The transformation from the inputs to the outputs of the first RBN is f0;1; ; g f 0; 1; g : The second RBN transfers all 0's to the upper half of the network outputs, and all 1's to the lower half of the network outputs, while each of 's may go to either the upper half or the lower half. The second RBN is referred to as a quasi-sorting network. In Figure 5 , we show how an input pattern in a BSN is scattered in the first sub-network and then quasi-sorted in the second sub-network. Next, we will move onto discussing our basic component network, reverse banyan network, and see how it can perform the functions of a scatter network and a quasi-sorting network.
The Reverse Banyan Network
An n n reverse banyan network (RBN) is recursively constructed by two n 2 n 2 RBNs followed by an n n merging network. An n n merging network consists of one stage of n 2 2 2 switches, such that both the input and the output links of the stage are connected according to the perfect shuffle interconnection function. The network construction is shown in Figure 6 . We now discuss a useful property of a reverse banyan network. Suppose only two values, say, and , are to be passed in a reverse banyan network. We define an n-bit circular compact sequence of 's and 's as follows: where 0 s n and 0 l n. The real meaning of C n s;l; ; is that, in an n-bit sequence all l -bits are compacted together followed by also compacted n , l -bits in a circular way (modulo n), and s is the starting position for the -bit sequence. Cheng and Chen [10] considered the circular compact sequence of 0's and 1's in their permutation network, and found an interesting property. We state it below in a more general form. In the following, we provide a different, much easier to understand proof for Theorem 1. The technique used in the proof and some observations will be applied to the design of our multicast network in later sections.
Theorem 1 For any -values on the inputs of an RBN, a circular
Suppose the inputs of the nn RBN consist of l 's and n,l 's, among which the upper half n 2 inputs contain l0 's and the lower half n 2 inputs contain l1 's, where l0 + l1 = l. Assume Theorem 1 holds for an n 2 n 2 RBN. We will prove that it also holds for an n n RBN by giving a positive answer to the following question. Then exchangea denoted as a is the other input of the switch.
The inputs of the merging network with addresses shufflea and shufflea are linked to inputs a and a of the switch, respectively, and the outputs a and a of the switch are linked to the outputs of the
n-2 n-1 n/2 x n/2 n x n merging network RBN Figure 6 : The recursive definition of an n n RBN. The dashed box is an n n merging network. Figure 7. ) We can see that connections are only possible between the inputs and the outputs of a merging network with addresses shufflea and shufflea. Since only one-to-one mapping is considered, the switch has only two settings: parallel and crossing.
Note that jshufflea , shuffleaj = n 2 , that is, two inputs with n 2 apart in their addresses can be connected to outputs with the same addresses in a parallel or crossing way. Let ri denote the setting for switch i (ri = 0: set to parallel; ri = 1: set to crossing). We can also use a circular compact sequence to represent switch setting. To avoid confusion in notations, we use W n=2 s;l;0;1 to specify the compact switch setting of n 2 switches in an n n merging network, where l consecutive switches have setting 1 with starting position s, and the rest of switches have setting 0 in a circular way.
The following lemma gives a positive answer to Question 1. Case 1: b = 0. As shown in Figure 8 (a), all inputs in both sub-segments x0 and x1 on the left are routed to the outputs on the right in a parallel way, and all inputs in both sub-segments y0 and y1 on the left are routed to the outputs on the right in a crossing way. Therefore, the segment on the right is x0 y1 x1 y0. We now prove that the sequence represented by the segment on the right is indeed the circular compact sequence C n s;l; ; . In the segment x0 y1 x1 y0, x0 ending with and y1 starting with make the 's in C n=2 s 0 ;l 0 ; ; and C n=2 s 1 ;l 1 ; ; consecutive at the joint of x0 and y1. Also, x1 ending with and y0 starting with make the 's in C n=2 s 1 ;l 1 ; ; and C n=2 s 0 ;l 0 ; ; consecutive at the joint of x1 and y0. Moreover, the consecutiveness of or is preserved at the joint of y1 and x1, since x1 y1 is a circular compact sequence. A similar argument applies to the joint of y0 and x0. Thus, the segment x0 y1 x1 y0 is a circular compact sequence since l0 's from C n=2 s 0 ;l 0 ; ; and l1 's from C n=2 s 1 ;l 1 ; ; are concatenated in the segment. . Also from s1 = s + l0 mod n 2 , we have s = s1 , l0 mod n: In the sequence x0 y1 x1 y0, the ending point (i.e. bottom) of x0 is at position s1. From this point going upward (in a circular way), there are consecutive l0 's, and the last is at the starting position for 's sequence in x0 y1 x1 y0, which is s1 , l0 mod n in this case. 
The RBN as a Scatter Network
In the last section, we considered only two values (0 and 1) for a full permutation assignment in an RBN. For a multicast assignment, we must deal with four values 0, 1, , and . The following theorem gives the main result of this section, and its proof is given at the end of the section. (4) 
and (5).
It is worth pointing out that although n n holds globally for the original nn RBN which is the scatter network of an nn BSN (see (3)), for the recursively defined sub-network n 0 n 0 RBN of the n n RBN, we may have n 0 n 0 , where n 0 and n 0 are the numbers of inputs (of this n 0 n 0 RBN) with values and respectively. This is because that 's and 's are distributed nonuniformly. To simplify the problem, we can combine 0 and 1 into a single value . A link has a value if it has a single value 0 or 1.
If the two inputs of a 2 2 switch have values and respectively, can be scattered so that the two outputs of the switch have values 's. Let n , and n be the numbers of inputs with value and , respectively. By exploring the properties of the circular compact sequence, we can obtain the following general results for an n n RBN with any numbers of 's and 's on its inputs.
Theorem 3 For any values on the inputs of an n n RBN, if
n n (or n n ), a circular compact sequence C n s;n ,n ; ; (or C n s;n ,n ; ; ) with any starting position s (0 s n ) can be achieved at the outputs of the RBN under a proper setting for switches in the network.
In the above theorem, we say that (or ) is the dominating type among and if n n (or n n ).
We now further explore the property of the circular compact sequence in order to deal with multicast assignments. In the case of merging two circular compact sequenceswith the same set of binary values, (e.g. merging two sequenceswith 's and 's, C n=2 cover all possible intervals. Applying a similar approach to the proof of Lemma 1, we represent the input and output sequences as vertical segments as shown in Figure 9 (a)-(d), with one sub-figure for each of the four cases.
Note that all 's on the inputs are in the upper half and all 's on the inputs are in the lower half. Let's first look at the two subsegments y0 and y1 on the left in Figure 9 (a). In the lower half of the inputs (i.e. C n=2 s 1 ;l 1 ; ; ), the sub-segment y1 which starts from position s1 and is of length l1 (in a circular way, modulo n Question 2 has other three symmetric variants, which can be obtained by changing the condition in Lemma 2 to l1 l0 and l = l1 , l0, and/or swapping for . The corresponding solutions to these three variants are also symmetric. Lemmas 3 -5 give such solutions.
Lemma 3 Given n, s, l, l0, and l1 satisfying that n is an even number, 0 s n , 0 l n, 0 l0 l1 n 2 , and l = l1 , l0. Now we are in the position to prove Theorem 3.
Proof of Theorem 3. By induction on the network size n.
For n = 2 (base case), the network is a 2 2 switch. It is easy to check that Theorem 3 holds in this case (see [11] ). Clearly, we have n 0 + n 00 = n and n 0 + n 00 = n . We first assume n n and consider the following cases. Case 1: n 0 n 0 and n 00 n 00 . By the inductive hypothesis, Theorem 3 holds for both upper and lower n 2 n 2 RBNs. That is, for any given integers s0 and s1 (0 s0; s 1 n 2 ), C n=2 s 0 ;n 0 ,n 0 ; ; and C n=2 s 1 ;n 00 ,n 00 ; ; can be achieved at the outputs of the upper and lower n 2 n 2 RBNs, respectively. Now, let l0 = n 0 , n 0 , l1 = n 00 , n 00 , and l = n , n , which implies l = l0 + l1. Then by Lemma 1, given any integer s (0 s n ), there exist integers s0 and s1 (0 s0; s 1 n 2 ) such that C n=2 s 0 ;l 0 ; ; and C n=2 s 1 ;l 1 ; ; can be merged to C n s;l; ; through the nn merging network under a proper switch setting. Hence, Theorem 3 holds.
Case 2: n 0 n 0 and n 00 n 00 . By the inductive hypothesis, for any given integers s0 and s1 (0 s0; s 1 n 2 ), C n=2 s 0 ;n 0 ,n 0 ; ; and C n=2 s 1 ;n 00 ,n 00 ; ; can be achieved at the outputs of the upper and lower n 2 n 2 RBNs, respectively. Let l0 = n 0 ,n 0 , l1 = n 00 ,n 00 , and l = n ,n . Then we have l = l0 ,l1 and l0 l1. By Lemma 4, given any integer s (0 s n ), there exist integers s0 and s1 (0 s0; s 1 n 2 ) such that C n=2 s 0 ;l 0 ; ; and C n=2 s 1 ;l 1 ; ; can be merged to C n s;l; ; through the nn merging network under a proper switch setting. Thus, the result is also true in this case.
Case 3: n 0 n 0 and n 00 n 00 . Similar to Case 2, by the inductive hypothesis and Lemma 3, we can see Theorem 3 holds.
Thus, we have proved Theorem 3 for n n . Symmetrically, we can show that the theorem holds for n n .
2
In the proof of Theorem 3, we can see that the number of the tag values of dominating type in the outputs of an RBN is the sum of those in its two sub RBNs, if the two sub RBNs have the same dominating type. In this case, Lemma 1 is applied, and we refer to it as / -addition. On the other hand, the number of the tag values of dominating type is the difference between those of its two sub RBNs, if the two sub RBNs have different dominating types. In this case, Lemma 2-5 are applied, and we refer to it as / -elimination.
Finally, we are in the position to prove Theorem 2.
Proof of Theorem 2. Since n n holds for the original n n RBN which is the scatter network of an n n BSN (see (3)), by Theorem 3 we can always eliminate 's at the outputs of the RBN under a proper switch setting. On the other hand, from the proof of Theorem 3, we know that a tag value 0 or 1, once presented on a link at some stage of the RBN, will be passed in a unicast way to some output without any value change. We also know that value changes among 0's, 1's, 's, and 's occur only in some 2 2 switches. In this case, two inputs of the switch have and respectively, the switch setting is upper or lower broadcast (this is always true in our design), and two outputs of the switch have 0 and 1 respectively. Consequently, a pair of and are transformed to a pair of 0 and 1. In total, n such pairs are transformed. Hence, e n0, e n1, e n , and e n , which are the numbers of outputs with value 0, 1, , and , respectively, satisfy the following: e n0 = n0 + n ; e n1 = n1 + n ; e n = n , n ; and e n = 0: Also by using (1) and (2), we have that 0 e n0; e n1 n 2 and e n0 + e n1 + e n = n. 2
The RBN as a Quasi-Sorting Network
Theorem 1 can be used to perform bit sorting in an RBN, that is, under a proper switch setting, all 0's and 1's on the inputs of the RBN can be routed to its outputs in an ascending order. However, this works only for full permutation assignments, in which each input of the RBN has a tag value either 0 or 1, and cannot be directly applied to partial permutation or multicast assignments.
In the following, we discuss how an RBN can be used as a quasisorting network for partial permutation or multicast assignments. As can be seen in the last section, the outputs of the n n RBN as a scatter network have tag values 0, 1, and only, which are passed to the inputs of the n n RBN as a quasi-sorting network, and the numbers of such 0's or 1's are no more than n 2 . The function of an n n RBN as a quasi-sorting network is to route all 0's and 1's on the inputs of the RBN to the upper and lower halves of its outputs respectively, and to route 's to the remaining positions at the outputs. We can let some of 's be dummy 0's (denoted as 0's) and the rest of 's be dummy 1's (denoted as 1's) , such that the number of all 0's (including all the real 0's and the dummy 0's) and the number of all 1's (including all the real 1's and the dummy 1's) are equal to n 2 . Then by applying the bit sorting results in Theorem 1 we can achieve the quasi-sorting.
The Self-Routing Algorithms for RBNs
Lemmas 1-5 actually provide a way to perform switch setting for all the switches in an RBN as a bit sorting network and as a scatter network, while switch setting for an RBN as a quasi-sorting network can be transformed to that for an RBN as a bit sorting network after all the 's on the inputs are properly divided into 0's and 1's. Based on the recursive construction of an RBN, we can formulate the structure of an RBN into a complete binary tree shown in Figure 10(a) . The root node of the tree represents the original RBN as a bit sorting network, a scatter network, or a quasisorting network; the two child nodes of the root represent the two recursively defined sub RBN networks of the original RBN; and so on; and finally the leaves of the tree represent the inputs of the original RBN. The distributed routing algorithms are performed by each node of the binary tree. The algorithms start from leaves and perform forward computations all the way to the root, and then start from root and perform backward computations all the way to the leaves. The detailed description for the distributed algorithms can be found in [11] .
Circuit Design Issues and the Complexity
Based on the distributed algorithms, we can design a self-routing circuit for our RBN-based multicast network in a similar approach to Cheng and Chen's [10] self-routing circuit design for their RBNbased permutation network. In this section, we only briefly discuss some circuit design issues.
First of all, as can be seen, five different tag values 0, 1, , and (including 0 and 1) are used in the RBN as a scatter network and as a quasi-sorting networks. We need to use 3 bits to represent a tag value. The encoding scheme for these tag values is shown in [11] . Secondly, the binary tree structure in all distributed algorithms can be embedded into an RBN. For a balanced hardware distribution, we separate the tree to a forward tree and a backward tree as shown in Figure 10(b) , where the first and the last switches of the last stage of a sub RBN network can serve as the nodes in the forward tree and the backward tree, respectively, and the switches in between can use the results from these two nodes for their switch settings. Thirdly, we can see that the most frequently used operation in the distributed algorithms is addition (or addition-like operations). For an n n RBN, the maximum values of n , n , or n1 are n, which can be represented by using at most log n bits. We can implement the adder in a pipelined fashion such that the log n bit adder is reduced to a one bit adder. Also, since the distributed algorithms work in a pipelined fashion, the delay caused by running the forward and the backward processes is also reduced.
Next, we analyze the hardware cost, network depth and set-up time of the new multicast network. Since we add only a constant cost (a constant number of one bit adders) to each switch for the selfrouting circuit and there are n 2 log n switches in an n n RBN, the hardware cost for the RBN is Onlog n, and is the same for an nn BSN. Let Cn denote the cost of an nn BRSMN. By the recursive construction in Figure 1 , we have Cn = On log n + 2C , n 2 , which implies Cn = On log 2 n. Since the network depth of an n n RBN is Olog n, so is an n n BSN. Let Dn denote the network depth for an n n BRSMN. From Dn = Olog n + D , n 2 ; we immediately have Dn = Olog 2 n.
Note that the distributed switch setting algorithms work in a pipelined fashion. It takes Olog n unit time delay for the first bit (from input) to reach a switch in the last stage of an nn RBN. Then it takes only O1 unit time for each of the subsequent log n,1 bits to reach a switch in the last stage. Hence, the propagation delay in the forward phase is Olog n unit time, and so is in the backward phase. By a similar analysis, the total propagation delay of the switch setting algorithm for an n n BRSMN is Olog 2 n.
Moreover, note that in our design all major functional components are recursively defined reverse banyan networks. As shown in [11] , by reusing part of the network we can construct a feedback version of our multicast network with On log n cost.
Finally, we can see that the RBN-based multicast network presented in this paper achieves the same order of complexities of network cost, network depth, set-up time as those of the RBN-based permutation network proposed by Cheng and Chen [10] .
Conclusions
In this paper, we have proposed a design for a new self-routing multicast network based on recursive decompositions of multicast networks. Different from earlier proposed multicast networks, the new multicast network is conceptually simple and has a good modularity. The network design is based on the binary radix sorting concept and all functional components of the network are recursively constructed reverse banyan networks. In addition, the routing circuit is completely distributed into each switch of the network so that the network is operated in a self-routing manner. The new multicast network we design is compared favorably with the previously proposed multicast networks. It uses On log 2 n logic gates, and has Olog 2 n gate delay and Olog 2 n set-up time where the unit of time is a gate delay. By re-using part of the network, the feedback version of our design can further reduce the network cost to On log n.
