Abstract: Benes/Clos networks have been used in many areas, such as interconnection network in parallel computers, multiprocessors system, and networks-on-chip. The parallel switch setting algorithm is the key to satisfy the requirements of high performance switching networks. The Lee's routing algorithm is by far the most efficient parallel routing algorithm for Benes networks. However, there is no hardware implementation for this algorithm. In this paper, the Lee's routing algorithm is fully implemented in RTL and synthesised. We have refined the algorithm in data structure and initialisation/updating of relation values to make it suitable for hardware implementation. The simulation and synthesis results of the switching setting circuits for 4 × 4 to 64 × 64 Benes networks confirm that the timing, area, and power consumption of the circuit is consistent with the complexity of the Lee's algorithm. To the best of our knowledge, this is the first complete hardware implementation of the parallel switch setting algorithm which can handle all types of permutations including partial ones.
Introduction
Both Benes and Clos networks are rearrangeably non-blocking multi-stage interconnection networks. Benes network is a special case of Clos network which has N = 2 n inputs and outputs. The Benes network is constructed with 2 × 2 switching nodes recursively. Due to their non-blocking property and relative smaller number of cross-points, Benes/Clos networks have received much attention in both academia and industry. Benes/Clos networks have been used in many areas, such as interconnection network in parallel computers, multiprocessors system (Levitt et al., 1968) , and networks-on-chip (Kao et al., 2011; Liu et al., 2014; Joshi et al., 2009; Richter, 2013; Moussa et al., 2007) . Compared with direct networks (Bahn et al., 2007; Liu et al., 2012) , Benes/Clos networks can provide uniform latency and throughput, which are very important for cache-coherent many-core systems (Sewell et al., 2012) . In packet switching systems, the switch fabric must be able to provide internally conflict-free paths for the requesting packets in each time slot (Lee and Liew, 1996) . This is implemented by setting the states of all switches in the network. It is clear that the routing assignment (i.e., switch setting) scheme in Benes/Clos networks has a strong impact to the efficiency of the Bene/Clos networks.
A number of switch setting algorithms have been developed in the past few decades, including sequential algorithms and parallel algorithms. Sequential algorithms such as looping algorithms (Yeh and Feng, 1992) are designed for circuit switching systems where the switching configuration can be rearranged at relatively low speed. A simple setting algorithm (Yeh and Feng, 1992) with time complexity O(NlogN) is proposed based on Waksman's proof (Waksman, 1968) . As a matter of fact, using sequential algorithm, the N × N Benes network cannot be setup in less than O(NlogN) time, because there are O(NlogN) switches. The setup time is much longer than the latency in Benes networks, which is O(logN) for N × N network. In order to obtain a switch setting algorithm of complexity comparable to the network latency, parallel algorithms are needed.
Nassimi and Sahni (Nassimi and Sahni, 1982 ) developed a parallel setup algorithm which runs significantly faster than the sequential algorithm based on Waksman's proof. The complexity of this algorithm depends on the parallel computer model and the number of processing elements (PEs) available. Four SIMD models with different topologies are studied as follows: 4 Perfect shuffle computer (PSC): This model employs the shuffle connection of Stone's work (Stone, 1971) .
The time complexity is O(log

N).
We can see that the time complexity of topologies other than CIC is fairly high. However, CIC is simply too complex to be realised. In addition, this parallel algorithm (Nassimi and Sahni, 1982) cannot handle the partial permutations. Implementing the algorithm in SIMD systems is not efficient enough comparing to its complexity. The authors also proposed a self-routing algorithm for Benes network (Nassimi and Sahni,, 1982) to route through the network using destination tags. However, this algorithm cannot route all permutations (Rathod et al., 2015) . A fast parallel algorithm (Lee and Qruc, 1995) is proposed with pipelining which achieves U(logN) speedup than Nassimi and Sahni's algorithm for unicast assignments on both CIC and extended shuffle-exchange network. Lu and Zheng (Lu and Zheng, 2005) propose a fast parallel algorithm which can route K connections in O(logNlogK) for rearrangeable non-blocking networks based on edge-colourings of bipartite graphs. A list of parallel routing algorithms is surveyed in Rathod et al. (2015) .
Lee and Liew present a parallel routing algorithm (Lee and Liew, 1996; Lee and Liew, 2002) for Benes Networks. It has time complexity O(log 2 N) which is same as CIC but using only N/2 PEs (Nassimi and Sahni, 1982) . This algorithm was developed based on the previous work (Waksman, 1968; Nassimi and Sahni, 1982) , but can handle the partial permutation problem. In addition, the algorithm can be extended and applied to Clos networks with two's power number of central modules. In the literature, there is nearly no hardware implementation of this parallel algorithm. A simple hardware design (Hamada et al., 2009 ) based on Lee's algorithm for 16 × 16 Benes network in FPGA is presented. However, no detailed design and simulation results are shown in that paper. Another problem is that, the work is only limited to the switch setting unit for the first stage of 16 × 16 Benes network. Without the design of the switch setting circuit for different size networks, there is no way to tell the trend of how the hardware cost would increase correspondingly when the network size grows.
In this paper, we present the hardware design of Lee's parallel routing algorithm for Benes networks in different sizes ranging from 4 × 4 to 64 × 64. The algorithm is refined to make it more suitable for hardware design. The RTL level design of the algorithm is coded in Verilog, simulated, and synthesised using Cadence tools under 65 nm technology. The timing delay trend is consistent with the time complexity trend of Lee's algorithm. The switch setting hardware design can be integrated with Benes network circuit to be used in high-performance network-on-chip systems.
The rest of the paper is organised as follows. Section 2 presents the basic knowledge of Benes network and its routing constraints. Section 3 presents the parallel routing algorithm. Section 4 presents the RTL design and improvement of Lee's parallel routing algorithm. Section 5 presents the synthesis results and analysis of the results.
Benes network and routing constraints
The Benes network is a special instance of Clos network. An N × N Benes network basically is built with two symmetrical butterfly networks. A Benes network can be considered as a cascaded combination of omega network and a reverse omega network overlapped with the middle stage. As such, the Benes network is a symmetric topological structure among the link patterns in the network from centre stage. Besides, Benes network is inherently recursive. An N × N Benes network can be built from two 2 2 N N × Benes networks recursively, S up and S down , which represent the up and down Benes subnetwork, respectively. As shown in Figure 1 , the 8 × 8 Benes network can be divided into two 4 × 4 Benes networks and two extra stages each composed of four 2 × 2 switching nodes at input side and output side, respectively. A complete path of Benes network can be decomposed into the forward sub-path and backward sub-path routed in the omega network and the reverse omega network, respectively. The two subpaths must meet at one of the switches in the middle stage. Therefore, between any pair of input and output ports of an N × N Benes network, there exist N routing paths.
The non-blocking routing in Benes networks is achieved if the following constraints are satisfied:
• Symmetric routing constraint: To route from input s to output d, either S up or S down subnetwork must be assigned to the subpaths on the omega network and reserve omega network simultaneously. This constraint must be held for each inner stage, recursively. As such, when the output state of the switching node at the forward stage in the omega network is determined, then the input state of the switching node at the symmetric backward stage in the reverse omega network is also determined.
• Internally conflict-free constraint: To avoid confliction between connection requests, the two input ports (resp. output ports) of each input switching node (resp. output switching node) cannot be assigned to the same output port (resp. input port). Each switching node has two states: '0' (i.e., straight) and '1' (i.e., cross), as shown in Figure 2 . Combined with Figure 1 , we can see that, any input port of a switching node must connect to the '0' output port to reach S up , or connect to the '1' output port to reach S down . The output states at each stage can be represented as a binary bit (namely, routing bit). The routing bits ('0' or '1'), as shown in Figure 2 , at all stages compose the path in the Benes network. Table 1 shows the relation between the switch state and the routing bit corresponding to its input ports. The state of a switching node determines the routing bit value of a port, and vice versa. Following the internal conflict-free constraint, the routing bits of the two input ports of a switching node have to be distinct.
Lee's parallel routing algorithm
Lee's parallel algorithm can be decomposed into four major steps: initialisation, searching, merging and calculating the permutation for subnetworks. Denote the set of input and output ports as I and O, respectively, i.e., I = O = {0, 1, …, N -1}, and π: I -> O be an input-output permutation indicating connection requests. We use (i, j) to indicate the i th input port is going to connect to the j th output port in the permutation. In this section, we will use an example permutation to elaborate the main concept of this algorithm. In the below permutation, 'X' means this input port has no output request. 0 1 2 3 4 5 6 7 0 3 2 6 4 7 5 π X
According to the symmetric routing constraint, in Lee's algorithm, the output side switch setting is determined first, and then the input side switch setting is derived.
Initialisation
The first step of Lee's algorithm is to build the connections between output switching nodes using relation values. The algorithm need to group switching nodes with the same relation together, and assign the switch state values to them consistently. Here, we adopt the same notation of Lee and Liew (1996) . We use a i and b i to denote the switch state value of input/output switching node a i and b i , respectively. Let 
The internal conflict-free constraint requires that
The combination of (1) and (2) gives ( )
Then we have
For the given permutation, we have: 
For the i th input switching node, we refer to the output port pair (k, l) corresponding to the input port pair (2i, 2i + 1) as a connection pair. Then we obtain:
Based on equation (3), we have:
Consider the given permutation as shown in Figure 3 , we can obtain a set of N / 2 initialising equations as follows:
As shown in Figure 3 , each initialising equation is used to establish a pointer, in which the state variable with larger index points to the other with smaller index. After initialisation step, all output switching nodes can be grouped into equivalent classes. The representative node of each class is the switching node with the smallest index number. Regardless of the Benes network radix, the initialisation step is processed at all PEs at the same time with time complexity O(1).
Searching
As shown in Figure 3 , there are two pointer types, Type 0 Pointer indicating the two state variables are equal, and Type 1 Pointer indicating the two state variables are not equal. All switching nodes except the representative node in the group will go through the searching step to point to the representative node. The time complexity of searching step is O(logN). Figure 4 shows the searching result of Figure 3 . 
Merging
Usually, among the nodes belonging to the same class, there should be only one endpoint which is the representative node of the class. If there are two endpoints in one class, then the merging step is needed to eliminate one of them. The time complexity of this merging step is O(1). Figure 5 shows that the two endpoints b 0 and b 2 are pointed by b 3 , which means the value of b 3 will be determined by the values of b 0 and b 2 , causing confliction. As shown in Figure 5 , after the merging step, the direct connection between two endpoints b 0 and b 2 is found. By applying the symmetric routing constraint, the state values of input switching nodes should be setup as:
, , , (0, 0,1, 0) State a a a a = Figure 6 shows the settings of input/output switching nodes for the given permutation π.
Permutation for subnetwork
After the state values of input/output switching nodes are determined, the switch settings of two inner 2 2 N N × subnetworks can be determined recursively. The permutations of the two inner subnetworks can be derived by tracing the routing paths from both input and output sides. The time complexity to calculate those permutations for subnetworks is O(1). In a recursive manner, the state values of all stages will be computed by the Lee's parallel routing algorithm. for colours) Figure 7 shows the connections of the two inner subnetworks and the derived two permutations π 0 for S up and π 1 for S down for two inner subnetworks, respectively.
Continue this process until the state values of the middle stage switching nodes are determined. As we can see from the description in above section, the searching step is the only procedure which is relevant to the radix of Benes network. All the other procedures could be finished in O(1). The time complexity for each round is determined by the searching procedure which is O(logN).
Hardware design of Lee's algorithm
Design flow
As shown in Figure 8 , the hardware design of Lee's algorithm follows the common RTL design flow which consists of four steps: 1 specification 2 RTL design 3 simulation of the RTL code 4 synthesis of the RTL design.
In the second step, we use Verilog HDL to implement the RTL design of Lee's parallel algorithm. As shown in Figure 9 , for the switch setting circuit of N × N Benes network, there are N / 2 PEs, each representing an output switching node, connected by the main frame. Each PE i holds several variables. In the main frame, two major parts are the control logic and the shared memory. Table 2 lists the variables used in our design. For N × N Benes network, each variable storing port index has 2 log N n = bits. The global variables are shared among all PEs. As each output switching node (represented by one PE) has two ports, 0 and 1, we adopt a two-register structure for each output switching node to store the pointers associated with port 0/1. In the searching step of Lee's algorithm, each PE may need search in two directions. The two-register structure allows each PE keeps searching in two directions until they reach the representative nodes. Here, four variables are used for storing the index of the node (nodeValue0/1) pointed by the port 0/1 pointer and corresponding relation value (nodeS0/1), respectively. The size of these shared registers is determined by the radix of Benes network. For N × N Benes network, the size of nodeValue0/1 is (N / 2) * logN bits as there are N / 2 output switching nodes and logN bits are needed to represent the index of each port. The size of nodeS0/1 is N / 2 as one bit is needed to represent the relation value between two connected switching nodes, '0' representing not equal, '1' representing equal. The control logic is responsible for the following functions:
1 Maintaining and updating the registers' data and status respectively, according to the newest information received from PEs.
2 Calculating the setting value for switching nodes on the inputs/outputs stage.
3 Calculating the input/output permutation for the subnetworks.
Every clock cycle, the control logic gathers the updated values of nodeValue0/1 and NodeS0/1 from all PEs. All PEs have the full access (write and read) to all bits of both nodeValue0/1 and NodeS0/1 so that each PE can modify any bit of these registers at any time. As such, the design must guarantee there are no more than one PEs writing the same bit of these registers in the same clock cycle. The Lee's algorithm ensures that when there is no confliction in permutation, each element of nodeValue0/1 and nodeS0/1 will only be updated by one PE in each step. The instinct exclusive property can guarantee that, for each bit of nodeValue0/1 and NodeS0/1, in each clock cycle, there will be only one PE modifying it and no conflict would happen. The second task of the control logic is to calculate the state values for the input/output switching nodes. The state values of output switching nodes can be obtained from NodeS0/1. The state values for the input switching nodes are based on the symmetric routing constraint.
The last task of the control logic is calculating the input/output permutation for the subnetworks. The Lee's algorithm calculates the switch setting values recursively from the outmost stages to the most inner stages. Take 16 × 16 Benes network as an example, according to the state values of the input/output switching nodes, the control logic will derive the permutation for two inner 8 × 8 Benes networks. This section will be discussed in details in the following subsections.
Finite state machine
In this section, the RTL design of Lee's parallel algorithm is presented. Following the process of Lee's parallel routing algorithm, we derive the finite state machine of each PE as shown in Figure 10 which encloses five steps.
IDLE
2 INIT 3 SEARCH 4 MERGE
DONE
Each step could be divided into several states to complete the function that this step is supposed to do. Those states named with 'WAIT' as appendix are used to synchronise PEs. All the PEs need to wait one clock cycle so that the register values updated by other PEs become valid in all PEs. In the following part of this section, we will describe these five main steps. 
IDLE
At the starting point, all PEs are in the IDLE state to wait for the new permutation between input and output ports. When the new permutation arrives by setting input ports of all input switching nodes, all PEs will enter the INIT state to conduct initialisation functions. Before the PE enters the INIT state, the control unit needs one clock cycle to synchronise with all other PEs. In the IDLE state, all register values are reset to default values, where nodeValue0/1 and prenodeValue0/1 are set to the current node index and nodeS0/1 are all reset to 0.
INIT
In Lee's parallel routing algorithm, the first step is to initialise the pointers and relation values between output switching nodes. This initialisation process is determined by the permutation between inputs and outputs of Benes network. Consider the following permutation for a 16 × 16 Benes network: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 10 14 9 2 8 13 12 15 1 7 11 5 0 4 6
As discussed in Section 3, there are two types of relation between two output switching nodes that have connection, equal or not equal, represented as '0' or '1' respectively. In Lee's parallel routing algorithm, in order to find out the relation between these two output switching nodes, the equations between routing bits of input/output switching nodes need be derived first. In our design, the relation between two output switching nodes can be derived directly from the parity of two output port indexes corresponding to the two input ports of each PE. Given the connection pair (k, l) for an input port pair (2i, 2i + 1) (i.e., port0 and port1 in our design), according to equations (4), (5) and (7), we derive the four possibilities of the above equation:
Case 1 k is even and l is even, we have After the initialisation step, all output switching nodes will be divided into one or more classes depending on the permutation of inputs/outputs as shown in Figure 11 . All output switching nodes in the same class are bounded together such that once the state value of any switching node is determined, then the state values of all the other switching nodes will be determined. For the example shown above, if the switch setting value of b 0 is 0, then the state values of the whole class are shown in Figure 12 . 
SEARCH
As discussed in Section 3.2, in the searching step, all PEs parallelly search and update the node pointer till reaching the representative node of the class, i.e., the switching node with the smallest index number in the class. The number of searching steps is bounded by 2 .
N
As shown in Figure 10 , right after the state machine runs into the SEARCH state, each PE P i updates nodeValue0/1 [i] and relation values nodeS0/1[i] stored locally till the pointer's values do not change in the current searching iteration. To detect the ending condition of searching step, before searching in SEARCH state, the node pointer's current value nodeValue0/1 will be stored in prenodeValue0/1. Figure 13 shows that after searching all PEs point to one endpoint except the one representing b7, which reaches two endpoints b1(node [1] ) and b0 (node[0] ). In each class, there is only one representative node. In order to solve this problem, we must merge these two end nodes pointed by the same PE, as shown in Figure 5 , this process will be done in the MERGE state.
The following two conditions need be satisfied before transferring to the MERGE state.
• After one searching step, the value contained in register preNodeValue does not change.
• The switching node has type value 'nodeType == 2′ b11', which means the switching node points to two endpoints.
At each PE P i , the following code is used to determine if transiting to the MERGE state.
if ((preNodeValue0 = nodeValue0) and If the two pointers of the switching node point to the same endpoint, then FSM transits to MERGE_SN state, in which one of two pointers of the switching node will be reset to its initial value; otherwise, the FSM transits to the MERGE state.
MERGE
When the PE reaches the endpoints in both direction and the two endpoints are different, the merging step will be conducted. As in the initialisation step, the node pointer with larger node index is updated with smaller node index number. As shown in Figure 14 , the PE merges the endpoints of b7 overwriting the nodeValue register storing b1 to b0. We can also see that, after the merging process, the switching nodes previously pointing to node b1 need be updated to pointing to b0. For the example in Figure 14 , after the merging step, nodes b6 and b4 need go through searching step again to update their pointers to the representative node b0. For each PE P i , the following code is used to update pointers.
Figure 14 MERGE (see online version for colours)
After the merging step, the PE will notify the other PEs so that all the other PEs will transit to the SEARCH state. As shown in Figure 15 , after the searching step, all the switching nodes point to the representative node of this class. The initial state value for the representative node 'b0' of this class is '0', then the state value of all other switching nodes can be determined by the relation value nodeS in parallel. And the switch state values shown in Figure 15 is exactly the same as those values shown in Figure 12 .
In our design, after all PEs are in DONE state, the mainframe will set the state values of output and input switching nodes. 
Setting state values of output switching nodes and input switching nodes
The state values for output switching nodes 
After the state values of output switching nodes are determined, the state values of input switching nodes are determined too. According to the symmetric routing constraint, the state value of an input switching node is equal to or opposite to the state value of its corresponding output switching node which depends on the relation of the input/output port index number. Given permutation pair (k, l), where k is the input port number, and l is the output port number, due to the symmetric self-routing constraint, i.e., equation (4) and (5) in Section 3.1, we have: 2 if , are same parity 2 if , are opposite parity 2
where k / 2and l / 2 give the corresponding input/output switching node index. As we can see, either port [2i] or port[2i + 1] can be used to determine the relation between state values of input switching node and its corresponding output switching node. Here, we use port[2i] to do the calculation. And port [2i] [0] gives the parity of the output port.
Permutation configuration for sub Benes network
After the state values of input/output switching nodes are set, the permutation for subnetworks will be determined.
Given state values in
⎦ the permutation for two subnetworks sub0 and sub1 can be calculated by the control unit as below: Figure 16 shows the timing diagram for the whole process of the example permutation. In this example, there are three searching steps, two consecutive ones and one after the merging step which is consistent with the Lee's algorithm. Each step needs two clock cycles to finish, because each step needs one more clock cycle to update data in the shared memory. After the state values of input/output switching nodes are determined, one more clock is needed to calculate the permutation for two subnetworks. Totally, 17 clock cycles are used to finish the whole process. Consistent with Lee's algorithm, during the whole process, only the number of searching steps is relevant to the radix of Benes network. And all other steps are in constant.
Special case
Because of the simplicity of 4 × 4 Benes network, there is no need to run the whole process. As shown in Figure 17 , there are 4 4 24 P = permutations between input and output ports which fall into two cases: 1 either these two output switching nodes are in the same class 2 they are in the two separated classes.
For both cases, there is no need to do the searching and merging procedure thus significantly reducing the logic complexity of each PE. Consider the permutation of the 4 × 4 Benes network shown in Figure 17 , we can derive below:
Consider the connection pair (i, j) there are two cases:
Case 1 Both i and j belong to the same output switching node, then k and l must belong to the other switching node. There is no connection between two output switching nodes, i.e., they belong to two separated classes. The state value of each output switching node can be assigned independently.
Case 2 i and j belong to two output switching nodes, respectively. Then there exists a connection between two output switching nodes, i.e., they belong to the same class. The state value will be assigned correlately. 
Algorithm pseudo codes
The pseudocode of the implemented parallel switch setting algorithm for N × N Benes network (N > 4) is listed below. All variables are defined in Table 2 . 
} } // Calculating permutation for subnetworks 
and
// Merging procedure
Experiment results
We have implemented the Lee's algorithm for finding the switch settings for input/output stages of 4 × 4 to 64 × 64 Benes networks in Verilog, simulated and synthesised the designs using Cadence tools. The RTL code is written in parameterised way so that it is easy to expand to larger sizes. In the simulation process, ModelSim is adopted as the simulation tool. For each design, five categories of permutations are used for validation including bit reversal, perfect shuttle, butterfly, matrix transpose, and random permutations. Under each category, one or more different permutations have been tested. The worst case permutation would cause all output switching nodes in the same group connected in order. As such, the algorithm needs run logN steps to search the representative node. For each network size, one worst case is tested. In the synthesis process, Cadence encounter RTL-compiler is used with TSMC 65 nm technology library. All size designs are synthesised under the same settings. The synthesised results of timing, area in number of cells, and power consumption are presented below. The timing delay is mainly decided by the time complexity of the algorithm. While the size of the PE will not affect the timing delay as much as that does to area and power consumption as shown in Table 3 and Figure 18 . As discussed in Section 3.2, the time complexity of algorithm is determined by the number of searching steps. The simulation results show that the number of searching steps follows O(logN). Except the 4 × 4 network, the synthesised timing result has about the same trend as that of the time complexity of Lee's algorithm. For 4 × 4 Benes (explained in Section 4.5), there is no searching step in the switch setting algorithm. That is why the timing delay is much lower than that of 8 × 8 Benes. Table 4 and Figure 19 show the area result in terms of number of cells, the basic design unit used to measure the logic complexity. When the network size is doubled, the number of cells increases by about four times except for the 4 × 4 network. It is clear that in Lee's algorithm, when the network size is doubled, the number of PEs needed in each stage is doubled. For example, the 8× 8 Benes has four PEs and the 16 × 16 Benes network has eight PEs. Besides, the logic complexity of the PE nearly doubles when the network size is doubled. Overall, the logic complexity of the PE should be increased by four times when the network size is doubled. This explains the trend of number of cells in Table 5 . Table 4 Cell number and area Table 5 shows the power consumption of the design in terms of static (internal) power, dynamic (mainly switching), net and leakage power. Each portion of power increases significantly as the radix of Benes network increases. The power consumption increasing trend is consistent with the increasing trend of number of cells. As shown in Figure 20 , the switching power is the most significant portion, followed by internal (static) and leakage power which occupies 36%, 28% and 27% of total power, respectively. Together, the three portions of power dominate the power consumption at more than 90%. 
Benes
Conclusions
This paper presents the RTL design of a parallel switch setting algorithm in Benes networks. We have refined the algorithm in data structure and initialisation/updating of relation values to make it suitable for hardware implementation. The RTL code is written in parameterised way so that it is easy to expand to larger sizes. The RTL design of the switch setting circuit for 4 × 4 to 64 × 64 Benes networks are simulated and synthesised using Cadence tools. The simulation and synthesis results confirm that the timing, area, and power consumption of the circuit is consistent with the complexity of the Lee's algorithm. The future work includes integration of the switch setting circuit with the Benes network circuit.
