Abstract-Crossbar switch is a key component of communication switches used in networks. Allocation of resources has a direct impact on packet transmission. A poor allocation results in long transmission delays. Hence, switches must include an arbiter that efficiently allocates the crossbar's resources. In this brief, a novel VLSI CMOS implementation of a high performance wrapped wave front arbitration (WWFA) for crossbar switches is described. Arbitration time is one of the critical factors that affect network performance. WWFA requires a two-dimensional arbitration that incorporates a rotating priority to provide fair arbitration. In this brief, we describe the design and implementation of this arbiter. The arbiter is capable of performing an arbitration in 1.15 ns using 0.5-m technology, for a 4 4 crossbar. We also include the description of interface and crosspoint (CP) control circuitry, i.e., request-acknowledge circuit and CP controller circuit respectively.
I. INTRODUCTION
Computer networks utilize routers to receive, forward, and deliver packets. To achieve high performance, a router that provides high bandwidth and low latency is required. A router can be considered as a collection of network interfaces, a connection fabric connecting those interfaces, and logic that determines how to route packets among those interfaces [8] . A crossbar switch may serve as switching fabric to provide a nonblocking network configuration [5] . Crossbars are one of the key components in most communication switches used in today's networks.
In order to better utilize the available input buffer space, the input ports usually have multiqueue buffers that have shown to significantly increase network throughput [2] , [7] , [11] , [12] . Each multiqueue input buffer is able to transmit through the crossbar the packet at the head of any of its queues. Each queue can be considered to be a user and each internal crossbar bus (input and output) to be a server (resource). Hence, the switch consists of n 2 users and 2n resources. Fig. 1 shows an user-resource perspective of a n 2 n crossbar switch matrix. An input-port bus (IBi) intersects an output-port bus (OBj ) at crosspoint (CPi;j ). Any IB can connect to an OB by closing the appropriate CP switch. A queue qi;j can transmit data to an output port j by acquiring IBi and OBj . These two resources are essential for a complete data path. This translates into a request to close the CP switch CPi;j . Inputand output-bus conflicts occur when two or more queues request for the same input bus or output bus, respectively. Only one user can be granted access to a resource. Hence, an arbiter needs to be employed to resolve these conflicts and assign resources to users. Arbitration scheme should produce high throughput by providing maximum user-resource utilization. To avoid port cross connections, the arbiter has to grant at most one CP per row per column of the crossbar matrix. are n resource pairs, maximum throughput would require maximum (n) number of connected pairs. Assuming that a CP, when requested, has a value of 1 (CP i;j(request) = 1) we express the arbiter input as
This implies that any number of CPs on the crossbar can be requested at the same time. Contingent on the CP request pattern, the output port status and the arbitration scheme, we have that the number of granted CPs would have the following range:
A CP i;j(request) can be granted if and only if the following condition is met: This condition guarantees that no grant exists in the column and row where CP i;j(request) is located. Thus, the maximum throughput can be obtained when
The arbiter can be decomposed into a group of arbiter cells with one arbiter cell associated with each CP. The arbiter considers the requests for each CP and depending upon the arbitration scheme, determines the ones to be granted.
The crossbar arbitration policy has a significant impact on the overall performance of the crossbar switch [10] . Tamir and Chi [10] have studied three arbiter schemes, skewed two-step arbiter (stsa), wave front arbiter (wfa), and WWFA. These arbiters differ primarily in the manner in which priorities are assigned to individual arbiter cells. Throughput and average latencies of each arbiter have been evaluated using static probabilistic analysis and event driven simulations [10] . These evaluations have shown that WFA and WWFA achieve nearly the same performance as more complex schemes. Further, WWFA has been shown to be approximately twice as fast as the WFA. WFA and WWFA require a two-dimensional (2-D) priority. Most of today's encoded priority controllers [3] provide a one-dimensional priority. In this brief, we report a novel VLSI design of the WWFA arbiter. To our knowledge, this is the first study in the VLSI design of this arbiter. Resource conflicts frequently occur in crossbars. The conflicts are resolved by an arbiter. The primary step toward resolving conflicts is to assign priorities to the CACs in such a way that the arbitration results in maximum throughput from the crossbar switch. Maximum throughput is achieved when all the n input ports to the crossbar are serviced. It can be accomplished only if the CPs servicing these input ports are on different rows and columns of the crossbar matrix.
The dashed lines (shown in Fig. 2 ) indicate the wrapped diagonals of the arbiter. All the CACs on a diagonal have the same priority. In this example, diagonal (2) has the highest priority, which is indicated by a darker dashed line. The following diagonals 3, 4, and 1, have decreasing priorities. As explained earlier, two distinct conflicts can occur: input-bus conflict (x direction) and output-bus conflict (y direction) when two or more requests occur per input or output bus, respectively. Only one connection per input port and per output port can be established. This in turn requires the arbitration scheme to consider both x and x directions. With the priority diagonals of the WWFA, the CACs associated with any diagonal have no conflicts since they all lie on separate rows and columns.
The arbitration wave front begins with n CACs that lie on the highest priority diagonal. It is important to point out that the priorities can be rotated every arbitration cycle to provide fairness [1] . In the next arbitration cycle, priority is given to the diagonal following the one which held it last; in our example in Fig. 2 , the next diagonal would be 3.
A. Rules for WWFA
The fundamental rule of this arbitration scheme is that only one CAC per row, per column, can be granted. For example, in Fig. 2 , if CAC (1,1) receives a grant, then no other CAC in row 1 (CACs (1; j)) and is not granted, then the priority is passed on to the CACs in its row and column, in the positive x and positive y direction, respectively. Thus, when the wave front moves to diagonal (3), CAC (2,1) gets a grant if both CACs (1,1) and (2,4) do not. Another rule is that a grant cannot be generated by a CAC unless the associated CP switch is requested and its output port is not busy. Summarizing these rules, it can be said that a CAC can be granted if and only if the following three conditions are met;
request. the associated CP switch is requested; priority. the CAC has highest priority or none of the CACs with higher priority in its corresponding row and column have been granted; output port. the output port associated with the CP is ready to receive data.
Thus a grant can be expressed as
The priority is passed to the adjacent cells as follows:
Fig . 3 shows WWFA rules in a graphical form. Concentric circles indicate a requested CP, while shaded concentric circles indicate that the request has been granted. A "0" at X in or Y in indicates that this CAC has priority in the x or y direction, respectively. The grant output (G) is asserted if, and only if, there is a request (R), output-port ready signal (OR) is "1" and both Xin and Yin are "0". If the CAC grants the request, it pulls its X out and Y out signals to "1". This case is shown in Fig. (3a) . In this figure we have the arbitration cell's request input and its arbitration output(s) in the following format: input/outputs; where the input is either request or no request, while the outputs are grant or no grant and pass or block priority. No request for the CP results in passing the priorities to the following CACs (Fig. (3b) ). If the CAC does not receive a grant then it makes itself transparent and X out and Y out signals obtain the values of X in and Y in , respectively. Two cases are shown in Fig. (3c) and (d). These cases are also shown in Table I . Fig. 4 shows an example of a 424 crossbar WWFA. In this example, it has been assumed that all the output ports are ready and diagonal (1) has the priority. It should be pointed out that before the start of the arbitration cycle, the x and y inputs of all the CACs except the ones on the highest priority diagonal are initialized to '1'. This means that the CACs on diagonals (2), (3), and (4) yield a grant only if the ones on the priority diagonal (1) pass the priority. If a particular CAC receives a grant, it does not disturb the x and y inputs of the neighboring cells. As a result, the priority is blocked. If a CAC does not receive a grant, then, it passes on its x and y inputs.
In the example of Fig. 4 , diagonal (1) has two requests at CACs (4,1) and (2,3) which are granted. This in turn denies priority to any request in rows 4 and 2 and columns 1 and 3. Since diagonal (2) has no requests, the priority is passed to diagonal (3). This diagonal has three requests (CACs (4, 3) , (3, 4) and (1,2)). CAC (4,3) does not receive priority in either directions, however, CACs (3, 4) , and (1,2) get priority and are granted. This in turn denies priority to any request in the following diagonal in rows 3 and 1 and columns 4 and 2. Any request in diagonal 4 will be denied since all the rows and columns already have a granted CP. Fig. 4 shows a configured arbiter at the end of the arbitration cycle.
III. CAC DESIGN
Based on the WWFA scheme requirements, we have designed a CAC [4] . This cell should take into account priority in both x and y directions. Fig. 5 shows cell (i; j)'s circuit diagram. The grant and the pass-priority (PPriority) form the primary blocks of the CAC. There are two PPriority blocks, one per direction. The directional inputs (X i;j and Yi;j ) of all the cells, except the ones on the priority diagonal, are precharged to '1'. The directional inputs of the cells on the priority diagonal are discharged to '0'. Priority to a diagonal is indicated by P k =`1'. We have considered the worst case when the priority may change every clock cycle. The P k (='1' or '0') and P C (='0') signals arrive at approximately the same time which assign the appropriate value to Xi;j and Yi;j . This operation initializes the wavefront. The grant block determines the status of grant signal G i;j . As stated earlier, a CAC can receive a grant if and only if it's associated CP is requested (Ri;j ), it had priority in both x and y directions (Xi;j =`0' and Y i;j =`0') and the output port serviced is ready to receive data (ORj ). This is expressed as If the CAC (i; j ) lies on the priority diagonal and receives a request, its grant signal is asserted.
The time delay in generating a grant, once the inputs arrive, depends upon the position of the arbiter cell in the crossbar matrix. Fig. (6a) shows the signals associated with the grant block of cell (4, 1) . Cell (4,1) lies on the priority diagonal. Hence, inputs X 4;1 and Y 4;1 are both '0'. Also, one of the inputs of the NAND gate, OR1 , is high. If the CP is requested, it will definitely receive a grant. From the simulation it is clear that when R4;1 goes high, G4;1 also goes high after a certain delay. This delay is the sum of the delays of the NAND and NOR gate. The total delay, from the moment the clock (CLK) arrives to the time G 4;1 goes high is denoted by t grant pd; where "0pd" indicates that the cell lies on the priority diagonal. Simulations show that t grant pd = 0:67 ns. Fig. (5b) displays the signals associated with cell (4, 4) . This cell is farthest from the priority diagonal. None of the cells in row 4 and column 4, except cell (4, 4) , in the crossbar matrix, are awarded grants. This implies that, cell (4,1) passes the priority in the x direction to cell (4, 4) , and (1,4) passes its priority to cell (4, 4) in the y direction. The cell receives the priority when signals X 4;4 and Y 4;4 go to "0". It should be noted that the request and output-ready signals to all the cells are asserted at the same time. Which means that by the time X 4;4 and Y 4;4 arrive at the input of the NOR gate, the NAND gate output is already set to high. Hence, the delay to assert G4;4 , after the arrival of X4;4 and Y 4;4 is the delay of the NOR gate only. This delay is denoted by tNOR and is equal to 0.22 ns. The total time taken to generate a grant is denoted by tgrant-4; where "04" indicates that the cell (4,4) lies on diagonal 4. This delay can be seen to be equal to 1.15 ns.
The primary function of the PPriority block is to pass priority to the following CAC. Both the PPriority-x and PPriority-y perform similar operations; for simplicity, only the former is explained here. Since directional inputs of the cells are precharged to '1', a priority is passed by discharging them. A priority is passed only if X i;j is '0', the CAC does not receive a grant (Gi;j =`0') and the following cell does not lie on the priority diagonal (P k+1 6 =`1'). Once these conditions are met the output of the NOR gate goes to '1' discharging (passing the priority to) Xi;j+1 . The time taken by the arbiter cell to pass the priority to the following cell is an essential component in determining the total arbitration time, since this is a part of the critical path.
Both the grant and PPriority x blocks require the value of signal X i;j . However, the PPriority-x block needs the grant block signal before passing or blocking the input priority. If Xi;j were fed to the grant and PPriority-x blocks at the same time, the PPriority-x would see that the cell is not granted and pass the priority to CAC (i; j +1). Although the output of the grant block may be affirmative, the priority has been already passed to CAC (i; j +1). In the worst case, all the CACs in the associated row and column may generate a grant resulting in violation of the arbitration rules. Thus, the input Xi;j is delayed by an amount equal to that of the grant block through NAND gate 1.
In case the accidental discharge of Xi;j+1 does occur, its correct value is restored through the p-type transistor on the PPriority-x block.
Here also, the signal X i;j+1 is disturbed only if it does not lie on the highest priority diagonal. The arbitration cycle is completed when the wavefront reaches the diagonal with the lowest priority, generating all possible grants. The time taken by the arbiter cell to pass the priority to the following cell is an essential component in determining the total arbitration time, since this is a part of the critical path. The maximum arbitration time is observed for the configuration in which a grant signal is to be asserted for a CAC on a diagonal farthest from the priority diagonal. Fig. 7 shows a simulation of the arbiter to determine the maximum arbitration time. This figure shows the wavefront signals as seen in the CACs of row 4. CAC (4,1) passes the priority to CAC (4,2) in time tpass-pd. This delay is associated with a CAC lying on the priority diagonal, to pass the priority after the rising edge of the clock cycle. The priority moves further through CAC (4, 2) and CAC (4, 3) , each with a delay of t pass0npd . Where t pass0npd represents the time taken be a CAC not lying on the priority diagonal to pass the priority to the following cells. The same occurs simultaneously in the y-direction from CAC (1,4) to CAC (3,4) . Once the priorities are received by CAC (4,4) , the grant signal G 4;4 is asserted after a delay of t NOR (refer Fig. (6b) ).
From this information, we can generalize the grant time (t grant(n) ) for an n 2 n arbiter as t grant(n) = t pass0pd + d 1 (t pass0npd ) + t NOR Where d is the number of diagonals (or CACs) between the lowest and the highest priority diagonal. It should be noted that the magnitudes of t pass0pd , t pass0npd and t NOR would depend on technology and implementation. From our SPICE simulations, using a m technology, the values of these delays are: t pass0pd = 0:47ns, t pass0npd = 0:23ns, and t NOR = 0:22ns. From this data, the total arbitration time t grant (4) of a 4 2 4 arbiter can be calculated as follows: t grant(4) = t pass0pd + 2 1 (t pass0npd ) + t NOR t grant(4) = 1:15ns: Fig. 8 shows the layout of the arbiter cell CAC (i; j). In comparison, the WFA designed by [10] (which has a relatively simpler scheme) had a worst case arbitration time of 15.5 ns for 2-m technology.
A more recent implementation of a 2-D priority arbiter is presented in [6] . Hurt et al. designed a rotating priority arbiter using 0.35-m technology. The approximate delay for a 4 2 4 arbiter, operating at a 3.3-V supply, is 2.5 ns.
IV. REQUEST-ACKNOWLEDGE CIRCUIT
Each input port of the crossbar switch must be able to transfer data to an output port by requesting the service of the corresponding CP.
For a configuration having n output ports, n CP switches are necessary per input channel. Hence, each input port will require 'n request lines. Consequently, a n 2 n matrix will result in n 2 request lines. The input port places data on the data bus after it receives the crossbar configuration. This information is handed to the input buffers, by the CP cells, with the help of acknowledge signals. Thus, an additional n 2 lines are required for acknowledge signals resulting in a line count of 2n
2 for handshaking signals only. In this design, however, the request, acknowledge and data signals are multiplexed over the same data-lines rewarding a line overhead reduction to n. It should be pointed out that having dedicated request, acknowledge and data lines do not alter the functionality of the crossbar switch.
The input port sends a request signal for a required CP in the first half of the clock cycle. If granted, an acknowledge signal is sent to the input port in the second half of the clock cycle. Once the acknowledgment is received, the input-port buffer places the data on the data-bus in the following clock cycle. This process is illustrated in Fig. 9 . The request-acknowledge block serves as an interface between the arbiter cell and the input-port. It generates the request signal Ri;j for the arbiter cell and generates the acknowledge signal A i;j , for the input port.
The schematic in Fig. 10 , shows the manner in which the request-acknowledge block is placed between the data-bus and the arbiter cell.
The circuit diagram of one of the blocks is also shown. Req i is pulled low by the input port to indicate that the input buffer has data destined for the output ports. Data lines Di;1 to Di;4 serve as request signals for CPs CAC (i; 1) to (i; 4), respectively. Signal R i;1 is asserted high if and only if CLK = 1, Req i = 0 and Di;1 = 1. This indicates that the CP has been requested by the input port. After the arbitration stage the configuration of the crossbar is known. The availability of all the CPs that received a grant is given to the input ports by discharging the corresponding data line. Once the communication between the arbiter and the input port is complete, the data bus is used only to transfer data. Our multiplexed scheme does not require dedicated request lines.
V. CP CONTROLLER
The arbitration cycle configures the switch matrix. This is done by generating grant signals for the appropriate CPs. These grant signals are used to close the CP switches and connect the input ports to the output ports. A CP switch has to remain closed as long as the input port is transferring data to the output port. These functions are executed by the CP Controller. The process of a successful transmission of data from input port to output port requires: a request, a data capture and a release of the bus. This process is illustrated in Fig. 11 .
To transmit data to an output port j, the input port i first sends a request to the appropriate CP (i; j). After arbitration, if the request is granted, an acknowledge signal is sent to the input port. This is called the request stage. In the following clock cycle, the CP switch enters the capture stage where the switch is closed; connecting the input port to the output port. The Req i line goes high. This switch is kept closed for the number of clock cycles required to transmit the data. Once the data transfer is complete, the input port pulls the req i line low. The data bus stops the data transfer. The CP switch is opened, disconnecting the input port i from the output port j. The arbiter cell re-enters arbitration. This is the release event. A request stage may immediately follow the release event if the output port is requested through another CP in the same column. It should be noted that during the capture stage the ORj signal is disabled avoiding all the CPs on row j from being granted. State QA. In this state, the controller waits for the CP to receive a grant. If signal G i;j is '0', it means that the CP has not received a grant. As a result, the input port i should not be connected to output port j. Hence, the controller sets signal CP i;j to "0". On the other hand if Gi;j becomes "1", it means that the CP has received a grant and the controller moves to state Q B at the rising edge of the following clock cycle.
State QB. When the controller enters state QB it checks the status of signal Req i . As seen in Fig. 11 , this signal is high. Hence, the CP switch is closed by setting CPi;j to "1". The input port i is thus connected to output port j. This connection is maintained until Req i goes low again.
The output port is said to be captured by the input port. A Reqi =`0'
indicates that the data transfer is complete. The CP switch is opened; isolating the output port from the input port. The controller returns to state QA and waits for the grant signal. Fig. (13a) shows the circuit for the CP controller and Fig. (13b) gives the circuit for the control of the OR 2 line.
VI. CONCLUDING REMARKS
In this brief, we have presented a novel crossbar switch VLSI design that uses a WWFA scheme. The crossbar switch fulfills the 2-D arbitration requirements needed in a wrapped wave front arbiter. It provides flexible priority setting by means of a rotating priority, which is used to provide a fair arbitration and to avoid packet starvation. Our design achieves high performance by having minimal circuitry in the critical path [9] , [13] . The performance of the crossbar switch depends largely upon the arbiter's speed. Using a m technology and a 4 2 4 crossbar switch, the arbitration scheme produces a valid configuration with a latency of only 1.15 ns. In comparison, the WFA designed by [10] (which has a relatively simpler scheme) had a worst case arbitration time of 15.5 ns for 2-m technology. A description of the other components of our crossbar, namely request-acknowledge circuit and CP Controller circuit, has also been included.
