Abstract -A neural network-based controller is presented for the real-time arbitration of routing paths in large crossbar switches constructed from one-sided crosspoint chips. This controller is suitable for a synchronous environment where a number of connection requests are simultaneously presented to the switch. The controller aims to maximize the effective bandwidth of the switch and to minimize the simultaneous-switching noise in the individual chips. The controller uses multiple winnertake-all networks coupled with some competitive-cooperative mechanisms to achieve the joint optimization. The effects of various network parameters are studied through simulation, and cases leading to nonoptimal solutions analyzed. The results show that the arbitration complexity and time scale well with the size of the switches, and the throughput achieved is close to the theoretically maximum attainable. We also introduce a hierarchical neural network controller for a packet-switched environment where connections are established and broken asynchronously. This controller provides almost the same level of performance as the first one, but with significantly reduced computation for each connection request.
I. INTRODUCTION
ROSSBAR switches are used extensively as multi-C processor interconnection networks and for communication switching. Although small crosspoint arrays are easy to implement on a single chip, several problems arise when the size is increased. Current VLSI and packaging technologies are capable of meeting the circuit-density and the pin-count requirements of large crosspoint arrays. For example, the single-chip implementation of a 72 X 72-line crossbar has been reported [20], and an array of size 256 x 256 is considered feasible using multiple chips [31. The primary limiting factor to the cost-effective single-chip realization of such large arrays now is the inductive noise generated by the line-drivers driving the output leads of the chip. When many of these drivers are active simultaneously, a substantial transient current passes through the inductance of the power distribution system, causing a noise spike to emerge on the power lines. This phe- While the Delta-I noise problem is present in all VLSI chips with a large number of output leads, it is particularly severe in large crossbar switching chips because they have a large number of data output lines. This problem can be alleviated by constructing crossbar networks using one-sided crosspoint switching chips [6], [28] . As explained in Section 11, these networks allow a pair of ports to be connected using one of many available internal switching paths. By choosing the switching paths properly, it is possible to distribute the active off-chip drivers uniformly over the chip matrix, thus reducing the Delta-I noise in each chip to acceptable levels [29].
In a nonblocking interconnection network, any request to connect two idle ports can be satisfied without disturbing existing connections. For a one-sided crossbar with a limit on the allowable number of concurrently active line-drivers per chip, it is possible that a connectionrequest is rejected even when a switching path is available to make the desired connection. This reduces the effective overall network bandwidth. Situations in which this can happen are detailed in [8] . Thus smart path allocation policies are needed to maximize the number of requests that are fulfilled simultaneously for a given sequence of connection/disconnection requests. In summary, a joint optimization of network throughput and noise reduction is needed. Moreover, the time taken to reconfigure the network should be much smaller than the average duration of a connection request. This motivates the use of a highly parallel artificial neural network controller for path allocation and arbitration.
Hopfield-Amari networks and their variations have been widely used to solve combinatorial optimization problems in which the cost function can be expressed as a quadratic Lyapunov "energy function" [ [7] .
Recently, some researchers have studied applications of artificial neural networks to control problems in switching and communications [21, [12] , [191. The continuous Hopfield neural net model (also known as Grossberg additive model [lo]) has been used to design a controller for a packet-switched two-sided crossbar switch [ 191. The authors assume that packets are sent in batches. For each batch, the packet destinations are specified by a request matrix. The controller attempts to find a configuration matrix that has maximum possible overlap with the request matrix, so as to maximize the throughput. The entire crossbar is reconfigured after each set of packet transmissions. Under these assumptions, the authors found the average throughput efficiency to be over 98% for an 8 X 8 crossbar.
Brown used "winner-take-all networks" (WTA's) [5] as the basic control mechanism for two-sided crossbar switches, which are used as subcomponents in Clos networks and other multistage networks [2l. He introduced neural cell pairs with different time constants as a way of marking switches that have been set more recently. With this mechanism, the sequential algorithms for finding routing paths in multistage networks could be emulated in a macroscopically sequential manner by a neural network controller.
In this paper, we propose and evaluate neural network controllers for setting switches in large one-sided crossbar networks constructed from multiple chips. The controller problem here is much more involved than that for twosided crossbars since there are multiple choices for the path selected for making a connection, and because of the need to spread the switching activity among all the chips. The paper is organized as follows: In Section 11, we introduce one-sided crosspoint networks that provide added flexibility over conventional two-sided crossbar switches because of the presence of multiple paths between any source-destination pair. The parameters of the switching network are defined, and different ways of receiving and processing connection requests are outlined. Section I11 explains the controller architecture based on WTA's. In Section IV, experimental results for determining the effectiveness of the neural network controller are described, and cases that yield invalid solutions are analyzed. Section V presents an alternative controller that uses hierarchical neural net control mechanisms for processing asynchronous connection requests. Concluding remarks are given in Section VI.
ONE-SIDED CROSSPOINT NETWORKS
A conventional (two-sided) crossbar network consists of a set of input lines and a set of output lines, conceptually placed perpendicular to each other, with switches placed at each point where the lines cross. Large crossbars are implemented by partitioning the switching matrix into smaller rectangular blocks and assigning each block to a chip. Fig. 1 illustrates a two-sided crosspoint matrix with 16 inputs and outputs, constructed from 16 4 X 4 switching chips. Note that there is a unique path between a pair of input and output ports. This restriction can lead to an uneven distribution of the active drivers among the switching chips. For example, if the inputs to the first row of chips are connected to the output lines in the last column of chips, then all the drivers in the top-rightmost chip are used (see Fig. 11 , but none of the drivers in the first row or last column of chips are active. Thus the chips in the network have to be designed to work reliably even in the event that all its line-drivers become active simultaneously.
An alternate way of designing crossbar networks is by means of one-sided crosspoint switching chips. A onesided crosspoint matrix consists of a set of port-lines and a set of bus-lines placed perpendicular to each other, with switching elements placed at the points of intersection. A connection between two port-lines can be established through any of the internal (column) busses. In the fullduplex implementation, each port-line is actually two wires, one for communication in each direction. Similarly, each bus-line actually consists of two wires. The architecture and implementation details of one-sided crossbar networks are given in [6] and [8] . Fig. 2 shows a one-sided network with N = 32 input/output ports and 16 internal busses. The switch matrix is constructed out of 16 crosspoint chips, each of size 8 X 4. Note that each horizontal line in Fig. 2 represents a full-duplex channel and each vertical bus a pair of bidirectional lines in a full-duplex implementation. A crucial feature of the one-sided network is the presence of multiple paths connecting any two ports. Any of the available column busses can be used to make the desired connection. This flexibility can be exploited by a smart bus arbitrator to distribute the active line-drivers more uniformly among the switching chips than is possible in a conventional two-sided design.
A connection between a source port and a destination port in the one-sided switch is established by locating an unused column-bus (internal bus) and then turning on the two crosspoints where the source and destination rows intersect with the selected bus. If these two crosspoints are within the same chip, no column drivers need be activated. Otherwise the signal needs to be routed off-chip. This requires the activation of a column driver in the source-chip as well as the destination-chip to provide a full-duplex path. Fig. 3 shows the active drivers for the two cases. In case (a), a row-driver is activated for each active port and no column-drivers are activated. This is referred to as an intemal connection. In case (b), a rowdriver and a column-driver are activated for every active port. This is referred to as an extemal connection. Since any unused column bus can be selected for making a connection between two ports, the choice can be made to optimize various criteria. In particular, the connecting busses can be chosen so as to minimize the maximum number of drivers active in any chip at that instant.
In this paper, we consider one-sided crosspoint matrices with N ports and M internal busses, constructed from r rows and c columns of n X m chips. Thus N = r X n and M = c X m . We use chip(i,j), O < i < r -1 ; O < j < c -1; to denote chip addresses, and switch(i,j), 0 < i < N -1; 0 < j < M -1; to label the switch crosspoints. If the type is evident from the context, only the indices ( i , j ) are given. When all the ports are in use and all connections are external, the maximum number of off-chip columndrivers that are simultaneously active on any chip is at least In /cl. This happens if these drivers are spread evenly. Let d be the muximum allowable number of active column-drivers per chip. This corresponds to an actual limit of 2d active drivers per chip, considering both rowand column-drivers. For nonblocking operation, the pertinent range of d is from f n / c l to m. As d is made smaller, the difficulty of allocating paths in a given network increases, thereby reducing its throughput.
There are two modes of making connection requests to the controller. In a synchronous mode or barch mode, all crosspoints are deactivated initially. When a set of requests is made, the controller tries to simultaneously set up as many pairs of connections as possible within the given constraints. After the desired communication is over, a new set of requests is presented, and the entire network is reconfigured to cater to this new batch. Such batch requests are encountered in a multiprocessor system where the processors operate synchronously and submit their requests to the network simultaneously. An example of such a system is the IBM GF-11 supercomputer designed for efficient execution of quantum chromodynamics and related problems [ l ] .
Alternatively, the switch can be used in incremental mode, wherein connection or disconnection requests from the ports are made asynchronously and independently. This is a more accurate model for many real situations such as requests for calls in a telecommunication system.
For the batch mode, the input to the controller is an N X N binary, symmetric request matrix, R. R ( i , j ) is 1 iff a (bidirectional) connection is desired between ports i and j , and zero otherwise. For a given request matrix R , the maximum number of port pairs that can be connected is given by the cardinality of a maximal matching of the graph for which R is an adjacency matrix. Let this be CmaX(R). Let the actual number of connections made by the controller be Cbatch( R). Then, the instantaneous throughput efficiency, 7 , is If we neglect the overheads in reconfiguring the switches, the average efficiency is obtained by finding the average of the instantaneous efficiency over the appropriate time period. This is one of the measures used to evaluate the quality of the assignments made by the controller.
The maximum load on the network occurs when all N ports are active. This is possible only if the connection requests have no conflicts. The corresponding request matrix has a maximum matching of size N / 2 . For a given request matrix R, we define the offered network load as network load = -x100%.
N/2
In general, the difficulty of finding a bus-allocation satisfying the d-constraint increases with the offered networkload. To impose maximum severity, the network load was set at 100% in our simulations unless mentioned otherwise.
The controller described in the next section maintains the number of active drivers in every chip within a given limit while allocating the internal busses to connection requests. This constrained allocation can reduce the throughput efficiency. However, as shown in the next section, such degradation is very small in the examples studied.
111. NEURAL NETWORK CONTROLLER ARCHITECTURE 3.1. Hopfield Nets and Winner-Take-All Mechanisms Before detailing the controller architecture, we briefly describe two well-known neural network structures that will be used by the controller. The continuous Hopfield model [14] consists of an ensemble of neurons or cells, with cell i being connected to cell j through a "synapse" with transconductance strength T,, j . The input activation, ui, of cell i is governed by where 5 = g ( u j ) is the output of cell j , and Zi is an external current (bias) to cell i. C and R are analogous to the capacitance and resistance, respectively, and determine the time constant for evolution of cell activity. If the
T,,j matrix is symmetric and the activation function g ( x )
is sigmoidal, then it can be shown that there is a Lyapunov energy or cost function that decreases monotonically to a (local) minima as the cells evolve according to (1) . This has prompted the use of the continuous Hopfield network for solving combinatorial optimization problems [131, D11.
A WTA network [5] , [9] is a connectionist mechanism that employs competitive learning [26] to identify the most active cell from among a group of N cells. Each cell competes with the others by sending positive feedback to itself and negative feedback (lateral inhibition) to all other cells, as shown in Fig. Na) . If the thresholds and weights are chosen correctly, then eventually, only one cell-the winner-remains active while the outputs of all other cells decay to some quiescent value. The WTA circuit can be transformed into the one shown in Fig. 4(b) , where the net output of all cells are collected and transmitted as a common inhibition signal to the cells. The resultant circuit needs only O(N) instead of O(N2) interconnects, and can be efficiently fabricated using CMOS integrated circuits [ 161. Moreover, the circuit can be read- ily extended to a "K WTA' network in which the K most active cells are selected [MI. We extend the schematic representation introduced by Brown [2] by denoting a K-WTA as in Fig. 4(c) .
Controller for Batch Mode
The controller architecture shown in In the above equations, d is the delta-I noise limit for the chip, and U, is the unit step function. The evolution of the switch cells is simulated in discrete time steps, using the update equations ( 
4) u , t , = u l , J + ( s t ) ( -t , -t , -t 3 + t 4 )
where i, t 1 = A ii) t, = B x gch,p(ulmodn,,modm). This denotes the inhi- cells [14] . m e output of switch cell This choice led to better results than commonly used sigmoidal activation functions such as the hyperbolic tangent or the logistic map: f ( x ) = (1/1+ e-").
For each switch cell column, we want either none or exactly two of the cells to be ''on,'' and the rest to be "Off." This is enforced by the column control cells, whose activation function is given by
IV. EXPERIMENTAL RESULTS AND ANALYSIS
For our experiments, the time increment at was IO niques. For each size, 1000 matrices were generated, and the results averaged over these instances. The simulations were performed on a Sun-4, and the user times given in Table I are the net times taken (including request generation and set-up times) averaged over the entire run of 1000 instances.
Initially, each switch cell has a low, randomly chosen activation value, with a mean of 1/N. The controller G,, = cv,,, if x v , , < I ,
several sizes were generated through Monte-Carlo tech-
The output of a column-control cell is used to inhibit all the switch cells in the same column, if the net activation of all switch cells in a column strays from zero or two. A WTA network with O ( M ) connections is implemented for every row of the cell matrix, using the rowcontrol cell for that row. The output of the row-control cell is kept at zero if there is at least one 1 in the corresponding row of the request matrix. Otherwise, the network is simulated through a sequence of iteration cycles. In each iteration, every switch cell is selected exactly once, and its output value is updated according to (4). The cell to be updated next is chosen in a random but equitable manner as follows: A signature register of size p = [log,(N X M)l is used to implement a primitive polynomial of G F ( 2 9 [17] . The polynomial generates all numbers from 1 to 2 p -1 in some order depending on the seed chosen, and these numbers determine the cell to be updated next. At the end of each iteration, the chip cells are examined and updated to reflect the change in change in activity of the switch cells. The switch cell updates are stopped when each requesting row has a cell with output 0.9 or more. If this does not happen in 100 iterations, we presume that the network fails to converge, and start again with a new set of random values for the cell activations. Table I presents the simulation results. To calculate the throughput efficiency, a maximal matching algorithm is run for each request matrix to obtain C,, which is then compared with Cbatch obtained from the connection matrix. We notice that the efficiency is almost lm%, showing that the quality of the solutions is good. The request matrices were such that the network load was 100% in almost all cases. Since the d limit severely restricts the choice of internal busses once many ports are active, getting a high throughput is not trivial.
Even though the convergence rate is rather high as compared with many other applications using Hopfieldtype networks, we did encounter cases when the network failed to converge. The frequency of this happening increased when there were more column busses available to make a connection, so that the search space was not constrained enough. By observing the evolution of cell activations for these cases, two cases were discerned.
1) The controller would oscillate between choosing columns k and I for implementing a connection between ports i and j . This could happen when vi,k = = 1; vi,l = vj,k = a < 1, and all other cells in rows i, j and columns k, I were off. This situation is expressed in Fig. Ha) . Such situations could be overcome by increasing the positive coupling along the columns (term t, in (4)) and increasing the inhibition of cells in the same row (term t l ) .
2) There were a few situations where more than two cells were quite active in one column whereas another column was not utilized at all, as depicted in
gence for batch mode; Case 2. by the number of crosspoints. In a serial simulation, the time taken per iteration is O(NM). The number of iterations required for convergence is experimentally observed to increase sublinearly with the network size. Thus the computational requirements scale well with increase in problem size. In a parallel implementation with one processor per cell, the time taken per iteration is almost independent of the network size. This implies that a VLSI implementation of the controller has the potential of performing bus arbitration in time quite impervious to the crossbar size.
V.
A HIERARCHICAL C~UTROLLER In the incremental mode, a pait of entries in the R matrix is modified whenever a new request is received, without affecting the other entries. For this mode, a smaller and more efficient neural network can be used in a two-level controller architecture. The key observation is that, within a given chip column, all the available internal busses are equivalent in their effect on Delta-I noise or on system throughput. Similarly, these parameters are independent of the specific port among the set of ports accessing a particular row of chips, that is, making a connection request. Therefore, one can reduce the problem of selecting an internal bus to selecting the appropriate pair of chips (in the same chip column) used to make a connection. Once this is done, any of the available internal busses running through these two chips can be selected. The proposed architecture thus has two levels, as shown in Fig. 7 . The top level uses a neural network that has rc chip block cells arranged in an r X c matrix augmented by an extra row and column of control cells. If we need to connect or disconnect port i ' to j ' , then an incremental r x c request matrix, R', is created, where RYi, j ) = The row and column control cells behave as before, except that now they get inputs from block cells instead of switch cells. For the second (lower level) controller, a digital mechanism, outlined in [SI is used. This controller receives a signal from the neural net arbiter once the chips are chosen, and then activates the appropriate crosspoint pair after selecting any one of the available busses. Similarly, when a release request arrives, this controller deactivates a crosspoint pair, and decrements the corresponding two M ( i , j ) values by one. The exception is for an internal connection involving row i , in which case M ( i , i) is decremented by 2. For our experiments, the time increment S t was 0.05, and typical values for the coefficients A', B,, B,, D' , and E' were 50, 25, 100, 50, and 75, respectively. Connection requests were generated using Monte-Carlo techniques until the desired capacity was reached. After that, a randomly chosen connection was removed and another connection chosen from among the free ports. For the special case of 100% network load, i.e., when all ports were busy, new connection/disconnection requests were generated by performing pairwise interchange between two pairs of communicating ports. Table I1 summarizes the performance of the top level arbiter. The individual crosspoint chips are taken to be of size 64x32, and the d limit was chosen as Tn/cl+l. Thus, even though only one more off-chip column driver was allowed than the lower-bound, 100% convergence was observed for almost all cases. The average convergence times are also very low. We observe that the time taken to converge is quite independent of the load or capacity at which the network is operated. The only cases in which the neural network did not converge to a valid solution occurred when the inhibitory terms were strong enough to cause a11 the cells to become inactive. This situation could be alleviated in two ways:
1) by adding an extra inhibitory term, C'(Ci, jq, -21, which tends to enforce that exactly two cells be active. A value of C' = 50 was found to be suitable; or 2) by scaling B , in inverse proportion to the d limit.
To observe the performance of the controller in a crosspoint matrix with no extra chip columns, we simulated incremental operation of networks with such configurations. In all the examples studied, the chip-size was fixed at 48x24 and the number of chip-columns was chosen identical to the number of rows. This forces all the busses to be used when the network load is 100%. The results are shown in Table I11 for various network sizes. Here, the d limit is set to I n / c l , the best achievable, in the first table entry for each switch size. First, a series of connection requests are generated till the desired loading level is achieved. Then, a connection is picked at random for removal, followed by a request for a new connection among two of the free ports that belong to different chip rows. The results are recorded after 10000 connections/ disconnections. Since the parameter d provides only a soft constraint in (4), it is possible that the number of active column-drivers actually reached in a chip exceeds d . The maximum of the observed number of active drivers/chip during each simulation is given as d,, in Table 111 . For large networks, these values closely matched the best possible.
Among the traditional controllers, such as those using the first-fit and best-fit algorithms [8], the best performance is observed to be that of a best-fit algorithm that selects a chip column j for connecting a port is chip row i to one in row k, according to min (max ( M~, j , M~,~) ) .
This algorithm attempts to keep the highest number of active drivers in a chip as low as possible by avoiding the use of chips with maximum number of active drivers for a new connection unless there is no other choice. We compared the performance of the neural network con- troller with that of the best-fit algorithm described above.
The results are shown in Table IV . For each switch size and loading value, we recorded the maximum d reached over a sequence of 10 OOO connections/disconnections for both controllers. We observe that the controllers are comparable at lower loads, but the neural network controller is distinctly superior when the network loading approaches 100%.
In a parallel implementation of the neural network controllers, with one processor per cell, the convergence time is indicated by the number of iterations in the corresponding serial simulation. Fig. 8 shows estimated average time taken to converge in the incremental mode. We observe experimentally that the arbitration time scales almost linearly with network size. This makes a V U 1 implementation attractive. The 8 X 8 neural network arbitrator described in [19] uses a Hopfield model and 2-pm CMOS technology, and typically computes configurations within 120 ns. We expect the hierarchical controller presented in this paper to fulfil a single connection request in 0(102) ns for networks with 0(102) ports. More specific timing estimates demand an actual implementation of the controller in VLSI and are dependent on technological parameters.
VI. CONCLUDING REMARKS In this paper, we presented two neural network architectures for the controller of multiple chip crossbar networks, and analyzed their performance for batch and incremental connection requests, respectively. Since each communication line in a one-sided network is established through the activation of a pair of crosspoints, and since there are several choice of these pairs (one per available internal bus), the controller designs are much more involved than the one given in [19] .
The hierarchical controller operates faster and provides better results than the batch controller. This is primarily because the reduced size of the neural network makes the number of constraints that need to be simultaneously satisfied smaller. For combinatorial optimization problems such as the traveling salesman problem, where the multiple symmetries in the problem lead to the formation of several well-spaced global minima, poor convergence rates for the Hopfield net have been observed [30] . The application considered in this paper does not suffer from this disadvantage, and converges to valid solutions in most trials. Moreover, the quality of the solutions is typically superior to those obtained by more conventional controllers using first-fit or best-fit schemes [8] .
When the simultaneous switching noise is the limiting constraint on the size of a crosspoint chip, an effective controller allows the synthesis of larger crossbar networks than would be possible otherwise. Moreover, the massive parallelism afforded by neural networks can be exploited to meet the fast reconfiguration demands of these switches. The WTA network, a key component of the arbiter architecture, has already been fabricated on CMOS integrated circuits using O ( N ) of interconnects [16] . A V U 1 implementation of the controller facilitates the design of a large crossbar switch serving as an interconnection network for highly parallel multiprocessor systems.
