INTRODUCTION
The use of Recurrent Neural Networks (RNNs) to learn Synchronous Sequential Machines (SSMs) from examples is a problem which has been studied extensively. A related topic that, to the authors' knowledge, has not been studied previously is the use of RNNs to learn SSMs for which several distinct initial states are possible.
This problem is interesting because it maps directly into the problem of learning the structure of an Interconnection Network (IN) from examples. Learning an IN from examples is an unusual approach. Traditionally, INs have been designed (and not learned) based on several criteria, including speed, complexity, ease of route calculation, and fault tolerance. Numerous different types of INs have been proposed. A detailed description of many of the INs that have been applied to parallel computing can be found in Siegel's book (Siegel, 1990) . tThis paper is adapted from (Goudreau, 1993, Chapter 6) . A shortened version of this paper was published in (Goudreau & Giles, 1993) .
In this paper, the learning of Self-Routing Interconnection Networks (SRINs) is discussed. SRINs are described in detail in Section 2. They can be used to describe many commonly used INs. If one considers a parallel computing system, the idea is that the processors have certain communication requirements with other processors, and certain message headers (also described in Section 2) must be used that allow the message to pass through the SRIN and reach the desired destination processor. The message headers provide routing information to the switches in the SRIN.
The method that is proposed makes use of a second-order Single-Layer Recurrent Neural Network (SLRNN) to learn the training data. The training data is a table of source processors, message headers, and destination processors. Once the training data has been learned, the structure of the SRIN can be extracted from the SLRNN.
One topic that is related to the learning of INs was presented by Hillis (Hillis, 1990) . In that paper, Hillis makes use of simulated evolution to construct sorting networks. It should be also mentioned that neural networks have been previously used for interconnection network routing: for example, see (Brown, 1989; Brown & Liu, 1990; Funabiki et al., 1991; Funabiki et al., 1993; Goudreau & Giles, 1992; Hakim & Meadows, 1990; Lee & Chang, 1993; Marrakchi & Troudet, 1989; Melsa et al., 1990a; Melsa ct al., 1990b; Thomopoulos et al., 1991; Troudet & Waiters, 1991) . However, none of these methods learned the structure of the interconnection networks; the structure of the interconnection network was always directly mapped into the neural network.
The learning of interconnection networks is a new idea; as such there are no existing applications. That said, this paper can be viewed as an attempt to look at interconnection network design in a different light. Rather than start with the design of an interconnection network and have the structure of the interconnection network determine the routes, it is possible to start with a set of desired routes and use them to determine the structure of an interconnection network. The possibility that this technique can be useful has been made more likely by automata rule encoding and extraction methods recently developed for recurrent neural networks (Andrews et al., 1995; Giles & Omlin, 1993; Maclin & Shavlik, 1993) .
SELF-ROUTING INTERCONNECTION NETWORKS
In this section we describe SRINs. The purpose of a SRIN is to allow a set of processors to communicate amongst themselves using a store-and-forward methodology. For store-and-forward routing, a message travels along the path towards its designation one switch at a time. A switch can be thought of as a simple processor that accepts a message and then routes the message to the appropriate output line. Once a switch has sent a message to the next switch, it is free to be used to route a different message. A SRIN does not use an external controller to route messages. Rather, the switches in a SRIN are smart; they examine the message that is being sent and decide which way to route it.
A message, as it is defined in this paper, consists of two parts: a header and a body. The header and the body are separated by an end-symbol, which will be denoted by e. A schematic diagram of a message is shown in Figure 1 .
The SRIN uses self-routing switches to route messages. A self-routing switch with M inputs and N outputs will be called a M x N self-routine switch. A drawing of a I × N self-routing switch is shown in Figure 2 . Figure 3 is a drawing of a M x N selfrouting switch. Intuitively, the routing works in the following end-symbol t ea er r manner. When a message arrives at a switch, the switch strips off the first symbol of the message header and examines it. If the symbol is not the endsymbol, e, the switch sends the message (minus the header symbol that it just examined) through the appropriate output port. If the symbol is the endsymbol, then the message should be given to the processor that is associated with the switch. In other words, once the header has been stripped down so that only the end-symbol is left, the message does not get passed through the SRIN any longer.
Although it is not shown in Figures 2 and 3 , it must be remembered that there is a connection from each switch to its associated processor. This can be thought of as another output port for the endsymbol, e.
The switches work in a First-In, First-Out (FIFO) manner. If a message can not be routed immediately, it is buffered until it can be routed.
A Formal Description of a Self-Routing Interconnection Network
We will now present a more formal description of an SRIN. The SRIN will have a set of M processors, P = {P0, Pl, .-., PM-1}. Each processor, pj, will be associated with a set of switches, Qj. Each set Qj must contain at least one switch. (Otherwise, the processor would have no way to communicate with the other processors.)
Note that not all the switches in the SRIN need to be associated with a processor. Some switches in an SRIN might never be used to connect to a processor. Such switches are called don't care switches, or free switches. A message can be routed through a free switch, but a free switch should never be the first switch nor last switch in a route; to do so would imply that the free switch is associated with some processor. For the sake of convenience, we will associate some processor with each free switch, even though such an association is meaningless since it is never used. Now, the SRIN has the set of switches Q = Q0 t.J Ql tJ ... tJ QM-I. The processor function, /3, performs the mapping,/3 : Q ~ P. That is, if q is a switch, then/~(q) is the processor associated with that switch. We will let R be the finite input alphabet for the header and the body. The end-symbol, ~, is not a member of R; that is, ~ ¢ R. The end-symbol is only used to separate the header from the body. One typical alphabet would be R = {r0, rl}.
In general, however, the magnitude of the alphabet can be greater than two. Since most computing environments are binary, the situation becomes more complicated when the magnitude of the alphabet is greater than two. In such cases, the members of the alphabet must be encoded in some way.
There does not need to be a size limitation for the header nor the body. In a binary system, the endsymbol might consist of a string of zeros and ones that is illegal in the header. Alternatively, one might send the header and the body separately, in which case the position of the end-symbol will be understood by the receiving switch. Another approach would be to designate the first byte of the header to represent the length of the header. There are many different ways to implement the end-symbol, but for our purposes here we will assume the end-symbol is just a symbol that can be transmitted in one time step.
We now define the switch transition function, ~b, which performs the mapping, ~b: Q x R ~ Q. If q is a switch and r is the input symbol that is taken from the front of the header, then q~(q, r) is the next switch that the message will be sent to.
Finally, when processor pj sends a message, it starts the message off from one of the switches in the set Qj. Each processor will have a switch that is designated for this purpose. We define the switch function, 7, which performs the mapping, 7 : P ~ Q. If p is a processor, then 7(P) is the switch that performs the first stage of the routing for any messages that p sends. We will call the switch 7(P) the designated switch for processor p.
The SRIN can now be defined formally. DEFINITION 1. A self-routing interconnection network is a 7-tuple, (P, Q, R, ~b,/3, 7, c), where:
• P is a finite, nonempty set of processors.
• Q is a finite, nonempty set of switches.
• R is a finite, nonempty set of input symbols.
• ~b : Q x R --~ Q is the switch transition function.
• ~ : Q -~ P is the°processor function.
• "r : P ---' Q is the switch function.
• e is the end-symbol. Figure 4 shows an example of a SRIN. This SRIN has the set of switches Q = {qo, ql, q2, q3, q4, qs}. In Figure 4 , the switches are shown as white boxes with their labels in the upper left comer. The outputs, labeled r0 and rh are on the right side of each switch. The inputs come to the left side of the switch. A switch with a * in its lower left comer is a designated switch.
An Example of Serf-Routing Interconnection Network
The set of processors is P = {Po, pl, p2, p3}. In Figure 4 , the processors are represented by shaded areas.
For this example, we have the input alphabet R -----{to, rl}. Thus, each switch has two output ports. In practice, not all of the output ports need to be connected; some can be don't cares if they are never used for routing.
The number of input ports for each switch can be zero or any larger integer. If a switch has zero input ports, it must be a designated switch or it will have no purpose in the SRIN. The processor function/3, is shown here:
The switch transition function, ~b, is shown here:
Finally, the switch function, 7, is shown here:
Suppose processorp! has data to send to processor P2. One possible way to send the data there is to use the header rlrorl. The message starts in switch q2 = 7(Pl)-The switch q2 strips off the left-most symbol in the header, in this case rl, and routes the message to q4 : ~b(q2, rl). The message then goes to switch ql = ~b(q4, r0), and at last to switch q3 : ~b(ql, rl). At this point, the header has been spent and the message is led by the end-symbol. Switch q3 therefore delivers the message to processor p2 :/3(q3).
SYNCHRONOUS SEQUENTIAL MACHINES (SMMS)
In this section we discuss the relationship between SSMs and SRINs. SSMs are thoroughly described in (Hopcroft & Ullman, 1979 , Kohavi, 1978 . We will use the definition of SSMs that is provided in (Kohavi, 1978) . 1 (A finite state automata is a restricted case of a sequential machine that has reduced output alphabet of accept or reject of input sequences.) DEFINITION 2. A synchronous sequential machine is a quintuple, (O, S,/, 6, A), where:
• O is a finite, nonempty set of outputs symbols.
Specifically, our definition is for a Moore machine.
• S is a finite, nonempty set of states.
• I is a finite, nonempty set of inputs symbols.
• 6 : S x I ~ S is the state transition function.
• A: S ---, O is the output function.
From Definitions 1 and 2, it is clear that SRINs and SSMs are very similar. In fact, it only takes a slight expansion of the definition of SSMs to make them directly equivalent to SRINs. We will describe how SRINs are equivalent to Augmented SSMs (ASSMs), which will be defined below.
Let each processor in P be an output symbol in O. Similarly, let switch in Q be a state in S, and each input symbol in R be an input symbol in I. The switch transition function, ~b, becomes the state transition function, 6. The processor function, /3, becomes the output function, A.
Now the only components of the SRIN that are not equivalent to components in the SSM are the end-symbol, e, and the switch function, 7. The ASSM will have an end-symbol, ~. The meaning of the endsymbol in this context is merely that the input string has reached its conclusion, and the ASSM can now output the value corresponding to the input string. The ASSM will also have a state function, p. The state function p performs the mapping, p : O --+ S. In this context, the state function allows for some set of initial states in the ASSM. Thus, each input string that is to be entered into the ASSM must have an output symbol associated with it. This output symbol allows the ASSM to choose the correct starting state.
The ASSM can now be defined formally.
DEFINITION 3. As augmented synchronous sequential machine is a 7-tuple (O, S, L 6, A, p, ~), where:
• O is a finite, nonempty set of outputs.
• I is a finite, nonempty set of inputs.
• 6 : S x I -~ S is the state transition function.
• A : S --* O is the output function.
• p : O ~ S is the state function.
• ~; is the end-symbol.
It is now clear from Definitions 1 and 3 that SRINs and ASSMs are equivalent.
MACHINE INFERENCE
Since SRINs and ASSMs are equivalent, there are many issues that have been explored for ASSMs that can now be used for SRINs. For example, just as one can minimize the size of an ASSM by merging equivalent states (Kohavi, 1978) , one can minimize the size of a SRIN by merging equivalent switches.
What we are interested in is the inference of a SRIN from examples. A great deal of work has been done on the problem of machine inference. It has been shown that, in the worst case, inferring a SSM from sparse data is an intractable problem (Angluin, 1978; Gold, 1978; Kearns & Valiant, 1989; Pitt & Warmuth, 1989) . Approaches that can be used to infer SSMs will now be examined.
Recurrent Neural Network Approaches
The literature on the use of neural networks for grammatical inference and finite-state machine learning is now well-established (Cleeremans et al., 1989; Giles et al., 1992a; Giles et al., 1992b; Mozer & Bachrach, 1991; Pollack, 1991; Watrous & Kuhn, 1992; Zent et al., 1993) . These approaches use RNNs to represent SSMs. For the work done in this paper, the approach described in (Giles et al., 1992a; Giles et al., 1992b) will be used (see Section 5.2). We refer readers who are interested in the details to those references. In Section 5, there is a qualitative explanation of the RNN approach to learning SRINs.
Until recently, the RNN approach for SSM inference that is used in this paper had only been possible for unknown SSMs with a small number of states (approximately 30). It should be pointed out that the limited success of this approaches is due to the learning algorithms. Generally, the RNNs have rich representational capabilities. However, recent work has shown that certain types of large SSMs, with thousands of states, are learnable (Clouse et al., 1994; Giles et al., 1995) . Furthermore, the performance of the RNNs can sometimes be improved by using "hints" if partial information about the structure of the SSM is known (Giles & Omlin, 1993) .
Other approaches that use neural networks for grammatical inference exist that will not be used in this paper. For example, the use of update graphs has been proposed by Rivest and Schapire (Rivest & Schapire, 1987a; Rivest & Schapire, 1987b; Schapire, 1988 ). An update graph is an alternate representation of a SSM that can be much smaller than the SSM for certain environments that often arise in practice. Update graphs can be mapped to a connectionist system that can learn the environment from examples (Mozer & Bachrach, 1990; Mozer & Bachrach, 1991) .
Traditional Approaches
Other methods for grammatical inference, which do not use neural networks, have demonstrated some promising results. In fact, a polynomial time algorithm proposed by Trakhtenbrot and Barzdin (Trakhtenbrot & Barzdin, 1973 ) has been shown to be able to infer some very large finite automata. The algorithm produces a machine that is consistent with a sparsely labeled tree, but the machine that is produced is not necessarily the minimum machine that is consistent with the data. Lang (Lang, 1992) performed several experiments using this algorithm for random finite automata with 1000 states and 2000 transitions. Given enough training examples, the algorithm was almost always able to construct a machine that was similar to the correct machine.
RECURRENT NEURAL NETWORKS TO LEARN INTERCONNECTION NETWORKS
The problem that we are trying to solve is posed in the following form. We have a training list of source processor, header, and destination processor combinations that must be implemented by a SRIN. For example, Table 1 contains data for some such problem. The data in Table 1 is consistent with the SRIN in Figure 4 . We must infer a SRIN that can accomplish all of the routings described in the training list. Hopefully, the SRIN will also be able to generalize. That is, we would like the SRIN to perform correct routings even for examples that are not on the training list. In Section 5.1, the recurrent neural network that is used to learn the interconnection network is described. The training algorithm is also discussed. Section 5.2 contains a training example and explains the specific encodings used. Finally, extracting the SRIN from the trained recurrent neural network is described in Section 5.3.
Recurrent Neural Network with Several Distinct Initial States
The structure of a general SLRNN is shown in Figure  5 . There are M inputs lines, xl, x2, ..., xM. The value ofinput xi(1 < i < M) at time t is x~. There is a single layer of N neurons, y~, Y2, ..., YN. The output value of neuron y/(1 < i < N) at time t is y~. At each time step, these output values are stored in a bank of latches to act as the "state" of the network. The state is fed back as an input to the layer of neurons on the subsequent time step. In general, all of the state values can be considered as output values, but it This multiplication and summing occurs inside the neurons shown in Figure 5 . The activation function, g (x) , is the sigmoid function shown here:
The second-order SLRNN is used to infer the ASSM that is equivalent to the unknown SRIN. Again, the approach used in (Giles et al., 1992a; Giles et al., 1992b) will be used here. The SLRNN will learn the training data, and the ASSM will be extracted from the SLRNN.
The training algorithm that is used is a variation of the Real-Time Recurrent-Training (RTRL) algorithm proposed by Williams and Zipser (Williams & Zipser, 1989) . The original RTRL algorithm was proposed for first-order SLRNNs, but the version used here is for second-order SLRNNs. The RTRL algorithm is an on-line, gradient-descent-based algorithm. Other recurrent training algorithms could be used for this application, e.g., backpropagation through time or the extended Kalman estimator.
Recall that the training data is in the form of In order to use the SLRNN, it is necessary to encode the symbols of a table entry into binary vectors that can be recognized by the SLRNN. The source processor defines the initial state vector of the SLRNN (i.e., the values of y0 for 1 < i < N). Thus, each source processor must be assigned a distinct, N-bit binary vector. The input symbols in the header (along with the end-symbol) correspond to the input vectors of the SLRNN (i.e., the values of x~ for 1 < i < M). Thus, each input symbol (and the end-symbol) must be assigned a distinct, M-bit binary vector. Note that the inputs to the SLRNN change over time, with the binary vector of the first input symbol applied at t = 1, the binary vector of the second input symbol applied at t --2, etc. After the binary vectors for the sequence of input symbols have been applied, the binary vector for the end-symbol is applied as the final input vector. At this point it is possible to check the resulting output vector of the SLRNN (i.e., the values of yr for 1 < i < K where T is the final time step) against the desired result. The desired result is a K-bit binary vector that represents the destination processor, so each destination processor must be assigned a distinct, K-bit binary vector.
Intuitively, the input vectors of the SLRNN represent the inputs and the end-symbol of the ASSM (and therefore the input symbols and the end-symbol of the SRIN). The state vectors of the SLRNN represent the states of the ASSM (and the switches of the SRIN). And the output vectors of the SLRNN represent the outputs of the ASSM (and the processors of the SRIN).
It is important to note that the binary vectors that are chosen to represent the source processors, the input symbols, the end symbol, and the destination processors are arbitrary. However, we will use simple one-hot encodings for all of the necessary binary vectors. Recall that a one-hot code is a code for which each symbol is represented by a vector that has one element equal to one while all of the other elements are equal to zero. This structure is chosen because it is known that (given enough neurons) a solution will exist to map the SLRNN to the desired ASSM (Goudreau et al., 1994) . The solution that is known to exist requires the use of one-hot codes for the states and the inputs. The representation that the SLRNN actually learns, however, can have states that are not in a one-hot code. The SLRNN might construct a solution that is different from the one-hot solution.
Clearly, there must be enough neurons to represent the processors (outputs) with a one-hot code. Therefore, the number of neurons must at least be equal to the number of processors. For the onehot solution to exist, however, there must be one neuron for each switch as well. Unfortunately, one does not generally know the number of switches beforehand. It is necessary to estimate the number of switches, and provide at least that many neurons. This is one of the weaknesses that is common to many neural network approaches: often it is not clear what size neural network would be best. One approach is to start with as many switch neurons as reasonably possible; if training is successful, then reduce the number of neurons using a destructive heuristic (Giles & Omlin, 1994) .
Assume that the values ~r for 1 < i < K constitute the binary vector that represents the desired output vector. Training occurs for this table entry if, for any i(1 <i< K) we have lyr-~rl > H, where ~ = 0.2 for our simulations. If it is determined that training should occur for this table entry, we use the learning algorithm to incrementally change the weights in the direction opposite that of the gradient of the meansquare-error cost function, Er:
The weights of the SLRNN will therefore potentially be updated after each table entry is examined. As a result, the learning algorithm does not perform true gradient descent for the entire table of data; rather, it approximates true gradient descent. The table of training data is cycled through repeatedly until none of the table entries require updating of the weights. Once the entire table is learned, the structure of the underlying ASSM is extracted from the SLRNN. This procedure is described in Section 5.3. The state vectors that are created through the use of the learning algorithm are not examined during training, but they are examined after training to describe the structure of the underlying ASSM. As mentioned before, these states actually represent switches in the SRIN. Since the neurons have a continuous activation function, it is generally necessary to use some partitioning and clustering techniques to make a group of states equivalent. After this is done, an ASSM can be extracted from the SLRNN. The extracted ASSM might not be minimal, but in this case standard SSM minimization techniques can be used.
It is instructive to compare learning an ASSM with learning a SSM, which has been extensively studied by the neural network community. The problem of inferring an ASSM, as opposed to inferring an SSM, is that in an ASSM there will generally be several possible initial states. In an SSM, a single initial state is generally assumed. With this in mind, the reader should be able to see that the SLRNN that is trying to learn an ASSM will, in fact, be trying to learn a SSM for each possible initial state. However, all of these SSMs will have the same structure; the only difference is that the initial state varies. Since all of the SSMs have the same structure, it can reasonably be hoped that the SLRNN will try to merge these multiple SSMs into a single ASSM. In fact, this turns out to be the case. Rather than learning several different SSMs in different sections of the state space, the SLRNN will take advantage of the identical structures of the SSMs and merge them.
A Training Example
Perhaps the easiest way to describe the mapping to the SLRNN would be to describe the structure of the SLRNN given data for the SRIN from Figure 4 . We will choose to have N = 8 neurons in the SLRNN. The choice of N = 8 is somewhat arbitrary; however, we did want to give the recurrent network enough neurons to easily learn the interconnection network. Recall that we do not know the structure of the SRIN and are trying to learn it from routing strings. Since one-hot codes will be used, four initial states vectors are necessary, corresponding to the four source processors that are present in the data table. In reality, however, it should be remembered that the state vectors in the SLRNN will actually correspond to switches from the SRIN, rather than processors. Thus, when we choose state vectors for the four source processors, we are really choosing state vectors for their respective designated switches (as shown in eqn 3). Let hn, m be vector of dimension n that has a value "1" in position m and a value "0" in all of the other positions. These are the initial values of the Yi vectors. For example, h5,2 = [0, 1, 0, 0, 0] r. When the source processor is processor Po (which is equivalent to saying that the designated switch is qo, see eqn 3) the initial state vector will be hs, 1. Similarly, source processor pl will make the initial state vector hs, 3, source processor P2 will make the initial state vector hs, 4, and source processor P3 will make the initial state vector ha, 5. Messages always start at these source processors. This mapping of source processors to initial state vectors can be made arbitrarily. There is no reason not to use vector ha, 2, for example. The only reason that this vector was not used was because of the way the training data was generated.
Now the input symbols, ro and rl, as well as the end-symbol, e, can be mapped to input vectors. There are three symbols that must be mapped to input vectors, so we will have three vectors of length three for this purpose. Let r0 correspond to input vector h3,1, rl correspond to input vector h3,2, and c correspond to input vector h3, 3.
Finally, the processors can now be mapped to output vectors. There are eight neurons in the SLRNN, but only the outputs of four of them are needed for the output vectors. Therefore, the output values will only be taken from neurons 0, 1, 2, and 3. Destination processor Po will correspond to output vector h4, I, destination processor Pl will correspond to output vector h4, 2, destination processor P2 will correspond to output vector h4,3, and destination processor P3 will correspond to output vector h4, 4.
In summary, the second-order SLRNN network that is used for this example is shown in Figure 5 and has N = 8 neurons and M = 3 input bits. When the output values are examined, only the first four neurons are used (K = 4). Now that the specifics of the encoding have been explained, we can use the training data to train the SLRNN. The data table contained an entry for every header up to length 11 (including the end-symbol) for every processor. The data table starts with strings of length one and concludes with strings of length 11. The SLRNN does not try to learn all of the data in this table simultaneously. Rather, it first learns the first 20 lines of the data table for each possible starting state. Then the resulting SLRNN is checked against the rest of the data table to see how it generalizes. If perfect generalization does not occur, the SLRNN adds 20 more lines to its training data. Once these 40 lines are learned, generalization is checked again. This process is repeated until all of the lines in the data table have been learned. This heuristic approach, which involves incremental expansion of the training data, has proven to be quite successful in practice.
For our experiment, by the time the SLRNN learned the first 280 lines of the table, the SLRNN was able to generalize for all of the remaining strings in the table. The full table had 8188 lines.
Extraction of the Interconnection Network from the Trained Neural Network
It was mentioned in Section 5.1 that this problem can be thought of as the problem of learning several separate SSMs, in this case four. Given this fact, one can examine the four separate SSMs that are generated from the four different initial states. Table 2 shows the unminimized SSM with initial state corresponding to switch q0 that was extracted from the SLRNN. This machine shall be called M 1.
One of the advantages of using second-order SLRNNs is the ease with which automata can be extracted from the trained or training networks. However, first-order SLRNNs could also be used (Manolios & Fanelli, 1994; Miller & Giles, 1993) . Details on the method of SSM extraction that we used can be found in (Giles et al., 1992a; Giles et al., 1992b) . The left column (S) contains the number of each state. State "1" in Tables 2 to 9 representative state vector, it is assumed that the two vectors implement the same state. To simplify our analysis, the real valued representative state vectors are quantized. Any value greater than or equal to 0.5 in a representative state vector is set to 1 after quantization, while any value less than 0.5 is set to 0. Henceforth, when state vectors are mentioned, we will actually be talking about these quantized representative state vectors. Thus, in general, the state vectors in Table 2 actually represent some volume of the state space. The unminimized SSMs given designated switches q2, q3, and q4 are shown in Tables 3 (machine M2) , 4 (machine M3), and 5 (machine M4), respectively.
Once the unminimized SSMs and their state vector representations are known, the SSMs can be minimized and merged. Table 6 contains the minimized SSM that is extracted from the SLRNN when the initial state vector corresponds to designated switch q0. Each state in the minimized machine can be associated with one or more state vectors. For example, state 1 of the SSM in Table 6 is associated with two state vectors, 10000000 and 10000001.
The minimal SSMs with initial states correspond- ing to designated switches q2, q3, and q4 are shown in Tables 7, 8 , and 9, respectively. With the information in Table 6 to 9, we can merge the SSMs into an ASSM. States in the SSMs with any overlapping vector representations are assumed to be equivalent. The results show that the SLRNN has indeed "merged" the SSMs.
In fact, if we ignore the quantized representations of the states, the four minimal SSMs are equivalent except for the fact that they have different initial states. That is, if we ignore the initial states, the four SSMs can be relabeled so that they are identical.
It should be clear how Table 6 corresponds to the SRIN in Figure 4 . State 1 corresponds to switch q0, state 2 corresponds to switch q4, state 3 corresponds to switch q2, state 4 corresponds to switch ql, state 5 corresponds to switch qs, and state 6 corresponds to switch q3-Simple observation will show that the other three SSMs also correspond to the SRIN in Figure 4 . Table 10 merges this information to show how the SLRNN represents the switches of Figure 4 . State vectors are shown and the machines that utilized them are presented.
Examination of Table 10 shows that, for the most q2  00100000  M2  00100001  M1, M2, M3, M4  00100101  M1, M2, M3, M4  01100101  M2, M3, M4  q3  00010000  M 3  00010101  M1, M2, M3, M4  q4  00001000  M4  01010100  M3, M4  01011100  M1, M2, M3, M4  01011110  M2  01111100  M1, M2, M3, M4  q5  11000101  M4  11001101  M1, M2, M3, M4  11010100  M1, M2, M3, M4  11011101 M1, M2, M3, M4
M. W. Goudreau and C. L. Giles et al., 1993) , it is possible that minimal SSMs could have been extracted directly from the SLRNN.
Furthermore, values such as 0.49 and 0.51 will have different values after quantization, although they are in fact quite close in the state space. Thus, two state vectors that are quite close in terms of Euclidean distance (before quantization) might be quite far from each other after quantization. This is another possible explanation for the fact that the extracted SSM was not minimal. Another observation is that the SLRNN did not learn anything approaching the one-hot solution that was described in Section 5. In fact, the only states vectors that were one-hot after quantization were the initial state vectors that were forced upon the SLRNN. However, while these initial state vectors were not returned to, they did seem to give the SLRNN a bias on its state representation.
Finally, the fact that the correct SRIN was extracted means that good generalization for strings longer than those in the training data has obviously been achieved.
CONCLUSIONS
A radical approach to the construction of interconnection networks has been presented. This approach uses training data from an unknown interconnection network to teach a recurrent neural network (RNN) to generate an interconnection network that is capable of routing the training data.
It was shown that this problem maps directly to the problem of learning a SSM with several distinct initial states. The proposed approach took advantage of previous work on the use of RNNs to inference synchronous sequential machines (SSMs). However, it should be noted that the relationship between interconnection networks and SSMs might also allow for some non-neural network approaches to be used for the same problem. It seems likely that such methods could be varied slightly to accommodate the interconnection network inference problem, just as the RNN method for SSM inference can be varied slightly to perform interconnection network inference.
It was demonstrated that given a table of training data, it is possible to use a second-order, single-layer RNN to generate the structure of an interconnection network that is capable of routing the training data. Furthermore, the interconnection network that is generated might be able to generalize for inputs that are not in the training data. A sample problem was used to illustrate the methodology. It is an open question as to whether other RNN models and/or training methods can outperform these results.
This work clearly pointed out the need for further research into the use of RNNs to inference larger SSMs. To date, RNNs have had limited success for large problems in grammatical inference, but some recent results are promising (Giles et al., 1995; Clouse et al., 1994) .
The concept of learning interconnection networks is an unusual one for the interconnection network community. It remains to be seen whether such learning approaches will become a useful method for interconnection network design or analysis.
