Using Recurrent Neural Networks to Learn the Structure of
   Interconnection Networks by Goudreau, Mark W. & Giles, C. Lee
Using Recurrent Neural Networks to Learn the Structureof Interconnection NetworksUNIVERSITY OF MARYLAND TECHNICALREPORT UMIACS-TR-94-20 AND CS-TR-3226Mark W. Goudreaua and C. Lee Gilesb;caDepartment of Computer ScienceUniversity of Central Florida, Orlando, FL 32816bNEC Research Institute4 Independence Way, Princeton, NJ 08540cInstitute for Advanced Computer StudiesUniversity of Maryland, College Park, MD 2074215 February 1994AbstractAmodied Recurrent Neural Network (RNN) is used to learn a Self-RoutingInterconnection Network (SRIN) from a set of routing examples. The RNN ismodied so that it has several distinct initial states. This is equivalent to asingle RNN learning multiple dierent synchronous sequential machines. Wedene such a sequential machine structure as augmented and show that a SRINis essentially an Augmented Synchronous Sequential Machine (ASSM). As anexample, we learn a small six-switch SRIN. After training we extract the net-work's internal representation of the ASSM and corresponding SRIN.This paper is adapted from (Goudreau, 1993, Chapter 6). A shortened version of this paperwas published in (Goudreau & Giles, 1993). 1
1 IntroductionThe use of Recurrent Neural Networks (RNNs) to learn Synchronous SequentialMachines (SSMs) from examples is a problem which has been studied extensively.A related topic that, to the authors' knowledge, has not been studied previously isthe use of RNNs to learn SSMs for which several distinct initial states are possible.This problem is interesting because it maps directly into the problem of learning thestructure of an Interconnection Network (IN) from examples. Learning an IN fromexamples is an unusual approach. Traditionally, INs have been designed (and notlearned) based on several criteria, including speed, complexity, ease of route calcu-lation, and fault tolerance. Numerous dierent types of INs have been proposed. Adetailed description of many of the INs that have been applied to parallel computingcan be found in Siegel's book (Siegel, 1990).In this paper, the learning of Self-Routing Interconnection Networks (SRINs) isdiscussed. SRINs are described in detail in Section 2. They can be used to describemany commonly used INs. If one considers a parallel computing system, the idea isthat the processors have certain communication requirements with other processors,and certain message headers (also described in Section 2) must be used that allowthe message to pass through the SRIN and reach the desired destination processor.The message headers provide routing information to the switches in the SRIN.The method that is proposed makes use of a second-order Single-Layer RecurrentNeural Network (SLRNN) to learn the training data. The training data is a table ofsource processors, message headers, and destination processors. Once the trainingdata has been learned, the structure of the SRIN can be extracted from the SLRNN.One topic that is related to the learning of INs was presented by Hillis (Hillis,1990). In that paper, Hillis makes use of simulated evolution to construct sortingnetworks. It should be also be mentioned that neural networks have been previouslyused for interconnection network routing: for example, see (Brown, 1989; Brown& Liu, 1990; Funabiki et al., 1991; Funabiki et al., 1993; Goudreau & Giles, 1992;Hakim & Meadows, 1990; Lee & Chang, 1993; Marrakchi & Troudet, 1989; Melsaet al., 1990a; Melsa et al., 1990b; Takefuji & Lee, 1991; Thomopoulos et al., 1991;Troudet & Walters, 1991). However, none of these methods learned the structureof the interconnection networks; the structure of the interconnection network wasalways directly mapped into the neural network.2 Self-Routing Interconnection NetworksIn this section we describe SRINs. The purpose of a SRIN is to allow a set ofprocessors to communicate amongst themselves using a store-and-forward method-2
εheader body




















M-1Figure 3: An M  N self-routing switch. Again, input messages are routed on aFirst-In, First-Out (FIFO) basis.symbol is the end-symbol, then the message should be given to the processor thatis associated with the switch. In other words, once the header has been strippeddown so that only the end-symbol is left, the message does not get passed throughthe SRIN any longer.Although it is not shown in Figures 2 and 3, it must be remembered that there is aconnection from each switch to its associated processor. This can be thought of asanother output port for the end-symbol, .The switches work in a First-In, First-Out (FIFO) manner. If a message can not berouted immediately, it is buered until it can be routed.2.1 A Formal Description of a Self-Routing Interconnection Net-workWe will now present a more formal description of an SRIN. The SRIN will have aset of M processors, P = fp0; p1; : : : ; pM 1g. Each processor, pj , will be associatedwith a set of switches, Qj . Each set Qj must contain at least one switch. (Otherwise,the processor would have no way to communicate with the other processors.)Note that not all of the switches in the SRIN need to be associated with a processor.Some switches in an SRIN might never be used to connect to a processor. Suchswitches are called don't care switches, or free switches. A message can be routedthrough a free switch, but a free switch should never be the rst switch nor lastswitch in a route; to do so would imply that the free switch is associated with someprocessor. For the sake of convenience, we will associate some processor with eachfree switch, even though such an association is meaningless since it is never used.Now, the SRIN has the set of switches Q = Q0 [ Q1 [    [ QM 1. The processorfunction, , performs the mapping,  : Q ! P . That is, if q is a switch, then (q)is the processor associated with that switch.We will let R be the nite input alphabet for the header and the body. The end-4
symbol, , is not a member of R; that is,  62 R. The end-symbol is only used toseparate the header from the body. One typical alphabet would be R = fr0; r1g.In general, however, the magnitude of the alphabet can be greater than two. Sincemost computing environments are binary, the situation becomes more complicatedwhen the magnitude of the alphabet is greater than two. In such cases, the membersof the alphabet must be encoded in some way.There does not need to be a size limitation for the header nor the body. In a binarysystem, the end-symbol might consist of a string of zeros and ones that is illegalin the header. Alternatively, one might send the header and the body separately,in which case the position of the end-symbol will be understood by the receivingswitch. Another approach would be to designate the rst byte of the header torepresent the length of the header. There are many dierent ways to implementthe end-symbol, but for our purposes here we will assume the end-symbol is just asymbol that can be transmitted in one time step.We now dene the switch transition function, , which performs the mapping,  :Q R! Q. If q is a switch and r is the input symbol that is taken from the frontof the header, then (q; r) is the next switch that the message will be sent to.Finally, when processor pj sends a message, it starts the message o from one of theswitches in the set Qj . Each processor will have a switch that is designated for thispurpose. We dene the switch function, , which performs the mapping,  : P ! Q.If p is a processor, then (p) is the switch that performs the rst stage of the routingfor any messages that p sends. We will call the switch (p) the designated switchfor processor p.The SRIN can now be dened formally.Denition 1 A self-routing interconnection network is a 7-tuple,(P;Q;R; ; ; ; ), where: P is a nite, nonempty set of processors. Q is a nite, nonempty set of switches. R is a nite, nonempty set of input symbols.  : QR! Q is the switch transition function.  : Q! P is the processor function.  : P ! Q is the switch function.  is the end-symbol. 5





































Figure 4. A sample self-routing interconnection network.7
switch q2 strips o the left-most symbol in the header, in this case r1, and routesthe message to q4 = (q2; r1). The message then goes to switch q1 = (q4; r0),and at last to switch q3 = (q1; r1). At this point, the header has been spent andthe message is led by the end-symbol. Switch q3 therefore delivers the message toprocessor p2 = (q3).3 Synchronous Sequential MachinesIn this section we discuss the relationship between SSMs and SRINs.FSAs and SSMs are thoroughly described in (Hopcroft & Ullman, 1979; Kohavi,1978). We will use the denition of SSMs that is provided in (Kohavi, 1978). 1Denition 2 A synchronous sequential machine is a quintuple, (O; S; I; ; ),where: O is a nite, nonempty set of outputs symbols. S is a nite, nonempty set of states. I is a nite, nonempty set of inputs symbols.  : S  I ! S is the state transition function.  : S ! O is the output function.>From Denitions 1 and 2, it is clear that SRINs and SSMs are very similar. Infact, it only takes a slight expansion of the denition of SSMs to make them directlyequivalent to SRINs. We will describe how SRINs are equivalent to AugmentedSSMs (ASSMs), which will be dened below.Let each processor in P be an output symbol in O. Similarly, let switch in Q bea state in S, and each input symbol in R be an input symbol in I . The switchtransition function, , becomes the state transition function, . The processorfunction, , becomes the output function, .Now the only components of the SRIN that are not equivalent to components inthe SSM are the end-symbol, , and the switch function, . The ASSM will have anend-symbol, . The meaning of the end-symbol in this context is merely that theinput string has reached its conclusion, and the ASSM can now output the valuecorresponding to the input string. The ASSM will also have a state function, .The state function  performs the mapping,  : O ! S. In this context, the state1Specically, our denition is for a Moore machine.8
function allows for some set of initial states in the ASSM. Thus, each input stringthat is to be entered into the ASSM must have an output symbol associated withit. This output symbol allows the ASSM to choose the correct starting state.The ASSM can now be dened formally.Denition 3 As augmented synchronous sequential machine is an 7-tuple,(O; S; I; ; ; ; ), where: O is a nite, nonempty set of outputs. S is a nite, nonempty set of states. I is a nite, nonempty set of inputs.  : S  I ! S is the state transition function.  : S ! O is the output function.  : O ! S is the state function.  is the end-symbol.It is now clear from Denitions 1 and 3 that SRINs and ASSMs are equivalent.4 Machine InferenceSince SRINs and ASSMs are equivalent, there are many issues that have been ex-plored for ASSMs that can now be used for SRINs. For example, just as one canminimize the size of an ASSM by merging equivalent states (Kohavi, 1978), one canminimize the size of a SRIN by merging equivalent switches.What we are interested in is the inference of a SRIN from examples. A great deal ofwork has been done on the problem ofmachine inference. It has been shown that, inthe worst case, inferring a SSM from sparse data is an intractable problem (Angluin,1978; Gold, 1978; Kearns & Valiant, 1989; Pitt & Warmuth, 1989). Approaches thatcan be used to infer SSMs will now be examined.4.1 Recurrent Neural Network ApproachesThe literature on the use of neural networks for grammatical inference and nite-state machine learning is now well-established (Cleeremans et al., 1989; Giles et al.,9
1992; Giles et al., 1992; Mozer & Bachrach, 1991; Pollack, 1991; Watrous & Kuhn,1992; Zeng et al., 1993). These approaches use RNNs to represent SSMs. For thework done in this paper, the approach described in (Giles et al., 1992; Giles et al.,1992) will be used (see Section 5.2). We refer readers who are interested in thedetails to those references. In Section 5, there is a qualitative explanation of theRNN approach to learning SRINs.Until recently, the RNN approach for SSM inference that is used in this paper hadonly been possible for unknown SSMs with a small number of states (approximately30). It should be pointed out that the limited success of this approaches is due tothe learning algorithms. Generally, the RNNs have rich representational capabilities.However, recent work has shown that certain types of large SSMs, with thousandsof states, are learnable (Giles & Horne, 1994). Furthermore, the performance of theRNNs can sometimes be improved by using \hints" if partial information about thestructure of the SSM is known (Giles & Omlin, 1993).Other approaches that use neural networks for grammatical inference exist thatwill not be used in this paper. For example, the use of update graphs has beenproposed by Rivest and Schapire (Rivest & Schapire, 1987a; Rivest & Schapire,1987b; Schapire, 1988). An update graph is an alternate representation of a FSAthat can be much smaller than the FSA for certain environments that often arise inpractice. Update graphs can be mapped to a connectionist system that can learn theenvironment from examples (Mozer & Bachrach, 1990; Mozer & Bachrach, 1991).4.2 Traditional ApproachesOther methods for grammatical inference, which do not use neural networks, havedemonstrated some promising results. In fact, a polynomial time algorithm proposedby Trakhtenbrot and Barzdin (Trakhtenbrot & Barzdin, 1973) has been shown tobe able to infer some very large nite automata. The algorithm produces a machinethat is consistent with a sparsely labeled tree, but the machine that is producedis not necessarily the minimum machine that is consistent with the data. Lang(Lang, 1992) performed several experiments using this algorithm for random niteautomata with 1000 states and 2000 transitions. Given enough training examples,the algorithm was almost always able to construct a machine that was similar tothe correct machine.
10

















xxFigure 5: A Single-Layer Recurrent Neural Network (SLRNN). There are M inputbits, N state bits, and (up to) N output bits. The bank of N latches is shown onthe right.The activation function, g(x), is the sigmoid function shown here:g(x) = 11 + e ax (5)The second order SLRNN is used to infer the ASSM that is equivalent to the un-known SRIN. Again, the approach used in (Giles et al., 1992; Giles et al., 1992)will be used here. The SLRNN will learn the training data, and the ASSM will beextracted from the SLRNN. A gradient descent algorithm that is similar to the oneproposed by Williams and Zipser (Williams & Zipser, 1989) will be used to trainthe SLRNN. If it were not possible to extract the ASSM from the SLRNN, thenthe SLRNN would not be useful for this problem. Without ASSM extraction, theSLRNN could still generalize, but when the structure of a SRIN is the desired resultsuch generalization is not useful.The problem of inferring an ASSM, as opposed to inferring an SSM, is that in anASSM there will generally be several possible initial states. In an SSM, a singleinitial state is generally assumed. With this in mind, the reader should be able tosee that the SLRNN that is trying to learn an ASSM will, in fact, be trying to learna SSM for each possible initial state. However, all of these SSMs will have the samestructure; the only dierence is that the initial state varies. Since all of the SSMshave the same structure, it can reasonably be hoped that the SLRNN will try tomerge these multiple SSMs into a single ASSM. In fact, this turns out to be thecase. Rather than learning several dierent SSMs in dierent sections of the state12
space, the SLRNN will take advantage of the identical structures of the SSMs andmerge them.Intuitively, the input vectors of the SLRNN will represent the inputs and the end-symbol of the ASSM (and therefore the input symbols and the end-symbol of theSRIN). The state vectors of the SLRNN represent the states of the ASSM (and theswitches of the SRIN). And the output vectors of the SLRNN represent the outputsof the the ASSM (and the processors of the SRIN).We will use simple one-hot encodings for the input symbols and the output symbols.Recall that a one-hot code is a code for which each symbol is represented by a vectorthat has one element equal to one while all of the other elements are equal to zero.Additionally, the initial state, which depends on the source processor, will be a one-hot code. This structure is chosen because it is known that a solution will exist tomap the SLRNN to the desired ASSM (Goudreau et al., 1994). The solution thatis known to exist requires the use of one-hot codes for the states and the inputs.The representation that the SLRNN actually learns, however, can have states thatare not in a one-hot code. The SLRNN might construct a solution that is dierentfrom the one-hot solution.Clearly, there must be enough neurons to represent the processors (outputs) witha one-hot code. Therefore, the number of neurons must at least be equal to thenumber of processors. For the one-hot solution to exist, however, there must beone neuron for each switch as well. Unfortunately, one does not generally know thenumber of switches beforehand. It is necessary to estimate the number of switches,and provide at least that many neurons. This is one of the weaknesses that iscommon to many neural network approaches: often it is not clear what size neuralnetwork would be best. One approach is to start with as many switch neurons asreasonably possible; if training is successful, then reduce the number of neuronsusing a destructive heuristic (Omlin & Giles, 1993).The SLRNN will work in the following manner. The input to the SLRNN will be aninitial state vector and a series of input vectors. The initial state vector and the rstinput vector will be applied to the SLRNN at time step one, the second input vectorwill be applied at time step two, and so on. Note that the last input vector willalways correspond to the end-symbol. The outputs of the neurons are not examineduntil the entire input has been applied. After the end-symbol is entered, gradientdescent techniques can be used to train the SLRNN so that the actual output getscloser to the desired output. There are many techniques that can be used to helpthe SLRNN correctly handle all of the training data.The state vectors that are created through the use of the gradient descent techniquesare not examined during training, but they are examined after training to describethe structure of the underlying SRIN. As mentioned before, these states actuallyrepresent switches in the SRIN. Since the neurons have a continuous activation13
function, it is generally necessary to use some partitioning and clustering techniquesto make a group of states equivalent. After this is done, an ASSM can be extractedfrom the SLRNN. The extracted ASSM might not be minimal, but in this casestandard SSM minimization techniques can be used.5.2 A Training Example with AnalysisPerhaps the easiest way to describe the mapping to the SLRNN would be to describethe structure of the SLRNN given data for the SRIN from Figure 4. We will chooseto have N = 8 neurons in the SLRNN. Since one-hot codes will be used, four initialstates vectors are necessary, corresponding to the four source processors that arepresent in the data table. In reality, however, it should be remembered that the statevectors in the SLRNN will actually correspond to switches from the SRIN, ratherthan processors. Thus, when we choose state vectors for the four source processors,we are really choosing state vectors for their respective designated switches (as shownin Equation 3). Let hn;m be a vector of dimension n that has a value \1" in positionm and a value \0" in all of the other positions. For example, h5;2 = [0; 1; 0; 0; 0]T.When the source processor is processor p0 (which is equivalent to saying that thedesignated switch is q0, see Equation 3) the initial state vector will be h8;1. Similarly,source processor p1 will make the initial state vector h8;3, source processor p2 willmake the initial state vector h8;4, and source processor p3 will make the initialstate vector h8;5. This mapping of source processors to initial state vectors can bemade arbitrarily. There is no reason not to use vector h8;2, for example. The onlyreason that this vector was not used was because of the way the training data wasgenerated.Now the input symbols, r0 and r1, as well as the end-symbol, , can be mappedto input vectors. There are three symbols that must be mapped to input vectors,so we will have three vectors of length three for this purpose. Let r0 correspondto input vector h3;1, r1 correspond to input vector h3;2, and  correspond to inputvector h3;3.Finally, the processors can now be mapped to output vectors. There are eightneurons in the SLRNN, but only the outputs of four of them are needed for theoutput vectors. Therefore, the output values will only be taken from neurons 0, 1,2, and 3. Destination processor p0 will correspond to output vector h4;1, destinationprocessor p1 will correspond to output vector h4;2, destination processor p2 willcorrespond to output vector h4;3, and destination processor p3 will correspond tooutput vector h4;4.Now that the specics of the encoding have been explained, we can use the trainingdata to train the SLRNN. The data table contained an entry for every header up tolength 11 (including the end-symbol) for every processor. The data table starts with14
strings of length one and concludes with strings of length 11. The SLRNN does nottry to learn all of the data in this table simultaneously. Rather, it rst learns therst 20 lines of the data table. Then the resulting SLRNN is checked against the restof the data table to see how it generalizes. If perfect generalization does not occur,the SLRNN adds 20 more lines to its training data. Once these 40 lines are learned,generalization is checked again. This process is repeated until all of the lines in thedata table have been learned. This heuristic approach, which involves incrementalexpansion of the training data, has proven to be quite successful in practice.For our experiment, by the time the SLRNN learned the rst 280 lines of the table,the SLRNN was able to generalize for all of the remaining strings in the table. Thefull table had 8188 lines.5.3 Extraction of the Interconnection Network from the TrainedNeural NetworkIt was mentioned in Section 5.1 that this problem can be thought of as the problemof learning several separate SSMs, in this case four. Given this fact, one can examinethe four separate SSMs that are generated from the four dierent initial states.Table 2 shows the unminimized SSM with initial state corresponding to switch q0that was extracted from the SLRNN. This machine shall be called M1. Details onthe method of SSM extraction that was used can be found in (Giles et al., 1992;Giles et al., 1992). The left column (S) contains the number of each state. State\1" in Tables 2 to 9 corresponds to the initial state. The next column (O) containsthe output value associated with that state. The following two columns contain thenext state given input zero (NS0) and given input one (NS1). The nal column(QR) contains the quantized representation for the state in the SLRNN. It shouldbe kept in mind that the SLRNN actually uses real valued state vectors, as doesthe clustering algorithm that was used for SSM extraction. The SSM extractionalgorithm makes use of a real valued representative vector for each state. In theSLRNN, whenever a state vector is \close" to a representative state vector, it isassumed that the two vectors implement the same state. To simplify our analysis,the real valued representative state vectors are quantized. Any value greater thanor equal to 0.5 in a representative state vector is set to 1 after quantization, whileany value less than 0.5 is set to 0. Henceforth, when state vectors are mentioned, wewill actually be talking about these quantized representative state vectors. Thus,in general, the state vectors in Table 2 actually represent some volume of the statespace.The unminimized SSMs given designated switches q2, q3, and q4 are shown in Ta-bles 3 (machine M2), 4 (machine M3), and 5 (machine M4), respectively.Once the unminimized SSMs and their state vector representations are known, the15
S O NS0 NS1 QR1 1 2 3 100000002 4 4 5 011111003 2 4 6 001000014 1 7 7 110000115 4 8 9 110101006 4 4 5 010111007 3 10 6 000101018 4 11 9 110011019 2 4 2 0010010110 1 6 3 1000000111 4 11 9 11011101Table 2: Machine M1, the unminimized SSM with initial state corresponding toswitch q0. Column S contains the state. Column O contains the output. ColumnNS0 contains the next state given input 0. Column NS1 contains the next stategiven input 1. Column QR contains the quantized representation for the state inthe SLRNN. S O NS0 NS1 QR1 2 2 3 001000002 1 4 4 110000113 4 2 5 010111104 3 6 7 000101015 4 8 9 110101006 1 7 10 100001017 4 2 5 010111008 4 11 12 110011019 2 2 7 0110010110 2 2 7 0010000111 4 11 12 1101110112 2 2 13 0010010113 4 2 5 01111100Table 3: Machine M2, the unminimized SSM with initial state corresponding toswitch q2. 16
S O NS0 NS1 QR1 3 2 3 000100002 1 3 4 100001013 4 5 6 010111004 2 5 3 001000015 1 7 7 110000116 4 8 9 110101007 3 2 10 000101018 4 11 12 110011019 2 5 3 0110010110 4 5 6 0101010011 4 8 9 1101110112 2 5 13 0010010113 4 5 6 01111100Table 4: Machine M3, the unminimized SSM with initial state corresponding toswitch q3. S O NS0 NS1 QR1 4 2 3 000010002 1 4 4 110000113 4 5 6 110001014 3 7 8 000101015 4 9 10 110011016 2 2 11 011001017 1 11 12 100001018 4 2 13 010101009 4 9 10 1101110110 2 2 14 0010010111 4 2 13 0101110012 2 2 11 0010000113 4 5 6 1101010014 4 2 13 01111100Table 5: Machine M4, the unminimized SSM with initial state corresponding toswitch q4. 17
S O NS0 NS1 QR1 1 2 3 10000000100000012 4 4 5 01011100011111003 2 4 2 00100001001001014 1 6 6 110000115 4 5 3 1100110111010100110111016 3 1 2 00010101Table 6: The minimal SSM with initial state corresponding to switch q0. This SSMis equivalent to machine M1 in Table 2.SSMs can be minimized and merged. Table 6 contains the minimized SSM that isextracted from the SLRNN when the initial state vector corresponds to designatedswitch q0. Each state in the minimized machine can be associated with one or morestate vectors. For example, state 1 of the SSM in Table 6 is associated with twostate vectors, 10000000 and 10000001.The minimal SSMs with initial states corresponding to designated switches q2, q3,and q4 are shown in Tables 7, 8, and 9, respectively. With the information inTables 6 to 9, we can merge the SSMs into an ASSM. States in the SSMs with anyoverlapping vector representations are assumed to be equivalent. The results showthat the SLRNN has indeed \merged" the SSMs.In fact, if we ignore the quantized representations of the states, the four minimalSSMs are equivalent except for the fact that they have dierent initial states. Thatis, if we ignore the initial states, the four SSMs can be relabeled so that they areidentical.It should be clear how Table 6 corresponds to the SRIN in Figure 4. State 1corresponds to switch q0, state 2 corresponds to switch q4, state 3 corresponds toswitch q2, state 4 corresponds to switch q1, state 5 corresponds to switch q5, andstate 6 corresponds to switch q3. Simple observation will show that the other threeSSMs also correspond to the SRIN in Figure 4.Table 10 merges this information to show how the SLRNN represents the switches ofFigure 4. State vectors are shown and the machines that utilized them are presented.Examination of Table 10 shows that, for the most part, machines M1, M2, M3, andM4 make use of state vectors that are approximately equal. For example, all four18
S O NS0 NS1 QR1 2 2 3 001000000010000100100101011001012 1 4 4 110000113 4 2 5 0101110001011110011111104 3 6 3 000101015 4 5 1 1100110111010100110111016 1 3 1 10000101Table 7: The minimal SSM with initial state corresponding to switch q2. This SSMis equivalent to machine M2 in Table 3.S O NS0 NS1 QR1 3 2 3 00010000000101012 1 3 4 100001013 4 5 6 0101010001011100011111004 2 5 3 0010000100100101011001015 1 1 1 100001016 4 6 4 110011011101010011011101Table 8: The minimal SSM with initial state corresponding to switch q3. This SSMis equivalent to machine M3 in Table 4. 19
S O NS0 NS1 QR1 4 2 3 000010000101010001011100011111002 1 4 4 110000113 4 3 5 110001011100110111010100110111014 3 6 1 000101015 2 2 1 0010000100100101011001016 1 1 5 10000101Table 9: The minimal SSM with initial state corresponding to switch q4. This SSMis equivalent to machine M4 in Table 5.switch QR machineq0 10000000 M110000001 M110000101 M2,M3,M4q1 11000011 M1,M2,M3,M4q2 00100000 M200100001 M1,M2,M3,M400100101 M1,M2,M3,M401100101 M2,M3,M4q3 00010000 M300010101 M1,M2,M3,M4q4 00001000 M401010100 M3,M401011100 M1,M2,M3,M401011110 M201111100 M1,M2,M3,M4q5 11000101 M411001101 M1,M2,M3,M411010100 M1,M2,M3,M411011101 M1,M2,M3,M4Table 10. How the SLRNN represents the switches from the SRIN in Figure 4.20
machines use the state region 00010101 to represent switch q3 from Figure 4.Another interesting fact is that the state vectors that represent equivalent statestend to be close to each other in the state space. That is, the state vectors forequivalent states tend to have small Hamming distances from one another. Forexample, machine M1 uses two state regions to represent switch q4, 01011100 and01111100. This fact leads one to believe that the SLRNN actually uses somethinglike a cohesive sub-space for each state. It seems likely that the unminimized SSMsthat were extracted have equivalent states due to the nature of the SSM extractionalgorithm that is used. If other clustering approaches were used (Das & Mozer,1994; Watrous & Kuhn, 1992; Zeng et al., 1993), it is possible that minimal SSMscould have been extracted directly from the SLRNN. Furthermore, values such as0.49 and 0.51 will have dierent values after quantization, although they are in factquite close in the state space. Thus, two state vectors that are quite close in termsof Euclidean distance (before quantization) might be quite far from each other afterquantization. This is another possible explanation for the fact that the extractedSSM was not minimal.Another observation is that the SLRNN did not learn anything approaching theone-hot solution that was described in Section 5. In fact, the only states vectorsthat were one-hot after quantization were the initial state vectors that were forcedupon the SLRNN. However, while these initial state vectors were not returned to,they did seem to give the SLRNN a bias on its state representation.Finally, the fact that the correct SRIN was extracted means that good generalizationfor strings longer than those in the training data has obviously been achieved.6 ConclusionsA radical approach to the construction of interconnection networks has been pre-sented. This approach uses training data from an unknown interconnection networkto teach a RNN to generate an interconnection network that is capable of routingthe training data.It was shown that this problem maps directly to the problem of learning a SSM withseveral distinct initial states. The proposed approach took advantage of previouswork on the use of RNNs to inference SSMs. However, it should be noted that therelationship between interconnection networks and SSMs might also allow for somenon-neural network approaches to be used for the same problem. It seems likely thatsuch methods could be varied slightly to accommodate the interconnection networkinference problem, just as the RNN method for SSM inference can be varied slightlyto perform interconnection network inference.21
It was demonstrated that given a table of training data, it is possible to use a RNNto generate the structure of an interconnection network that is capable of routingthe training data. Furthermore, the interconnection network that is generated mightbe able to generalize for inputs that are not in the training data. A sample problemwas used to illustrate the methodology.This work clearly pointed out the need for further research into the use of RNNs toinference larger SSMs. To date, RNNs have had limited success for large problemsin grammatical inference, but some recent results are promising (Giles & Horne,1994).The concept of learning interconnection networks is an unusual one for the intercon-nection network community. It remains to be seen whether such learning approacheswill become a useful method for interconnection network design.7 AcknowledgementThe authors would like to acknowledge useful discussions with Sanjeev Kulkarni andCli B. Miller.ReferencesAngluin, D. (1978). On the complexity of minimum inference of regular sets. Infor-mation and Control, 39, 337{350.Brown, T. X. (1989). Neural networks for switching. IEEE Communications Mag-azine, 27 (11), 72{81.Brown, T. X. & Liu, K.-H. (1990). Neural network design of a Banyan networkcontroller. IEEE Journal on Selected Areas of Communication, 8 (8), 1428{1438.Cleeremans, A., Servan-Schreiber, D., & McClelland, J. L. (1989). Finite stateautomata and simple recurrent networks. Neural Computation, 1, 372{381.Das, S. & Mozer, M. (1994). A hybrid clustering/gradient descent architecture fornite state machine induction. In Cowan, J., Tesauro, G., & Alspector, J.(Eds.), Advances in Neural Information Processing Systems 6. San Mateo, CA:Morgan Kaufmann.Funabiki, N., Takefuji, Y., & Lee, K. C. (1991). A neural network model for traccontrols in multistage interconnection networks. In Proceedings of the Interna-tional Joint Conference on Neural Networks 1991, (pp. A898).22
Funabiki, N., Takefuji, Y., & Lee, K. C. (1993). Comparisons of seven neural net-work models on trac control problems in multistage interconnection networks.IEEE Transactions on Computers, 42 (4), 497{501.Giles, C. L. & Horne, B. G. (1994). Some large DFA's are learnable with recurrentneural networks. NEC Research Institute, Inc., Princeton, NJ.Giles, C. L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z., & Lee, Y. C. (1992).Learning and extracting nite state automata with second-order recurrent neu-ral networks. Neural Computation, 4 (3), 393{405.Giles, C. L., Miller, C. B., Chen, D., Sun, G. Z., Chen, H. H., & Lee, Y. C. (1992).Extracting and learning an unknown grammar with recurrent neural networks.In Moody, J., Hanson, S., & Lippmann, R. (Eds.), Advances in Neural Informa-tion Processing Systems 4, (pp. 317{324)., San Mateo, CA. Morgan KaufmannPublishers.Giles, C. L. & Omlin, C. W. (1993). Extraction, insertion and renement of symbolicrules in dynamically-driven recurrent neural networks. Connection Science,5 (3,4), 307{337. Special Issue on Architectures for Integrating Symbolic andNeural Processes.Gold, E. M. (1978). Complexity of automaton identication from given data. In-formation and Control, 37, 302{320.Goudreau, M. W. (1993). Neural Network Applications for Interconnection Net-works. PhD thesis, Princeton University, Princeton, NJ.Goudreau, M. W. & Giles, C. L. (1992). Routing in random multistage intercon-nection networks: Comparing exhaustive search, greedy and neural networkapproaches. International Journal of Neural Systems, 3 (2), 125{142.Goudreau, M. W. & Giles, C. L. (1993). Discovering the structure of a self-routinginterconnection network with a recurrent neural network. In Alspector, J.,Goodman, R., & Brown, T. X. (Eds.), Proceedings of the International Work-shop on Applications of Neural Networks to Telecommunications, (pp. 52{59).,Hillsdale, NJ. Lawrence Erlbaum Associates, Inc.Goudreau, M. W., Giles, C. L., Chakradhar, S. T., & Chen, D. (1994). First-ordervs. second-order single layer recurrent neural networks. To be published inIEEE Transactions on Neural Networks.Hakim, N. Z. & Meadows, H. E. (1990). A neural network approach to the setup ofthe Benes switch. In Infocom 90, (pp. 397{402).Hillis, W. D. (1990). Co-evolving parasites improve simulated evolution as an opti-mization procedure. Physics D, 42, 228{234.23
Hopcroft, J. E. & Ullman, J. D. (1979). Introduction to Automata Theory, Lan-guages, and Computation. Reading, MA: Addison-Wesley Publishing Company,Inc.Kearns, M. & Valiant, L. (1989). Cryptographic limitations on learning boolean for-mulae and nite automata. In Proceedings of the 21st Annual ACM Symposiumon Theory of Computing. ACM Press.Kohavi, Z. (1978). Switching and Finite Automata Theory (second ed.). New York,NY: McGraw-Hill, Inc.Lang, K. (1992). Random DFA's can be approximately learned from sparse uni-form examples. In Proceedings of the Fifth ACM Workshop on ComputationalLearning Theory, (pp. 45{52). ACM Press.Lee, S.-L. & Chang, S. (1993). Neural networks for routing of communication net-works with unreliable components. IEEE Transactions on Neural Networks,4 (5), 854{863.Marrakchi, A. M. & Troudet, T. (1989). A neural net arbitrator for large crossbarpacket-switches. IEEE Transactions on Circuits and Systems, 36 (7), 1039{1041.Melsa, P. J. W., Kenney, J. B., & Rohrs, C. E. (1990a). A neural network solutionfor call routing with preferential call placement. In Proceedings of the 1990Global Telecommunications Conference, (pp. 1377{1382).Melsa, P. J. W., Kenney, J. B., & Rohrs, C. E. (1990b). A neural network solutionfor routing in three stage interconnection networks. In Proceedings of the 1990International Symposium on Circuits and Systems, (pp. 483{486).Mozer, M. C. & Bachrach, J. (1990). Discovering the structure of a reactive envi-ronment by exploration. Neural Computation, 2 (4), 447{457.Mozer, M. C. & Bachrach, J. (1991). SLUG: A connectionist architecture for in-ferring the structure of nite-state environments. Machine Learning, 7 (2/3),139{160.Omlin, C. W. & Giles, C. L. (1993). Pruning recurrent neural networks for im-proved generalization performance. Technical Report TR 93-6, Department ofComputer Science, Rensselaer Polytechnic Institute, Troy, NY.Pitt, L. & Warmuth, M. (1989). The minimum DFA consistency problem cannot beapproximated within any polynomial. In Proceedings of the 21st Annual ACMSymposium on Theory of Computing. ACM Press.24
Pollack, J. B. (1991). The induction of dynamical recognizers. Machine Learning,7 (2/3), 227{252.Rivest, R. L. & Schapire, R. E. (1987a). Diversity-based inference of nite au-tomata. In Proceedings of the Twenty-Eighth Annual Symposium on Founda-tions of Computer Science, (pp. 78{87).Rivest, R. L. & Schapire, R. E. (1987b). A new approach to unsupervised learningin deterministic environments. In Langley, P. (Ed.), Proceedings of the FourthInternational Workshop on Machine Learning.Schapire, R. E. (1988). Diversity-based inference of nite automata. Master's thesis,Massachusetts Institute of Technology, Cambridge, MA.Siegel, H. J. (1990). Interconnection Networks for Large Scale Parallel Processing.New York: McGraw-Hill.Takefuji, Y. & Lee, K. C. (1991). An articial hysteresis binary neuron: A modelsuppressing the oscillatory behavior of neural dynamics. Biological Cybernetics,64, 353{356.Thomopoulos, S. C. A., Zhang, L., & Wann, C. D. (1991). Neural network imple-mentation of the shortest path algorithm for trac routing in communicationnetworks. In Proceedings of the International Joint Conference on Neural Net-works 1991, (pp. 2693{2702)., Singapore.Trakhtenbrot, B. & Barzdin, Y. (1973). Finite Automata: Behavior and Synthesis.Amsterdam: North-Holland Publishing Company.Troudet, T. P. & Walters, S. M. (1991). Neural network architecture for crossbarswitch control. IEEE Transactions on Circuits and Systems, 38 (1), 42{56.Watrous, R. L. & Kuhn, G. M. (1992). Induction of nite-state languages usingsecond-order recurrent networks. Neural Computation, 4 (3), 406{414.Williams, R. J. & Zipser, D. (1989). A learning algorithm for continually runningfully recurrent neural networks. Neural Computation, 1 (2), 270{280.Zeng, Z., Goodman, R. M., & Smyth, P. (1993). Learning nite state machines withself-clustering recurrent networks. Neural Computation, 5 (6), 976{990.25
