Recon gurable communication networks for massively parallel multiprocessor systems o er the possibility to realize a number of application demands like special communication patterns or real-time requirements. This paper presents the design principle of a recon gurable network which is able to realize any graph of maximal degree four. The architecture is based on a special multistage Clos network, constructed out of a number of static routing switches of equal size. Upper bounds on the cut size of 4-regular graphs, if split into a number of clusters, allow minimizing the number of switches and connections while still o ering the desired recon guration capabilities as well as large scalability and exible multiuser access. E cient algorithms con guring the architecture are based on an old result by Petersen 27 about the decomposition of regular graphs. The concept presented here is the basis for the Parsytec SC series of recon gurable MPP-systems. The currently largest realization with 320 processors is presented in greater detail.
Introduction
Over the last few years, distributed memory parallel computers have evolved from simple systems connecting processors directly by using buses or integrated communication links to highly complex systems using a number of communication networks for di erent functionalities, e.g. basic communication, I/O, and synchronization. This progress is mainly supported and even promoted by the development of specialized communication switches providing advanced features like randomization, adaptive routing or pipelining. Today's distributed memory parallel systems are usually built by connecting a number of processors with associated local memory via at least one communication network consisting of routing switches. A network constructed by dynamic routing switches provides communication paths between recon guration during runtime has to be compared to its costs, not only in terms of time needed for recon guration but also in terms of additional hardware costs which are certainly necessary for providing this ability 19;20 . Therefore, systems of this kind are often used as general purpose parallel computing systems which are able to realize user speci c processor topologies. Of course, such systems are usually not able to realize any possible communication structure directly, but very often applications get along with a communication graph of bounded (small) degree 20 . Recon gurable MIMD systems, designed as general purpose parallel computers, can also be used as realtime systems if the communication requirements of the application which demand realtime are fully realized by the architecture without any routing. In this case, the communication time depends only on the communication startup time and the transfer time which itself only depends on the amount of transmitted data. As these numbers can be predicted exactly, the realtime behavior of an application can be guaranteed. A recon gurable MIMD system with up to 64 processors was presented by Hromkovi c and Monien 13 . It is constructed out of four 96 96-crossbar switches and with minimized hardware requirements using upper bounds on the bisection width of k-degree graphs. The system was manufactured and commercialized by Parsytec Computer, currently one of the largest European producers of parallel computing systems. A di erent approach to design a recon gurable MIMD architecture based on the Transputer processor was taken during the ESPRIT project P1085 26;31 . This architecture is based on the decomposition of a regular network of degree four into two regular networks of degree two 27 , which can then be realized by four permutation networks. The aim of this paper is to describe the architecture of a recon gurable MIMD system able to realize any network of degree up to four. The architecture is based on a special Clos network and combines the following features:
Scalability to any number of processors Minimized hardware requirements Logarithmic diameter E cient con guration algorithms Multi-user capability Integration of external devices
The design of the architecture had to meet industrial demands. For the realization of the general concept presented within this paper, it is assumed that the processing node has up to four communication links which makes it necessary to investigate the recon guration of regular networks of degree four. The general technique can also be applied to any larger number of communication links per processing node, but will be demonstrated here for Transputer systems, having four communication links per node. In the following, we describe the basic design principle of the network architecture in greater detail and show its realization for an existing commercial system, the Parsytec SC320 (Sec. 2).
Architecture
The architecture of a circuit switching network can be quali ed by several characteristics. According to Broomell and Heath 5 , one is the connection capability of the used communication switches. They distinguish single and double sided switches. Single sided or one sided switches consist of one set of connections which are all treated equally. Any port of these switches can be connected to any other port. Since all of them may be both, source and destination of messages, the ports must be bidirectional. A double sided or two sided m m switch connects m input ports to m output ports. A single sided switch is able to realize any permutation of connections between its input and output ports. The distinction between single sided and double sided switches is not always clear. In fact, many single sided networks, e.g. telephone networks, are constructed out of double sided switches.
Multistage Interconnection Networks
The architecture presented in this paper is closely related to the Clos network 5;6 which originates from the design of telephone networks. A Clos network with n = m k inputs and outputs typically consists of three stages of switches (cf. Throughout the rest of this paper, we will study recon gurable networks of degree four. The presented architecture able to realize these networks is a multistage interconnection network like the Clos network constructed out of single sided switches. In fact, it is a folded complementary Bene s network constructed out of single sided switches of equal size (cf. Fig. 2 ). As in the complementary Bene s network, the rst and last stage is recursively decomposed. The architecture uses a minimized number of switches. It is based on assumptions that will be discussed in greater detail in Chapter 3:
Claim 1 Almost every graph of degree 4 can be partitioned into subsets of size p in such a way that each of these subsets has at most 2 p connections to other subsets (external edges).
Using this claim, we can recursively develop the architecture in a top down fashion by partitioning the processor network into a number of subsets, each containing p processors and 2 p edges to other subsets. To connect the external edges of each subset, we use a result of Petersen 27 , proven in 1891:
Fact 1 (Petersen 27 )
The edges of a regular graph of degree 2 d can be partitioned to form d regular graphs of degree 2.
If we treat each subset of p processors as a node of a 2 p-regular supergraph, we can partition this supergraph into p graphs of degree 2 using Petersen's result. Each of these p graphs contains all nodes of the supergraph. Thus, p switches, each having 2 links to each subset of processors, are su cient to connect all external links.
Claim 1 and Fact 1 allow the following construction (cf. Fig. 3 ): Connect k subnetworks, each containing p processors and 2 p external links by p switches. In detail, the outgoing links numbered 2 i ? 2 and 2 i ? 1 of each subnetwork are connected to the i-th switch (for any i 2 f1; : : :; pg).
This principle has a number of important features in uencing the reliability, usability, and scalability of the overall architecture. First of all, the architecture is to some extent fault tolerant as far as the communication capability is concerned. This is due to the large number of parallel paths between any two subnetworks. Concerning the usability of parallel computer architectures, multiuser access is a very important property. Our construction allows the partitioning of the machine into physically independent subnetworks. These partitions can be assigned to di erent users without the risk of disturbing each other. Even if di erent users share a communication switch, the applications do not in uence each other, since the switches provide independent routing paths for each user. Thus, we can realize a very exible partitioning of the resources to provide multi-user access in this way. One of the most important features of the machine is its good scalability. On the one hand, the general principle allows scalability to any numbers of processors. On the other, even for a very small number of stages (resulting in very small routing times), a large number of processors can be connected using existing routing switches. This will be further described in the next section. n?2 switches. Thus we are able to connect more processors (factor 4 3 ) by less switches (factor 3 2 ) with the presented architecture by using the same number of stages. Compared to the network presented by Nicol et.al. 26 , our architecture has a much smaller 
Embedding of Networks into the Architecture
To use a recon gurable architecture like the one presented above, the user has to describe the communication pattern he wants to realize for his application. In our case, this is a graph of degree four. Two problems have to be solved in order to map a regular network of degree four onto the architecture. First, the network has to be partitioned into subsets of at most m 6 processors such that each set has at most m 3 external edges. Afterwards, the external edges have to be placed onto the switches of the di erent stages.
Partitioning
In the following lemma we show how to upper bound the number of external edges if the network is partitioned into subsets of size l.
Lemma 1 Let G = (V; E) be a regular graph of degree 4. G can be partitioned into subsets of size l in such a way that each subset has at most 2 l + 4 external edges.
Proof: As G is regular of degree 4, it contains an Eulerian Circle. The edges along this circle can be colored black and white in turn. Both subsets of equally colored edges form a subgraph of degree two, i.e., both subgraphs consist of cycles covering every node of the graph and at most two paths of black edges (cf. Fig. 6 ). If all 2 l white edges are now external, then each subset has a total of at most 2 l + 4 external edges. 2 l procs l procs l procs l procs Figure 6 : Assigning cycles to clusters Lemma 1 implies a bound of m 3 + 4 external edges if we partition a network into clusters of size m 6 . Considering our architecture, which is mainly based on Claim 1, we have to reduce the number of external edges by 4 to nd a partitioning with at most m 3 external edges. We claim that this is always possible for graphs of limited size since the proof of Lemma 1 does not take the white edges into account. Extensive experiments using a heuristic optimization technique (cf. Sec. 3.3) have shown that m 3 external edges are always su cient, at least for 4-regular graphs of up to 640 nodes. However, results from graph theory show that this is not possible in general. Note that, if a cluster of size l has at most 2 l external edges, then it has at least l internal edges and therefore contains a cycle of length at most l. It is known 4 that for every l 0, there exist some n = n l and some graph of degree four with n l nodes whose shortest cycle has length larger than l. The numbers n l arising from the construction of Bollob as 4 are very large. Our system in its current realization contains 320 processors in clusters of size m 6 = 16 (cf. Sec. 3.3). It is easy to see 4 that a 4-regular graph with 320 nodes has a cycle of length smaller than 12. We conjecture that every graph of degree four with at most 320 nodes can be partitioned into clusters of size 16 in such a way that every cluster has at most 32 external edges. However ports of the switches are used). These additional links can be used for fault tolerance purposes.
Mapping of External Edges
The second step of mapping a user network onto the architecture requires an assignment of the external edges of each subset to the switches of the di erent stages. We treat each subnetwork S i?1 as a node of a regular graph with degree 2 d, for some d. The assignment makes use of a result shown by Petersen in 1891 which was already mentioned in Fact 1: partitioned to form k 2 regular graphs of degree 2.
Proof ( 27 ) : As G is k-regular and k even, the graph contains an Eulerian Cycle. Color the edges along this cycle with black and white in an alternating way. Then, each node is incident to exactly k 2 edges of each color. Partition the graphs into two others, the \black" and \white" graph. Both graphs are k 2 -regular. A recursive application of the described method delivers the desired result.
2
Petersen's algorithm computes the decomposition in linear time for all k-regular graphs where k is a power of 2. The result was generalized by K onig in 1935 17 , who presented an algorithm for the partitioning of arbitrary regular graphs which had complexity O(n 2 ). His result was improved in 1982 by Gabow and Kariv 11 who described an algorithm with time complexity O(n log n). After having found a decomposition of the 2 d-regular graph of stage i ? 1 into d graphs of degree 2, each of the d switches at stage i can be used to realize one of these parts. 3 .3. The Con guration Software of the SC320 In order to realize a user requested network, the switches of the architecture are con gured before the program is executed. As described at the beginning of Section 3, two problems have to be solved to con gure the system: The network has to be split into subsets of size 16 with not more than 32 external edges and the external edges have to be assigned to switches of the top stage. The assignment of edges uses Petersen's algorithm described in the proof to Lemma 2. It treats each cluster of stage 0 as a node of degree 32 and decomposes the resulting graph into 8 regular graphs of degree 4. Each of the 8 switches in stage 1 realizes one of these graphs.
As mentioned in Section 3.1, the problem to partition a graph into clusters of size l with at most 2 l external links is not soluble in general. The con guration software of the SC320 uses Simulated Annealing 16 as a heuristic approach to nd an appropriate partition of user requested networks 7 . Simulated Annealing (SA) is a probabilistic optimization technique based on local search which allows deteriorations of the cost function. This enables the algorithm to escape from local optimal solutions which often leads to an improved solution quality compared to deterministic local search methods. The con guration software should be fast but does not have to minimize the numbers of external edges in total. It is su cient to nd a partition into clusters of size 16 with at most 32 external edges. Thus, we de ne the following particular combinatorial optimization problem:
De nition 1 ( -Con guration Problem) Given a graph G = (V; E), nd a partition P = fP 0 ; : : :; P k g with P i \ P j = ; for all i 6 = j, S k i=0 P i = V and jP i j = for all i 2 f1; : : :; kg; jP 0 j minimizing
where (x) = max(0; x) and ext(P i ) = jffv; wg 2 E; v 2 P i ; w 6 2 P i gj. The cost function expresses the requirements of the architecture. It counts the number of external edges of clusters until it is less than 2 . Additionally, it shows how graphs are con gured if jV j is not a multiple of . P 0 is a so called splitter cluster containing at most nodes. The external edges of P 0 are minimized in total because the remaining processors of this cluster are supposed to be assigned to another user. This increases the multi-user capabilities (cf. Sec. 3.4). For the existing system, we set = 16. The simulated annealing algorithm uses a handoptimized cooling schedule and a swap neighborhood, i.e. in each step it chooses two nodes from two arbitrary clusters and tries to exchange them 7 . The algorithm is able to partition arbitrary graphs of degree 4 and size 320 according to the described con guration problem in about 10s on a normal front-end workstation (Sparc-Station SS10). Extensive tests with random graphs as well as standard networks have shown that all test instances with 512 nodes can be su ciently partitioned by the algorithm. Tests with larger graphs, now minimizing the total number of external edges per cluster, have shown that 32 external edges are su cient for (nearly) all 4-regular graphs with up to 576 nodes. Graphs with 768 nodes can be partitioned (with our algorithm) at a range of 97%. Thus, for systems of size 576, the architecture has to be modi ed as described in Sec. 3.1. Table 2 shows the maximum number of external links per cluster that are necessary if the graphs are partitioned into clusters of size 16. The entries show mean values of the maximal numbers of external edges from tests on several hundred randomly generated 4-regular graphs.
3.4. Multi-User Access to the SC320 The system allows multi-user access in a space sharing mode. It is able to be con gured to arbitrary user networks even if parts are already in use. Any sequence of connection requests can be realized at stage 1. The con guration software partitions a requested network as described and assigns it to a number of complete subsets of the system. If the network size is not divisible by 16, the splitter cluster P 0 is assigned to a subset that is partly in use according to a best-t strategy. Note that P 0 can only be assigned to a partly occupied subset if enough free processors are available and if the number of free links to stage 1 is su cient.
For this reason the number of external edges for splitter clusters are always minimized in total (cf. Sec. 3.3). Practical experience with up to 24 user entries shows that the number of processors blocked due to non-available external connections for clusters which are shared between di erent users is very low (below 1 percent). The total blend due to non-tting splitters and non-available links to stage 1 ranges from 2 to 5 percent of the system size.
Conclusions
We have described the architecture of a fully recon gurable static switching network. This architecture is suitable for any number of processors. It has logarithmic diameter and a su ciently large number of parallel paths between any two clusters of processors. This
For single sided switches of 96 links and 4 parallel links between each switch, we presented a realization for up to 384 processors by a two stage network. This architecture has been realized by Parsytec Computer for 320 processors integrating two additional I/O clusters and is in daily use. The basic design principles of the architecture are based on bounds on the cut size of 4-regular graphs, if they are split into a number of parts, and on methods to decompose the edges of a 2 k-regular graph to form k 2-regular graphs.
The principle presented here can also be used for the cost-e cient realization of regular networks of larger and even degree. Additionally, it is a good candidate for dynamic switching networks because of its short diameter and large number of parallel paths between any two nodes in the network.
