Packaging technologies impose various physical constraints on bisection bandwidth, pinout, and channel width of a system whereas processor and interconnect technologies lead to certain demanded throughput on network bisection. Earlier studies in literature have proposed hierarchical and clustered interconnections by considering the e ect of limited packaging constraints. Pinout technologies and capacity of packaging modules have been ignored, often leading to con gurations which are not design-feasible. In this paper, we solve this design problem by proposing a new supplydemand optimization framework. This generalized framework uses parameterized representation of processor board area, pinout technologies (periphery or surface), channel width, and channel speed. The family of at k-ary n-cube topologies and their clustered variations (k-ary n-cube cluster-c) are evaluated to derive optimal con gurations which can lead to cost-e ective design of scalable parallel systems using wormhole-routing. The analysis identi es processor board area, channel width, and pinout density as critical parameters. The study indicates that cluster-based parallel systems can deliver better performance with lower cost. It is predicted that optimal con gurations for future systems will be cluster-based (2-10 processors per cluster) with 3D/4D/5D inter-cluster interconnection. This framework is quite general to capture technological trends of future years. The framework is validated by the design solutions of current machines using contemporary technologies.
Introduction
Rapid developments in the eld of processor, interconnect, and, packaging technologies make the task of e cient design of large multiprocessor systems a di cult one 1, 2, 7, 11, 12, 14, 17] . Design guidelines to derive the most optimal interconnection have to take into account changes in these technologies to yield best results. Several previous studies have considered packaging constraints 3] while selecting the best system con guration. These studies include Dally's 11] analysis of k-ary n-cube interconnection under a VLSI model with constant bisection bandwidth, and Abraham and Padmanabhan's study 1] under a constant pinout from a processing node. Agrawal's 2] analysis of the same class of networks considers three di erent constraints: constant bisection width, constant channel width, and constant pinout and is based on a more comprehensive model of node and wire delays. However, Ranade 23, 24] and Yew 15] argued that neither Dally's VLSI model with limited bisection bandwidth nor the limited pinout model as proposed in 1, 2] is adequate for very large systems. Both models con ne to only one level of packaging hierarchy whereas large systems typically employ several levels of packaging.
It was demonstrated in 24] that considering two-level hierarchical/clustered systems widen the scope of the design-space. The architectural levels can be chosen to closely match the packaging hierarchy leading to better designs. A variety of two-level hierarchical con gurations have been proposed by researchers in the past to build scalable systems. Though a larger number of levels might be more general, it is commonly believed that most systems in the near future will t in two hierarchies. Moreover the techniques which work for two-level hierarchies can be easily extended to design more levels. Examples of previous work in this area include two-level systems based on hypercube and other network topologies 12, 20, 24] , MINs and n-hop networks 24], two-level systems with k-ary n-cube topologies 4] , and combination of bus and mesh/hypercube networks 14]. However, most of these analyses did not take into account packaging and interconnecting constraints. Guidelines developed in 14, 24] were based only on xed board sizes with xed pinouts, and did not consider changes in board sizes and alternate pinout technologies. The impact of the changes in processor and interconnect technologies in relation to packaging technology was also not studied.
Analogously in our analysis we focus on two-level clustered architectures based on a k-ary n-cube cluster-c 4, 6] interconnection. These are clustered extensions of the at k-ary n-cube systems. Our design model is more comprehensive and exible in terms of considering packaging and interconnect constraints. Among other things we allow varying board sizes, only reasonable channel widths, and exible pinout from board which depends on its size. System design under alternate pinout technologies like periphery and surface is also explored. The increase in level of integration with time in terms of smaller chips, larger wiring boards, and wiring and pinout densities is also captured and the impact of each studied on system design.
1
In this paper we propose a new framework, as shown in Fig. 1 , for designing and developing clustered architectures. We consider a packaging technology with two levels: the lower level made of chips/multi-chip modules and the upper level made up of processor wiring boards. We analyze the problem of designing an e cient system with N processors using two-levels. Theoretically a large number of alternate con gurations are possible to build such a system. The con gurations which conform to the packaging restrictions, like maximum board size, the pinout technology being used -peripheral or surface, and the pinout densities, are de ned by us as design-feasible con gurations. It is clear that only the design-feasible con gurations can translate into real machines. Each of the design-feasible con gurations o er a level of performance depending on its architectural characteristics. Referring to the set of design-feasible con gurations and their performance characteristics as what the packaging technology o ers, we call this the supply side. On the other hand, to sustain processor performance in terms of throughput places a demand on the size of the required bisection bandwidth 4, 24] of the system. The design-feasible con gurations which o er performance greater than what is demanded are de ned as good con gurations. Among the good choices the one which provides desired performance at optimal cost is de ned as the best con guration. Our goal is to derive this best topology. Simulation modeling is also employed to determine exact performance to decide between very close good candidates. The paper is organized as follows. In Sec. 2 we present the two-level k-ary n-cube cluster-c architecture used in later discussion. Processor speeds, communication link speeds and the per-formance demand are discussed in Sec. 3. In Sec. 4 we discuss the trends in growth of processor board sizes, alternate pinout technologies, and channel width technology. In Sec. 5 expressions for o ered channel width and bisection bandwidth and the design-feasible con gurations under di erent packaging and processor and interconnect technologies are derived. In Sec. 6 we derive the good con gurations and discuss important considerations in choosing the best con guration. Finally, concluding remarks and future work is presented in Sec. 7.
2 Two-Level Clustering with k-ary n-cube cluster-c Organization Many current parallel systems like the CRAY T3D 10], Intel Paragon 16] , and the Stanford DASH 13] are taking a two-level clustering approach. Recently, we have introduced a new k-ary n-cube cluster-c organization 4, 5, 21] to capture this upcoming trend in building scalable parallel systems. In this organization, the lower level consists of k n clusters of processors. These clusters are interconnected by a higher level direct k-ary n-cube network (also referred to as inter-cluster network or internet). Each cluster consists of c processors leading to a total of N = (k n :c) processors in the system. This interconnection achieves two main design objectives: a) direct internet provides easy scalability and b) processor clusters allow easy exploitation of communication locality. Figure 2 shows the overall con guration of such a system.
Intercluster Network
Intra-cluster Network Cluster Interface (CI)
. . . . .
k-ary n-cube
c-processor Clusters us/MIN/direct Figure 2 : Two-level clustering with k-ary n-cube cluster-c organization. The interconnection within a cluster (also referred to as intra-cluster network or intranet) can be chosen as bus/MIN/star network/direct network as shown in Fig. 3 . Each cluster is connected to the rest of the system through a cluster interface. The main task of the cluster interface is to handle the volume of communication to/from the cluster. Other functionalities may be added to the cluster interface to e ciently implement various communication, synchronization, and cachecoherence operations to enhance overall system performance 13, 21] . However, such discussion is beyond the scope of this paper.
The memory in the system is distributed physically across the clusters to increase overall memory bandwidth. Organization of memory within a cluster is left as an open choice depending on the size of cluster and its con guration. The exact nature of this distribution is not critical to the analysis presented in this paper. Therefore, without loss of generality we assume the cluster memory to be distributed uniformly amongst the processors in the cluster.
Architectural Alternatives
To build an N-processor system, there are vast number of possible alternatives. The degrees of freedom are: number of processors (size) in each cluster and topologies of the two levels (intercluster and intra-cluster). Let us consider designing a system with N = 1024 processors. This system can be designed with 64 clusters of 16 processors each, 16 clusters of 64 processors each, and so on. Note that for a given system size, xing the size of one level automatically determines the size of the other level. Having xed the size of each level, there is still freedom to vary the topology in each level. For example, in a system with 64 clusters with 16 processors each, the topologies can be: 4-ary 3-cube internet with bus-based clusters, 8-ary 2-cube internet with MIN-based clusters, and so on. Our objective is to select the con guration which, at a minimum cost, o ers best system performance.
It is easy to observe that the at k-ary n-cube can be derived as a special case of the k-ary n-cube cluster-c family by choosing cluster size c = 1. Since inter-cluster interconnections are more expensive than intra-cluster interconnections 24], in this paper we primarily emphasize on internet and cluster size. In the following sections, we develop a supply-demand optimization framework to derive optimal k-ary n-cube cluster-c organizations (c 1). Table 1 provides a summary of the main symbols and notations used in this paper.
Demand on Network Bisection Bandwidth
In this section we rst analyze the impact of processor speed on the rate of injection of tra c into the interconnection network. We then show the impact of communication link technology on the bisection bandwidth of a system. The demanded bisection size of the system to support a given processor speed and communication link speed technologies is then presented. 
E ect of Interconnect Speed
The bisection size of a network is de ned as the minimum number of wires that need to be cut in order to divide the network into two equal parts 11]. It limits the number of bits that can cross from one half to another half of the network. For example, in a k-ary n-cube interconnection with W-bit wide channels, the bisection size can be derived as = 4k n?1 W 11]. Let t be de ned as the cycle time, the time to transfer a bit across a wire in seconds. In other words, a set of W parallel wires can transfer W bits simultaneously in time t. The bisection size of a network corresponds to a bisection bandwidth of ( =t) bits/sec. A reduction in t allows more data to be sent across any wire in a given time and increases the bisection bandwidth for the same bisection size.
Characterizing Demanded Bisection Size
Given that each processor injects messages into its cluster network at D bits/sec, let us derive the expressions for minimal demanded bisection bandwidth in the internet and intranet to sustain the associated communication requirement. Represent the tra c rate to be supported across the bisection in the internet (upper-level network) as B d u and that in each intranet (lower-level network) as B d l . Given that the probability of a generated request not going outside a lower-level network as p, the e ective injection rate of a cluster into the inter-cluster network can be derived as D clus = cD(1 ? p) bits/sec. This leads to a combined total injected tra c of (D clus N=c) into the internet.
Under uniform tra c assumption, a generated message by a processor is equally probable to be destined for any other processor in the system. Thus, p can be derived as c=N which approaches 0 for large systems with small cluster sizes. It also indicates that, on average, half the messages are destined to clusters on the other half of the system and thus require to cross the network bisection.
Thus, the minimal bisection bandwidth demanded in the inter-cluster network can be derived as:
All expressions in this section have been simpli ed by assuming p ! 0. A localized tra c model would yield (p > c=N) leading to lesser demanded bisection B d u . However, while designing a generalpurpose machine without prior knowledge of the nature of the applications to be executed on it, an uniform tra c model is considered more representative 24].
The demand on bisection bandwidth inside a cluster can be divided into three components, shown as ows f 1 , f 2 , and f 3 in Fig. 4 . The ow f 1 corresponds to the local tra c across bisection and is given by cDp=2. The ow f 2 = cD(1?p)=2 is the fraction that goes to the upper level across the bisection, and the ow f 3 = cD(1 ? p)=2 is a similar contribution coming from the upper level that crosses the bisection. This leads to a minimal bisection bandwidth demand in the intra-cluster network as:
Lower-level networks each interconnecting c processors Cluster
Interface (interconnecting N/c lower networks)
Upper level Network To derive representative values of in current and future systems, we consider predictions made by Patterson 22] . As shown in Table 2 the future values of will broadly lie in the range of 4-10. To support larger sized clusters, the intra-cluster bisection bandwidth should scale linearly with size c. Since packaging constraints are less rigid in lower hierarchies, it is possible to provide thicker channel/buses for higher bandwidth 24] . Hence in this paper, we focus on minimum bisection size in the upper level, d
u . The demanded bisection is parameterized with respect to thus capturing advances in both processor and interconnection speeds. In the following section we compare this demanded bisection size with that being o ered by the packaging technology.
Modeling and Parameterization of Packaging Technologies
The packaging of high performance systems has an extreme impact on their performance. It is quite often the case that system level design is dictated by available packaging technology. We rst present a model to capture hierarchical packaging schemes. We then focus on a two-level packaging with the rst level as clusters inside a board and the second as across the boards. Current trends in board and pinout technologies are investigated and a general scheme is presented to parameterize the architectural characteristics with respect to these technologies.
Hierarchical Packaging Model
Some earlier models, proposed to capture packaging constraints, like the VLSI model with limited bisection bandwidth as proposed by Dally 11] and the limited pincount model as proposed by Agrawal 2] This generalized packaging model can capture and characterize a wide range of packaging hierarchies and technologies by appropriately choosing the parameters. With respect to our two-level k-ary n-cube cluster-c organization shown in Fig. 2 , we consider a two-level packaging hierarchy. The lower level modules consists of processors inside a cluster. The dimensions of this clustermodule depends on the size of the cluster and the level of processor-memory integration. Boards represent the upper-level modules. Depending on the cluster size and board area, multiple clusters may be put onto a single board. The inter-cluster network may therefore be partly intra-board and partly across boards. We do not allow a single cluster to span across multiple boards. In this research we study the interplay of the cluster-modules and board-modules on overall system design as a function of cluster size, board size, and technological advances. 
Parameterizing Packaging Technologies
In this section we discuss current trends in board and pinout technologies. Then we present a general scheme in which architectural characteristics like bisection bandwidth and channel width are parameterized with respect to the board and pinout technologies.
Processor board technology
Processor boards cannot be arbitrarily large in size. Their size is restricted by electrical, mechanical, and board fabrication constraints. Let us de ne the area required by a processor chip, its local memory, and a per-processor fraction of the network and other interface logic, to be a. We treat this area as a square with a perimeter of 4 p a and measure all board sizes in terms of a. Given a board of size (ba), we can accommodate up to b processors on it. This can represent b 0 = b=c clusters with c processors in each cluster. For sake of simplicity, in further discussions we drop the area unit a and refer to a board area of (ba) as simply b.
Over the years board sizes have grown in physical dimensions. We now have systems like MIT J-machine 19] which put 64 processors on a board. However, the J-machine uses ne-grained nodes 9] with integrated processor and router organization. The memory provided per processor is also small compared to other contemporary commercial systems and research prototypes. The current trend in building large parallel systems is to use o -the-shelf commodity processors and to provide a few megabyte of memory per processor. Hence, for illustrating our framework we consider board sizes to hold a maximum of 16 processors. To build a system with N processors we need a total board area N. However, board area being precious 6], we suggest using a total board area = N. All boards used in a system are assumed to be of the same size and 1 b 16. The total number of boards used in designing an N-processor system is de ned as N boards = N=b. Another important consideration in the design problem is the on-board bisection bandwidth, B b . This parameter re ects the maximum number of wires that can be laid across the bisection of the board. Multi-layered wiring is commonly employed to support a higher bisection bandwidth density.
Pinout Technologies
The pin-count P b out of a board has a direct in uence on the data volume that can ow in/out of a given processor board. Currently two di erent types of technologies are being employed by the computer industry: It is natural to expect that the future trend will be towards a mixed pinout technology employing a convenient combination of surface and peripheral pinouts. The pincount under mixed technology may be expressed as a function f(p p ; p s ) of the periphery and surface pinout densities. The exact nature of this function is di cult to predict as it depends on speci c fabrication advancements. However, reasonable suggestions may be a max function or a weighted mean. Without loss of generality, in this paper we present our results with respect to the peripheral and surface technologies. The results for a mixed pinout technology can be derived by simple extension. Figure 5 shows the growths of pincount with board area under the respective surface and peripheral technologies for di erent relationships between p s and p p . The vertical clipping line in each graph re ects that the board size cannot grow continuously and is limited by some maximum size. 
Board size ( A)

Channel Width Technology
Most current machines have 16-bit data channels. This corresponds to a channel width of W 24 including control, acknowledgment, and parity wires. Many factors like path-width inside routers and connector technology restrict channel widths from being arbitrarily large. In the near future it is expected that technologies would allow channels to carry 32-bit and 64-bit data 6, 22], corresponding to W 40 and 72, respectively. Larger channel widths would make e ective message lengths shorter leading to lower message latencies and lesser contention in routing 8, 11] thus yielding better system performance. In this paper we consider design under all three channel widths.
Under-Utilization of Board Area and Pinout Capacity
It is not always necessary to either ll up a processor board to its maximum capacity or to utilize all the pins coming out from it. In our k-ary n-cube cluster-c organization, consider the N 0 = N=c clusters to be interconnected by a generalized k-ary n-cube topology with di erent radix in each dimension. This leads to N 0 = k 1 x k 2 x... x k n . The interconnection is assumed to be laid out across the boards. Consider a sub-topology of the inter-cluster network with clusters, requiring an area of c , to be placed on a board. Let P wires be required to connect the chosen sub-topology to other clusters outside the board. This requires b c and P b P. A perfect matching of b = c and P b = P, where both board capacity and pinout are fully used, is usually di cult to achieve. On choosing an optimal board size with capacity b = c may imply P b > P, leading to under-utilization of pin-capacity. De ning u p = P=P b , this under-utilization factor can be expressed as (1 ? u p ). Similarly choosing a board with optimal pincount P b = P may lead to (b > c ), which re ects board area under-utilization or under-population of board estate. De ning u b = c =b, the factor of under-population can be expressed as (1 ? u b ).
It is to be noted that under-population of board area allows fewer nodes to share the pincount from a board making wider channels possible. The disadvantage is clearly the wastage of precious board estate leading to larger number of boards, greater system volume, and higher cost. In the discussion that follows we assume board area to be fully utilized unless mentioned otherwise. Underutilization of pin-capacity is used if larger board sizes are required to t the desired topology but the resulting larger pincount cannot be utilized with wider channels as discussed in sec. 4.2.3. It is also possible to choose a design where under-utilization of both board and pinout-capacity is used. This is obviously not required, because we can always select a board size where one of the capacities is optimally utilized but not the other. Such a design rule can be captured as (u b = 1) _ (u p = 1).
Impact of Packaging on Architectural Parameters
In this section we determine possible channel widths and bisection bandwidths to design k-ary n-cube cluster-c systems for a given board, pinout, and channel width technologies. From the expressions we derive in this section for arbitrary cluster sizes, the results for at architectures can be obtained by choosing c = 1.
O ered Inter-Cluster Channel Width
Let the sub-topology of the inter-cluster network N 0 = N=c = k 1 x k 2 x k n being placed on a board be (b 0 = b 1 x b 2 x x b n ), where b i k i ; 8i and some b i 's may be 1 to depict that only one processor exists on the board along that dimension. This implies that no inter-cluster dimension can be fully contained within a board. This is reasonable to expect for large system sizes. be derived as r i = b 0 =b i . Note that inter-board channels go out/come in from the two ends of each row leading to 4r i inter-board channels along dimension i. Hence, the total number of interboard channels from all n dimensions is derived as 4 P n i=1 r i = 4b 0 P n i=1 (1=b i ). To support a channel width of W, a total pincount of 4b 0 W P n i=1 (1=b i ) is required from each board. Thus, the maximum supportable inter-board channel width from pinout restrictions, denoted by W p , is derived as:
Let us now consider the required on-board bisection size to support W width channels in the (b 1 x b 2 x... x b n ) sub-topology on a board. Clearly, the available board bisection should be able to support the largest bisection in the sub-topology. The largest bisection in the sub-topology is the one orthogonal to the smallest dimension. However, a smallest dimension j with b j = 1, implying only one processor along that dimension, does not require any on-board bisection. Thus, the largest bisection occurs orthogonal to dimension j such that b j is the smallest dimension not equal to one. There are b 0 =b j nodes nearer to this bisection and assuming bidirectional channels we have 2b 0 =b j channels crossing this bisection. With channel width of W, it requires 2Wb 0 =b j wires across this bisection. With a board bisection B b , the maximum supportable intra-board channel width from board bisection restrictions, W b , is given by, (8) Thus, the channel width supported in a system is restricted by two constraints: W p from pinout and W b from board bisection constraints, respectively. This can potentially lead to di erent inter-board and intra-board channel widths. For designing an architecture with uniform intercluster channel width, we choose W = min(W p ; W b ). With multi-layered interconnection, it is reasonable to assume on-board connection density to be higher than o -board density. This leads to channel width being limited by the pinout constraint resulting in W W p . Unless otherwise speci ed we therefore use W and W p interchangeably. In the following discussion we assume full utilization of board area and pinout resources. Design with under-utilization is analyzed later.
From the expression for W p in Eqn. 8, it can be observed that the channel width is determined by board size, cluster size, pinout technology from board, and dimensionality of the inter-cluster network. It does not depend on the total system size. For a given pinout technology and board size let us consider the impact of varying the inter-cluster dimensionality while keeping cluster size xed. In order to support larger dimensional networks, we need more channels. This leads to thinner channels with a xed pinout from a board. Similarly, for a given pinout technology and board size, decreasing the cluster size while keeping inter-cluster dimensionality xed also leads to thinner channels. A smaller cluster size results in larger number of clusters on board. This implies that pinout from the board gets shared among more clusters leading to thinner channels. We summarize these as: Observation 1 For a given pinout technology (P b ) and board size (b), the o ered channel width falls with Increase in the inter-cluster network dimensionality (n) while keeping cluster size (c) xed or Fall in the cluster size (c) while keeping the inter-cluster dimensionality (n) xed. Figure 7 shows the trend as cluster size is increased for di erent inter-cluster dimensionalities. In the gure each board is assumed to be large enough to hold exactly one cluster (b 0 = 1) and channel widths up to 128 lines have only been shown. For a given cluster size, o ered channel width drops with increasing inter-cluster dimensionality. It can be observed that the drop in channel width is more signi cant at lower dimensions (from 1D to 2D) than at higher dimensions (from 3D to 4D). Therefore for reasonably higher dimensions, the channel widths are expected to be close. Similar results are also obtained for b 0 > 1. For a given message size, a wider channel reduces the e ective message length in the inter-cluster network. This has a potential to reduce message contention leading to smaller message latency. Thus, it is desirable to choose the largest channel width supportable by technology at a given time as discussed in sec. 4.2.3. 4b (n?2)=(2n) n . Thus, for a given cluster size and inter-cluster dimensionality n > 2, an increase in board size b leads to a fall in channel width. The channel width in a 2D inter-cluster network remains constant while that in an 1D network rises sharply. This is shown in Fig. 8 . Thus, for any given cluster size and inter-cluster dimensionality, an increase in board size b leads to a rise in channel width. However, this rise is not very signi cant for higher dimensions. Such trends are shown in Fig. 8(a) for p s = 64 and a cluster size of 2. This leads to:
Observation 2 Under periphery pinout technology, keeping cluster size c xed and dimensionality of inter-cluster network n > 2, an increase in board size results in channel width to fall. However, under surface pinout technology, for all inter-cluster dimensionalities it leads to a rise in channel width. 
Supporting a Fixed Inter-Cluster Channel Width
Most of the prior studies on system design, while proposing guidelines under di erent constraints like constant bisection bandwidth 11], did not impose any restrictions on the values that the channel width can take while satisfying other constraints. Supporting an arbitrary channel width is not easy because it has impact on router design. Agrawal realized the importance of this constraint and analyzed system designs under xed channel width 2]. For a simpler interface design, the width of data lines in a channel is expected to maintain an integral relationship with that of processor and memory which are typically in multiples of a byte. In Eqn. 8, we showed that W(W p ) gets determined once other system parameters like n; b; and c get speci ed. Given a channel width technology with W 0 (W 0 = 24, 40, 72...), an obvious design objective while selecting values for n; b; and c is to ensure that the o ered channel width (W) is equal to the supportable channel width. Together with Eqn. 8 this leads us to the following set of observations on the cluster size that can be placed on a board while other parameters like inter-cluster dimensionality, pinout density, and channel width are varied: Observation 3 .
For a xed channel width (W 0 ), board size (b), and a pinout technology, an increase in intercluster network dimensionality (n) leads to an increase in the cluster size (c), resulting in fewer clusters on the same board area.
For a xed channel width (W 0 ), board size (b), and inter-cluster dimensionality (n), an increase in pinout density (p p or p s , and hence P b ) leads to a fall in the cluster size (c), resulting in more clusters on the same board area.
Increasing the channel width (W 0 ) leads to larger cluster size (c) while other parameters are kept constant. This leads to fewer clusters on a given board area.
While maintaining a xed channel width (W 0 ), it may also be desired to maintain a given xed number of clusters (b 0 = b=c) on each board. Under such a condition, supporting larger clusters also implies using larger boards to maintain a xed b 0 . Let us consider the e ect of increasing board size (or equivalently cluster size) on the inter-cluster dimensionality in Eqn. 8 while maintaining a xed W 0 and b 0 . Larger board area allows more pinout per board thus leading to an increase in P b . This implies that n should also increase to maintain the same channel width. This can be stated as the following observation:
Observation 4 For xed number of clusters on a board (b 0 ) and supportable channel width (W 0 ), an increase in board (cluster) size leads to an increase in the dimensionality of the inter-cluster network.
Let us consider the impact of pinout technology on supporting a xed channel width. Under periphery pinout technology, as discussed earlier, we need to satisfy W 0 = ppc (n?1)=n 4b (n?2)=(2n) n . Thus, for a given inter-cluster dimensionality n > 2 and pinout density (p p ), supporting larger cluster size requires an increase in the size of the boards used. For n = 2, cluster size can not be varied because changing board sizes has no impact. Similarly, for n = 1, board size can not be varied because changing cluster size has no impact. Under surface pinout technology, we need to satisfy W 0 = psc . Thus, for a given n > 1 and pinout density (p s ) , supporting larger cluster sizes requires a fall in the size of the boards used. For n = 1, board size can not be varied because changing cluster size has no impact. The above interplay can be summarized as the following observation:
Observation 5 In supporting a xed channel width (W 0 ) in an inter-cluster network with a given dimensionality (n), for periphery pinout technology with n > 2, supporting a larger cluster size requires an increase in board size. In comparison, under surface pinout technology with n > 1, supporting a larger cluster size requires a fall in board size. 17 
Realizing Designs with Under Utilization
The expression in Eqn. 8 yields a channel width W assuming full utilization of board area and pinout resources. However, while building a real machine it is unlikely that both board area and pinout resources get fully utilized. To make the study more realistic and interesting, we extend the expression in Eqn. 8 to allow for under-utilization of board and pinout capacities as discussed in sec. 4.3. Now Eqn. 8 can be rewritten as,
where P b u p is the utilized pincount from board, u b b 0 is the actual number of clusters placed on a board (out of the maximum b 0 = b=c), and W is the channel width value obtained assuming full-utilization of resources. We assume a reasonable bound on both under-utilizations such that u p ; u b u min for some u min closer to 1. Thus, observations made earlier assuming full-utilization (u min = 1) continue to hold. For illustrative purposes, in this paper, we choose u min = 0:9. Given a pinout density and a channel width technology, we formulate the following search problem to determine (n; b; c) tuples which satisfy the following inequalities. (11) Figure 9 shows the plots in Fig. 7 with such under-utilization tolerance lines, which provide exibility of 10% around the channel widths of 24; 40; and 72. The con gurations which fall within these tolerance boundaries are valid or design-feasible con gurations. We summarize the solutions to this search problem in Table 3 for surface pinout technology of p s = 64; 256 and in Table 4 for periphery pinout technology of p p =256, 512. In both tables three di erent supportable channel width technologies corresponding to W = 24, 40, and 72 and di erent allowable number of clusters per board are considered. In the examples shown in Fig. 9 and Tables 3 and 4 most current and near future systems will have 1 or 2 clusters on each board. With increase in pinout densities, under both surface and periphery technologies, it may also be observed that the number of valid con gurations with b 0 = 4; 8 increase. This suggests that future systems with higher pinout densities will be able to better support 4 to 8 clusters on each board.
From Table 3 note that for any given value of p s (64), W (24), and b 0 (1) as cluster size c is increased from 3 to 13 the required dimensionality of the inter-cluster network also rises from 2 to 8. This con rms Obs. 4. Next consider the board sizes required for con gurations under surface pinout technology corresponding to p s = 64, W = 24, n = 2 with two di erent cluster sizes c = 1 and 3. The con guration with the smaller cluster size (c = 1) and b 0 = 8 requires boards of size b = 8, while the one with larger cluster size (c = 3) and b 0 = 1 requires a smaller b = 3. Under periphery pinout technology a di erent trend is noted. In Table 4 the board sizes required for con gurations corresponding to p s = 512, W = 24, n = 4 with two di erent cluster sizes c = 1 and 2 are b = 2 and 16, respectively. Therefor as cluster size increases so does board size. This con rms the observations made in Obs. 5.
O ered Inter-Cluster Bisection
Now let us consider the impact of packaging on o ered inter-cluster bisection. Given the intercluster network as a (k 1 x k 2 x... x k n ) mesh/torus and a xed channel width, the size of the inter-cluster bisection can be computed in the following manner. A (k 1 x k 2 x... x k n ) mesh/torus can have various bisections which divide this network into two halves. For example, in a 3D torus with x, y, and z dimensions, we can have three possible bisections: one orthogonal to x dimension along yz plane, and so on. We are interested in the size of the smallest bisection in the system because it maximally constrains the performance of the system under random tra c. Clearly, the smallest bisection has to be orthogonal to the dimension having the largest radix, given by k max = max i=1::n (k i ). The number of nodes in such a bisection is given by N 0 =k max . Assume each of these nodes is connected to another across this bisection using bidirectional channels of width W. (12) which is simpli ed by assuming the inter-cluster network is regular i.e. (k max = (N=c) 1=n ) and using Eqn. 8 to replace W and N boards = N=b. 2)=(2n) n . Thus, for a given system size (N) and inter-cluster network dimensionality (n), the size of the bisection falls with increase in board size for n > 2. This can be explained by the fact that the total number of board pincount reduces as board sizes are increased. For n = 2, the value of s u remains xed and for n = 1 the value rises slowly. The key point is that under this pinout technology, it is ideal to work with smaller boards as they o er higher bisection sizes and hence performance. Under surface pinout technology with P b = p s b, Eqn. 12 reduces to B = . Thus, we see here that for a given system size (N) and inter-cluster network dimensionality (n), the size of the bisection increases for n. These observations can be summarized as, Observation 6 Under periphery pinout technology, for a given system size (N), keeping the dimensionality of inter-cluster network (n) xed at a value greater than 2, an increase in board size (b) leads to a fall in the inter-cluster bisection size. However, under surface pinout technology, as board size (b) is increased, bisection size increases for all inter-cluster dimensionalities.
Next, let us de ne V w as the volume (number) of inter-board wires required to interconnect a desired inter-cluster con guration. This parameter V w can be derived as N boards P b =2, where N boards = N=b, P b is the pinout from each board, and each wire connects two pins. Under periphery pinout technology, V w can be simpli ed as Npp . From these two expressions we conclude, Observation 7 For a given system size, the volume of inter-board wires falls with an increase in board size under periphery pinout technology, while under surface pinout technology it remains xed.
Indirect impact of cluster size
From Eqn. 12 note that s u = 4(N=c) (n?1)=n W. Thus, keeping channel width W xed at W 0 , an increase in the cluster size c (maintaining W = W 0 may require board size to be varied also) causes the o ered bisection to fall. Fig. 11 shows this for a system size N = 1024 processors and a given channel width W 0 = 24, and di erent inter-cluster dimensionality (n = 1 ? 4). 6 Deriving Optimal System Con gurations
Demand vs Supply: An Optimization Problem
For a given system size, our goal is to derive the optimal con guration. This con guration should satisfy packaging and bisection demand constraints and o er best performance. We formulate this as an optimization problem of selecting the best con guration from amongst the various con gurations. The problem is solved in three phases. In the rst phase we derive the set of design-feasible con gurations which adhere to the packaging constraints as discussed earlier, System con gurations which are possible to be synthesized under the above constraints are sorted out in this process. This was studied in sec. 5 and valid or design-feasible con gurations derived under various technological parameters in Table 3 and 4. For each design-feasible con guration we then compute the o ered bisection in the inter-cluster ( s u ) and intra-cluster networks ( s l ). The set of design feasible con gurations and their o ered characteristics are referred by us as the supply side. The demand side refers to the demanded bisection sizes d u and d l in the inter-cluster and intra-cluster networks of such design-feasible con gurations, derived earlier in sec. 3.3 on the basis of processor and interconnect speeds. The second phase of the optimization problem sorts out the design-feasible con gurations which meet the demanded bisections, i.e. The Inter-cluster Bisection Constraint states that the o ered inter-cluster bisection from packaging constraints should be greater than the demanded inter-cluster bisection from processor and interconnect speed considerations. Using Eqn. 3 and Eqn. 12 in the above inequality we obtain the following: 4(N=c) Table 5 we summarize the range of cluster sizes, which meet the demanded bisection bandwidth under = 6:0 for each combination of system size (N = 1024, 2048, and 8192), channel width technology (W = 24, 40, and 72), and inter-cluster dimensionality (n = 1 to 8). The intra-cluster bisection constraint states that the o ered intra-cluster bisection from packaging constraints should be greater than the demanded intra-cluster bisection from processor and interconnect speed considerations. We have not considered any limitations or restrictions on the size and scalability of the intra-cluster network in this work and assume the intra-cluster bisection demand to be satis ed. However, in bus-based clusters the size of clusters do get limited by the bus-bandwidth. Contemporary bus-based clusters support up to 4 processors 13, 16] . Future bus technology is expected to support larger sized clusters. In current technology large clusters with higher intra-cluster bisection can be supported by using MIN-based or direct network-based clusters.
The second phase of the optimization leads us to the set of good con gurations. The nal optimization step is to select the best con guration from the good con gurations. Next, we take up an example to show how to derive the set of good con gurations and the best con guration under a given set of current technological parameters. and periphery pinout technologies, respectively. Similar tables can be generated for other system sizes of N = 2048 and 8192 processors. Note that most design-feasible con gurations in Tables 3 and 4 meet the bisection demands corresponding to a system size of N = 1024 processors and = 6 and qualify as good con gurations in Tables 6 and 7 . However, for larger system sizes with greater processor demand on network it can be shown that fewer design-feasible con gurations would meet the higher bisection demands and qualify as good con gurations. 
Deriving the Best Con guration
For any given set of technological parameters, the set of good con gurations narrow down the design-choice space. Using any of these good con gurations it is possible to actually fabricate a machine under packaging constraints and such a machine is guaranteed to o er more than or equal the desired bisection bandwidth. However, there are two other important considerations while selecting a con guration. It should o er low average message latency and it should have low cost. Thus, among the good con gurations we are interested in the best con guration 23] which provides best performance at lowest cost.
From Tables 6 and 7 , it is to be noted that under most combinations of pinout density and channel width technology, we are left with around 15-20 good choices from which we have to select the best one. This search, however, is not a very clear-cut process. Unknown and inestimable factors like exact link cost, connector cost, cluster technology, delay due to longer wires and other packaging costs make this a hard decision. The scalability of a con guration to larger sizes is also very important. In this section, we use a combined qualitative, quantitative, and simulation approach to derive the best con gurations.
We rst demonstrate the design choices under the current technologies of channel width (W = 24), surface pinout density (p s = 64), and periphery pinout density (p p = 256). Each entry in Table 6 , corresponding to a good con guration, can be expanded to show detailed characteristics about the connector and wiring costs together with its potential for future scalability. Table 8 shows such detailed characteristics. For example, the entry corresponding to n = 4, p s = 64, W = 24, b 0 = 1 and cluster size of 6 in Table 6 gets expanded to the rst entry in Table 8 . The desired number of clusters in this system with 1024 processors should be 1024=6 171. However, the closest 4D internet topology is 4D:4x4x3x4 having 192 clusters. This leads to an actual system size of 192x6 = 1152 processors, which is larger than the desired size of 1024 by 12:5%. We allow such deviations in system size up to 15%. The good con gurations which yield actual system sizes outside this range are dropped. The actual bisection size o ered by this con guration is 4608 (obtained using Eqn. 5.4). The average distance to be traveled in the internet is 3.75, obtained as n i=1 k i =4 11]. The total number of internet channels can be obtained as 2n(N=c) yielding a value of 1536 for this con guration. This is a measure of the number of inter-cluster connectors required. Connector cost being quite high 23], a natural design guideline would be to opt for con gurations which require less number of connectors. The board size of 6, used to place the clusters in this con guration, is the same as the cluster size. Therefore only one cluster is placed on each board. The number of inter-board wires, which is a measure of the volume of wiring, is derived as 32K (based on the discussion in Sec. 5.4). Other entries in Table 6 are similarly expanded. Table 9 shows similar characteristics for entries in Table 7 under periphery pinout of p p = 256. Both Tables 8 and 9 are split into three parts for ease of presentation: a) one cluster is put on one board implying b 0 = 1 i.e. the board size is chosen to t the cluster size, b) two clusters are put on a board leading to b 0 = 2, and c) three or more clusters are put on a board leading to b 0 3.
The average distance to be traveled in the internet is a measure of the average latency a message incurs. While choosing the best con guration, a design objective should be to choose a con guration that o ers low average message latency. Since messages can travel across clusters, this average latency measure should include both the inter-cluster network delay as well as the delay incurred inside the cluster. However, not having restricted our study to a particular intra-cluster topology, we primarily emphasize on the inter-cluster delay.
Let us consider the scalability aspect in choosing a good con guration. While building a system with 1K processors, once a good con guration (say n =4D, c = 5, b = 10, and W = 24) is chosen to build it, these parameters get xed. Now assume that it is required to scale this system to 2K processors. To maintain a homogeneous system the addition of new processors to the system should be done while keeping the parameters n, b, c, and W xed to their prior values. Otherwise, the system requires complete redesign and refabrication. Let (n; b; c; W) 1K denote the set of all n, b, c, and W values that represent good con gurations to build a system of N = 1K processors. Similarly, let (n; b; c; W) 2K and (n; b; c; W) 8K denote such sets for system sizes of N = 2K and 8K processors, respectively. The intersection set (n; b; c; W) 1K T (n; b; c; W) 2K therefore denotes good con gurations under both N = 1K and 2K processors. This implies that these con gurations used to build a system with 1K processors can be scaled up to 2K processors. Similarly, the intersection set (n; b; c; W) 1K T (n; b; c; W) 2K T (n; b; c; W) 8K represents con gurations possible for systems with 1K, 2K, and 8K processors. Therefore, such con gurations can be scaled up to 8K processors. The last column in Tables 8 and 9 shows the maximum system size to which the con guration being considered can be scaled. Note that not all good con gurations for 1K processors scale up to 2K or 8K processors.
Choosing the best con guration A) Surface pinout O(A) technology: In this technology, irrespective of internet con guration, the total volume of inter-board wires (total number of pins) is xed. Larger boards o er greater bisection bandwidth under this technology. Thus a reasonable approach is to go for larger board sizes for any given internet dimensionality. To decide among the good con gurations we should also keep in mind factors like minimal latency, connector costs, cost of inter-board wires, and extent of future scalability. Let us consider the good con gurations under p s = 64 and W = 24 technology in Table 8 . We chose the 4D:4x4x4x3 with c = 5 as the best topology as it o ers reasonably high bisection size (recall that for a system of size 1024 processors demanded bisection size is 3072). It also o ers low average latency, with low connector cost, while being scalable to around 8K processors.
B) Under Peripheral Pinout O( p A) technology: Under this technology, the total volume of inter-board wires in the system falls with increasing board size. This decreasing system volume could translate to a fall in packaging costs. Consider the good con gurations under p p = 256 and W = 24 technology in Table 8 . Note the 3D con gurations are scalable up to only 2K processors. For scalability beyond 2K we need to choose amongst the higher dimensional systems. The 4D with c = 4 and 5D with c = 6 under b 0 = 2 and 4D with c = 4 under b 0 3 provide such scalability while o ering reasonably high bisection bandwidth at low connector and inter-board wire volume. Amongst these three con gurations the 4D under b 0 3 is clearly better than the 4D under b 0 = 2 as it o ers similar characteristics with lower inter-board volume of wires. We therefore consider the two con gurations: 5D with c = 6 and b 0 = 2 and 4D with c = 4 and b 0 = 3. Simulation modeling was used to decide the best con guration by comparing the o ered throughput versus average message latency. For this experiment bus-based clusters were assumed. We observed as shown in Fig. 13 that at high system loads the cluster-bus became a greater bottleneck in the 5D system with larger clusters. Message latencies in both systems at lower latencies were comparable. These observations lead us to conclude that the 4D:4x4x4x4 with c = 4 will be the best system con guration.
Avg. latency of a message (100 cycles) 
Applying the Framework to Contemporary Design Trends
From Tables 8 and 9 the following interesting comments can be made.
Cray T3D example
Employing the technology combination of p p = 256 and W = 24 the Cray T3D uses boards of size 4 containing 2 clusters (nodes) each of size 2 interconnected through a 3D inter-cluster network 10]. Our framework derives this con guration as the rst entry in Table 9 under b 0 = 2. This may be treated as a validation of our framework in predicting real system design. Also note the rst entry in Table 9 under b 0 3. This predicts that if in a similar system if boards of size 14 are used instead of 4 then by placing 7 clusters on each board the volume of inter-board wiring in a Cray T3D system could be reduced from 64K to 34K i.e almost halved leading to a much less expensive system.
Flat systems are not always possible
Our framework is general enough to allow at con gurations (c = 1). However, it is interesting to note that while designing system under some technology combination it may happen that the at con guration is not possible at all because it is not a good con guration. This can be seen in Table 9 where none of the good con gurations o er c = 1.
Higher inter-cluster dimensionality may be required 30 It has been recently demonstrated that that it is easier to build lower dimensional systems with 2D/3D interconnections 10, 11]. Higher dimensional networks require longer wires which may lead to higher wiring costs and packing e ort. However, it is possible that system design under a given set of packaging constraints and technologies may not o er any good con guration with 2D/3D interconnection. For example, consider the design problem of a scalable system with N = 1024 processors under current technologies of p p = 256 and W = 24. Let the packaging constraints be such that we can only place one cluster on each board (b 0 = 1). The possible good con gurations are shown in Table 9 under b 0 = 1. Note that no good con guration is possible to build such a system with a 2D/3D interconnect. Under such constraints a system designer would therefore be forced to choose a higher dimensional system.
Summary of Best Scalable Con gurations
We extended our previous analysis to predict best system con gurations for other channel width and pinout density parameters re ecting near future technologies. Combinations of channel width technologies (W = 24; 40 and 72) were considered in conjunction with surface pinout technologies of (p s = 64 and 256) and periphery pinout technologies of (p p = 256 and 512). We rst studied the problem of choosing the best con gurations while restricting design to low-dimensional (2D/3D) inter-cluster networks with small cluster sizes (c < 8). Table 10 shows such best con gurations to build a system with size of N = 1024 processors under various technologies. A best system con guration depicted using the shorthand (3D c-2) represents a (8-ary 3-cube cluster-2) system. Other entries can be similarly expanded to the k-ary n-cube cluster-c notation. Note that the availability of larger channel width under future technology has potential to go for larger cluster sizes under a given surface or peripheral pinout density. However, an increase in pinout density without growth in channel width will force systems with smaller cluster sizes. We also observe that a future periphery pinout density of p p = 512 yields best system choices similar to those by a future surface pinout density of p s = 256. This suggests that although theoretically surface pinout has potential to eject periphery pinout technology as shown in sec. 4.2.2, the state of the art is such that even if surface pinout technology were to improve 4 times while periphery pinout technology only grew twice, they would still be comparable.
We next considered the design of systems with 1024 processors which are scalable up to 8K processors under the same technologies. In this case we did not restrict ourselves to only lowdimensional networks with small cluster sizes. Table 11 shows the obtained best con gurations. Note that interestingly all these con gurations use 3D/4D/5D inter-cluster networks and cluster sizes from 2 ? 10. These are reasonable values and it is easy to comprehend that such systems will be available in the near future. The entry corresponding to a technology of p s = 64 and W = 72 is
