Existing interprocessor connection networks are often plagued by poor topological properties that result in large memory latencies for distributed shared-memory (DSM) computers or multicomputers. On the other hand, scalable networks with very good topological properties are often impossible to build because of their prohibitively high very large scale integration (VLSI) (e.g. wiring) complexity. Such a network is the generalized hypercube (GH). The GH supports full connectivity of all of its nodes in each dimension and is characterized by outstanding topological properties. Also, low-dimensional GHs have very large bisection widths. We present here the class of highly-overlapping windows (HOWs) networks, which are capable of lower complexity than GHs, comparable performance and better scalability. 
INTRODUCTION
The demand for ever greater performance by many computational problems has been the driving force for the development of computers with hundreds or thousands of processors. Near petaflops (i.e. 10 15 floating-point operations per second) and more performance is required by many applications. The high-performance computing field generally targets applications that require performance at the many gigaflops or petaflops range. The High-End Computing Panel of the President's Information Technology Advisory Committee (PITAC) in its 1999 report emphasizes that innovations are required in high-end systems to deliver petaflops performance [1] .
Several high-performance computers have been developed in recent years, such as the SGI Mountain Blue capable of 3 teraflops, and the IBM ASCI Blue [2] and ASCI White; the latter machine is capable of about 12 trillion calculations per second. IBM is also expected to deliver in 2005 the IBM Blue Gene/L for simulating phenomena such as aging of materials, fires and explosions, at the 200 teraflops rate. The ultimate IBM objective in this field is to build a 1 petaflops machine for the life sciences by the end of this decade [3] . The common objective of the highly-overlapping windows (HOWs) networks and the latter efforts is to support the implementation of computers with thousands of processors. Our work differs from these other efforts because we give emphasis primarily to the implementation, with current or future technologies (e.g. optics), of scalable interconnection networks. Our systems are scalable because they can easily take advantage of not only ever increasing wiring and/or channel densities (i.e. the number of channels per mm 2 ) but also of more powerful processing nodes that reduce the volume occupied by machines. The saved volume can then be used to increase each processor's degree of connectivity.
Very high performance is very difficult to achieve or sustain primarily because of the, as currently viewed, unsurmountable difficulty in developing low-complexity, high-bisection bandwidth and low-latency networks to interconnect thousands of processors. To quote Dally, 'wires are a limiting factor because of power and delay as well as density' [4] . Several interconnection networks have been proposed including, among others, regular meshes and tori [5, 6] , enhanced meshes [7] , (direct binary) hypercubes [8] , hypernets [9] , fat trees [10] and hypercube variations [11, 12, 13] . The hypercube dominated the high-performance computing field in the 1980s because of good topological properties and rich interconnectivity that permits efficient emulation of many other topologies [8, 14] .
Nevertheless, these properties come at the VIABLE ARCHITECTURES FOR HIGH-PERFORMANCE COMPUTING 37 cost of often prohibitively high VLSI (primarily wiring) complexity due to a dramatic increase in the number of communication channels with any increase in the number of processing elements (PEs). Its high VLSI complexity is undoubtedly its dominant drawback, which limits scalability [14] and does not permit the construction of powerful systems. The versatility of the hypercube in emulating other important topologies efficiently constitutes an incentive for the introduction of hypercube-like interconnection networks of lower complexity (i.e. better scalability) that preserve, to a large extent, the former's topological properties [11, 12] . The current trend in the high-performance computing field is to build distributed shared-memory (DSM) computers where the main interconnection network connects clusters of shared-memory processor (SMP) subsystems [15, 16, 17] . The underlying message-passing (i.e. distributed) interconnection network in DSM machines also serves as the vehicle for remote memory accesses. The proposed HOWs are also applicable to this case. Therefore, the terms processor and node (very often implying an SMP cluster) will be used interchangeably from now on in this paper. Moore's law predicts the doubling of transistor density for chips about every 18 months. As a consequence, shared-memory multiprocessor chips are slowly appearing in the market; they may eventually serve as the nodes of high-performance computing machines.
To support scalability, several approaches to massivelyparallel processing use bounded-degree networks, such as meshes or p-ary n-cubes (i.e. tori), with low node degree (e.g. the FLASH [18] , Cray T3E [19] and Tera computers). For example, hypernets are constructed hierarchically with identical cubes, trees or buses as the basic building blocks [9] . Their focus is on maintaining a constant node degree with increases in machine size. Nevertheless, low-degree networks result in large diameter, large average internode distance and small bisection width. Relevant approaches that employ reconfiguration to enhance the capabilities of the basic mesh architecture (e.g. reconfigurable mesh, mesh with multiple broadcasting and mesh with separable broadcast buses) will not become feasible for massivelyparallel processing in the foreseeable future because of the requirements for long clock cycles and precharged switches to facilitate the transmission of messages over long distances [7] .
The high VLSI (wiring) complexity problem is unbearable for generalized hypercubes (GHs). In contrast to nearest-neighbor p-ary n-cubes that form rings with p nodes in each dimension, GHs implement fully-connected systems with p nodes in each dimension [20] . The n-D (symmetric) generalized hypercube GH(p, n) contains p n nodes. The address of a node is x n−1 x n−2 . . . x 1 x 0 , where each x i is a radix-p digit with 0 ≤ x i ≤ p − 1. This node is a neighbor to the nodes with addresses x n−1 x n−2 . . . x i . . . x 1 x 0 , for all 0 ≤ i ≤ n − 1 and x i = x i . Therefore, two nodes are neighbors if and only if their n-digit addresses differ in a single digit. For the sake of simplicity, we restrict our discussion to symmetric GHs where the nodes have the same number of neighbors in all dimensions. Therefore, each node has p − 1 neighbors in each dimension, for a total of n(p − 1) neighbors per node. The n-D GH(p, n) has diameter equal to n only. Low-dimensional GHs have very impressive bisection widths. When a network is cut into two equal halves (with the same number of nodes), its bisection width is the number of edges that run between these two halves [21] ; dense/heavy communications operations can benefit from a large bisection width. For n = 2 and p an even number, the bisection width of the GH(p, 2) is the immense p 3 /4. The outstanding topological properties of GHs are the result of their high node degree (that is, the large number of connections per node) which, however, has negative effect on the wiring complexity. The increased VLSI/wiring cost of GHs results in outstanding performance that permits optimal emulation of hypercubes and p-ary n-cubes, and efficient implementation of complex communication patterns [22, 23, 24] .
In order to reduce the number of communication channels in systems similar to the GH, the spanning bus hypercube uses a shared bus for the implementation of each fully-connected subsystem in a given dimension. However, shared buses result in significant performance degradation because of the overhead imposed by the protocol that determines each time ownership of the bus. Similarly, hypergraph architectures implement all possible permutations of their nodes in each dimension by employing crossbar switches [25] . Reconfigurable GHs interconnect all nodes in each dimension dynamically via a scalable mesh of very simple, low-cost programmable switches [26] . However, all these proposed reductions in hardware complexity may not be sufficient for sustained high-performance computing.
Fat trees have received major attention because it has been proved that, for a given amount of communications hardware, a fat tree is nearly the best routing network (universality theorem) [10] ; that is, the fat tree can simulate any other network containing the same amount of communications hardware in a time that may be greater by a polylogarithmic factor. The fat tree structure is based on the mesh-of-trees network [27] . The processors of a fat tree are represented by the leaves of a complete binary tree; all internal nodes in the binary tree are switches. The number of wires (and bandwidth) connecting neighboring nodes in the fat tree increases exponentially with each new level, as the root is approached. It is worth noting that the butterly network is so versatile that it can be redrawn to look like a fat tree.
The Tera-op Reliable Intelligently adaptive Processing System (TRIPS) is a hierarchical system composed of multiple chips; each chip is composed of eight processors and several memory elements [28] . Each processor has a grid processor architecture (GPA) consisting of an 8 × 8 array of arithmetic-logic units (ALUs), a local register file, local instruction and data caches and control circuits. The grid processor and the memory arrays are configurable. The objective of TRIPS is to scale well with technology, similarly to our objective for HOWs. Similarly, the Raw architecture emphasizes parallelism within a processor chip that contains simple configurable tiles interconnected in a large structure visible directly by the compiler [29] . Therefore, the Raw hardware can be customized by the compiler to the specific application. Only short wires are used in Raw, in contrast to HOWs that employ short intraand longer inter-chip connections for substantial scalability. Each simple tile in Raw contains instruction and data memories, an ALU and a register file. Contrary to HOWs, however, where the objective is to scale well within and between processor chips in order to support very-highperformance computing, the TRIPS and Raw architectures emphasize scalability within processor chips. Each node in a HOW system could actually be a scalable TRIPS or Raw processor. Therefore, HOWs and TRIPS/Raw complement each other in the areas of technology and architecture scalability.
To summarize, low-dimensional massively-parallel architectures with full connectivity for their nodes in each dimension, such as GHs, are very desirable because of their outstanding topological properties, but their electronic implementation is a Herculean task because of packaging (and primarily wiring) constraints. We propose in this paper a new class of interprocessor connection architectures, namely HOWs, which employ the GH [20, 22, 24, 30] with outstanding topological properties (e.g. extremely small diameter and average internode distance, and immense bisection width) as their basic building block. HOWs are also obtained from GHs by removing some of their processor interconnections in order to reduce their wiring complexity and render them viable structures for very-high-performance computing. Large GHs have outstanding topological properties; however, they are characterized by very high wiring complexity that prohibits their implementation. In contrast, HOWs can be viable while having simultaneously topological properties comparable to those of GHs. This paper focuses on 2-D HOWs because of their potential for ease of implementation and their large bisection width. It is organized as follows. Section 2 introduces the architecture of HOWs and presents a simple, fault-tolerant routing algorithm to find shortest paths. Section 3 discusses cost analysis for HOWs. Section 4 presents the embedding of various interconnection networks into 2-D HOW systems. Section 5 presents and analyzes communication operations for 2-D HOW systems. Section 6 presents performance comparisons involving hypercubes (binary and generalized) and 2-D HOW systems. Conclusions are presented in Section 7.
THE CLASS OF HOW ARCHITECTURES
HOWs are designed recursively. We first introduce the class of 1-D HOW node interconnections. HOW(p, w, 1) denotes the 1-D HOW system with p sequentially-numbered nodes and window size w. Each node with unique address k, where 0 ≤ k ≤ p − 1, is connected directly to all nodes within the windows of size w immediately to its left and right. More specifically, its neighbors are all the nodes with addresses 0 ≤ k ± i ≤ p − 1, for i = 1, 2, 3, . . . , w. Therefore, all connections are local in this 1-D system and span up to w nodes to the left and w nodes to the right of the referenced node.
Each node k belongs to as many as w + 1 maximalsized 1-D GHs GH(w + 1, 1) (i.e. fully-connected subsystems); they can be derived by starting with the subsystem spanning node k and all its left neighbors in the collinear representation of the HOW(p, w, 1), and shifting each time the window by one position to the right until the last subsystem spans node k and all its right neighbors. Therefore, each such pair of successivelyderived GH(w + 1, 1)s has a very large overlap that forms a GH(w, 1). The HOW(p, w, 1) can also be derived from the GH(p, 1) by removing for each node, in the collinear representation of the GH(p, 1), those edges that connect it to nodes outside of the left and right windows defined by w. Therefore, existing algorithms for GHs can be modified easily to run on HOWs because of the following reasons:
• HOWs are derived from GHs by removing some edges; • HOWs contain many smaller, highly-overlapping GHs.
The (symmetric) n-D HOW(p, w, n) with p nodes per dimension is constructed recursively, so that each node has up to 2wn neighbors. A node has address
This node has neighbors with addresses that differ from its own address only in a single radix-p digit, that is they have addresses x n−1 x n−2 . . . x i . . . x 1 x 0 , so that 1 ≤ |x i − x i | ≤ w for all i = 0, 1, . . . , n − 1. This HOW system contains p n nodes. It contains many, highlyoverlapping GHs GH(w + 1, n). The HOW(p, w, n) can also be derived from the GH(p, n) by removing in each dimension all connections for each node that do not fall into its left and right neighborhood windows defined by w. Figure 1 shows the HOW (7, 3, 2) . The HOW(p, p − 1, n) is identical to the GH(p, n), the HOW(p, 1, n) is identical to the n-D mesh and the HOW(2, 1, n) is identical to the n-D binary hypercube. This paper focuses on 2-D HOWs because of their simplicity, high bisection width, and ease of implementation.
Not only do HOWs have reduced wiring complexity compared to GHs of similar size, but also the locality of node interconnections in HOWs can prove a viable solution for very-high-performance computing.
• Intrachip and/or local interchip connections could be implemented efficiently with current and expected electronic technologies for reasonable values of the window size w; in contrast, the global interconnections required in GHs are much more difficult to realize. Improvements in intrachip and/or interchip interconnection technologies can increase the value of w, so HOWs are technology scalable.
• Free-space optical interconnects are expected to become viable in the near future for the local interconnection of chips [30] . Very substantial work is carried out in research laboratories, quite often with federal support, for the efficient realization of free-space interconnects within computer systems. Wavelength-division multiplexing (WDM) will be employed for the transmission of multiple bits in parallel [30] . Because of the fact that chromatic dispersion becomes a major problem in WDM for distances larger than about a meter, the global interconnections required in GHs will still be very difficult to implement. Therefore, HOWs will increase further their advantage over GHs with respect to interconnection complexity.
All of the above points also demonstrate that HOWs are more prone than GHs to scalability related to technological advancements. Current commercial-off-the-shelf (COTS) chips have about 1000 pins. Assuming 64-bit unidirectional channels and about 800 of these pins available exclusively for data transfers, we can have w = 6. Around the year 2010, COTS chips will have about 3000 pins (Semiconductor Industry Association (SIA) prediction). Assuming about 2600 of these pins exclusively for data transfers, we will have w = 20. Application-specific integrated circuit (ASIC) chips can generally have up to double the number of pins on COTS chips, and therefore for ASIC chips we can currently have w = 12.
The mesh, binary hypercube and GH are extremes in the spectrum of networks that HOWs are proposed for. They correspond to HOWs with w = 1, p = 2 (of course, we can only have w = 1 in this case) and w = p − 1, respectively. Therefore, the proposed HOWs are compared primarily against these extreme cases, which correspond to widely used networks in theory and practice. Some comparisons with the fat tree network are included as well. For a comparative evaluation of these networks, a trade-off analysis must be carried out between performance and cost. As pointed out in [15] , network topology has been a point of major debate primarily because 'different positions make sense under different cost models and the technology keeps changing'. Using HOW architectures, we can take advantage of wiring and/or optical technologies to ever increase the window size for ever higher performance, while supporting program portability.
HOWs can have very good topological properties and reasonable cost. The following proposition and theorem are pertinent. 
Routing in HOWs and support for fault tolerance
A major requirement for any new interconnection network is to support ease of routing. We show in this subsection that HOWs adhere to this requirement. For the sake of illustration, let us first describe a simple routing algorithm for the 1-D HOW(p, w, 1). This routing algorithm is distributed and chooses a shortest path. Assume that the source and destination addresses are A and B, respectively. Routing is carried out in λ = |B − A|/w steps. Assume that the required direction of transfer is shown by the value of the parameter sign(B − A); it is +1 if B − A ≥ 0, otherwise it is −1. Also, R is the remainder of the division |B − A|/w.
The randomized routing algorithm that results in a shortest path is as follows. The current node A sends the message to any node with address A + sign(B − A) * τ , where R ≤ τ ≤ w. Therefore, the node chooses any of these w − R + 1 neighbors. Because of these potentially multiple choices and the distributed nature of the routing algorithm, multiple paths of minimum length may be available for the transmission of a message. 
There is a single path of minimum length for R = 0.
Proof. There exists a minimum length path that first follows |B − A|/w links of length w and up to one link of length R in the 1-D representation of the HOW system. For R > 0, the link of length R can be chosen, however, in any phase during the transfer; that is, it can be selected either at the source or at any other intermediate node in the path, without increasing the total distance traveled; this distance is λ hops. The total number of such paths is equal to λ, thus the first term of the equation. Each remaining path of minimum length contains anywhere two links of length R + i and w − i with all its other links being of length w, for all integers i with 0 < i ≤ (w − R)/2 . The total number of permutations of these pairs of links in the path for a given value of i is λ(λ−1), i assumes (w −R)/2 possible values and pairs of paths are identical for i = (w − R)/2 (if it is an integer). Therefore, the second and third terms of the equation represent these additional paths.
For example, in the HOW(64, 7, 1) there exist 22 paths of minimum length 4 between any pair of nodes whose addresses differ by 24; 55 paths of minimum length 5 between pairs differing by 29; 117 paths of minimum length 9 between pairs differing by 59; and 32 paths of minimum length 4 between pairs differing by 23. For a multidimensional HOW system, the total number of shortest paths is given by the product of the numbers of shortest paths in all dimensions. This may be a very large number in practical cases. Thus, HOW systems are highly fault tolerant. The routing algorithm is deadlock free if we adhere to dimension-order routing. Since a large number of shortest paths exist for most pairs of nodes, faulty links or nodes can be easily avoided with minimal overhead. For pairs of nodes corresponding to R = 0, there exist numerous alternative paths that increase the shortest distance by just one.
COST ANALYSIS
Our main objective here is to show that HOWs have reduced VLSI/wire complexity compared to GHs, despite the fact that they can deliver comparable performance (as shown in subsequent sections). This reduced complexity of HOWs is investigated with respect to the number of channel wires and the complexity of the VLSI/wire layout. We also compare the numbers of channel wires in HOWs with those of other popular interconnection networks, for systems of comparable size. Additionally, we compare the bisection width of HOWs with that of the torus, a popular network. Another major interconnection network nowadays is the fat tree. Therefore, a comparison is also made with the latter network, despite the fact that it is actually an indirect network (since it employs switches) and, therefore, a comparison of topological properties is not really easy. The following proposition shows the complexity of the fat tree. PROPOSITION Proof. The fat tree has (m + 1) levels and the total number of wires between any pair of consecutive levels is equal to Nk. Therefore, the total number of wires that implement the channels is equal to mNk. The switch in the apex is connected to 2 m−1 k wires to its left and 2 m−1 k wires to its right, for a total of 2 m k or Nk wires. To be able to implement all possible permutations of its inputs, it is a crossbar switch with N inputs and N outputs, where the width of each input or output is k. This crossbar switch is implemented with N 2 k binary switches. Level i (at distance i from the leaves), where 1 ≤ i ≤ m − 1, contains 2 m−i crossbar switches; each crossbar switch has 2 i+1 inputs and the same number of outputs, where each line contains k wires. Thus, level i contains a total of 2 m−i 2 2(i+1) k or 4N2 i k binary switches. As a result, the total number of binary switches in the fat tree is equal to Table 1 compares the numbers of channels in the binary hypercube, the p-ary n-cube (i.e. n-D torus), the GH GH(p, n), the binary fat tree, the 2-D HOW and the n-D HOW, all with the same number N = p n of nodes. This table shows that the number of channels in the HOW(p, w, n) is smaller than that in the GH(p, n), in the
As an example, assume systems with N = 16,384 nodes (i.e. m = 14) and 64-bit data channels with two sets of 64 wires for full-duplex bidirectional data transfers. The total numbers of wires for inter-node data transfers are: For the comparative analysis of these results, we emphasize again that HOW systems with reasonable window size w are scalable, and could be implemented with current and expected electronic and/or optical technologies because of the locality of their interconnects. In contrast, binary hypercubes are not scalable because the node degree increases with an increase in the number of nodes and, therefore, they are difficult to build. Also, large GHs are impossible to build because of their very large wiring complexity and their global interconnectivity. Finally, fat trees of large size are very difficult to build because they contain very large numbers of binary switches. When wraparound connections are also included, the diameter of the HOW(p, w, n) is reduced by half to n (p − 1)/2w . However, we do not study here HOW systems with wraparound connections because we want to facilitate the implementation of large-dimensional HOW systems (using local interconnects only). In addition, the diameter does not change in the asymptotic O(·) notation because of the wraparound connections.
Let us also compare the bisection widths of the torus, the HOW with wraparound connections and the GH. We have included the torus in this comparison because it is easy to construct and many real computers contain this network. A very large bisection width may make the network difficult to build but it also reduces the probability of communication bottlenecks. The bisection width of the GH(k, n) is O(k n+1 ); it is O(p 3/2 ) for the GH( √ p, 2). For the 1-D HOW( √ p, w, 1) with wraparound connections, the bisection width is 2(1 + 2 + 3 + . . . + w) = w(w + 1) = O(w 2 ). For the 2-D HOW( √ p, w, 2) with wraparound connections, the bisection width is w(w + 1)
. Finally, the bisection width of the binary fat Let us now compare in detail the bisection width of 2-D HOWs (the main focus of this paper) with that of tori, for networks comparable in cost. Assume the symmetric 2-D HOW( √ p, w, 2) with p nodes (and wraparound connections in each dimension) and the symmetric m-D torus with the same number of nodes p = β m . Thus, β = p 1/m . For the two systems to contain the same number of wires, we need 4w = 2m or m = 2w. The bisection width of the torus is β m−1 . For the bisection width of the HOW to be larger than or equal to that of the torus, we need to have Table 2 shows the maximum number, as a function of w, of nodes required for the HOW to have better bisection width than the torus. We can infer from this table that, in practical cases, HOWs have better bisection width than tori of the same cost. The study of further issues related to the VLSI/wire cost is now in order. A VLSI cost comparison between 1-D HOWs and GHs is presented. Since the focus of our attention is 2-D systems with p nodes in each dimension, this 1-D comparison is assumed to be carried out for each of the p rows and p columns in the 2-D systems (i.e. for their building block). The next definition is pertinent.
DEFINITION 1. The crossing number of a graph is the minimum number of edge crossings needed to draw the graph in the plane [31].
This number is related to the area needed to lay out the graph for VLSI implementation. To eliminate all edge crossings, several printed-circuit layers may have to be implemented. Not only does the number of layers affect the VLSI cost, but the thickness of each layer also contributes to the cost measure. To determine the VLSI/wire cost, we measure the complexity of each system based on the minimum number of layers required in the collinear layout of the circuit for zero edge crossings and the width of each layer. In the collinear layout, all nodes in the 1-D system lie on the same straight line. The chosen rules of routing the wires for 1-D systems are: (a) we consecutively number the nodes 0, 1, 2, . . . , p − 1, from left to right; (b) going from left to right, for even-numbered nodes the wires go to the top half of the printed-circuit board; (c) for oddnumbered nodes, the wires go to the bottom half of the printed-circuit board. These rules of routing minimize the maximum collective width, MCW (expressed in number of wires), in the y (vertical) dimension. Figure 2 shows the collinear layout of the 1-D HOW(12, 4, 1) and its bruteforce decomposition for its implementation with two layers that eliminate all edge/wire crossings. However, the number of layers that eliminate all wire crossings depends on w, and thus it increases with increases in the window size. The following theorems are pertinent. an odd address). Therefore, we have PE 1 contributing two wires because it is connected to PE w and PE w+1 outside of this left-most window. PE 3 contributes four wires because it is connected to PE w , PE w+1 , PE w+2 and PE w+3 , and so on. Therefore, we have MCW = 2 + 4 + 6 + 8 + . . . + w, or w/2 i=1 2i where w/2 is an integer or, finally, (w/2)((w/2) + 1). For odd w, however, MCW corresponds to the upper half of the layer because PE w−1 , which is the right-most PE in the left-most window, is the last PE that contributes to MCW and it contributes to the upper half (because it has an even address). Therefore, we have PE 0 contributing one wire because it is connected to PE w . PE 2 contributes three wires because it is connected to PE w , PE w+1 and PE w+2 , and so on. Therefore, we have MCW = 1 + 3 + 5 + 7 + . . . + w, or To obtain these results, we have assumed that all w wires leaving PE w−1 exist and therefore w − 1 + w < p or w < (p + 1)/2. This should be expected to be the practical case for HOWs. However, the results do not cover GHs because for them we have w = p − 1. Therefore, GHs must be treated separately. Because of the symmetry in 1-D GHs, without loss of generality we can find the MCW by focusing on the upper half of the printed circuit. In fact, we can count the contribution of each PE in a left-to-right order. Let α be equal to p − 1. PE 0 contributes α wires because it is connected to α neighbors to its right. PE 2 contributes α − 4 wires to MCW because it is connected to α − 2 neighbors to its right and two levels of wires emanating from PE 0 can be reused (therefore, PE 2 can also use the same wire levels). Similarly, PE 4 contributes α − 8 wires to MCW because it is connected to α − 4 neighbors to its right and four levels of wires emanating from PE 0 can be reused. Similarly, PE 6 contributes α − 12 wires to MCW because it is connected to α − 6 neighbors to its right and six levels of wires emanating from PE 0 can be reused. In general, PE i , where i = 2j , contributes α − 2j wires to MCW because it has α − j neighbors to its right and it can reuse j levels of wires emanating from PE 0 . However, even-numbered PEs i for which α − i is negative or zero do not contribute to MCW . Therefore, contributing PEs have addresses 2i, with α − 4i ≥ 0 or i ≤ a/4 . The value of MCW is then given by Proof. Assuming the wire routing rules defined earlier and the brute-force decomposition to produce zero wire crossings, we focus for the proof on a single window. Each layer deals with a pair of consecutive nodes within the window and there are w/2 pairs. Thus, we need a total of w/2 layers for the HOW(p, w, 1). For the GH, going from left to right in the collinear representation of the system, each layer contains two successive nodes that connect to all other nodes to their right. However, up to four rightmost nodes can be combined on the last layer with zero wire crossings, and thus the total number of layers for the GH is 1
We observe that the numbers of layers in HOWs and GHs of similar size are O(w) and O(p), respectively. This is another advantage of HOWs that renders them more viable for implementation than GHs. Let us now deal with another wire routing technique, namely restricted routing [32] , that requires only two layers for the implementation of any system represented in the 2-D space. As a result, both HOWs and GHs require two printed-circuit layers regardless of their size. Horizontal and vertical wire segments are laid on two different wiring layers. Figure 3 demonstrates this technique for the HOW (12, 4, 1) . Horizontal and vertical wires can then cross over each other without any electrical connection. If a connection is needed, a contact is placed at the respective intersection; these contacts contribute to the VLSI cost. Therefore, the total wiring cost with restricted routing has four components.
• The total number of wires: this number is O(wp 2 ) and O(p 3 ) for 2-D HOWs and GHs, respectively.
• The maximum collective width of wires, MCW : this number is O(w 2 ) and O(p 2 ) for HOWs and GHs, respectively.
• The length of the wires: the maximum length of horizontal wires is O(w) and O(p) for HOWs and GHs, respectively.
• The total number of electrical connections (contacts) between the two layers: this number is double the total number of wires; therefore, it is O(wp 2 ) and O(p 3 ) for HOWs and GHs, respectively.
Therefore, HOWs are superior to GHs even with restricted routing. We can conclude that HOWs are more prone to implementation than GHs for reasonable values of w. The following sections also show that HOWs can deliver very high performance.
EMBEDDINGS INTO 2-D HOWS
We propose embeddings of popular interconnection networks into 2-D HOW systems. Such embeddings could prove very beneficial as HOW and related systems demonstrate significant promise in scalable parallel processing [33] . Some definitions are pertinent for the analysis of results. Given two graphs G (V , E) and G (V , E ) , embedding the graph G into the graph G results in the mapping of each vertex in the set V onto a vertex in V and of each edge in the set E onto an edge, or a set of edges (a path) in E . There are three important parameters that determine the quality of mapping. The dilation of a source edge in E is the number of edges in E (the length of the path) that this edge from E is mapped onto. in E is the number of source edges mapped onto this edge in E . The expansion is the ratio of the number of nodes in the set V to that in V . We try, whenever possible, to limit the scope of the discussion to cases where the expansion is one, for the sake of cost effectiveness.
Embedding a ring
We visit the nodes in a serpentine-like, column-wise way where the first column is scanned sequentially for an even number of rows. In this case, even with w = 1, we produce an optimal mapping. For an odd number of rows, the nodes on the first column cannot be visited sequentially, but still an optimal mapping exists for w ≥ 2, as shown in Figure 4 . If the ring has fewer nodes than the HOW system, we just use one or more links connecting nodes at distance two to bypass several nodes in the 2-D HOW system for optimal mapping, as shown in Figure 5. 
Embedding a binary tree
Binary trees can be embedded into 2-D HOW systems in several ways. Such an embedding could be used for the implementation of data reduction operations [34] . Consider a full binary tree of depth d containing 2 d − 1 nodes and the 2-D HOW( √ 2 d − 1 , w, 2) system for the smallest expansion. We assume that w ≥ 2. The two basic building blocks used in our optimal binary tree mapping are for the three-level tree, and are shown in Figures 6 and 7 . These two building blocks and their mirror images are employed for the mapping of larger trees. For example, Figure 8 shows a mapping where the building block at the upper-left corner of Figure 7 and its three mirror images are used for the mapping of the four distinct three-level trees containing leaves of the original five-level tree. The mirror images are employed to minimize the distances between the roots of these trees for connections at the next level. The largest dilation of edges is two in this case (there is no way to directly connect processor-1 and processor-4 or processor-2 and processor-6; we use two edges to connect them together as shown with the bold lines in Figure 8 ). In general, a large binary tree of depth d is viewed as four appropriately connected subtrees of depth d−2 for which embeddings into a 2-D HOW system are easily obtained recursively; interconnection of their roots after the embeddings are then easily derived. An example is shown in Figure 9 . The maximum dilation is two for binary trees with an odd number of levels. For an even number of levels, we have optimal embeddings. 
Embedding a binary hypercube
Based on the desired expansion, we can embed a (direct binary) hypercube into a 2-D HOW system with two different methods. into the 'building block' HOW(3, 2, 2) is used recursively. As shown in Figure 11 , the embedding into the building block results in only one unused node. Also, the source edges (000, 100), (100, 101) and (110, 111) have dilation two in the building block, and the congestion is two for the target edge (110, 100). Figure 11 shows the embedding of a 5-D hypercube using four 3-D hypercube building blocks mapped onto HOW(3, 2, 2)s. This example shows that there are only four unused nodes. In the general case, with the second method for an odd d the chosen target system is the HOW (3 × 2 m , w, 2) for the best mapping with minimum expansion, where m = (d − 3)/2. The expansion is equal to 9 × 2 2m /2 d or 9/8.
COMMUNICATION OPERATIONS ON 2-D HOWS
The communication latency, that is the time consumed to communicate a message between two nodes in the system, depends on the following parameters [35, 36, 37] .
• Startup time (t s ): the time consumed by the sending node. It comprises the time to prepare the message (producing the header, trailer and error correction information), the time for the routing algorithm at the source and the time to send the first word of the message to the appropriate output communication port.
• Per-word channel transfer time (t w ): the time taken by a word to traverse a channel.
• Combining time (t c ): the time consumed by an intermediate node to switch a message from an input to an output port; it also includes the time to combine incoming messages, if needed, and send them to the appropriate output port.
We calculate only the time taken by a message to reach the input port of the destination. Additional time may be needed to get the data from that port. In store-andforward (SF) routing, with a message traversing a path with multiple links, each intermediate node forwards the message to the next node in the path after it has received the entire message. To reduce the communication time, wormhole routing divides a message into flits (flow-control digits) [5] . As the header flit advances along the chosen path, the remaining flits follow in a pipelined fashion. If the header flit encounters a channel already in use, all flits are blocked until the channel becomes available [35] . Normally, the flit size coincides with the channel width. The combining time t c is ignored in wormhole routing.
In We compare the communications capabilities of 2-D HOWs, binary hypercubes [35] and 2-D GHs, all with the same number p of nodes. For the sake of simplicity, we assume store-and-forward routing. Table 4 also summarizes the performance of these systems and compares them using the product of the 'communication time' and the 'node pinout' as cost measure [39] ; systems with lower cost value are preferable. The node pin-out for a network is the number of wires per node; it is the product of the node degree and the channel width (we assume constant width here). It is a very widely used measure of the VLSI cost. The cost of implementing these communications operations is asymptotically identical for HOWs and GHs; this is very important as HOWs are much easier to implement than GHs. Therefore, HOWs are proven viable networks in the field of very-high-performance computing.
One-to-one communication
It becomes obvious that GHs perform better than HOWs from the communication time point of view. However, the GH has a fundamental design disadvantage. Let us now redefine the cost of an interconnection network as the product of the 'communication time' and the 'bisection width'. This is another reasonable cost measure because we should like to achieve small communication time with a small VLSI system complexity. Table 5 shows the costs of the HOW( √ p, w, 2) and the GH( √ p, 2) for √ p ≥ w.
This table also shows that reductions in the cost of HOWs are proportional to reductions in the value of w and this leads to predictability in their design. We also carried out computer simulations to test the robustness of the proposed architecture. The simulation results are shown in Table 6 . The total time shown is expressed in number of communication cycles. A single communication cycle is consumed for the transmission of a message between two neighboring nodes. All messages have the same size. The uniform distribution is used to determine the location of message initiators. All messages are generated in cycle number zero and dimension-order routing is applied. Each message buffer can hold up to five messages in all simulation runs. Each case was simulated 40 times and the average time is shown in the figure. To test the architectures under many different loads and random communication patterns, the destinations are chosen randomly without assuming any specific distribution. Random communication patterns are much more demanding of symmetric networks than regular patterns, such as permutations. Simulations are shown here under various loads, where the number of sending nodes ranges from 19.63 to 96.35% of the total number of nodes. Therefore, the behavior of the proposed architecture is also tested under very heavy loads. We can observe that the proposed architecture yields outstanding performance. In the worst cases, the total time is comparable to the diameter of the system. Therefore, the simulation results also support our claim that HOWs are capable of delivering outstanding performance.
CONCLUSIONS
We have introduced a class of scalable architectures, namely HOWs, which are capable of very high performance. We have proved that HOWs have lower cost than fat trees of similar size and higher bisection width than tori of similar cost. We have also proposed graph embedding algorithms and demonstrated the implementation of various important communications operations. We have also compared the performance of this class of architecture with those of the binary and generalized hypercubes for the aforementioned communications operations. Our results show that not only are our architectures scalable and feasible with current and expected technologies, but they also perform better than the binary hypercube and comparably to the generalized hypercube for several highly demanding communications operations. Simulation results also show that HOWs can deliver very good performance under highly demanding communication loads.
