I. INTRODUCTION
Sequential computer steadily increases their speed to meet the computation demand, and it has already been reached saturated. Thus, the only way to meet the increasing demand of computation power to solve the grand challenge problems is to use parallel computers. Massively Parallel Computer (MPC) systems with thousands of nodes have been commercially available and efforts have been made to build MPC systems with millions of nodes. In such computers, with millions of nodes, the large diameter of conventional topologies is completely infeasible. Hierarchical interconnection networks (HIN) [1] are a cost-effective way to interconnect a large number of nodes. A variety of hypercube-based hierarchical interconnection networks have been proposed [2] - [6] , but for MPC systems, the number of physical links becomes prohibitively large. To alleviate this problem, several kary n-cube based HIN [7] - [10] have been proposed. However, the performance of these networks does not yields any obvious choice of an interconnection network for MPC. No one is clear winner in all aspect of network design.
A Tori connected mESH (TESH) network [11] , [12] is an HIN aiming for large-scale 3D MPC systems, consisting of multiple basic modules (BMs) which are 2D-mesh networks. The BMs are hierarchically interconnected by a 2D-torus to build higher level networks. The restricted use of physical links between BMs in the higher level networks and within the BMs reduces the dynamic communication performance of this network [14] . With the increase of inter-level connectivity, the dynamic communication performance of the TESH network is shown to be better than that of a mesh network. However, it is still not as good as that of torus network. It has already been shown that a torus network has better dynamic communication performance than a mesh network [20] . We have replaced the 2D-mesh of a TESH network by a 2D-torus network, and the modified HIN is called Toriconnected Torus Network (TTN) [16] .
In our another studies yet to be published, it is seen that TTN is suitable for a few tens of thousands of node. For millions of nodes, TTN does not yield better performance. The assignment of free links of BM for higher level interconnection is asymmetric in the TTN. We assign the free links in a symmetric order for higher level interconnection instead of asymmetric fashion. This new interconnection network is called Symmetric Tori connected Torus Network (STTN). It provides scalability up to a million of nodes with less cost. However, the length of the longest wire is a limiting factor for a network with millions of nodes in the STTN. The operating speed of a network is limited by the physical length of links. It means that such long length links may result in excessive latency or require slower signaling rates. This problem can be diminished by folding the network [20] . After folding each level of the STTN, the resultant network is called Folded Tori connected Torus Network (FTTN) .
In this paper, we address the architectures of the STTN and FTTN, and evaluate their static network performance. The static network performance will be evaluated in terms of node degree, network diameter, cost, average distance, 2V_out  3V_out  4V_out  5V_out   5H_in   4H_in   3H_in   2H_in   3H_out   4H_out   5H_out  3,0 3,0  3,1  3,2  3,3   2,0  2,1  2,2  2 The remainder of the paper is organized as follows. In Section II and III, we briefly describe the basic architecture of the STTN and FTTN, respectively. Addressing of nodes and the routing of messages are discussed in Section IV and Section V, respectively. The static network performance of the STTN and FTTN is discussed in Section VI. The evaluation of the longest wire length is presented in Section VII. Finally, in Section VIII, we conclude this paper.
II. INTERCONNECTION OF THE STTN
The Symmetric Tori connected Torus Network (STTN) is a HIN consisting of multiple basic modules (BM) that are hierarchically interconnected to form a higher level network. A (2 m × 2 m ) BM consists of a 2D-torus network of 2 2m processing elements (PE) having 2 m rows and 2 m columns, where m is a positive integer. Considering m = 2, a BM of size (4 × 4) is depicted in Figure 1 (a). Each BM has 2 m+2 free ports at the contours for higher level interconnection. All ports of the interior nodes are used for intra-BM connections. All free ports of the exterior nodes, either one or two, are used for inter-BM connections to form higher level networks. In this paper, unless specified otherwise, BM refers to a Level-1 network.
Successive higher level networks are built by recursively interconnecting (2 2m ) immediate lower level subnetworks in a (2 m × 2 m ) 2D-torus network. As portrayed in Figure 1 (b), considering (m = 2) a Level-2 STTN can be formed by interconnecting 2 2×2 = 16 BMs. Similarly, a Level-3 network can be formed by interconnecting 16 Level-2 subnetworks, and so on. Each BM is connected to its logically adjacent BMs. To avoid clutter, the wraparound links of the BMs are not shown. It is useful to note that for each higher level interconnection, a BM uses 4 × (2 q ) = 2 q+2 of its free links, 2(2 q ) free links for vertical interconnections and 2(2 q ) free links for horizontal interconnections. Here, q ∈ {0, 1, ....., m}, is the inter-level connectivity. q = 0 leads to minimal interlevel connectivity, while q = m leads to maximum interlevel connectivity. As shown in Figure 1 (a), for example, the (4 × 4) BM has 2 2+2 = 16 free ports. If we chose q = 0, then 4(2 0 ) = 4 of the free ports and their associated links are used for each higher level interconnection, 2 for horizontal and 2 for vertical interconnection. Among these 2 links, one is used for incoming link and another one for used for outgoing link, i.e., a single links is used for vertical in, vertical out, horizontal in, and horizontal out. If we increase the inter-level connectivity (q), then we have the option of using more than one link for the vertical in, vertical out, horizontal in, and horizontal out connection, which in turns reduces the level of hierarchy. Note that the last digit in the labels denotes the link number, for example, 2V out k as shown Figure 1 In principle, m could be any positive integer value. However, if m = 1, then the network degenerates to a hypercube network. Hypercube is not a suitable network, because its node degree increases along with the increase of network size. If m = 2, then it is considered the most interesting case, because it has better granularity than the large BMs. If m ≥ 3, the granularity of the family of networks is coarse. If m = 3, then the size of the BM becomes (8 × 8) with 64 nodes. Correspondingly, the Level-2 network would have 64 BMs. In this case, the total number of nodes in a Level-2 network is N = 2 2×3×2 = 4096 nodes, and Level-3 network would have 262144 nodes. Clearly, the granularity of the family of networks is rather coarse. In the rest of this paper we consider m = 2, therefore, we focus on a class of m+2 free ports at the contours for higher level interconnection, and BM refers to a Level-1 FTTN.
Successive higher level networks are built by recursively interconnecting (2 2m ) immediate lower level subnetworks in a (2 m × 2 m ) folded 2D-torus network. As portrayed in Figure 2 (b), considering (m = 2) a Level-2 FTTN can be formed by interconnecting 2 2×2 = 16 BMs in a folded fashion. Similarly, a Level-3 network can be formed by interconnecting 16 Level-2 subnetworks in a folded fashion, and so on. Each BM is connected to its logically adjacent BMs. Like STTN, each higher level interconnection, a BM uses 4 × (2 q ) = 2 q+2 of its free links, 2(2 q ) free links for vertical interconnections and 2(2 q ) free links for horizontal interconnections. Here q is the inter level connectivity. As shown in Figure 2 Note that the choice of (2 2m ) subnetworks to built the higher level networks is natural. This choice maintains the regularity of the network structure and thus makes the addressing of the nodes more convenient. The addressing of nodes will be explained in the next section.
The question may arise, whether we need massively parallel computers with thousands of nodes or millions of nodes. The answer is 'yes'. Solving the most challenging problems in many areas of science and engineering, such as defense (maintaining national security), aerospace (space exploration and shuttle operation), disaster management (recovering from natural disaster), and weather forecasting (predicting and tracking severe weather), requires tera-flops performance for more than a thousand hours at a time. This is why, in the near future, we will need computer systems capable of computing at the tens of peta-flops level or even exa-flops level. To achieve this level of performance, we need MPC system with thousands or millions of nodes.
IV. ADDRESSING OF NODES Base-4 numbers are used for convenience of address representation. As seen in Figure 1 and 2, nodes in the networks are addressed by two digits, the first representing the row index and the next representing the column index. More generally, in a Level-L STTN/FTTN, the node address is represented by:
Here, the total number of digits is n = 2L, where L is the level number. A L is the address of level L and
for Level-L network. Pairs of digits run from group number 1 for Level-1, i.e., the BM, to group number L for the L-th level. Specifically, l-th group (a 2l−1 a 2l−2 ) indicates the location of a Level-(l − 1) subnetwork within the l-th group to which the node belongs; 1 ≤ l ≤ L. In a twolevel network the address becomes A = (a 4 a 3 ) (a 1 a 0 ). The pair of digits (a 4 a 3 ) identifies the BM to which the node belongs, and the pair of digits (a 1 a 0 ) identifies the node within that BM. The assignment of inter-level ports for the higher level networks has been done quite carefully so as to minimize the higher level traffic through the BM. The address of a node n 1 encompasses in BM 1 is represented as n 1 = a 
All digits of n 1 in BM 1 is compared to that of n 2 in BM 2 . And if only one digit is different, where the difference is either ±1 or ±(2 m − 1) for STTN, then these two BMs are connected in the differed co-ordinate position using the free ports by a bidirectional link. As portrayed in Fig. 1(b) , the Level-2 STTN is constructed using 16 BMs.
(a 3 , a 2 = 3, 0) for horizontal interconnection. Similarly, as portrayed in Fig. 2(b) , the Level-2 FTTN is constructed using 16 BMs.
V. ROUTING ALGORITHM Routing of messages in the STTN/FTTN is performed from top to bottom as in TTN [16] . That is, it is first done at the highest level network; then, after the packet reaches its highest level sub-destination, routing continues within the subnetwork to the next lower level sub-destination. This process is repeated until the packet arrives at its final destination. When a packet is generated at a source node, the node checks its destination. If the packet's destination is the current BM, the routing is performed within the BM only. If the packet is addressed to another BM, the source node sends the packet to the outlet node which connects the BM to the level at which the routing is performed.
Due to simplicity and fast routing, we have considered the dimension order routing algorithm for the STTN/FTTN. At each level, vertical routing is performed first. Once the packet reaches the correct row, then horizontal routing is performed. Routing in the STTN/FTTN is strictly defined by the source node address and the destination node address. Let a source node address be
, and a routing tag be t = (t 2L−1 , t 2L−2 ), (t 2L−3 , t 2L−4 ), ..., (t 1 , t 0 ), where Let us consider an example in which a packet is to be routed from source node 000000 to destination node 231112. In this case, routing is to be done at Level-3, therefore the source node sends the packet to the outlet node of Level-3 STTN, 00 00 31 as shown in the bottomleft corner of Figure 4 , whereupon the packet is routed at Level-3, as shown in Figure 4 . Here again, the wraparound links of Level-1, Level-2, and Level-3 networks are not shown to avoid clutter. After the packet reaches the Level-2 (2, 3) network, then routing within that network continues until the packet reaches the BM (1, 1) . Finally, as shown in the bottom-right corner of Figure 4 , the packet is routed to its destination Node(1, 2) within the destination BM. Routing of packet for the same sourcedestination pair (000000, 231112) in the FTTN is shown in Figure 5 . The routing of message in the FTTN follows the same way of STTN. However, due to difference in structure, the FTTN follows the different inter-subnetwork route steps. The difference is shown in Figure 4 and 5.
VI. STATIC NETWORK PERFORMANCE
Although the actual performance of a network depends on many technological and implementation issues, several topological properties and performance metrics can be used to evaluate and compare different network topologies in a technology-independent manner. Most of these properties are derived from the graph model of the network topology. In this section, we discuss some performance metrics that characterize the cost and performance of an interconnection network. For the performance evaluation, we have considered mesh, torus, and k-ary 2-cube based HIN such as TESH network, TTN, proposed STTN, and FTTN. For fair comparison, we did not consider k-ary 3-cube based HIN.
A. Node Degree
The node degree is defined as the maximum number of physical links emanating from a node. Since each exterior node of the BM has six links, the degree of the STTN and FTTN is 6, and it is independent of network size. Constant degree networks are easy to expand, and the network interface cost of a node remains unchanged with increasing network size. The I/O interface cost of a particular node is proportional to its degree. It is shown in Table I that the degree of the STTN and FTTN are exactly equal to that of TTN and is higher than that of mesh, torus, and TESH networks.
B. Diameter
Data is transferred form one node to its adjacent node through the connecting communication links between them. But the nodes which are not directly connected traverse through other nodes. The length of a path is given by the number of links it traverses to reach its destination node from source node. The diameter of a network is the maximum inter-node distance, i.e., the maximum number of links that must be traversed to send a message to any node along the shortest path. As a definition, the distance between adjacent nodes is unity. Diameter is the maximum distance among all distinct pairs of nodes along the shortest path. The diameter is commonly used to describe and compare the static network performance of the network's topology. Networks with small diameters are preferable. The smaller the diameter of a network, the shorter the time to send a message from one node to the node farthest away from it. For the k-ary n-cube, the diameter is given by D(k, n) = n k 2 . The STTN is a HIN, which consists of k-ary 2-cube. According to the routing algorithm, a packet from source to destination traverse three k-ary 2-cubes: such as source BM, destination BM, and inter-BM transfer. Therefore, the diameter of the STTN is calculated using diameter of source BM, destination BM, higher level 2D-torus, and maximum distance from one level's free link to the next immediate lower level free links. And it is calculated using the following equations:
Here D represents the diameter, and the subscript s refers to the source BM, d to the destination BM, and i to the various level traversed in the hierarchy. Figure 6 . Clearly, the STTN has a much smaller diameter than TESH, torus, and mesh networks. And it is slightly higher than that of TTN. However, it is shown that the difference is diminishing with the increase of number of nodes. In Figure 6 , it is shown that the diameter of the FTTN is slightly higher than STTN. However, the difference is trivial.
C. Cost
Inter-node distance, message traffic density, and faulttolerance are dependent on the diameter and the node degree. The product (diameter ×node degree) is a good criterion for measuring the relationship between cost and performance of a multiprocessor system [4] , [17] - [19] . An interconnection network with a large diameter has a very low message passing bandwidth, and a network with a high node degree is very expensive to implement. In addition, a network should be easily scalable; there should be no changes in the basic node configuration as we increase the number of nodes. Therefore, low cost results low engineering complexity. The cost of different networks is plotted in Figure 7 , and it is shown that the cost of STTN and FTTN are far lower than that of mesh and torus networks, and slightly higher than that of TTN and TESH networks. Also, the cost of FTTN is slightly higher than STTN.
D. Average Distance
The diameter may not always be indicative of actual performance of the network. Because most of the time data traverse along the path less then the diameter. Therefore, it is important to calculate the distance the average data traveled. The average distance is the mean distance between all distinct pairs of nodes in a network. A small average distance results small communication latency, especially for distance-sensitive routing, such as store and forward. But it is also crucial for distance-insensitive routing, such as wormhole routing, since short distances imply the use of fewer links and buffers, and therefore less communication contention. We have evaluated the average distances for STTN, FTTN, TTN, and TESH network by simulation and mesh and torus networks by their corresponding formulas and the result is plotted in Figure  8 . It is shown that the average distance of the STTN and FTTN are remarkably lower than that of TESH network, and far lower than that of mesh and torus networks. It is also shown that the difference between average distance of STTN/FTTN and TTN diminishing with the increase of number of nodes. With large number of nodes, i.e., 2 20 = 1 million nodes, the average distance of the STTN is lower than that of its rival TTN.
Although the dynamic communication performance of a program on a multicomputer depends on the actual times taken for data transfer, a smaller average distance and diameter of an interconnection network yields a smaller communication latency of that network.
E. Bisection Width
The Bisection Width (BW) of a network is defined as the minimum number of links that must be removed to partition the network into two equal halves. Many problems can be solved in parallel using binary divideand-conquer: split the input data set into two halves, and solve them recursively on both halves of the interconnection network in parallel, then merge the results from both halves into the final result. Small bisection width implies low bandwidth between the two halves, and it can Number of Nodes (N) slow down the final merging phase. On the other hand, a large bisection width is undesirable for the VLSI design of the interconnection network, since it implies a lot of extra chip wires, such as in hypercube [11] . The bisection width of the STTN(m, L, q) is given in Eq. 4, where m indicates the size of the BM, L is the level of hierarchy (2 ≤ L ≤ L max ), and q is the inter-level connectivity.
Similarly, the bisection width of the FTTN(m, L, q) is given in Eq. 5. For L = 1, it is the basic module, and it is 2D-torus for STTN and 2D folded torus for FTTN. The BW of the STTN and FTTN are calculated using the Eq. 4 and 5, respectively. They are the number of links that need to be removed to partition the highest level (Level-L)torus network. We have also calculated the bisection width of TTN, TESH, mesh, and torus networks by their respective static formula and it is plotted in Figure 9 . It is shown that the the bisection width of the STTN and FTTN are exactly equal to that of the TTN and TESH network. And after 4096 node, it is higher than that of conventional mesh and torus networks. Therefore, in very large networks yields high throughput due to this large bisection width.
F. Arc Connectivity
Arc Connectivity measures the robustness of a network. It is a measure of the multiplicity of paths between processors. Arc connectivity is the minimum number of links that must be removed in order to break the network into two disjoint parts. High arc connectivity improves performance during normal operation by avoiding link congestion, and also improves fault tolerance. The ratio between arc connectivity and the degree of a node gives a measure of static fault tolerance performance. A network is maximally fault-tolerant if its connectivity is equal to the degree of the network. The arc connectivity of various networks is shown in Table I . Clearly, the arc connectivity of the STTN and FTTN are exactly equal to that of TTN and torus network and higher than that of the mesh and TESH networks. However, the arc connectivity of the torus network is exactly equal to its degree. Thus, torus is more fault tolerant than all the networks. STTN and FTTN are exactly equal fault tolerant to that of TTN and more fault tolerant than mesh and TESH networks. 
G. Some Generalization
It is mentioned in the Section II that m = 2 has better granularity than the large BMs. However, for generalized study of static network performance of STTN and FTTN, we have considered 4096 nodes. We have evaluated the static network performance of various networks with 4096 nodes for different m & q, and tabulated in the Table II . It is seen that the diameter, cost, and average distance of the TESH, TTN, STTN, and FTTN with m = 3 is higher than to those of when m = 2 is used. That is, with the increase of m, static network performance of a HIN is getting worse. Again, for m = 3, i.e., for a (8 × 8) BM, the longest lengths of higher level network will be very large which will be overwhelming the benefit of hierarchy. Therefore, m = 2 is the best choice for STTN/ or FTTN and thus the configuration of STTN(2,3,q)/FTTN(2,3,q) is better than that of STTN(3,2,q)/FTTN(3,2,q).
VII. LONGEST WIRE LENGTH
The cost of VLSI system is predominantly that of connecting wires, and the performance is limited by the delay introduced by these interconnecting links. Thus, to achieve the required performance, the network must make efficient use of the available wires. The length of the longest wire [7] is an important parameter in the design of an interconnection network. The performance of a network is strongly influenced by the longest links.
The operating speed of a network is limited by the physical length of its links. Thus, the longest length of a wire can be used to describe and compare the maximum physical speeds that the various networks can attain. The length of the longest wire may become more important than the diameter of the network. We will assume that all networks have a 2D-planar implementation. The formula commonly used to describe the longest wire length (LWL) of a k-ary n-cubes (LW L k,n ) [20] is given in Eq. 6. In a k-ary n-cube networks, n represents the dimension of the network and k is the radix. Dimension (n), radix (k), and number of nodes (N ) are related by the equation
This assumes a square layout of nodes with each side having √ N nodes. The above formula underestimates the maximum length because it does not take into account the length of the wrap-around link. For a regular layout, the length of the wrap-around link is given by:
A similar formula can be developed for the FTTN as shown in Eq. 8. The longest wire length in a regular 2D layout, representing the longest wire in the intersubnetwork module of the highest level torus. It is always only one subnetwork apart. That is, for Level-3 network it is one Level-2 network apart. We assume that each node is implemented in one tile area. In 2D-planar realization, one subnetwork apart means k L−1 tiles apart, i.e., k
nodes apart. Here k represents the number of subnetworks in one array for the FTTN, and it is k = 2
The width of a node (in tile) depends upon its underlying CMOS technology used to integrate. And the length of a connecting link between nodes is the number of title it passes to interconnect in 2D planar implementation. The length of the longest link is the largest link among all interconnections. Since the actual distance depends upon it underlying technology, we evaluate the length of the longest wire of different networks using the number of nodes (tiles) it passes to interconnect the farthest two end nodes and it is shown in Figure 10 . We did not consider the 2D-mesh network for the evaluation of longest wire length, because it does n't have any warp-around links. It is shown that the longest wire length of 2D-torus and STTN are equal, and TTN and TESH are equal. However, the longest wire length of FTTN is far lower than that of 2D-torus, TESH, TTN, and STTN.
The longest wire in a planar Level-4 FTTN are the inter Level-3 links which interconnect the folded Level-3 network to construct Level-4 FTTN, and it is 64. According to 65 nm CMOS technology [21] , the title size is 2 nm × 1.5 nm = 3 nm 2 . The length of this longest interconnect is 64 × width of a processor(in tile) = 64 × 2.0 mm = 12.80 cm.
With 2D-planar implementation, the longest lengths of Level-3 and Level-4 STTN are 63 and 255, respectively. These are the wrap-around links of the higher level interconnection. The BM of STTN is a 2D-torus network. Thus, we need some more medium length links. The main demerit of STTN is that we need some medium and high length links. However, this cost yields better performance. To overcome this problem, we have folded the STTN according to folded 2D-torus network, and we find that the longest length of Level-3 and Level-4 FTTN are 16 and 64, respectively. For calculation of the longest link, we have considered m = 2, i.e., k = 4 for STTN, FTTN, TTN, and TESH network. The longest link length of the FTTN is about one-fourth of the STTN. For FTTN, it is just one sub-network apart. It is one BM apart for a Level-2 network, one Level-2 apart for a Level-3 network, and so on. In general, the longest link length of the FTTN is about (1/2 m )th of the STTN.
The effect of the longest wire length can again be minimized by replacing the long electronic links by optical links. That means, short links as if electronic links whereas long links replaced by optical links. Study the architecture and performance of opto-electronic-FTTN or hybrid-FTTN is kept in mind as future work.
VIII. CONCLUSION
Two new hierarchical interconnection network, called Symmetric Tori connected Torus Network (STTN) and Folded Tori connected Torus Network (FTTN), are proposed for the high performance MPC systems. The architecture of the STTN and FTTN, addressing of nodes, and routing of messages were discussed in detail. We have evaluated the static network performance of the STTN and FTTN, as well as that of several other interconnection networks. From the static network performance, it has been shown that the STTN and FTTN possess several attractive features, including constant node degree, small diameter, low cost, small average distance, better bisection width, and better fault tolerant performance. The diameter and average distance of the STTN is lower than that of TTN, TESH, torus, and mesh networks for very large size network. The diameter and average distance of the FTTN is slightly higher than that of STTN. However, the difference is trivial. STTN and FTTN are equal fault tolerant to that of TTN and more fault tolerant than mesh and TESH networks. The STTN and FTTN yield better static network performance with reasonable cost for a network consist of millions of nodes. The longest wire length of FTTN is far lower than that of 2D-torus, TESH, TTN, and STTN. It is about (1/2 m )th of the STTN. An interconnection network with good static network performance with reasonable cost and better fault-tolerant with very small longest wire length is indispensable for next generation high-performance massively parallel computer systems. Therefore, FTTN would be a good choice of an interconnection network for millions of nodes. This paper focused on the architectural structure and static network performance. Issues for future work include the following: (1) evaluation of dynamic communication performance of the proposed networks using dimension order routing and (2) assessment of the performance improvement of the STTN and FTTN with an adaptive routing algorithm.
