Abstract-As an interconnection topology, two-dimensional mesh is widely used in the design of the network-on-chip (NoC) for integrating dozens of cores on a VLSI chip because of its very simple structure and ease of on-chip implementation. However, as the progress of IC technology, it becomes possible to integrate a large-scale system on a chip that contains more than one thousand processing elements or cores. In such a case, mesh topology will deteriorate performance due to the increase of communication time among cores. This paper investigates topologies and IC layout schemes of mesh, torus, hypercube, and metacube for achieving good cost-performance tradeoffs. We propose an analytical model for evaluating cost-performance ratio by considering NoC's topology and layout. The model is parameterized with node degree, graph diameter, the number of routers, the router complexity, the bandwidth of the connection for the router, the number of processing cores, the total length of links, and the cost ratios of the link section and the router section. This model is helpful for us to find out the optimal topology and layout for NoC with a given network size. It was found that when the network size is small, mesh has a better cost-performance than others; as the network size increases, torus and hypercube outperform mesh; and metacube has the best cost-performance among them.
I. INTRODUCTION Network-on-chip (NoC) is a subset of system-on-chip (SoC) that provides communication fabric among processing cores in a single chip [1] . A number of research studies have shown the feasibility and advantages of NoC over traditional busbased architecture [2] , especially for designing a large-scale SoC that contains a large number of processing cores. NoC connects on-chip cores together to form a parallel system.
Conventional interconnection networks for constructing large-scale supercomputers consist of routers and links. A router is usually implemented with a crossbar switch which has a number of ports for exchanging data or messages among ports. Links are high-speed cables that connect ports by following a certain topology. Because the cables are the ones of off-chip, supercomputers can be implemented with high-dimensional complex interconnection networks. Research studies of conventional interconnection networks focus on the achievement of low-cost and high-performance that are affected mainly by node degree and diameter. A node consists of a router and one or more compute nodes. Node degree is the number of ports used for connecting ports of other nodes. It does not contain the ports used to connect compute nodes.
The diameter of an interconnection network is defined as the maximum of the shortest distances between any two nodes. Achieving low-cost and high-performance meant to reduce the node degree and shorten the diameter.
Different from conventional interconnection networks, NoC implements links with on-chip wire in integrated circuit (IC) design. Due to the limitation of the IC process, NoC faces difficulty to implement high dimensional networks. Multidimensional topologies must be projected onto a two-dimensional (2D) plane, although 3D layout is expected to be available in future. To achieve low-cost and high-performance of NoC, in addition to node degree and diameter, we must consider the IC layout, the length of link, and the area ratio of routers and wires. Two-dimensional mesh is a popular topology for the design of NoC because of its very simple structure and ease of on-chip implementation. However, as the progress of IC technology, it becomes possible to integrate a large-scale system on a chip that contains more than one thousand cores. In such a case, mesh topology will deteriorate performance due to the increase of communication time among cores. It is needed to evaluate the on-chip implementations of other topologies and their cost-performance.
The main contributions of this paper are to investigate the on-chip network topologies and their layout schemes of mesh, torus, hypercube, and metacube, and propose an analytic model for evaluating the cost-performance of on-chip interconnection networks based on the topological properties, architecture, and IC layout characteristics. The model takes the following parameters into consideration: node degree, graph diameter, the number of routers, the router complexity, the bandwidth of the connection for the router, the number of processing cores, the total length of links, and the cost ratio of the link section and the router section. By using this model, we can design a large-scale high-performance SoC at low-cost.
II. COST AND PERFORMANCE METRICS FOR NOC
For the NoC implementation, some parameters about layout must be considered. In the interconnection network of supercomputers, the important feature is the number of hops, not the physical distance or link length. However, in the NoC, links between routers are implemented on the silicon. They have an impact on the area and power consumption of the chip. Therefore, we will use the length of links as one of the parameters to evaluate the cost-performance.
A. Topological Properties

1) Node Degree:
The number of links connected to a node. Smaller or fixed node degree is a preferable characteristic of the topology. This is because the cost of the router increases non-linearly as the node degree increases.
2) Diameter: The maximum number of links of the shortest path between any two nodes. The communication time between two nodes is proportional to the diameter. Larger diameter worsens the communication delay.
3) Network Size: The total number of nodes in the whole network. Note that the network size does not equal the number of processing cores. For supercomputers, the number of cores reaches millions and these systems have a huge network. On the other hand, recent NoCs have dozens of nodes [3] , [4] . It is expected to increase the network size of NoC in the future. However, since there is a limitation of the area of one chip and the transistor scaling, we will consider the network size in a range up to few thousands.
4) Link Complexity:
The total number of links in the network. Larger link complexity reduces network diameter but increases node degree. In this paper, we assume that the link is bidirectional, instead of a uni-directional link.
B. Architectural Properties
We must consider some properties based on the architecture of the network and the limitation of 2D layout. Architectural properties concern the system configuration and circuit design of components, including the number of processing cores for one router, router complexity, link thickness (connection width), cost ratios of the router section and the link section, and external ports for connecting off-chip components.
1) The Number of Processing Cores for One Router (p):
A router can connect multiple processing cores directly. If each router connects more cores, the total number of processing cores becomes larger. As a result, a smaller network can connect the same number of cores. But meanwhile, the cost of the router worsens because the number of ports in a router increases.
2) Router Complexity (λ):
It is an exponential factor of the router cost estimation. The router consists of various components such as the crossbar switch, FIFO buffers for virtual channels, and the routing control logic. The cost of each component increases linearly or quadratically with the number of ports. For example, the m × m crossbar switch consumes the area which is proportional to the square of m. We must configure the order of the total cost according to the router architecture. We let the router complexity λ be in the range from 1.0 to 2.0.
3) Thickness (t): It is a coefficient for adjusting the connection width for the router with multiple processing cores. Considering fat-tree or Fig. 1(b) , when p is doubled, the routerto-router connection width can be also doubled. Meanwhile considering thin-tree [5] or Fig. 1(c) , the connection width is decreased compared to that of Fig. 1(b) . In this paper, we define a parameter thickness t for adjusting the connection width of routers. The connection width equals p when t = 1, as shown as in Fig. 1(a) and Fig. 1(b) . When we let t < 1, the connection width will be decreased to tp, as shown as in Fig. 1(c) where t = 1/2. Fig. 1 . Router-to-router connection width adjustment by t
4) Cost Ratios of the Router Section and Link Section (α):
The total cost of the network is the sum of the costs of the router section and the link section, and it will be affected significantly by the dominant section [6] . For example, when the tile size is large and the router is relatively small, the impact of links becomes large. The result in [7] shows that the link section significantly dominates at both area and power. Meanwhile in another study of [3] , the router section dominates and the link consumes 17% of the communication power. The cost balance is depending on the chip architecture, and also it will change in the future. Therefore, we let α be the cost ratio of the router section to the sum of the router and link sections with 0 < α < 1. For example, α = 0.6 means that the router section and the link section dominant 60% and 40%, respectively, of the total cost of the network.
5) External Ports for Connecting Off-chip Components:
The actual NoC must have external interfaces for connection to the memory, other processor chips, etc. Mesh network has free ports on outer boundary routers. These free ports can be used as external interfaces. The symmetric networks such as torus and hypercube do not have the free port; it is not appropriate to compare these networks with the same number of routers. In this paper, we insert some routers for external interfaces that have no processing cores, and compare the large-scale networks with the mesh (baseline) network.
C. Layout Properties
Layout properties concern the length and cost of links.
1) Unit Cost of Links:
We define the unit length of links as the length of the shortest link: the distance between two adjacent routers. It is equal to the side length of the tile, i.e., the square root of the tile size. Because the tile size is mainly affected by the area of the processing section, we estimate the cost of links approximately using the number of cores in a tile. That is, the link cost is √ p. For example, see Fig. 1 , the unit-length with p = 4 is approximately twice of that with p = 1. We assume that all tiles are the same size squares.
2) Total Length of All Links:
The area of links is proportional to the total link length. That is, the total link length linearly affects the cost. In the logical aspect, the total length is equivalent to the link complexity. And it depends on the node diameter.
3) Maximum Link Length:
The maximum link length is the length of longest link among all the links. Excessively long links may cause that messages cannot be transferred in one clock cycle. Or it causes the reduction of the network clock frequency. For example, 2D-torus has long wrap-around links on edges of a graph. However, folded torus [8] can eliminate wrap-around links by doubling the length of all other links. In higher-order hypercubes, many long edges appear and they can not be eliminated by folding. In this case, it is necessary to insert repeaters in a long link for the high frequency system. Wiring delay depends on manufacturing process and other physical or electrical properties. We assume that wiring delay is less than the cycle time of processing cores. That is, messages can be transferred along the longest wire in one clock cycle.
III. COST-PERFORMANCE ANALYTICAL MODEL
This section proposes an analytical model for NoC's costperformance evaluation. The model takes networks' topological, architectural, and layout properties into consideration.
There are two major components in the on-chip interconnection networks: routers and links. Here, we estimate the total cost of each section as follows.
where d is the node degree for the network, p is the number of processing cores connected to a router directly, (d + p) is the number of ports in a router including the ports for cores, λ is router complexity, R is the number of routers, and L is the total length of links on the whole network.
As described before, the cost of each router can be estimated by the λ-th power of the number of ports. And the total cost of the link section is the product of the total length of links and the unit cost of links.
The total cost of the network is the sum of the costs of the router section and the link section. In order to reflect the degree's influence, we introduce a cost ratio of each section. Also, the connection width for the router with multiple processing cores is considered. We define the total cost of NoC as follows.
where α is the cost ratio of the router section to the sum of router section and link section with 0 < α < 1, and t is the thickness. We adjust the connection width by tp with 0 < t ≤ 1.
Our model adopts the communication latency for evaluating the cost-performance ratio. The latency is obtained from the network diameter.
Conclusively, we assume the network performance is inversely proportional to the diameter. We define the performance of the whole network as below.
Performance net ∝ P D where P is the number of processing cores and D is the diameter of the network.
The cost-performance is defined by the cost over performance. The equation of the network's cost-performance (CP ) is given below.
we also use relative cost-performance (RCP ) for comparing with a baseline network. Relative means that the RCP is a relative CP to the baseline network. We will use mesh as the baseline network.
When evaluating CP or RCP , we need to compare the networks that have the same computing capability. Therefore we use the number of processing cores (P ) as a horizontal axis, not the number of routers. We use the torus with increased outer boundary routers that provide external interfaces to outside memory modules and/or other chips. In this case, R, D, L are increased for the same P .
For the hypercube and metacube, we remove a processing core from each outer boundary router; the released ports are used to connect outside memory modules and/or other chips. Fig. 3 shows the effects of p (the number of processing cores for one router) and t (thickness of connection) for RCP of torus with α = 0.6 and λ = 2.0. When p increases under t = 1, the required number of routers is reduced for the same number of processing core, but the cost of the router, the unit cost of link, and the connection width increase. As a result, when comparing under the same number of processing cores, the higher p brings a worse RCP . Interestingly, in this configuration, the torus has better cost-performance than mesh when the network has 100 × p cores.
If the communication traffic is sparse, we can cut down t and it will achieve the reduction of cost. When p = 4 and t = 3/4 as shown in Fig. 3 , the connection width is reduced to 3 from 4. In this case, torus with p = 4 achieves better cost-performance than that with p = 2 when it has roughly over 2 7 (128) processing cores, and it overcomes that with p = 1 when it has roughly 500 or more cores. Relative cost performance Number of processing cores Fig. 3 . RCP comparison on p and t for torus with α = 0.6, λ = 2.0 Fig. 4 shows RCP s of mesh, torus, hypercube, and metacube in the range up to 100,000 cores with λ = 2.0, p = 1, α = 0.6, and t = 1. There are three curves for metacube MC(k, m), corresponding to k = 1, k = 2, and k = 3, respectively. Because these are close in RCP values, we did not show the individual label in the figure. In this configuration, RCP s of torus and hypercube becomes the same at about 2 16 cores. We expect that such extra-large-scale NoCs can be realized in the next decade. From Fig. 4 , we can see that metacube has the best cost-performance among all the topologies for the large-scale NoC. This is because, compared to hypercube, metacube reduces node degree dramatically but the increase of the diameter is few. The disadvantage of metacube is that there are big node number gaps between configurations. For example, the metacube with 2 8 (256) cores is not configurable. Anyway, in the configurable number of cores, metacubes show very low CP ratios. Fig. 4 . RCP in the range up to 100,000 cores (λ = 2, p = 1, α = 0.6, t = 1)
IV. CONCLUSION
We proposed an analytical model for evaluating the costperformance ratio of large-scale on-chip interconnection networks. Also we evaluated the cost-performance ratio of the torus, hypercube, and metacube relative to the mesh baseline network. These are helpful for designing large-scale highperformance NoCs at low-cost in the near future.
Most of current actual NoCs have adopted the mesh topology. As the scale of NoC becomes large year by year, the performance of mesh suffers from the long diameter which enlarges the communication time. We expected that torus and metacube achieve better cost-performance than mesh when the NoC contains roughly over 100 processing cores.
