I. INTRODUCTION
Driven by the advances of semiconductor technology, the realization of the future complex System-on-Chip (SoC) consisting of billions of transistors fabricated in technologies characterized by 65 nm feature size and less will soon be reality [1] [2] [3] . Such SoC implies the seamless integration of hundreds, or ever more, intellectual properties (IP) cores including processing resources and storage resources performing different functions and operating at different clock frequencies.
Existing on-chip interconnect architectures will give rise to some problems. The most frequently used on-chip interconnect architecture is the shared medium arbitrated bus, where all communication devices share the same transmission medium. Every additional IP core connected to the bus adds to this parasitic capacitance, in turn causing increased propagation delay. As the bus length increases and/or the number of IP cores increases, the associated delay in bit transfer over the bus may will increase and eventually exceed the targeted clock period. This thus limits, in practice, the number of IP cores that can be connected to a bus and thereby limits the system scalability [4] .
Consequently Several research groups have advocated the use of a communication-centric approach to integrate IP cores in complex SoC. This new SoC model allows the decoupling of the processing elements (i.e., the IP cores) from the communication fabric (i.e., the network). The need for global synchronization can thereby disappear. This new approach employs explicit parallelism, exhibits modularity to minimize the use of global wires, and utilizes locality for power minimization [5] . In a communication-centric approach, the communication between IP cores can take place in the form of packets. Packet switched Network-on-Chip (NoC) has been proposed in [6, 7] as a viable and attractive alternative to future complex SoC.
The ability of the NoC to efficiently disseminate information depends largely on the underlying topology. Besides having a paramount effect on the network latency, throughput, area, fault-tolerance and power consumption, the topology plays an important role in designing the routing strategy and mapping the cores to the network nodes [5, 8] .
Generally speaking, determining the optimal topology to implement any given application does not have a known theoretical solution [8] . Although the synthesis of customized architectures is desirable for improved performance, power consumption and reduced area, altering the regular grid-like structure brings into the picture significant implementation issues, such as floorplanning, uneven wire lengths (hence, poorly controlled electrical parameters), etc. Consequently, ways to determine efficient topologies that trade-off high-level performance issues against detailed implementation constraints at micro-or nano-scale level need to be developed.
A network topology can be regular or irregular and it is non-blocking if it can manage all the requests that are offered to it. In a packet switched case this kind of network is also called as non-interfering network. Non-interfering network can deliver all the packets in guaranteed time [9] . In NoC, some basic regular network topologies have been proposed, such as 2D Mesh, 2D Torus, Folded 2D Torus, Fat-tree, Chordal Ring, Bi-directional Ring etc. 2D Mesh is the most studied topologies (over 50% of the cases) [10, 11] .
In this paper, a regular NoC architecture, named Double-Loop(DL(2m)) interconnection network, is proposed. The topology of DL(2m) is simple, symmetric and scalable in architecture, which is 3-regular plane topology with 4m nodes. The nodes of DL(2m) apply Johnson coding scheme that can make the design of routing algorithms more simple and efficient. The DL(2m) was compared with Ring and 2D Mesh which are planar graph topologies, torus is a non-planar topology by simulating and analysing. The results show that DL(2m) is a better NoC topology, when there are not too many network nodes.
The remainder of the paper is organized as follows. Section 2 describes the topology architecture and the properties of DL(2m). In section 3, we present a routing algorithm in DL(2m). Section 4 compares the performance of DL(2m), Ring and 2D Mesh by simulating and analysing. Section 5 summarizes the results and concludes the directions for future research.
II. TOPOLOGY OF DL(2m)
A. Topology of DL(2m) Definition 1. Binary unit-distance cyclic code is a binary code whose each two adjacent codes have one and only one bit different(unit distance characteristic), and the first code and the last one in those codes have one and only one bit different(cycle characteristic). Definition 2. Binary code represents each number in the descending sequence of integers {n-1,n-2,…,2, 1, 0} as a binary string of length m= 2 / n by an order. The binary code has the properties of definition 1 and as follows: i) for 0<k<m, Definition 3. Double-Loop (DL(2m)) interconnecttion network is a kind of network topology with the following characteristics: 1) DL(2m) has 4m nodes and 6m links, which consists of two rings, an outer ring and an inner ring, each containing 2m nodes; 2) The nodes of the outer/inner ring of DL(2m) can be marked with m bits Johnson code and 1 bit ring sign at most significant bit, where the outer ring is marked sign 1 and the inner ring is 0; 3) In which the coding rules of the nodes are as follow: When there is just one bit different between any two nodes, there will exist a link between them, that is to say, these two nodes are neighboring to each other.
An example of a DL(2m) is shown in figure 1 , where m equals to 4, which is composed of 4=16 nodes and 6 4=24 links. The nodes of the outer/inner ring of DL(2m) can be marked with 4 bits Johnson code and 1 bit ring sign at most significant bit, where the outer ring is marked sign 1 and the inner ring is 0. 
B. Network properties of DL(2m)
Some of the most common NoC architectures belong to the classes of the Ring and k*l 2D Mesh. This section addresses these principle properties of DL(2m) and compares it with 2D Mesh and Ring.
Some of the most interesting characteristics of the DL(2m) are: network with regular topology, vertex symmetry (same topology appears from any node), edge-transitivity, constant node degree (equals to 3), which make the router hardware design more simple and effective, homogeneous building blocks (the same router structure can be used to compose the entire network), the codes sequence of each circuit of DL(2m) is a binary unit-distance cyclic codes, the node of DL(2m) has three and only three adjacent node codes, DL(2m) has better scalability, the granularity of size scaling is , the distance of two random
, and simple routing scheme. By assuming a NoC of bidirectional links and N=4*m=k*l nodes, the number of links, network degree, network diameter, average distance and symmetry is shown in table I. A significant worst case index, named the network diameter is defined as the maximum shortest path length between any pair of nodes in the topology. The average network distance is defined as the average path length of all different paths in the network. The network diameter of real 2D Mesh(N=k*l) topologies with N nodes shows quite unpredictable fluctuations between the ideal Mesh(N=k*l, k l) values and the Ring diameter values, as shown in figure 2. In figure 3 , we show the average network distance for Ring, ideal and real 2D Mesh, and DL(2m The zero-load latency is proportional to the network average distance. Increasing network degree can reduce the average distance of an interconnection network. So it is very difficult to accurately evaluate the latency of interconnection networks with different degree, if only using the average distance without taking into account the network degree. We use the normalized average distance when analyzing the latency. The normalized average distance of an interconnection network is the result of the network average distance multiplied by the network degree [16] . Figure 4 compares the normalized average distance generated by DL(2m), Ring and 2D Mesh, respectively. Figure 4 indicates that the zero-load latency of DL(2m) is lower than 2D Mesh when the scale of network is not greater than 44. The analysis results show that when there are not too many nodes, DL(2m) is a better candidate for constructing the interconnection network for SOC, taking into account the node degree, number of links and diameter.
III. ROUTING ALGORITHM OF DL(2m)
Routing algorithm is a key factor which affects the efficiency of the communication of NoC. The distributed dimension-order routing is adopted in this paper. The characteristic of DL(2m) and nodes code is fully utilized in the routing. In this approach, each node, upon receiving the packet, decides whether the packet should be delivered to the local node or forwarded to adjacent node. During the routing decision process, the routing algorithm needn't the state information of the complete network, and just uses code of the current and destination node, thus it can reduce the network communication overhead and node storage overhead. Then the distance between the three adjacent nodes and destination node is obtained:
Source node S sends packet to the d min corresponding adjacent node, S is modified whose value is the code of d min corresponding adjacent node. Computing the value of d, if d 0, then node S is destination node, else iterate the process.
In the same ring, the adjacent node is generated by simply implementation of register shift. The adjacent node of the dissimilar ring is performed NOT operation for the most significant bit. The node coding of DL(2m) network can be dynamically changed. The nodes codes only show relative position between nodes. In worst case the longest path of routing can't exceed network diameter. The routing mechanism is very simple and easy to be implemented in hardware with low implementation cost, and the complexity of the algorithm is O(m+1).
IV. PERFORMANCE EVALUATION
One key aspect about NoC is the performance evaluation. To compare and contrast different NoC architectures, a standard set of performance metrics can be used. For example, it is desirable that an MPSoC interconnect architecture exhibits high throughput, low latency, energy efficiency, and low area overhead [17] . Generally, the performance metrics of the average network latency and the average network throughput are of great importances [15] . To evaluate the proposed DL(2m) architecture, the DL(2m) was compared with Ring and 2D Mesh by simulating and analyzing in three different network sizes (16, 32,64 nodes) .
We have developed a discrete event, cycle accurate NoC simulator. It provides substantial support to experiment with NoC design in terms of routing algorithms and applications on various topologies. The simulator is written in systemC. The dimension-order routing and wormhole packet-switching is adopted in three topologies, which ensure the justness of analyses. DL(2m), Ring and 2D Mesh nodes have been defined with the same node architecture, excepted the number of links.
Each node has an external network interface to connect the IP core to the NoC. The external IP core can act as a packet source and/or as a packet destination (sink) depending on the simulated scenario. In our simulations, each source IP core generates packets and sends them to other IP cores. Each packet has three 32-bit flits(flow control unit, flit). The first (head) flit of a packet is sent to the routing mechanism of the node, and then transferred on the output of the target channel (if next node input channel is room). Once the head flit has been processed by the routing element of a node, a switching mechanism is defined to forward all immediately following packet-flits to the outgoing links of the target path to the destination node. We changed the flit rate injection from 0.05 flit/cycle/node to 0.5 flit/cycle/node. Each input channel consists of 8 flits fifo buffer. Each output channel consists of one flit buffer. The clock frequency of NoC is 1GHz.
For single hot-spot destination scenario and homogeneous destinations scenario, we simulated and analysed the performance of Ring, DL(2m) and 2D Mesh in three different network sizes as follows.
A. Single hot-spot destination scenario
The destination node of single hot-spot have been taken in different points on the Mesh topology (in symmetric Ring and DL(2m), this would not have any difference). Figure 5 shows the performance index of the average latency and average throughput as a function of the injection rate parameter of the source nodes and the number of network nodes when hot-spot destination is present in the system (that is, one single destination node for all packets). All the source nodes send packets to single hot-spot destination except destination node. Figure 5 shows the average latency and the average throughput obtained by DL(2m), 2D Mesh and Ring topologies under one single hot-spot destination node, as a function of the number of nodes N and the injection rate parameter of multiple source nodes. By assuming a homogeneous injection rate, the result shows that the bigger the network size, the worse the performance of the average latency and the average throughput in same topology. And the simpler the topology, the better the performance in same network size. When all the sources homogeneously increase the injection rate, the average latency and the average throughput increase with the injection rate of source nodes, and the destination node saturation is obtained.
Therefore, the system bottleneck under single hot-spot traffic destination scenarios is the destination node and NoC topologies. The performance of DL(2m) is much better than 2D Mesh. The scalable and symmetric architecture of DL(2m) would give the same advantages of simple Ring, under the hot-spot communication viewpoint. Figure 6 a) shows the average latency obtained by DL(2m), 2D Mesh and Ring topologies under homogeneous source and destination distribution scenarios. All the nodes behave like sources and can be addressed as destination for packets with uniform probability distribution. Latency is shown as a function of the number of nodes N and the injection rate parameter of multiple source nodes. The result shows that DL(2m) and Ring topologies outperform 2D Mesh, and scale better when the number of nodes is high. Under this scenario, 2D Mesh shows a smaller average latency than DL(2m) only with not many nodes and when the local injection rate of all source nodes is greater than 0.2 flits/cycle/node. Figure 6 b) shows the throughput results with respect to the NoC topology and the number of nodes, under homogeneous scenarios with uniform distribution of sources and destinations. Specifically, all the nodes behave like sources and can be addressed as destination for packets, with uniform probability distribution.
B. Homogeneous destinations scenario
When all node sources increase the injection rate, the average latency and the average throughput increase with the injection rate of source nodes, up to the set of destination nodes and/or the network becomes saturated. This performance index shows that DL(2m) and 2D Mesh topologies outperform Ring, and scale better when the number of nodes is low. Under this scenario, 2D Mesh shows a better throughput than DL(2m) only with many nodes and when the local injection rate of all source nodes is greater than 0.15 flits/cycle/node. On the other hand this scenario is hardly obtained in real systems, and this does not constitutes a good motivation to prefere the adoption of 2D Mesh in favour of the DL(2m) topology. As expected, the bottleneck emerging in this scenario is basically given by the communication infrastructure. This is confirmed also by the worst performances obtained by the Ring topology. 
V. CONCLUSION
Focusing on decreasing node degrees, reducing links and reusing IP cores, we have proposed a new topology architecture DL(2m) for NoC. The topology of DL(2m) is very simple, planar, symmetric and scalable in architecture, and it is 3-regular plane graph with 4m nodes.This paper presents a novel Johnson coding method for nodes coding which adapts to DL(2m) topology architecture. The nodes of DL(2m) apply Johnson coding scheme that makes the design of routing algorithm more simple and efficient. The DL(2m) was compared with Ring and 2D Mesh by simulating and analysing, both under uniform load and under more realistic load assumptions in the several network size scenarios. The results show that the DL(2m) topology is a good trade-off between performance and cost. It is a better candidate for NoC topology, when there are not too many network nodes.
In future research, we will map the application of wireless communication on DL(2m) topology architecture. When applications behaviors can not be predicted at compile time, on-line scheduling approaches are usually needed. Significant work is needed to develop efficient performance and energy-aware on-line scheduling algorithm for NoC. She is now Director of Micro-Electronics Research and Teaching Group at computer science department at Xi'an Institute of Posts and Telecommunications, P.R. China. Her research interests are mainly on ASIC design, formal verification for hardware design, and computer architecture, etc.
