Popular 2-D mesh networking topology is used in highperformance multicore/manycore systems mainly for its simplicity. In some 2-D mesh architecture, like MIT Raw Architecture Workstation (RAW), each node/tile has communication and computation components. Switching components of such a node consume power while the node is only computing (and vice versa). The large number of switches used in mesh topology is primarily responsible for high power consumption and high communication latency. In this paper, we propose an energy efficient interconnection network topology for multicore architecture. According to the proposed topology, nodes are separated between network switches and computing cores. Using folded torus concept, this proposal determines and connects the switches and cores. Although the number of switches is cut down significantly, each switch provides adequate communication channels. Using synthetic workload, we simulate RAW, Triplet Based Architecture (TriBA), Logic-Based Distributed Routing (LBDR), and the proposed topologies with up to 64 nodes. Experimental results show that the proposed topology outperforms RAW, TriBA, and LBDR. Comparing with RAW, proposed topology reduces the number of switches by up to 60%, the average hop count by up to 38%, and the power consumption per task by up to 80%.
Introduction
Effective networking support is the backbone of highperformance distributed and parallel computing. A network is a collection of computing systems interconnected by communication channels to support sharing of information and resources. In a multicore architecture, multiple cores communicate with each other to process the jobs. Each core in a multicore system can be considered as a node that needs to be connected to other nodes in an effective way. In networking, different topologies like bus, mesh, and ring topologies are extensively used. In multicore architecture, the topologies like crossbar switching [1, 2] and torus (an extension of mesh) [3] are being used. Mesh and torus are popular topologies for their simplicity. However, they (mesh and torus) suffer due to long latency and high power consumption as each node has a switching component. In this work, we propose a networking topology for multicore architecture based on folded torus concept [3] . Folded torus is an extension of torus topology. Torus topology generally has wrap around links. Figure 1 (a) illustrates an 4x4 torus architecture. When the number of nodes increases the wrap around links between the edge nodes becomes a drawback of torus topology. To overcome this drawback, folded torus has a similar layout as torus does, but the links are arranged physically in a folded manner to equalize wire lengths. This can eliminate wrap around links unlike torus topology. A partial folded torus is shown in Figure 1 (b). In folded torus topology every node has a link to its every alternate node in both horizontal and vertical directions. In this topology, source and destination pair will have lower hop count which leads to lower communication delay and lower energy consumption. As we know, in multicore system design, a number of cores are combined together via network where cores work together to increase the processing speed [4] . In a multicore architecture, the computational capacity is increased through parallel computing. In present days, multicore architectures adopt isomorphic architecture [5] , where each core will have its own private first level cache (CL1) and there is a shared second level cache (CL2). TriBA (that follows hierarchical interconnection topology) and LBDR (where only switches are connected to each other) are not suitable to scale multicore architecture. Studies suggest that folded torus topology has the potential to reduce the number of switches by having multiple links among the nodes in the given network. Multiple links in folded torus also help increase the reliability of the network. This paper is organized as follows. Section 2 briefly discusses some related published articles. In Section 3, the proposed folded torus based multicore architecture is introduced. Section 4 describes the simulation details. Some important simulation results are presented in Section 5. Finally, this work is concluded in Section 6.
Literature Survey
Various network topologies are proposed for designing effective multicore/manycore architectures. Some selected articles, closely related to our work, are summarized in this section.
A group of MIT scientists introduces RAW (short for Raw Architecture Workstation) where the cores are organized using a 2-D mesh topology [6] . RAW is a wireefficient multicore architecture that scales with increasing VLSI gate densities. RAW architecture implements tile based design where each tile consists of a switching component and computing component. The tiles in this design are interconnected with several components like routers, programmable switches, data-memory, and ALUs. In RAW, scheduling and routing are taken care of without any conflict between the cores for shared resources. This architecture has static and dynamic networks for communication among the tiles. Static networks define a fixed communication channel before the compile time and the compiler exactly know where to send the message. Dynamic network communication is incorporated to avoid the situations wherein the memory requirement cannot be decided before the compilation time [7, 8] . The main goal of RAW architecture is to improve performance over the existing architectures and provide more flexibility for the compilers of multicore by implementing fine-grain parallelism. However, the coreto-core communication delay and power consumption due to the presence of the large number of switching components in a RAW architecture has become the prime concern.
To overcome the latency due to the usage of shared medium, TriBA (short for Triplet Based Architecture) is introduced. Three nodes are connected to each other in triangular pattern in TriBA. Each node has its local memory (like CL1), while three nodes share a common memory (like CL2) [8] . TriBA multicore architecture also looks into the drawbacks of having large number of switching components. TriBA is a direct interconnection network. TriBA consists of a 2-D grid of programmable processing units, each physically connected to its three neighbors so that advantageous features of group locality can be fully and efficiently utilized for getting maximum out of an on-chip interconnection of cores. Cores on the same chip are connected via triplet-based hierarchical interconnection network, which has simple topology and computing locality characteristic. TriBA basically looks at the concern where interconnected cores use the same transmission medium. In this architecture, at each level the program decides where to send the incoming message. It decides whether to send the message to a local node or to any other neighboring node. Efficient routing algorithms are used for this kind of mechanism and to improve the performance in the communication between the interconnected nodes in a network. Distributed deterministic routing algorithm is implemented for TriBA. Addressing schemes are used for each node at each level in the hierarchical interconnection network. As TriBA implements layered architecture that is defining at different levels, it is difficult to implement the design in 2-D mesh topology. Also, VLSI issues such as silicon area is compromised in TriBA.
It is known that 2-D mesh topologies are generally used by designers of network-on-a-chip (NoC). In the case of irregularities in the network it is claimed that managing routing tables is a challenging task. To overcome this complexity while dealing with the routing tables in the switches in a multi-core architecture design, a new method is proposed; which is known as LBDR (short for Logic-Based Distributed Routing). In LBDR, 4 cores are connected to a single switch and all switches are connected to each other. LBDR mechanism [9] was extended to support multiple cores per switch. In LBDR, every computing core is connected to only one switch, which is considered as a short-coming. Recent studies indicate that folded torus topology has potential to be used in multicore architecture. Folded torus reduces the number of hops between the source and destination pair, which leads to lower power consumption and communication delay.
The impact of cache parameters on overall system performance and total power consumption in a homogeneous multicore system is studied in [10] . VisualSim modeling and simulation techniques are used in the study. Two 4-core architectures--both with private CL1s, but one with shared CL2 and the other one with private CL2s--are simulated. Average memory latency per task is used to represent the performance. Simulation results show that significant reduction in power consumption and improvement in performance are possible by optimizing cache parameters. Simulation platform developed in [10] is modified and used in this work to evaluate the proposed and selected (RAW, TriBA, and LBDR) architectures.
Proposed Interconnection Topology
According to the proposed networking topology, a node can be a computing component or a switching component. There is an exception where a node can act as a computing component and a switching component boththis is to connect different layers without increasing power requirement. As shown in Figure 3 , a solid node indicates a switch; a plain node indicates a computing core; and a striped node indicates a special node with both a switch and a core.
Node Selections
Determining whether a node should be a switch, a core, or a special node is very important. We now discuss our node selection algorithm.
Switching components: In a given NxN topology, starting from the first node (say, the left-top node), every node after two nodes are considered to be a switch. The same selection pattern is followed both in row wise and column wise. The diagram in Figure 3 depicts the format of selecting the nodes in an 8x8 mesh topology. As already mentioned, the first (left-top) node is a switch and again the fourth node in that column and row are switches. Hence, the distance saved between two switches is 2 units (considering the distance between any two consecutive nodes is 1 unit). Using this technique, we cut down the number of switches and hop distance.
Computing components: Computing components are the processing 'cores' that perform the actual computation. Each core is a full-blown computer with its CPU, cache, etc. Except the switching and special nodes, all other remaining nodes are considered to be the cores.
Switching-Computing 'special' components: In this design, (cores are connected to switches and) switches are connected to each other to form the network. There may be a situation where switches will form layers and there is no connection between two layers. Figure 3 illustrates this phenomenon with three different layers. Therefore, we propose a node (special striped node in the diagrams) common to two layers. These special nodes have both switching and computing components and help in having full connectivity throughout the mesh. Figure 3 . Switch-to-switch connections in an 8x8 mesh topology; special nodes connect different layers.
Node Connections
It is very important how the nodes in a multicore architecture are connected and how they communicate for efficient processing of applications. In multicore architecture, switches are the major components which synchronize all the computation data of the cores and provide a final result by collaborating with all the results obtained from each core involved in a process. The proposed design ensures that every switch is connected to other switches using different routes for improved connectivity. Switch-to-Switch connections: While defining connections between switches it is considered that every switch will have a link to its adjacent switch at a regular distance of 3 units (by saving 2 units). Figure 3 shows how the connections exist among the switches. There are three layers in this case. Figure 3 also shows how the special nodes are used to ensure that there is full connectivity among the switches. It should be noted that the computing nodes communicate with each other only through the switches.
Core-to-Switch connections: In folded torus topology each node will have a connection to a node at a distance of 2 units. However in this proposed architecture, each computing component will have a link to a switch that is at a distance of 2 units or at a distance of 1 unit. Figure 4 (the left-top 4x4 nodes from Figure 3 ) is an example of cores having connections with the switches. Cores will have links only to the switches; they will not have any link to other cores. Special connections for special nodes: Each special node that acts as a core and a switch has two direct links to its two adjacent nodes such that the two adjacent nodes are in two different layers. In Figure 4 , the dotted lines between the striped nodes and normal nodes show such special connections between a special node and two switching nodes.
Evaluation
We model and simulate our proposed multicore architecture with a network topology based on folded torus concept. Our goal is to reduce the total power consumption by cutting down the number of switches without compromising performance. We compare our proposed multicore architecture with some other historymaking popular architectures. In the following subsections, we briefly discuss assumptions, simulated architectures, workload, and output parameters.
Assumptions
In this work, we make the following assumptions:
 The same number of total nodes is considered for all simulated architectures (RAW, TriBA, LBDR, and proposed).  The same synthetic workload is used while calculating the power consumption and communication delay per task.  For LBDR and proposed architecture, it is assumed that a computing component consumes more power when compared to a switching component. This assumption is based on the fact that a switch only checks the packet header and a core actually process the packet. We know that the packet size is much bigger than the packet header size. The power needed for a switch is considered as 1 unit and the power required by a computing core is considered as 4 units.
 It is assumed that the power consumption due to data transfer between two adjacent nodes is negligible because the adjacent nodes are very close to each other.  For RAW and TriBA designs (where each node has a core and a switch), power consumed by a node to process a packet is four times more than the power consumed by the same node to transfer the packet.
Simulated Architectures
In addition to the proposed multicore architecture, we model and simulate RAW, TriBA, and LBDR. Proposed architecture: The switches, cores, and special nodes in the proposed NxN folded torus multicore system are selected carefully applying proposed methodology.
MIT RAW architecture: NxN 2-D mesh RAW architecture are considered, where N = 4, 5, 6, 7, and 7. Each node consists of a switch and a core. Each node is connected to its four neighboring nodes.
TriBA architecture: The nodes in TriBA are numbered in a triangular fashion starting from the leftbottom node.
LBDR architecture: In LBDR architecture, cores are connected to switches and each switch is connected to its four neighboring switches.
Workload
In this experiment, we use synthetic workload to represent communications. We randomly generate two numbers between (and including) 1 and the maximum number of nodes; we consider small number be the source and the large number be the destination. Table 1 shows five different possible communications in 16-node RAW, TriBS, LBDR, and the proposed architectures. For each transaction, we list the source node (S), destination node (D), and communication path. 
Output Parameters
Output parameters include switch count, power consumption per task, and hop count. Switch count represents the total number of nodes in an architecture that acts as networking components. Power consumption per task represents the average energy consumed by the system due to each task to process all the jobs. Hop count represents the distance between the source and destination nodes. Hop count is the number of pointto-point links in a transmission path. It should be noted that we consider point-to-point links for hop count (not network devices between the starting node and the destination node). We use hop count to represent communication delay (i.e., performance).
Results and Discussion
We present some important simulation results to evaluate our proposed reduced-switch multicore architecture based on folded torus concept. We focus on the impact of our proposed topology on the number of switches, total power consumption per task, and hop count. Proposed architecture is compared with some history-making multicore architecture.
Number of Switches
First, we simulate RAW, TriBA, LBDR, and proposed architectures for 16, 25, 36, 49, and 64 nodes. As shown in Figure 6 , the average number of switches increases with the increase in total number of nodes. Simulation results show that the average number of switches needed by the proposed architecture is much less than those by RAW and TriBA architectures. In spite of the fact that the average number of switches needed by our proposed architecture is higher than that of LBDR, the proposed architecture is more robust. This is because in the proposed design, each core is connected to 2 or more switches; but in LBDR, each core is connected to only one switch. 
Power Consumption per Task
Second, we explore the impact of the proposed folded torus based topology on power consumption per task. To calculate total power consumed, only the nodes that make the source-to-destination path are considered. Figure 7 illustrates power consumed per task by 16-node systems for the five cases in Table 1 . Simulation results shows that proposed architecture requires less amount of power for all five cases. This is because the best-case distance between the source and destination is the minimum for the proposed architecture. Considering 25 different cases for each {architecture, nodes} combination, we obtain power consumption per task for all combinations as shown in Figure 8 . It is observed that power consumption per task increases for all architectures as the number of nodes increases. This is because the average number of switches/nodes increases with the increase of the total number of nodes. However, power consumption per task in proposed architecture is less than LBDR and much less than RAW and TriBA. 
Communication Latency
Finally, we study the impact of different network topologies on communication latency. It is inferred that the less number of hops a message traverses from source to destination the less would be delay. Here, we represent latency using the hop count. Figure 9 illustrates hop count in an 64-node topology for the five cases in Table 2 . Simulation results show that proposed topology requires less number of hops for all cases except Case 5. That means the communication latency due to our proposed topology is smaller almost all the cases. Figure 9 . Hop count due an 64-node network topology Again, considering 25 different cases for each {topology, nodes} combination, we obtain average hop count for all combinations as shown in Figure 10 . It is observed that the average hop count increases for all topologies as the number of nodes increases. However, it is important to note that that the average hop count in proposed architecture is less than LBDR and much less than RAW and TriBA architectures. 
Conclusion
Core-to-core communication latency and power consumption are two crucial factors for designing highperformance multicore/manycore systems. Although 2-D mesh networking topology is very popular for its simplicity, it is not effective for state-of-the art multicore architecture due to its large number of switches. For example, each tile/node of 2-D RAW architecture has communication and computation components; therefore, switching components consume power while the nodes are only computing (and vice versa). Various approaches are proposed to reduce the number of switches; two important approaches are: TriBA and LBDR architecture. TriBA follows hierarchical interconnection topology. In LBDR, only switches are connected to each other. TriBA and LBDR are not suitable to scale multicore/manycore architecture. According to recent studies, folded torus topology has the potential to reduce the number of switches by having multiple links among the nodes in the given network. In this paper, we propose a folded torus based reduced-switch multicore architecture to lower the communication delay and power consumption. In the proposed topology, nodes are separated between network switches and computing cores. A scheme is proposed to connect the switches and cores. Although the proposed architecture has less number of switches, each switch provides adequate communication channels. Using synthetic workload, we simulate our proposed architecture along with RAW, TriBA, and LBDR with up to 64 nodes. Considering communication latency (i.e. hop count) and power consumption, experimental results show that the proposed topology outperforms RAW, TriBA, and LBDR. Proposed topology reduces the average hop count by up to 38% and the power consumption per task by up to 80% when compared with RAW. This improvement is due to the fact that the proposed architecture requires 60% less number of switches than the RAW does.
Directory based approach where a centralized table contains the information regarding shared memory blocks should help reduce average access time and total power consumption in a multicore/manycore system. We plan to explore the impact of such a directory on the performance/power ratio of a multicore system in our next endeavour.
