

GLOBAL JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY NETWORK, WEB & SECURITY Volume 13 Issue 14 Version 1.0 Year 2013 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals Inc. (USA) Online ISSN: 0975-4172 & Print ISSN: 0975-4350

# Energy Efficient Mapping in 3D Mesh Communication Architecture for NoC

### By Pranav Wadhwani, Naveen Choudhary & Dharm Singh

Maharana Pratap University of Agriculture and Technology, India

*Abstract* - By the end of this decade we will be entering into the era of thousand cores SoCs. 3D integration technologies have opened the door of new opportunities for NoC architecture design in SoCs providing higher efficiency compared to 2D integration by appropriately adjusting the increased path lengths of 2D NoC. The application to core mapping on NoC architecture can significantly affect the amount of system's dynamic communication energy consumption. The considerable amount of energy savings can be achieved by appropriately optimizing the application to core mapping in NoC architecture. This paper presents a Branch-and-Bound heuristic for smart application to core mapping in 3D Mesh NoC architecture. Experimental results show that proposed heuristic saves about 42%-55% and 19%-28% of dynamic communication energy consumption in comparison to random mapping in 3D NoC communication architecture and the energy aware-mapping in 2D NoC architecture of same size, respectively.

Keywords : branch and bound, energy-aware mapping, NoC, 3D mesh, ULC, ILC.

GJCST-E Classification : C.2.1



Strictly as per the compliance and regulations of:



© 2013. Pranav Wadhwani, Naveen Choudhary & Dharm Singh. This is a research/review paper, distributed under the terms of the Creative Commons Attribution-Noncommercial 3.0 Unported License http://creativecommons.org/licenses/by-nc/3.0/), permitting all non-commercial use, distribution, and reproduction inany medium, provided the original work is properly cited.

2013

### Energy Efficient Mapping in 3D Mesh Communication Architecture for NoC

Pranav Wadhwani <sup>a</sup>, Naveen Choudhary <sup>o</sup> & Dharm Singh <sup>p</sup>

Abstract - By the end of this decade we will be entering into the era of thousand cores SoCs. 3D integration technologies have opened the door of new opportunities for NoC architecture design in SoCs providing higher efficiency compared to 2D integration by appropriately adjusting the increased path lengths of 2D NoC. The application to core mapping on NoC architecture can significantly affect the amount of system's dynamic communication energy consumption. The considerable amount of energy savings can be achieved by appropriately optimizing the application to core mapping in NoC architecture. This paper presents a Branch-and-Bound heuristic for smart application to core mapping in 3D Mesh NoC architecture. Experimental results show that proposed heuristic saves about 42%-55% and 19%-28% of dynamic communication energy consumption in comparison to random mapping in 3D NoC communication architecture and the energy aware-mapping in 2D NoC architecture of same size, respectively.

*Keywords : branch and bound, energy-aware mapping, NoC, 3D mesh, ULC, ILC.* 

#### I. INTRODUCTION

he network on chip technology has brought an archetype shift from computation centric design architecture to communication centric design architecture. Using on-chip network has advantages of structure, performance, and modularity. The Next generation SoCs will contain a large number of on-chip cores and the main challenge to be solved will be the network on chip (NOC) bottleneck of these systems which restricts scalability. 3D integration technologies have opened the door of new opportunities for architecture design in SoCs. The fusion of two emerging archetypes, NoC & 3D IC, facilitates the design of novel structures with considerable performance improvements in quality metrics over traditional solutions (Rahmani, Latif, Liljeberg, Plosila & Tenhunen, 2010).

One of the important stages in design flow of NoC is application to core mapping. This stage significantly affects the dynamic communication energy consumption and quality metrics of the system. To this end, in this paper the mapping problem is formulated followed by illustration of the effect of various applications to cores mappings on the consumption of dynamic communication energy by a given system. The

Authors α σ p : Department of Computer Science & Engineering, College of Technology and Engineering, Maharana Pratap University of Agriculture and Technology, Udaipur, Rajasthan, India. E-mails : pranav.udraj@gmail.com, naveenc121@yahoo.com, dharm@mpuat.com paper presents a Branch-and-Bound heuristic for smart application to core mapping in 3D NoC architecture to minimize the total dynamic communication energy consumption of the system. Several experiments are carried out on various random benchmarks to verify the efficiency of the algorithm.

In (Feero & Pande, 2009; Xu, Du, Zhao, Zhou, Zhang & Yang, 2009) demonstrated that besides the footprint reduction in a fabricated design, 3D network structures tends to lead better performance in terms of lower dissipation of energy, higher throughput, and smaller latency compared to traditional, 2D NoC architectures. In (Hu & Marculescu, 2005) the mapping problem for 2D regular Tile - based architectures is addressed.

### II. Energy-aware Mapping Problem for 3D NoC

This paper uses the energy model presented in (Hu & Marculescu, 2005) by Hu et al. The chip under consideration is composed of LxMxN tiles which are interconnected according to the underlying 3D Mesh infrastructure. A tile in 3D NoC (Fig.1) is composed of IP Core, Virtual Channels (VCs) & seven communications links (East, West, North, South, Front, Rear and Core). For the 3D mesh NoC with XYZ routing, Eq.1 shows that the average dynamic energy consumed in sending a single bit of information from core i to core j is estimated by the Manhattan distance between these two cores.

$$E_{bit}^{c_i,c_j} = n_{hops} \times E_{Sbit} + (n_{hops} - 1) \times E_{Lbit}$$
(1)

Where  $E_{bit}^{ci,cj}$  corresponds to the dynamic energy consumed in transmitting single bit of information from core i to core j,  $E_{Sbit}$  the energy consumed by single bit of information transported through a switch,  $E_{Lbit}$  the energy utilized by single bit of information on the link between two switches & n<sub>hops</sub> the number of hops single bit of information encounters when it is transported from source tile i to destination tile j (i.e. Core i to Core j).

#### a) Formulation of Energy-aware mapping problem

The problem is to find a mapping of applications to cores in 3D NoC architecture, such that the overall dynamic communication energy consumption is minimized.



#### Figure 1 : A tile in 3D NoC

Communication Routing Graph, CRG, G = G(C, P) also referred as Interconnection Network graph, is a directed graph, where each vertex c<sub>i</sub> corresponds to one core in the architecture, and each directed edge  $p_{i,j}$  corresponds to the routing path between c<sub>i</sub> and c<sub>j</sub> which is determined using XYZ routing algorithm.  $e(p_{i,j})$  corresponds to the average energy consumed (joule) in transporting a single bit of information from core c<sub>i</sub> to c<sub>j</sub>, i.e.,  $E_{bit}^{ci,cj}$ .  $I(p_{i,j})$  corresponds to the group of links that forms the routing path  $p_{i,j}$ .

Application Communication Graph, ACG, G = G(A, M) is a directed graph, where each vertex  $a_i$  corresponds to one application, and each directed edge  $m_{i,j}$  corresponds to the communication from  $a_i$  to  $a_j$ .  $V(m_{i,j})$  denotes the communication volume (bits) from  $a_i$  to  $a_j$ .  $bw(m_{i,j})$  stands for the minimum bandwidth (bits/sec.) that the underlying communication architecture should provide.

The formulation of the problem of dynamic communication energy consumption minimization can be done as one to one mapping of applications to cores: Given a CRG and an ACG that satisfies condition in Eq.2:

$$size(ACG) \le size(CRG)$$
 (2)

Deriving a one to one mapping relation bmap() from ACG to CRG which reduces:

minimize 
$$\left\{ Energy = \sum_{\forall m_{i,j}} V(m_{i,j}) \times e\left(p_{bmap}\left(a_{i}\right), bmap\left(a_{j}\right)\right) \right\}$$
(3)

Such that:

$$\forall \text{ link } l_k, \text{Bw}(l_k) \ge \sum_{\forall i, j, a_i, a_j \in M} \text{bw}(m_{i,j}) \times \text{is\_link}\left(l_k, p_{\text{bmap}}_{(a_i), \text{bmap}}_{(a_j)}\right) (4)$$

Where Bw(I<sub>k</sub>) is the link I<sub>k</sub>'s bandwidth and is\_link() returns true (i.e. 1) if I<sub>k</sub> is one of the link in the group of links that form the routing path  $p_{bmap}(a_i), bmap(a_j)$  otherwise it returns false (i.e. 0).

© 2013 Global Journals Inc. (US)

Eq. (4) assures that the traffic load on any link will not go beyond its allocated bandwidth.

#### b) Branch and Bound Heuristic for 3D Mesh NoC

The proposed branch-and bound heuristic moves through the search tree. This search tree corresponds to the solution space as shown below in Fig. 2 to find an optimal mapping which has the least communication cost.



*Figure 2*: Search tree example for mapping eight applications onto  $2 \times 2 \times 2$  NoC architecture

Every node except root node is assigned a label. For example, node 547xxxxx represents an internal node where Application number  $A_0$ ,  $A_1$  and  $A_2$  of ACG are mapped on Core<sub>5</sub>, Core<sub>4</sub> and Core<sub>7</sub> of CRG respectively, whereas application number  $A_3$  to  $A_7$  of ACG are still unmapped.

*Communication Matrix:* It is use to store the communication requirements of applications. Communication Matrix is derived as shown in Eq 5:

$$CommMatrix[i][j] = \sum_{\forall j \neq i} \{V(m_{i,j}) + V(m_{j,i})\}$$
(5)

For any node in the search tree, the energy consumed by communication among mapped applications represents the cost of that node. For example, the cost of node 546xxxxx can be computed as:

$$\begin{array}{l} \mathsf{V}(\mathsf{m}_{0,1}) \, \times \, \mathsf{e}(\mathsf{p}_{5,4}) \, + \, \mathsf{V}(\mathsf{m}_{1,0}) \, \times \, \mathsf{e}(\mathsf{m}_{4,5}) \, + \, \mathsf{V}(\mathsf{m}_{0,2}) \, \times \, \mathsf{e}(\mathsf{m}_{5,6}) \\ + \, \mathsf{V}(\mathsf{m}_{2,0}) \, \times \, \mathsf{e}(\mathsf{p}_{6,5}) \, + \, \mathsf{V}(\mathsf{m}_{1,2}) \, \times \, \mathsf{e}(\mathsf{p}_{4,6}) \, + \, \mathsf{V}(\mathsf{m}_{2,1}) \, \times \, \mathsf{e}(\mathsf{p}_{6,4}) \end{array}$$

The cost of child node is not lesser than cost of its parent node. In order to prune ineligible treebranches, this property will be used later on in the heuristic. A node is referred as a legal node if the bandwidth requirement between the currently mapped applications is satisfied. This condition can be represented as shown in Eq 6:

$$Bw(l_k) \ge \sum_{\forall i,j,a_i,a_j \in M} bw(m_{i,j}) \times lis\_link(l_k, p_{bmap(a_i),bmap(a_j)}) \quad (6)$$

Here, M denotes the set of mapped applications.

A node is said to be illegal if it violates the condition represented in Eq 6. The child nodes of an illegal parent node are also illegal.

The upper limit cost (ULC) of a node denotes a cost that is no less than the minimum communication cost of its legal, descendant child nodes where all the applications have been mapped (i.e. leaf nodes). In order to compute as lowest ULC as possible for a node, a greedy approach for mapping the applications onto cores is adopted.

*The lower limit cost (LLC)* of a node denotes the best possible communication cost that its legal descendant child nodes where all the applications have been mapped (i.e. leaf nodes) can probably attain.

The Run time of proposed heuristic scales up with the system size. There exists a trade-off between the quality of solution and run time. The quickening techniques used in the proposed heuristic reduce the search time by detecting as many unpromising nodes as possible at earlier stages during the search process and then trim away such nodes. To this end, the applications are ranked on the basis of their communication requirement so that mapping of applications with higher communication requirement can be done at the earlier stages of mapping. A node preference queue is maintained to store the legal nodes that are eligible for further expansion. The node with the lowest cost has the utmost preference for branching. The length of node preference queue can affect the run time of proposed heuristic. When the queue length reaches a threshold value, selection of child nodes for insertion into queue is done on the basis of some stern criteria.

The heuristic iterates in-between the following two steps until it finds out an optimal solution.

#### i. *Branch*

New child nodes are generated in this step by selecting next unexpanded node from the node preference queue, and then mapping the next unmapped application with the utmost communication requirement to the set of cores that have not been occupied yet.

#### ii. Bound

Each child node that has been generated in the previous step is examined to see if it tends to yield the best leaf nodes later. The ULC and LLC of the node under inspection are computed and this node is pruned if either the communication cost among mapped applications on the occupied cores or LLC is higher than the lowest ULC that has been evaluated during the search. The following method is used to minimize the computation time of ULC and LLC without compromising with the quality of these parameters.

#### ULC Computation

As stated above, the ULC of a node can be set equivalent to the communication cost of any legal descendant leaf node. A greedy mapping is used to find a legal descendant leaf node with the least communication cost. In the process of greedy mapping the next unmapped application  $a_k$  with the utmost communication requirement is chosen and its ideal position on CRG is calculated in terms of x, y and z coordinates using Eq. 7:

$$\begin{aligned} x &= \frac{\sum_{\forall a_{i} \in M} (V(m_{k,i}) + V(m_{i,k})) \times a_{i}^{x}}{\sum_{\forall a_{i} \in M} (V(m_{k,i}) + V(m_{i,k}))} \\ y &= \frac{\sum_{\forall a_{i} \in M} (V(m_{k,i}) + V(m_{i,k})) \times a_{i}^{y}}{\sum_{\forall a_{i} \in M} (V(m_{k,i}) + V(m_{i,k}))} \\ z &= \frac{\sum_{\forall a_{i} \in M} (V(m_{k,i}) + V(m_{i,k})) \times a_{i}^{z}}{\sum_{\forall a_{i} \in M} (V(m_{k,i}) + V(m_{i,k}))} \end{aligned}$$
(7)

Here,  $a_i^x$ ,  $a_i^y$  and  $a_i^z$  represent the row number, column number and slice number of the core that  $a_i$  is mapped onto, respectively; the set of applications that have been mapped is represented by M. At each stage of mapping this set of mapped applications is updated.  $a_k$  is mapped to a core which is not occupied and whose topological position has the least Manhattan distance to (*xyz*).

This step undergoes repetition until a single descendent leaf node is identified. If this leaf node is legal, then ULC of node under examination is set equivalent to its cost otherwise ULC is set to infinitely large and the node is trimmed away.

#### LLC Computation

The LLC is composed of 3 components, as shown in Eq. 8:

$$LLC = C_{mm} + C_{mu} + C_{uu}$$
(8)

 $C_{mm}$  is the cost of communication among mapped applications.  $C_{mm}$  can be calculated exactly as positions of the mapped applications are known.  $C_{mu}$  is the communication cost between mapped and unmapped applications.  $C_{mu}$  can be derived from Eq.9:

$$C_{mu} = \sum_{i=0}^{Nmap} \sum_{j=Nmap}^{Np} CommMatrix[i][j] \times LUC_i$$
(9)

$$LUC_{i.} = \min_{\forall a_i \in M, c_k \in \breve{O}} e(p_{bmap(a_i),k})$$
(10)

*LUC* is the lowest unit cost.  $N_{map}$  represents the total number of mapped applications;  $N_p$  represents the total number of applications which includes both

Sear 2013

mapped and unmapped applications.  $\check{o}$  represents the set of unoccupied cores.

The last Component is  $C_{uu}$  which represents the cost of communication among all unmapped applications. It can be derived as shown in Eq 11:

$$Cuu = vol \times LUUC \tag{11}$$

$$vol = \sum_{i=Nmap}^{Np} \sum_{j=Nmap+1}^{Np} CommMatrix[i][j]$$
(12)

$$LUUC = \min_{\forall Cm, Cn \in O} e(p_{m,n})$$
(13)

Here, *vol* represents the total communication volume among all he unmapped applications. *LUUC* is the lowest unmapped unit cost.

#### III. EXPERIMENTAL RESULTS AND ANALYSIS

The performance of application to core mapping derived from proposed heuristic was evaluated on a NoC simulator NC-G-SIM. NC-G-SIM is a discrete event, cycle accurate simulator which supports Regular 3D, 2D and irregular topology framework with XYZ and distributed table based routing. In 3D ICs the length of heat conduction path and power density per unit area increases as more dies stack vertically (Bernstein, Andry, Cann, Emma, Greenberg, Haensch & Young, 2007; Ebrahimi, Daneshtalab, Liljeberg, Plosila & Tenhunen, 2011; Hassanpour, Khadem & Hessabi, 2013). Hence, the maximum number of slices in the 3D topologies is kept 4 in the experimental setup. ELbit is set to 0.0007 with the help of analytical energy model presented in (Choudhary, Gaur & Laxmi, 2011; Hu & Marculescu, 2003); ESbit is assumed to be 0.54 and 0.52 for 6 Ports and 4Ports router respectively based on estimation from Orion (Kahng, Li, Peh & Samadi, 2009) for 0.18µm technology. The packet size and flit-interval are set to 8 bytes and 2 clock cycles respectively. The number of cores used in the experiments ranges from 8 to 512. The heuristic is applied on five sets of 100 different topologies each to generate the energy aware mapping as well as random mapping of applications to cores in both 3D and 2D Mesh NoC architectures of same sizes and for same traffic scenarios. Thousand categories of benchmarks were randomly generated using TGFF (Dick, Rhodes & Wolf, 1998), with diverse bandwidth requirement of the IP Cores and randomly generated communication volumes according to the specified distribution. The average total energy consumption/flit reaching its corresponding destination is taken as performance metric. The simulation is run for 5000 clock cycles with applied packet injection interval.

As shown in Fig.3, the proposed heuristic saves about 42%-55% of dynamic communication energy consumption compared to random mapping because

© 2013 Global Journals Inc. (US)

heuristic maps applications on the basis of traffic characteristics. Application ordering has been done so that applications with higher communication requirement will be mapped earlier to an unoccupied core which has the lowest cost to the occupied core.





As shown in Fig.4. the proposed heuristic saves about 19%-28% of dynamic communication energy consumption in 3D Mesh architecture in comparison to energy aware mapping propsed in [4] for 2D Mesh architecture of same size. For a same sourcedestination pair, a data packet has to traverse less number of links to reach its destination which also includes less switch arbitration in 3D NoC in comparison to 2D NoC leading to decrease in the dynamic energy consumption in 3D Mesh architecture.



*Figure 4*: Average total Energy/flit for the received flits at destinations after applying energy aware mapping in 3D Mesh and energy aware mapping in 2D mesh of same sizes

#### IV. Conclusion

In this paper, the Branch and Bound algorithm for smart energy aware mapping of applications to cores in a 3D Mesh NoC architecture is proposed. The experimental results clearly show that application to core significantly mapping impacts the dvnamic communication energy consumption of the system. The proposed heuristic is fast and results in significant energy savings. The proposed Branch and Bound heuristic can be extended to support application to core mapping in irregular NoCs where the application modules vary in shape and sizes and deadlock free routing is more challenging.

#### V. Acknowledgment

The work is supported by Department of Science and Technology, jaipur, Rajasthan, India under the research project "Network-on-Chip simulation framework for regular, irregular and 3D-Mesh Interconnection Architecture".

#### References Références Referencias

- Bernstein, K., Andry, P., Cann, J., Emma, P., Greenberg, D., Haensch, W., & Young, A. (2007, June). Interconnects in the third dimension: Design challenges for 3D ICs. In *Proceedings of the 44th annual Design Automation Conference* (pp. 562-567). ACM.Choudhary, N., Gaur, M. S., & Laxmi, V. (2011). Energy Efficient Network Generation for Application Specific NoC. *Global Journal of Computer Science and Technology*, *11*(16).
- Dick, R. P., Rhodes, D. L., & Wolf, W. (1998, March). TGFF: task graphs for free. In *Proceedings of the 6th international workshop on Hardware/software codesign* (pp. 97-101). IEEE Computer Society.
- Ebrahimi, M., Daneshtalab, M., Liljeberg, P., Plosila, J., & Tenhunen, H. (2011, May). Exploring partitioning methods for 3D Networks-on-Chip utilizing adaptive routing model. In *Networks on Chip (NoCS), 2011 Fifth IEEE/ACM International Symposium on* (pp. 73-80). IEEE.
- Feero, B. S., & Pande, P. P. (2009). Networks-onchip in a three-dimensional environment: A performance evaluation. *Computers, IEEE Transactions on, 58*(1), 32-45.
- Hassanpour, N., Khadem,P., Hessabi,S. (2013). A Task Migration Technique for Temperature Control in 3D NoCs. In 27th IEEE International Conference on Advanced Information Networking and Applications (AINA). Manuscript submitted for publication.
- 6. Hu, J., Marculescu, R. (2005). Design methodologies for application specific networks-on-chip. Ph.D. dissertation. Carnegie Mellon University.
- 7. Kahng, A. B., Li, B., Peh, L. S., & Samadi, K. (2009, April). Orion 2.0: A fast and accurate noc power and

area model for early-stage design space exploration. In *Proceedings of the conference on Design, Automation and Test in Europe* (pp. 423-428). European Design and Automation Association.

- Rahmani, A. M., Latif, K., Liljeberg, P., Plosila, J., & Tenhunen, H. (2010, November). Research and practices on 3D networks-on-chip architectures. In *NORCHIP, 2010* (pp. 1-6). IEEE.
- Xu, Y., Du, Y., Zhao, B., Zhou, X., Zhang, Y., & Yang, J. (2009, February). A low-radix and low-diameter 3D interconnection network design. In *High Performance Computer Architecture, 2009. HPCA 2009. IEEE 15th International Symposium on* (pp. 30-42). IEEE.

## This page is intentionally left blank