Application-specific network-centric architectures (such as Networks on-Chip, NoCs) have recently become an effective solution to support high bandwidth communication in Multiprocessor Systemson-Chip (MPSoCs). Moreover, the introduction of the hierarchy concept in the NoC design benefits from the main locality nature of the communication in MPSoC architectures. This paper presents a methodology to design Application Specific Hierarchical NoC (ASHiNoC) architectures considering foorplanning information. The presented approach targets heterogeneous clustered architectures where the intra-cluster communication is managed by a low-latency circuit-switched crossbar, while the inter-cluster communications are managed by a high-bandwidth packet-based NoC, allowing regulars topologies. The proposed design flow faces the problem by starting from the cluster selection down-to the foorplanning-aware estimation of the interconnect performances in terms of latency, power, area within each cluster and for the backbone NoC. Experimental results show that the AHiNoC architecture is able to guarantee an interconnection power and latency reduction of 49% and 33% respectively, at a cost of an area increment of 78% with respect to a flat topology version. 
INTRODUCTION
The current trend in System-on-Chip (SoC) design is to integrate a large number of processors, memories and hardware accelerators onto a single die, making real the concept of MultiProcessor System-on-chip. The new philosophy brought by the enormous degree of integration raises also new challenges in designing interconnection infrastructure, since old communication paradigms cannot scale anymore. Due to the complexity of future MPSoC Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. NoCArc '11, December 4, 2011, Porto Alegre, Brazil Copyright 2011 ACM 978-1-4503-0947-9 ...$10.00. designs, the systems are requiring other alternatives to interconnect processing elements (PEs) instead of large and somehow inefficient conventional networks-on-chips (NoCs).
The performance and energy dissipation of a system is totally dependent on the network topology and how the cores are connected in the NoC. Selecting the network topology is one of the first steps in designing such kind of complex interconnection, since the routing strategy, flow control methods and mapping cores on the network nodes heavily depend on the topology [1] .
The use of regular Network-on-Chip topology, such as those used in macro-networks (e.g. mesh, torus, ring), represents a priori decision especially when either we are targeting to homogeneous systems or the system design characteristics (type and size of cores and traffic patterns) are not completely well known at the beginning of the design phase. In fact, regular NoC topologies enable design predictability, controlling the wires delay that sometimes is enough to satisfy general-purpose requirements. However, higher performance and/or lower power dissipation are not possible from the general alternatives since regular topologies present poor performance and have a large power overhead [2] , [3] . Besides, a general purpose solution is not a good solution in the embedded system context where communication patterns are irregular and strongly application-dependent, while the cores are completely heterogeneous also in terms of size impacting the links length and consequently the network latency and its maximum frequency [3, 4, 5, 6] .
In the context of application-specific NoC design, both the knowledge of communication patterns and core dimension should be exploited for optimization purposes, avoiding the design for either the worst case or average case which respectively means either waste of area or possible communication bottleneck. In fact, the knowledge of the communication patterns helps on finding the best core allocations reducing the number of intermediate network hop for high bandwidth edges, and consequently reducing the network power and latency. On the other side, the core size knowledge can be used to customize the topology considering that not all the routerto-router links can have the same size and that small cores can be connected also to the same router instead of having one router per core.
Analyzing the interconnect architectures, one possibility in order to obtain higher performance would be to use a circuit-switching solution employing crossbar switches in the interconnection. In order to circumvent the scalability problems of this component, the usage of a hierarchical topology using crossbar switches for local communication and packet-switched routers for the global communication seems to be a viable solutions.
In this paper is presented the ASHiNoC (Application Specific Hierarchical NoC) analyzer framework, allowing an optimized topology for such kind of specific-application hierarchical interconnection. The main contributions of this paper are:
• The exploitation of a hierarchical interconnection where the local interconnections are made by circuit-switched crossbars while the global interconnection uses a packet-switched network;
• The introduction of an automatic Design Space Exploration (DSE) framework for designing hierarchical network topologies based on floorplanning considerations;
• The application-specific optimization in terms of area, power and latency of the hierarchical interconnection infrastructure, while maintaining communication requirements, considering both core mapping and clustering problems and architectural parameters tuning.
Experimental results present the power/performance/area tradeoffs solutions for the Application-Specific Hierarchical NoC, by changing mapping and architectural parameters on a state of the art real application.
The paper is organized as follows. Section 2 briefly introduces the background and state-of-the-art, while the hierarchical NoC architecture -ASHiNoC is presented in Section 3. The methodology is proposed in Section 4 where we detail the flow used to find an optimal hierarchical topology. In section 5 are shown the results obtained from the proposed tool and finally Section 6 concludes the paper.
PREVIOUS WORKS
A wide range of works have been proposed in literature to presents tools and methodology to efficiently design application-specific NoC topology, such as [5, 3, 4, 6, 7, 2] .
In [5] the authors proposed four different approaches for applicationspecific mapping and topology customization applied to STNoC architecture. The proposed methods were based on the orthogonalization of the mapping an topology design concepts. The topologies are derived from ring and spidergon topologies and this proposal considers routers with only three input/output ports for the connection among the NoC routers and one port to connect the core. However, the topology exploration in this work is limited since the proposal concerns only to the STNoC architecture.
In [3] was presented a method to considers in the floorplan the wiring complexity for NoC topology synthesis process. The proposed mechanism integrates the synthesis, generation, simulation and physical design processes. Firstly the tool considers a set of constrains defined by user. In the next step the NoC architecture is synthesized and a topology that best approximates to design constraints is chosen, ensuring the required bandwidth for the application. After obtaining the synthesized topology, the floorplanning information are extracted. For the cluster partition the authors used the min-cut strategy to group the cores. The connection among clusters was made using a path allocation algorithm.
Chan and Parameswaran presented in [4] an iterative methodology to obtain an energy efficient topology supporting point-to-point and packet-switched connections. This work considers the effects of the wire length according to the topology. The topology generation algorithm consists of two phases: a phase that uses a topology as an initial starting point and a refinement phase where are analyzed the possibilities to add or remove the links to compose a point-to-point or a router connection.
Another proposal for an application-specific network-on-chip architecture was presented in [6] . This work proposed two heuristic algorithms to examine the partition possibilities for communication flows. The first algorithm was called CLUSTER and it reduces the number of set partitions from a smaller subset of set partitions. The second algorithm was called DECOMPOSE and in this case it starts with single cluster and creates new partitions for each iteration. The two approaches generate the network topology using a Rectilinear-Steiner-Tree based algorithm to evaluate the costs of each partitioned group.
In [7] floorplan-aware method focusing to minimize the power consumption was presented. The authors use a partition driven floorplanning to obtain the physical information. The switches and network interfaces are defined using heuristic method and a min-cost max-flow algorithm, respectively. The difference of this proposal is that it considers the network interfaces in the topology generation. The alternative to minimize power consumption is achieved using an incremental path allocation. Another similar proposal is found in [2] . This work attacks the same problem but uses a genetic algorithm to synthesize application-specific NoC topology. However the majority of these proposals considers cluster connected by switches with buffers and many other controls that increase the area and power dissipation.
Despite of the previous works present advantages due to the application-specific NoC customization, their approaches continue to be within the concept of a flat topology. In fact, the idea of core clustering is not used to separate local and global communications defining two levels of interconnections, but it is used mainly to reduce the number of NoC nodes, increasing the number of ports for each router.
In this paper we present an automatic application-specific framework to exploit hierarchical interconnection based on floorplanning information, considering together the cores clustering, NoC mapping and interconnection synthesis problems.
TARGET NOC ARCHITECTURE
A hierarchical NoC topology brings many advantages for complex MPSoCs since it can exploit the communication locality of the system, while maintaining the NoC advantages. Besides, the hierarchical NoC topology has proved not only to reduce the number of hops when compared to a regular packet-switched topology, but it is also able to provide a suitable bandwidth, power and QoS results.
In this context, the target Application-Specific Hierarchical Network on-Chip (ASHiNoC) is composed by small clusters interconnected by local crossbars (LXBars), that drastically reduce the power consumption and improves the communication performance of the system, interconnected by a regular packet switched backbone NoC (BNoC). Figure 1 shows an example of the target ASHiNoC topology. Besides, cores with more communication minimize the communication energy when are in the same cluster [7] .
The clusters are composed by the crossbar and an arbiter used to define the multiplexer selection for each output data. This arbiter is based in the Round-Robin algorithm and in the designed cluster architecture the data are only sent for the destination core if the destination core is available to receive it. In such case, up to n communications can be made simultaneous, where n is the number of cores presents in the cluster. Each core sends a signal control informing its availability and a handshake protocol is used to manage the receiving data. The data is transmitted in accordance with this control and with the request of the source core (req, that contains the ID of the destination core). The sending of messages uses a protocol indicating the end-of-message (eom). The arbiter is composed by a finite state machine (FSM) that controls the requests for a same port and each output port has an arbiter. The implementation of the Round Robin algorithm takes only two cycles independently of the crossbar switch size, eventually impacting the LXBar maximum frequency. In the first cycle, the arbiter verifies all requisitions of the input ports of a crossbar for an output port and in the second cycle it attends the next requisition. In summary, the arbiter verifies only the input ports where the requisitions are activated and in an orderly manner it defines the next input port that must be attended, ensuring the starvation free. BNoC. For the global level we decided to use a conventional packet switching routers with a wormhole mechanism. This strategy has been the most popular since it is simple, regular and presents suitable features for this level of hierarchy. The BNoC router is similar to a conventional NoC, but in this case, instead of a PE being coupled to a local channel, the local channel has a connection with a cluster port. However in the global level different regular topologies can be considered like mesh, ring, torus, spidergon and others. Anyway, for an application-specific NoC, only communications with low bandwidth are defined in the global level.
The architecture of the crossbar switches and routers is illustrated in figure 2. The bridge modules are the boundaries between the two network levels and integrate the synchronization logic when the LXBars and the BNoC work at different frequencies. 
THE PROPOSED METHODOLOGY
The advantage introduced by the usage of hierarchical interconnection topology for heterogeneous systems can only be exploited by automatic tools for the design space exploration during the NoC synthesis. In fact, the large design space introduced by the mapping and clustering decisions together with BNoC and LXBars architecture configuration is not possible to be faced manually.
The Application-Specific network-on-chip synthesis that we are facing is carried out by selecting the NoC parameters (such as frequency and router buffer depth), the topology and by mapping the In particular for the target Hierarchical NoC, the synthesis consists also in the generation of the local Xbar-based cluster interconnections. The synthesis step is performed by taking into account estimated performance, power consumption and latency considering floorplanning information.
Problem Formulation
To define the problem of designing an Application Specific Hierarchical NoC, let us introduce the following concepts: the Core graph, used to describe the target application, and the NoC topology graph, used in our approach to describe the Backbone NoC and the Core-to-Node mapping function used to correlate the two previous graphs.
The Core Graph is a directed graph G(V, E) where V is the set of PEs belonging to the target System-on-chip and E is the set of edges representing the communication between the IPs vi ∈ V and vj ∈ V . The weight of the edge ei,j ∈ E, denoted by commi,j, represents the bandwidth of the directed communication from vi to vj.
The NoC Topology Graph is a directed graph P(U, F ) where U is the set of network nodes and F is the set of directed edges (ui,uj) representing an existing link between the network nodes ui ∈ U and uj ∈ U . Each edge fi,j ∈ F has a weight bwi,j which represents the bandwidth available across fi,j.
The Core-to-Node mapping function M : V → U is defined as the set Core-to-Node mappings (vi, uj), representing the Core vi ∈ V mapped to the network node uj ∈ U . The set of possible mappings M(P, G) depends on a given network topology graph P and an Core graph G.
Despite of neither the NoC Topology Graph nor the Core-toNode mapping function explicitly manage the cluster hierarchy, we are able to manage the hierarchy in the mapping problem formulation considering that two (or more) cores mapped into the same network node are parts of the same cluster and so interconnected by a crossbar-based interconnection. Moreover, to evaluate the performance of each mapping and of the related Application-Specific Hierarchical NoC including the floorplanning information, in our problem we considered an enhanced version of the Core Graph where each core has been annotated with the equivalent area and aspect-ratio.
Overview of the application specific methodology
The tool flow proposed in this paper is shown in Figure 3 . It takes as input the definition of the target topology in terms of number of routers and pattern (e.g. mesh, ring), the routing functions associated to the topology and the enhanced version of the Core Graph (annotated also with the core dimensions) and generates as output the best ASHiNoC architecture together with the power/area/latency evaluations. It is composed of 5 main steps done iteratively: core allocation, floorplanning, cluster estimation, backbone NoC estimation and hierarchical NoC estimation.
Core Allocation. This is the step that has in charge the cluster generation, each core of the Core Graph G(V, E) is mapped on a specific cluster. It has been implemented by using the NSGA-II [8] multiobjective genetic algorithm since the design space of the target problem results to be too vast to be analyzed exhaustively. The adopted chromosome structure is composed by a set of genes equal to the number of cores in the Core Graph. Each gene identifies the cluster where the related core has been mapped. This chromosome structure gives the possibility to be extended by encoding also additional genes for other architectural parameters related to the Hierarchical NoC (e.g. Data-path width, BNoC frequency).
Floorplanning. This step analyzes the core-to-cluster assignment, performing the floorplan estimation for each cluster and for the backbone NoC. The floorplan has been derived by using the Hotfloorplan tool [9] considering a two dimensional layout, where each core is represented by a rectangular region with size and aspectratio specified in the input Core Graph, and each local crossbar and backbone router (together with its related Network Interface) are represented by a square with size dependent respectively on the number of cores connected and on the BNoC topology 2 . The created floorplan is based on the topology of the entire hierarchical network, by minimizing a linear combination of the silicon area for the entire floorplan and the distance between interconnected resources.
Clusters Estimation. -This step uses the previous floorplanning phase to design each LXBar-based interconnection and to estimate its performance in terms of maximum working frequency, average power and area. First of all, each cluster floorplan has been used to estimate the length of the links attached the local Xbars. Next, the estimated links length is processed by an in house model that is able to extract the maximum working frequency of each cluster depending on the wire latency and crossbar size. The used model has obtained from Hspice results for a 65nm process technology. We used a distributed RLC-π model [11] which emulates the electric wire behavior. The values of resistance and capacitance for this model for 65nm technology were obtained from [12] . Figure 4 shows the maximum frequency allowed according to the link length and considering two strategy of repeater insertion, each 1mm or 2mm. Finally, area and average power values for both crossbars and links has been derived by using Orion 2.0 [10] taking into consideration the estimated links length, the crossbars size in terms of ports, the data-path width, and the communication traffic. This step concludes with the verification that each crossbar link is able to sustain the allocated bandwidth, otherwise a bandwidth constraint violation is reported. The number of the violated bandwidth constraints is used by the exploration algorithm to discriminate among good and bed designs and for the selection of the next mapping function to analyze.
Backbone NoC Estimation. -Once concluded the per cluster analysis, this step estimates the backbone network performance in terms of maximum sustainable frequency, area and average power consumed by each router and by each router-to-router links. This estimation step is done by considering the BNoC topology, the intercluster-level floorplan for deriving the length of the network links and using again Orion 2.0 [10] for the technology parameters. As previously done for the local crossbar, this step concludes with the verification that each BNoC link is able to sustain the allocated bandwidth, otherwise a bandwidth constraint violation is reported as feedback for the exploration process. 2 The area model for both local crossbars and BNoC routers has been derived by using Orion 2.0 [10] Hierarchical NoC Estimation -Finally, the last step of the methodology is used to put together all the metric derived for the local crossbars and the backbone NoC. The area and average power of the ASHiNoC is obtained by the sum of crossbars and BNoC values, while for the average latency we used the weighted sum for each edge in the communication graph of the ideal latency (IL) that can be expressed as follow:
where CLs and CLd are the latency due to the source and destination cluster, respectively, f reqBNoC is the backbone NoC working frequency, while hops and cycles are respectively the number of intermediate hops and network cycles introduced by each hop. 
EXPERIMENTAL RESULTS
In this section, we show the experimental results obtained by applying the proposed methodology to the design of a 38 cores TVOPD (Triple Object Plane Decoder) case study where the communication core graph structure, presented in Figure 5 , has been derived from [13] while the core area values from [3] . Regarding the network evaluation, synthesis and estimation values have been obtained by using 65nm process technology considering routers architecture with an input queue depth equal to 4 slot.
The analyzed design space X for the target use case is composed of all the possible configurations x combining the following 40 parameters:
• 38 core cluster assignment and network mapping parameters, varying from 0 to Max BNoC routers (one for each core);
• 1 number of the mesh backbone NoC routers parameter, varying in the following set 2, 4, 6, 9, 12, 16;
• 1 BNoC frequency parameter, varying in the following set 250M Hz, 500M Hz, 750M Hz, 1000M Hz
The resulting design space is composed of more than 10 46 architectural configurations, and it is unfeasible to be explored without the usage of an automatic design space exploration technique.
For the application-specific hierarchical NoC generation, we formalized the multi-objective optimization problem as follows:
(1) Figure 5 : Communication Core Graph for the TVOPD benchmark and subject to the respect of the application communication bandwidth.
In particular, the minimization problem we faced for this use case consists of four objective functions. Three of the objective functions are related to the entire ASHiNoC interconnection: the area, the average power consumption and the average latency. The fourth refers to the Backbone NoC component of the ASHiNoC interconnection and in particular to the number of routers. The minimization problem is without any constraint except that to support the QoS requirements of the TVOPD application, the resulting design configurations should be able to support the data traffic presented in Figure 5 .
To help the system architect to select among the large number of feasible solutions composing the Pareto front, and to better analyze the impact of the hierarchy in the different design configurations, we decided to first cluster the Pareto solution with respect to the number of routers in the Backbone NoC and then selects a champion solution for each cluster by using a decision-makingmechanism based on the following product:
The results of the NSGA-II based exploration phase after 50K evaluations can be found in Figure 6 , where we plot the interconnection Average Power and Area (respectively in Figure 6 (a) and 6(b)) and the Average Latency and BNoC hops for the best configurations found accordingly to Equation 2 by varying the number of routers (see Figure 6(c) ). Before commenting the results we want to underline that, despite of the global trends will be clear in Figure  6 , some unexpected local trends can be shown due to the Decision Making Mechanism (DMM) represented by Equation 2. Figure 6 (a) shows the ASHiNoC average power, split also the Xbars and Backbone NoC components. Except for a small reduction passing from 4 to 9, the global power trend is to increase with the number of BNoC nodes. While the backbone NoC power component, which is the main responsible of the global trend, increases with the number of nodes, the opposite happens for the Xbar com- ponent. This is mainly due to the increment of data traffic that pass from inter-cluster to intracluster communications, removing the pressure from the local Xbars. Passing from 4 to 9 BNoC nodes (probably due to the Equation 2 based selection), an almost constant BNoC power together with a power reduction due to Xbars components cause a global reduction in the ASHiNoC power. Figure 6 (b) shows the ASHiNoC area, split for the Xbars and Backbone NoC components. The global area trends results in an ASHiNoC area reduction with the increment of the number of BNoC nodes. The main reason can be found by taking a look to the two different components, where the Xbars area presents the largest values. In fact, large clusters (found when the number is small) require longer local interconnections with respect to small clusters (found when the number is large) and thus, larger area. For the target case study, the area overhead introduced by the additional BNoC routers is less with respect to the advantage in reducing the cluster dimension, creating the observed area reduction trend. Figure 6 (c) shows the ASHiNoC average latency and BNoC hops. It is possible to note that passing from 2 to 16 nodes the average number of hops increases mainly due to the increment of data traffic that passes from intra-cluster to inter-cluster communications and due to the increment of the average network distance [1] .
This expected trend is not completely reflected into the average interconnection latency, since the average hops metric hides completely the impact of the clusters local frequency. In fact, larger clusters present a lower working frequency with respect to smaller clusters (considering the same area for the cores), meaning also a larger latency for all the intra-cluster communications. For the target case study, the average interconnection latency presents a trade-off with 9 BNoC nodes. For values smaller than 9 nodes, the increment of the working frequency is counterbalanced with the increment in the network latency, while for values larger than 9, the advantage in having smaller cluster is not more evident.
Moreover, Table 1 outlines the best configuration among all the Pareto solutions by using the same decision-making criteria presented in equation 2. The best solution results to be the one with 9 routers in the Backbone NoC. As can be noted also from Figure 6, this solution is characterized by a higher power contribution from the Backbone NoC, while an higher area contribution from the local Xbars. Regarding the clusters organization, this solution is able to generate well balanced clusters in terms of number of nodes (except for cluster 4) able to sustain a working frequency equal to 500MHz in most of the cases. Anyway, it is possible to note the trend outlined during the comment of Figure 6 (c), where larger clusters present a lower working frequency (such as cluster 4 with 8 cores and 250MHz), while smaller cluster can sustain higher working frequency(such as cluster 7 with 2 cores and 1000MHz).
Finally, Table 2 compares the previous ASHiNoC configuration with the best mapping found by considering a flat 5x8 mesh topology for the same TVOPD application. Comparing the results, it is possible to note how the ASHiNoC architecture overcomes as expected the flat mesh topology both in terms of power and latency (-49% and -33%) thanks to the majority of communications occurring at LXBar level. On the other side, the ASHiNoC architecture presents a worst area value (+78%) due to the pure NoC philosophy of the flat mesh architecture which considers shorter links shared by different cores with respect to what the ASHiNoC architecture does.
CONCLUSIONS
In this paper, we have introduced a methodology to fully exploit and support the design of Application-Specific Hierarchical NoCs. The proposed methodology is able to explore the large number of alternative solutions for hierarchical NoC, taking into consideration not only the communication rates among the cores but also the core sizes and the interconnect parameters. The methodology uses floorplanning information to estimates synthesis and perfomance results for each cluster and for the backbone NoC. We have presented the experimental results for the TVOPD benchmark where we analyzed the possible hierarchical topologies by varying the number of nodes and cluster sizes of the NoC. The experimental results show how the proposed methodology allows to quickly compare hierarchical topology solutions considering the area/power/performance tradeoffs. Finally the comparison with a flat mesh outlined the advantage especially in term of power that such solution can reach.
