Abstract-The paper presents Nessie [1] , an original multicriteria exploration tool able to decide on functional and nonfunctional aspects of SoC's based on the application and platform descriptions and the costs (energy, execution time, area, ...). Indeed, the different performance prediction tools we can find in the literature, aiming at helping designers to take decisions and find efficient solutions, have some limitations that we wanted to overcome. To validate our tool, we have chosen real case study. In particular, we present here an example based on a 3MF MPSoC platform executing a real time AVC/H.264 video decoding extended to 3D-stacked architectures. Results given by Nessie, based on the average wire power consumption, have reinforced us in the utility of such a tool to help designers to take decision, especially in huge design spaces like embedded SoC ones.
I. INTRODUCTION
In the nowadays so constrained embedded electronics world, the increasing complexity of the applications and the platforms (heterogeneity, IP and communication infrastructure possibilities) makes rapid and optimal decisions even more complicated for designer to satisfy the specification and the time-tomarket constraints. In typical industrial top-down design flows, automatic tools appear generally at low abstraction levels and often explore one sole aspect of the system (functionality or architecture) taking into account one type of constraint (cycles, energy). Decisions made at higher abstraction levels are more devoted to experts that make intuitive assumptions about the design and the final performances and restrict the exploration space to be able to make quickly first decisions, sometimes without information about the counterpart of the system (functionality or platform). This leads inevitably to non-optimal solutions and long design time. To predict performances from the early stages of the flow, several tools have been developed proposing some solutions to guide the designers. The state of the art in the domain, presented in the next part, has driven us to elaborate our own original multi-criteria high level exploration tool, Nessie, more flexible and general. The validation of this original tool constitutes the second phase of the work and consists in simulating a real case to observe the feasibility to explore a given design space and generate relevant solutions for the users in few times. In this paper, we thus present a case study based on a real time video decoding AVC/H.264 application running on a 3MF MPSoC. While three functionality mapping scenarios with three data resolutions were considered, we added an extra degree of freedom by placing the hardware components on 3D stacked architectures. We have estimated for different functionality/platform possibilities the average power consumption gain we can achieve, in particular on the Network-on-Chip, the communication infrastructure. The obtained results have shown the interest of such a tool.
The remainder of the paper is structured as follows. Section 2 presents a brief overview of existing performance estimation tools. Section 3 describes the resulting original tool we have developed. Section 4 shows the use of the framework on a real case study and the results we have obtained. Finally the conclusions are drawn in section 5.
II. RELATED WORK
We have classified high level estimation tools into two categories. Amongst all the tools we have found, we present and compare the most relevant on few chosen criteria (language description, hierarchical modelling, multi-objective and automatic exploration capabilities).
The mapping based tools are first considered. The SESAME [2] system-level modeling environment takes the functionality and the architecture description (with SystemC) at a transaction level and enables its refinement but is devoted to multi-processors platforms. Furthermore, the mapping does not perfom automatic exploration amongst functionality/architecture possibilities, focuses on timing aspects and does not support synthesis phase. Koski [3] and Design Trotter [4] both enable the automatic exploration of multiprocessor-like platforms without refinement or synthesis possibilities. The former uniformises the description of both the functionality and the platform with UML but restricts this to highest level of abstraction. The latter represents the functionality thanks to HCDFG and the platform with XML files and takes technological parameters into account enabling multi-objective modelling. On the contrary, Metropolis [5] , using metamodels to describe functionality, platform and multiobjective mapping policy, enables refinement amongst different abstraction levels and offers synthesis capability without automatic exploration of the design space. Synthesis capability is also possible in the Cofluent [6] environment and the Chinook [7] tool, both limited to one abstraction level. The former uses SystemC to describe functionality/platform and automatically explores different multiprocessor/multicore/multitask architectures by estimating timing criteria. The latter targets FPGA and enables also compilation for processors. It is based on Verilog description of the functionality/architecture.
Aside these tools, closed-formed based ones have been developped some times ago. First generation of performance prediction tools essentially focused on clock cycle for (multi-)processor platforms (Codrescu model [8] e.g.). Then GENESYS [9] has appeared and introduced modeling hierarchy. BACPAC [10] has gone further by enabling multiple-objective modeling (timing related metrics, power consumption, silicon area, etc). But the more complete and federative one, namely GTX [11] , aimed at providing users with an environment able to incorporate existing models avoiding redundant development of modeling tools. Further implementation of that tool has been abandonned. But we have extracted limitating points for future development. Indeed, large campaign of simulations are tedious since building, loading and execution of successive models have to be perform each time by the user and the framework does not implement a rigorous grammar for the definition of these models. Furthermore, it is not possible to interface GTX with other tools as it is a GUI-based tool.
Limitations of these reviewed tools have pushed us to propose a performance centric tool integrating interesting properties of existing high-level simulation ones (automatic and multi-objective exploration, hierarchical description, flexible models, customisable mapping policy), enabling explicit mapping and the support of easy to build closed-formed models. The so-called Nessie framework is presented in the following section.
III. A MULTI-CRITERIA ESTIMATION TOOL
Nessie is a more general and flexible multi-criteria performance simulation tool. The framework, shown in figure 1, has been build with C++ classes and is entirely interfaced by the mean of XML files. The C++ mapping core is based on a hierarchical description of both the functionality and the architecture at different abstraction levels.
To feed the mapping core, the user has to fill an XML simulation file where he describes the different applications (SW structures), platforms (HW structures) and the different input parameters he wants to explore. To describe the application, we have chosen Petri Nets as Model of Computation. But thanks to the modularity of the framework, the user could easily add other MoC if necessary. Petri Nets enable the representation of data dependency, concurrency and parallelism. Places represent tasks while transitions control their dependencies. The platform can be describe in a Netlist thanks to three types of blocks: computation, memory, interconnection. All these blocks have input and output ports. Different states can also be considered: sleeping, memorizing, transmitting, idle. For each computational block, the user has to define cost models associated to the compatible executable tasks and based on the input parameters. These models have to be entered in other XML files and will be evaluated thanks to the YETI model estimation tool [12] communicating with the Nessie core. YETI (which can be use as a stand-alone tool) has been developed to enable a flexible evaluation of reusable and easy to build closed-formed expressions and output sensitivity analysis.
The mapping policy is based on a dynamic event-based engine where the scheduling, the allocation and the routing perform in three phases. The allocation and the routing are based on user-defined rules (allocationWeight.xml and routingWeight.xml files). The platform block compatible with the functional block elected for the current execution is chosen by minimizing the cost (energy, area e.g.) and the minimumweight route is found thanks to a Dijkstra algorithm.
The outputs of Nessie are the performance criteria of each functionality/architecture simulated couple that the user is interested in and for which he can define constraints to filter simulated points. The XML output files generated by the tool contain then the performance results desired by the user, the activity of the different components of the simulated architectures (relative percentage of their states) and a timeline showing the mapping core behavior for each functionality/architecture couple.
This way, Nessie is particularly interesting when addressing embedded system-on-chip where the design is constrained by multiple criteria (high performance, low power dissipation, low costs,...) according to several degrees of freedom (interconnection, memory, number and type of computation units,...). The validation of our framework on such a case study constitutes a second phase of the research.
The next section presents the chosen industrial design case study. Through this example, the goal is to show the ability of our framework to model a complex system, explore different aspects of a real problem (functionality, platform, models,...) and give the user interesting and maybe non intuitive solutions quickly and in one simulation.
IV. A REAL CASE STUDY
To show the interest of Nessie, we have modelled a case study relying on a work published in [13] , which is a 3MF MPSoC system running a video decoding application (AVC/H.264, SVC standards). In such system, designers are notably interested in reducing as much as possible the power consumption. In particular, the paper deals with the power consumption reduction of the NoC as communication infrastructure. In this context, the 3D-stacking paradigm offers possibilities to decrease the interconnection length, then the delays and the communication power consumption. This emerging technology consists in placing the different platform components on several layers communicating together vertically, instead of a classical 2D layout with an horizontal communication infrastructure. If the 3D stacking architecture seems interesting to explore, it increases considerably the design space where to find a solution according to desired performances. The design of such a system requires a performance estimation tool at the early stages of the flow. The remainder of the section explains a bit more the case study. Finally, the integration into our framework and the simulation results are presented.
A. Description of the system
The reference paper only proposes a given 2D layout for the 3MF MPSoC platform composed of six ADRES processors as computational nodes [14] , two instruction memories (L2Is1 and L2Is2), two data memories (L2D1 and L2D2), one FIFO memory (buffer for the data streams), one external memory interface (EMIF), one ARM processor for the control. All those nodes are interconnected thanks to Arteris NoCs : one for the datapath (with 2x2 mesh topology), one for the instructions (1 router with 6 inputs, 2 outputs).
They considered three different mapping scenarios for the functionality :
• data split where video streams are divided in six equal parts processed by the six ADRES processors. This scenario puts the stress on the instruction NoC.
• functional split distributes the different operations over the six ADRES. Compared to the data split, the functional split is more friendly with the instruction NoC but doubles the amount of data flowing on the data NoC.
• hybrid where the heaviest computational task is mapped onto three ADRES using data split. The remaining tasks are mapped onto the three others. Independently, three data resolutions are considered : HDTV, 4CIF, CIF increasing the transferred instructions/data ratio as resolution decreases. The architecture topology, the functionality schema block and associated data sizes can be found in the mentioned paper.
B. Experimental setup
As a first validation of Nessie, we were particularly interested in the estimation of the total wire power consumption for different 3D-architecture possibilities.
To feed our tool, we have thus defined nine different Petri Nets (three application scenario for each resolution) in the same simulation file. Places are the different functions exectuted on processors or memorizing operations. For the platform, we have defined ten variants of 3D-stacks, on 2 or 3 layers. The disposition of the hardware components on layer 2 or 3 follows (the other components are placed on the first layer): (1)L2I1 and L2I2 on layer2, (2)L2I1 on L2, L2I2 on layer3, (3)L2I1, L2I2, L2D1, L2D2 and EMIF on layer2, (4)FIFO and EMIF on layer2, (5)L2I1, L2I2, L2D1, L2D2 and EMIF on layer2, ADRES1, ADRES4 on layer3, (6)ADRES1, ADRES2, ADRES3, FIFO on layer2, L2I1, L2I2 on layer3, (7)L2I1, L2D1 on layer2, L2I2, L2D2 on layer3, (8)L2I1, ADRES1, ADRES2, ADRES3 on layer2, L2I2, ADRES4, ADRE5, ADRES6 on layer3, (9)switch of the instruction NoC, EMIF, L2D1, L2D2 on layer2, (10)ADRES1, ADRES4, L2D1 on layer2, ADRES3, ADRES6, L2D2 on layer3.
Finally, to estimate the wire power consumption, we have used an analytical model giving the link dynamic power consumption P l which is the product of the segment power consumption P s by the length of the link l (see eq 1). The P s can be detailed as follows : the number of wires w, the NoC frequency f N oC , the capacitance per segment unit C, the activity of the wire A, the supply voltage V dd and the average toggle probablity of a wire p toggle .
The parameters of the equation have been extracted from the design flow and the informations presented in the reference paper. Likewise, we chose a vertical distance of 5 µm between each layer (consistent with current technology) so that reasonable link power savings could be expected from moving the initial 48.66 mm 2 die to a 3D stacked architecture.
C. Results and discussion
We have first estimated the NoC power consumption of the AVC application running on the 2D-MPSoC based on values taken from the reference paper. The figure 2 shows the original results and those estimated by Nessie and the relative difference between them for the three application scenarios and resolutions. As we can see, power consumption estimated by Nessie is very close to the original values. Indeed, the mean absolute error is of 0.77% when average over all the experiments.
Secondly, the wire power consumption for the ten platform variants, for each functionality scenario and resolution has been estimated by our mapping tool. Results for resolution HDTV and CIF are presented in fig 3 and 4 . We can see in the two figures that interesting power reductions can be achieved by switching from a 2D (first column) to a wellchosen 3D layout. We have calculated that the average gain ranges from 32.8% to 55.3%. We can also observe that as instructions/data ratio to transmit increases as the resolution decreases, a bigger relative gain is obtained by reducing the instruction path rather than the data one for CIF resolution. Variants 4, 9, 10 do not benefit really from this technology as the traffic from/to the EMIF and FIFO represent only 5% of the total data bandwidth transfer. Finally, we see that for the HDTV (and also for 4CIF which is not shown here), the hybrid scenario is the more power efficient, while the functional split beats the hybrid for the CIF resolution. If this result is not intuitive at a first look, it can be explained. Indeed, the functional split reduces the instruction bandwidth which is more critical when the resolution decreases (instruction/data ratio increasing). It is important to notice that these results can not be obtained without a tool such Nessie that enables a global design space exploration.
These conclusive results have shown the ability of our tool to generate correct and sometimes non intuitive solutions rapidly (few minutes for the simulation). The charts enable the designer to directly reject non interesting solution and to adapt its choices depending on the resolution of the application.
V. CONCLUSIONS
By quickly estimating different trade-offs and efficient solutions before running the real design flow, Nessie enables to reduce development cost by decreasing design time. Indeed, different functionality and platform structures can be explored in one run to find a satisfying solution for different constraints enabling to size a scalable system at high-level design time. The validation of the tool on a real and complex case study has shown the interest of this methodology and the ability of the tool to give relevant solutions. Future work will consist in the integration of other case studies to test further the generality and flexibility of the framework.
