Abstract-The speed-up estimation of parallelized code is crucial to efficiently compare different parallelization techniques or task graph transformations. Unfortunately, most of the time, during the parallelization of a specification, the information that can be extracted by profiling the corresponding sequential code (e.g. the most executed paths) are not properly taken into account. In particular, correlating sequential path profiling with the corresponding parallelized code can help in the identification of code hot spots, opening new possibilities for automatic parallelization. For this reason, starting from a well-known profiling technique, the Efficient Path Profiling, we propose a methodology that estimates the speed-up of a parallelized specification, just using the corresponding hierarchical task graph representation and the information coming from the dynamic profiling of the initial sequential specification. Experimental results show that the proposed solution outperforms existing approaches.
I. INTRODUCTION
Today, Multiprocessor Systems-on-Chip (MPSoCs) are the de-facto standard for embedded system design [1] . Normally, to design these systems, the programmer iteratively divides the application in tasks, describing the parallelism with proper annotations (e.g., OpenMP [2] ), analyzes the resulting performance and, eventually, transforms the code until it meets the requirements. To shorten this process, static analysis, if accurate, is usually preferred to dynamic execution of the parallel code on the target platform.
Thus, performance analysis [3] is a key step of the design of multiprocessor systems, and fundamental with embedded architectures, where program performance, memory occupation and code compactness are critical aspects. Profiling is one of the most known and studied techniques for performance analysis, used for hand-tuning of programs or for various smart compilation techniques. In fact, common compilers implement control flow profiles that, through code instrumentation or statistical sampling of the program counter, count how many times basic blocks (i.e., portion of code without branches), branch transitions and paths (sequences of branch transitions) are executed. Among these approaches, Efficient Path Profiling [4] (EPP) is a well-known technique to gather important information on code hot spots and load balancing for sequential specifications. Unfortunately, it fails when applied to parallel code, where the profiling information is usually exploited only to estimate the performance of the single tasks, then composed to obtain the best, average or worst performance estimation of the whole task graph, without considering the correlations among the different parts of the code that have been parallelized.
In this paper we propose a solution that estimates the performance of a parallelized specification, trying to overcome the limits of previous approaches. In particular, to identify the correlations among the different tasks, we extend the profiling on the related sequential code with the Hierarchical Path Profiling (HPP). HPP has a natural correspondence with the structure of the typical parallel embedded applications, naturally described with cycles and represented through Hierarchical Task Graphs (HTGs) [5] . In fact, in HTGs, a vertex may have associated another HTG in a hierarchical way, resulting more powerful than Direct Acyclic Graphs (DAGs), where feedback edges are not allowed. Furthermore, we propose a methodology that efficiently represents the paths and it is able to accurately estimate the speed-up of a parallel application, by identifying what actually contributes to the overall execution time.
The main contributions of this paper can be summarized as follows:
-it extends the EPP [4] with a new solution, which better identifies the basic blocks correlations, in particular when cycles are involved; -it applies this approach to cyclic task graphs with a more compact and efficient representation of the information; -it proposes a methodology for the speed-up estimation of parallel code starting from the information gathered from the related sequential one.
We applied this methodology to different parallel applications at different optimization levels. The results demonstrate that we are able to estimate the speed-up introduced by the parallelization much better than common techniques for performance analysis, included EPP.
The remainder of this paper is organized as follows. Section II is about the related work. Section III gives some preliminary definitions while Section IV presents the motivation of this work. The proposed methodology is detailed in Section V and the experimental results are presented in Section VI. Finally, Section VII concludes the paper.
II. RELATED WORK
One of the main purposes of profiling is the identification of the most executed paths inside a program, where the optimization algorithms will focus. There are several classes of profiling techniques which allow getting this type of information. Among them, we find edge profiling [6] and whole program profiling [7] . Edge profiling is a simple technique, but not necessarily cheap in terms of execution overheads and instrumentation code size, which only aims at recording information about how many times each branch transition occurs. From this information the path executions can be approximately estimated. On the other hand, whole program profiling usually gives very exact information about paths execution but with a bigger execution overhead cost. Path profiling, which counts the sequence of edges and basic blocks, is a trade-off between these two techniques. One of the most important work about path profiling is the Efficient Path Profiling [4] which will be detailed in Section III-B. This basic algorithm has been extended by different authors to support inter-procedural paths [8] and inter-iteration paths [9] .
In [10] performance estimation for real-time embedded systems is discussed. This work considers best and worst case execution exploiting the concept of path-based analysis, but without leveraging the effectiveness of path decomposition proposed by [4] . Furthermore, it mainly considers the estimation of sequential applications. Several static timing analysis techniques, targeting the estimation of performance for embedded systems, are described in [11] . The target architecture considered is based on a single processor, and almost all the techniques discussed have high computational complexity, since they target the verification of real-time systems with hard or soft constraints. Furthermore, the presented average-case performance estimation techniques are limited by the number of paths generated, since they do not exploit any techniques for path decomposition.
Bammi et al. [12] propose a technique to estimate the performance of embedded applications without the need of a cycle accurate processor model. Instrumented source code, annotated with timing information, is generated by analyzing the object code, and then compiled and executed on the host to get an estimation of its performance on the target architecture. This technique allows performance estimates to be obtained much faster than solutions based on Instruction Set Simulators (ISSs) [13] , [14] , [15] , which cannot be easily exploited for the (fast) trade-off analysis required by an optimizing compiler. We adopt a similar approach to profile the code, but instead of analyzing the object code, we start from GIMPLE [16] Fine grain instrumentation is also used in [18] to obtain accurate execution time and memory statistics. Similarly to [12] , it is faster than ISS-based techniques, but still too slow with respect to the performance analysis tools used for exploration of parallelization. Our estimation technique, instead, statically and efficiently estimates the task graph performance exploiting the path profiling information. Thus, such performance analysis tools could obtain better results by including this methodology.
In [19] , synchronization operations are speculatively anticipated if they are on the most executed paths. In this case, path profiling information has been used to optimize communication between the threads, rather than performing estimation of parallelized specifications. It is straightforward to extend our methodology to also consider communication.
Profiles can also be used to estimate trip counts of loops [20] . Common loop-oriented optimization techniques may have benefits from a proper estimation of this number. Real time constraints analysis also benefits from trip counts estimations [21] . Considering paths, our solution is also able to compute this kind of information.
III. PRELIMINARIES
In this section we introduce the basic elements to understand our estimation procedure for parallelized code. We describe the intermediate representations used by our methodology, we briefly present path profiling and, finally, we discuss the model of concurrency that has been adopted.
A. Intermediate representation
The proposed methodology works on the following intermediate representations, widely used in compilers:
-the Control Flow Graph (CFG) [22] , a directed graph G CF G = (N, E CF G ) which is an abstract representation of paths (sequences of branches) that might be traversed during the execution of a function; -the Control Dependence Graph (CDG) [23] , a directed graph G CDG = (N, E CDG ) representing control dependences of basic blocks; that is if a basic block can control whether or not another basic block will be executed.
-the Control Dependence Regions (CDR) [23] , a partitioning of the basic blocks in equivalence classes; two basic blocks are in the same region if they have the same set of control dependences in the CDG; -the Loop Forest [24] , a representation of the hierarchy of the loops contained into the CFG;
where N represents the basic blocks contained into the initial specification. The function γ : C i = γ(BB j ) returns the identifier of the Control Dependence Region C i associated with the basic block BB j .
Given the example of Fig. 1 , its CFG is represented in Fig. 2 . Its CDG and the control dependence regions which each basic block belongs to are instead shown in Fig. 3 , where, for example e 1,2 represents that BB 2 is executed iff BB 1 has been executed and the value of the condition was true. On the other hand, operations in BB 4 have not any control dependences with BB 1 , BB 2 and BB 3 , and they can be executed in parallel if the data dependences are respected. Only one reducible loop, with BB 5 as header, is shown in the example. For the sake of simplicity, in the rest of the paper a loop will be identified with its header number (i.e. 
B. Path Profiling
Before defining Path Profiling, we need to introduce the concept of path. Let G CF G = (N, E CF G ) be a CFG. The path P p is defined as the sequence:
where BB i ∈ N and the pair of basic blocks BB i , BB i+1 has the corresponding edge e i,i+1 ∈ E CF G . Note that two basic blocks contiguous in a path are also contiguous in the execution trace which the path is extracted from. As described above, since the CFG represents all the paths that might be traversed during a program execution, it is possible to count the frequency of each path with an appropriate profiling of this representation. This technique is usually called path profiling.
Ball and Larus [4] proposed an algorithm to efficiently profile the execution frequency of paths in CFGs. This algorithm Fig. 3 . The Control Dependence Graph of the function fun_0.
is known as Efficient Path Profiling (EPP) and it uses the concept of state to model the valid paths (i.e., paths which are counted). These paths are only the ones that connect Entry to Exit. CFGs with loops are managed by substituting each back-edge e jk with two new edges connecting basic block Entry with BB k and basic block BB j with Exit. The graph so obtained is named as Path Graph (PG). Figure 2 shows the CFG associated to the example and the related Path Graph.
C. Model of Execution
In this work, we target embedded Multiprocessor Systemson-Chip (MPSoCs) composed of different processing elements that communicate through a shared memory. We adopt explicit fork and join operations as model of concurrency. This programming model requires that each task spawning threads (called fork task) has a corresponding join task, which can be executed only after all the created threads have completed their execution. This concurrency model is well supported by the OpenMP [2] standard and the corresponding programs can run on such shared memory MPSoCs with a minimal operating system layer. Architecture properties are important for the correct performance evaluation of a specification as well as for path profiling. We take into account the architecture properties during the mapping of the GIMPLE nodes (that are language and processor independent) to the target assembler statements. Following an approach similar to the ones proposed in [12] , [18] , [25] , we use an analytical model that, given the list of assembler statements associated with a GIMPLE node, is able to return an estimation of the number of cycles required by the target processors. We know that this estimation introduces a degree of error. However, we can accept this kind of approximations since the accuracy of such operation estimation is usually high, as shown in the literature, and we are focusing on a fast estimation technique to be used in task optimization.
Similarly to [5] , we adopt the Hierarchical Task Graph (HTG) as the intermediate representation of a parallel program. In particular, the HTG is a directed graph whose vertices can be: simple (i.e. a task with no sub-tasks), compound, (i.e. a task that consists of other tasks in a HTG, for example higher level structures such as subroutines), loop (i.e. a task that represents a loop whose iteration body is a HTG itself). The hierarchical task graph can be extracted from the control flow graph of a sequential program by identifying the edges through data and control dependences analysis. This results in an acyclic graph, where the task can be classified as: fork (i.e. tasks with multiple successors), join (i.e. tasks with multiple predecessors), normal (i.e. all the remaining tasks).
IV. MOTIVATION
Given the profiling of a sequential specification, it can be difficult to estimate the speed-up introduced by one of its possible parallelizations, even admitting some approximations and supposing that architectural effects (task creation/destruction/synchronization and communication) can be predicted and modeled as additive. For example, consider the function in Fig. 1 , executed 10 times. The number of cycles required by each operation in the sequential specification is analyzed following an approach similar to [12] , [18] , [25] and an example is shown in Table I . For the sake of simplicity, in the rest of the example we assume that the execution time of the sub-functions fun_1, fun_2 and fun_3 is fixed and not data-dependent. Nevertheless, this information is not sufficient to compute a good estimation of the speed-up obtained by one of its possible parallelizations (e.g. the one shown in Fig. 4 whose corresponding task graph is shown in Fig. 5 ). In fact, there are several issues that should be considered when estimating how long the execution of the task graph takes. First, the number of loop iterations have to be accurately estimated, since, as in this case, it heavily affects the execution time of the task where the loops are contained (i.e., T ask2). A more precise information about the average loop iterations number can be obtained using edge profiling techniques, but this information is not sufficient yet to produce correct estimation results. In fact, the speed-up of the function depends on how the values of conditions (i.e., c1 and c2) are correlated, activating or not the execution of functions fun_1 and fun_3.
In particular, let us consider the two following situations: A) c1 and c2 always have opposite values; this means that the basic blocks executed in the same path are BB 2 and BB 12 or BB 3 and BB 11 . B) c1 and c2 always have the same values (true or false); this means that the basic blocks executed in the same path are BB 2 and BB 11 or BB 3 and BB 12 .
Let us also assume that the probability of condition c1 being true is 0.50 and T ask2 has an estimated execution time of 10,520 cycles (the loop is executed 10 times and the condition c3 is always true).
In the first situation the execution of the sequential specification requires 31,130 cycles. The execution of the parallel code requires 20,580 cycles if c1 is true, 20,560 otherwise. The average parallel execution time is 20,570, so the real speedup is μ A = 1.5133. In the second situation the execution of Unfortunately, in these cases the techniques like edge profiling or EPP are not able to detect this difference. In particular, In conclusion, when a loop iteration is executed at least once between two conditional statements, the EPP algorithm looses the correlations among the paths before and after the loop, giving the same results for both the cases. Therefore, all the methodologies that use this information would estimate the same speed-up.
V. PROPOSED METHODOLOGY
The proposed methodology aims at providing a static estimation of the speed-up introduced by parallelization. It can be divided into three steps. Firstly, a path profiling of the sequential specification is performed, considering the loop hierarchy. Secondly, the profiling results are organized into a more compact representation based on the control dependence regions. Thirdly, the speed-up of the parallelized code is statically estimated, efficiently combining the information obtained from this representation on the related HTG.
A. Hierarchical Path Profiling
In this section we describe the profiling technique, the Hierarchical Path Profiling (HPP), that extends the EPP and is able to maintain the correlation between what happens before and after a loop. The HPP has some analogies with Structural Path Profiling (SPP) [26] , in particular regarding the loop hierarchy. However, in HPP the paths can cross loop boundaries, while in SPP they cannot.
The HPP is applied to the PG described in Section III-B, with a different definition of valid path. In particular, a path P p = {Entry, BB i , BB i+1 , . . . , BB j , Exit} is valid if it starts from Entry and ends in Exit, like in EPP [4] , BB i and BB j are connected by a back-edge (i.e., e ji ∈ E CF G ) or they belong only to the loop L 0 . According to this definition, in the proposed example, the path En, 1, 2, 4, 5, 6, 7, 9, Ex is not valid anymore, since there is not the edge e 9,1 into the CFG and BB 9 belongs to L 5 . The path En, 5, 6, 7, 9, Ex is still valid, since there is the edge e 9,5 in the CFG.
Then, all the HPP paths are clustered with respect to the loop which they belong to. In particular, the path P p is said to belong to the loop L l and, thus, to the cluster HP l , if L l is the innermost loop which BB i (i.e., the first basic block besides Entry) belongs to. For example, the path En, 5, 6, 7, 9, Ex is contained into HP 5 . We partially relax also the assumption that each consecutive pair of basic blocks of a path have to be contiguous in the execution trace. In this way, differently from EPP, the path En, 1, 2, 4, 5, 10, 11, 13, Ex can effectively be extracted from the execution trace also when the loop has been executed for, at least, one iteration.
To apply HPP we classify PG edges into four categories: -Starting Path: they connect a basic block outside a loop with a loop header; e 4,5 is the only edge in the example; -Ending Path: the edges which directly connect a basic block to Exit (i.e., the edges e 13,Ex and e 9,Ex ); -Exit Loop: they connect a basic block inside a loop with a basic block outside a loop (i.e., the only edge e 5,10 ); -Normal: all the other edges. BBnew = get currently executed basic block 6: if eBB last ,BBnew ∈ StartingP ath then 7:
current loop = Lnew 8:
append BBnew to curr path 9: add curr path to idle paths 10: curr path = {En}
11:
else if eBB last ,Ex ∈ EndingP ath then 12: append Ex to curr path and increment its frequency 13: a new path p = En starts 14: else if eBB last ,BBnew ∈ ExitLoop then 15: update current loop 16: idle path of the current loop becomes curr path 17:
end if
18:
BBnew is appended to curr path 19: BB last = BBnew 20: end while
The proposed algorithm operates as described in Algorithm 1. Considering the example of Figure 2 and the situation in which c1 and c2 are both true, it behaves as follows. When the function execution begins, we start a new path p i = En (line 3). The path is updated (line 18) with the executed basic blocks until we reach BB 5 . At this point the current path is En, 1, 2, 4. Since e 4,5 is a Starting Path Edge (line 6), when the execution of BB 5 starts (i.e., BB new = BB 5 ), p i becomes idle (line 9) and a new path p j starts (line 10), also including the current basic block (line 18). At this point we have p j = En, 5, which is updated until the execution of BB 9 . Since e 9,Ex is a Ending Path edge (line 11), the current path p j = P 8 = En, 5, 6, 7, 8, 9, Ex is then terminated (line 12), and its frequency incremented by one. Subsequently, a new path p j starts (line 13) and it behaves as described above for all the ten iterations of the loop. At the end, when the execution of BB 10 starts (i.e., e 5,10 ∈ ExitLoop is traversed), p i returns active as p i = En, 1, 2, 4, 5, 10, deleting the current path (line 16). The path is updated until we reach e 13,Ex , which is a Ending Path edge (line 11), incrementing the frequency of the path p i = P 2 = En, 1, 2, 4, 5, 10, 11, 13, Ex. Then, the algorithm stops, since the execution of the function is terminated.
The presence of idle paths is one the most important differences between EPP and HPP. In each instant, EPP allows only one path to be live. At the opposite, the number of live paths in HPP is the nesting level of the current loop. Algorithm 1, which analyzes a basic block at each iteration, is linear with the number of basic block executed into the trace. In Table III we show the results obtained by applying the HPP technique to the example of Fig. 1 , for the two cases 
HC0,i
Entry -C0 Table II , it can be noticed that HPP is able to maintain the correlation between what happens before and after the execution of L 5 (i.e., between the execution of basic blocks BB 2 and BB 11 ).
B. Control Region Paths
Once the HPP profiling has been performed and the paths have been hierarchically clustered, we project the control dependence regions onto the paths. In particular, let L l be a loop, we define the Hierarchical Control Dependence region (HC) as:
For the example shown in Fig. 1 , the hierarchical control dependence regions are represented in Figure 6 . In particular, the dashed lines represent the regions for the loop L 5 and the filled ones for the loop L 0 . Note that BB 5 , being the header of L 5 , is included both into HC 0,1 and HC 5,1 . In fact, regardless the loop is executed or not, at the higher level of the hierarchy, the header BB 5 (i.e., the test of the condition) is always executed at least one time. Therefore, each region HC l,i contains all the basic blocks of L l that are dependent on the same value of the control condition. Thus, when this value has been evaluated, the region and all the related basic blocks will be executed for sure. Thus, we can represent each path (i.e., sequence of basic blocks) of the loop L l as the set of control regions to be executed, i.e., the Control Region Paths.
In particular, let P p ∈ HP l be a path of loop L l , the Control Region Path (CRP) CRP p associated with path P p is defined as:
Since the function γ is surjective for each loop L l of the hierarchy, the size of the control region path CRP p results equal or smaller than the size of the corresponding path P p . This produces a more compact representation of the paths, without loosing any information. The CRPs of the example in Fig. 1 are shown in Table III .
C. Static Task Graph Execution Time Estimation
Let G CF G = (N, E CF G ) be the CFG of a sequential specification and HT G 0 = (V, E) be the HTG related to one of its parallelization. HT G 0 is recursively analyzed with the procedure described by Algorithm 2.
In particular, the methodology analyzes all the tasks of HT G l = (V l , E l ) in topological order and starts by computing, for each CRP i in HP l , the contribution CC l,i,t (line 4) given by task v t ∈ V l to each region HC l,i :
where c s is the number of cycles required by the operation o s ∈ v t . Then, the contribution P C p,t (line 19) given by task v t to the path P p is computed as:
i.e., the sum of all the contributions of the regions that belong to P p . Note that if HC l,i contains the header of a loop L n completely contained into the task v t (lines 5-13), the Equation 3 is also applied to the CRPs of the loop L n and the execution time LC n (line 12) associated with the loop L n can be estimated as:
where f q is the frequency associated with the path P q and N n is the average number of iterations for the loop L n . Thus, the related CC l,i,t is updated (line 13) as follows:
In fact, when HC l,i contains the basic block BB n , which represents the header of a loop L n nested in L l , the execution of HC l,i must consider the additional cycles due to the loop L n . This process continues, as described above, until the contributions of the task have been computed for all the paths in HP l (lines [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] . Defined ACC p,t as the execution time needed to execute the operations of the path P p in the tasks from Entry to v t ∈ V l (and assumed ACC p,Entry = 0 ∀P p ), the task v t contributes (line 21) as follows:
where v u ∈ pred(v t ) is a predecessor of v t in HT G l (i.e., e u t ∈ E l ) and c t is the overhead that can be associated with the creation/destruction/synchronization of task v t . Note that Eq. 6 can also be applied to task graphs that are not compliant with the fork/join model. In fact, this model of execution only refers to the programming model supported by the target architecture and not to a limit of the methodology. The overall task graph execution time for HT G l (line 24) is then computed as a weighted average of the contributions given by all the paths:
, and HT G l is associated only to one task v l ∈ V j , the contribution CC j,i,l (line 16) is then updated:
Thus, the analysis can recursively continue until the computation of HT C 0 , associated to L 0 and representing the estimation of the parallelized specification, has been completed. Let E s be the performance of the original specification, the estimated speed-up introduced with the parallelization is then computed as:
for all Pp ∈ HP l do 3:
for all HC l,i contained into CRPp do 4: compute the region contribution CC l,i,t
5:
if BBn ∈ HC l,i and Ln completely nested in vt then 6: for all HCn,i associated to Ln do 7: compute the region contribution CCn,i,t for all Pq ∈ HPn do 10:
compute the path contribution P Cq,t if BBn ∈ HC l,i and HT Gn is associated to vt then 16: update CC l,i,t with HT Cn =estimate(HT Gn)
17:
end if 18: end if 19: compute the path contribution P Cp,t
20:
end for 21: update the cost of the path ACCp,t
22:
end for 23: end for 24: compute the overall task graph execution time HT C l As shown by the Algorithm 2, the procedure for estimating HT G l (V l , E l ) is composed by an outermost loop repeated |V l | times. For each task, all the paths at the current level of the hierarchy are analyzed and they can contain, in the worst case, at least one operation for each hierarchical control dependence region. Therefore, the analysis 3-20 is repeated, in the worst case, |C| times (i.e., the number of control regions contained into the specification) and the analysis 2-22 is repeated |P | times, where |P | is the number of paths. The complexity of the estimation for task graph HT G l is, thus, O(|V l |·|P |·|C|)).
Applying the methodology to the example presented in Fig.  1 and Fig. 4 , we obtain the results reported in the right side of Table III . First, we compute P C p,t for all the paths and all the tasks. The execution time required by loop L 5 has been estimated as LC 5 = 10, 500 (the only path executed is CRP 8 ). Therefore, P C i,2 = 10, 520 for all the P i ∈ HP 0 . The sequential execution time E s is 31, 130. Note that it can also be computed by using the proposed methodology and considering all the operations in the same task v a ∈ V 0 . Finally, we can compute the speed-up for the two situations presented in Section IV:
A) the paths executed are P 3 and P 6 , so the execution time estimated for the parallel version is: In conclusion, the proposed methodology, differently from EPP, is able to correctly estimate the speed-up in the two situations discussed in Section IV.
VI. EXPERIMENTAL RESULTS
The proposed methodology has been implemented in C++ inside PandA [27] , a hardware/software co-design framework based on the GNU GCC compiler [17] .
The considered target architecture is an embedded MPSoC composed by eight ARM920 processors with a shared memory. Each processor has 16KB of instruction cache and 8KB of data cache. The data cache is write-through and adopts a write-update coherency policy. We slightly modified the SimIt-ARM cycle-accurate simulator [13] to model such architecture. In particular, since SimIt-ARM does not support multi-core simulation, we modified it to support concurrent tasks execution on different cores with private caches. Communication costs are not addressed in this work, but the extension is straightforward (i.e., by modifying Eq. 6). In this section we compare the two profiling techniques discussed in this paper (EPP and HPP) from the point of view of the instrumentation. Then, we compare the methodology presented in Section V against three other common speed-up estimation models and the real speed-up obtained by the execution of the source code on the simulator on a set of manually partitioned applications extracted from MiBench [28] , from Splash 2 [29] and from OmpSCR [30] , that are three free suites of representative benchmarks for embedded and parallel computing. The benchmarks and their characteristics are reported in Table IV .
A. Path profiling Both the EPP and the HPP techniques have been implemented inside our framework, without any optimizations. The results related to instrumentation and paths counting are reported in Table V . The instrumented source code is generated starting from the GIMPLE code at the end of the target independent optimization flow. For this reason, different results (see Inc.&Init. and N P ) are obtained when changing the optimization level of the GNU GCC compiler. This also means that the estimation takes into account the middleend optimizations, when activated. Then, we executed the instrumented code for 100 times and averaging the resulting execution times. The profiling has been performed on a host linux machine with the Intel Xeon X5355 CPU (4 cores at 2,33 GHz with 4 MB of L2 cache per couple of cores).
Starting from the same path graph, both the techniques count a path each time they reach the end of a function or of a loop. This results in the same number of path counter writes (write) for the two methods. The number of paths (N P ) obtained with HPP is instead lower with respect to EPP. In fact, HPP performs a different path composition when loops are involved. As described above, EPP considers the path which enters in the loop and the path which exits from it as two distinct paths, when at least one iteration is performed, while HPP fuses these two paths into a single one. Finally, the instrumentation overhead introduced in both the techniques (oh diff ) ranges from 20% to 200%. This is not a limitation, since we profile on a host system much faster than the target architecture or its cycle-accurate simulator. However, these number cannot be directly compared to Ball and Laurus' implementation. Their performance analysis tool, in fact, instruments (SPARC) binary executables, reducing the overheads by performing data-flow analysis to exploit the architectural registers. Our tool, instead, implements the two techniques completely in software, acting on the architecture independent intermediate representation before the object code generation and producing an architecture agnostic instrumentation. Table V shows that HPP has an overhead systematically lower than EPP. In fact, since HPP uses a different definition of valid paths, we have been able to reduce the number of activated paths, reducing the data structure needed to store them.
B. Speedup estimation
In this section we compare the real speed-up obtained by simulating with SimIt the sequential and parallelized source codes to the following estimation models: -Case A: the contribution of each task is based on its worst case execution time [10] and the average number of iterations for not countable loops has been set to 5, similarly to [31] ; -Case B: the contribution of each task is based on its average execution time [10] where the branch probabilities are considered equiprobable. The number of iterations has been set as in Case A; -Case C: the estimation is performed as in Case B where the number of iterations and the branch probabilities are based on the results obtained by a dynamic profiling with EPP; -Case D: this case refers to the methodology proposed in this paper and detailed in Section V.
These methods have been applied to the benchmarks in Table IV with different levels of compiler optimizations. The results are reported in Table VI. Note that there is not any methodology able to exploit all the information coming from the EPP preserving the HTG structure since EPP identifies paths which cross the boundaries of the single task graph. Thus, the resulting information can be used only to estimate the branch probabilities and the number of iterations.
Since the target compiler for ARM processors may perform target dependent optimizations, some inaccuracies could occur. However, we have verified that their impact is similar on both the parallelized and the sequential code. So we can confirm that the speed-up ratio estimation is not affected. Analyzing the results in Table VI with respect to the benchmark characteristics shown in Table IV , we can make the following considerations. In basicmath and blowfish, the profiling-based techniques obtain far better results when no optimizations are applied. On the other hand, the differences become negligible when the code is restructured by the optimizations. One of the reasons is the introduction of loop optimizations by the compiler. Nevertheless, the profiling techniques reach an accurate estimation in both the cases. In the smoothing, fmm and Delayline benchmarks, the technique based on EPP information (C) obtains very poor results compared to the other techniques. In these benchmarks there are several control constructs and loops, and EPP looses all the correlations among the paths, leading to a highly inaccurate estimation. The HPP-based technique (D), instead, is able to correctly model such cases. The Delayline and the fmm benchmarks are very interesting when analyzing the quality of the parallelization. In fact, the approaches A, B and C estimates the presence of a speed-up in the parallelized code. However, when such code is executed on the simulator, we see that the parallelization is not efficient at all, and no speed up is obtained (μ SimIt = 1).
In both these cases, the proposed technique correctly estimates the lack of any speed-up. This means that our methodology may be highly suitable during the design space exploration, allowing the designer to obtain a fast preliminary evaluation on different parallelization approaches without requiring multiple, time-consuming simulations or executions on the target platform. In smoothing, the control constructs are very unbalanced (i.e., some branches have a larger probability to be taken) and the approach based on worst case (A) is able to model this situation, obtaining results that are very close to the proposed methodology. However, code restructuring due to optimizations changes the situation, and only the proposed methodology (D) is able to accurately model it. Note that, in general, when the branch probabilities are unbalanced, the approach based on the worst case (A) behaves better than the probabilistic one (B). Graphsearch and JPEG are examples for this scenario. In jacobi1 and openmpbench, the number of control constructs and loops is very limited. Again the branch probabilities are very unbalanced and the approach based on the worst case obtains the best results (A). However, the error of the proposed methodology is still acceptable (less than the 13%). In fft6, the effect of the control is negligible and the speed up of the parallelized code is correctly predictable by all the models with a very limited error. In dijkstra, instead, the branch probabilities of the control constructs are almost equiprobable. Consequently, the approach based on equiprobable branches accurately models such situation. In all the remaining cases, the profiling techniques C and D systematically outperform the probabilistic ones. On these benchmarks, they obtain very similar results.
VII. CONCLUSIONS AND FUTURE WORK
In this paper we proposed a methodology that effectively combines a new path profiling technique, the Hierarchical Path Profiling (HPP), with the information coming from the control dependence regions to obtain the static speed-up estimation of a parallel code, represented by a Hierarchical Task Graph. We applied our methodology to a set of common benchmarks for embedded and parallel computing, showing that it produces more accurate estimations than other standard approaches. Such a solution may be integrated in future auto-parallelizing compilers or performance analysis tools for MPSoCs to obtain a fast evaluation of the quality of the parallelization.
Future work will focus on the integration of all the optimizations proposed in literature to reduce the instrumentation overhead, on the analysis of the correlations among the operations and on the effects due to the target architecture (e.g. hits and misses on instruction and data caches or communication) to further improve the estimation accuracy. 
