Abstract-In this paper we present a new technique for automatically measuring the performance of tasks, functions or arbitrary parts of a program on a multiprocessor embedded sys tem. The technique instruments the tasks described by OpenMP, used to represent the task parallelism, while ad hoc pragmas in the source indicate other pieces of code to profile. The annotations and the instrumentation are completely target-independent, so the same code can be measured on different target architectures, on simulators or on prototypes.
I. INTRODUCTION
Performance analysis is a crucial phase of the embed ded design flow. Multi-Processor Systems-on-Chip (MPSoCs), composed of several processors, often heterogeneous, inter connected with memories, application specifi c accelerators and Input/Output components, are the standard architectures for modern embedded systems [l] .
The complex organization of these architectures, the par allelism, the synchronization and the communication mech anisms introduce several issues in the optimization process. In particular, accurate measurements or estimations of the execution time of the applications are required to help the designer in meeting the performance constraints. However estimation methodologies cannot be used if some parts of the application are provided only within libraries, or the input data are available only on the target architecture. In such cases, the only way to gather accurate performance information is to directly instrument the code onto the target architecture.
In this paper, we propose a technique for automatic code instrumentation of parallel applications for MPSoCs. The tech nique is selectively applied only to the most interesting parts of the code, identified by pragma annotations. The parallelism is described through OpenMP [2] , while ad hoc annotations are used to measure the execution times of functions and arbitrary parts of code. Several methodologies and tools have been proposed for code instrumentation and analysis.
In [3] the authors apply a fi ne-grained instrumentation to estimate the performance of embedded applications inside a Transaction Level Model (TLM) tool for MPSoC design. The instrumentation sums the clock cycles required by a specific instruction each time it is executed. The execution delay of each instruction, in clock cycles, has to be provided to the tool. Garcia et al. present Pet [4] , a tool for monitoring and analyzing parallel applications on embedded multiprocessors. The tool analyzes the execution trace and identifies all the interactions among the tasks and their occurrence. However, it only works on a specifi c embedded architecture and cannot manage general parallelism annotations like OpenMP. Some works discuss the performance analysis of OpenMP paral lel application in the High Performance Computing field. SCALEA [5] allows the evaluation of the overheads introduced by parallel programming paradigms. It deals not only with OpenMP but also with Message Passing Interface (MPI) and High Performance Fortan (HPF). Nevertheless, SCALEA is tightly integrated with its Fortran compiler framework, and is geared towards distributed systems. OPARI [6] is a source-to-source translation tool that exploits the idea of OpenMP pragmaldirective rewriting, automatically adding all the necessary calls to a runtime measurement library. The tool, however, do not allow to measure specific parts of the code. We follow the ideas proposed in those solutions for the performance analysis of parallel OpenMP code, but introduce those techniques on embedded architectures. Konkin et al. in [7] describe a methodology to identify the points where to add instrumentation for performance measurement. The authors define as suitable instrumentation points the interfaces among ditlerent software modules, like function call points. This solution allows complete automatic instrumentation of the source code, but the large number of instrumentation points introduces signifi cant overheads. Our proposal, instead, allows selecting which points to instrument. Our solution is somewhat inspired by DTrace [8] . With Dtrace, the code is instrumented by the programmer during the development, and the profi led data are collected at runtime through the use of scripts that enables the probes. In our case, the developer specifies with pragmas what he or she wants to measure, and at compilation time the tool inserts the required instrumentation. Our approach is thus more suitable for embedded systems: performance optimization is usually done at design time, and not adding performance libraries and probes (even if only activated when required) reduces the memory footprint of the application.
Embedded system design also imposes further constraints. Often, during the designs of the applications, we can not rely the code on the final target platform, but only on a prototype which does not represent all the details of the final architecture. Usually, for example, we have FPGA prototypes that, for area reasons, have few processors than the final ASIC architecture. Thus, solutions able to correctly estimate the speed up and the overhead due to the higher degree of parallelism are required. Our methodology can be used for the estimation 978-1-4244-4467 -0/09/$25.00 ©2009 IEEEof the performance of parallel code, starting from a single processor solution.
The contributions of this paper can be summarized as follows:
• it introduces a technique for measuring OpenMP parti tioned applications on embedded systems;
• it proposes a fast, target independent technique for arbi trary code instrumentation, that, limiting the overheads only on the interesting parts of the code, may be adopted in a embedded design flow;
• it uses the instrumentation technique to estimate the performance of a dual processor embedded platform, starting from a single processor solution.
The paper is organized as follows. Section II describes the proposed technique. Section III describes our case study, while Section IV discusses the experimental results. Finally, Section V concludes the paper.
II. PROPOSED METHODOLOGY
The proposed methodology is integrated in Zebu, a tool provided in PandA [9] , a framework for hardware/software co design. The methodology consists in a performance analysis flow composed of three different phases, that will be detailed in the following:
• instrumentation (Section II-I): the original source code is parsed, analyzed and then reproduced with the instru mentation added in the proper points;
• compilation (Section 11-2): the produced source code is compiled and executed on the target architecture;
• data collection (Section 11-3): the measured data are collected and associated with the parts of the code which they refer to.
l) Instrumentation: Zebu exploits the GNU/GCC compiler to translate the initial C source code into the GIMPLE in termediate representation, considering three different type of pragmas:
• OpenMP pragmas: we consider only the omp parallel sections, omp sections and omp parallel for pragmas;
• function pragmas: they identify the functions we are interested to measure the execution time;
• custom measurement pragmas: they identify arbitrary blocks of structured code that may require measurement.
The intermediate representation is then modified adding the requested instrumentation. A unique identifier is assigned to each omp parallel sections or omp sections pragma, when encountered, and the instrumentation code is added immedi ately before and after the corresponding code block. Next, a unique identifier is also assigned to each omp section in the sections and the instrumentation is added at its beginning and at its end. In this way, it is possible to measure both the execution time of each task of the omp sections region and the overall synchronization costs. The omp for pragmas are treated in two different ways. When we have the number of threads active at that point , the pragma is replaced with an omp parallel sections region and loop iterations are statically partitioned and assigned to different omp section. When the number of threads is unknown, the omp for pragma is treated like a custom measurement pragma and only the execution time of the whole loop is measured.
The second type of considered pragmas is the function pragma which is associated with the declaration of the functions. When a call point of an annotated function is detected, a unique identifier is associated to that and the instrumentation code is added immediately before and after the call. Instrumenting the code in this way allows distinguishing the execution time of the function according to the call points. Furthermore, the overhead of the function call is also considered in the execution time.
Finally, when a custom measurement pragma is found, the instrumentation code is simply added at each entry and exit point of the annotated code block. Also in this situation, we associate a unique identifier with the portion of code, adding the related instrumentation code to the produced code.
About the instrumentation, Zebu adds a numeric array, which stores the measures, and some function calls to record the application overall execution time and to allow collecting the data at the end of the execution. When all the needed instrumentation code has been added, the GIMPLE code is translated back to C source code.
2) Compilation: Since the instrumentation code added in the previous phase is completely target-independent, in this phase, it is customized for a particular target through an architecture-specifi c definition file composed of two different parts. The first part defi nes the type of the array elements which record the measures. The second part contains the implementation of the functions which effectively measure the execution time, since these implementations heavily de pend upon both the considered architecture and the oper ating system, if present. The instrumented source code and the architecture-specific definition file are compiled and then linked, obtaining the executable object code for the application on the specifi c target architecture.
3) Data Collection: The last phase of the flow consists in collecting the measurement data by executing the application onto the target architecture. Different runs with different inputs have to be executed if the application is strongly data dependent, producing different datasets to be collected. Note that, if a piece of code is executed more than once during a single run of the application, the technique measures the related average execution time. If the application has been executed more than once, the average execution time of each annotated code section is computed. Finally, since Zebu maintains the correspondence between the unique identifiers and the parts of the code, it easily assigns to each task, annotated function or annotated part of code, its measured execution time.
III. CASE STUDY: LEON 3 MP
In this section, we show how the proposed technique is applied to LEON 3 based systems [10] . In particular, we demonstrate that our approach allows the performance esti mation of a parallel application, annotated with OpenMP, on a multiprocessor system, while measuring its execution time on a single processor architecture.
We implemented two different architectures. The fi rst uses a single LEON 3, the second integrates two processors. In both the designs, we enabled the Memory Management Unit (MMU) and instantiated 16 KB of instruction cache and 8 KB of data cache. We enabled the Gaisler Floating Point Unit We now describe the steps required to perform the speed up estimation on an example, the first version of the Loop with Dependencies benchmark from OpenMP Source Code Repository [11] . The significant part of its source code is reported in Figure 1 .
The function loop, which is the kernel of the application, has two omp parallel for regions. The number of threads executing the application is two (the number of processor of our target architecture), so each omp parallel for region is replaced by an omp parallel sections with two sections. The resulting instrumented source code is shown in Figure 2 The instrumented application is then compiled with the architecture-specific definition file for sequential execution, ignoring the OpenMP pragmas, and is executed on the single processor platform. In particular, the functions implemented into the architecture-specific definition file for LEON 3 based systems are based on the Linux system function gettimeofday. At the end of the application execution, Zebu annotates each task of the application with the measured execution time. The next step consists in effective estimation of parallel execution time, accomplished the differences among the sequential and the parallel execution. On the parallel architecture, the exe cution time cpp of each parallel region P composed by task t E P can be computed as:
where c f is the fork cost, Ct the execution time of task t, and C j is the join cost. On the other hand, in the sequential execution, the time C8 p needed to execute the code of the same parallel region P is C8p = LtEP Ct. So the time saved by executing the parallel region onto a multiprocessor architecture (gp) can be estimated as:
The execution times have to be combined with the profiling information to produce a correct estimation. Consider the Loop with dependencies example presented in Figure 2 . The two parallel regions are the body of a loop which is executed nurn'iter times, so the gp has to be multiplied by nurniter.
Two aspects of a multi-processor system are not taken into account in estimating the execution time of the parallel application using sequential information: the contention in accessing shared resources, such as the memory, and the cache conflicts.
IV. EXPERIMENTAL RESULTS
We validate the proposed technique on a set of benchmarks extracted from the OpenMP Source Code Repository [ll] and from MiBench [12] on both the architectures described above. The OpenMP Source Code Repository[ll] benchmarks are already annotated with OpenMP pragmas. The MiBench benchmarks have been parallelized by hand, splitting the kernels of each application into one or more pairs of parallel tasks through OpenMP annotations. In particular, the data obtained by the single-processor architecture have been used to estimate the execution on the multi-processor architecture. The number of instrumentation points for each benchmark is reported in the Table I . Each benchmark has been compiled with a GNU/GCC Sparc cross-compiler without optimization (-00) and with the -02 optimization level. The results of all the executions are reported in the left part of the Table II. In the three columns labeled with Sequential we show the overhead introduced by the instrumentation for measuring the task performance on the first architecture (single-processor platform). In particular, in Real we report the execution time of the benchmark measured without instru mentation. (OH), instead, reports the overhead introduced by the instrumentation, that is usually very reduced (it ranges from 0.0% to 0.3%). In three cases (f It_6, Loops with Dependencies andjpeg encoder) the overhead is more relevant, due to a higher number of measures performed during the execution. In particular, in Loops with Dependencies, there are a huge number of measures with respect to the small size of the benchmark.
The central part of Table II shows the instrumentation overhead on the second architecture (dual-processor platform). Real and OH report the execution time without task perfor mance instrumentation and the overhead introduced by the instrumentation, respectively. The instrumentation overhead is for most of the benchmarks bigger than what observed on the single-processor architecture, for mainly two reasons. First, the overall execution time of the applications on the dual processor platform is smaller than on the single LEON solution, so the impact of the instrumentation is more relevant. Second, the instrumentation in parallel tasks generates a contention while accessing the common structures used to perform the profi ling.
It is worth noting the results of the Loops with Depen dencies benchmark. In this case, the overhead due to the task creation/destruction/synchronization is bigger than the benefits introduced by the parallelization. Thus, the parallel applications result longer than the sequential ones and, as a consequence, the relative instrumentation overhead of the parallel version results smaller.
The last two columns of the table show the results of the estimation of parallel execution computed using the method described in Section III. Estim. is the estimated overall ex- ecution time and Error is the error of the estimation. The cost for forks and joins has been obtained by applying our technique to the OpenMP MicroBenchmark Suite [13] and the error obtained in estimation ranges from 0.0% to 6.9%. Specifically, we observe that the parallel execution time is underestimated for all the benchmarks, depending on the contention for accessing shared resources and from the cache conflicts, as previously detailed in Section III.
V. CONCLUSION
In this paper we presented a technique to automatically mea sure the performance of the different parts of an application on a MPSoC. The technique uses different pragmas to identify the parts of the code to be measured and the related target independent instrumentation is directly inserted into the source code. The proposed technique has been validated on a set of benchmarks for parallel and embedded computing on a LEON 3 platform. The results show that the overhead introduced is small and that the performance profiled on a sequential version of an application can be used to estimate the execution time of its parallel version.
Future works will focus on the extension of the proposed methodology to heterogeneous architectures, considering in particular the performance analysis of functions offloaded to hardware accelerators, and on refinements of the methodology for allowing the estimation of the effects of the resources contention.
