ABSTRACT
INTRODUCTION
Embedded software is the key contributor to embedded system performance and power consumption. Program execution tends to spend most of the time in a small fraction of code, a feature known as the "90-10 rule" -90% of the execution time comes from 10% of the code. By their very nature, embedded applications tend to follow the 90-10 rule even more so than desktop type of applications.
Tools seeking to optimize the performance and/or energy consumption embedded software therefore should focus first on finding that critical code. Possible optimizations include aggressive recompilation, customized instruction synthesis, customized memory hierarchy synthesis, and hardware/software partitioning [10, 2] all focusing on the critical code regions. Of those critical code regions, about 85% of those regions are inner loops, while the remaining 15% are functions. A partitioning tool should focus first on finding the most critical software loops and understanding the execution statistics of those loops, after which the tool should try partitioning alternatives coupled with loop transformations in the hardware (such as loop unrolling). Our particular interest is in the hardware/software partitioning of programs, but our methods can be applied to the other optimization approaches too.
Many profiling tools have been developed. Some tools, like gprof, only provide function-level profiling and do not provide sufficient information at a more detailed level, such as loop information, necessary for partitioning. However, tools that profile at a more detailed level tend to focus on statements or blocks -a user interested in loops must implement additional functionality on top of those profilers. Furthermore, many profiling tools, like ATOM [12] or Spix [11] , are specific to a particular microprocessor family.
Instruction-level profiling tools can be tuned to provide useful information regarding the percentage of time spent in different loops of a program. Instruction profiling tools can be broadly classified into two categories -compilation based instruction profilers and simulation based instruction profilers. A compilation based profiler instruments the program by adding counters to various basic blocks of the program. During execution the counter values are written to a separate file. Simulation based instruction profiler uses an instruction set simulators they can be further classified into static or dynamic profilers. Simulation based dynamic instruction profilers obtain the instruction profile during the execution of the code on the simulator while in static profiling the execution is written to a trace and the trace is processed to get instruction counts. For very large applications, the trace generated by static profiling can grow to unmanageable proportions. Even though a dynamic profiling method is slow compared to the compiler-based instrumentation, a variety of architectural parameters can be tuned and studied while the program gets profiled on a full system simulator.
We have developed a profiling tool that focuses on collecting loop level information for a very large variety of microprocessor platforms. Our profiling tool supports both the instrumentation and Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. the simulation paradigms. We achieved this goal by building on top of two very popular tools -gcc for instrumentation, and Simics [9] for simulation -while keeping the output identical for the two paradigms, enabling easy switching among the paradigms. Both gcc and Simics, and hence our tool, support dozens of popular microprocessors. We call our toolset as Frequent Loop Analysis Toolset (FLAT).
RELATED WORK
Profilers like gprof are helpful to the extent of determining the time spent on function calls. However, to make judicious hardware/software partitioning decisions, knowledge of the program at the granularity of loops is imperative.
ATOM [11] provides a toolset that lets the user track a program's behavior by inserting analysis routine at interesting parts of the program. When the program is executed, the analysis routines collect information about various parts of the program and dump the result to a separate file. ATOM provides the following tools for instruction profiling-hiprof, pixie and uprof. The hiprof tool is capable of providing sampled program counters for different program events. Pixie tool provides basic block profile information. Uprof is useful for profiling non-time events and can provide procedure, source line and assembler profiles for a program.
The Harvard Atom Like Tool (HALT) [3] provides a flexible way to add routines to program produced by the SUIF compiler. Users indicate interesting parts of the program by labeling them with SUIF annotations. HALT looks for these annotations, and inserts function calls to analysis routines that match the type of the annotation. Using different analysis routines, Halt provides a number of hardware simulators, performs branch stream analysis, and records statistics for profile-driven optimizations. HALT is helpful for obtaining information regarding branch prediction, code layout, instruction scheduling, and register allocation. It has been ported to MIPS and ALPHA processors.
Optimally Profiling and Tracing Programs [4] inserts counters in the CFGs in order to record the execution count of the basic blocks and the program. QPT [15] is an instruction-profiling tool based on the algorithms described in [4] and is targeted for the SPARC architecture. It supports two modes of instruction profiling -a quick mode and a slow mode. Slow mode inserts a counter for every basic block while the quick mode relies on inserting counters on an infrequently executed subset of edges in the control flow graph. CPROF [15] processes program traces generated by QPT and annotates source lines and data structures with the appropriate cache miss statistics.
ProfileMe [5] samples instructions as they move through an outof-order issue pipeline and reports statistics like cache miss rates. LooAn [1] is a profiling tool that gives loop and function level information. However, since it's a static profiler, trace files scale up to unmanageable proportions for very large programs. Shade [15] combines instruction set simulation with trace generation capability. It uses a user-specified trace analyzer to control program execution and the extent of trace generation. The analyzer code is generated dynamically and is cached for reuse.
ALTO [16] develops whole-program data flow analysis and code optimization techniques for link time program optimization and is targeted to the DEC Alpha architecture. SpixTool [11] is an instruction profiling toolset intended for the SPARC architecture and it consists of the following two tools -Spix and Spixstat. Spix generate basic block execution profile while Spixstat generates statistics on instruction count, branch behavior, opcode usage, etc. Loop information can be easily deduced from the tool's output.
The Vtune [13] Performance Analyzer collects, analyzes and displays software performance data from the program-level down to a specific function, module or instruction in a developer's source code. Vtune runs on windows and Linux and is targeted for all Intel processors. Idtrace [17] is an instrumentation tool for Intel architecture on Unix platforms. It produces a variety of trace types like profile, memory reference, and full execution traces. Primitive post-processing tools, which read output files, view traces, and compute basic profile data are included in the IDtrace package.
Cacheprof [18] is an execution-driven memory simulator for the x86 architecture. It annotates each instruction that reads and writes memory and links a cache simulator into the resulting executable. Upon execution, the data references are trapped and sent to the simulator. Besides producing a procedure-level summary, Cacheprof reports number of memory references and the number of misses for each line of the source code.
FLAT is intended to provide loop/function level information for a wide variety of platforms. FLAT_C works for all platforms to which the GNU C Compiler (gcc) has been ported. FLATSIM is capable of producing loop level statistics for a variety of platforms like x86, Strong ARM, MIPS and SPARC.
FLAT: FREQUENT LOOP ANALYSIS TOOLS
Instruction profiling tools provide information based on which useful hardware/software partitioning decisions can be made. Frequent Loop Analysis Tool set (FLAT) is a profiling tool written in python and it provides the execution time of a given application at the granularity of both loops and functions. Loop profiles can be obtained through two different ways. The first method is to instrument the compiler to output the frequency of a loop and the other method is to use an instruction set simulator to find the execution count of loops. Both methods have their own advantages During hardware/software partitioning, frequently executed functions often prove to be the favorite candidates for hardware mapping. However, a frequently executed function could have many infrequently executed loops that contribute towards the total execution time of the function. Since loops perform the bulk of computation, returns for the silicon real estate would be maximized if a frequently executed loop of the program were chosen instead of the frequent function mentioned above. The output provided by FLAT is useful in deciding whether a loop or function needs to be mapped onto hardware. FLAT considers functions as loops that are iterate once for each call. FLAT comprises of two profiling tools -FLATC and FLATSIM. Figure 1 shows the tool flow for FLATC. FLATC uses gcc to obtain the basic block counts of a program. The information regarding the loop names and function calls are obtained from the disassembled instructions. Once the loops and function calls are identified, the percentage execution can be determined from the execution percentage of basic blocks. Since FLATC uses gcc, it is portable across a variety of platforms and there are no restrictions on the kind of code that can be profiled using FLATC. If a code can be compiled using gcc, it can be profiled using FLATC. Compile-time instrumentation adds roughly 15% of the instructions to the binary in order to accomplish profiling. Figure 2 shows the toolflow for FLATSIM. FLATSIM uses Virtutech's Simics instruction set simulator to do the instruction profiling. Simics is a full system simulation platform, capable of simulating high-end target systems with sufficient fidelity and speed to boot and run operating systems and commercial workloads. Simics provides a controlled, deterministic, and fully virtualized environment for a variety of hardware and software engineering tasks. Hence, We decided to instrument the Simics modules to get realistic instruction profile estimates. Simics is not an open source simulator. However, the source code for the add-on modules is included with the distribution.
The functionality of the simulator can be extended by modifying the existing modules or by creating custom modules. One such module that is supplied with the Simics distribution is the id-splitter module. The id-splitter module in Simics handles all cache accesses and redirects them to the instruction or data cache accordingly. FLATSIM relies on getting the instruction profile from a modified version of the id-splitter module. The suggested modification to the id-splitter module is as follows. A tree structure containing all the loop-addresses is introduced into the id-splitter module. During execution, if an instruction belongs to one of the loops, the counter associated with the loop is incremented. Finally information about loops and function calls are written to a file. FLATSIM analyses this file and prints out information regarding the loop execution. Table 1 shows the output of FLAT for Diffie-Hellman (DH) key exchange application from the NetBench [19] benchmark suite.
FLAT maintains a DAG like representation for holding data structures for loops and functions. Every loop and function is associated with a name. The loops and functions are named in a hierarchical fashion. For example, the loop name <NN_AddDigitMult.1> in Table 1 
CORE IDENTIFICATION
We define core as the set of all loops whose execution time is higher than a threshold value. If for an application, no loop contributes more than the threshold value, we classify the application as coreless. We refer to the cores with contribution closer to 90% as a strong core and we refer to cores with contribution is closer to 50% as a weak core. Tables 2, 3 and 4 show the loop contributions from the first 10 loops of SPECINT, MediaBench and NetBench/security applications respectively. We find that the loop contribution decreases successively as we move towards the tenth loop.
We identify the cores across different benchmarks in the following manner. For each application, all frequent loops that take up more than 5%( a fixed threshold) of the total execution time are considered as a part of the core. From Tables 3 and 4 , it is clear that embedded system applications like MediaBench, NetBench and cryptographic algorithms have a very high percentage of core contribution. Cryptographic applications tend to have much lesser code size as compared to the media or network applications and are hence characterized by the presence of very strong cores. On an average, the cores for cryptographic applications often consist of two loops. The NetBench applications have strong cores and their core size is 3 loops for most of the applications. Media applications from the SPECINT to be less significant than the core contribution have moderately strong cores For the unoptimized and the optimized versions of each benchmark, Tables 6, 7 and 8 show the total number of instructions in the benchmark, number of instructions in the core of the benchmark and the percentage of instructions contributed by the core. The benchmarks were optimized using the GNU C Compiler (gcc), operating at the highest level of optimization (O3), with loop unrolling enabled.
CORE OPTIMIZATION
In general, compiler optimization reduces the overall dynamic instruction count of a program. It also involves a lot of code movement and code redistribution. Hence, after optimization, the distribution of cores in a program often gets altered. Table 5 shows the percentage contribution from top four loops of the MCF application in SPECINT suite. One might observe that optimization results in extensive code movement. In Table 5 , for the unoptimized version, the second loop in the function <price_out_impl> has the highest contribution. Due to compiler optimizations such as function inlining, an additional loop is introduced in the function and hence the third loop of the function <price_out_impl> becomes the most frequent loop after optimization. The contribution of loop <primal_bea_mpp.3> increases from 8.28% to 15.91% while that of loop <refresh_potential.2> decreases. Overall, the contribution of the top four loops to the execution time increases from 57% to 75%. Compiler optimizations have a strong impact not only on the size of the computation core but also on its composition and the distribution of the frequently executed loops. 
Graph 1 Percentage of execution time spent in the first four loops of Security and NetBench applications
If the core is more conducive to optimization than the rest of the program, then most of the optimization would be centered around the core. However, the extent to which the core and rest of the program are affected by optimization largely depends on the nature of the application. In order to quantify the impact of optimization on the core, we define a new metric called Core to Program Reduction Ratio (CPRR) -the ratio of decrease in core size to the decrease in program size. CPRR is computed as follows:
One can visualize any program to consist of core and non-core portions. In the definition of CPRR, the numerator denotes the decrease in core instructions between the unoptimized and optimized programs. The denominator is the decrease in total dynamic instruction count due to optimization. If the decrease in the dynamic instruction count can be considered as being proportional to the extent of optimization, then, a CPRR of 50% would mean that the core and the rest of the program were optimized equally. A CPRR value higher than 50% implies that core is more amenable to optimization as compared to the rest of the program; while a CPRR of less than 50% implies that the impact of optimization on the non-core portion of the program is higher.
As illustrated in the Tables 6,7 and 8, programs like the security and Netbench applications have high CPRR values while applications like the SPECINT benchmarks tend to exhibit lower CPRR. We computed the CPRR for SPECINT in order to show the effect of compiler optimization on the kernels of large applications. Of course, the embedded applications, MediaBench, NetBench and the security applications, consist of small kernels while the SPECINT benchmarks are complete programs. This explains why the computation core in the embedded applications is more explicit and dominant.
PARTITIONING FOR CSoC
In this section we evaluate the potential speedups that can be achieved by mapping the optimized cores to hardware. Configurable System On Chip platforms (CsoCs) like the Xilinx Virtex II pro [7] , Altera Excalibur [8] and the Triscend A7 [6] are examples of a few architectures that prove to be ideal for migrating the core loops to hardware. The obvious objective of migrating code to hardware is the speed-up that can be achieved. One should also note that not all loops are conducive to hardware mapping. Figure 3 shows a target architecture that could benefit by mapping Where SW_only_time is the time from software only execution and the SW_loop_time is the time taken on a CPU by the loop that will be mapped to hardware. HS (hardware speedup) denotes the speedup expected on the loop by mapping it to hardware. From past results [1] we have computed this speedup to be 19 in number of cycles. However, our experience shows that the clock frequency that can be obtained on an FPGA is about 10 times lower than a CPU frequency. For the remainder of this analysis we will assume that HS = 1.9.
The overall speedup is the ratio of the SW_only_time over the CsoC time. From Tables  9, 10 and 11, it is clear that mapping the first 5 frequent loops gives an average speedup of 1.17, 1.67 and 1.94 for the SPECINT, MediaBench and NetBench/Security applications. It should be noted that not all loops are suitable for mapping on hardware.
CONCLUSION
We propose a loop analysis toolset to support hardware software partitioning. We provide a Simics based loop analyzer to profile an application and to fine-tune the various architectural aspects of the application. We also provide an instrumentation based loop analyzer that profiles an application without any significant slowdown compared to the actual execution. For a wide range of benchmarks, we identify the cores of the program and then study the effect of compiler optimization on the distribution of cores. We find that the cores are optimized more as compared to the rest of the program. On an average, the contribution from the first 2-4 loops of embedded applications is roughly 90% while the first 6 loops in the Spec bench suite contributes to almost 55% of the execution time. We observe that mapping the first five most frequent loops to hardware is beneficial for MediaBench, Network and Cryptographic applications.
[ 
