Embedded processors have been increasingly exploiting hardware parallelism. Vector units, multiple processors or cores, hyper-threading, special-purpose accelerators such as DSPs or cryptographic engines, or a combination of the above have appeared in a number of processors. They serve to address the increasing performance requirements of modern embedded applications. How this hardware parallelism can be exploited by applications is directly related to the amount of parallelism inherent in a target application. In this paper we evaluate the performance potential of different types of parallelism, viz., true thread-level parallelism, speculative threadlevel parallelism and vector parallelism, when executing loops. Applications from the industry-standard EEMBC 1.1, EEMBC 2.0 and the MiBench embedded benchmark suites are analyzed using the Intel C compiler. The results show what can be achieved today, provide upper bounds on the performance potential of different types of thread parallelism, and point out a number of issues that need to be addressed to improve performance. The latter include parallelization of libraries such as libc and design of parallel algorithms to allow maximal exploitation of parallelism. The results also point to the need for developing new benchmark suites more suitable to parallel compilation and execution.
INTRODUCTION
In recent years many hardware techniques have been incorporated into embedded processors for exploiting parallelism. These include vector units, multiple processors or cores, hyper-threading, special-purpose accelerators et cetera. There are several reasons for this trend. One such reason is the increasing transistor budgets which has enabled the realization of multi-core processors; another reason is the increasing emphasis on low power design. The trend is further stimulated by the increased performance requirements in many application areas, e.g., networking, xDSL, security, wireless and game applications et cetera.
This has led to the development of heterogeneous and application-specific MpSoCs [1, 2, 3, 4, 5] and embedded processors with vector capabilities [6, 7] or multithreading capabilities [8] . Such systems may also be augmented with co-processors which serve as accelerators for applicationspecific functions. For example, Intel's IXP2850 network processor [8] has integrated with cryptography engines to facilitate fast encryption/decryption; TI OMAP chips have an integrated DSP co-processor [9] .
These trends are in fact similar to those in the highperformance processor design, e.g., Intel's dual-core Yonah processor [10] or the Intel's dual-core and hyper-threaded Xeon R processor [11] and the IBM/Sony/Toshiba Cell processor [12] with its eight specialized SIMD units for dataintensive processing. The trend towards integrating more and more cores on an MpSoC is ramping up [13] . Some trends in this domain have not yet migrated to the embedded processor domain, e.g., speculative multithreading [14] , but it may be only a matter of time.
The use of aforementioned hardware techniques helps to achieve better performance than a standard uniprocessor by exploiting hardware parallelism. However the above is valid only for applications that are written or compiled for exploiting the available hardware parallelism. The actual improvement is limited by a "mismatch" between the type of parallelism -Instruction-level parallelism (ILP), SIMD (single instruction multiple data) parallelism, MIMD (multiple instruction multiple data) [15] or true thread-level parallelism (TLP), speculative thread-level parallelism (sTLP) -inherent in a given application and the type of parallelism supported by the underlying hardware. Thus the improvement is limited by how well the compiler can generate code for "matching" the hardware parallelism and the application parallelism. In case of applications which contain multiple types of parallelism, it is even more challenging for the compiler to efficiently exploit the available parallelism.
Loop-level parallelization has been one of the most widely used techniques for program parallelization. This can be attributed to the fact that in most applications loops account for a large percentage of the total execution time. Given a multi-core, the iterations of a DOALL loop 1 are executed in parallel by mapping them onto different threads. This corresponds to an instance of exploitation of TLP. On the other hand, loops with dependences between its iterations (referred to as Non-DOALL loops in the rest of the paper) can be parallelized either speculatively or with explicit synchronization. The former corresponds to an instance of exploitation of sTLP.
In this paper we evaluate the performance potential of different types of parallelism in embedded applications. For this, we perform the performance evaluation at the loop level. Due to space limitations, we only present analysis for the innermost loops. The loop coverage, defined as the percentage of the total execution time spent in the loops, is obtained by first instrumenting the code of each application during compilation with hardware performance counters and then executing it on an 3.6 GHz Intel R Xeon R Processor. The reason for using the Pentium processor instead of an embedded processor is that it integrates vector and multithreading support and the availability of an autoparallelizing compiler.
The main contributions of the paper are as follows:
Ë First, a loop-level characterization of the industry-standard EEMBC 1.1, EEMBC 2.0 and the MiBench embedded benchmark suites is presented. Ë Second, the performance potential of true TLP is estimated. In other words, an optimistic upper bound on the speedup achievable (at the loop-level) via autoparallelization [17] is determined. Given an application, this provides an estimate of the number of cores required for maximal exploitation of TLP. Ë Third, the performance potential of sTLP is estimated.
In other words, an optimistic upper bound on the speedup achievable via thread-level speculation (TLS) is determined. Ë Fourth, the impact of vectorization on performance is evaluated. To our surprise, we find that the amount of SIMD parallelism in different classes of application space, except multimedia applications, is rather limited. This is explained, in part, by the compiler's choice of exploiting TLP or MIMD parallelism. Ë Fifth, we identify limitations of parallelization of the innermost loops and suggest ways to alleviate them with user and compiler assistance.
This type of analysis exposes the parallelism inherent in a given application. The analysis can be used in a variety of ways. A hardware designer can use it to (a) make design decisions such as deciding between multiple cores, multithreading support, or having vector units, based on the performance, power and cost trade-offs; (b) design applicationspecific processors. Application developers can use it to modify programs to better exploit a given type or types of hardware parallelism or to assist the compiler by giving "hints" to the compiler in the form of directives/pragmas which guide the compiler to generate better parallel code. The compiler writers can use it (i) to develop better code generation strategies; (ii) for profitability analysis, and (iii) to develop ways to exploit multiple types of hardware parallelism. For example, in susan (an application in MiBench), one observes that parallel (non-vector) loops account for 68.4% of the total execution time. In contrast, vector loops account account for less than 1% of execution time. This is in part due to the code generation strategy that puts a premium on MIMD parallelism and did not have any vector parallelism "left over". Given efficient support for MIMD parallelism (or TLP) in loops it may not be useful to provide additional hardware support for vector execution in this case. Instead, it is better to increase the number of cores or provide multithreading support for exploiting TLP.
The rest of the paper is organized as follows: Loop-level evaluation of the performance potential of the different types of parallelism is presented in Section 2. Specifically, looplevel characterization of EEMBC 1.1, 2.0 and MiBench suites is presented in subsection 2.1, the evaluation of the performance potential of TLP and sTLP is presented in subsection 2.2 and the evaluation of the impact of vectorization on performance is presented in subsection 2.3. Finally, we conclude in Section 3.
LOOP-LEVEL PARALLELISM ANALYSIS
In this section, we first present a loop-level characterization (as loop parallelization has been one of the most widely used techniques for program parallelization) of the EEMBC 1.1, EEMBC 2.0 (Networking) and Mibench (described below). To our knowledge, this is the first characterization of this kind for these suites. Note that no source code changes were made in either of the two benchmark suites during the performance evaluation. Next, we discuss the loop-level speedup achievable by exploiting loop-level TLP and sTLP. Finally, we estimate the performance potential of vectorization for embedded applications. Let us start with a brief overview of the applications in the industry-standard embedded benchmark suites -EEMBC 1.1 and EEMBC 2.0 [18] and the academic embedded benchmark suite MiBench 1.0 [19] . Both EEMBC 1.1 and MiBench are divided into multiple classes which are representative of different embedded application domains such as automotive, consumer, networking, office (see Tables 1  and 3 ). The EEMBC 2.0 benchmark suite is currently under development; as of now, it has applications from the networking domain only (see Table 2 ). All benchmarks are written in the C language. Interestingly, benchmarks range from a modest 100 lines of code to ≈ 208K lines of code. As will be shown in the next section, memory and I/O operations account for a large percentage of the total execution time in many benchmarks. From Tables 1, 2 and 3 we note that the suites differ in the selection of the benchmarks. In view of the above, we carried performance analysis of all the three suites so as to cover a wider spectrum of applications.
The results presented in the rest of this section were obtained by running the benchmarks with the reference data sets (large data sets in case of MiBench) on the Intel R Xeon R Processor. The configuration details of the system are given in 
Loop-level Characterization
In this subsection, we present the innermost loop coverage, defined as the percentage of the total execution time spent Table 4 : Experimental Setup in such loops, for applications in EEMBC 1.1, EEMBC 2.0 and MiBench. Note that the coverage numbers presented in this subsection correspond to only those loops that are present in the (optimized) application code; loops in the library functions are excluded. Also, the coverages correspond to single thread execution of the parallelized code. Of course, the coverages shown later in this subsection are subject to the particular algorithm selected and its implementation and the compiler used.
To obtain this, the code generator of the Intel R compiler was modified for automatic insertion of hardware performance counters. The point of insertion of these counters (amongst the different phases of the compilation process) has a direct effect on the coverage analysis. This is due to the fact that insertion of these counters early in the compilation process can potentially disable some of the optimizations. Therefore, it is critical to make sure that these counters are inserted only during the code generation phase. In our experiments, we account for the overhead incurred due to the insertion of these counters. A detailed discussion of our instrumentation support is beyond the scope of the paper. The total loop coverage is computed as follows: first, tick%, defined as the total time spent in a given loop, is determined for each loop. Next, the self%, defined as the execution time spent in a loop excluding the time spent in any function or any other loop that may be embedded in it, is determined for each loop. Finally, the self% of all the loops is summed to obtain the total loop coverage for a given application.
The number of innermost loops executed and their coverage for select applications in EEMBC 1.1, 2.0 and MiBench are given in Tables 6, 5 and 7 respectively. In case of EEMBC 1.1 and EEMBC 2.0, results are shown only for those applications which were executed using the testing framework that comes with the corresponding suite.
From the table we observe that in some benchmarks the innermost loops have a low coverage. The reasons for this are discussed below.
In many benchmarks the low loop coverage can be attributed to the large amount of time spent in library calls. For example, the I/O function al write con in the ttsprk816 benchmark has a coverage of 17.25% (the benchmark ttsprk816 has very low (< 1%) innermost loop coverage; due to this we do not list it in Table 6 ). However, there do not exist any loops in the function (see below). Most of the coverage of the function is due to the fwrite library call.
Similarly, the library calls vsprintf and strlen ac-Benchmark a2time01 canrdr01 cjpeg djpeg rgbhpg01 pktflow bezier01 rotate01 autocor00 conven00 fbital00 fft00 viterb00 count for the entire coverage (= 35.65%) of the function i printf of the ttsprk816 benchmark. Further, the library calls printf, strcmp and strcat account for almost the entire coverage of the function th main. In a similar vein, in many benchmarks, I/O accounts for a large part of the total execution time.
The above observation provides valuable guidance for design of high performance embedded systems and optimization of embedded applications: for applications such as ttsprk816, it is critical to design better I/O mechanisms and address optimization of library routines rather than optimizing the application itself, in order to achieve high performance.
In benchmarks such as canrdr01 the low coverage of innermost loops can be in part attributed to their very small (in the number of instructions) loop bodies. Also, there exist quite a few functions with large coverage but with no loops. For instance, the function WriteOut (shown below, taken from bmark.c:1182) accounts for 15.79% of the total execution time but has no loops. In benchmarks such as canrdr01 the outermost loops have a large coverage (self%). For instance, the function t run test (bmark.c:186) has a large coverage of 62.95%. The outermost loop at line 342 in this function accounts for most of the function's coverage. The aforementioned functions are called in this loop. This illustrates the need for exploitation of parallelism at higher levels such as at the outermost loop level, at function level. However, this may be difficult to exploit due to the I/O involved. Likewise, exploitation of hierarchical parallelism as in parallel multi-way loops [21] can potentially yield better performance. A discussion of techniques exploiting the above is beyond the scope of the paper.
In applications such as bitcount, the low loop coverage is due to the fact that a large amount of time is spent in recursive execution. For instance, the function ntbl bitcnt (bitcount/bitcnt 4.c:38) is implemented in a recursive fashion and has a large coverage. In such cases, conversion of recursion to loops may help program parallelization [22, 23] . Alternatively, parallel recursive algorithms must be used in such cases.
Another reason for low loop coverage is that in many applications a large amount of time is spent in memory allocation. For instance, the function nbuf alloc in the tcp benchmark accounts for 14.48% of the total execution time. nbuf alloc allocates memory using the memset library call. Similarly, the function tcp memcpy (which internally calls the memcpy function) has a coverage of 6.02%. To better understand this, let us analyze the code of the memset library code (taken from the GNU C library, version 2.4 [24] , see below) for example.
In the code snippet, the while loop at line 23 accounts for most of the time of the entire function. Parallelization or SIMDization of this while loop can potentially lead to better performance.
2 On analysis, we see that The above exemplifies the importance of parallelization of libraries and is part of our current research.
TLP vs. sTLP
Recall that parallel execution of DOALL loops corresponds to exploitation of TLP, whereas speculative parallel execution of Non-DOALL loops corresponds to an instance of sTLP. Figure 1 shows the DOALL/Non-DOALL breakdown of the coverage of innermost loops for EEMBC 1.1, EEMBC 2.0 and MiBench respectively. The coverage of all DOALL loops corresponds to an upper bound on the speedup achievable via TLP. In other words, assuming an oracle TLP mechanism whereby the execution time of a candidate loop can be reduced to zero, the speedup achievable via TLP is equal to the total coverage of the DOALL loops. However, in practice the speedup achievable via TLP is limited by many factors such as the threading overhead. Thus, the coverages shown in Figure 1 are very optimistic upper bounds.
From Figure 1 (a) we see that in 6 out of 16 applications DOALL loops account for most of the total loop coverage. On an average, TLP has a performance potential of 24%, which is rather small. On the other hand, from Figure 1 (b) we note that in only 5 out of 20 applications DOALL loops account for most of the total loop coverage. On an average, TLP has a performance potential of 22%. In view of the increasing emphasis on putting more cores on a chip, the above gives rise to a need to revisit design of parallel algorithms and parallel programming models.
From Figure 1 we see that Non-DOALL innermost loops have a large coverage -31% in EEMBC 1.1, 2.0 and 39% in MiBench, on an average. This corresponds to an upper bound on the speedup achievable via sTLP. However, this is a loose upper bound as threaded execution of some loops can potentially result in performance degradation because of the threading and misspeculation overhead. Most of these Non-DOALL loops are DO loops with inter-iteration dependences or are true WHILE loops. Various schemes, viz., data dependence speculation (DDS), control speculation (CS) and data value speculation (DVS) have been proposed for extracting parallelism from such loops.
3 For example, the loop at line bmark.c:286 in rgbhpg01 (with a coverage of 75.7%) can be parallelized using DVS.
On the other hand, our analysis shows that in most cases either of the three aforementioned sTLP techniques standalone have very low performance potential. For instance, the loop at bmark.c:320 in a2time01 (with a coverage of 66.2% can be parallelized iff all the three types of sTLP 10  0  10  cjpeg  107  18  98  djpeg  110  16  90  filters  30  0  13  pktflow  19  0  19  bezier01  12  0  12  rotate01  11  0  11  autocor00  17  3  15  conven00  15  0  15  fbital00  6  5  14  fft00  22  0  22  viterb00  16  0 
Impact of Vectorization
Previous studies have shown that vector execution yields high speedups in video and other multimedia applications [27, 28] . In this subsection, we evaluate the impact of SSElike vectorization on the performance of several applications taken from EEMBC 1.1, 2.0. For this, we determine the difference in loop coverage with and without vectorization. In case of the latter, certain transformations such as loop distribution may not kick in during the optimization phase. Such optimizations are applied (in some cases) to enable vectorization. Consequently, the dynamic loop count of a run with and without vectorization are different (refer to Table 8 ).
Interestingly, from the table we see that very few loops are actually vectorized by the Intel R compiler. 4 It is important to note that the number of loops vectorized is different from the number of vectorizable loops. To our surprise, we observe that in applications such as FFT, no loops are vectorized. This is contrary to the previous studies such as [29] , where Franchetti and Püschel showed that FFT is vectorizable. This "anomaly" can be attributed to the specific implementation style of FFT in EEMBC 1.1. For example, let us consider the innermost loop of the FFT computation (taken from fft00:fft00.c:171).
Clearly, the above loop cannot be vectorized at compiletime due to the unknown value of the variables n1, n2 and DataSize which can potentially lead to a flow dependence [30] between the successive iterations of the loop. The above loop (and such loops in general) can be vectorized at runtime (/dynamically [31] ), subject to the values of n1, n2 4 The vector instructions are executed on the MMX units. In a similar vein, the set of optimizations, their parameters such as the loop unrolling and the unroll factor, and the order in which they are applied is different with vectorization enabled/disabled. This has a direct influence on the loop coverage -it increases in most cases and in some cases it decreases. The decrease can be in part attributed to the limitation of the heuristic which determines the benefit of vectorizing a loop.
The impact of vectorization on the loop coverage of applications with vector loops is shown in Table 9 . From the table we see that for each application, except autocor00, the reduction in loop coverage (which relates to achievable speedup) due to vectorization is rather small. This calls for the development of new algorithms, data structures and coding guidelines which are amenable for exploitation SIMD parallelism inherent in embedded applications. Further, profitability analysis techniques need to be developed to balance the trade-off between TLP and SIMD parallelism.
CONCLUSIONS
We presented innermost loop-level characterization of industry-standard EEMBC 1.1, 2.0 and the MiBench embedded benchmark suites. It showed that in many programs innermost loops, while perhaps easiest to parallelize, may have low coverage. This is in part due to frequent library function calls which cannot be analyzed by the compiler and to the code in outer loops.
We also evaluated the loop-level speedup achievable via auto-parallelization and simdization. It showed that the speedup achievable via the latter is rather low (as compared to CPU applications [32] ). The number of loops parallelized is also be quite low in many applications. This is in part due to the compiler profitability analysis for the target architecture. Small parallel loops cannot be efficiently parallelized even though the task startup overhead is quite modest and are marked serial.
The results indicate that while loop auto-parallelization is a good starting point, it needs a number of additional "improvements". First, parallelism also needs to sought at higher levels such as outer loops with high coverages. However, at such levels auto-parallelization is hard and user assistance, for instance via OpenMP pragmas, is needed. Second, our analysis indicates that libraries have to be further analyzed and optimized to expose parallelism, e.g sequential FFT code in EEMBC is unsuitable for parallel systems. All of these improvements will help in efficient exploitation of the performance potential of the emerging multi-core systems and in meeting the performance requirements of embedded applications.
