Abstract
Introduction and Related Work
While application-specific, customized logic could dramatically improve the performance of an application, that approach is typically too expensive to justify its cost for most of the embedded applications. This has given rise to increased adoption of soft core processors, which are reconfigurable general purpose processors. Examples of soft core processors include Tensilica [26] with Stretch [24] , MicroBlaze [29] , and LEON [19] .
There has been significant work centered around the idea Sponsored by NSF under grant CNS-0313203.
of customizing a processor for a particular application or application set. Arnold and Corporaal [3] describe techniques for compilation given the availability of special function units. Atasu et al. [17] describe the design of instruction set extensions for flexible ISA systems. Choi et al. [8] examine a constrained application space in their instruction set extensions for DSP systems. Gschwind [11] uses both scientific computations as well as Prolog programs as targets for his instruction set extensions. Gupta et al. [12] developed a compiler that supports performancemodel guided inclusion or exclusion of four functional units of multiple-accumulate, floating point, multiported memory and pipelined vs. non-pipelined memory unit. Systems that use exhaustive search for the exploration of the architecture parameter space are described in [13, 18, 22] . Heuristic design-space exploration for application-specific processors is considered in [9] . Pruning techniques are used to diminish the size of the necessary search space in order to find a Pareto-optimal design solution. In [5] , the authors use a combination of analytic performance models and simulation-based performance models to guide the exploration of the design search space. Here, the specific application is in the area of sensor networks. Analytic models are used early, when a large design space is narrowed down to a "manageable set of good designs", and simulation-based models provide greater detail on the performance of specific candidate designs. The AutoTIE system [10] is a development tool from Tensilica that assists in the instruction set selection for Tensilica processors. This tool exploits profile data collected from executions of an application on the base instruction set to guide the inclusion or exclusion of candidate new instructions. [2] performs analytical (hierarchical) searching of parameters in their own dimensions, with some full parameter exploration to avoid local minimal, for tuning multi-level cache for low-energy embedded systems. [23] explores design options of instruction and data caches, branch predictor, and multiplier, by dividing the search space into piece-wise linear models and solving their results using integer linear programming.
There are two main problems with most of these approaches. First, many of the approaches consider only a few parameters for customization or consider only a specific subsystem (such as cache) for a specific purpose (such as energy conservation). Such approaches do not scale well for the large number of reconfigurable parameters in a soft core processor. The second problem is the way application runtime is estimated using analytical models or measured using simulators, which are discussed subsequently.
Performance Measurement Analytic models can provide the quickest estimations of application performance, and such models are often derived directly from the source code. Examples of the use of analytic models include: [6] , which describes an approach for the analytical modeling of runtime, idealized to the extent that cache behavior is not included; and [25] , a classic paper on estimating software performance in a codesign environment, which reports accuracy of about ±20%. However, for purposes of application performance improvement, ±20% is a wide deviation. These inaccurate predictions are due to the simplifying assumptions that are necessary to make analysis tractable and are notoriously common when analytic models are used. Moreover, application models often require sophisticated knowledge of the application itself. By contrast, simulation and the direct execution we use are both "black box" approaches that do not require knowledge of application implementation.
The method normally used to improve accuracy beyond modeling is simulation. Simulation toolsets commonly used in academia include: SimpleScalar [4] , IMPACT [7] , and SimOS [21] . Given the long runtimes associated with simulation modeling, it is a common practice to limit the simulation execution to only a single application run, not including the OS and its associated performance impact. SimOS does support modeling of the OS, but requires the simulation user manage the time/accuracy tradeoffs inherent in simulating such a large complex system. We improve on this through the statistics module of Liquid architecture platform [15] which uses a hardware-based, non-intrusive profiler to count the number of clock cycles taken by an application executing directly on a soft core processor. Because it gives accurate runtime measures, we use this in our work.
Background

Liquid Architecture Platform
For our experiments, we use our Liquid Architecture platform [20] . Briefly, this platform instantiates a LEON2 [19] provides a web interface to control the processor, run applications on it as well as profile runtime and microarchitecture parameters. This profiling is on application's direct execution on the processor and is hardware-based, non-intrusive and cycle-accurate.
LEON
LEON is an open source implementation of the SPARC V8 architecture, used by the European Space Agency. LEON's microarchitecture is parameterized along the systems of processor, bus, memory controller, peripherals, synthesis, clock, boot and debug. Processor comprises the subsystems of cache (separate instruction and data caches), Integer Unit (IU), Floating-point Unit, co-processor, Memory Management Unit (MMU) and Debug Support Unit (DSU). Boot and clock options are set once. We do not debug applications at runtime nor do our applications use peripherals or MMU. 64KB exceeds the available BRAM by 33%. FPU is excluded because two of the three supported interfaces (Sun's Meiko and Gaisler's GRFPU) are not free and the third one (LTH) is incomplete. Features such as SDRAM access, that are not part of out-of-the-box LEON distribution and require custom coding are also excluded. Figure 1 shows the parameters that impact the performance of our applications and their default values. Some are simple "enable or disable" while others take a range of values.
Constrained Binary Integer Nonlinear Programming
Linear Programming (LP) in standard form is the problem of minimizing a linear function, subject to a finite number of inequality constraints [16] . The following LP problem (or simply, linear program) has n variables, and k inequality constraints.
Minimize
n is the vector of decision variables to be solved for, Z|R n − > R is the objective function, c ∈ R n is the constants vector of cost coefficients, A ∈ R k×n is the matrix of coefficients of k functional constraints, b ∈ R k is the constants vector of right-hand sides of functional constraints. Other forms of LP include maximization and inequalities that are ≤.
If the decision variables are restricted to be integers, it becomes Integer Linear Program (ILP). If they are further restricted to be binary-valued, the problem is Binary ILP. In addition, if the objective function or a constraint is nonlinear, then the problem is Binary Integer Nonlinear Program (BINLP).
ILP is exponential with the number of variables. Therefore, depending on how large the number of variables is and depending on whether or not there are special structures in the problem that some algorithms can exploit, some ILP problems may not be solved for optimal solution. With nonlinear optimization in the standard form, if the nonlinear objective function is not convex or if all the nonlinear constraints are not concave functions, the algorithm is no longer guaranteed to find the (global) minimum [14] .
Cost Function
Chip Resource Cost Instantiating a soft core processor on FPGA utilizes resources and two fundamental ones are Lookup-tables (LUTs) and Block RAM (BRAM). Their utilizations are measured by actually building the processor, from it's source VHDL. The total LUTs and BRAM available on the Xilinx Virtex XCV2000E FPGA are 38,400 and 160 respectively and of them, the default (out-of-the-box) LEON configuration utilizes 14,992 (39%) and 82 (51%). Given the difference in their magnitudes, they are normalized as percentages and added together for a unified chip resource cost metric.
Application Runtime Cost
Application runtime is measured by executing the application directly on the soft core processor (LEON) and counting the number of clock cycles the execution takes. We use the non-intrusive and cycleaccurate hardware-based profiler available through Liquid architecture platform.
Total Cost To be compatible with chip resource cost, application runtime cost is also normalized as a percentage and they are added together.
Benchmarks
The following applications are executed directly on LEON, without an operating system. Hence, they have been modified to avoid making system calls and using stdio. Liquid architecture platform now supports Linux and therefore, future work can make use of it.
Benchmark I -BLASTN Basic Local Alignment Search Tool (BLAST) [1] programs are the most widely employed set of software tools for comparing genetic material. BLASTN ("N" for nucleotide) is a variant of BLAST used to compare DNA sequences (lower-level than proteins) [20] . BLASTN is computation and memory-access intensive. It has approximately 163 lines of code and its runtime on the default LEON configuration is 10.6 seconds.
Benchmark II -CommBench DRR DRR is a Deficit
Round Robin fair scheduling algorithm used for bandwidth scheduling on network links, as implemented in switches [28] . DRR is computation intensive. It has 117 lines of code and its runtime on the default LEON configuration is 5 minutes.
Benchmark III -CommBench FRAG Frag is an IP packet fragmentation application. IP packets are split into multiple fragments for which some header fields have to be adjusted and a header checksum computed, before being forwarded [28] . Frag is computation intensive. It has 150 lines of code and its runtime on the default LEON configuration is 2.5 minutes.
Benchmark IV -BYTE Arith Arith does simple arithmetics of addition, multiplication and division in a loop. It has been used to test processor speed for arithmetic. Arith is not memory intensive. It has 77 lines of code and its runtime on the default LEON configuration is 32 seconds.
Approach
Our goal is to improve performance of a given application through automatic reconfiguration of processor microarchitecture to meet the application's requirements and constraints closely. The approach we take is to consider all parameters that have a bearing on application runtime or hardware resources and to use actual measurements (costs) rather than estimates, to obtain more accurate customization. Despite these, we want our optimization technique to be feasible and scalable.
The challenge from considering all parameters is that, it makes the search space huge. The 79 parameter values in Figure 1 , which is really a subset of the reconfigurable parameters in LEON, results in 3, 641, 573, 376 exhaustive configurations. The second challenge comes from measuring actual costs. Costs are measured for all the dimensions that are being optimized and/ constrained. Currently, we optimize application runtime and FPGA resources. The execution times for our benchmarks range from 16 seconds to 9 minutes. We leave it for future work to address very long execution times, possibly through a smart sampling technique. Hardware resource utilizations are measured by actually building processor configurations from the source VHDL. Each build is very time-consuming, on the order of 30 minutes, even on modern computers. These two challenging makes the customization harder than a traditional optimization problem because they make it infeasible to do exhaustive enumeration to build an exact model and search for the best solution.
The next best approach is to build an approximate model and solve for an exact solution. We build the model by assuming parameter independence and restricting each parameter to it's own dimension. Though our results are no longer guaranteed to be optimal in all cases, Section 5 demonstrates that they are near-optimal in practice. With the assumption of parameter independence, the number of configurations is linear in the number of parameter values, 52 for the parameters in Figure 1 . Even if the remaining parameters benefit other applications, it would still be only 100 configurations, which is still feasible and scalable.
We solve for optimal solution by formulating the model as a constrained Binary Integer Nonlinear Problem. Although the search space is built by considering parameters in their own dimensions, the optimization algorithm evaluates points in between. These points represent configurations that have more than one parameter changed simultaneously. The solver assigns costs for these points through an approximation of actual costs provided by us in the model.
The approach to building the model is summarized as follows. We begin with the default LEON configuration that comes out-of-the-box. We call this the base configuration. We then perturb one parameter at a time, build the processor configuration and measure it's chip cost. Thirdly, we execute the application on each configuration and measure the runtime. Finally, we formulate these costs into a BINLP problem and solve for optimal solution, using the commercial solver of Tomlab Mixed Integer Nonlinear Programming solver [27] . Tomlab is a plug-in to Matlab and solves our formulation in seconds. The solution obtained is the recommended microarchitecture configuration for the given application.
Problem Formulation
We formulate the problem of automatic applicationspecific customization of soft core processor microarchitecture as a Binary Integer Nonlinear problem (BINLP). The objective is to meet the applications runtime requirements and FPGA resource restrictions. The constraint is to select a valid microarchitecture configuration that fits in the available chip resources.
We begin the reconfiguration with the default (out-ofthe-box) LEON configuration that we call as the base configuration. The %LUTs and %BRAM remaining unutilized after the base configuration are denoted by L and B respectively. From the base configuration, we change the parameter values one at a time, build a new processor configuration x i and execute the application on it. For each x i , the difference (in percentage) in LUTs, BRAM and application runtime over the base configuration are denoted by λ i , β i and ρ i respectively.
Objective Function
The objective function is to minimize the costs of the dimensions being optimized. The dimensions that we optimize are application runtime and chip resources and the following equation minimizes their costs. We use weights to optimize certain dimensions over others.
w 1 and w 2 are independent. w 1 is made to dominate w 2 for application runtime optimization and w 2 is made to dominate for FPGA resource optimization.
Constraints
Parameter Validity Constraints x i represents a new processor configuration resulting from a change in one parameter value from the base configuration. x i is binary (i.e.) it represents two values-on/ off or two integer values. That implies that for parameters with more than two values, more than one x i will be used. Therefore, for such parameters, we need to ascertain that only one variable is selected. All such constraints are presented below. There are additional constraints imposed by LEON. The icache and dcache replacement policy of LRR (Least Recently Replaced) can be used only with 2-way associativity (2 sets) and LRU (LR Used) with all multi-way associativity.
FPGA Resource Constraints
For the FPGA resources considered, their utilization for each x i should fit in what is available after the base configuration.
Cache size (of both icache and dcache) is expressed in terms of two parameters in LEON viz. number of cache sets and size of each set. Accounting for this, the constraint equations for hardware resources become:
The convexity of these nonlinear functions is conditional on the values of x i . That means, the optimization algorithm is no longer guaranteed to find global optimum in all cases. Therefore, to optimize the problem formulation, we leave the constraint on LUTs as a linear function, since variation in LUTs utilization is very minimal. We analyze the effect of this in Section 6.
Analysis
In this section, we analyze the impact of our assumption of parameter independence. The naive approach of comparing our solution to the one obtained by generating all configurations exhaustively is infeasible for us. The next logical approach is to scale down the problem space such that it becomes feasible to generate all configurations exhaustively. When these two solutions compare favorably, 
Figure 2. Dcache exhaustive for BLASTN
we show that our optimization algorithm works as well as can be expected. We chose the subsystem of dcache for this purpose because we had manually optimized the cache subsystem for BLASTN application in [20] . More fundamentally, cache subsystem has tangible variations in application performance and chip resource utilization, for changes in parameter values. As enumerated in Section 4, dcache has 7 reconfigurable parameters of number of sets, size of each set, associativity, line size, replacement policy, fast read and write options. The number of integer values of these parameters are 4, 7, 4, 2, 3, 2 and 2 respectively. Their exhaustive combinations are 2,688 and it would take at least 56 days to generate them all. That is not scalable and therefore, we consider only two parameters-number of sets and set size, which result in 28 combinations. We chose these two parameters because perturbing these affects both LUTs and BRAM utilization, at varying degrees. The base configuration has 1 set of 4KB size. Figure 2 shows BLASTN runtimes and chip resource costs for the exhaustive combinations of dcache sets and set size. Optimizing for runtime, a simple sort yields the optimal configuration of 2 sets of 16KB each (i.e.) a total of 32KB. The performance gain is 3.63% over the base configuration, utilizing no additional LUTs but 39% more BRAM than the base configuration.
BLASTN
We then compare this to the configurations that we evaluate as per our approach, the optimizer. Figure 3 shows this. Optimizing only for application runtime, the configuration we select is set size 32KB, which is the same cache size as selected by the exhaustive search but organized slightly differently. The performance gain with this configuration is 3.61%, which is 0.02% less than the optimal configuration from the exhaustive approach; LUTs utilization is 1% less here and BRAM is the same. The fact that our optimization was able to achieve performance gain within 0.02% difference from the exhaustive solution and with 1% reduction in LUTs (chip resource cost), in spite of the assumption of parameter independence, is very encouraging.
Other Benchmarks Results for the other benchmarks discussed in Section 2.5 are even better, as they match the solution from exhaustive approach. They are shown in Figure 4 . This gives us further confidence that our customization finds valid and near-optimal configurations, despite the assumption of parameter independence.
Further Observations
The results for DRR and FRAG are 2x16 configurations. These are not configurations that we provide directly in the model. This demonstrates that, while we construct the search space reconfiguring parameters in their own dimensions, the optimization algorithm considers points in between, which are points reconfigured simultaneously in many dimensions. Next, the fact that we are able to build the solutions proves that we generate valid configurations. Finally, the configurations selected are indeed application-specific.
Results
The research objectives of our experiments are to find out how much improvement we gain from the applicationspecific microarchitecture customization and to demonstrate that the customization is indeed application-specific. The results in Section 5 addresses both, for a subset of dcahe parameters. This section presents results for all LEON parameters shown in Figure 1 . Section 6.1 shows the performance gains to be 6.15%-19.39% and Figure 5 and Figure 7 show that the customization is indeed application-specific. Due to space constraints, we are restricted here to showing chip resource and application runtime costs only for our solutions, rather than for all the configurations that we consider.
We first present results of optimizing runtime over chip resources, by setting w 1 to be much higher than w 2 , and then, vice versa. Figures show only the parameters that are reconfigured from the base configuration.
Application Performance Optimization
We optimize application performance over chip resources by setting w 1 = 100 and w 2 = 1. Figure 5 shows the parameters reconfigured from the base configuration, along with results from the actual build of the solution. Based on the latter, runtime decrease for the four applications of BLASTN, DRR, FRAG and Arith are 11.59%, 19.39%, 6.15% and 6.49%, over the runtimes on their respective base configurations. The linear approximations performed by the optimizer estimate the performance improvements to be 11.77%, 39.14%, 7.67% and 6.49%, respectively. The range of overestimation is 0-19.75%. Due to space constraints, we present costs only for BLASTN reconfigurations, in Figure 6 .
The performance gains come at the expense of additional chip resources. The increase in chip resource utilization, expressed as a tuple of LUTs and BRAM, are (0%, 39%), (0%, 39%), (8%, 42%) and (1%, −3%) respectively. The approximations performed by the optimizer are (−4%, 36%), (−4%, 41%), (−4%, 44%) and (−2%, −4%), respectively. We consistently underestimate LUTs utilization; our estimates for BRAM are mixed, from −2% to 3%.
Cost Approximations As we saw in Section 4, we simplified the cost function for LUTs to be linear while leaving it nonlinear for BRAM. To evaluate the simplification, we present what the nonlinear approximations would be for LUTs in Figure 5 . As seen there, our underestimations Figure 6 . BLASTN runtime optimization costs would be slightly higher and hence, worse. In addition, to demonstrate how better the nonlinear cost function is over the linear for BRAM, we present the linear approximations also. Space constrains restrict similar analysis in Section 5.
Comparison with Dcache Optimization Given our assumption of parameter independence, an interesting observation is to compare the customization in dcache here to the one from optimizing only dcache in Section 5. However, the weights in the objective function are slightly differentfor the former, w 1 = 100 and w 2 = 1 and for the latter w 1 = 100 and w 2 = 0. The resulting dcache configurations are identical for all applications except Arith. For Arith, it was 1x4 in Section 5 but here it is 1x1. This is because of the chip resource consideration resulting from w 2 = 1.
FPGA Resource Optimization
We optimize chip resources over application performance by setting w 1 = 1 and w 2 = 100. Figure 7 shows the parameters reconfigured from the base configuration, Based on the latter, decrease in chip resource utilization are (2%, 3%), (2%, 3%), (3%, 3%) and (1%, 3%). The approximations performed by our optimization algorithm estimate the chip resource savings to be (5%, 4%), (7%, 4%), (7%, 4%) and (5%, 4%). We consistently overestimate the chip savings; for LUTs, the range is 3-5% and for BRAM, it is always 1%. Similar to application runtime optimization, here also, we present the nonlinear approximations for LUTs and linear approximations for BRAM in Figure 7 .
The savings in chip resources come at a loss of application performance, often significant -30.66% for BLASTN, 16.76% for DRR, 0.43% for FRAG and 36.34% for Arith.
Conclusion and Future Work
We have presented a heuristic for automatic applicationspecific reconfiguration of a soft core processor microarchitecture. This approach is linear in the number of reconfigurable parameters, with an assumption of parameter independence, to make the approach feasible and scalable. The performance gains over the base configuration are nearoptimal in practice, despite our simplifying assumption. More importantly, our technique empowers application developers to do performance-resource tradeoffs in hours and without detailed knowledge of the architecture.
Future work can recast our nonlinear constraints so that they are convex functions for all values of x i . This will guarantee that the optimization algorithm finds the global optimum. We can also analyze the cost approximations performed by the optimization algorithm and explore more sophisticated approximations. As extensions to our model, we can include power and energy optimizations, runtime sampling to facilitate analysis of long-running applications, running applications on an operating system and supporting ISA level customization. By integrating our solution with open source soft core processors, we can contribute back to the community. Finally, and more interestingly, we can evaluate our technique on other configuration and feature management problems.
