The correctness of a real-time system does not depend on the correctness of its calculations alone but also on the non-functional requirement of adhering to deadlines. Guaranteeing these deadlines by static timing analysis, however, is practically infeasible for current microarchitectures with out-of-order scheduling pipelines, several hardware threads, and multiple (shared) cache layers. Novel timing-analyzable features are required to sustain the strongly increasing demand for processing power in real-time systems.
INTRODUCTION
Real-time embedded systems are ubiquitous in our everyday life, for example, in safetycritical domains such as automotive, avionics, or robotics. The correctness of a real-time system does not only depend on the correctness of its calculations but also on the nonfunctional requirement of adhering to deadlines where, under circumstances safetycritical, output signals are produced. Failing to meet a deadline may lead to severe malfunctions, and therefore they need to be guaranteed in a process called timing validation [Wilhelm et al. 2009 ]. As part of the timing validation, a schedulability analysis is performed to guarantee that a given task set can be scheduled at runtime under any circumstances. To perform a schedulability analysis, the worst-case execution time (WCET) of every task from the task set needs to be known [Wilhelm et al. 2009] .
Determining an accurate upper WCET bound of a task is a complex problem, because performance-enhancing features of modern processors like pipelining, caches and branch prediction introduce a microarchitectural state. This microarchitectural state results in a dependency of the latency of instructions on the execution history. Assuming worst-case behavior of a microarchitectural component, for example, a cache miss, in situations where the microarchitectural state cannot be determined statically does not necessarily result in a safe WCET bound for the whole task. The state of one component may influence other microarchitectural components, for example, whether a cache access is a hit or miss can influence whether a branch condition is calculated in time or potentially mispredicted. Such an effect is called timing anomaly [Reineke et al. 2006] , and it enforces exhaustive exploration of every possible microarchitectural state when determining the worst-case bound for executing a sequence of instructions.
Modern high-performance processors like the IBM POWER8 [Sinharoy et al. 2015 ] feature microarchitectures with out-of-order scheduling pipelines, several hardware threads and multiple (shared) cache layers. These average-case performance enhancing features cause an explosion of possible microarchitectural states that render timing analysis practically infeasible [Axer et al. 2014] . However, the demand for processing power in real-time systems is strongly increasing, for example, automated driving requires vast amounts of sensor data to be processed under timing constraints. Therefore, high-performance microarchitectures amenable for WCET analysis are requested [Thiele and Wilhelm 2004; Edwards and Lee 2007; Axer et al. 2014] .
Recent advances in timing analysis of tasks on reconfigurable processors have proven instruction set extensions by reconfigurable custom instructions (CIs) to be an effective means to achieve predictable performance [Damschen et al. 2016] . CIs initiate execution of hardware accelerators configured on a reconfigurable fabric that is tightly coupled to a processor core (see Tessier et al. [2015] for an overview of reconfigurable architectures). An application binary in such an architecture provides directives to a Reconfiguration Unit to configure the CIs' accelerators onto the reconfigurable fabric. Reconfigurations are performed for the requirements of an upcoming kernel (also known as hot spot), that is, a compute-intensive part of the application, for example, a loop nest. In Damschen et al. [2016] , it was shown that-additional to a considerable speedup-the overestimation of a task's WCET can be reduced by moving calculations from software code to hardware CIs. CIs typically implement functionality that corresponds to several hundred instructions when executed on the central processing unit (CPU) pipeline, possibly including conditional branches and other control flow. While analyzing instructions for worst-case latency may introduce pessimism due to, for example, pipeline hazards or instruction cache misses, the latency of the hardware accelerators-executed on the reconfigurable fabric-is precisely known.
In this work, we propose an approach of selecting WCET-optimizing sets of CIs for computational kernels that seamlessly integrates into state-of-the-art timing analysis. While we do not target the reduction of overestimation of a task's WCET bound or resolving the problem of timing anomalies in this work, we present an effective approach to statically select sets of reconfigurable CIs to optimize a task's WCET bound and advance research on timing-analyzable high-performance architectures. One main problem in selecting WCET-optimizing CIs is the instability of the worst-case path, that is, when reducing the latency of the worst-case path by inserting a CI, a whole different path can become the new worst-case path. Therefore, WCET bound estimation is an integral part of WCET-optimizing CI selection. Figure 1 shows our envisioned toolflow. CI selection, also referred to as instruction set selection, is the second of the two main steps in the so-called instruction set extension problem [Galuzzi and Bertels 2011] . The first step is the CI generation that is performed when compiling the application source code. In this step, kernels are identified in the application and partitioned into segments of code to execute in software and segments to execute in hardware. For the segments to execute in hardware, several alternatives that differ in resource demands as well as latencies are generated and then synthesized into configurations for the reconfigurable fabric. CIs provide an assembly-level interface to execute the hardware segments. Which CIs are implemented in hardware instead of the original software code and how much resources to allocate per CI is determined by the CI selection according to an optimization goal, for example, average-case performance. Several approaches to CI generation exist that can provide CIs and implementation alternatives as input to CI selection [Galuzzi and Bertels 2011] . Differing from existing CI selection approaches targeting average-case performance, our novel WCET-optimizing selection requires the application binary, as it is the only way to be able to obtain precise WCET bound estimates [Wilhelm et al. 2008] . To obtain a finished binary with generated CIs while keeping the flexibility to execute the original software, we introduce CI super blocks, which we will detail in Section 3. Essentially, we move the selection step of the instruction set extension problem from the compiler to the timing analyzer, that is, post-compilation, by introducing a conditional jump that either jumps to the hardware CI, if configured, or the original software code otherwise. This results in an effective technique that considerably reduces the guaranteed WCET bound compared to the original task that does not use CIs.
The novel contributions of our work are as follows:
-Modeling the WCET-optimizing instruction set selection problem with support for global program flow information and reconfiguration delay by extending state-ofthe-art models used in timing analyzers like AbsInt aiT [AbsInt 2016] or OTAWA [Ballabriga et al. 2010 ]. -An optimal solution that effectively reduces the search space by mapping selection candidates to weak compositions of an integer, that is, the algorithm recursively generates all distributions of reconfigurable fabric area to CIs while adhering to area constraints. Recursion subtrees corresponding to distributions of area that cannot be utilized in CI implementations are pruned early. In our evaluation, we show that less than 1% of all possible 570,240 selections need to be evaluated when optimizing the EncodeMacroBlock kernel as part of the H.264 encoder with our optimal search algorithm. -A heuristic solution that performs a maximum number of WCET estimates linear in the partitions of area available for configuring CIs on the reconfigurable fabric. It reduces the runtime of optimization down to 11.18% of the optimal search algorithm in the before-mentioned EncodeMacroBlock kernel, the most-complex kernel evaluated. Its results produced maximum 2.52% lower speedups on the WCET than optimal in our evaluations.
We show that previous work targeting optimization of the worst-case path, for example, instruction cache locking or scratchpad memory allocation of program code, share similarities with the WCET-optimizing instruction set selection problem but cannot be adapted to obtain optimal solutions. For introducing runtime instruction set reconfiguration as an enabling feature to provide timing-analyzable performance, novel models and solutions are required.
RELATED WORK AND MOTIVATION
WCET-optimizing instruction set selection bears resemblance to other static optimizations targeting the worst-case path like instruction cache locking or scratchpad memory allocation of program code. In this section, we point out the differences of these problems to WCET-optimizing instruction set selection. Additionally, we discuss state-of-the-art solutions to instruction set selection specifically and explain their shortcomings.
Caches are used to effectively reduce the average memory access latency of a CPU. It is very difficult to predict whether a memory access can be served by the cache (cache hit) or needs to be served by main memory (cache miss). WCET analysis always needs to consider a cache miss when it cannot guarantee a cache hit. This typically leads to overestimation of the WCET bound. Cache locking is a software-controlled mechanism to load code segments into the cache and prevent them from being evicted. Several works utilize instruction cache locking to reduce overestimation resulting from cache analysis and thus lowering the WCET bound [Falk et al. 2007; Liu et al. 2009; Plazar et al. 2012] . Similarly, the instruction cache can be replaced by allocating program code directly to predictable scratchpad memory [Falk and Kleinsorge 2009] . Even though these techniques are complementary to instruction set selection, the question arises whether the same algorithms can be applied. Similarly to instruction set selection, the instruction cache locking and program code allocation problem entail WCET estimation to determine the worst-case path and using this information to select code segments that can be most profitably sped up for lowering the WCET bound. However, both need to choose between two alternatives for a code segment only: utilizing the fast memory (i.e., locking it in the cache or allocating it in scratchpad memory) or main memory. Instruction set selection has several alternatives to choose from: the original software or different CIs implementing the same functionality with different degrees of parallelism and therefore different delays and resource requirements. Even with extensions for evaluating multiple alternatives to choose from (e.g., the different CI implementations), existing algorithms for cache locking would remain unsuitable for our problem. Falk et al. [2007] and Liu et al. [2009] model the problem similarly using Execution Flow Graphs and Execution Flow Trees, respectively. However, the execution flow is modeled on the level of function calls. As we target kernels, we aim to model the function-internal control flow. Plazar et al. [2012] as well as Falk and Kleinsorge [2009] model function-internal control flow similarly to the instruction set selection presented by Yu et al. [2005] , which in turn is an integer linear programming (ILP) formulation of a WCET estimation technique called timing schema [Park and Shaw 1990] . Timing schema is a tree-based WCET estimation technique (see Ermedahl et al. [2005] for an overview of estimation techniques). In current timing analyzers, it was succeeded by the more powerful Implicit Path Enumeration Technique (IPET) [Li et al. 1995] , which we will discuss in Section 3.1. Timing schema is still commonly used in state-of-the-art WCET optimization approaches however, because it is computationally cheap and it enables WCET optimization to be modeled as a single ILP (as opposed to the combinatorial problem that we present in Section 4). In timing schema, the estimation is calculated by building a representation that generally corresponds to the abstract syntax tree of the program and traversing it bottom-up by simple recursive rules. Infeasible path information cannot efficiently be applied, because the recursive rules are local to program statements [Ermedahl et al. 2005] . This can lead to imprecise WCET estimates as shown in the simple example in Figure 2 : The rules are unable to capture the global information that the truecase of the if statement can appear at maximum 5 times in the worst-case path. In this example, timing schema produces an estimate based on a program path that executes the truecase 100 times, and therefore this case seems to be the most profitable candidate to be optimized. However, this path never appears in an actual execution of the program. State-of-the-art timing analyzers can correctly determine that the falsecase dominates the WCET in the example in Figure 2 using value analysis and generating constraints for IPET. Therefore, when utilizing a computationally cheap, but imprecise, WCET estimation technique like timing schema during WCET optimization, the allocated resources may not even be utilized in the final WCET bound that is obtained using a timing analyzer. Additionally, state-of-the-art timing analyzers support powerful annotation languages to provide global path information [Kirner et al. 2011] (the impact on WCET optimization is evaluated in Section 7.3). Thus, we propose to extend state-of-the-art timing analysis using IPET to support WCET optimization, as opposed to treating WCET optimization and timing analysis as two separate processes. Yu and Mitra [2005] perform WCET-optimizing instruction set selection for instruction set extensible processors. These processors contain custom functional units that can be configured to implement frequently used instruction patterns for speedups by exploiting instruction level parallelism and operator chaining [Yu and Mitra 2004] . According to their processor model, the presented heuristic assumes a uniform cost per selected pattern (i.e., occupation of one custom functional unit). The WCET-optimizing instruction set is selected per task, that is, during task execution the instruction set is fixed. Therefore, the cost of configuring a selected pattern is not taken into account in their approach. In our work, we target dynamic reconfiguration of custom instructions with varying area demands (1 up to A units of the reconfigurable fabric area). For evaluating the profit of an instruction on reducing the WCET estimate, we need to factor in its required area demands as well as its reconfiguration delay. The impact of reconfiguration delay on WCET optimization is evaluated in Section 7.2.
SYSTEM MODEL
Our optimization is applied to the reconstructed control-flow graph (CFG) of an application in binary form, as it is the only way to obtain obtain safe and precise WCET estimates [Wilhelm et al. 2008] . Besides the binary, we require additional compile-time information: potential CIs and their possible configurations to choose from (see Galuzzi and Bertels [2011] for an overview). The granularity of a CI, that is, the amount of software it replaces, depends on the specific target architecture. In our evaluation, CIs replace 12 to 123 lines of C code (see Section 7.1). For configuring the CIs in hardware, we assume reconfigurable fabric area to be allocatable in up to A discrete units. This corresponds to the common area model of dividing the fabric area into A equally sized partitions like in the 1D or 2D partitioned area models in Steiger et al. [2003] . Reconfigurations are performed before beginning execution of a kernel, for example, as shown in Figure 4 (a). Let CI be the set of all CIs. We assume a specific configuration j of a CI k ∈ CI in hardware to have a constant delay t k, j (cycles spent in the pipeline's execution stage), to require area on the reconfigurable fabric a k, j ∈ [1, A], and to take a constant reconfiguration delay r k, j for configuring it on the fabric. For a constant reconfiguration delay, a constant bandwidth for transferring configuration data to the reconfigurable fabric's configuration memory needs to be guaranteed. We assume the CPU to be stalled during reconfiguration in this work, and therefore the system bus could be utilized for reconfiguration at a guaranteed bandwidth. For more details on achieving a constant reconfiguration delay, see, for example, the Reconfiguration Unit in Damschen et al. [2016] .
Additional to hardware configurations, a CI can be implemented using its original software code j = 0. The software implementation does not have a constant delay t k,0 , because it is subject to, for example, cache and pipeline analysis in the specific context that it is executed in. It does not require fabric area nor reconfiguration delay (i.e., a k,0 = r k,0 = 0). For providing the flexibility to execute the original software for generated CIs, we introduce CI super blocks. As shown in Figure 3 , CI super blocks begin with a conditional branch before every CI (the actual instruction in the binary) which jumps to the functionally equivalent software code when the CI is not implemented in hardware. If a configuration for the CI is available on the reconfigurable fabric, then the CI is executed instead of jumping to the software. The CI super block ends by joining paths of hardware CI and software. Multiple CI super blocks in the binary can execute the same CI k. Let B be the set of all blocks, that is, basic blocks (not contained in super blocks) as well as super blocks. The function ci(i) determines which CI k is executed by a super block i ∈ B, that is,
The context-dependent delay for executing implementation j of CI super block i is denoted as e i, j for hardware as well as software implementations. While CI execution on the reconfigurable fabric itself is context independent (t ci(i), j is constant, for j > 0), invoking the CI from the CPU pipeline can add additional cycles, for example, because of pipeline hazards or instruction fetch miss of the CI. Therefore, e i, j ≥ t ci(i), j for j > 0. Consider the example of Figure 4 (a): It provides input to the WCET-optimizing instruction set selection. In this example, two CIs were generated, one with m 1 = 2 and the other with m 2 = 3 different hardware implementations. From microarchitectural analysis [Wilhelm et al. 2008] , we obtain the worst-case bound per block that considers, for example, cache, pipeline, or branch prediction effects (see Figure 4 (b)). This way, e 4,0 and e 6,0 can be unequal, even when they execute the same CI (ci({4, 6}) = 1) in the same implementation ( j = 0). e i, j is the main parameter that we use to calculate WCET estimates based on a specific selection of implementations in Section 4. Equation (1) is used to concisely formulate Equations (5)-(9) on which our WCET estimation is based. Effectively, we obtain a CFG that is parameterizable by a chosen selection using CI super blocks. In the following, we will introduce the WCET bound estimation technique we utilize and show how we can extend it to our problem formulation to evaluate and direct our optimization.
Calculating WCET Bounds Using IPET
We utilize IPET [Li et al. 1995] for WCET bound calculation during optimization, as it is the program path analysis technique state-of-the-art timing analyzers rely on Wilhelm et al. [2008] and Altmeyer et al. [2015] . IPET models program flow as arithmetic constraints in an ILP-formulated problem. The objective function determines the CPU cycles executed on a path in the task's CFG. To find the WCET path, it needs to be maximized. Variables in the objective function represent the execution count of a single basic block (x i ) in the CFG and are weighted with the execution cycles of that basic block (c i ), which is determined in the microarchitectural analysis. For a program with N basic blocks, the objective function is given as Li et al. [1995] :
Constraints restrict the variables by modeling the control flow and capturing relative execution counts of basic blocks. The more infeasible paths can be excluded by constraints, the tighter the WCET bound will be. An overview of how to generate IPET constraints is given in Li et al. [1995] . Besides the variables x i representing the execution counts of basic blocks, variables d i for every edge in the CFG are used. For example, consider Figure 4 . IPET generates the program structure constraints in Figure 4 (c) as follows. The loop header (represented by x 2 ) can be entered from outside using the edge represented by d 1 or from a previous iteration using d 9 . The same basic block can be exited when the loop condition is false and the kernel is exited using d 2 , or it can proceed to another iteration when the loop condition is true using d 3 . Therefore,
Global information like the upper bound of 100 loop iterations can be given by the constraint x 3 ≤ 100 · d 1 .
IPET is a standard technique in state-of-the-art timing analyzers as it can capture global control flow information without the need to explicitly enumerate program paths. Several extensions, for example, for complex flows and hardware timing effects depending on the history of executed instructions, have been published [Ballabriga et al. 2010; Ermedahl et al. 2005; Wilhelm et al. 2008 ].
PROBLEM FORMULATION
In order to obtain precise WCET estimates that utilize global program flow information during instruction set selection, we integrate the system model from Section 3 and global bound calculation using IPET. Selecting an instruction set to optimize the WCET bound essentially means we aim to minimize the WCET over all possible selections, that is, we aim to minimize the maximum execution time. In the following, we extend the ILP formulation of IPET (see Section 3.1) for capturing the implementation alternatives of a CI k ∈ CI. We introduce new variables y k, j ∈ {0, 1} for every implementation j with y k, j = 1 if CI k is implemented using alternative j and y k, j = 0 otherwise. For example, y k,0 = 1 would mean we do not implement CI k in hardware but utilize its original software instead (see Section 3 and Figure 3) . We introduce the constraint
To ensure that exactly one implementation is chosen-potentially in software ( j = 0) or hardware ( j > 0)-with m k being the number of hardware configurations of CI k. To only allow solutions that do fit onto the reconfigurable fabric, we introduce the following area constraint:
that is, the sum of area on the reconfigurable fabric a k, j required to implement all CIs k using the selected implementation j (for which y k, j = 1) needs to be lower than or equal to the total fabric area A. Any y ∈ {0, 1} |CI|×M , with M = max k∈CI m k + 1, satisfying Equation (3) and Equation (4) is a feasible instruction set selection. As shown in Figure 4 (c), the obtained constraints are used to extend constraints generated by IPET.
In the following, we will develop the objective function for optimizing the WCET in the presence of CI super blocks. Our system model from Section 3 enables us to capture every implementation alternative as a single super block in the CFG (see Figure 3) . The total cycle contribution of CI k's super block i to the WCET bound is given as follows:
For example, when choosing the software implementation, the cycle contribution becomes e i,0 x i , which directly resembles the contribution of a basic block in IPET's objective function (see Equation 2 ). The WCET for a given selection y without accounting for reconfiguration delay can be determined as follows:
Additionally, we need to account for the reconfiguration delay of our selection. Neglecting it could result in suboptimal selections in which the time spent configuring the selected CIs outweighs the time saved by performing hardware-accelerated calculations (more details in Section 7.2). Every CI super block utilized in a kernel is configured exactly once before entering the kernel (with zero reconfiguration delay for software implementation). Therefore, we obtain the WCET including reconfiguration delay as:
For every selection y, we obtain an ILP instance determining the resulting WCET for y. For example, when selecting the software implementation for every CI, we obtain the following objective function that resembles an objective function of an IPET problem instance without any CIs (see Equation (2)):
Note that for every choice of y, only WCET r (y) changes while the constraints remain static once they were generated.
Putting it all together, the WCET-optimizing instruction set selection problem becomes a combinatorial problem with the following objective function:
The objective function for our example in Figure 4 is shown in Figure 4 (d). As we have finite choices for y ∈ {0, 1} |CI|×M (|CI| and M are finite), we could transform Equation (9) into a single ILP by resolving the min of Equation (6) into one constraint per choice of y. However, this would result in up to 2 |CI|·M constraints of high complexity, which becomes practically infeasible even for small values. Also note that we do not need to evaluate the ILPs for the IPET instance of the whole application, but only per kernel. Therefore, the ILPs are considerably less complex (fewer variables and constraints) than the ILP for determining the WCET of the whole application. In the following section, we will show how the search space can be pruned and feasible y are generated efficiently.
OPTIMAL SOLUTION
In theory, we can have up to 2 |CI|·M possible selections y that we need to evaluate. In practice, however, the search space is considerably smaller for the following reasons:
-The number of possible hardware configurations m k per CI k varies a lot, for example, in our evaluation we had a minimum of 1 to a maximum of 78 = M implementations for CIs (including software implementation) within one kernel (more details Section 7). From these
CI implementations in total, again, in practice, only a small subset is relevant. For the CI with 78 different implementations, many implementations had different degrees of parallelism and latencies but required the same amount of area and reconfiguration delay when synthesized to the reconfigurable fabric. When considering only the minimum-latency implementation per required fabric area, our algorithm was able to prune the number of implementations to 10 relevant ones. Therefore, in practice the relevant number of implementations per CI k is much smaller than m k + 1. -Additionally, the possible selections can be pruned considerably when applying the area constraint early (see Equation (4)). Let us consider the inner sum of Equation (4); it gives us the allocation of area per CI for one selection as a tuple a = (a 1 , a 2 , . . . , a |CI| ).
To prune the search space, we want to find the number of unique tuples fulfilling Equation (4). Having a total area of A on the reconfigurable fabric means the number of selections utilizing the whole fabric is equal to the number of possibilities to distribute the area such that |CI| k=1 a k = A (allowing a k = 0 for the software implementation), that is, the number of selections utilizing the whole fabric is the number of so-called weak compositions of the integer A into exactly |CI| parts that is (
) [Heubach and Mansour 2004] . The number of all unique tuples fulfilling Equation (4), that is,
Combining both observations leads to an additional opportunity for pruning, which our optimal search algorithm shown in Algorithm 1 exploits. We recursively generate the weak compositions of A into exactly |CI| parts as tuples a = (a 1 , a 2 , . . . , a |CI| ). In the initial call OPTSEARCH ( A, 1, y) the algorithm enumerates the possible values of a 1 ∈ {0, . . . , A} (Line 3). For every value of a 1 , it tries to find the best implementation (minimum latency) of CI 1 requiring exactly a 1 area (Line 4). The recursive calls OPTSEARCH( A − a 1 , 2, y) take place only if an implementation for a chosen a 1 is found. Otherwise, the whole recursion subtree for the value of a 1 is pruned. Every leaf of the recursion tree (k = |CI| + 1) defines a unique selection y fulfilling Equation (3) as well as Equation (4) and is evaluated by solving the ILP of Equation (9) (Line 12). Figure 5 visualizes how pruning is applied and how generated tuples correspond to selection candidates for the input provided by the example in Figure 4 . The effectiveness of our approach of pruning the search space by recursively generating weak compositions of A is evaluated in Section 7.4.1. While it shows effective in practice, the number of candidates to be evaluated can still grow exponentially in A and |CI|. Therefore, we also present a heuristic solution in the following section. 
ALGORITHM 1: Recursive Search for Optimal Selection

HEURISTIC SOLUTION
We introduce a greedy heuristic that performs a number of WCET estimates linear in the number of partitions into which the reconfigurable fabric area was divided, that is, maximal A estimates (for A > 0). It is shown in Algorithm 2. The heuristic starts with implementing all CIs in software, that is, not allocating any area of the reconfigurable fabric for CIs. For every CI, it assigns a profit that calculates the WCET reduction on the current worst-case path when choosing an alternative implementation. Let j(y k ) be ALGORITHM 2: Greedy Heuristic for WCET-Optimizing Instruction Set Selection the implementation selected for CI k in y. We define the profit of selecting y k over y k for a CI k as follows:
latency reduction on current worst-case path
where x is obtained by solving Equation (6) and keeping the values of the variables x i , that is, x determines WCET (y). Note that the profit can become negative if the latency reduction on the current worst-case path is smaller than the additional reconfiguration delay for the additional area. The heuristic calculates the profit for selecting the next best implementation y + k instead of y k for every CI k (Line 18). The implementation y + k for a CI k is the implementation that can be chosen with minimum increase in the amount of area over y k resulting in a positive profit. There might be several implementations according to this definition with the same required area. In this case, y + k is the implementation with minimum latency t k, j (and minimum j). If no such implementation y + k exists (i.e., no implementation with positive profit was found), then the CI is not considered for selecting a different implementation. Among the CIs for which y + k exists, the algorithm greedily chooses the one with the maximum profit and upgrades y to select y + k for the chosen CI k (Line 23). This process is repeated such that in every iteration CI k with maximum profit(y In every iteration, we either select a CI for increasing its allocated area by a minimum of one or the algorithm terminates. For every iteration but the last we perform one WCET estimate. Therefore, we perform a maximum of A WCET estimates.
EXPERIMENTAL EVALUATION
Evaluation Setup
We evaluated this work on a reconfigurable processor presented in Henkel et al. [2011 Henkel et al. [ , 2012 that we extended for execution with hard real-time guarantees. The implementation is based on the Gaisler LEON3 SoC, with a SPARC V8 in-order microarchitecture and supports several real-time operating systems.
2 Its seven-stage pipeline was extended into a reconfigurable core: The Execute stage allows us to stall the pipeline and pass operands to the fabric. A CI is executed by a so-called CI Execution Controller. Its protocol is similar to other multi-cycle instructions like division and directly accesses register operands or non-cacheable Scratchpad Memory (see Bauer et al. [2008] for more details). This way, in microarchitectural analysis, when determining WCETs of basic blocks, a CI is just another multi-cycle instruction that does not influence data cache analysis (if applied). The reconfigurable fabric is divided into A equally sized partitions, complying to common models of allocating reconfigurable fabric area as assumed in our system model (see Section 3). A Reconfiguration Unit with private memory to store configurations provides predictable reconfiguration of CIs. Initiating a specific configuration is done by a single store of the CPU using the memory-mapped interface of the Reconfiguration Unit (see Damschen et al. [2016] for more details).
Our timing analysis of tasks on reconfigurable processors has been detailed in previous works and was evaluated using the commercial timing analyzer AbsInt aiT [AbsInt 2016] (see Damschen et al. [2016] ). In this work, we extend WCET analysis as an integral part of WCET optimization. Therefore, we implemented our optimal search and heuristic selection algorithms as processors within the open-source WCET estimation framework OTAWA [Ballabriga et al. 2010] . We extended the existing analysis support for the LEON3 CPU in OTAWA to support CI opcodes, CI super blocks with configuration-dependent latency and reconfiguration delay. We evaluated our analysis with an H.264 encoder application that uses nine CIs covering the most compute-intensive kernels shown in Table I . Multimedia applications in general are regularly subject to hard real-time constraints in the domain of computer vision. Notable examples are advanced driver assistance systems, for example, vehicle detection and tracking [Betke et al. 2000 ], but also consumer electronics, for example, face recognition in digital cameras [Yang et al. 2006] . The H.264 encoder contains complex control flow with numerous decisions and nested loops. Most of the properties tested in the Mälardalen 3 or TACLeBench 4 WCET Benchmarks are covered, for example, Discrete Cosine Transform is contained in all three. The H.264 decoder-that is part of TACLeBench-performs a subset of the computations performed in the H.264 encoder that we evaluate. Especially the EncodeMacroBlock kernel stresses our selection heuristic (more details in Section 7.4.2), as it contains separate compute-intensive paths that share some CIs. The kernel iterates over macroblocks (MBs). Which path is executed within a kernel iteration depends on the type of MB, either I-MB or P-MB, determined by the MotionEstimation kernel, that is, it is input dependent. the I-MB and P-MB paths also contain separate CIs leading to instability of the worst-case path, that is, adding more partitions to the current worst-case path can result in the other path becoming the worst case. We compiled the application using BCC 4.4.2 (Gaisler's extended GCC 4.4.2) at O1 and performed our selection on the encoder for a frame size of 99 MBs (QCIF resolution). At higher optimization levels, GCC emitted irreducible loops, that is, complex loop structures that cannot be extracted as well-defined loop routines by the timing analyzer. Therefore, O1 provided the lowest WCET bound for the baseline executing all CI super blocks in software. The selection is performed offline and runs on a workstation with an AMD FX-6300 CPU and 12GB of RAM. The result is used to generate a single configuration for every kernel that includes CI super blocks. The configurations are supplied to the optimized application on the target system by loading them into the private memory of the Reconfiguration Unit before executing the application. Before entering a kernel that includes CI super blocks, its specific configuration is triggered. The pipeline stalls for the reconfiguration delay and continues with entering the kernel once reconfiguration finishes.
The parameters evaluated were different numbers of partitions A (300 slices each on a Xilinx Virtex 7), reconfiguration bandwidths as well as relations of CPU frequency and fabric frequency f CPU / f fabric . f fabric stays constant at 100MHz, and we choose multiples of it for f CPU that resemble realistic setups. For example, running the CPU at f CPU = 400MHz, which the LEON3 CPU is advertised as running at as an application-specific integrated circuit (ASIC) implementation, would correspond to the parameter f CPU / f fabric = 4. The successor of the LEON3, LEON4 is advertised running at 1500MHz, corresponding to f CPU / f fabric = 15. The commercially available Xilinx Zynq-7000 SoC couples an ARM Cortex A9 at 866 MHz with a Xilinx 7-Series reconfigurable fabric, corresponding to f CPU / f fabric ≈ 9. Note that while the WCET in seconds (WCET cycles/ f CPU ) is anticipated to get lower (better) with higher f CPU , the WCET cycles are increasing (at a constant f fabric ), because hardware CIs perform less computations on the reconfigurable fabric within one CPU cycle.
In Sections 7.2 and 7.3, we focus on the effects of considering reconfiguration delay and infeasible path information on the selection result, respectively. In Section 7.4, we analyze the effectiveness of pruning during optimal search and compare runtime as well as quality of selection results of our optimal search and heuristic algorithms. Note that all discussed results are upper bounds of the actual WCET. In general, it is not possible to obtain the actual WCET [Wilhelm et al. 2008] . 
Impact of Reconfiguration Delay on WCET-Optimizing Selection
In this section, we evaluate the impact of reconfiguration delay on WCET-optimizing CI selection. Figure 6 shows results obtained by applying our optimal search algorithm (see Section 5) to the EncodeMacroBlock kernel of the H.264 encoder for f CPU / f fabric ∈ [1 : 16] and reconfiguration bandwidth of 200MB/s (half of the theoretical maximum in current Xilinx FPGAs). We compare the results obtained by considering the reconfiguration delay during selection, as in Equation (9), with results obtained by ignoring it (i.e., r k, j = 0 ∀k ∈ CI, j ∈ m k ). The final WCET bound always includes the reconfiguration delay required to configure the selection result. Figure 6 (a) shows the results for A = 7, that is, the algorithm can allocate up to 7 partitions for the selection to optimize the WCET of this kernel. For f CPU / f fabric ∈ [1 : 3], the selections and the resulting WCET bound are equal. For higher frequencies of the CPU, the WCET bound obtained by ignoring the reconfiguration delay during selection is higher than the WCET bound obtained by considering the reconfiguration delay with a maximum of 4.08% increase at f CPU / f fabric = 16. More importantly, the lower WCET bounds are obtained with fewer partitions. It is not beneficial to use all 7 partitions with f CPU / f fabric ∈ [4 : 16], because the CIs having the biggest effect on reducing the WCET bound are implemented in hardware already. Increasing the number of allocated partitions for these CIs yields diminishing returns in their latency reduction. In total, this leads to an increase of the WCET bound, because the additional reconfiguration delay outweighs the latency reduction of the WCET path. This effect becomes even more apparent, with A = 10 as shown in Figure 6(b) , keeping all other parameters as in Figure 6 (a). In this case, ignoring the reconfiguration delay already yields a higher WCET bound by 4.02% at f CPU / f fabric = 1 and up to 17.14% at f CPU / f fabric = 16 over considering the reconfiguration delay. Furthermore, at f CPU / f fabric = 16 only half the partitions are required when considering the reconfiguration delay (5 partitions, as compared to 10 when ignoring it). The effect of obtaining lower WCET bounds with fewer partitions when considering the reconfiguration delay during selection compared to not considering it becomes more severe with higher reconfiguration delay (measured in CPU cycles) per allocated partition. The reconfiguration delay per partition increases when f CPU / f fabric is increased (e.g., when using a higher-frequency CPU) or the reconfiguration bandwidth is lowered (e.g., when using cheaper memory). Additionally, raising Afurther would again lead to worse selections when not considering the reconfiguration delay, as more partitions would be allocated for only little CI latency improvement.
In sum, not considering the reconfiguration delay during WCET-optimizing CI selection can lead not only to suboptimal results but also to higher WCET bounds allocating more partitions than required in the optimal results (considering reconfiguration delay). Existing approaches for selecting CIs to optimize the WCET, target applicationspecific instruction set processors (ASIPs) instead of reconfigurable processors, and therefore do not consider reconfiguration delay (see Section 2). While runtime reconfiguration provides the flexibility to utilize the whole fabric area per kernel and was proven to provide substantial WCET reductions [Damschen et al. 2016] , the reconfiguration delay needs to be considered during selection to avoid suboptimal results. Previous approaches targeting ASIPs can therefore not be applied to reconfigurable processors as the results obtained by our approach show.
Impact of Infeasible Path Information on WCET-Optimizing Selection
As motivated in Figure 2 , previous WCET-optimizing selection and allocation approaches relying on timing schema cannot utilize information about the global program flow. Therefore, the global flow information provided by annotation languages in stateof-the-art timing analyzers (see Kirner et al. [2011] for an overview), which is crucial to precise WCET bounds, cannot be utilized during optimization, and therefore decisions are made on imprecise WCET estimates. In the evaluations of our approach, the CFG was annotated with infeasible path information using the XML-based FFX language [Bonenfant et al. 2012 ] supported by OTAWA. During WCET bound estimation, these annotations are translated into IPET constraints (see Section 3.1). Similarly to all other IPET constraints used in our optimization approach, the constraints need to be generated once for the whole optimization process and can be reused for all WCET bound estimations. Figure 7 shows results with and without infeasible path information obtained by applying our optimal search algorithm (see Section 5) to the EncodeMacroBlock kernel of the H.264 encoder for several parameters. For evaluating the effects of infeasible path information, we annotate the path enconding I-MBs (see Section 7.1) as infeasible, which becomes the worst-case path only at some point when adding partitions to CIs (the exact point depends on f CPU / f fabric and the reconfiguration bandwidth). For most selections and especially when not allocating any partitions (original software instead of hardware CIs only), the P-MB path is the worst-case path. Still, we can show that annotating the I-MB path as infeasible has a considerable effect on the resulting WCET bound. A reconfiguration bandwidth of 400MB/s-the theoretic maximum in current Xilinx FPGAs-is used in Figure 7 (a) for allocating A = 5 partitions. At f CPU / f fabric = 1, the difference between optimized WCET bound with and without infeasible path information is maximal with 12.71% more WCET cycles when not utilizing the infeasible path information. The additional WCET cycles are a result of allocating partitions to CIs that lie on the path marked as infeasible. Our approach enables us to utilize this information during optimization and therefore does not allocate any partitions to the infeasible path when this information is provided. For f CPU / f fabric ∈ [2 : 4], the difference decreases down to 3.45% at f CPU / f fabric = 4, because the increased speed of the CPU relative to the fabric compensates the partition wasted on the infeasible path when not utilizing the global flow information. One might suspect that this effect is only possible at this high reconfiguration bandwidth, because the reconfiguration delay of adding an additional partition to the worst-case path while utilizing infeasible path information might be too high to reduce the WCET bound otherwise (see Section 7.2). However, with half the reconfiguration bandwidth (200 MB/s), the results remain valid with 11.95% additional cycles at f CPU / f fabric = 1 and 3.08% at f CPU / f fabric = 4 when comparing the selection not utilizing WCET path information with the selection that does. For higher values of f CPU / f fabric than 4, the instability of the worst-case path leads to the infeasible path not appearing for A = 5. Keeping the reconfiguration bandwidth at 200MB/s and increasing A leads to an additional effect shown in Figure 7 (c). The difference between the WCET bounds obtained with infeasible path information and without is generally lower, with maximal 3.90% additional WCET cycles at f CPU / f fabric = 3. The reason is that with infeasible path information the optimal choice only allocates 6 of 7 = A available partitions, the reconfiguration delay of adding an additional partition to the worst-case path is now too high to effectively reduce the WCET bound. Adding an additional partition to the infeasible path when not utilizing this information therefore adds reconfiguration delay to the WCET bound without providing actual benefit. Therefore, similarly to the impact of reconfiguration delay evaluated in Section 7.2, utilizing infeasible path information during WCET-optimizing CI selection provides better results and can even provide better results with fewer partitions. Previous WCET-optimizing approaches for CI selection and memory allocation relied on timing schema and therefore were unable to utilize global flow information (see Section 2). However, utilizing this information is crucial to obtain good selection results.
While our approach enables utilizing global flow information and considering the reconfiguration delay during optimization, it adds complexity by utilizing IPET over simpler techniques like timing schema to obtain WCET estimates. In the following, we will evaluate the quality of the results of our heuristic compared to optimal search as well as runtimes to demonstrate the practicality of our approach.
Runtimes, Pruning, and Quality of Heuristic Selection
Sections 7.2 and 7.3 demonstrated the importance of considering reconfiguration delay as well as global program flow information during WCET optimization. In contrast to previous approaches, our approach enables us to consider both types of information. However, this requires evaluating several IPET instances and therefore raises the question whether the runtimes of the optimization remain within acceptable bounds. Table II to V show evaluation results for all major kernels of the H.264 encoder application. We fixed f CPU / f fabric at 4 and a reconfiguration bandwidth of 400MB/s, as it reflects the realistic setup of running the CPU at 400MHz (which the LEON3 processor is advertised as running at when implemented as an ASIC) and the reconfigurable fabric at 100MHz, as well as running the configuration port at its maximum speed. The scalability and effectiveness of pruning during the optimal search as well as the heuristic is evaluated by running the algorithms for A ∈ [0 : 21]. Giving optimal search the freedom to allocate up to A = 21 partitions results in the maximum number of candidates for the most complex kernel EncodeMacroBlock (see Table V ). The maximum number of candidates is reached for lower A for the MotionEstimation (A = 11) and LoopFilter (A = 4) kernels, and the additional measurements for these kernels are therefore omitted. Table II to Table V are in increasing order of kernel complexity (number of instructions and number of CI super blocks). The first line of the tables is the total number of possible selections calculated as k∈CI (m k + 1), that is, the number of all combinations of configurations, plus the original software implementation, per CI without any restrictions. Weak compositions of A were explained in Section 5 as a technique we apply for pruning selection candidates during optimal search. The number of weak compositions of Ainto exactly |CI| parts are calculated as
s+|CI|−1 |CI|−1 [Heubach and Mansour 2004] . Opt. Estimates and Heur. Estimates are the number of WCET estimates calculated using Equation (7) during optimization using optimal search and the heuristic, respectively. The last lines of the tables are the runtimes of the optimizations and the speedups obtained on the WCET estimate of the kernel, comparing the selection result to the software-only implementation.
7.4.1. Effectiveness of Pruning and Scalability of Optimal and Heuristic Selection. Table II shows the results obtained for the evaluated kernel of least complexity, LoopFilter. The kernel includes one CI super block only, with seven implementation alternatives and therefore seven possible selections in total. For only one CI, the optimal search performs as many WCET estimate calculations as the number of weak compositions of A into exactly |CI| parts (denoted as number of compositions for the remainder of this text) until A = 4. For higher A, the number of estimates remains constant, as no possible implementation for the CI requiring more than four partitions exists. The heuristic never performs more than three estimates while the optimal search performs a maximum of five; however, the complexity of the kernel is too low to show any measurable runtime effect.
The kernel of next higher complexity is MotionEstimation, and its evaluation results are shown in Table III . It includes two CI super blocks having 4 and 78 different implementations, for a total of 312 possible selections. This is enough to demonstrate the effectiveness of pruning by finding selections that correspond to weak compositions of A (see Section 5), as even at A = 11 the 78 possible compositions are only a quarter of the total number of possible selections. The additional pruning of the search space in our recursive search further reduces the search space to 30 candidates, which is 9.62% of the total selection candidates and 38.46% of the number of compositions. The heuristic further reduces the number of WCET estimations performed to 11, 36.67% of the estimations performed by the optimal search, the runtime benefit is barely measurable, however.
EncodeMacroBlock is the most complex kernel in our evaluation. It contains six CI super blocks, resulting in a total of 570,240 possible selections. The results are shown in Table IV and Table V . Again, finding selections that correspond to weak compositions of A prunes the search space effectively, but for |CI| = 6 the number of possible compositions of A already grows rapidly, reaching 51.91% of the total number of estimates at A = 21. However, the number of estimates calculated during optimal search stays much lower with a maximum of 4,200 at A = 21, which is 0.74% of the total number of possible selections. This is possible by pruning recursive subtrees early in our optimal search algorithm. Still, the runtime 5 of the optimal search algorithm does not scale well with increasing A. At A = 0 it takes 4.5s, doubling already at A = 6 with 9.0s, again doubling at A = 9 to 18.88s. For higher values of A, the runtime growth stagnates, reaching its maximum of 41.43s at A = 21. Especially because there may be numerous kernels within the application under optimization, these runtime values may hinder design space exploration. To reduce the optimization runtime at the potential cost of quality of the result (discussed in the following section), the heuristic can be applied. It performs a maximum of 10 estimate calculations, leading to a runtime of maximal 4.63s; 4.25s of this runtime are spent in CFG reconstruction and microarchitectural analysis, which are preparation steps to WCET bound estimation before any optimization can take place. Therefore, solving ILPs for estimate calculations during optimization only takes a fraction of the total runtime of the heuristic. As the dominating part of the runtime-the microarchitectural analysis-only needs to be performed once, an additional estimate required roughly under 40ms. This value is often dominated by the noise In sum, the pruning techniques for the optimal search algorithm have proved to be very effective but can still lead to runtimes unsuitable for design space exploration. In these cases the heuristic can reduce the runtime down to 11.18% of the optimal search algorithm. However, the heuristic can lead to suboptimal results in certain cases, which we will evaluate in the following section.
7.4.2. Quality of Heuristic Selection. For the kernel of least complexity, LoopFilter, and medium complexity, MotionEstimation, our heuristic as well as optimal search always find the same solution for all values of A as shown in Table II and Table III , respectively. Therefore, we focus on the EncodeMacroBlock kernel and the evaluation results shown in Table IV , which exhibit heuristic selections that differ from optimal search (highlighted with yellow background). More specifically, the heuristic finds selections that produce 2.52% and 1.46% lower speedups at A = 5 and 6 than the optimal solution, respectively (while finding the optimal solution in all other cases). The reason for this is the calculation of the profit function in Equation (10) that tries to estimate the effect of a CI implementation on a previously calculated WCET bound. The problem is that the profit is calculated for the current worst-case path. Due to the instability of the worst-case path, adding a CI implementation y j that benefits the WCET bound can have a smaller effect on the total bound than on the current worst-case path. However, this is not sufficient for the heuristic to make a suboptimal choice. We additionally need a CI implementation that was assigned a lower profit than y j but actually has a higher effect on the total WCET bound than y j . This can happen when a CI configuration can appear in the current worst-case path, as well in the next longest path in the program. This is the case for the dct4x4 CI within the EncodeMacroBlock kernel. For example, for A = 5 the heuristic chooses the suboptimal selection as follows: After allocating four partitions for the P-MB path, the I-MB path becomes the worst-case path. The heuristic calculates a profit of 257,496 cycles for implementing ipredhdc in hardware, allocating the last partition. However, the P-MB path is only 13,659 cycles longer at this point. Therefore, implementing dct4x4 with a profit of 46,032 cycles in hardware and effectively reducing the WCET bound by 32,373 cycles, because it appears in the P-MB as well as the I-MB path, would have been the better choice. For A < 5, the I-MB path never becomes the worst-case path. For A > 6, the heuristic has enough partitions available to fully compensate for the suboptimal decision. Therefore, the heuristic finds the optimal results in these cases.
CONCLUSION
In this work, we presented how to extend state-of-the-art timing analysis using IPET to perform WCET optimization on runtime-reconfigurable processors. We formulated the WCET-optimizing instruction set selection problem, that is, selecting the WCET-optimal set of reconfigurable custom instruction implementations. Techniques for generating and pruning potential instruction set selections were discussed and realized in an optimal search algorithm. We demonstrated the effectiveness of pruning in our optimal search algorithm, which only needed to evaluate less than 1% of all possible 570,240 selections when optimizing the EncodeMacroBlock kernel as part of the H.264 encoder. However, as the optimal search algorithm was still not scaling well for large problem instances, we introduced a heuristic that performs maximally as many evaluations as there are partitions to allocate on the reconfigurable fabric. The heuristic was an order of magnitude faster than the optimal search for the previously mentioned kernel. Additionally, an analysis of suboptimal solutions (up to 2.52% lower speedup) obtained from the heuristic in our evaluation showed that they are a result of competing worst-case paths during optimization that share CIs. Our problem formulation and algorithms were implemented based on the timing analyzer OTAWA, showing the seamless integration into a state-of-the-art timing analysis tool.
We showed the consequences of utilizing timing schema in WCET optimization, a WCET estimation technique that does not support global program flow information but is still commonly used in state-of-the-art WCET optimization approaches. Our novel problem formulation of WCET optimization enables considering global program flow information such as reconfiguration delay during optimization and the importance of considering these information was demonstrated. Not considering global information can lead to higher WCET bounds that require more resources than the optimal solution.
In sum, we provide novel WCET optimization approaches and introduce runtime instruction set reconfiguration as an enabling feature for timing-predictable performance. To the best of our knowledge, our model is the first formulation of a WCET optimization problem with support for global program flow information and we can envision applications to problems other than instruction set extension.
