University of Karlsruhe, Chair for Embedded Systems, Germany {lars.bauer, shafique, henkel} @ informatik.uni-karlsruhe.de
INTRODUCTION AND MOTIVATION
State-of-the-art reconfigurable architectures use so-called Special Instructions (SIs) to expedite application execution. Application profiling is used to identify and extract SIs at compile time (similar to SI detection for ASIPs at design time). For each SI they provide one hardware implementation that may be reconfigured into a region of the reconfigurable fabric at run time. The application is extended to call these SIs and to issue prefetches that start the reconfigurations in time. Prefetching is important because the reconfiguration time of fine-grained reconfigurable fabrics is in the range of milliseconds and even a single application might demand repetitive reconfigurations during execution. For instance, an H.264 video encoder consists of three subsequent hot spots (namely the Motion Estimation, the Encoding Engine, and the In-Loop Deblocking Filter) that are subsequently executed per video frame. Each of them demands different SIs, which necessitates repetitive reconfigurations to retarget the available reconfigurable fabric towards the specific hot spot when it executes. Note that a video frame rate of 25 frames per second corresponds to 40 ms per frame, which leaves sufficient time for reconfigurations. The application may execute in parallel to the reconfigurations even when the reconfigurations for a demanded SI are not completed yet. In such a case, the corresponding SI is executed using the core Instruction Set Architecture (cISA), i.e. the non-reconfigurable instruction set. The cISA execution is noticeable slower than the SI execution that uses the hardware accelerators, but it avoids pipeline stalling, i.e. waiting until the reconfigurations are completed.
State-of-the-art reconfigurable architectures face a conceptual drawback considering the parallelism that is exploited by SIs when executing on the reconfigurable fabric and the corresponding reconfiguration time. A higher degree of parallelism leads to a better performance, however, the correspondingly longer reconfiguration time may reduce the performance, as the SIs may have to execute using the cISA for a longer time. The trade-off between parallelism and reconfiguration overhead is typically determined at compile time.
This work presents the novel concept of modular SIs (see Section 2) that allows determining the parallelism/overhead trade-off efficiently at run time by preparing multiple SI implementation alternatives (varying in their parallelism/overhead) at compile time. Besides this SI model, the core of this work is RISPP's novel run-time system that provides adaptivity by dynamically selecting SI implementation alternatives at run time depending on the expected SI execution frequency (using online monitoring) and the size (i.e. capacity) of the reconfigurable fabric (see Section 3).
MODULAR SPECIAL INSTRUCTIONS INPUT:
OUTPUT: shows an example of a modular SI that is obtained at compile time by partially unrolling an inner loop of the Motion Estimation from an H.264 video encoder. This SI consists of four different data paths (DPs) that are demanded multiple times to realize the overall functionality. When up to four Transform DPs are available in hardware, they may execute in parallel to realize the SI functionality. However, when only one instance of the Transform DP is available in hardware, then it has to be used multiple times in a sequential manner to achieve the SI functionality. The amount of available DPs directly determine different performance/area trade-offs that are prepared at compile time. Thereby, the DPs are provided as elementary reconfigurable units. This means that the run-time system may decide to reconfigure more or less instances of each DP to obtain different performance/area trade-offs. Additionally, sharing of DPs between different SIs is possible (as at most one SI executes at the same time), for instance the Transform DP in Fig. 1 
DCT=0

RUN-TIME SYSTEM
To exploit the unique features of modular SIs, a supporting run-time system is required. At compile time, SIs and Forecasts (i.e. prefetching instructions) are added to the application. A Forecast triggers the run-time system to determine and start the next reconfigurations. Online monitoring is used to provide information about the expected SI execution frequencies. When a Forecast triggers the run-time system, these execution frequencies are used as the input for the SI Selection. Here, for each forecasted SI, one of the compile-time provided implementation alternatives is selected. The Selection problem is modeled on a formal basis and a proof that it is NP-hard is provided [1] . Furthermore, a greedy heuristic is presented to obtain a practicable solution. It is near optimal (only 11.9% slower application execution time compared to the optimal solution 1 ) and allows starting the first reconfiguration after the first iteration over all SI implementations, thus hiding the subsequent algorithm execution time, as the Selection may run in parallel to the application. Its hardware implementation (556 slices and 11 embedded 18x18 multipliers) are mainly used to calculate the profit function and do not affect the overall system size significantly.
After the SI implementations are selected, the demanded DPs realizing these implementations need to be reconfigured. As at most one reconfiguration can be performed at the same time and the performance of an SI depends on its available DPs, the Scheduling of the reconfiguration requests is important to exploit the feature of upgrading the SI implementations. The problem is modeled on a formal basis and four different scheduling algorithms for loading the DPs of modular SIs are developed [2] . The proposed Highest Efficiency First scheduler outperforms the other schedulers by up to 1.52x by upgrading those SIs first that provide the best expected performance improvement per additional available DP. Its hardware requirements are moderate (549 slices and 5 18x18 multipliers) compared to the average size of a DP (421 slices).
Let us analyze a detailed run-time scenario to demonstrate the benefits of our proposed RISPP architecture. Fig. 2 shows the execution of the first two hot spots ("Motion Estimation" and "Encoding Engine") of an H.264 video encoder. The SI Selection is performed at the beginning of each hot spot. The lines show the latencies (i.e. demanded cycles per SI execution) for four SIs using a logarithmic scale. Whenever a latency line decreases, the DPs to upgrade the SI implementation just finished loading, thus, the lines illustrate the reconfiguration Scheduling. After all reconfigurations for a hot spot are performed, the targeted SI implementation is obtained, thus showing the result of the SI Selection. The bars show the amount of SI executions per 100,000 cycles, thus showing the performance improvement due to upgrading the modular SIs. 
RESULTS AND CONCLUSION
This work presents the RISPP architecture with its novel concept of modular SIs that allow run-time adaptations and upgrading of SI implementations. Furthermore, a novel run-time system is used to exploit the provided potential by dynamically selecting SI implementation alternatives and determining the reconfiguration sequence for the DP that implement the selected SIs. Although RISPP is not dedicated or optimized for any specific application, a complete H.264 video encoder is used to perform exhaustive benchmarks (using the simulation environment and the running FPGA prototype [3] ) because this application represents an actual challenge for stateof-the-art ASIPs and reconfigurable architectures. Depending on the amount of provided reconfigurable fabric, a performance improvement of up to 25.7x (in avg. 17.6x) is obtained in comparison to a GPP [4] . An ASIP may provide the same SIs using non-reconfigurable DPs. Due to RISPPs increased resource utilization (achieved by dynamically reconfiguring parts of the fabric), a performance improvement of up to 3.1x (in avg. 1.4x) is achieved in comparison to ASIPs [4] . Especially when the hardware resources to implement the DPs are limited (e.g. due to area constraints or sharing of the area among multiple applications), RISPP unveils its potential to increase performance and efficiency. In comparison to stateof-the-art reconfigurable architectures, performance improvements of up to 2.38x [2] and 7.19x [1] are achieved when comparing with Molen [5] and Proteus [6] , respectively.
