ABSTRACT
INTRODUCTION
I n the contcxt of embedded processing, there are two main ways to improve performance. In the first approach, one first sclccts a processor that meets macro constraints such as cost, footprint etc. and then optimizes the intended application for such a processor. In the current state-ofan, this often entails assembly programming the core sections of the code. Alternatively, given the application, one can optimize the hardware to he used to execute the application. This allows the designers better control over the design and in better meeting the optimization criteria. However, engineering a new processor is often a very cxpensive proposition.
The introduction of tightly coupled reconfigurable processors offers a new degree of freedom in the design space. They give the designs the cost-effectiveness of using off-the-shelf silicon with the flexibility of optimizing parts of the hardware for specific applications.
In an earlier work, we introduced a dynamically reconfigurable processor architecture we called Adaptive EPIC (Explicitly Purullel Instruction Computing) [20] . The basic design consists of a segmented reconfigurable array that is tightly coupled with an EPIC processor. The AEPIC architecture combines the advantages of the EPIC with its simpler architecture backed by well-known compiler technology, and that of programmable logic that exploits fine-grain parallelism through explicit control over micro-architectural features. In the paper, we described the brief design of the processor and the compiler considerations that are needed to work with such a processor. The contributions of the present work are as follows:
We describe a simulation infrastructure that realistically simulates both the main EPIC core and reconfigurable component. We use the state-of-the-art FPGA technology to obtain realistic measurements for the latter.
We present simulation results on embedded benchmarks that show that the concept significantly benefits embedded computing especially when the processor needs to operate at lower power and hence lower frequencies.
PREVIOUS WORK
The earliest known computing system based on reconfigurable devices was proposed and implemented by Gerald Estrin at UCLA [8] . It is a hybrid machine consisting of a general-purpose processor interconnected with high-speed logic devices, which were reconfigured manually. In the last two-three years, even some FPGA manufacturers have shown interest in combining FPGA with standard processor cores. These devices are being targeted for the embedded market, and this is reflected in the choice of the processor cores chosen to go with the programmable logic (from 8 bits microprocessor at 40Mhz up to 32bits RISC processor at 166Mhz). Some released or announced devices are: Triscend's E5 and A7, Atmel's FPSLIC and Chameleon Systems' CS2000. All these devices, except Chameleon ones, use fine-grained reconfigurable array; the FPGA logic blocks operates on one or two bit wide data. The exception, CS2000, contains up to eighty-four 32-bit datapath units (DPUs) each of which includes a 32-bit arithmetic-logic .unit (ALU). Enzler and Platzner [7] presented the current products in this domain and the future trends. 
ADAPTIVE EXPLICITY PARALLEL INSTRUCTION COMPUTING
The AEPlC architecture provides a dynamically varying EPIC style architectural interface to the executing process. This means the interface observed by the executing program in any machine cycle is that of an explicitly scheduled EPIC architecture [16, 191. The variation can be in terms of the number and types of instructions that can be executed on any given machine cycle. A machine that implements AEPIC architecture may be composed of hardwired functional units and some programmable logic that can be reconfigured to implement application specific instructions. On an AEPlC machine, the running program also controls the adaptation. However, the decisions .of when and how to reconfigure are pre-determined by the compiler and embedded into the code generated for the given application.
Key features of t h e a r c h i t e c t u r e
The AEPIC architecture is motivated by a desire to (1) enable efficient reconfiguration of the processor data-path at runtime, (2) allow compiler to determine the reconfiguration decisions in a flexiblc and efficient manner and, (3) allow AEPlC researchers to study a wide variety of AEPlC machine configurations. In order to achieve these goals, the AEPIC architecture proposes the following novel features.
Compiler specified resource allocation. Here wc are referring to resources that are intended for hosting Configured Functional Units (CFU). AEPlC delegates to the compiler the task of specifying which regions of the program code will execute on the programmable logic and when they are allocated (de-allocated) to (from) the programmable resources on processor.
Architecturally transparent resource assignment. Although the AEPlC compiler dccidcs which particular piece of the computation should be performed on the programmable logic each cycle, the proccssor determines which region of the programmable logic rcsource is utilized for.hosting that computation. Support for efficient context switching and modular software development.
AEPlC architecture allows multiple CFUs to be instantiated simultaneously and groups them into distinct sets so that on any cycle, a panicular set of CFUs is considered activc. These activc CFUs are the ones on which operations can bc executed. The architecture also provides special instructions to alter these CFU sets or to switch between sets to make a different one active. Explicitly controlled configuration cache hierarchy. AEPlC provides architectural mechanisms to explicitly control the data placement in the configuration cache hierarchy. This feature is a natural extension of the explicitly controlled data cache hierarchy mechanisms provided in some EPIC architectures [16] . It is cxpected to play an even more significant role in AEPlC proccssing where the costs ofconfiguration cache misses cm be more expensive than the costs of conventional data cache misses. Since applications are expected to have a much smaller number of configurations than the number of program values (which go through the traditional cache hierarchy), explicit control of configuration data placement is expected to be feasible and advantageous. Implicitly specified operands for configured functional units. Unlike typical RISC operations, some of the operations performed by CFUs may take a large number of inpuVoutput operands. In order to simplify the instruction decode logic and to keep the instruction format simple, operands for CFU operations are not specified as part of the instruction itself. Instead, AEPlC architecture specifies operand assignment operations that associate specified registers as sources (destinations) for input (output) operands for CFU operations. We shall now describe a little more detail about the Multi-context Reconfigurable Logic Arruy (MRLA). The structure of the MRLA is shown in Figure 2 
Programming AEPIC
In this subsection, we will describe how the CFUs can be used by means of pseudo-code. Programming the CFUs consists of two parts: the configuration and the usage.
The CFU configuration code is as follows: The above code will allocate space and registers for a configuration as well as load it in the C-cache and MRLA. The calloc in Line I allocates adequate number of blocks in the C-cache for the configuration located at memory address pointed to by reg. It also associates configuration register cr with the configuration. The malloc instruction in line 2 allocates the required number of slices on MRLA on context specified by literal cid for the configuration associated with cr. The number of slices required by the CFU is obtained from the information stored in 97n .configuration cr. At line 3 the configuration data associated with cr is transferred from C-cache to MRLA. The instructions inp and outp (lines 4 and 5 ) associate array registers files as input and output registers for the cr. There can be several input and output register, with the base given,by ar and the count given by the literal lit. Also several configurations can be associated with the same register. This is because there can only be one configuration call at a time and it is only during the input and output of the computation implemented with the configuration that this association takes meaning.
Each time that a CFU is activated, the following code sequence is performed.
wtc cr 5 /*transfer data to input registers *I I* transfer data out of output registers *I Before the call the necessary input data must be transfered to the CFU input registers using standard instructions such as register moves or memory loads. At line 2, the context with the configuration to be executed is set as the current context. This is done by the setctx instruction. This instruction must be present only if the required configuration is not in the current context. The instruction in line 3 call the operation opid on the CFU associated with cr. This effectively triggers the execution of the CFU on the given input. The next instruction (wtc) waits for the execution on the CFU to complete. It effectively stalls the processor. Finally, the output from the CFU's computation is removed.
AEPIC SIMULATOR
The AEPIC simulator used for this study is based on the cycle level simulator of the HPLPD EPIC architecture 
AEPIC Simulator Components
An important issue is how the application can be optimized to exploit the AEPIC features. For this purpose we need to identify the parts of an application that are most suited to run on the MRLA. Although ideally the compiler should do this automatically, we still do not have good compiler algorithms for this automatic partitioning. Therefore we did this partitioning manually. Using runtime information from the EPIC simulator, we identified the compute intensive parts of the application. By considering the speedup gained by performing the computation on the CFU, the estimated time to reconfigure the MRLA, and the estimated time needed to transfer inputloutput data to/from the CFU, we select sections ofthe code to be performed in the CFU.
To obtain realistic estimations of the cycle time and number of execution cycle, we used FPGA technology to approximate the MRLA. The chosen parts of the applications are implemented in Xilinx Vertex XCVlOOO FPGA using a high-level hardware language Handel-C [Z]. We decided to use Handel-C instead ofmore efficient hardware description language because of the ease of converting C code to Handel-C. What is generally involved is the insertion of parallel constructs to the C code. An example of this conversion is shown in the Appendix. Note that even though Handel-C has a very well defined statement-based timing model that makes it 
EPIC Simulation
The final step is to add the AEPIC instructions into the application to reconfigure data-paths and control the CFU.
The resulting AEPlC application is compiled and ran with the same input. As the FPGA setup runs on Microsoft Windows while Trimaran runs on Linux, we use a remote FPGA server that offers RPC-like service to the AEPIC simulator. The FPGA server will load and execute the compiled Handel-C code and report back the execution cycles to the AEPlC simulator. The whole process is shown in Figure 5 . This ca-simulation framework gives us a more realistic picture of the AEPIC's performance.
RESULTS
We used four benchmarks to evaluate the AEPIC architecture. These four benchmarks consist of two encryption algorithms, IDEA [IS] and Pegwit, and two audio decoders algorithms, G721 and ADPCM. The last three benchmarks are from the MediaBench suite [17].
We used RCIOOO development board from Celoxica [3] with Xilinx Virtex XCVIOOO FPGA [26] to simulate the M E A . The AEPlC core with 4 integer units and 2 load-store units was simulated on Pentium 111 processor.
Basic Speedups
The speedup obtained for each application is presented in Table I . This speedup was computed by assuming that the EPIC main processor and the reconfiguration unit run at the same frequency. Read another way, it tells us that an AEPIC processor running at a lower frequency can Table, we also show the number of FPGA gates used to implement the application CFU. As can be seen, the CFUs are relatively small. We therefore assume that the configuration can be loaded in the cache and the CFU configured way ahead of its usage. Table 2 shows the total number of input and output registers needed for each benchmark. It should be noted that with our current implementation of stalling the main processor when the CFU executes ensures that the CFU is the sole bus master should it be necessary to obtain data from memory.
Performance trade-off between Core and CFU frequencies
Using the simulation data, we performed further study by assuming that the core processor runs at a higher frequency than the reconfigurable unit. This is not an unlikely situation if we project the speed difference between the current generation of embedded processors and FPCAs onto the AEPlC architecture. The key question is then where is the break-even point, in other words, at what kind of speed differentiates will it no longer be useful to have a reconfigurable unit because the main core processor is fast enough to handle the computation. The simulator will report on the total number of cycles the execution took, CyclesrEp,c that assumes that the core and the reconfigurable component are running at the same frequency. To adjust for the difference in core and reconfigurable component, we recompute the executable cycles as follows. Let Cydes,, be the component of Cyc/esAEp,c that is the estimated number of cycles consumed by the reconfigurable component. We obtain this from the FPGA implementation of the computation core using Handel-C. The same implementation, after placement and routing, will also report the number of gates neededto implement the logic as well as the clock frequency (/&A) with which the circuit can be executed. From the execution logs of the simulations, we used the following formula to compute the AEPIC execution cycle count for a specific main corc processor frequency6 Table 3 . AEPIC Speedup relative to Core
Frequencies
ARer the above recalibration, we compute the speedups for various core frequencies. The results are shown in Table 3 . They show that AEPIC is particularly effective when the main core processor is running at a low frequency.
Multiple, Smaller CFU
The clock frequencies we obtained after placement and routing of the ponion of the code identified for execution in the reconfigurable part of the processor is typically between 15 to 30 MHz. The realizable clock frequency is determined by the complexity of the circuit that affects the critical path of the circuit. Since the design of AEPIC allows for a number of CFU slices to be dynamically. loaded, w e . experimented with splitting the code (and hence the circuit) to be realized in the CFU into smaller pieces. We need to recalibrate the counting of execution cycles reported by the simulator. Using AEPlC simulator, we counted the number of CFU calls and the number AEPIC cycles. In our applications, the main core processor will wait for the CFU to complete its operation before proceeding. Therefore we can compute the number of cycles spent by the main core processing in waiting for the CFU to complete its work as follows:
where Jis the frequency of the main core, , /
is the frequency of CFU slice i estimated using FPGA technology, and ToraICycles,,, is the total number of cycles executed by CFU i. This is the sum of a11 cycles executed by CFU i for each call to it. We split the code for the reconfigurahle unit in both audio decoder applications that was tested in first experiments. The total number of gates in the split CFUs is about the same as that for a single CFU. In fact, in some cases, because of further, circuit simplifications, it is slightly lower than the single CFU case. The code for the CFUs of the other benchmarks is too simple to he split. For ADPCM the total number o f CFU cycles is 121,228,705 for eight CFU. In Table 4 we show the number of calls for every CFU and the attained frequency.
The speedups obtained are presented in Table 5 . For the G721 benchmark the total number of CPU cycles is 2,266,857 for three CFU. Table 6 presents the number of calls for every CFU and its frequency. The speedups for G721 are presented in Table 7 . For both benchmarks, it shows that splitting the CFUs resulted in smaller CFUs that can be realized with higher frequencies. This extended the speedups afforded by AEPIC by closing the gap between the core's and the CFU's frequencies.
CONCLUSION
In this paper, we described a simulation environment and provided evaluation for a fine-grain, dynamically reconfigurable processor that consists of an EPIC core tightly coupled with a reconfigurable unit. Evaluation using four embedded benchmarks using FPGA technology to stand in for the CFUs shows that AEPIC shows particular potential for low frequency, and hence low power, systems. Under such assumptions, we were able to achieve a speedup of up to 2.43 times in performance.
As an extension of the current work, we would like to investigate how we can automatically identify the part of the application that is of the correct granularity and that can be executed efficiently on AEPIC. We would also like to explore issues relating to StNChInng applications so as to operate the CFUs in parallel with the main core. 
ACKNOWLEDGMENT

