future plans are discussed in more depth in section 6, and our conclusions given in section 7.
of implementing an actual system in hardware. However, combination to implement, either in an FPGA or in one or the design of a RC system is a complex process. Dependmore custom chips. Modular design would allow simulation ing on the needs and goals of the designers, the system may of different reconfigurable hardware (RH) designs without be built from commercial off-the-shelf components, includchanging the core functionality. The use of a modifiable ing commercially-available FPGAs. In this case, commersimulation platform would permit designers to alter the coucial tools provide critical infrastructure. In particular, if the pling of the RH with other parts of the system, such as the system uses a complete commercial board, the system is esprocessor and memory. A full-system simulator would also sentially already implemented, and may not require simulaallow a degree of realism not available with simpler estition beyond the capabilities of the tools that come with it. mation tools. Currently, researchers tend to use simulators
In other cases, a designer may want to create a fundamenand estimation tools that only model user-space code, and tally new system or reconfigurable architecture. Although ignore system-level activities. These tools may also lack one can sometimes successfully model a new design on an realistic memory models, and may model non-mainstream existing FPGA platform, the difficulty of the task can inISAs such as Alpha or PISA [2] . In these cases, it can be crease at least proportionally to the difference between the difficult to contextualize the results in terms of current pronew design and the available FPGA platform. Due to common hardware design techniques such as pipelin-loaded on the simulated system's hardware, the simulator ing, a single call to a kernel's hardware implementation may sets Rs to 1, and the program flow will fall through to the be the equivalent of several executions of that kernel's softhardware support code at the following bz instruction. In ware equivalent. In these cases, the application designer (or the hardware branch, we add another special instruction: preferably the compiler) would simply unroll that number RC_HW_func, which is intended to actually trigger hardof software iterations in the section of the binary for the kerware kernel execution in a real RC system, and is therefore nel's software code in order to maintain equivalency. Accaptured by our simulator for this purpose. For functional commodating more complex differences between the softcorrectness, the kernel must somehow be actually computed ware and hardware kernel implementation interfaces will be at this point, as discussed in the next section. part of future work.
Although this method was appropriate for our previous proximity to the related hardware kernel call. The pre-load the hardware must be loaded from the CPU's memory hierphase models loading the input values into a kernel's local archy into the kernel's local buffer (the "pre-load" phase).
data buffer, while the post-store phase models loading the After the call, the output results can be read from the buffer kernel outputs from the buffer into the CPU's memory. The to store into the CPU's memory hierarchy (the "post-store" CPU performs the required virtual-physical memory transphase). Another ISA extension would allow applications to lations, and loads required memory pages. directly read from and write to kernel buffers. The CPU The remaining problem is to actually put the correct keris therefore responsible for all memory accesses, simplifynel output values into the required memory locations. Reing the problem of virtual addressing. The special loads and member that when choosing the hardware implementation stores must be added to the application code within the hardof a kernel, for functional accuracy the kernel's software ware part of the kernel branch, as shown in Fig. 5 . The use equivalent is executed within the simulator itself, not on the of the local buffer also simplifies the interface from a hardsimulated platform. Therefore, the results of that execution ware designer's perspective, as they do not need to use a are in the memory of the computer running the simulation, specialized interface, or have to worry about variable memnot the simulated system itself. Since the post-stores are exory latency. Memory timing and bandwidth limitations are ecuted on the simulated platform, but the actual results are implicitly addressed by Simics and GEMS for these CPUnot known within that scope, the application binary writes initiated memory accesses. Advanced memory interfaces dummy values to all output memory locations. Inside the discussed in the Future Work section that do not use the CPU simulator, the Simics API allows us to intercept these store as an intermediary will have to interface more explicitly with operations as they complete execution. Although the sim- since we are only using one key, that the subkeys are preulator knows the virtual memory addresses for the result computed. The hardware implementation is pipelined and data, it is not aware of the corresponding physical address encrypts two blocks for every hardware call. The subkeys until the simulated CPU performs address translation durare initially loaded in 11 cycles, and it takes 22 hardware ing the dummy store. The CPU loads the memory page cycles to execute the kernel with a cycle time of 4.35ns, for if required and performs address translation, generating the a total latency of 96ns per two blocks encrypted. This transcorrect physical addresses for the result data. The simulalates to 192 CPU cycles, at a 2GHz CPU clock rate. We tor then replaces the dummy values at the stored locations assume a 64-bit bus width for the data connection between with the actual values it had computed. This ensures that the the CPU and RC HW, an easily achievable configuration. correct results will be stored at the correct addresses with
The buffer copy therefore requires 4 hardware cycles to get correct memory timing.
the input, and another 4 for writing the output. ing that all data will be in LI data cache when needed) improves timing estimate accuracy, the results still differ sig-2. Charge a fixed number of cycles for local buffer copy nificantly from the full-system simulation results with the operations, which assumes an LI cache hit on all data pre-load and post-store operations. This data indicates that accesses semi-reasonable yet at least partially inaccurate assumptions and estimates used in evaluating a RC application or system 3. Fully simulate the memory overhead (which may inmay adversely affect the validity of the evaluation results. In dude cache misses) using our pre-load/post-store augthe case of the extremely simple model, this lead to a 2.5x mented binary error in performance results.
The simulated machine has a 2GHz 4-way out-of-order SPARC-9 processor with 256MB memory and 8GB SCSI 6. FUTURE WORK disk running Solaris 9, augmented with RH based on a Xilinx Virtex-4. Configuration details are shown in Table 1 4 FPGA to obtain timing information. Since this is the only In future simulator versions, users will be able to choose kernel implemented, we assume for all cases that the kerbetween several different memory interfaces (and correspondnel fits in hardware, and is pre-loaded. We also assume that ing processor interfaces), or even add their own with the help of the simulator's code base. The simulator will be able to 
