We describe an in-progress approach to resolving the opposing goals of speed and accuracy in designing, prototyping, and evaluating the system level performance of near-memory accelerators. Hardware emulators offer high fidelity and speed at the expense of flexibility in exploring the design space. Software simulators can be enormously flexible, but simulation speed suffers if high accuracy is required. Building on past experience of using both these methods, we report on progress with using SystemC simulation for full application level evaluation with eventual synthesis of the accelerator hardware components for emulation.
Introduction Designing, prototyping, and evaluating near memory accelerators is an exercise in triage among the competing goals of speed of design, speed of evaluation and accuracy of metrics. It is desirable to generate design alternatives quickly to help narrow down the large search space. Evaluation speed is required to run the proposed hardware under realistic scenarios with appropriately sized data sets and use cases. Fast evaluation also provides coverage in exploring the large number of design alternatives. However fast turnaround is of little use without confidence in the accuracy of results and understanding of the modeling errors.
Background In prior work [2, 6] we have strived for accuracy of the near memory hardware design and speed of evaluation by developing the Logic in Memory Emulator (LiME) [3] , a hardware/software "swiss army knife" of FPGA-based functions to prototype acceleration modules adjacent to a variety of memory types, while simultaneously capturing and logging external memory requests from both main CPU and accelerator. This emulation approach was very successful in speed of evaluation with only a 20X slowdown from the predicted run time. In contrast, software simulation typically has thousands of times slowdown. The speed gain was due to the acceleration units being instantiated in FPGA hardware and the application running in hard ARM processors on the Multiprocessor System on Chip (MPSoC).
However, FPGA implementation at the hardware description language level is very slow, limiting design space exploration experiments. While latencies could be varied on the emulation platform for different types of memories, the latency associated with a memory was fixed during the experiment. Additionally, the emulation studies were conducted with a single core and a single accelerator. While these aren't fundamental restrictions to the emulation platform, the time to design FPGA modules and embedded software infrastructure was a significant bar.
To mitigate these drawbacks, we used simulation [4] to explore other designs that are difficult to implement in hardware. We ported one of the near memory accelerators, a Key/Value Store lookup engine [6] , to the Structural Simulation Toolkit (SST) [7] , a pluggable discrete event simulation framework. In this simulation we re-implemented the lookup accelerator as an SST component. We validated the simulator using a fixed latency memory model similar to LiME's, and then replaced the memory model with a more realistic HMC model [5] to evaluate features not implemented in LiME. We introduced optimizations and experimented with different numbers of lookup accelerators to evaluate quantitatively the relationships of bandwidth, latency, and load as the number of concurrent lookups increased. In these scaling experiments, we could easily modify the architecture and therefore also studied the effects on performance of doubling the accelerator bus width and increasing the number of outstanding requests to the HMC. The simulation results showed that having a reasonable number of outstanding memory requests is critical to performance, and that eight accelerators could be accommodated before saturating the bandwidth.
With the insight gained from software simulation, it is desirable to continue rapid design space exploration with a path to hardware synthesis. For this effort, we have turned to SystemC [1] , a widely used modeling and verification framework that includes a synthesizable subset supported by many CAD companies. In the modeling and simulation domain, SystemC enables rapid prototyping and experimentation that can be both bit and cycle accurate and is decoupled from the CAD tool flow. When design elements are constrained to a synthesizable subset, those components can be synthesized through the CAD High Level Synthesis (HLS) tool flow. This is particularly valuable when system performance evaluation is required that encompasses both hardware and software. The interfaces between software APIs and hardware modules can be modeled at a cycle accurate level. When desired, the hardware modules can be pushed through HLS to obtain estimates of area and clock frequency. If an FPGA is targeted, the resulting designs can be run in emulation, to re-gain the speed advantage of FPGA emulation. This is a very desirable outcome as we've measured SystemC simulation to be 31,000 times slower than predicted run time and 1,500 times slower than LiME FPGA emulation time.
Implementation To work towards the goal of a simulation to emulation path that combines the benefit of flexible simulation during initial design with synthesis to a fast emulation platform for promising design options, we have ported the Key/Value store lookup accelerator to SystemC. A synthesizable subset of SystemC was used to enable a path to emulation. The SystemC accelerator model and the C++ application program are compiled together and execute on an x86 platform, with functionally identical results to LiME emulation. The simulated system includes a simple cache management model derived from emulator platform measurements.
A SystemC Transaction Level Modeling (TLM) 2.0 interface has been incorporated into the lookup accelerator for flexibility in connecting various memory models. TLM provides a standard to describe how models can communicate. This allows the exchange of "black box" models to enable prototyping, performance analysis, and hardware verification. With TLM, we can interface to externally developed, possibly proprietary, memory models that have unique characteristics without having to see the model internals.
Two memory models, also written in SystemC, have been used with the accelerator, a fixed latency model similar to LiME's and a more realistic proprietary memory model for the HMC. Both models have TLM compliant interfaces and support at least 16 outstanding read and write transactions for high throughput. The fixed latency model has configurable average read and write latencies. Values of 85 ns for a read and 106 ns for a write were used in this experiment, which have been experimentally determined from an actual HMC device for a similar workload. The fixed latency model does not model the internal structure of an HMC, i.e. the model does not simulate vaults, banks, switches, or links. In contrast, the proprietary model simulates all these HMC components at a clock cycle and bit level of accuracy. However, the HMC binary-only memory model can only be accessed through the links, so the near memory computing scenario cannot be fully simulated when this memory model is used. We cannot simulate accelerator logic placed on the base-layer die with reduced latency. In this case, memory latency to the accelerator includes the extra link latency for off-chip access.
Validation Figure 1 compares the lookup accelerator's performance using the LiME emulator model with the SystemC simulation models. In this figure, the y-axis measures lookups/s into the k/v store, and the x-axis is load factor (how full the table is relative to its total capacity). The LiME model shows near exact correspondence with the fixed latency SystemC model as the load factor increases. Due to the limitations of the binary-only, proprietary HMC model, the results shown in Figure 1 for the HMC model predict performance when the lookup accelerator is off-chip. A block transfer optimization similar to that used in the SST simulation [4] was also used since the HMC model does not include a scratchpad memory. The slightly higher performance predicted by the HMC model is likely due to the block transfer optimization that issues a single request to the HMC for a block of keys rather than individual requests for each key.
This preliminary validation lends confidence to our proposed approach of using SystemC for design space exploration with eventual synthesis of the accelerators. Using SystemC for the modeling language allows the strengths of simulation and emulation to be used in design space exploration without the need to write accelerator components in separate languages. Our next steps will be to reproduce the optimizations and enhancements from the SST simulation, such as using multiple accelerators attached to the same memory module. While our current experience with synthesis to FPGA has challenged the advertised capabilities of FPGA CAD synthesis, we will continue to work towards our goal of synthesizing the hardware modules and run on the FPGA platform.
