Abstract -As memory accesses increasingly limit the overall performance of reconfigurable accelerators, it is important for high level synthesis (HLS) flows to discover and exploit memory-level parallelism. This paper develops 1) a framework where parallelism between memory accesses can be revealed from runtime profile of applications and provided to a high level synthesis flow, and 2) a novel multiaccelerator/multi-cache architecture to support parallel memory accesses, taking advantage of the high aggregated memory bandwidth found in modern FPGA devices. Our experimental results have shown that for 10 accelerators generated from 9 benchmark applications, circuits using our proposed memory structure achieve on average 52% improved performance over accelerators using a traditional memory interface. We believe that our study represents a solid advance towards achieving memory-parallel embedded computing on hybrid CPU+FPGA platforms.
I. Introduction
The performance benefits of FPGA computing are being made much more attainable as high level synthesis flows provide easy ways for the developers to offload the most compute intensive portion of an application to hardware accelerators However, as the accelerators aggressively parallelize compute operations, the conventional memory model, which serializes all memory accesses, becomes a performance bottleneck. Any deviation from it, on the other hand, would often require complicated memory alias analysis to be applied to the target programs. For applications written for conventional general purpose processors, memory access patterns may be complex and the dependencies may not be statically determined. Consequently, only limited opportunities for parallelizing memory accesses can be exposed. Memory-level Parallelism Discovery. Our proposed tool flow captures and examines the dynamic traces of applications to establish independence between partitions of memory accesses. Subsequently, the operation reordering and loop pipelining performed during accelerator generation are no longer restricted by unnecessary serialization of memory accesses. Meanwhile, the access pattern of each partition of memory operations is used to optimize the associated cache structure, resulting in higher cache hit rate and ultimately better performance. Multi-cache Architecture. Complementing our tool flow, a new multi-cache architecture template is designed to accommodate parallel accesses from the accelerator. Each cache, customized according to the access pattern, serves one memory access partition. As the tool flow is profile based, our architecture has built in mechanisms to ensure the coherence of the system in the case when the assumed independence is violated. In addition, given the coherence guarantee provided by the architecture, the accelerators and the CPU can each have its own cache, yet accesses the same unified memory address. Accelerator Synthesis. As part of our study, an accelerator generator is developed. Each hot region in the captured runtime traces is extracted and rolled back to the original network of basic blocks. It is then converted to C syntax and processed by the front end compiler of the LLVM project [1] . The single static assignment intermediate representation generated groups llvm instructions into basic blocks. Branch predication is then performed on control transfer and memory operations. Finally instruction rescheduling and loop pipelining are done before each operation in the dataflow graph gets mapped using a library of pre-built hardware primitives.
II. Related Work & Background
Recently, High-Level synthesis (HLS) in CPU+FPGA systems has attracted significant interest on both academic and commercial front [2] , [3] . Many of these HLS tools use profiling information to discover kernels, whose source code is then transformed to FPGA circuits. The Warp processor [4] , on the other hand, performs translation from the binary running on the processor, directly utilizing the dynamic profile of the applications. Our work fits into the overall system architecture of these works, where the CPU performs control tasks while the computation is offloaded to accelerators. However, instead of focusing on converting software to hardware, we investigated the effect of memory-level parallelism in accelerator performance and how the benefit can be obtained.
A previous project did attempt to address memory access parallelism for reconfigurable accelerators [5] . Their approach was to have the user specify independent memory accesses, such that neither complex alias analysis nor hardware support for cache coherence is needed. But ultimately this approach restricts the range of suitable applications to ones that have easily determined and static memory access patterns-such as those found in scientific computing. We take a much more general approach and rely solely on runtime profiling to determine the memory access behavior and therefore can potentially address a wider range of applications.
III. Extracting Memory-Level Parallelism
The runtime traces of applications capture the memory addresses accessed by each instruction, providing the basis for dividing up memory operations into partitions. Memory accesses from different partitions can be parallelized or performed out of order because they do not access the same addresses, or they are both read operations, making an ordering between them unnecessary. Within each partition however, the RAW, WAR and WAW ordering must be preserved. These rules are observed in the hardware synthesis to reschedule memory accesses to achieve the best performance. The process of partitioning memory accesses is illustrated in figure 1. Memory accesses are initially placed into separate partitions, and later merged to the same partition if they are observed to be accessing the same addresses during the execution. As the observation window slides forward in time, more clustering of accesses would occur, until a stable partitioning is reached.
In this process, accesses by two instructions to the same data have to be in the observation window at the same time for them to be place in the same partition. As the partitioning is to be used by HLS, the size of the observation window should be equal or greater than the worst case relative movement of instructions in HLS.
Assuming the most aggressive loop pipelining can be achieved during accelerator generation, we would initiate one iteration per cycle. If we also observe the latency of an inner loop iteration to be L, the maximal number of iterations in flight simultaneously, N, would be L/1. Coupled with the most aggressive intra-iteration instruction reordering, the worst case relative instruction movement, and also the minimal size of the observation window, would be N*I , I being the number of instructions per iteration. Since this process is profile-based, the program may behave differently given a different set of inputs.
To ensure correct program execution, protection mechanisms against the violation of partition independence are implemented, as will be described in section IV-B.
IV. Multi-Cache Architecture
From the partitioning of memory accesses, an application-specific memory access network, shown in figure 2, is synthesized. Reflecting the multiple memory partitions associated with each accelerated code segment, multiple caches are connected to each accelerator.
A. Cache Structure and Operation
Each of these caches for an accelerator has its own associativity, line size and the index bits based on the observed memory access pattern of the instructions. Also, if the partition only contains load instructions, the cache would only have a read port connected to the accelerator, while a store-only partition would be associated with a cache with no read port. Write-back and write-allocate are used on misses for these write-only caches as well as the normal read-write caches.
When one or more cache misses occur during the execution, the accelerator is stalled. Misses from multiple caches are serialized in the internal memory request bus, shown in figure 2. This bus also forwards the request to off-chip RAM if the desired data is not in the multicache network. The internal memory response bus, also in figure 2, feeds the response from the off-chip RAM or sibling caches to the requesting cache. 
B. Handling Inter-Partition Memory Dependence
To guarantee the correct state of memory, the application-specific memory access network has to protect against violation of the inter-partition independence. It is possible that the accesses to a common piece of data by two different partitions are close temporally, such that the reordering of memory operations in our generated accelerator have already caused a RAW, WAR or WAW violation. In the other scenario, the second partition accesses the common data long after the first, in which case the reordering we have performed in the accelerator generation remains valid. These scenarios are handled using mechanisms described in the following sections.
1) Vulnerability Window:
To distinguish these two scenarios, we introduce a new concept-vulnerability window for memory partition reordering. It captures how much a memory access is rescheduled with respect to its predecessors in the original program order. The size of the window is determined by the scheduling of memory operations in the synthesized hardware. Given the original program order S 1 and the rescheduled instruction order S 2 , we can determine the vulnerability window for each partition as follows.
1) in each partition P m , we find the set of instructions A m , each of which has been rescheduled in S 2 to before its preceding memory instructions in S 1 . 2) for each instruction I i in A m :
• find the set of instructions preceding it in S 1 . Among these instructions, pick the one instruction I l which comes the latest in S 2 .
• find the window covering I i and I l in S 2 3) the largest window obtained in the previous step is used as the vulnerability window for P m This process is exemplified in figure 3 , where partition 1 & 5 have vulnerability window size of 4 & 3 respectively. When a different partition's access to the same data falls outside of this window, we have violation outside the vulnerability window (VOVW). A cache coherence scheme is required to move the data item such that the newest access would have it in its cache. On the other hand, if the second access falls within the window, the reordered memory operations have already resulted in a wrong execution. This violation in the vulnerability window (VIVW) would raise an exception and the processor would take over the execution. This is a high cost error, but should be rare. Two instructions using memory for communication in such a short temporal distance should have already been observed and taken into account by the partitioning process. Mechanisms for periodic commits and restoration are built into the accelerators and the memory network, such that the processor can start execution from a correct checkpoint.
2) Cache Coherence Scheme: The cache coherence scheme required by VOVW, described in section IV-B1, is built on top of the two internal memory buses. This bus snooping protocol ensures only one valid copy of data is in the cache network unless all the owner partitions contain only load instructions. The behaviors of different caches when a miss is placed onto the request bus is detailed in 
3) Exception Scheme for VIVW:
The exclusive ownership of data by the caches, imposed by the cache coherence scheme, guarantees that cache hits would not result in any data inconsistencies in the system. Therefore, the detection of VIVW only involves comparing the cache misses on the request bus with the most recent accesses to each cache. Physically, a shift register is added per cache to keep track of these past accesses. It records the address of each access, as well as the original value that is overwritten by each write access. This shift register, updated by every cache access, has a number of entries equal to the size of the partitions' vulnerability window.
After detecting VIVW, the system needs to be restored to a known good state before the processor takes over. We have devised a checkpointing mechanism for registers and a restoration scheme for memory. In the accelerators, explicit store operations are inserted. Whenever a new iteration of the inner loop is started, the values corresponding to the original processor registers are written into a special cache, which can be loaded into the processor. For the memory state, besides undoing the memory writes in the vulnerability window for each cache, the evicted cache lines are held in a delay buffer before being committed to the main memory. When the VIVW is detected, the evicted cache lines are written back into the caches, and then the old value stored in each cache's shift register would be written to the memory, invalidating the corresponding cache entries.
V. Evaluation
We implemented our prototype system with a Virtex-5 FPGA (XCV5LX155T-2). Taking nine applications from Spec2006 and MiBench as input, our flow identified and synthesized ten hot regions into accelerators. The baseline system is a conventional processor platform using Xilinx's Microblaze. Each accelerator using a multicache memory access network was compared against its counterpart with a single cache, both implemented on the same device. To keep the silicon area for storage constant, the amount of memory used in the caches of the various implementations is normalized to be equivalent to a 64KB direct mapped cache. Shown in table II are the performance numbers of the accelerators. Because of limitations in our current infrastructure, applications were not run to completion during profiling. However, in all cases we are confident that the employed trace sufficiently represents the nature of the application. The first column in the table, % of captured trace, represents the amount of the execution trace that is eventually mapped to accelerator(s). T d refers to the time when the generated accelerators are actually running, T c refers to the time when accelerators are stalled for cache misses. The baseline represents the amount of time spent by the Microblaze-based platform in executing all the instructions in the selected region. From the table, it can be observed that when the multi-cache network is used, the T d is reduced, demonstrating the benefit of exploiting memory-level parallelism. However, as the original cache is divided into multiple smaller caches, the capacity misses occur more frequently. Customizations of each cache alleviated the negative impact by reducing conflict misses, but there is still an overall increase in the miss rate. Meanwhile, the communication between caches also caused some performance degradation. However, all these costs are outweighed by the benefit of better accelerator throughput, resulting in an overall gain in performance.
On average, when using a single-cache accelerator, we observed a performance improvement of 4.5x over the baseline. When memory level parallelism is identified and exploited for each of the accelerators, the overall performance improves another 52%, to 6.9x over the baseline implementation.
VI. Conclusion
For many memory-intensive embedded applications, sustaining high peak memory access bandwidth is the key to maximizing their computing performance. In this work, a novel approach is developed to parallelize application-level memory accesses, using the abundant block RAMs and hardware flexibility of FPGAs. A performance improvement of 52% is achieved in experiments using our new multi-cache architecture and its complementary tool flow. We have shown that the proposed hardware template and methodology is effective in helping to overcome the performance bottleneck imposed by the traditional accelerator memory model.
