Abstract
Introduction
Many video applications require substantial storage and bandwidth to maintain an acceptable quality of service. However, these resources are often limited, particularly on hand-held devices. Further, the existence of multiple video communication and processing standards implies a need for a flexible implementation platform. These factors motivate the acceleration of video compression algorithms, such as the Quad-tree Structured Pulse Code Modulation (QS-DPCM) algorithm [4] , on FPGA platforms. To accelerate video applications, memory sub-systems are frequently built on top of video frame memories to exploit data re-use. Indeed, much work has been done for this application on processor-based platforms [1] . These techniques either involve a caching methodology or a fully static memory subsystem.
Our contributions are as follows. First, we introduce a parameterisable, hybrid memory sub-system, consisting of a scratchpad memory (SPM) and a custom parallel cache, which efficiently exploits data re-use in spite of data dependence. Second, the data dependent nature of memory accesses in the QSDPCM application is demonstrated using real video data. Third, compared with an implementation that only employs an SPM, the proposed memory sub-system is found to provide speed-ups of up to 1.7× and 1.4× respectively on two popular FPGA platforms. In addition, memory reductions of up to 3.2× are achieved.
Data dependent memory accesses
In the QSDPCM application, compression is achieved by exploiting spatial correlation in video images. Indeed, previous studies indicate that correlation is effectively captured by accounting for inter-frame movements [4] , implying that pixel accesses are dependent on the motion vectors of the video. This data dependent feature is demonstrated in Figure 1 for two different video sequences. To handle data dependent memory accesses, a custom parallel caching methodology, shown in Figure 2 , is employed. This scheme allows parallel accesses to sub-cache banks and arbitrates external memory accesses when sub-cache misses occur. Also, the address mapping scheme exploits two dimensional spatial locality by ensuring that neighbouring pixels map to seperate sub-cache banks.
Results
Experiments are carried out to measure the benefits of the hybrid memory sub-system. To make meaningful comparisons, the search window size for each video is set to capture 75% of the motion vectors. The platforms used are the Celoxica RC250 [2] and RC300 [3] boards, which contain the Stratix 2 and Virtex 2 FPGAs. Four different schemes are considered: firstly, scheme Spm, which contains an SPM that accomodates the largest possible motion vector, such that data accesses are strictly limited to the SPM. Secondly, scheme Spm d uses an SPM that allows accesses to external memory if the required data is not in the SPM. Finally, schemes Hybrid 1 and Hybrid 2 contain SPMs and caches with capacities of 512 and 1024 pixels respectively.
For the hybrid schemes, an initial reduction in run time occurs as the SPM size increases. This result is expected because pixels in the neighbourhood of the reference block, with high re-use frequencies, are buffered. However, an upturn in execution time occurs as SPM size continues to increase because of the overhead incurred to buffer pixels farther away from the reference block. Table 1 shows the realised speed-ups, compared with Spm d. For small interframe movements ('hkmovement'), the hybrid schemes use Figure 3 . Conversely, for large inter-frame movements ('parachute'), the hybrid schemes uses up to 3.2× fewer block RAMs.
Conclusion
In conclusion, performance gains and memory resource savings of up to 1.7× and 3.2× respectively are realised using a hybrid memory sub-system on contemporary FPGA platforms. Greater gains are realised for video sequences with greater inter-frame movements. The results emphasise the need for dynamic memory sub-systems in custom hardware to deal with data dependent memory accesses. In future, we intend to develop techniques to determine the optimum parameters of the memory sub-system.
