Abstract-Advances in FPGA technology have dramat ically increased the use of FPGAs for computer vision applications. Availability of on-chip processor (like PowerPC) made it possible to design embedded systems using FPGAs for video processing applications. The objective of this research is to evaluate the performance of different memory co mponents available on FPGA boards for embedded/platform-based implementations of image/video processing applications. The clustering based change detection algorith m for Ubiquitous Multimedia Environ ment is selected for evaluating the effect of different memo ry co mponents (DDR/BRAM) on performance of the system in terms of frame rate (frames per second).
I. Introduction
The main challenge for co mputer vision/image processing applications lies in achieving the real-time performance. Fo r this reason different technologies and design methodologies have been used to build co mputer vision systems. These technologies go fro m general purpose processors/special purpose digital signal processors to applications specific integrated circu its (ASICs)/application specific instruction set processor (ASIPs), or even field programmable gate arrays (FPGAs). Recent advances in FPGA technology have increased the use of FPGAs for co mputer vision applications. Current generation FPGA have their size and performance comparab le to those of ASICs and provide the flexib ility to perform algorith mic changes in later stages of system develop ment. Moreover, their structure is able to explo it the spatial and temporal parallelism inherent in many image p rocessing applications. These features have increased the interest of researchers toward FPGAs for imp lementing computer vision/image processing systems [1] , [2] . The availability of on-chip embedded processors (like PowerPC and/or M icroblaze) made it possible to design whole embedded system for video/image p rocessing applications using hardware/software co-design based methodologies. Recently, many research papers have been presented in the literature related to implementation of v ideo/image processing applications on FPGAs [3] - [7] (a thorough discussion of this body of the work is, however, beyond the scope of this paper).
For any software program running on embedded processor, the two main sections which require storage are program/ instructions and intermediate data objects. In this research, we have studied the effect of storing the program/ instructions and intermed iate data objects in different memo ry co mponents (Block RAM, DDR, or combination of both) on performance (frame rate) of image processing application. The clustering based change detection scheme [8] is chosen for this study for two reasons. First, the program/instruction memory size is large enough to study the effect of different memo ry components on program execution time. Second, this algorith m requires the storage of some important background related informat ion for each b lock of pixels in image. Therefo re, a large amount of memory is needed for storing intermediate data objects, which is important for studying the effect of d ifferent memo ry components selected for storage of intermediate data objects.
The rest of the paper is organized in the following way: Section 2 briefly describes the clustering based change detection scheme. In section 3, we show the implemented FPGA based embedded system architecture, and different memory co mponents partitioning and their performance evaluation. Sect ion 4 discusses results and evaluates the performance. Finally, we conclude the paper in section 5.
1. Each inco ming gray frame is partitioned into 4x4 blocks.
2. Each block is represented by a group of clusters and each cluster contains centroid value and the frame number which updated the cluster recently.
3. Initially, the cluster set of each block is initialized with a cluster having its centroid set to average color value of the corresponding block of the first frame.
4. Each block has 3 cluster nodes.
5. For every new frame, each block is compared with the corresponding cluster group. The difference is computed by taking Manhattan distance between average color value of the block and its centroid.
6. If the difference is below a threshold value, it is considered a matching cluster.
7. For a matching cluster, the frame nu mber and centroid associated with the cluster node of the corresponding block is updated.
8. For a given block, if no matching cluster is found then a new cluster node is created by replacing the existing cluster node which has not been updated for the longest period of the time.
III. System Architecture
This section discuses three different embedded/platform-based imp lementations of clustering based change detection scheme on Virtex-II Pro FPGA board. The basic block diagram of the proposed embedded/platform-based implementation is shown in Fig. 1 . There is one embedded processor block PPC405 for executing the software programs, three memo ry modules (DDR Memo ry, BRAM Memo ry, and Flash Memory), one TFT controller module connected to Display mon itor, one JTA G controller connected to Host Computer via JTA G cable, one UART controller connected to Host Co mputer via RS232 cable, t wo buses (PLB and OPB) for p roviding interconnection between different modules, and one PLB2OPB Bridge for providing communication between two buses. The embedded platform for executing clustering based change detection scheme is implemented using Xilin x Embedded Development Kit (EDK) tool chain. All the blocks are configured using this software. The system architecture produced by EDK is shown in Fig.  3 . The embedded processor ppc405 is connected to processor local bus (plb) through two interfaces: data PLB (DPLB) and instruction PLB (IPLB). JTA G PowerPC controller (jtagppc_cntlr) provides the connection of PowerPC405 with host computer for programming and debugging. DDR and Block RAM memo ries are connected to PLB through plb_ddr and plb_bram_if_controller interfaces respectively. TFT display controller is also connected to PLB. Data Control Register (DCR) bus is used to configure the different registers of plb_tft-cntlr_ref module for proper functioning. System ACE Co mpact Flash and RS232 UART modules are connected to on-chip peripheral bus (opb). This bus is relatively slower as compared to PLB. The two buses PLB and OPB interact with each other through plb2opb_bridge. DCR and OPB bus interact through opb2dcr_bridge. In Fig. 3 architecture is used to evaluate the performance of different memory co mponents for image/video processing applications using clustering based change detection scheme mentioned in section II. 2. C/ C++ program running on PowerPC405 is used to read the image files fro m Flash memory and to store them into DDR memory.
3. The video frames stored in DDR memory are converted in gray scale and processed by change detection C/C++ code running on PowerPC405 and the results are stored back in DDR memory.
4. The processed video frames stored in DDR memory are displayed on Monitor by executing the C/ C++ program on PowerPC405. This program provides the addresses of processed video frames to TFT controller.
5. The complete process is monitored using hyper terminal on host computer through RS232 UA RT port.
6. The executable p rogram files fro m host system are downloaded on FPGA Po werPC405 using JTA G interface and Xilin x Microprocessor Debugger (xmd) tool.
7. The time taken by change detection C/C++ program running on FPGA PowerPC405 is measured and frame rate is computed.
Three imp lementations are considered and executed on PowerPC405 embedded processor for this evaluation. These are discussed below.
Implementati on I
In this imp lementation, only video frames are stored in DDR Memory. The all program execution related sections (Boot Section, Program/ Instructions, Intermediate Data Objects, Stack, and Heap) are stored in Block RAMs. The memo ry partit ioning for this implementation is shown in Fig. 4 . The snapshot of downloading of executable.elf file for this implementation is shown in Fig. 5 . It is clear that all sections reside in Block RAMs (in address range of 0xfffe0000 to 0xffffffff). 
Implementati on II
For this imp lementation, the intermed iate data objects section is moved to DDR Memory. Therefore, DDR memo ry contains video frames and intermediate data objects. The all remaining program execution related sections (Boot Section, Program/Instructions, Stack, and Heap) are stored in Block RAMs. The memo ry partitioning for this imp lementation is shown in Fig. 6 . The snapshot of downloading of executable.elf file for this imp lementation is shown in Fig. 7 . It is clear that intermediate data objects section resides in DDR memo ry (in address range of 0x00000000 to 0x00011067) and all other remain ing sections reside in Block RAMs (in address range of 0xfffe0000 to 0xffffffff). 
Implementati on III
In this imp lementation, all the program execution related sections except Boot Section are moved to DDR memo ry. Therefore, DDR memory contains video frames and all program execution related sections (Program/ Instructions, Intermediate Data Objects, Stack, and Heap). It is necessary to store the Boot section in Block RAMs for successful execution of the program. The memory partit ioning for this imp lementation is shown in Fig. 8 . The snapshot of downloading of executable.elf file for this imp lementation is shown in Fig. 9 . It is clear that intermed iate data objects section resides in DDR memory (in address range of 0x00000000 to 0x0005da1f) and all other remaining sections reside in Block RAMs (in address range of 0xfffe0000 to 0xffffffff). 
IV. Results
The embedded system/platform for change detectionalgorithm is designed using Xilin x Embedded Develop ment Kit (EDK). For performance evaluation, the CIF (352x288) size gray v ideo frames are taken. The video frames are stored in Flash memory. Storing the program/instruction as well as intermediate data objects in Block RAMs (imp lementation 1) processed 50 video frames in 10 seconds (frame rate of 5 frames per second). Storing the intermediate data objects of algorith m in DDR and program/ instructions in Block RAMs (implementation 2) have resulted in a frame rate of 4.25 fps (processed 50 frames in 12 seconds). Storing of both intermediate data objects as well as program/ instruction in DDR (implementation 3) has resulted in slowest processing. It takes 17 seconds for processing 50 frames (appro ximate frame rate of 3 frames per second). Fig. 10 shows the snapshot of host system monitor screen where the program execution is monitored on hyper terminal (connected to FPGA through UART port). It shows the successful reading process of video frames fro m Flash memory (and storing them into DDR memory ), d isplay of original video, program execution, and display of processed video. Fig. 11 shows the initial frame 1 when no object is present in the scene, intermediate frame 23 when cow started entering in the scene, and change detection results produced by all three imp lementation for frame number 23. Fig. 12 shows the initial frame 1 when no object is present in the scene, intermediate frame 38 when cow entered and is in midd le of the scene, and change detection results produced by all three implementation for frame number 38.
V. Conclusion
This paper has presented the performance evaluation of different memo ry components available on VirtexIIPro FPGA board. By using this study, a designer should be able to find an optimal me mory system for designing embedded/platform-based system for given image/video processing application. It is found that storing program/instructions in Block RAMs and intermediate data objects in DDR makes the best trade- 
