We have designed an automated visual surveillance system for monitoring sleeping infants. The low-level image processing is implemented on an embedded Xilinx's Virtex II XC2v6000 FPGA and quantifies the level of scene activity using a specially designed background subtraction algorithm. We present our algorithm and show how we have optimised it for this platform.
Introduction
Advances in surveillance technology have exposed many potential areas of application, such as detection of criminal activity [4] and supportive care [3] . We have designed a surveillance system to monitor sleeping infants. Our design employs a distributed architecture in which low-level processing is implemented on-camera using an embedded FPGA unit, and the results transmitted wirelessly to a PC based monitoring station. The bandwidth requirement of processed data is minimal, so that multiple cameras may simultaneously transmit results to a single station. We term the camera/embedded processor setup a "smart camera".
Thus far we have built a PC based prototype of the system for the purposes of functional evaluation. We have reimplemented the low-level processing functions on a Xilinx Virtex II XC2v6000 FPGA unit, so that we can the evaluate the performance of the smart camera unit. We present our algorithms, the results of our evaluations, and show how we have optimised our low-level algorithm for the platform.
Algorithm Design and Evaluation
Our premise is that different levels of activity (characterised by intensity and duration) are indicative of waking state. Low-intensity activity is associated with the normal sleepwake cycle. High-intensity movement indicates that attention may be required. Low-level processing estimates the level of activity from one frame to the next, and high-level processing models the activity level over time. For testing we used 720×576 monochromatic video images captured using a DV camcorder under infra-red illumination, which suffered from some level of noise.
Low-level (FPGA) processing is based on Stauffer's background subtraction algorithm [5] . We model one process per pixel, and use integer (fixed-point) arithmetic. We model each pixel's background process as a mean intensity value µ, and a variance σ 2 . For each new frame, each pixel value is classified as foreground or background based on this model. We apply morphological "erode" and "dilate" operations to the mask of foreground pixels, then calculate the proportion of foreground, θ, which defines the magnitude of activity for the frame.
The monitoring station maintains a history of activity Θ = {θ n , . . . , θ m }, where m is the last value received. Each value of Θ is regarded as significant if it exceeds a threshold θ s A corresponding set of values Γ = {γ n , . . . , γ m } is also maintained, and is related to the the number of values of θ ≥ θ s in Θ. The monitoring station begins an alert state if the most recent value held in Γ reaches a threshold γ s .
We used our PC prototype to evaluate the ability of our algorithm to discriminate between perceived levels of activity. Our test data comprises 34 hours of video footage of sleeping infants. We manually inspected the data set and recorded the start and end times of each period of activity, and graded the intensity. We compared this to the output of our system, and by varying values of θ s and γ s generated a set of Receiver Operator Characteristic (ROC) curves. Figure 1 shows two example curves which demonstrate the ability of the system to differentiate perceived levels of activity. 3. The FPGA Implementation Figure 2 shows how we have engineered the processing to use the parallelisation and pipelining capabilities of the platform, achieving a throughput of one complete pixel process per clock cycle. Each RGB pixel value is converted to greyscale, whilst the camera output addresses the corresponding statistical data in SRAM. The pixel value and statistics arrive simultaneously at the pipelined pixel processing unit. During processing, the updated statistics are written back to SRAM, and a 1-bit foreground pixel value is output.
Memory Access Control
Read/write Enable Pixel Processor Memory resources are a typical source of performance constraint [2, 1] . Our requirement for the pixel statistics (µ and σ 2 ) exceed that available as BlockRAM. We therefore used the 4 banks of external SRAM to store them. These banks are single port, so for optimal performance the data is double-buffered: it is arranged on 2 pairs of switchable banks: one pair holds the current read-values, the other the write-values, and all read and write functions are automatically directed to the correct pair. The banks are switched each frame by a memory controller.
We exploit the parallel architecture available on FPGA to run different processing stages concurrently. Pixel values from the camera are converted through the RGB-togreyscale converter, and pass directly into the 32-bit fixed point converter. These two processes run in parallel, with a one cycle latency. The pixel background classification and model update operations are recomposed as a four stage processing pipeline, with each stage taking 1 clock cycle. This enables us to concurrently process four pixels, maintaining an effective 1 pixel-per-cycle throughput. The resulting 1 bit per-pixel foreground mask is stored in Block-RAM. Morphological operations (erode and dilate) are applied to this output concurrently with the 4 stage pixel processing pipeline, and the results are displayed on a VGA monitor. The latency from pixel capture to completion of the 4-stage processing pipeline is 8 cycles, with a further 14 cycles to traverse the morphological processing and be expressed on the VGA display.
We compared the FPGA implementation with the corresponding PC prototype running on a 3.0 GHz Pentium 4 system. The PC prototype is written in C++ (compiled for maximum speed) and is able to process a frame in 40ms. Coincidently this corresponds to the capture rate of our data. The FPGA system is synchronised directly to the input camera, and therefore operates at a fixed rate of 25Hz. However, performance analysis indicates that much higher execution speeds are achieveable. Since our processing is optimally pipelined we process 1 pixel per clock cycle, which at a clock speed of 25MHz corresponds to a total time of 16.6ms (60Hz). This corresponds to around a 2.5× speed improvement.
