Abstract
Introduction
Computer vision and video processing often involve computationally intensive tasks that need to be applied to data streams in real time, with applications ranging from image-guided surgery, security and surveillance to home-care monitoring.
General-purpose computers can support a widevariety of tasks, but are often too slow or too power hungry for vision applications. FPGAs (Fieldprogammable Gate Arrays) provide an attractive alternative: they combine the flexibility of software with a speed approaching that of custom hardware technology. It has been shown that, for selected applications, an FPGA at tens of MHz can run up to 1000 times faster than a microprocessor with a GHz clock [3] , while moving critical software loops into hardware can result in average energy savings of 35% to 70% with an average speedup of 3 to 7 times, depending on the particular device used [12] .
In this paper we focus on the design and implementation of an FPGA-based architecture for human posture analysis. Our goal is to build a pervasive visual sensing environment that can monitor and assess daily activities of home care patients. Instead of using body sensors, we rely on images captured by video cameras to identify multiple visual cues, such as projection histogram and radial shape description, to estimate changes in posture and gait. A number of clinical studies have shown that changes in posture and gait indicate the onset or progress of various diseases, such as early signs of neurological abnormalities linked to dementia [13] .
Our method is based on previous work on ubiquitous sensing for managed homecare of the elderly [15] , and includes the following four contributions:
1. requirements for a posture analysis system with hardware support (Section 3);
2. a design for a posture analysis system that fulfils these requirements (Section 4);
3. new architectures and their implementation targeting a Xilinx XC2V6000 FPGA using a high-level hardware design approach (Section 5);
4. performance evaluation and comparison to software for our derived implementations (Section 6).
The rest of this paper is structured as follows. Section 2 covers background material, and provides an overview of our work. Section 3 explores the requirements of a posture analysis system with hardware support. Section 4 describes a design workflow for producing a hardware system that would meet the requirements, while Section 5 discusses its implementation. Section 6 evaluates the performance and accuracy of our approach, and Section 7 summarises the paper.
Background
Analysis of human motion has in recent years become one of the most active areas of research in the field of computer vision. It provides the opportunity of using an unobtrusive domestic health monitoring system for home-care patients by detecting changes in posture and gait to determine the onset of an adverse event or worsening of an existing condition. In this paper we consider a posture analysis system based on a frameby-frame technique [15] which comprises three stages: human blob detection, posture type matching based on blob metrics, and behaviour profiling.
Human blob detection.
A blob describes the shape of an object against a blank background. A common method to extract blobs is to employ image differencing and thresholding as shown in Fig. 1 (c) and (d) respectively. The former compares an image with a reference (background) frame to see which parts of the image have changed. This is a simple process of taking the difference between two images for each pixel in turn
To take into account variations in the background, such as changes in lighting conditions, the reference image can be adapted progressively. On the other hand, the thresholding process generates a 1-bit (binary) image where pixels with values above the chosen threshold are part of the blob representation, and below the threshold define the background. Binary images are ideal representation for blobs since they are fast to process and store. Noise and distortion can be removed using Gaussian filters as can be seen in Fig. 1 (b) .
Blob metrics. Once a blob is extracted, we can represent it in a number of ways (Fig. 2) . Different representations of the blob shape may reveal different features of the shape of the blob. Examples of blob representations include:
• Projection histogram representation. Projection histograms are one-dimensional representations that describe the distribution of pixels of an object along the horizontal and vertical axes. Projection histograms can be generated by projecting the binary image on each of the axes. Fig. 2 (c) and (d) show the horizontal and vertical projections of the blob in Fig. 2 
(a).
• Radial shape representation. Radial shape representation describes the outline of the blob by measuring the distance of the outline from the shape centroid at various angles. The radial shape bears resemblance to the blob contour but, as the shape is described as a function of the angle instead of along the contour, it will not be the same. Fig. 2 (b) shows an example of the radial shape of the blob of Fig. 2 (a) , plotted in polar coordinates to show the resemblance of the shape and the object.
Posture type matching. Posture estimation can be determined by fusing two or more blob metrics, such as the projection histogram, radial shape and elliptical fitting. In particular, blob metrics are used to classify a Behaviour profiling. To analyse posture changes, an accumulate score based on appearance information is kept so that if a score is beyond a certain range considered as normal, then the system registers it as a change in behaviour. Also, the optical-flow technique can be used to monitor subtle changes in posture from certain parts of the body, such as head, hand and legs.
For the scope of this paper, we focus on human blob detection for hardware implementation, and posture type matching for software execution.
Previous work. The recognition and analysis of human motion and activity are currently some of the most active research topics in the field of computer vision [16] . For example, W 4 [6] uses a combination of shape analysis and tracking. It is a sophisticated feature-based system that can be used to track more than one person and recognise various activities. Its posture recognition uses projection histograms as well as silhouette shape.
Some of the most sophisticated systems are not aimed for real-time detection. A system containing an automatically calibrating, distributed set of sensors has been proposed that learns common patterns of activity, and can detect patterns that are out of the ordinary [5] .
Other systems use statistical modelling with multiple cameras and can detect interesting activity even with random movement in the background.
Design Approach and Requirements
This section introduces the selection criteria and requirements for implementing the posture analysis system into hardware. The basic workflow and the scope of the system is shown in Fig. 3 .
For an algorithm to be worth considering for hardware implementation, it must be able to behave reliably and predictably in a variety of conditions. Reallife scenes include complications such as image noise, changing illumination and camera shake. The aim of the hardware architecture is to input a video frame, locate an object in the frame, then output data describing the shape of this object. Some of the considerations taken in account when designing the architecture include:
Noise reduction. The purpose of noise reduction in the posture analysis system is to ensure that the binary image with the blob to be analysed is as clear as possible ( Fig. 1(b) ). In practice an initial low-pass filtering of the image seems to give the best results. Blur filtering usually reduces detail in the image; sophisticated structureadaptive filters are sometimes used to achieve best results with minimal loss of detail. In the case of the posture analysis workflow, the loss of detail is not a great concern as, because of later thresholding, the filtering has minimal effect on the shape of the blob. Applying morphological filters including erosion and dilation further corrupts the outline of the blob, and does not remove large groups of pixels. A large kernel causes more dramatic changes to the blob outline, while a small kernel does not remove larger groups of noise pixels.
Reference image updating.
Because the image processing system should perform well under a variety of conditions, it should be able to adapt to changing conditions. Reference image updating should therefore be used to enable the system to adapt to gradual changing conditions. The actual update rate should be slow enough to adapt to the changing environmental conditions while ensuring that interesting objects that merely move slowly do not blend into the background.
It is possible to adjust the rate at which the reference image is updated by controlling two variables. First, the update frequency can be adjusted so that the reference image need not be updated with each frame. Second, because the reference image is a weighted average of the previous reference image and the latest frame, the ratio between the two source can be adjusted. A combination of these two parameters provides finergrain control on how the reference image is updated.
Colour space considerations. The selection of the colour space is important to the functionality of the thresholding filter. Ideally, the image should be in a format where the contrast between the object and the background is as high as possible in all circumstances, while noise blends well into the background.
For physiological reasons, the eye is more sensitive to differences in brightness (luminance) than colour (chrominance). This fact is used in many applications, including video capture and transmission (where the luminance channel holds more information that the chrominance channel), and lossy compression. In the context of this project this means that when pictures are acquired from a video camera or an image sequence, the quality of individual channels in the frame may be low. Hence decomposing the image may lead to poor results even if the channel increases contrast.
Because of these problems, and due to the extra overhead required for colour space conversion, the RGB colour model is chosen for this project. This approach works well in most cases for thresholding, and the quality of the results can be improved by thresholding each channel separately and ORing the result.
Monitoring home care patients. Since the intended use of this system is to monitor home care patients, we need to take into account two considerations:
• Privacy. One of the possible advantages of a posture analysis system, as opposed to a traditional video monitoring system, is that it provides a level of privacy. It should not be possible to reconstitute an image frame from the posture descriptions.
• Bandwidth. Most residential buildings are not equipped with high-speed data links. Any data to be transmitted between a posture analysis node and some central monitoring location should have low bandwidth requirements.
The system assumes a single occupancy environment for elderly who live alone. This is the most critical situation where continuous sensing is required. A similar assumption is made by a number of existing tracking systems [10, 11, 15] . Systems to track multiple people exist, both for isolated people [1, 7] or people in groups [6] . Object tracking has also been implement in hardware [8] .
Workflow Design
This section introduces the posture analysis workflow, taking into account the requirements set in the previous section. It outlines the main points about the design of some of the algorithms used in the system.
The main workflow is shown in Fig. 3 , and is further expanded from an implementation perspective in Fig. 4 . First, a frame is acquired from a video source. Next a blur filter (BlurBlock) is applied to the frame to reduce noise, as explained in the previous section. A difference filter (DiffBlock) finds the difference between the blurred frame and a blurred reference image. Then, a threshold filter (ThreshBlock) is applied to create a binary image. This image is then used by the histogram (HistBlock) and radial (RadialBlock) algorithms to output posture descriptions.
Projection histogram.
A projection histogram describes the distribution of pixels across the image. The nth element of the horizontal projection histogram is a count of the number of white pixels in the nth column of the binary (blob) image; similarly, the mth element of the vertical projection histogram is a count of the number of white pixels in the mth row in the image.
Radial shape. The radial shape describes the outline of the blob as an array of distances from the centre of the blob over a full rotation around the blob.
The projection histogram can be used to find the centre of the blob. First, count all white pixels in the image and store the value in sum.
Then, using a loop, find n such that
where n is the point in the histogram where half of the binary image pixels are on either side, and therefore it marks a coordinate of the centre of the blob along one axis.
To find the radial shape, go over all integral angles from 0 to 359 degrees in a loop. For each angle, follow a straight line from the centre of the blob to the direction of the angle until the edge of the shape (black pixel) is found.
Hardware Implementation
This section describes the hardware implementation of the posture analysis system, which is shown in Fig. 5 . The design is pipelined and contains three stages that are run in parallel using a dual-buffering technique. The first stage (BlobGen) generates a binary image containing the blob. The second stage produces the horizontal and vertical projection. This data is used not only as a metric to determine posture position, but can also help determine the blob centroid. The last stage determines the radial shape.
Next we highlight some aspects of the implementation that are specific to hardware implementation. Ring buffer. While microprocessors work generally well with algorithms that access data randomly, hardware-based architectures work best with algorithms that are based on processing streams of elements in a pipeline. For the case of a convolution (BlurBlock), memory accesses can be reduced by introducing a ring buffer to store pixels needed for subsequent stages of filtering.
The ring buffer is usually implemented using block RAMs in FPGAs, and must be large enough to store pixels needed for the convolution mask. In the case of a 3 × 3 pixel filter kernel and an image of width w, the ring buffer must be able to store at least 2w + 3 pixels. The advantage of this technique is that the 9 memory accesses required to process a pixel (for a 3 × 3 kernel mask) are reduced to a single access.
Reference image updating.
Updating the reference image without using floating-point operations places some limitations on how to perform the operation. If the decay rate is high, for example w = 1/4, integer operations can give acceptable results. In this case, the operation is:
Problems arise if w is very low. For example, if w = 1/128, then:
In this case the value that the current pixel contributes to the result is so small that rounding errors may mean it does not affect the end result. There are two possible solutions to the problem. The first solution is to set w to a reasonably high value can reduce the effects of rounding errors. This can then be compensated by lowering the reference image update frequency. The second solution is to increase the precision of the stored reference image and treating the reference image pixels as fixed point values. Instead of storing each colour component as an 8-bit value, it could be stored in 12 bits. Now, the operation with w = 1/128 is:
This ensures that the pixel values are not stored as integers but rather as fixed-point values with 4-bit decimal portion. In addition, the operation can be done with no multiplies or divisions, with shifts and adds:
When shifts are used, the weight must be a power of two. This is usually an acceptable limitation, though, because this granularity can be compensated with a suitable combination of w and the reference image update frequency.
Approximating trigonometric functions. In order to find a radial description of a blob, the algorithm needs to iterate over different angles for a complete rotation. The software version, running on a CPU with a floatingpoint processor, used sine and cosine functions to convert between Cartesian and polar coordinate systems. Such functions are not readily available for programming hardware.
A number of algorithms exist for approximating trigonometric functions without a floating-point unit [9] . CORDIC [14] is an iterative algorithm for calculating trigonometric functions to an arbitrary precision using only shifts and adds. Table lookup is a simple and effective method for trigonometric approximation. Multipartite tables can be used for greater precision, either using lookup with add or lookup with multiply. Polynomial approximation such as the Taylor series can be used as well.
The radial description algorithm aims to find the distance of the blob outline over a complete rotation. This is done by iterating over a set of angles between 0 and 359 degrees at 1 degree intervals. For the purposes of this algorithm and for the frame sizes considered, this is a reasonable precision. For each angle, starting from the blob centroid, the algorithm moves outward one step at a time until it finds the blob boundary. The distance of the boundary from the blob centroid is then recorded in an array.
A lookup table of 91 entries is used to do direct table lookup for integral angles between 0 and 359 degrees. This table stores a quarter of a cosine wave between 0 and 90 degrees. All angles between 0 and 359 degrees for both sine and cosine can be generated by shifting and mirroring the quarter-wave along each of the axes.
In order to look up the sine and cosine values in the same clock cycle, dual-ported RAMs are used to store the lookup table. A C program is used to generate a table using any arbitrary fixed-point precision. The current architecture uses a 32-bit lookup table.
Pipelined radial block. The radial analysis algorithm is pipelined. The pipeline stages are: 
3. Multiply the angles by the radius to get the offset from the centre in Cartesian coordinates.
4. Add the offset coordinates to the centroid coordinates to get the absolute position of the point in considerations.
5. Translate each (x, y) coordinate pair into a location in the memory. The memory location is simply y × [framewidth] + x. For certain known frame widths, this can be made more efficient with a combination of shifts and adds. For example, for a frame width of 320 pixels, the location is (y << 8) + (y << 6) + x.
6. Read memory at the location.
7. If the pixel read is black (background colour), we have found a shape boundary so break and repeat for the next angle. Otherwise increase radius and repeat.
Hardware design with Haydn-C. We use the Haydn design flow [4] for generating hardware designs. Haydn extends DK3 [2] capabilities of simulation and hardware synthesis with automatic source-to-source transformations. In particular, Haydn supports two main features: deriving architectures that meet performance goals involving metrics such as resource usage and execution time, and inferring design behaviour by generating behavioural code that is easy to verify and modify from scheduled designs such as pipeline architectures. Fig. 6 shows the hardware compilation design flow. All hardware architectures have been implemented using the Haydn-C language [4] , which is an extension to the Figure 6 . The Haydn design flow, which performs hardware synthesis, simulation and source-tosource transformations. The Haydn-C language is used to describe hardware designs, whereas C++ is used to implement the host which runs on a CPU.
Handel-C language. It contains an annotation language that directs the source-to-source transformation process as well as including powerful macro capabilities. On the other hand, the software module executing on the host, such as the posture type matching, is built using Visual Studio C++. One main feature of the Haydn design flow is that simulation and hardware synthesis can both be performed without changing the hardware (Haydn-C) and software (C++) source-code when targeting the RC2000 board. For hardware synthesis, a bitstream and the host application are generated, and both communicate with each other using the PCI bus. Simulation involves linking both hardware and host designs into a single multi-threaded application to simulate the behaviour and communication protocols.
Evaluation
Accuracy. The accuracy of our results is an important criterion for building a posture estimation system. When a posture is compared against a set of known postures, the most useful definition of accuracy is the percentage of correctly identified matches. Image data used in testing the algorithm is a selection of video sequences portraying different postures. The postures are classified as standing, sitting and ly-ing. The test data comprises 2,943 frames totalling 1,830 MB of raw image data. The data are acquired with a stationary Samsung SCC-641 camera and stored in AVI format with Motion JPEG (MJPEG) compression. We chose reference images from the entire data set by selecting typical examples for each posture (standing, sitting and lying). The amount of movement in these images is smaller than that in the test data, so that we can evaluate the tolerance of the algorithm. Table 2 shows the accuracy of the different algorithms. The horizontal projection description returns the most accurate matches for both the standing and the sitting postures. For the lying posture, radial shape is slightly more accurate than projection. Note that the system is usable even with accuracies below 100%, as we can still derive statistical information about the change of posture over time.
Overall, horizontal projection seems to give a higher accuracy for all three postures. This may partly be characteristic to the selected set of three postures. The horizontal projection of a standing person is likely to be significantly different from that of a sitting or a lying posture.
The combined posture description is obtained by comparing the three descriptions and selecting the posture with most support from the algorithms. In other words, a posture that matches lying in two algorithms and sitting in one is considered likely to be the lying posture.
Performance. It is important that, regardless of which posture estimation method is used, it should be able to process frames at a rate equal to or faster than the video source to provide real-time computation. The test data used in this project was captured at a rate of 15 frames per second, a common rate for computer-based applications. Common frame rates vary between 15 and 30 frames per second. The frame size used for this project is 320 × 240 pixels. Table 1 shows eight designs that implement the posture analysis system shown in Fig. 5 , including the projection histogram and radial shape analysis algorithms. Note that frame rate is projected for the design maximum frequency reported by Xilinx tools. The projected frame rate is for data streamed into the processing core, and does not take into account other I/O constraints such as external memory and video input. We exploit the flexibility of FPGA devices to derive different architectures using two types of multipliers (block and LUT multipliers), and different colour depths (12, 24 , 36 and 48 bits per pixel).
The projected video processing speed is calculated taking into account the maximum frequency of a design as reported by Xilinx tools. The projected speed is for the processing core and does not include the speed restrictions of the I/O bus. On the other hand, the software version is able to process a frame of size 320 × 240 at the 8 frames per second on a 3 GHz Pentium-4. As the input video is captured at a rate of 15 frames per second, it is clear that the current software version is not able to perform real-time posture analysis.
In summary, the hardware version running on a RC2000 board executes 106 times faster than the software version on a 3 GHz Pentium-4, and 145 times faster when considering the maximum frequency reported by Xilinx tools.
Conclusion
This paper describes the design and implementation of an FPGA-based architecture for human posture analysis that can be used to monitor and assess the daily activities of home care patients. The techniques applied include the use of multiple visual cues, such as projection histogram and radial shape description, to estimating changes in posture. The accuracy of our results is around 86%, and our hardware implementation at about 80-100 MHz can run from 106 to 145 times faster than a software version running on a 3 GHz Pentium-4 computer.
Current and future work includes refining our architecture and tools. For example, currently the position of the blob is found by analysing the histograms. A more sophisticated object tracking algorithm capable of tracking multiple people can be used instead.
Work is currently under way to develop new efficient designs that analyse blob metrics. In particular, blob radial description is used as a basis for many al- gorithms, such as skeletonization, to measure changes in stride and gait frequency. Such algorithms have been proving to be more useful to track changes in gait compared to simpler posture-based algorithms. The architecture runs fast enough to analyse more image data than one camera could supply, so another interesting improvement would be adding support for multiple cameras. These cameras could work either independently of each other to monitor a large area, or together to improve detection accuracy. Alternatively, further speed/area/power consumption tradeoffs can be examined in order to target low-cost devices while still keeping performance sufficient for real-time use.
