Abstract. Posture analysis is an active research area in computer vision for applications such as home care and security monitoring. This paper describes the design of a system for posture analysis with hardware acceleration, addressing the following four aspects: (a) a design workflow for posture analysis based on radial shape and projection histogram representations; (b) the implementation of different architectures based on a high-level hardware design approach with support for automating transformations to improve parallelism and resource optimisation; (c) accuracy evaluation of the proposed posture analysis system, and (d) performance evaluation for the derived designs. One of the designs, which targets a Xilinx XC2V6000 FPGA at 90.2 MHz, is able to perform posture analysis at a rate of 1,164 frames per second with a frame size of 320 by 240 pixels. It represents 3.5 times speedup over optimised software running on a 2.4 GHz AMD Athlon 64 3700+ computer. The frame rate is well above that of real-time video, which enables the sharing of the FPGA among multiple video sources.
Introduction
Computer vision and video processing often involve computationally intensive tasks that need to be applied to data streams in real time, with applications ranging from image-guided surgery, security and surveillance to home-care monitoring.
General-purpose computers can support a widevariety of tasks, but are often too slow or too power hungry for vision applications. FPGAs (Field-programmable Gate Arrays) provide an attractive alternative: they combine the flexibility of software with a speed approaching that of custom hardware technology. It has been shown that, for selected applications, an FPGA at tens of MHz can run up to 1,000 times faster than a microprocessor with a GHz clock [5, 16] , while moving critical software loops into hardware can result in average energy savings of 35 to 70% with an average speedup of 3-7 times, depending on the particular device used [19] .
In this paper, we focus on the design and implementation of an FPGA-based architecture for simple posture analysis, which determines if the target is standing up, sitting down or lying on the floor. The goal is to build a smart and independent video-camera system that can monitor the daily activities of home care patients. Instead of using body sensors, we rely on images captured by the video camera to identify multiple visual cues to determine posture. A number of clinical studies have shown that changes in posture and gait can indicate the onset or progress of various diseases, such as early signs of neurological abnormalities linked to dementia [21] .
Our method [12] is based on previous work on ubiquitous sensing for managed homecare of the elderly [14] and includes the following four contributions:
1. A design workflow for a posture analysis system, summarising the main processing components (Section 3); 2. System architectures and their implementation targeting a Xilinx XC2V6000 FPGA using a high-level hardware design approach (Section 4); 3. Evaluation of the accuracy of our hardware implementations for three posture estimation algorithms (Section 5.1). 4. Performance evaluation and comparison to software for our hardware implementations (Section 5.2).
The rest of the paper is structured as follows. Section 2 covers the background material and provides an overview of our work. Section 3 describes the posture analysis design workflow, while Section 4 discusses its implementation. Section 5 evaluates the performance and accuracy of our approach, and Section 6 summarises the paper.
Background
Analysis of human motion has in recent years become one of the most active areas of research in the field of computer vision. It provides the opportunity of using an unobtrusive domestic health monitoring system for home-care patients by detecting changes in posture and gait to determine the onset of an adverse event or worsening of an existing condition.
Technique Overview
In this paper we consider a posture analysis system based on a frame-by-frame technique [14] which comprises three stages: blob detection, blob metric generation and posture classification.
Blob detection. A blob describes the shape of an object against a blank background. A common method to extract blobs is to employ image differencing and thresholding as shown in Fig. 1c and d respectively. The former compares an image with a reference (background) frame to see which parts of the image have changed. This is a simple process of taking the difference between the image frame and the reference frame for each pixel in turn. To take into account variations in the background, such as changes in lighting conditions, the reference image can be adapted progressively. On the other hand, the thresholding process generates a 1-bit (binary) image where pixels with values above the chosen threshold are defined as part of the blob representation, and those below as the background. Binary images are ideal representation for blobs since they are fast to process and store. Noise and distortion can be removed using Gaussian filters as shown in Fig. 1b. Blob metrics. Once a blob is extracted, we can represent it in a number of ways (Fig. 2) . Different representations of the blob shape may reveal different features of the shape of the blob. Examples of blob representations include:
-Projection histogram representation. Projection histograms are one-dimensional representations that describe the distribution of pixels of an object along the horizontal and vertical axes. Projection histograms can be generated by projecting the binary image on each of the axes. Fig. 2c and d show the horizontal and vertical projections of the blob in Fig. 2a . -Radial shape representation. Radial shape representation describes the outline of the blob by measuring the distance of the outline from the shape centroid at various angles. The radial shape bears resemblance to the blob contour but since we represent the shape as function of the angle instead of along the contour, it will not be the same. Fig. 2b shows an example of the radial shape of the blob of Fig. 2a , plotted in polar coordinates to show the resemblance of the shape and the object.
Posture classification. Finally, posture classification determines the posture type (standing, sitting and lying down) for a particular blob target. One common technique to estimate posture is to match a particular blob metric (T) against a set of reference patterns (T i ) and find an instance in the database that minimises the similarity distance between the blob metric and the reference pattern, that is:
where d is the similarity measure for a particular blob representation, i is the instance number of the reference pattern in the posture matching database, and c is the resulting posture type.
Previous Work
The recognition and analysis of human motion and activity are active research topics in the field of computer vision [7, 23] . For example, W 4 [10] uses a combination of shape analysis that can be used to track more than one person and recognise various activities. In this case, posture recognition uses projection histograms as well as silhouette shapes.
Pfinder [24] is a sophisticated system that is in use with many applications. It employs a multiclass statistical model of colour and shape to perform tracking of the human body. Pfinder is limited to a single camera and single person setup, although there is a version which uses a stereo camera to obtain three-dimensional models [2] . This single-person tracking assumption is also made by a number of existing tracking systems [14, 17, 18] . Systems to track multiple people exist, both for isolated people [3, 13] or people in groups [10] .
Some of the more sophisticated systems do not aim for real-time detection. A system containing an automatic calibration scheme and a distributed set of sensors has been proposed which learns common patterns of activity, and can detect patterns that are out of the ordinary [9] .
Various hardware implementations based on FPGAs have been proposed. These include systems for applications such as collaborative and reconfigurable object tracking [8] , augmented reality [15] , and automatic target tracking [22] . However, to the best of our knowledge, the designs proposed in this paper are the first reported FPGA-based implementations for posture analysis.
Design Workflow
The main workflow is shown in Fig. 3 . First, a frame is acquired from a video source. Next a blur filter (BlurBlock) is applied to the frame to reduce noise, as explained in the previous section. A difference filter (DiffBlock ) finds the difference between the blurred frame and a blurred reference image. Second, a threshold filter (ThreshBlock) is applied to create a binary image. Third, this image is passed through the HistBlock to generate a histogram. Finally, the RadialBlock receives the binary image and the histogram to output the blob metrics (both histogram and radial descriptions). Both descriptions are subsequently matched against reference patterns stored in a database (PostMatchBlock ) to estimate the posture type described in that particular video frame.
Our approach uses three algorithms for estimating posture: vertical projection, horizontal projection, and radial shape. We describe them in Sections 3.3 and 3.4, and their accuracy is evaluated in Section 5.1.
Note that this system assumes a single occupancy environment. However, it can be scaled to deal with multiple occupants by employing an object tracking module, and then process each object individually using our system.
Noise Reduction
The purpose of noise reduction in the posture analysis system (BlurFilter ) shown in Fig. 3 is to ensure that the binary image representing the blob is as clear as possible (Fig. 1b) . In practice an initial low-pass filtering of the image seems to give the best results. Blur filtering usually reduces detail in the image; sophisticated structure-adaptive filters are sometimes used to achieve best results with minimal loss of detail. In the case of the posture analysis workflow, the loss of detail is not a great concern because after thresholding the filtering has minimal effect on the shape of the blob.
Reference Image Updating
Because the image processing system should perform well under a variety of conditions, it should be able to adapt to changing conditions. The actual update rate should be slow enough to adjust to the changing environmental conditions while ensuring that interesting objects that move slowly do not blend into the background. In Fig. 3 , the RefBlock filter is responsible for updating the reference image.
There are two parameters that can be combined to provide finer-grain control on how the reference image is updated. The first parameter is the update frequency which can be adjusted so that the reference image need not be updated with each frame. The second parameter is the weight used to compute the reference image, which is the weighted average between the previous reference image and the latest frame, that is, Rði; jÞ nþ1 ¼ wFði; jÞ n þ ð1 À wÞRði; jÞ n for reference image R, frame F and weight w.
Projection Histogram
A projection histogram describes the distribution of pixels across the image. The n th element of the horizontal projection histogram is a count of the number of white pixels in the n th column of the binary (blob) image; similarly, the mth element of the vertical projection histogram is a count of the number of white pixels in the mth row in the image. In other words, if T is the binary image, then:
Tði; nÞ
Tðm; jÞ
Radial Shape
The radial shape describes the outline of the blob as an array of distances from the centre of the blob over a full rotation around the blob (Fig. 2b) . The first step in calculating the radial distribution is to find the blob centroid coordinates ðc x ; c y Þ. This is done by counting all white (blob) pixels in the image and storing the value in sum. Then, we find c x and c y so that:
where c x and c y mark the point in the histogram where half of the binary image pixels are on either side, and therefore correspond to the coordinates of the centre of the blob along their respective axis. Once the centroid ðc x ; c y Þ is computed, we calculate the radial distribution as follows, where T is the binary image: 
Hardware Implementation
In this section we describe the Haydn approach [6] (Section 4.1) and how it is used to derive different design architectures for the posture analysis system (Section 4.2). We also provide details of our development and execution tools (Section 4.3).
Haydn Approach
Hardware synthesis tools tend to fall into two distinct approaches: cycle-accurate approach and behavioural approach. Each has its own benefits and drawbacks.
The behavioural approach usually employs a software-based language, such as C, to describe hardware functionality, and provides an annotation facility to guide the scheduling process. The behavioural approach provides several advantages, namely: (1) ease of use for software developers, (2) highproductivity for design implementation, and (3) maintainable designs. However, the behavioural approach has a major drawback: hardware synthesis is performed with little human guidance. High-level synthesis often suffers from lack of user control and transparency over the implementation process.
On the other hand, cycle-accurate description languages (such as those based on RTL) give developers more control over low-level implementation details. At this level of abstraction, developers are able to make decisions that would be left to the compiler in a behavioural approach. This allows developers to fine-tune their hardware implementations to achieve an optimal solution. However, cycle-accurate design methodology can have two major disadvantages over high-level synthesis, namely low productivity and poor maintainability, which make it highly ineffective for implementing large designs.
The Haydn approach is unique in that it combines both cycle-accurate and behavioural design methodologies. Developers can opt to use the behavioural approach to rapidly derive a hardware implementation from a high-level design description and constraint annotations. Alternatively, manual intervention can be exerted either at the beginning or at the end of the design cycle to fine-tune a design. We believe that combining both models, manual development and computerised optimisations can be interleaved to achieve the best effect.
We have developed the Haydn-C language [6] to support this methodology. Haydn-C is based on the Handel-C [4] language, but contains significant differences, which we enumerate next. First, Haydn-C is a component-based language like VHDL. This makes it easy for importing and exporting library blocks (such as IP cores) and working with other HDL tools. Second, Haydn-C also provides a metalanguage to support source-to-source transformations, additional data structures such as pipelined FIFOs, hardware timers to count the number of cycles for parts of the design, and extended macro capabilities, such as replicators. Most ANSI-C constructs are supported, such as loops, control and assignment statements.
Our hardware design flow is shown in Fig. 4 , which performs source-level transformations, cycleaccurate simulation and hardware synthesis for Haydn-C designs. The source-to-source transformation process is guided by annotations in the program that describe design constraints. In particular, the transformation process scans for blocks of code that are enclosed by curly braces and that are annotated with requests for a particular action, such as scheduling. In this case, the block is removed from the rest of the code, analysed and the transformed code is put back in place of the original code. Developers can immediately synthesise the new implementation, simulate or perform another transformation, either by manually revising the code or requesting another computerised optimisation.
The source-level transformation process (Fig. 4 ) is able to transform both sequential and parallel descriptions by deriving a data-flow graph (DFG) with the unscheduling process. A DFG describes the dependencies between operations. The scheduling process, on the other hand, places operations in a Figure 4 . This figure illustrates our hardware compilation approach, which performs source-level transformations, hardware synthesis and simulation of Haydn-C designs. Optimisations can be applied to any block of code (enclosed by curly braces) that contains annotations requesting a particular transformation. An example of an user-annotation is shown in Listing 1, line 21. In this case a new block with transformed code is placed over the original code after running the source-level transformation process. particular time-order without violating program dependencies and user-provided constraints. Figure 5 shows the hardware architecture for the proposed posture analysis system. The hardware design is pipelined and contains two coarse-grained stages to maximise throughput. The first stage generates the blob and histogram descriptions, while the second stage computes the radial distribution. Both pipeline stages can work concurrently by storing and accessing two different memory locations alternately. Hence, the design latency is given by the number of pixels in one frame, and subsequently the design is able to output results at a rate of one pixel per cycle.
Architecture Design
Listing 1 Non-optimised (sequential) Haydn-C description of the posture analysis system. Lines 5 and 21 instruct the source-level transformation process to replace each block of code with a pipelined description (Fig. 4 ) that can generate a result every cycle, that is, with an initiation interval (II) of one.
The initial Haydn-C code for the posture analysis design is shown in Listing 1. The code specifies two tasks (lines 3 and 19 respectively) corresponding to each pipeline stage, and the top-level module (line 30). The top-level module instantiates both tasks in lines 34-35. The memindex register value indicates which memory to use in the dual-buffer scheme shown in Fig. 5 . At each frame, Stage1 and Stage2 need to read and write to different buffers. Hence, the memindex register value and its negation are passed as input arguments to both tasks respectively. The par block in line 33 specifies that tasks Stage1 and Stage2 are executed simultaneously. Note that Listing 1 only shows part of the design, and hides details such as variable declarations, definition of the filter blocks, and communication between host and hardware.
To derive different architectures (shown in Table 2 ) we parameterise the design for different pixel depths by setting the appropriate value in line 1 of Listing 1, and selecting the appropriate multipliers (block or LUT) in the resource table.
The source-level transformation process is guided by user-defined annotations which manage and control the compiler_s backend objects, such as the resource table and the scheduler process. Annotations start with an B@^symbol. For instance, the scheduler annotations (lines 5 and 21 of Listing 1) identify parts of the design that need to be optimised. In this case, we specify that we wish to fully pipeline the two blocks of code. Other design configurations can be derived, such as generating a result every n cycles to facilitate resource sharing and multiple video sources. For instance, if the initiation interval is 2, then one can process frames from two sources, where each frame is processed in alternate cycles. This reduces the frame rate by half (assuming one can achieve the same cycle time).
The source level transformation process supports design exploration at three levels: throughput (by setting the initiation interval), controlling resources used by the scheduler and sharing level, and selecting the bitwidth for operation and expressions.
System Development and Execution
We use the Haydn design flow [6] for generating designs that can run in both hardware and software platforms (Fig. 6) . The software module that implements the host is built in C++. The hardware, on the other hand, is described in Haydn-C. A parser converts Haydn-C into Handel-C (HyHC), and we use DK4 capabilities for hardware synthesis and generating a simulation model that runs on a software platform. Simulation involves linking both hardware and host descriptions into a single multithreaded application to simulate behaviour and communication protocols. On the other hand, hard- Figure 5 . The pipelined design for the posture analysis system. The design contains two stages that run concurrently using dual-buffers. Figure 6 . This figure shows the Haydn design flow, which performs hardware synthesis, simulation and source-level transformations. The Haydn-C language is used to describe hardware designs, whereas C++ is used to implement the host which runs on a software platform. The hardware synthesis process configures the FPGA device with the posture analysis bitstream and selects the iTools block filter to communicate with the FPGA board. The simulation process, on the other hand, creates an iTools block filter that incorporates the Haydn-C description code for software execution.
ware execution involves generating a bitstream to configure the FPGA, which in turn communicates with the host through the PCI bus.
The posture analysis system has been developed using iTools (Fig. 7) . This framework has a GUI interface that enables developers to build a system by connecting block filters.
Evaluation

Accuracy
The accuracy of our results is an important criterion for building a posture estimation system. When a posture is compared against a set of known postures, the most useful definition of accuracy is the percentage of correctly identified matches.
Image data used in testing the algorithm is a selection of video sequences portraying different postures. The postures are classified as standing, sitting and lying. The test data comprises 2,943 frames totalling 1,830 MB of raw image data. The data are acquired with a stationary Samsung SCC-641 camera and stored in AVI format with Motion JPEG (MJPEG) compression. We choose reference images from the entire data set by selecting typical examples for each posture (standing, sitting and lying). The amount of movement in these images is smaller than that in the test data, so that we can evaluate the tolerance of the algorithm. Table 1 shows the accuracy of the different algorithms. The horizontal projection description returns the most accurate matches for both the standing and the sitting postures. For the lying posture, radial shape is slightly more accurate than projection. Overall, horizontal projection seems to give a higher accuracy for all three postures. This may partly be a characteristic of the selected set of three postures. The horizontal projection of a standing person is likely to be significantly different from that of a sitting or a lying posture. The accuracy of the vertical projection, on the other hand, is not very high for the lying posture. It would seem logical to assume that the vertical projection would describe the lying posture in much the same way as the horizontal projection describes the standing posture. One possible reason for the lower accuracy may be the lack of contrast between the person_s shirt and the carpet in the image sequence, possibly distorting the blob shape.
The classification process can combine the results of different posture matching algorithms to improve the accuracy of the system by selecting the posture with most support. For instance, if both horizontal and vertical projection algorithms match the standing posture then the classification algorithm can be more confident about the outcome. Note that the system is usable even with accuracies below 100%, as we can still derive statistical information about the change of posture over time.
Performance
The test data used in this project are captured at a rate of 15 frames per second, a common rate for computer-based applications. Common frame rates vary between 15 and 30 frames per second. The frame size used for this project is 320 by 240 pixels. Table 2 shows eight designs that implement the posture analysis system described in Section 4. Note that frame rate is projected for the design maximum clock frequency reported by Xilinx tools. The projected frame rate is for data streamed into the processing core, and does not take into account other I/O constraints such as video input and output. We use the Haydn hardware design-flow (Section 4.1) to derive these architectures automatically using two types of multipliers (block and LUT multipliers), and different colour depths (12, 24, 36 and 48 bits per pixel). As expected, different resource and depth configurations provide performance tradeoffs in execution time and area.
We compare the 24-bit (RGB) FPGA version (XC2V6-blk2) against the software versions running on different instruction processors in Table 3 , including the Athlon AMD64 3700+ [1] , the Intel Xeon [11] and the Trimedia TM1300 [20] . As one can see, the FPGA design outperforms the Athlon 64 The frame rate is calculated for a 320 by 240 frame using 24-bit pixels. The bandwidth corresponds to the number of bits processed per second for each implementation.
3700+ by a 3.5 fold speedup, while running at 90.2 MHz clock rate. The software versions have been implemented with full optimisations on all platforms, and frame rate results do not take into account video capturing and rendering.
Conclusion
This paper describes the design and implementation of an FPGA-based architecture for human posture analysis to monitor and assess the daily activities of home care patients. We show the use of multiple visual cues, such as projection histogram and radial shape description, to estimating changes in posture.
The hardware implementation can run 3.5 times faster (1,164 frames/s) than a software version running on 2.4 GHz AMD Athlon 64 3700+ computer (330 frames/s). Current and future work includes refining our architecture and tools. For example, currently the position of the blob is found by analysing the histograms. A more sophisticated object tracking algorithm capable of tracking multiple people would be an useful extension.
In the future we intend to develop more efficient designs that analyse blob metrics. In particular, blob radial description is a basis for many algorithms, such as skeletonisation, to measure changes in stride and gait frequency. Such algorithms are more useful to track changes in gait compared to simpler posturebased algorithms.
The architecture runs fast enough to analyse more image data than one camera supplies, so another interesting improvement consists of adding support for multiple cameras. Furthermore, smart videocameras work either independently of each other to monitor a large area, or together to improve detection accuracy. Alternatively, further tradeoffs, involving metrics such as speed, area and power consumption, can be examined in order to target low-cost devices while still keeping performance sufficient for real-time operation. 
