A system level design methodology is applied to the embedded system design for a typical sensor network application: face detection for security purpose. The tradeoff analysis is performed for hardware and software implementations of the tasks in this application. The best system design is achieved with limited hardware resources.
can be searched in a global database located at the central server for identification purpose (eg., face recognition). The functionality of the sensor node includes (1) image data collection from the camera, (2) data pre-processing by the embedded system processing unit, and (3) communication and coordination with other sensor nodes for data transmission to the network base stations.
We argue that the architecture and software of the sensor node need to be aggressively co-optimized, so that real-time constraints are met using processors operating at low frequencies, thus minimizing power consumption. The necessary speed increase is obtained by having special-purpose hardware accelerators tuned to the image processing algorithm in the architecture. As shown in Section 2, these algorithms have significant operation-level parallelism and computational uniformities, which can be exploited by vector processing or customized ASIC. To identify the best sensor node architecture, we used a hardware-software co-design methodology [9] . The methodology consists of system specification, performance profiling, and hardware-software partitioning. We explored different data path resources for the accelerators, and various vector lengths for the vector processor. The resulting execution speedup was observed.
The proposed sensor architecture differs from other sensor nodes, which employ general purpose microprocessors or micro-controllers, like SmartDust [17] , motes [15] , PicoNet [14] , WINS [13, 16] . We propose a customized architecture with higher execution speed than a general-purpose architecture and a minimized silicon cost. Our experiments also showed that systematic exploration is needed, because the best architecture is far from being intuitive.
Section 2 details the face detection application, and the image processing algorithm. Section 3 presents the behavioral model for the vector processor. Section 4 introduces the co-design methodology that was used for the sensor unit. Section 5 provide experiments, and finally, our conclusions are offered. 1 
The Application: Face Detection
Identifying human faces in an arbitrary image is a fundamental step in video monitoring as an application for security. Face detection algorithms have been widely studied in several disciplines, such as image processing, computer vision, and neural network. The existing face detection techniques include knowledge-based methods, feature invariant approaches, template matching methods, and appearance-based methods [4] . Knowledgebased methods use general rules to capture the relationship between facial features, for example that the nose is between the two eyes. One disadvantage of the approach is the difficulty in generating general rules from human knowledge. Feature invariant approaches define constant features of a face, and use them as a measure for face detection. For color images, the skin color is also an effective feature. YCrCb color space can be used in skin detection and filtering on Cr, Cb can define the skin tone pixels [5] . The template matching approach uses either a predefined template (face, eyes, nose etc), or a deformable template which is a priori elastic model that describes facial features in parameters. Face detection algorithms for deformable templates are more complex than predefined face template matching algorithms. The last category of face detection methods is appearance-based. Well known approaches include eigenfaces [6] , and neural networks [7] .
Since our research was focused on sensor node architecture design, selecting the best face detection algorithm became a secondary issue. Considering its popularity, we used the template matching method. The template is a full human face, and it was obtained from Michigan State University [8] . To improve the face detection result, we also incorporated a skin feature extraction method, which converts the RGB color space into the YCbCr space, and defines the skin tone pixel by selecting Cb, Cr values within a certain region. The face detection algorithm was modelled in SystemC. SystemC is a recent C++ based system specification language, which contains constructs for expressing both hardware and software modules. The application and the sensor node architecture were described in SystemC. Besides, as compared to languages like UML, StateCharts, or SpecCharts, there is an abundance of hardware simulation and synthesis tools from SystemC programs, which made the architecture development process much easier for us. Figure 1 presents the structure of the SystemC module for face detection, which contains two major module, one for skin extraction and the other for template matching. In order to observe the algorithm effectiveness, an image I/O module was also specified in SystemC to load and store testing images. Figure 2 shows the pseudocode for the skin extraction and face template matching algorithms. The color space YCbCr was used for skin feature extraction. The RGB to YCbCr color space conversions were defined as:
(1)
The face template matching algorithm was implemented by correlating the face template with the image pixels. The absolute correlation coefficient is:
A face template and two experimental images were retrieved from [8] for validating the template matching Figure 3 shows the face template, two original images and their face template matching results. This motivates the correctness of our SystemC specification.
Sensor Node Architecture
The smart sensor node is composed of a sensing element, a communication block, and a processing unit, as shown in figure 4 . The sensing element collects data (image, sound, temperature, etc) from the physical environment, and then sends data to the preprocessing unit. In our application it is the digital camera. The camera controller controls data access to the digital camera. The communication block includes the coprocessor for the wireless communication protocol, and the RF transceiver, which incorporates analog and RF circuits such as low noise amplifiers, high-frequency filters, sample and hold, multiplexer, oscillators, PLL, and data converters. Communication in a sensor network is different from traditional wireless networks in that it requires cooperation of the sensor nodes to aggregate information and reduce redundancy. Existing communication approaches for sensor networks are (a) localized algorithms, such as directed diffusion [1] , in which packets are forwarded between neighboring nodes with direction control, (b) distributed tracking algorithm IDSQ enabling sensor node collaboration based on the transmitting cost and resource constraints [2] , and (c) mobile-agent-based sensor network that uses mobile agent with integrated data to reduce the communication bandwidth required for the network [3] .
The processing unit is composed of a vector processor, ASIC coprocessor, and shared memory. The processing unit performs the face detection algorithm. The template of the architecture follows a typical embedded system processor-coprocessor architecture [9] , as shown in the left part of Figure 5 . However, it does not instantiate the attributes of the architectural blocks, like the length of the vector registers, the number of operations executed in parallel by the vector processor, the size of the RAM memory, and the resource set of the ASIC.
As explained in Section 4, we suggest using a hardware-software co-design methodology for systematic exploration of the architectural attributes, so that the speed of the architecture is maximized, and the silicon area constraint for the architecture is met. The codesign method was applied to partition the face detection algorithm onto the vector processor and ASIC, hence to identify the architectural resources of the vector processor and ASIC. The experiments in Section 5 prove that this process is neither trivial nor intuitive. This embedded system design methodology can be generalized to other sensor network applications, such as environmental and sound monitoring, which have similar functional attributes and performance requirements (timing constraints, massive data preprocessing before transmission to the base station etc).
The sensor node must have sufficient processing capability, so that the input image data (set as 128 × 128 pixels per frame) can be processed within the time limit set for this real-time system. To fulfill the performance requirement, we used a vector processor as the core processor for the system. Its contains vector registers with many elements (typically 64 elements) and is capable of parallel vector operation. Since the application is not control centric, the vector processor can be implemented as a simplified RISC processor. The vector processor has its own data memory. The co-processor is an ASIC, and its size is limited by the available area on the silicon chip. The shared memory was included for the processor and coprocessor to exchange and store processing data.
For system level design, the register transfer level (RTL) description is too detailed. Instead, we specified the vector processor at the behavioral level in SystemC. The Synopsis Cocentric development tool was used for the SystemC module design. Figure 6 shows the result- ing SystemC modules. The behavioral SystemC specification divides the vector processor unit into modules including instruction fetch, instruction decode, execution, and memory access/write back. The vector operation is assumed to be finished in one instruction cycle. Pipelining is abandoned to keep low power consumption. An additional instruction memory module was included in the SystemC specification to test the correctness of the vector processor specification. The SystemC model of the vector processor has a complete instruction set as shown in Table 1 .
Processing Unit HW/SW Co-design
The sensor processing unit for the face detection application was designed using a top-down hardware software co-design methodology. This general design methodology is applicable to various applications. It considers trade-offs such as timing constraint, area constraint and power constraint between different implemen- Logical  AND  OR  NAND  NOR  ANDI  ORI  NANDI  NORI  Arithmetic  ADD  SUB  MULT  DIV  ADDI  SUBI  MULTI  DIVI  ADDV  SUBV  MULTV  DIVV  ADDSV  SUBSV  MULTSV  DIVSV  SUBVS  DIVVS  Load/store  LB  SB  LH  SH  LW  SW  LV  SV  LVWS  SVWS  LVI  SVI  CVI  Control  JSR  RET  BEQZ  BNEZ   Table 1 . Vector processor instructions Figure 7 . Hardware-software co-design flow tations and provides an optimal partitioning of the application tasks onto the software implementation and hardware implementation. Without such a thorough exploration at the system level at the early stage of the design, the actual implementation may either not be able to meet the performance requirement or may be over designed, which is a waste of resources. The hardware-software co-design methodology was adapted after the COSYMA methodology [9] . Figure  7 depicts the co-design flow. It can be divided into the following steps. The first one is co-specification, which specifies both hardware and software components of the sensor node. In Section 2 and Section 3 we have introduced the SystemC specification for our vector processor and the face detection application. Following this, the data profiling step extracts the hardware and software timing and area estimation data for the specification. The next step is co-synthesis, in which the interdependencies between hardware and software components are explored, and the application is partitioned onto hardware and software implementations to meet timing and area constraints. In the co-synthesis process, the system architecture is developed based on hardware-software partitioning, resource allocation, functionality mapping, and scheduling. Hardware-software partitioning is the central concept, and the primary goal for partitioning is to meet a performance requirement. The secondary goal is to minimize hardware cost. 
Application data profiling
The data profiling procedure starts from the observation of specifications and division of the code into smaller task blocks. Each block may contain several lines of instructions or an instruction loop. The detailed rules in guiding the software timing profiling are: (1) Execution time for a task block equals the number of basic operations this task block contains. (2) Execution time for a loop in the specification equals the number of basic operations inside the loop times the iteration number. The basic operations refer to the instruction set of the core processor are shown in Table 1 .
The hardware timing profiling steps are: (1) Given a set of hardware resource, a hardware resource block graph was generated. (2) Based on the hardware block graph and proper operation scheduling, the time steps needed for the task blocks can be estimated.
The communication timing profiling steps are: (1) Determine the protocol setup time by selecting the proper communication protocol between processing elements. (2) Determine the bus width, then estimate the data transmission time needed for each execution block, and the time needed for data multiplexing. These profiling steps are general to various applications and many estimation approaches can be found in literature [10] [11] [9] . 4.1.1. Task graph Following the general data profiling steps discussed, the face detection algorithm was broken into small execution tasks, and a task graph shown in figure 8 was extracted to capture the data dependencies between the tasks. A software execution representation and an ASIC execution representation is also shown for each task. The fifteen tasks involved in the face detection algorithm can be illustrated as: R,G,B: Get the red, green and blue pixel color information from the image data. Cr,Cb: Convert the RGB color space into YCbCr color space, and gets the corresponding value of Cr,Cb for each image pixel. Cr th , Cb th : Compare each pixel's Cr, Cb with their thresholds to determine the face and skin region in the figure. G new : Leaves the face and skin region green color value unchanged, and sets other regions in the image to black. G mean : Calculates the means of the green color index among all the pixels in the image. G f : Gets the green color index in the face template. G f mean : Calculates the means of the green color index among all the pixels in the face template. Cov: Calculates the covariance between each pixel in the face template and each pixel in the image. Sum: Sums up the covariance in one region as large as the face template. Th: Compares the correlation coefficient with a threshold to filter out the non face region in the image. Pick: Selects the face region, and marks it with square. Figure 8 can be assigned either to the vector processor, or the ASIC for the coprocessor. To explore the quality of different hardwaresoftware partitioning, the software and hardware timing and hardware area estimations were performed for each task. The hardware area refers to the silicon area required for the implementation and is based on the standard 0.25 micron macro cell library provided by SAMSUNG [12] for their embedded system products. The timing estimation was performed by counting the number of instruction cycles for each task. Table 2 shows the vector processor timing estimation in terms of instructions cycles. Table 3 shows the estimated hardware area of the vector processor. Table 4 shows the ASIC performance estimation. Table 4 . ASIC Performance estimation
SW/HW timing and area estimation During co-design, each of the tasks in

Communication overhead estimation
The handshaking protocol was used for communication among the processor, co-processor, and shared memory. When an application task is assigned to the coprocessor, a corresponding communication overhead is added to the system timing. The communication overhead was defined as the time for data transfer through the shared memory. Since the vector processor has internal memory, if two execution blocks are mapped to software, no communication overhead needs to be added. The data bus capacity was set equal to the vector register size. Since the whole system is on one chip, intra module data transfer speed is the estimated as the same as inter module data transfer speed and they equals an instruction cycle.
The estimation of the total communication overhead is:
where N is the number of transmission required, t prot is the handshaking protocol time to setup the transmission, t multiplex is the time required to multiplex data bits onto the data bus, and t unit is the time required to transmit one multiplexed unit (64 × 32 bit). N result denotes the total number of bits resulted from the execution, and N bus denotes the bus capacity in bits. N data is the number of bits in a single execution result. For each execution blocks, two communication direction are defined, t com(in) refers to the communication overhead from the previous block to the current block, and t com(out) refers to the communication overhead from the current block to the next block. Table 5 shows the estimated t com(in/out) .
Hardware-software partitioning
The hardware-software partitioning algorithm uses the simulated annealing heuristics under the guidance of following cost function [9] : Table 5 . Communication overhead Figure 9 . Hardware extraction statistics are the hardware timing, software timing, and the communication overhead. t ov is the hardware and software overlap time. Each execution block B is used only once in our application, therefore the iteration factor It(B)=0. The partitioning cost function is similar to the COSYMA cost function [9] . However, we implemented the partitioning control parameters external to the cost function to avoid over-design.
Preliminary partitioning results
Applying the partitioning algorithm, we mapped the tasks in the face detection application into software and hardware, and achieved a speed-up of about 2.0 compared to pure software implementation using an additional ASIC co-processor that contains 1024 adders, 1024 multipliers, 1 divider and 512 NAND gates. The software timing is 839,952 cycles, while the hardware and software timing result is t SW HW =407,734 cycles. The best partitioning solution the annealing process found was to move tasks 11, 12, 13, and 14 to hardware. The hardware extraction statistics in Figure 9 shows that these tasks are the most frequently extracted partitioning solutions.
Silicon area estimation
The silicon area of the vector processor was estimated to be 54mm 2 . The co-processor area contains the required hardware area for the extracted execution blocks, which was shown in Table 6 . The final hardware extraction solution to the face detection application was to move tasks 11, 12, 13, and 14 to hardware. Therefore, the co-processor area plus the shared memory is about 268mm 2 (Ignoring NAND and the divider area). The total silicon area is 323mm 2 . Hence, the silicon implementation of the sensor node requires the largest MOSIS package (PGA391L), which has a die area of 18 2 = 324mm 2 . Since the data profile is obtained from industrial library, the partition result provides a practical solution for the system implementation. Table 6 . Partitioned block area
Discussion of Co-design Tradeoffs
The previous section introduces the hardware software codesign flow. A pre-optimized structure with a typical vector processor and ASIC is obtained. In practical system level design, careful exploration of the relationship between system cost(silicon area, etc.) and performance requirement (real time limit) are needed to get the most economic design. Therefore, this section provides an experimental methodology for such exploration. The results shows that careful evaluation of the application specifics are important because the system cost and the speedup in performance are not necessarily contradictory.
ASIC accelerator
The previous implementation (Subsections 4.3 and 4.4) has a much larger ASIC area than the vector processor. To find the best ASIC accelerator with large speedup/area ratio, we conducted more experiments for different ASIC area: A data profiling process was performed for each of the cases. The software timing and area estimation remain the same. The ASIC time profiling for each task in the application was shown in Table 7 .
As the ASIC silicon area decreases, the execution time doesn't necessarily increase in the same scale. Therefore, it is possible to obtain a design with smaller hardware area, but similar execution time. Table 8 shows the tasks extracted to hardware, the speed-up compared to pure software implementation, and ASIC area for each shown that Case 2 only takes around half of the hardware area of Case 1, whereas the timing remains almost the same. Hence, it can be concluded that an upper limit exist for the speed-up as the ASIC area increases. Above the limit, the hardware area increment stops to offer performance enhancement. Comparing Case 3 and Case 4, it can been seen that the number of adders contributes more to speed up than the number of multipliers. This is due to the fact that the tasks in the face detection application have more addition steps than multiplication steps. Thus, allocating more adders results in higher speed-up. The other advantages of adders is the much smaller area it has over the multipliers. Case 5 shows that when the ASIC area is strictly limited, assigning tasks to the ASIC coprocessor may not be able to achieve significant speedup. In conclusion, case 3 offers the best ASIC accelerator, with a speed-up of 1.56 at the expense of a small ASIC area (76mm 2 ) compatible to the vector processor area (54mm 2 ).
Vector processor design
The vector processor's computational capability is proportional to the number of elements inside the vector register. Since the vector register takes up a large area on the chip, it is necessary to find the minimum vector register elements needed for certain software timing requirement.
Since the best ASIC accelerator is found in last section, this section combines different vector processor structures together with the best ASIC accelerator and table 9 shows the execution time results. In these experiments, the communication loads are kept in the same range with same bus capacity. The optimal solution from table 9 provides the shortest HW+SW timing with minimum silicon area. It can be seen that HW+SW timing can be reduced to same range with 32 bit vector processor as with 64 bit vector processor when large ASIC area is possible. This is because when ASIC area is enough, the partitioning algorithm will move most of the data intensive tasks to ASIC to meet the timing constraint. This leaves light tasks for the vector processor and not much timing is gained with vector processing. By iteratively applying the hardware software partitioning algorithm for different vector processor and ASIC structures, the best system configurations can be found. Finally, as we described before, because of data dependencies between operations, an upper limit exists for the system performance and neither vector register size increment or ASIC increment will shorten the timing beyond this limit. Table 9 . Vector processor sizing vs. timing
Vector processor vs. ASIC
The vector processor is a good candidate for data intensive application compared to general RISC processor, because it is capable of parallel computation. The experiment shows that when the co-processor ASIC area is strictly limited (for example, in the case of 128 adders and 128 multipliers), the execution speed-up is also reduced, and moving tasks to the co-processor does not offer much execution speed. When hardware area is not the limiting factor, applying the tasks to the ASIC coprocessor can achieve significant speed-up, as shown in our experimental results.
Reducing shared memory size
The shared memory in the embedded processing unit stores the intermediate results for the image processing tasks that are mapped to the co-processor. In many image processing cases the information stored in the memory cells are highly repetitive. By eliminating the redundant information, the shared memory size can be reduced. However, this approach sometimes requires a repetitive computation for the intermediate result, and may increase the processing period. Careful exploration into the memory size and processing speed tradeoffs needs to be performed to get the optimal implementation.
Conclusion
This paper proposes a sensor node architecture for image processing applications, like infrastructure monitoring and security. The processing unit is customized for face detection applications. A top-down hardwaresoftware co-design methodology was used to design the embedded sensor processing unit. The application data profiling and the hardware/software partitioning are two most important steps in the co-design process. By iteratively applying the partitioning algorithm for possible ASIC and vector processor configurations the experiments find the best ASIC accelerator as a structure with 512 adders and 128 multipliers. With this ASIC accelerator and a 64 element vector processor, the HW+SW timing can be reduced to 538432 clock cycles. The total area is composed of ASIC (76mm 2 ) and the vector processor (54mm 2 ). Experiments with different vector processor sizes shows that the SW+HW timing with 32-element vector processor is as good as the SW+HW timing with the 64-element vector processor, therefore further reduction of the silicon area is possible with a smaller vector processor. These experimental results prove that it is important to explore the proper system architecture to avoid over design of the system. The design methodology proposed in this paper is a general system level design method, which can be applied to various applications and performance requirements.
