identify objects in the environment. The sensor extracts SIFT Advances in DSP technology create important avenues of re-features from the images and matches them to the features of search for embedded vision. One such avenue is the investiga-the object(s) of interest through application of a local Suption of tradeoffs amongst system parameters which affect the port Vector Machine (SVM). By using SIFT features which energy, accuracy, and latency of the overall system. This pa-are invariant to scale, rotation and substantial range of affine per reports work on benchmarking the performance and cost distortion and varying illumination, the sensor node can reliof Scale Invariant Feature Transform (SIFT) for visual clas-ably identify objects in a cluttered or occluded environment. sification on a Blackfin DSP processor. Through measure-While local SVM requires a training phase, we assume that ments and modeling ofthe camera sensor node, we investigate the SVM on the camera sensor is trained initially and rather system performance (classification accuracy, latency, energy focus on the object classification problem. consumption) in light of image resolution, arithmetic preciTo study the performance and cost of the SIFT based sension, location of processing (local vs. server-side), and pro-sor against various system parameters, we model a camera cessor speed. A case study on counting eggs during avian sensor which consists of a Blackfin DSP processor [3], a lownesting season is used to experimentally determine the trade-power CMOS image sensor, an 802.15.4 CC2420 radio and offs of different design parameters and discuss implications acquisition and processing memory. We have chosen the Blackto other application domains.
sion, location of processing (local vs. server-side), and pro-sor against various system parameters, we model a camera cessor speed. A case study on counting eggs during avian sensor which consists of a Blackfin DSP processor [3], a lownesting season is used to experimentally determine the trade-power CMOS image sensor, an 802.15.4 CC2420 radio and offs of different design parameters and discuss implications acquisition and processing memory. We have chosen the Blackto other application domains.
fin processor due to a portfolio of features including high Index Terms-embedded vision, system tradeoffs, DSP, computation performance to power consumption ratio, hy-SIFT, object recognition brid architecture that supports efficient computation as well as control oriented applications (processing images vs. communicating with radio or imager), multiple power states and 1. INTRODUCTION agile transition among them, and finally, rapid frequency and
While it is intuitive that, in wireless camera sensor networks, voltage rescaling to adjust performance during various operalocal processing of images followed by transmission of pro-tion episodes [4, 5] . The CC2420 radio which is widely used cessed data is generally more efficient than direct transmis-in various sensor network nodes [6] is chosen due to its low sion of raw images, there has been little quantitative study power consumption and moderate transmission speed. of the specific cost and performance tradeoffs characterizing Our study is generally divided into two sets of experithese two approaches. This is due in part to the complex-mentation to benchmark 1) local computation cost and perfority of vision algorithms and the volume of imaging data that mance and 2) overall camera sensor performance. To benchuntil very recently has been incompatible with the process-mark the computation cost, we have implemented the SIFT ing, storage and energy constraints of camera sensor nodes. feature extraction and SVM classification on the Blackfin proHence, the emphasis of the research in computer vision has cessor and benchmarked the computation in the VisualDSP++ been historically given to enhancing the performance of the instruction level simulator. To expedite accurate benchmarkalgorithms with little interest in improving them under con-ing of classification accuracy on test datasets, we ran the same straints of embedded environments. However, with contin-code on a simulation server where many instances of the clasuous advances of Digital Signal Processing and computing sification performance are evaluated. Finally, we feed our retechnology [1], there are new avenues of research to study the sults to an analytical camera sensor model to evaluate overall application of state of the art computer vision approaches in node performance under various system parameters. more computationally-capable camera sensor networks.
Through experimentation, we study the sensing accuracy, In this paper, we empirically study a camera sensor node latency and energy consumption in light of various system pawhich uses Scale Invariant Feature Transform (SIFT) [ SIFT computation with minimum loss in classification accuracy. In addition, although local SIFT classification (i. e., com- Fig. 1 . System Model Block Diagram pared to sending the raw images) leads to significant reduction of bandwidth utility in a camera network and has an impor-ADCM-1700 CMOS Camera Module [7] . This module has tant benefit in and of itself, we find that in some instances it an 12C control bus and an 8-bit parallel CCIR656-compliant also leads to reduction in node energy consumption or sens-data bus, both of which are connected gluelessly to the CPU, ing latency. For instance, the energy consumption of a camera the Blackfin ADSP-BF533. node that performs optimized SIFT classification is less than
The Blackfin DSP is at the heart of the camera sensor its energy consumption when it sends raw images of the size node platform and performs algorithmic as well as control that results in the same classification accuracy at a backend functions. The processor has a direct peripheral interface to server (which runs a standard implementation of SIFT).
synchronous DRAM memory, in our case, a low-power MiClearly, the optimization of the application-specific pa-cron part [8] . This memory is used as temporary storage for rameters is only meaningful in the context of an application. image data captured from the imager, for intermediate data Throughout this paper, our experiments are focused on a case generated by the SIFT algorithm and for the final results. The study of counting eggs during avian nesting season. While hashed lines in Figure 1 denote the flow of image and results the numbers are representative of this application, we discuss data from the imager to memory to the CPU and from the generalization of the basic tradeoffs to a broad range of appli-CPU to the radio component respectively. cation domains.
The radio component used is the popular Texas InstruThe primary contribution of this paper is an empirical ments CC2420 Single-Chip 2.4 GHz RF Transceiver which evaluation of SIFT classification on a model of a Blackfinhas been extensively studied in [6] . The radio component rebased camera sensor. We identify tradeoffs among the key ceives both the control commands as well as the results from system parameters that affect the sensing accuracy, latency the CPU along the SPI bus. For purposes of our evaluation, and energy consumption and explore the design space where we assume a lightly loaded one-hop network to form a basethose parameters can be adjusted to fulfill the requirements line for comparing costs of local processing versus image of the application. In addition, we investigate the tradeoffs transmission. Channel contention, lossy transmissions and of local processing of the images followed by transmission of hop delays in loaded and multi-hop networks favor the local the result data vs. transmission of the raw images. A sec-processing argument. ondary contribution of our work is the implementation of the In our model, we assume each component in the system application and architecture optimized SIFT algorithm on the spends time and energy in one of a set of distinct states. This embedded Blackfin processor, previously considered the ex-approximation is similar to that in [9] and is commonly used clusive realm of high-end computers.
to model hardware at the micro-architecture level [10] and The rest of this paper is organized as the following. Sec-full system level [1 1, 12] . However, our approach is slightly tion 2 describes the camera sensor model. Section 3 outlines different from [9] by modeling the state of each component rather than the entire system. The full system model is then various parameters that affect the performance and cost of cntruteas a comination The statesofdac of thec sensing. Section 4 describes the implementation of the SIFT ponenter the course of tie this featr ln greater feature extraction and SVM classification on Blackfin proces-flexibility in modeling the operation ofthe system and convesor. Finally, Section 5 describes the experiments followed by niences the user evaluating the effect of changes to individual the presentation of the results and discussion in Section 6.
components.
As a measure of standardizing the taxonomy of states across 2. SYSTEM MODEL widely heterogeneous components, we distill the various states into a set of seven basic states. We then map the basic states
To evaluate the energy consumption and detection latency, we into component-specific states. Table 2 . System Parameters and System Variables interval taken to go between states. In our study, we have found that the only non-negligible time is the wake radio is transmitting results data. up delay of a component.
In our system we assume a schedule-driven approach, where * Data Processing (DP): The data processing state modthe sensor wakes up periodically to perform the image acquiels the componentperforming some algorithmic or func-sition and classification. Figure 2 illustrates the sequence of tional task. For the imager, the DP state represents the operations in one round of the node's scheduled activity. image capture time, whereas for the radio, it represents
The camera sensor node is initially in the sleep state with the time spent transmitting data wirelessly. The DP each component powered down. Even in its low power mode, time spent by the CPU is the time spent in the SIFT the CPU sets a real time clock timer to wake up the CPU on a algorithm. specific schedule. When the real time clock expires, the CPU * Data Communication (DC): The data communication (and hence memory) wakes up, boots and carries out perfuncstate represents the interval when components are extory control processing (CP) tasks. The CPU then commuchanging data with each other. The time each componicates with the imager to configure image acquisition (CC nent spends in this state depends on the bus interface block) but both CPU and memory move to the idle state while speed between components.
the image acquisition in progress. At the end of the acquisi-* Control Processing (CP) and Control Communication tion period, the imager stores the image data in the memory (CC): Control processing is the time that each module (DC block) via the CPU's direct memory access (DMA) and spends during the configuration whereas control comreturns back to sleep upon receiving appropriate configuration munication is the time it spends exchanging configuracommands from the CPU. tion information and commands.
The CPU then runs the SIFT feature extraction and SVM * Idle (I): Idle time is spent in components waiting for classification and writes the results back to the memory. These each other to complete a task or subtask. In the case tasks are memory-intensive and the memory load/store operof imager and radio, since the CPU co-ordinates their ations are intermixed with arithmetic operations running on operation, the idle time can be ideally made zero (by the CPU's execution units. However, to simplify the illuswaking up at the right instant and staying up for only tration we depict memory accesses (DC blocks) distinct from the time required to perform their task). CPU itself the main processing (DP block). The CPU then wakes up idles while the imager is capturing an image and the the radio and configures it to transfer the results to the radio (DC block) to be subsequently sent over the radio (DP Block). 500 Mhz, derive first-order analytical relationships for each term with and 600 Mhz. By reducing the frequency of the Blackfin prorespect to hardware specific system parameters and algorithm cessor, power consumption is reduced, but may ultimately redependent system variables. The top half of Table 2 lists the sult in increases in the overall latency and energy consumpparameters derived from respective data sheets and the bottom tion due to idling components in the system. half represents system variables that are varied (as discussed 3.2. Application designs in Section 3) for system evaluation.
Number of Octaves: 1, 2, 3, 4, 5,..., N. Invariance to scale Table 3 provides the simplified expressions for significant is achieved by consistently finding feature descriptors at the time intervals used to compute energy consumption and la-same scale regardless of the scale the object is captured. Detency. Note that the key arguments in this computation are tection of these features are guaranteed by repeatedly searchthe CPU cycle count and memory access count, which de-ing at different scales. Depending on the application, it may pend on the algorithmic load as well as the data. To ac-be possible to reduce the number of octaves (set of scales) curately model the system, these values are evaluated from searched. Note that the search space is implicitly reduced by real execution traces using the cycle accurate simulator pro-the selection of the capture resolution. vided within VisualDSP++. Using this approach, we compare Scale space sampling: Direct or Inferred. To broaden the the costs of local classification to image transmission with scale space search, the generic SIFT algorithm upsamples the server-side processing across a number of design variables, original image and detects features at that octave. We distinCounts derived from execution traces are used to determine guish between the SIFT descriptors extracted through upsamtimes for local computation whereas server-side computation pling because the resulting histogram is different than if the implies PO,cyc 0 and PR,size =PI,res. Simplified analyti-scene was captured at the higher resolution and the computacal models have the drawback of not capturing subtle higher-tional cost is substantially different. In conclusion, we see that given reasonable compiler sup- Fig. 3 . SIFT breakdown on a 160x 160 image for a single oc-port such as the one found in VisualDSP++, C++ code can be tave at two types of precision, fixed-point and floating-point. relatively easy to optimize without having to deal with assemHere fixed point includes using native bit-width data types, bly. Also, since the number of features is application depencompiler optimizations, and fixed-point implementation of dent, it might be the case that optimizations should be more particular functions. focused at the descriptor stage. For this reason it is important to profile the application in order to guide further optimization 4. BLACKFIN PORT that could further increase the benefits of local computation.
The compiler influences the amount of effort required to port 5. RESULTS code to a particular platform; ifthe compiler is unable to make In this section, we describe the experiments used to evaluate appropriate platform-specific optimizations, the programmer the tradeoffs between accuracy, energy, and latency. These must re-implement those sections of the code so that they do. findings will be used to propose pairings between system deBut before re-writing parts ofthe code, the application should sign and application domain in the following section. The be profiled to identify key bottlenecks in the implementation.
energy consumption of running SIFT on the node is modeled In our study, a C++ implementation [13] of SIFT was as: used. Having tuned the parameters of SIFT for our particular application, the code was compiled and executed on the EZEtotal = Ef ixed + Evariable Kit Development Board using Visual DSP++. As can be seen Efixed represents a fixed cost given input resolution of imon the left side of Figure 3, Figure  loops was re-organized to make them more compiler-friendly 4 shows that the traditional SIFT algorithm which scans the as discussed in [14] . Together these allow the compiler to entire scale space (all octaves) consumes more energy than make more efficient use of dedicated hardware resources such transmitting the image for both the floating point and fixed MAC units.
point implementations, and that this difference is exacerbated A second optimization replaced math related function calls as images get larger. Additional energy consumption will be with fixed-point alternatives. Two functions, the arctan () consumed by searching for and computing the SIFT features. and sqrt () were replaced with a well-known function ap-Ifthe generic SIFT algorithm is needed, transmitting the improximation and LUT techniques, respectively, with errors in age and processing at the sever-side is better in terms of acthe order of 0.06 radians and 5°0 respectively. curacy, latency, andpower. The effects of our changes on algorithmic accuracy are Accuracy under Fixed-Point As described in Section 4, empirically evaluated during experimentation onreal data sets fixed-point arithmetic would allow the Blackfin DSP to uti-(see Section 5) . An important point to note is that for robustlize processor specific optimizations, reducing computation ness, SIFT quantizes keypoint orientations and descriptors by by an order of magnitude. However, this energy/latency rebinning them at certain granularities; consequently certain erduction comes at the cost of accuracy. In Table 4 Figure distance of the estimated count from the true count. The im-8. The best energy consumption and latency for transmitting ages are subdivided by the number of eggs in the image. The an image is at the lowest CPU frequency. This is due to the average error is shown in Table 5 showing that on average fact that the radio dominates in this situation. On the other error increases imperceptibly when fixed-point arithmetic is hand, local processing achieves the best results at the highest used. It shows that while the inferred sampling has higher refrequencies. Under any resolution, there exists a CPU fre-call than direct sampling, it also has lower precision. While quency in which it is more efficient in terms of latency to pro-more SIFT features are classified as eggs, more non-eggs are cess locally. The Table 5 . Recall/Precision on egg detection using different recommendations expressed in this material are those of the implements. Excepted difference from true egg count over a author(s) and do not necessarily reflect the views of the NSF nesting cycle is also given.
or the ONR. . 
