Abstract-Wireless Sensor Networks (WSNs) are a key technology for future social and industrial developments. With increasingly complex applications, the compute demands for innode / in-network processing have been growing steadily, while the capabilities of energy-harvesting or -storage systems have advanced only slowly. We present the Hardware-Accelerated Low Power Mote (HaLoMote), a heterogeneous system architecture for a wireless sensor node that achieves significantly better energy efficiency than traditional approaches, even for demanding applications requiring sensor sample rates of hundreds of Hertz. The paper discusses an evolution of the node hardware architecture and details the implementation of a signal processing chain required for Structural Health Monitoring (SHM) applications. The measurement accuracy of the WSN-based data acquisition system is compared against a wire-bound laboratory system showing that the dominant eigenfrequencies of a monitored structure can be detected with less than 1 % error. Furthermore, the runtime and power requirements of the HaLoMote are compared against various software processors typically used for conventional WSN architectures. It can be shown that the HaLoMote is 2.3 times more energy-efficient than a state-of-the-art ARM Cortex-M3 based device.
I. INTRODUCTION
While heterogeneous computing architectures have been employed successfully in high-performance server, desktop, and even embedded systems such as mobile phones, they have seen only limited use in highly energy-critical settings, such as WSN nodes. But the increasing sophistication of WSN applications is accompanied by a corresponding demand for in-node computation, communication, security, and availability, which often conflicts with the still-limited energy supply. To meet the timing requirements, low-power microcontrollers (MCUs) (as used on the TelosB or Mica2 mote) are replaced by more powerful processors (e.g., 32 bit ARM on Imote2) at the cost of an increased power consumption in active mode (ą100 mW) and longer wake-up times from deep sleep modes (ą100 ms). Active power management on these nodes thus can only occur when measurements are completely suspended. However, improvements in energy storage and harvesting are made only slowly, and do not keep up with the growth in node capabilities. Dynamic Power Management (DPM) and innetwork data aggregation are well known strategies to achieve energy efficiency, but they become hard to realize at higher sampling frequencies.
As a more energy-efficient approach to homogeneous node architectures, we propose the use of a heterogeneous system architecture to handle even complex WSN scenarios. The architecture is able to handle compute-intensive tasks (e.g., data preprocessing) as well as long-term low-intensity operations (e.g., the RF communications protocol). To this end, our HaLoMote architecture employs a low-power Field Programmable Gate Array (FPGA), acting as Reconfigurable Compute Unit (RCU), for the first use, and a small 8 bit MCU with integrated RF functionality for the latter.
The architecture was designed for applications in which multiple sensors are sampled by each mote at several hundred hertz, thus producing large amounts of data that have to be preprocessed and aggregated. Such applications include e.g., the condition monitoring of vibrating machinery [11] , vibrationbased SHM [13] , acoustic object localization [1] and video surveillance [20] . For such high-data rate applications, nodelocal processing can be used to reduce the necessary wireless communication bandwidth by techniques such as applicationspecific feature extraction (e.g., modal properties, object location), or generic data aggregation (e.g., lossless compression). But even low-data rate applications, such as environmental and industrial monitoring, may require more intense processing (e.g., for encryption protocols). The HaLoMote hardware accelerator is suitable for both use-cases.
While prior attempts at using reconfigurable computing in wireless sensor nodes have been made, their success has only been limited, as we will discuss in Section II. We present the evolution of the HaLoMote in Section III, while Section IV details a SHM application and its hardware-accelerated implementation. In Section VI, the accuracy of the proposed wireless data acquisition system is compared against a wirebound laboratory system. Furthermore, the energy required per acquired sample is compared between the HaLoMote and various software processors.
II. RELATED WORK
WSNs have become popular for environmental monitoring in the last decades. Several research groups started to equip various bridge structures with wireless sensor networks ranging from small models with artificial excitation [3] over medium size pedestrian bridges [8] up to large size traffic bridges [5] , [6] , [17] , [18] , [23] . The number of involved sensor nodes ranges from 8 [8] to 70 [6] . The sampling rates and synchronization accuracy required for monitoring the distributed vibration of large structures are much harder to realize than monitoring slowly changing temperature or humidity. As the power supply of the sensor nodes is limited and the radio transceiver typically is the major power consumer of a sensor node, sensor data aggregation is required to reduce the overall communication and power requirements and thus extend the network lifetime or maintenance intervals. Very few of the related research projects actually support this feature, such as [6] , [8] . Both execute complex in-sensor computations such as Frequency Domain Decomposition, or Stochastic Subspace Identification of a Filtered Hilbert-Huang Transformation on an XScale ARM processor. Compared to the traditional MCUs typically used in wireless sensor platforms, the XScale processor is very power-hungry, thus depleting even a relatively large 21 A h battery in less than two months [6] .
FPGA-based RCUs can perform complex computations more efficiently than MCUs and Digital Signal Processors (DSPs) [14] . In real-time applications, the use of RCUs often enables computations that cannot be performed by MCUs or DSPs at all under the given constraints. This made them attractive for use in sensor nodes performing compute-intensive applications (e.g., video and image compression) [4] , [15] , [27] . However, none of these systems could achieve truly low power operation: They all relied on FPGAs using Static RAM (SRAM) for configuration storage, which thus could not be powered down completely without losing the configuration data itself.
When energy actually becomes a first-class design goal, Flash-based FPGAs are far more suitable [21] for the RCU. Sensor nodes using a combination of a Flash-based Microsemi Igloo FPGA and a wireless transceiver have already been proposed [22] , [25] , [26] . However, despite the power advantages of Flash configuration storage, these architectures also turn out to be sub-optimal: All processing is performed on the RCU (even long-term low-intensity tasks), and when powered down, the radio transceiver is required to wake up the FPGA again. Thus, at least one of the two power-hungry devices has to be enabled all the time. Refinements which use very simple timekeeping on the RCU, employing inverter ring-based oscillators [16] , to only power up the receiver periodically at pre-agreed times for data reception, are still sub-optimal: Due to the large timing inaccuracy (drift) of these oscillators, the power-down windows have to be shortened conservatively, leading to the system drawing higher power for longer intervals.
A better choice is a heterogeneous architecture combining an RCU and a low-power MCU. The Cookie WSN [24] and the PowWow Mote [2] have joined a small Microsemi Igloo FPGA with a TI MSP430 MCU and an additional radio transceiver. However, both systems utilize the FPGA only for low-level handling of radio messages, instead of preprocessing the sensor data stream. Furthermore, the use of discrete MCU and RF components carries the burden of slower communication as well as more complex power management.
III. HALOMOTE HARDWARE ARCHITECTURE
To overcome the deficits of the WSN-architectures described in the last section, we proposed the HaLoMote architecture. In this section, we present its evolution driven by experiences gained in different applications.
The basic architecture of the original HaLoMote is shown in Figure 1 . It heterogeneously combines software-programmable and reconfigurable compute units. The MCU handles less complex computations, such as the radio protocol and basic (yet precise) time-keeping. It is integrated with the RF components into a single System-on-Chip (SoC) and can directly access the human-machine interface (HMI) peripherals (e.g., LEDs) on the mainboard. The RCU is realized as a discrete FPGA based on non-volatile memory (NVM), allowing for deep sleep modes with fast shutdown and wakeup times as well as a very low static power draw. External sensors and additional memories are connected to the RCU to support the efficient preprocessing of the sampled data stream. Only the aggregated results are transferred to the MCU for RF transmission into the network.
For the first realization of the architecture, a Microsemi IGLOO AGL1000 FPGA and a TI CC2531 RF-SoC have been used and application-specific sensors and memories were flexibly attached using expansion headers. While it already exceeded the performance and power efficiency of homogeneous systems for real applications [13] , the practical experiences gained identified a number of design weaknesses. This led to design improvements for a second implementation of the architecture, shown in Figure 2 .
Detailed power profiling revealed that the 8051-based MCUcomponent of the CC2531 RF-SoC required significant energy even with the RF transceiver completely shut-down: With the control software on the MCU just initiating RCU operations (i.e., sensor sampling and data accumulation) at 128 Hz, the RFSoC consumed between 34 % and 48 % of the overall system energy (depending on the actual load at the RCU) [13] than half of the active time of the MCU (31 µs out of 58 µs) was spent waiting for the clock source of the RCU to become stable after waking it up. This led to a more refined clocking scheme which allows the MCU to provide an auxiliary clock for the RCU until the main oscillator started up.
In the second version, the TI RF-SoC itself was replaced by a more recent Atmel ATmega256RFR2 device, which not only has more RF throughput, but also more General Purpose Input/Output (GPIO) pins for communicating with the RCU. Furthermore, it can be operated with just a 1.8 V supply (instead of the 2.5 V used for the CC2531), thus allowing for more efficient switching regulators.
We excluded the power and area-consuming HMI-peripherals from the mainboard. While they can still be attached for debugging purposes, unused peripheral pins are now used to increase the communication bandwidth between MCU and RCU.
Most monitoring applications require a significant amount of external memory. By directly integrating four 1 Mbit serial SRAM devices on the mainboard, less demand was imposed on the expansion headers (now 40 pins, down from 148). This significantly shrunk the overall system size from 100 mmˆ62 mm down to 46 mmˆ30 mm (see Figure 3 ). We chose multiple serial instead of a single parallel memory to enable parallel independently addressed memory accesses by the RCU. Furthermore, we can now selectively replace one or more SRAMs with pin-compatible non-volatile Ferroelectric RAMs (FRAMs) for even more aggressive power management. FRAM was chosen as persistent data storage as it clearly outperforms FLASHbased memory in terms of write performance (i.e., access time, granularity and power consumption) [19] . In particular, the MB85RS1MT modules can be written at 25 Mbit/s while drawing only 17 mW from the 1.8 V supply rail.
Finally, the new implementation allows over-the-air reconfiguration of the RCU by exposing the FPGA JTAG interface to the MCU. This comes, however, at the cost of an additional 3.3 V regulator responsible for providing the higher programming voltage.
IV. RANDOM DECREMENT TECHNIQUE FUNDAMENTALS
Civil infrastructures such as bridges are prone to fatigue and other load-induced damage. With increasing age, inspection intervals have to be scheduled more frequently to assure the secure operation of the infrastructure. Costly manual inspections thus have to be complemented by automated SHM. By periodically observing modal properties such as the eigenfrequencies, mode shapes, or the damping of the structures, damage or fatigue can be identified by significant deviations of these properties from reference measurements [9] .
Modal parameters of an object are typically derived by observing its response to a well defined artificial excitation. While large structures can be excited by appropriate equipment, a significant amount of energy is required to drive such large shakers. Furthermore, the excited structures often have to be taken out of service to assure safety and measurement accuracy. For continuous automated SHM, the identification of the modal properties thus has to be based on the natural ambient vibration of the structure caused by wind or traffic. The major challenge of this Operational Modal Analysis (OMA) is the separation of the random components of the observed signals caused by the unknown excitation from the structure's actual response to this excitation. The Random Decrement Technique (RDT) was proposed for this purpose [7] .
For the RDT, a set of sensors S Ă N is distributed all over the structure to acquire its vibrations in terms of acceleration or deflection as time series px s : T Þ Ñ V q sPS . For a finite sampling rate and measurement duration, the time domain is also finite and discrete, i.e., T " t0, . . . , n t´1 u Ă N. For simplicity, V Ă R can be assumed by abstracting from the finite measurement accuracy. As the OMA aims for the dynamic characteristics of the observed structure, the static components of the acquired signals (gravity or prestress) have to be eliminated by a high-pass filter, e.g., by applying a Finite Impulse Response (FIR) filter
c k¨xs pt´kq @ps, tq P SˆT of order n f with n f`1 appropriate coefficients c k P R.
To eliminate the random signal components, the RDT selects a subset of the sensors as references R Ď S and a trigger level l r P V for each reference r P R. The points in time t P T , at which a reference signalx r crosses l r , are referred to as trigger events E r -tt P T : px r ptq ě l r^xr pt´1q ă l r q_ px r ptq ă l r^xr pt´1q ě l r qu @r P R.
A signal window px r pt`kqq
of fixed length n w P N starting at a trigger event is composed of the structure's response to its initial displacementx r ptq " l r , its response to the initial velocity and the random ambient excitation, as shown in Figure 4 . Assuming a zero-mean excitation, the random components are extinguished when accumulating a sufficient number of these triggered windows. The velocity response will also be eliminated, as each rising signal edge (with positive initial velocity) is followed by a falling signal edge (with negative initial velocity). Thus, the accumulated signal windows converge against the displacement response, which describes the structures free decay and can thus be used to derive its modal properties.
To estimate the mode shapes of the structure, spatial correlations between different sensor positions are required. Therefore, signal windows from all sensors are accumulated for each trigger event, resulting in |S|¨|R| correlation functions
Finally, the accumulated functions must be normalized by the number of detected trigger events, the trigger level and the standard deviation of the reference signals:
The normalized correlation functionsD s,r are the input of the subsequent modal analysis, which is performed on a central gateway and is thus not covered in this paper. In addition to eliminating the random parts of the sampled signals, the RDT aggregates the |S|¨|T | raw sensor samples down to |S|¨|R|¨n w correlation samples. The compression factor |T | |R|¨nw increases with the measurement duration. n w is typically chosen such that the correlation functions show the free decay of the structure, which may take several seconds for large bridges. For a measurement duration of several hours, which is required to collect a sufficient number of trigger events, the compression factor typically exceeds two or more decades. This is the major benefit of the RDT for the distributed WSN implementation of SHM applications.
However, the RDT increases the demand for in-sensor preprocessing. The computational complexity of the RDT is dominated by the FIR filter and the memory accesses required for the accumulation of the correlation functions. The latter becomes particularly complex if the correlation functions can not be stored in the few kilobytes of RAM provided by most WSN processing units thus requiring access to external memory. The RDT preprocessing linearly scales with the number of sensor channels to be processed by each sensor node. Each sensor typically provides three channels to capture the multidimensional movement of the structure and multiple nearby sensors may be connected to a single sensor node to simplify the deployment. Thus, assuming three to twelve sensor channels per mote is not unrealistic. As the sensor channels can be processed independently of each other, most of the RDT preprocessing can be parallelized. The next section details the RDT implementation on the HaLoMote RCU.
V. HARDWARE ARCHITECTURE FOR RDT Figure 5 shows the sequential FIR implementation requiring one multiplier per sensor channel. The FIR taps are buffered in a Block RAM (BRAM) and the filtered value is passed to an additional First In, First Out (FIFO) buffer, which is integrated into the same BRAM as the FIR taps. The additional delay between the filtering of a sample and its further processing is required to ensure that trigger events captured at other sensor nodes can be flooded over the entire network [10] . Figure 6 shows the computational logic required for each sensor channel. A sensor specific control module requests the samples over the digital senor interface. Although most of the control lines of the sensor interfaces (e.g., SPI or I²C) could be shared between multiple sensors, each sensor channel is controlled by a dedicated interface to allow for parallel independent sensor sampling. The samples are filtered and delayed as described above. The filtered samples are used to detect trigger events by comparing the current and the last value against the trigger level to drive a generate flag. Furthermore, the filtered values and their squares are accumulated as sx and sxx to derive the standard deviation of the sensor channel. Both, the trigger event detection and the standard deviation calculation are only required for sensor channels configured as RDT references. To simplify the network configuration, the reference-specific hardware is provided for each sensor channel and the software processor decides which of the results to use for further processing. Figure 7 shows the module used for accumulating the delayed samples to a correlation function stored in BRAM. External trigger event specific logic decides whether and which memory position to modify. Additional clear logic is required to initialize the correlation functions at the start of each measurement. Figure 8 shows the handling of trigger events for a specific reference channel. This logic is required at all sensor nodes, not only at the sensor node sampling the reference signal. A trigger event is characterized by its age, i.e., the number of sampling cycles since it was generated by the logic shown in Figure 6 . Trigger events older than the length of the delay FIFO (shown in Figure 5 ) cause an accumulation of the output of the delay FIFO to all correlation functions corresponding to the reference channel that generated the trigger event. Trigger events are removed after n w accumulations.
The main difficulty of the trigger event management is that the sequence of event insertions does not necessarily have to match the sequence of event removals. For example, a trigger event generated at the local node will be inserted immediately with an age of 0. In the next sampling cycle, another trigger event generated at a remote node may arrive that already traveled for 10 sampling cycles on a multi-hop path to the local node. This second trigger event will thus be inserted after the first event, but it will be removed before the first event.
To avoid the fragmentation of the data structure storing the trigger events, a shift register based queue is used. In each sampling cycle, each trigger event is dequeued and enqueued again after incrementing unless it is old enough to be removed. New events are enqueued afterwards. The actual length of the queue is managed by dedicated logic and corresponds to the number of currently active (i.e., overlapping) signal windows triggered by the reference channel. Figure 9 shows the combination of all these modules to realize an RDT kernel for |S| " 4 sensor channels and |R| " 2 reference signals. The SPI ports of this module are connected to the discrete digital sensors. All other ports are controlled by the HaLoMote MCU using the communication infrastructure described in [13] . The BRAM addressing for the correlation functions are overridden for clearance and readout of individual correlation functions. As all correlation functions of a single
level [3] generate [3] sx [3] sxx [3] reference channel share the same addressing signals, BRAM can be shared among those functions. Figure 11 details the scheduling of the RDT operations. At the start of each sampling cycle, the MCU wakes up the RCU, which starts requesting the next sensor samples. The number of RCU cycles required for this operation depends on the number of bits to read, which typically do not exceed 3¨16 bit. The high pass filter operates on the sample from the previous sampling cycle and can thus be executed in parallel to the sensor sampling. The filtering takes n f`1 RCU cycles. The trigger event detection, the computation for the standard deviation and the handling of the delay FIFO require one additional RCU cycle after the filtering operation. The accumulation of the delayed sample value from the last sampling cycle to the correlation functions is performed in parallel to the filtering. It requires one RCU cycle per registered trigger event. If available, new trigger events are inserted just before the RCU is shutdown again. As long as the number of registered trigger events does not exceed the order of the FIR filter, the overall execution time of the RDT kernel is fixed.
VI. EVALUATION

A. Measurement Accuracy
To evaluate the measurement accuracy of the proposed wireless data acquisition system, a laboratory scale testbed was set up as shown in Figure 10 . A warren truss railroad bridge was modeled by connecting 54 metal rods with 24 metal joints (Figure 10a ) resulting in 51 kg overall weight and a span width of 246 cm. The test structure can be excited by a G-scale railway model or an impact hammer.
Five HaLoMotes were attached to this structure (Figure 10b) , each connected to four ADXL362 micro-electro-mechanical (MEMS) acceleration sensors (upper part of Figure 10c) . Thus, the movement of all inner joints of the bridge can be observed in three dimensions with a resolution of 1 mg. However, only the acceleration orthogonal to the bridge deck is taken into account as this is the main direction of the ambient excitation cased by traffic. Due to the relatively large stiffness of the small bridge model, the relevant structural modes to be observed are located between 50 Hz and 100 Hz. Thus, to safely meet the ShannonNyquist lower limit, a sampling rate of 400 Hz was chosen. The wireless sensor nodes are synchronized with an accuracy of a few microseconds. The time synchronization protocol [12] is not detailed in this paper, but the achieved accuracy is sufficient for the required sampling rate.
The WSN-based data acquisition system does not capture the actual excitation of the structure, so an OMA is required as described in Section IV. A n f " 64 tap high pass filter with a cut-off frequency of 20 Hz is applied to each sensor channel. This configuration was chosen as a trade-off between computational complexity and the filter quality. Static acceleration is damped by 60 dB, while all frequencies above 40 Hz are damped by less than 4ˆ10 −3 dB. After the high-pass filtering, the RDT with two reference channels (nodes 3 and 13 as marked in Figure 10a ) is applied. The trigger level of l 3 " l 13 " 200 mg was determined experimentally and corresponds to the peak excitation injected by the train set. A window length of n w " 256 is applied to capture the free decay of the structure within the first 640 ms after each trigger event.
The manual excitation of the structure for 10 s resulted in 60 trigger events registered at node 3 and 40 trigger events registered at node 13. The 40 resulting correlation functions  D 1,3 , . . . , D 20,3 , D 1,13 , . . . , D 20,13 were transmitted to a base station for the subsequent modal analysis. As shown in Figure  12 for D 3,13 , these correlation functions characterize the free decay of the structure.
A second wire-bound data acquisition system was installed in parallel consisting of 12 PCB 356A16 integrated circuit piezoelectric (ICP) accelerometers (lower part of Figure 10c ) controlled by the LMS Test.Lab 14A (LMS). Due to the limited number of input channels available at the SCADAS sensor front-end, only one half of the bridge can be completely observed by the reference system. In principle, both sides of the bridge could be analyzed independently, but moving the sensors from one side to another is not practical for long term observations as required by SHM. Instead, only two additional ICP sensors are installed in the other half of the bridge to assess the symmetry of the observed mode shapes. All sensors are sampled at 512 Hz with a resolution of 0.1 mg. The LMS system also captures the excitation provided by an impact hammer, so an Experimental Modal Analysis (EMA) can be performed thus providing more accurate results than the OMA-based WSN.
Compared to the wireless data acquisition system, the cabling required for the LMS system becomes rather complex ( Figure  10d ) even though only 60 % of the structure is covered. For the LMS measurement, five strokes with an impact hammer on node 17 (see Figure 10a) were injected in vertical direction at intervals of about 3 s to excite the vertical bending modes of the bridge model. The EMA results are averaged over those five individual measurements. The resulting frequency response functions at the lower central joint (marked as node 3 in Figure 10a ) is shown in Figure 13 . Below 40 Hz and above 110 Hz, the structure's characteristics can not be captured properly by the WSN-based system. However, the two dominating modes are located outside of the fuzzy frequency bands and can be captured with an accuracy of at least 1 % as summarized in Table I .
In addition to the eigenfrequencies, the actual mode shapes are of special interest for an SHM system as minor damage will be reflected in the deformation of the mode shapes before significant changes in the eigenfrequencies can be detected. Figure 14a shows the asymmetric vertical bending mode captured by both monitoring systems. Remember that the wirebound LMS system has a limited view on the rear side of the structure due to input channel restrictions. However, the two nodes on the rear side are sufficient to detect the asymmetric character of the mode, i.e., the rear side is bending up while the front side is bending down. But only the WSN system provides a detailed view on both sides of the structure, which is essential for the subsequent SHM analysis. This also holds true for the symmetric vertical bending mode shown in Figure  14b . Although the reduced accuracy of the WSN system is clearly visible in the mode shapes, the principal behavior of the structure can still be observed without the need for extensive cabling and well-known controlled excitation.
B. Resource Requirements
Finally, the runtime and energy required by the hardware accelerator of the HaLoMote architecture for the RDT computation is compared against software processors typically used in mobile and WSN applications. Namely, the TI CC2530 and the Atmel ATmega256RFR2 were chosen as representative 8 bit MCUs as these RF-SoCs have also been integrated into the HaLoMote. Furthermore, the TI CC430 RF-SoC was chosen as representative 16 bit MCU, as the MSP430 architecture is widely used in many WSN motes. Most recent software processors for mobile applications are based on the ARM Cortex-M architecture, so the STM32F407 and the TI CC2650 were chosen as particularly powerful and energy-efficient stateof-the-art references.
For a fair comparison, all systems were configured with the same RDT settings as described in Section VI-A, i.e., 400 Hz sampling frequency, four sensor channels, two reference channels, 64-tap FIR filter and 256 samples per correlation function. As the actual work-load heavily depends on the amount of detected trigger events, all systems were fed with the same pre-recorded sensor samples stored in the processors code memories and the trigger levels were chosen such that the actual number of triggered events is 60 and 40, respectively, as observed in Section VI-A. The firmware for all systems was build with most recent compilers configured to optimize for execution speed. The average runtime per sampling cycle measured with peripheral timers was combined with data-sheet information about the power consumption and the wakeup time from sleep mode as shown in Table II . The deepest sleep mode with memory retention and enabled realtime-clock was chosen for each system respectively. To derive the overall energy spent per sampling cycle (E overall ), a power-consumption of P active was assumed during wakeup, as the capacitors of the internal switching regulators have to be charged during the ramp-up. To better illustrate the consumed energy per sample by the different processor architectures, the corresponding system live time (t alive ) achievable when supplied by a 1 W h energy buffer (e.g., a typical NiMH cell) was derived. Note that this estimation takes only the processor into account, disregarding the sensors and the radio transceiver (which are assumed to be identical across the platforms examined).
As shown in Table II , the 8 bit MCUs do not achieve the required sampling period of 2500 µs. The CC430 requires about 50 % of the sampling period for the RDT computations thus consuming 15.4 µJ per sampling cycle. The powerful Cortex-M4 device is nearly 30 times faster than the CC430, but its comparatively large power draw in idle mode still results in 11.4 µJ consumed per sampling cycle. The TI CC2650 proves to be the most energy-efficient software-processor under consideration as it requires only 1.6 µJ per sampling cycle. However, the HaLoMote FPGA requires only 28 % of the energy of the most efficient software processor. Note that the HaLoMote MCU causes an additional overhead mainly caused by its wakeup (228 nJ) and idle (6 nJ) time. The combination of the hardware accelerator and the Atmel MCU, as used by the latest HaLoMote implementation, thus consumes only 44 % of the energy of the most efficient software processor.
The energy efficiency of the HaLoMote easily exceeds that of the other platforms while actively performing computations, but it suffers when the node has to remain powered, but stays idle. In that case, its idle power consumption is 53 µW, which is nearly thirty times the power drawn by the CC2650 MCU in idle mode. As sensor motes spend most of their lifetime in idle mode, special care must be taken on the HaLoMote to address this issue. In the SHM application, this can be achieved by having a supervisory power manager suspend the sensor sampling when no traffic is present on the structure. For the specific use-case of railway bridges, these times may be more than 95 % of the overall operating time. For these quiet periods, the non-volatile FRAM (see Section III) can be employed to store the internal state of the hardware kernels, which has to be preserved once the supervisory manager completely powers down the hardware accelerator. Note that the FPGA configuration data is not affected by such a shutdown, as it is held on-chip in non-volatile Flash memory.
For the concrete SHM configuration discussed in this section, about 48 kbit of runtime state has to be preserved across shutdowns (i.e., 41 kbit for the correlation functions, 3 kbit for the FIR taps, 3 kbit for the delay FIFO, and 1 kbit for the trigger events). Writing this state requires about 2 ms when using only one FRAM module. A state transfer between FRAM and FPGA requires a total of 188 µJ for both directions (state save and restore). This energy would be consumed by the FPGA in idle mode in about 3.5 s. Thus, state preservation and FPGA power-gating pays off for an idle duration of at least 4 s for the SHM application. Beyond the railway bridge usecase, such short idle-times occur even in many less frequently traveled automotive bridges. Thus, the capability of quickly and power-efficiently preserving the system state, while completely shutting down the accelerator, is attractive for a variety of applications.
VII. CONCLUSION
With the growing complexity of WSN applications, the use of heterogeneous computing architectures was shown to be profitable even for space-and power-constrained sensor nodes. The evolution of the HaLoMote was driven by practical experiences and resulted in a significantly more compact and efficient WSN mote. When used as a data acquisition system for distributed SHM applications, its accuracy can compete with a wire-bound laboratory system at least in the limited frequency range relevant for that applications while significantly reducing the deployment efforts. Finally, due to its hardware-accelerated signal processing chain, the energy-efficiency of the HaLoMote outperforms even most recent ARM-based processors designed for WSN applications by a factor of 2.3 thus improving the system lifetime when powered by a limited energy supply.
