In this article we describe a low-power processor platform for use in Wireless Sensor Network (WSN) nodes (motes). WSN motes are small, battery-powered devices comprised of a processor, sensors, and a radio frequency transceiver. It is expected that WSNs consisting of large numbers of motes will offer long-term, distributed monitoring, and control of real-world equipment and phenomena. A key requirement for these applications is long battery life. We investigate a processor platform architecture based on an application-specific programmable processor core, System-OnChip bus, and a hardware accelerator. The architecture improves on the energy consumption of a conventional microprocessor design by tuning the architecture for a suite of TinyOS-based WSN applications. The tuning method used minimizes changes to the instruction set architecture facilitating rapid software migration to the new platform. The processor platform was implemented and validated in an FPGA-based WSN mote. The benefits of the approach in terms of energy consumption are estimated to be a reduction of 48% for ASIC implementation relative to a conventional programmable processor for a typical TinyOS application suite without use of voltage scaling.
INTRODUCTION
Wireless Sensor Networks (WSNs) consist of large numbers of small network devices or motes [Hill 2003 ]. Each battery-powered mote consists of a processor, sensors, and a Radio Frequency (RF) transmitter. It is envisaged that large numbers of motes will provide long-term monitoring, and control of real-world equipment and phenomena. Previously proposed applications include early detection of wild fire, building monitoring, structural monitoring, tracking wild animals, military surveillance, and medical research [Akyildiz et al. 2002] . One of the limiting factors in the deployment of WSN technology is battery lifetime. Currently, research is being conducted on all aspects of WSNs in order to meet the stringent requirements of small device size and long-term, untethered operation.
Previous work on reducing the power consumption of WSN processors has focused on full custom processor development and circuit-level optimization for low power, including the use of asynchronous logic and voltage scaling. In contrast, in this work, we investigate application-specific tuning of the processor Instruction Set Architecture (ISA) combined with offloading of computationally complex Operating System (OS) functions to hardware accelerators. The tuning method aims to reduce energy consumption while minimizing changes to the ISA such that software migration to the new platform is straightforward. The proposed tuned BlueDot ISA is based on the conventional Atmel ATmega128L ISA but frequently used instructions are accelerated, infrequently used instructions are decelerated, commonly used instruction pairs are combined, RAM accesses are pipelined, and a commonly used function is offloaded. In this way, the cycle count of TinyOS applications is significantly reduced, with a corresponding reduction in energy consumption. Software binary compatibility with the ATmega128L is maintained by means of a postcompilation software tool that remaps existing machine instructions to new machine instructions.
To the authors' knowledge, this tuning approach has not been applied previously to WSN processors. The advantage of the approach is that significant energy savings can be achieved with minimal changes to the application software and tool flow.
The processor platform was implemented on an FPGA layer that was incorporated in a WSN mote. Verification and validation were significantly accelerated by implementation of the processor platform on the FPGA. The processor platform was synthesized for ASIC and the power consumption estimated by gate-level simulation. It was found that the tuning process reduced energy consumption by 48%, on average, over a range of TinyOS applications.
The remainder of the article is structured as follows. Section 2 describes the background to the work covering previous WSN processor designs and details the characteristics of TinyOS. Section 3 presents an analysis of the power consumption of a typical WSN application scenario. Section 4 presents the BlueDot architecture. Section 5 describes implementation of the processor platform. Section 6 gives the results of implementation in terms of clock frequency, area, and energy consumption. Finally, Section 7 concludes the article. 
WSN Processors
The first WSN motes, developed at UC Berkeley, were based on Atmel AVR general-purpose 8-bit microcontrollers [Atmel 2004 ]. The Atmel processor was selected due to its low power consumption, its support for analog-to-digital converters and real-time clocks and the availability of programming tool support. The ATmega128L is an 8-bit Harvard RISC architecture with 16-bit instructions, 32×8 bit general-purpose registers, peripheral control registers, and a 2-cycle multiplier. Currently, the most popular Commercial Off-The-Shelf (COTS) processors used in WSN motes are the Atmel ATmega128L and the Texas Instruments MSP430 [Texas Instruments 2004] . Both are general-purpose microcontrollers. As such, their ISAs are not optimized for WSN applications . Further information on the performance and power consumption of these and other commercially available WSN processors can be found in Lynch et al. [2005] .
Due to the need for lower power consumption and longer battery life, research is ongoing on developing low-power Application-Specific Processors (ASP) for WSN motes [Hempstead et al. 2008] .
SNAP is a 16-bit fully asynchronous processor with an event-driven architecture [Ekanayake 2004 ]. It has a FIFO task dispatcher with message and timer coprocessors. An asynchronous Atmel AVR clone was described in Necchi et al. [2006] . These processors achieve low power consumption. However, it can be difficult to integrate asynchronous logic into conventional, commercial synchronous design flows for low-cost System-on-Chip (SoC) solutions. For this reason, asynchronous logic is considered unattractive for many applications.
Subthreshold sensor processors, primarily for biomedical sensor network applications, are described in Nazhandali et al. [2005b] and Hanson et al. [2008] . These designs achieve low power consumption. However, as described in Hanson et al. [2008] , subthreshold logic is highly susceptible to temperature and process variations. In addition, due to the low voltage swing, noise arising from other on-chip SoC components could be an issue for commercial WSN applications.
In contrast, Spec is an 8-bit synchronous WSN mote processor with 16-bit instructions [Hill 2003 ]. The ALU operations are single cycle but the memory operations take two cycles. The ISA contains 32 general-purpose registers and can address a 16-bit address space. The processor has a special bank of context switching registers to improve interrupt-handling efficiency. A number of functions are offloaded to hardware accelerators. However, the design of the RF hardware accelerator is such that it needs the processor to handle transmission and reception. As a result, the frequency of processor interrupts to handle communication is nearly the same as for the conventional Mica processor.
The Smart Dust microcontroller is a 12-bit synchronous processor that runs at 1 V and employs component-level clock gating and guarded ALU inputs [Warneke and Piste 2004] . Power optimization focuses mainly on circuit-level optimizations to reduce switching activity. The Nimbus processor was developed by optimization of an Atmel AVR microprocessor clone for low power [Lorentzen 2004; Leopold 2004] . Power reduction was achieved by restricting the design to a subset of the AVR ISA, by using low-power circuit design and by scaling the supply voltage. The project team also investigated an asynchronous variant of the Nimbus processor, called Disa.
The Hempstead platform consists of an event processor and a secondary general-purpose nonpipelined microcontroller [Hempstead et al. 2005 [Hempstead et al. , 2009 . Low-power optimizations focus mainly on voltage scaling and the use of hardware accelerators. System bus, timer, radio interface, and message processor subsystems are included. Circuit-level optimizations are considered in further work by the same authors [Hempstead et al. 2006] .
μAMPS consists of a 16-bit synchronous Digital Signal Processor (DSP) and a Fast Fourier Transform (FFT) coprocessor connected via a shared bus [Finchelstein 2005 ]. The architecture includes voltage scaling and has a DMA engine.
Of these processors, only the Spec has extensive programming tool support, due to the fact that it is compatible with the Atmel AVR ISA. The other processors have custom ISAs. To the authors' knowledge, TinyOS has not been ported to the other processors, making software migration a time-consuming task. None of the previous work considers optimization of the Atmel microprocessor ISA, other than the simple subsetting approach used in the Nimbus processor. The application-specific ISA tuning approach described herein may be combined with circuit-level optimizations, such as subthreshold voltage and asynchronous logic, to further reduce power consumption.
WSN Processor Workload
TinyOS [Hill et al. 2000 ] is an application-specific operating system for WSN motes originally developed by UC Berkeley. TinyOS has become the de facto standard in the WSN research community due to its open-source availability, small footprint, portability, and development tools [Hill et al. 2004; Levis et al. 2005 , Gay et al. 2007 .
TinyOS has a component-based architecture and uses a C-like programming language, NesC. TinyOS components interact with each other in one of three ways: via commands, events, or tasks. A command is a request to perform some service, such as to initialize a sensor, while an event signals completion of this service. An event is a notification of a condition from a lower-level module to a higher-level handling module. Rather than performing a computation immediately, command and event handlers post tasks to a scheduler. The function is then scheduled and initiated later by TinyOS.
Simulators for TinyOS are described in Shnayder et al. [2004] . Avrora, a fast cycle-accurate instruction-level simulator for WSN motes, was used in this work [Titzer et al. 2005] .
To the authors' knowledge, there are no publically available software benchmarks for WSN applications. The authors of Hempstead et al. [2004] proposed the creation of TinyBench, a standardized benchmark suite for TinyOS. However, the benchmark suite, although proposed, was not actually released. The authors of Nazhandali [2005 a] proposed a set of C applications representing real-time workload applications of a WSN processor. Again, the software is not publically available. A WiSeNBench workload was proposed and assessed on the ARM processor in Mysore et al. [2008] . The work did not go on to propose or assess the effect of ISA optimizations.
The TinyOS distribution itself includes a comprehensive set of applications [TinyOS 2009 ], a subset of which is used in this research for workload analysis in place of a formal benchmark.
WORKLOAD ANALYSIS
In order to determine the most effective processor platform architecture, a workload analysis was conducted for a reasonable, computationally tractable WSN application scenario. A 3×3 grid of 9 nodes was considered, as shown in Figure 1 . All nodes are Mica2 motes with the Atmel ATmega128L processor on-board. All nodes run TinyOS-based application software. Three leaf nodes, nodes 5, 7, and 8 sense light and send measurements to the intermediate routing nodes (Surge) to be forwarded to the base node, node 0. A 4-tap simple Finite Impulse Response (FIR) averaging filter application was developed to represent the workload of a typical WSN processor-intensive application. Node 8 was programmed with this application. It filters the ambient light samples read by the ADC and sends the data via RF. Nodes 1, 3, and 4 sense light, send their data to the base node, and forward packets coming from leaf nodes. Node 2 was programmed to generate regular radio traffic by broadcasting counter values. Node 6 monitors the activity in the network by displaying any packet received on the Light Emitting Diodes (LEDs). Node 0 executes the TransparentBase application that sends packets from the host PC to the WSN and receives packets from the WSN and forwards them to the host PC via a UART.
An automated software system was developed for profiling the workload. An XML configuration was used to store the configuration parameters. The executable was generated from the nesC source using the avr-gcc compiler and input to Avrora to obtain the instruction trace and the function trace. A further program postprocessed and summarized the log files. The network was simulated for 300 seconds of virtual time. Use of a larger network or longer simulation time was desirable but the log files generated become impractical to process. The instruction execution trace was sorted and analyzed to identify frequently used instructions and pairs of instructions. In addition, the function execution trace was analyzed to identify frequently used functions, their hierarchy, and the number of instructions, executed as part of them.
Due to the fact that TinyOS is an event-driven OS, most of its functionality is contained in interrupt routines. Table I summarizes the TinyOS functions with the greatest workload for each node. The interrupt handler routine for the SPI-RF interface dominates, consuming 30%-50% of active processor cycles. Main, the OS kernel function, consumes 10%-20% of the total workload across all of the applications. The other dominant functions are nesC-atomic-, Timer-and ADC-related interrupt routines. Table II lists the 10 most frequently used instructions across all applications. These instructions constitute 40%-70% of the total instruction workload, depending on application. The most frequently used are the stack manipulation instructions, push and pop. Results show that 60%-80% of accesses to data memory are to the stack, that is, the bottom 20-25 locations in the Atmel ATmega128L address space. Table III lists the most frequently occurring consecutive pairs of instructions. These instruction pairs constitute 25%-35% of all of the pairs encountered in each application trace. The most frequently occurring instruction pairs are pop-pop and push-push.
• 23:7 4. PROPOSED ARCHITECTURE
Processor Platform
Herein, we propose a processor platform, called BlueDot, which supports energy-efficient execution of TinyOS-based WSN applications. The processor platform is intended to replace the conventional processor in a WSN node. Since replacing the conventional processor with a BlueDot processor platform does not alter the functionality of a node, a node containing the new platform can be used together with conventional nodes in the same network. The BlueDot processor platform architecture consists of an applicationspecific programmable processor core and a hardware accelerator interconnected via a System on Chip (SoC) bus.
Processor Core
The BlueDot processor core ISA was determined by tuning the conventional ATmega128L ISA based on the results of the workload analysis. The following optimizations were applied.
-Frequently used multicycle instructions were modified to only use one clock cycle.
23:8
• R. K. Raval et al.
-Frequently used pairs of instructions were merged into single instructions.
-Interrupt handling time was significantly reduced.
-Instructions were added to enhance communication with hardware accelerators. -In some cases, Data RAM accesses were advanced to speed up execution.
In addition, to facilitate software compatibility with the conventional Atmel microcontroller, two modes of operation are supported.
-Compatible Mode: ATMega128L binary compatible. -Advanced Mode: close to ATMega128L binary compatible.
Software compiled for the Atmel ATMega128L will execute without modification on the BlueDot processor core when in Atmel compatible mode. Taking advantage of the optimizations available in advanced mode requires that compiled ATMega128L machine code be postprocessed, either manually or automatically, to convert some ATMega128L instructions to optimized BlueDot instructions. This postprocessing was done after compilation using a script to substitute BlueDot accelerated equivalents for ATMega128L machine instructions. Table IV lists the accelerated instructions and gives the speedup achieved in both compatible and advanced mode. Compatible mode accelerates commonly used existing instructions. Some instructions cannot be accelerated due to data RAM access limitations. Others cannot be accelerated due to the length of their opcodes. For example, an instruction with a 32-bit opcode, which requires a 2-cycle fetch, takes up two slots in the pipeline regardless of the speed of the execution step. In advanced mode, opcodes can be altered. Hence some frequently used instructions can be further accelerated. However, due to the limited number of opcodes available, some 16-bit instructions had to be decelerated, that is, adjusted to 32-bit opcodes, to achieve this. RCALL and ANDI were decelerated for this reason. The new compound instructions added to advanced mode are described in more detail in Table V . Overall, advanced mode provides acceleration over compatible mode due to the inclusion of these new compound instructions, even though a small number of instructions are actually decelerated relative to compatible mode.
Three additional instructions were added to allow for direct communication between the processor core and the SoC Bus. These are listed in Table VI. The RAM memory used for the processor is a synchronous memory with combinational inputs. In pipelined execution, writes can be performed in one cycle. Reading takes two cycles: the first to set the address and the second to obtain the data; see Figure 2 (a). This arrangement minimizes the critical path of the circuit. Instructions such as POP, RET, RETI, LD, and LDS must access the RAM in read mode. The performance impact is significant, given that these instructions are among the most frequently used. The solution adopted herein is to bring the read process forward one cycle, as shown in Figure 2(b) , putting the address on the address bus while still executing the previous instruction. This solution works well when the previous instruction does not make use of the address bus to access the memory, otherwise a resource conflict occurs. If a conflict is detected, the read is not brought forward, which resolves the conflict. The processor core switches between forward and normal read mode depending on the instructions that are being executed. Table VII lists all possible resource conflicts. If any of the instructions in the second column are executed after any of the instructions in the first column then the forward read process is disabled and the instruction in the second column executes the read process normally.
SoC Bus
The SoC Bus was designed to provide low-power communication between the Processing Units (PUs) in the design. A shared 8-bit bus was chosen to provide for simple layout. To allow for Dynamic Voltage and Frequency Scaling (DFVS), the bus clock operates asynchronously to the PU clocks and is retimed in the bus interface module of each PU. Delay tolerance was designed into the bus to allow for PUs in deep sleep mode [Hu 2004 ]. Messages are buffered on an as-needed basis. The bus logic is comprised of three main hardware blocks: the arbiter, the scheduler, and the interfaces. The arbiter controls the bus, the scheduler is used to store messages missed by asleep or busy PUs, and the interfaces connect the PUs to the bus. The bus and the bus side of the interface modules are synchronized to a clock generated by the arbiter. The PUs themselves have independent clocks. There is typically no need to segment the bus into smaller sections, although this can be done. Transmission requests are signaled to the arbiter via a dedicated line per PU. In the case of a large number of PUs, The interface module is divided into two separate blocks: reception and transmission. The two parts of the module operate independently. The bus side of the blocks is clocked using the SoC bus clock. The PU side is clocked with the PU clock. A handshaking protocol operating via a shared RAM is used to retime and queue messages.
When the PU requests that the transmit block send a message, the block requests a bus grant from the arbiter using the dedicated line. Once the arbiter grants permission, the transmit block starts sending the message via the bus, indicating this to the other PUs by setting the "message being sent" signal. The PU clears the request signal as soon as the transmission has begun. The message consists of the destination PU Id and a series of data bytes. The data bytes are sent synchronously to the bus clock. The receiving module clocks the data into the interface component if the PU Id on the bus matches its own, hard-wired PU Id. The transmission finishes when the last byte is sent, which is indicated with a "bus last byte" signal. At this moment, the transmitter clears the message being sent to indicate that the bus is available to transmit a new message.
The SoC bus protocol and architecture are described in more detail in Fernandez et al. [2007] .
Hardware Accelerator
As described in Section 3, the interrupt handler routine for the radio-SPI component is the most frequently used function in the WSN workload. This routine initializes the radio chip and transmits and receives packets over the radio interface. Communication with the radio chip is performed using a SPI interface. In the Mica2 mote, the SPI interface interrupts the Atmel processor whenever a byte is received or transmitted, which keeps the processor active and awake for long periods of time with little activity. Porting this functionality to dedicated hardware can save energy, as it can reduce processor activity and speed up execution. The processor itself is only interrupted when a complete frame is received, rather than every time a byte is received. Similarly, the processor sends the complete frame to the dedicated hardware block and switches to sleep mode, leaving the handling of the transmission to the PU.
• R. K. Raval et al. A functional block diagram of the radio-SPI PU is shown in Figure 3 . The control module state machine is the main controller for the PU. It decodes the commands on the SoC bus, accesses the internal registers, and controls the transmission and reception of the frames via the two interfaces (the radio and SoC bus). To communicate with the radio chip, the necessary SPI signals are generated. This is done by means of the SPI state machine, which sends these signals and implements internal control, status, and extended status registers. There are two blocks of memory for transmission and reception of messages over the SPI interface. These are used in a FIFO manner. Interrupts to the processor are also generated in this block.
The SPI protocol uses four wires to synchronously communicate with the radio, with the radio-SPI processing unit acting as Master. A set of registers is used to configure communication with the radio chip.
Further hardware accelerators can be added to the architecture without modifying the existing PUs. New PUs are simply interfaced to the SoC bus and allocated a new PU Id.
IMPLEMENTATION
The BlueDot processor platform was implemented in Verilog. Depending on the software, the processor core within the platform can run in one of three modes. In baseline mode it operates as a clone of the ATmega128L with exactly the same cycle count. In compatible mode it executes the same ATmega128L machine code but some instructions are accelerated, reducing the cycle count. In advanced mode, the machine code is modified to allow for use on the new compound instructions, further reducing cycle count.
Verification was performed at the block and module level using ModelSim. The processor platform was validated using a Tyndall WSN node [O'Flynn et al. 2005] . The node consists of a number of PCB layers and a coin battery. The layers include an ATmega128L processor, light, and temperature sensors, memory, Analog-to-Digital Converters (ADCs), Xilinx Spartan FPGA, and T.I. CC2420 radio chip. The processor and FPGA are interconnected so as to allow communication between the processor and FPGA and to allow control of the node hardware by either the processor or FPGA. This enables rapid development by allowing initial software implementation on the processor and incremental offload of functionality to the processor platform on the FPGA. The test environment is illustrated in Figure 4 . The UART provides for monitoring of SoC bus traffic and manual transmission of bus messages via a PC terminal window application. This feature allows manual control of the system and significantly accelerated debugging of the design. A simple RF transmission and reception application was built to validate the design. Sensor and LED control PUs were constructed to test the functionality of the SoC bus and to assess the flexibility of the functional offload concept.
To evaluate the performance of the design, the cycle count figures obtained from ModelSim instruction simulations were compared with those of the conventional ATmega128L. To obtain estimates of area, power, and timing, the processor was synthesized for ASIC using Synopsys DesignCompiler with a Faraday-UMC 0.13μm low-leakage CMOS technology library. The memory access time was assumed to be 4.86ns, worst case. The Synopsys PrimePower tool was used for power estimation. The area figures provided exclude interconnect and scan test. The power and energy estimates were obtained prelayout but include manual estimates for memory access and clock tree energy, and leakage is included.
RESULTS
The ASIC area and maximum clock frequency of the various BlueDot components are provided in Table VIII . The average energy/instruction of the processor core was 26.42 pJ. The average energy per cycle measured for the processor core was 16.10 pJ. The number of cycles and energy required for execution of the TinyOS applications was assessed for the baseline processor only and for the full BlueDot processor platform including advanced mode processor core and RF-PU. The results are presented in Table IX . The full BlueDot platform provides significant energy savings of up to 56% over the baseline processor. Speedups in terms of cycle count of up to 61% were achieved, with an average of 55%. The number of bytes transferred between the ATmega128L and the CC2420 radio transceiver in the SenseToRfm application was 719745 bytes, estimated over 300 virtual seconds of simulation of the 9-node sensor network. On the BlueDot platform, the energy/bit to communicate this data over the SoC bus was 5 pJ per bit measured over a data length of 16 bytes.
The gate-level estimated energy consumption averaged over all applications in the workload for various architectural instances is listed in Table X and shown in Figure 5 . All three processor core variants are considered. All of the processor variants were assessed with and without the RF-PU. Adding the RF-PU to the baseline processor (instance 1 to instance 4) provides the largest power savings (24%). Moving from the baseline processor to the BlueDot processor in compatible mode (instance 1 to 2) provides comparable savings (20%). Switching from compatible to advanced mode (instance 2 to 3) provides a small energy saving (4%). The full BlueDot platform with the processor operating in advanced mode provides an energy saving of 48%.
The memory footprint improvement of the advanced mode BlueDot processor over the Atmel ATmega128L is 3.0%-5.75%, depending on application.
CONCLUSION
This article described an application-specific processor platform for TinyOSbased WSN nodes. The platform provides significant energy savings but maintains near binary compatibility with the conventional ATmega128L microcontroller. The article describes a workload analysis of WSN applications, the processor platform architecture, an application-specific processor core, the SoC bus interconnect, and the RF interface hardware accelerator. The platform was validated in an FPGA embedded in a WSN mote. The energy savings due to the various energy optimizations were assessed and compared with conventional implementation on a microprocessor with an ATmega128L ISA. The BlueDot platform consumes 48% less energy than the baseline ATmega128L-equivalent processor when executing the same WSN application suite. The work shows that minor changes to the ISA and offload of a small, but frequently used, function can provide significant energy savings. Making minor changes to the ISA has the significant advantage of ensuring near binary compatibility, which eases adoption of the new platform.
The speedups introduced in this work may allow for further energy savings in other mote components. For example, faster interfacing may allow the radio to be switched off more quickly, thus saving energy consumption in the radio. The authors intend to investigate the impact of these processor optimizations on higher-layer workloads. The authors also plan to extend the work to investigate the effect of circuit-level optimizations. The ISA tuning approach used herein is generally applicable; we plan to study its use in other application areas.
