A microcontroller specifically tailored to the processing needs and energy budget of battery-powered sensor-based microsystems is described. The microsystem controller contains both a microprocessor core and a data computation engine to support the operation control and data processing needs common to sensor systems.
This paper presents a microcontroller designed specifically for the features and performance demands of sensor-based microsystem. Because the controller power consumption is typically a limiting factor in battery powered microsystems [2] , the reported microsystem controller maps several key functions into energy optimized hardware to improve microsystem operation lifetime. The controller contains a microprocessor core and a data computation engine which share a memory block to hand data back and forth. These two primary blocks are described. 
MICROPROCESSOR CORE
The processor was designed to work in conjunction with a network of sensors whose requirements have been used to define the features of the processor. These sensors measure analog data from the environment that are then quantified into measurements of pressure, temperature, humidity etc. The role of the processor is to monitor these sensor nodes and arbitrate the processing of the data received from these nodes. To minimize the complexity (and therefore size and power) of the controller, a Reduced Instruction Set Computer (RISC) load store processor was chosen over a Complex Instruction Set Computer (CISC). Low power design requirements of the microsystem application called for implementation of architectural and circuit-level techniques to reduce the processor power consumption [3] . A power saving standby mode was incorporated to cater to the sporadic operation common in sensor-based systems. Other features such as gated clocks and interrupts were used to further reduce the power consumption of the processor. Harvard architecture, with independent program and data memory, was chosen over the Von-Neumann architecture. This allowed simultaneous memory access from both the instruction and data memory thereby increasing the memory bandwidth. It also simplified memory access by keeping the instruction and data memory separate. This improved the speed of the processor and removed the need for dual ported memo ry.
Architecture
The processor architecture is shown in Figure 2 . The CPU has a 16-bit arithmetic logic unit and a register file that consists of 8 registers for data. The CPU has additional registers such as the program status register and condition code register that are used to set flags. To control the flow of execution, the CPU uses a 10-bit program counter and 10-bit memory address register. The CPU is pipelined into three stages to improve the throughput of the processor and control the sequence of operations in the processor. The instruction memory is synthesized to a 2Kbyte (1024 x 16bit) static RAM and the data memory a 512 byte (256 x 16bit) RAM block. In addition to the instruction and data memory, a separate 1Kbyte (256 x 32bit) sensor data memory has been designed to store sensor data information and interface with the data correction unit. The I/O ports are divided into general purpose I/O, used to interface with blocks such as Direct Memory Access (DMA) controller, and a specific sensor bus I/O port, designed to implement the Intra Module Multielement (IM 2 ) bus. The IM 2 bus is based on the IEEE 1451.2 standard modified to support networks of sensor nodes within a microsystem module [1] . The sensor bus port maps to hardware what would otherwise be a complex set of instructions needed to facilitate communication between the processor and the sensor nodes.
The sleep unit in the processor is used to shut down the processor and manage a sleep mode to save power. During normal operation, the processor will periodically initiate a request for data to sensor node and then processes data received from the sensor. In the sleep mode the processor clock is turned off and the processor monitors the sensor ports for an interrupt to turn the processor back on. After receiving an interrupt the processor turns back on, shifting to normal operation mode, and starts executing instructions.
General Purpose & Sensor Bus I/O Ports
The general-purpose I/O port consists of a 16-bit data line and a 10-bit address line. The processor supports direct memory access (DMA) operations and has signals such as dma_req, dma_ack and dma_rw on the I/O port. These ports are used to perform DMA operations and can read and write data to the memory block. The DMA operation is also flagged in the program status register to indicate that a DMA operation is in progress.
A special port was implemented in the processor to arbitrate communication between the controller and the front-end sensor nodes using the IM 2 sensor bus protocol [1] . The signals, functions, and driver for IM 2 bus are listed in Table 1 . The required functions of all communication signals (i.e., not power signals) are implemented in the sensor bus port of the processor. In addition to managing serial data transfer and handshaking signals, the sensor bus port monitors the NINT line for interrupts and implements software controlled asynchronous triggering on NTRIG. In addition to typical Arithmetic, Load/Store, Logical, Control/Branch, and Rotate/Shift instructions, the processor implements several special instructions defined in Table 2 . The Snd and Rcx instructions take two clock cycles to complete, so there is a stall in the pipeline that will float through after any of these instructions are executed. The Str instruction has to be executed immediately after the Rcx to store the received data in the sensor data memory. The Pull instruction can select between NIOE and NTRIG with bits [10:8] of the instruction and chooses between pulling high or low by setting the bit [0] of the instruction. 
DATA COMPUTATION ENGINE
The other primary block of the controller is designed to process sensor data. It can be configured for general computation but includes special hardware to perform sensor data correction in an energy efficient manner.
Data Correction Unit
Sensor signals contain various types of error inherent in the sensor manufacturing process. Correcting offset (non-zero output for a zero input condition) and nonlinearity (nonlinear output to a linear input) is known as calibration, while correcting cross-sensitivity (output depends on parameters other than it measures) is known as compensation. These error correction processes can consume up to 40% of the energy spent by a microsystem, and correcting these errors can add up to 50% to the cost of the microsystem. Error correction is typically performed using an off-the-shelf microcontroller [4, 5] . However, the significant impact calibration and compensation can have on the cost and power budget of a microsystem warrant effective and efficient error correction techniques be built in to the microsystem controller.
The data correction unit, shown in Figure 3 , performs calibration and compensation operation prescribed by the IEEE P1451.2 standard by evaluating a single multinomial [5] . The entire operating region of the sensor is divided into many segments defined by the calibration Transducer Electronic Data Sheet (TEDS), and in each segment, the signal transfer curve is represented by a multinomial given by: 
Floating Point Unit
The floating-point arithmetic units are deeply pipelined to perform error correction in a short period of time. The input signals are pre-normalized before the integer multiplication and addition stages. The result produced by both the multiplier and the adder is post-normalized to maintain the accuracy of the floating-point computations. Before performing the actual floating-point multiplication and summation, the significance of each sum term in the multinomial is predicted by calculating the tentative exponent of the sum term in the multinomial. To avoid loss of accuracy in the floating-point summation, the summation is performed starting from the term of least significance. Alternatively, if the summation is started from the term of largest significance, a coarse result can be obtained with fewer computations. This method can save power by halting the computation when successive terms will not add significant value to the result. The computation engine can be configured to operate in either of these modes, allowing the user to make the accuracy versus power tradeoff.
Hardware Sorter
A hardware sorter capable of efficiently sorting sixteen 8-bit wide unsigned integers has been designed to sort the tentative exponents of the sum terms (Figure 4) . When a new integer arrives to the sorter, it is compared with all the existing integers. The comparator produces a '1' if the existing integer is smaller than the new integer. Hence the number of 1's in the comparator results represents the rank of the new integer. Once the rank of the new integer is determined, the rank of all the integers whose rank is greater than or equal to the rank of the new integer is incremented. Thus, at any point of time, the ranks of the integers are unique. The new integer is stored in the next vacant register along with its rank.
The parallelism of the required operations does not require more than one clock cycle to find the rank of a new integer and to update the rank of all other integers. Hence it efficiently map in to a pipelined hardware implementation for the data correction unit. Moreover, the scheme does not require any time (and energy) consuming registerswapping operations as in software sorting schemes. The integers are stored in content addressable memory (CAM) locations with the rank as the 'key' field. If an integer of a particular rank is needed, the rank is given as input for the CAM and the integer is obtained by matching the input against the rank of all existing integers. Using this scheme, the integers can be recalled from the sorter in either ascending or descending order by giving the output of an up or down counter as the rank input. 
Re-configurable Computation
The data correction unit is capable of adapting itself to a variety of operations in addition the error correction operation. It can be programmed to work as a dedicated integer/floating point multiply and accumulate unit shutting down the hardware sorter unit. It can adapt to a new rounding scheme where the rounding bit is forced to zero bypassing the entire post normalization blocks and complete with computations in fewer clock cycles. The microcontroller can take control of all hardware resources to operate as a general-purpose floating-point co-processor to perform filtering operations and data fusion algorithms.
SUMMARY OF ENERGY EFFICIENCY
The microsystem controller was implemented in a topdown design process using a custom library of energyoptimized cells. For example, since flip flops are used extensively in the pipelined microprocessor and data correction unit, the energy demands of five different flipflop structures (in-house flip-flop with and without reset, push-pull isolation flip -flop [6] , transmission gate flip-flop and a regular master-slave flip-flop) were thoroughly analyzed ( Figure 5 ) and the two most efficient structures were included in the cell library. Significant energy savings are provided by the configurable sorting options, rounding schemes, and perturbation-only calculations within the data correction unit. Clock gating schemes and sleep mode operations, which shut down the unused blocks, further contribute to the low energy objective. 
CONCLUSION
A microcontroller developed for sensor network control and data computation applications has been described. The controller implements several microsystem-specific instructions and utilizes a configurable computation engine to process sensor data. A novel data correction engine, which implements several techniques to maximize energy efficiency, was presented. The controller simplifies microsystem implementation while facilitating the use of sensors that require complex calibration and compensation.
