Abstract-In this paper, we present a hardware accelerator for receiving the custom protocol data in networked sensors. The accelerator is designed with the novel architecture we proposed, which gives it the flexibility to be composed into application specific protocols to improve communication rates and synchronization accuracy.
INTRODUCTION
A networked sensor is a node in a wireless sensor network. It is a device to integrate communication, power sources sensors and actuators with computational elements in a very small physical size.
For a networked sensor, the fundamental constraint is its energy consumption, since it may be impossible to replace its energy source. In a wireless sensor node, the radio consumes a vast majority of the system energy [1] . This power consumption can be reduced through decreasing the radio duty cycle. Increasing bit rate is the primary approach to decrease the radio duty cycle. By increasing the bit rate without increasing the amount of data being transmitted, transmission time decreases and the radio can remain off as much as possible [2] . But high bit rate requires high receiving ability, the microcontroller (MCU) has to run at high frequency, which also may increase the power consumption greatly.
In a system with a fixed protocol, such a problem can be solved by ASIC to cope with low-level data receiving and transmitting, but the wide range of application of wireless sensor networks makes it difficult to develop a single protocol. The hardware platforms for networked sensors must support a suite of application-specific protocols to produce system-level optimization. In the systems based on commercial MCUs, communication tasks have to be processed by software at very low level (bit-level) due to lack of efficient hardware support, which not only takes too much CPU time but also consumes a significant amount of energy, since the general-purpose CPUs do not have efficient instruction set and data-path for low-level protocol For a system with a standard protocol, all the tasks mentioned can be processed with hardware automatically, but for a system using custom protocol, we expect the accelerator to be able to provide efficient implementations of low-level operations that are inefficient on a generalpurpose data-path. As the most inefficient operations are the bit-level operations for a general-purpose CPU, and most of such operations occur in signal encoding and decoding tasks, we will discuss some familiar signal encoding and receiving methods. Although it seems that there are too many signal encodings to process with a single circuit, we can find that many of them can be received in very similar methods by analyzing several typical encodings. There are 3 kinds of encodings in Figure 1 . The first encoding we must process is Non-return to Zero (NRZ) code, which is a straight binary encoding and widely used in MCU-based system. Although the NRZ code can not directly be used for wireless transmission since a DC component of zero must be maintained for a radio receiver, it can be modulated to the FSK or PSK signals before transmission, after which the demodulated signals can be processed.
Manchester code is another very popular encoding used in networked sensors. Manchester encoding is a synchronous clock encoding technique used to encode the clock and data of a synchronous bit stream. In this technique, the actual binary data to be transmitted are not sent as a sequence of logic l's and O's as NRZ code. Instead, the bits are translated into a slightly different format that has a number of advantages over using straight binary encoding. Manchester code encodes the logical value "O" as low-tohigh transition, and "1" as high-to-low transition. When there is no transition at the middle points of a bit, it is called "Non-data high " if the logical level is high, or "N±", if the logical level is low, it is called "Non-data low" or "N-" [4] . These special symbols are very useful because they can be used to define the start delimiter and the end delimiter to identify the boundaries of data packets. In fact, within a system, Manchester encoding uses two bits of binary code to represent one bit data, "O, 1, N+ and N-" can be represented by "01, 10, 11,00", so we should sample the signal twice when receiving one data bit. Non-return to Zero Inverted (NRZI) code is another encoding that can be used. Logical "1" is defined for any transition, "O" is defined as no transition in such a system.
When receiving a signal encoded with NRZI code similar to Manchester code, we should also sample the signal twice to detect the transition, but the sample value '01' and '10' are both logical "I"s, and '00', or '11' are logical 'O's.
The most complex encoding we considered in this design is 4B/5B encoding, where 4 bits of actual data are encoded into a 5 bits code to ensure no long consecutive "1"and "O". 4B/5B code can be transmitted using NRZ or NRZI, so 4b/5B is received.
We can get some conclusions from the above analysis. The primary difference for different encodings is the clock frequency selection, bit-synchronization delay time, the location of sample points and the sampling method because a receiving process can only begin from the time that a transition is detected. For NRZ code's receiving clock frequency at least double that of the signal clock, we call it 2X clock. The sample point should be located at the middle of a bit as shown in figure 2. For any encoding using transition to represent data, we should sample at least twice and the sample points should be located on both sides of the transition point, so the receiving clock's frequency is at least 4 times that of the transmitting clock (4X clock). But to get accurate synchronization, the clock, frequency of the edge detection part, much more high frequency, 16X or 32X clock should be used. Because there may be much more noise in our application circumstance, a digital filter should be used. So we select 16X clock for the accelerator, and we can sample the signal five times at a single sample point. But there is drawback in using 16X clock, since the power consumption is proportional to clock frequency, and the accelerator's power should be low enough compared with the other components, so low power should be at early time of design flow. To meet these constraints, the most important task is the design of architecture, it must be flexible and power saving.
IV. THE ARCHITETURAL DESIGN
The architecture is designed based on the following principles: * Configuarable by users Unlike some similar serial communication parts used in MCUs, such as USART or SPI, which are designed for fixed protocols, most of the registers are transparent for the CPU, and only some interface registers can be accessed by users. The accelerator should be designed in an open architecture, while most registers and the connections should be controlled by the CPU. * Scalable for power optimization As there can be a great difference in requirements among protocols, some of the components can not be used for the protocol actually used. The clock of unused parts must be closed to reduce the power consumption [4] , and even when most components are used, it would be better if they work in a "one by one" mode, in which only a small part of the circuit are actually working at any time.
register. When the "en" signal is active, the bit-stream data will shift in. When the data are equal to the data in register RA, it produces a pulse for one clock cycle to indicate a matching event. So it is a eFU, and named emFU. RA and RB will exchange data in the next clock cycle if "en2" is configured to connect the event of the FU it self, which will enable this FU to be used twice in a auto-processing sequence. Based on these principles, we proposed a novel architecture, in which the overall circuit is made up of a set of function units (FU). They have CPU interface and connect to the signal buses we defined. There is a register in each FU to configure the connection to the signal buses.
We abstract the circuit units into five types. The first type is named bFU (bit out function unit), where the incoming bit stream can shift in shift out from the bFUs. The second type is called pFU(phase function unit),which generates a control signal to deside at what phase in a cycle a FU can work. The third type is eFU, which means the FUs can generate an event to change the working state. The fourth type is sFU, which is the state machine. There is only one sFU but there may be many FUs of other types in a system. The Final type is dFU, where data shift in, but no event signals connect to any FUs.
We define three types of signal buses to connect these FUs. The first is bBUS, which is the set of bit-stream signals that may shift in a FU. The second type is pBUS, which refers to all phase signals to control the action of an FU. The third type is eBUS, which is the collection of event signals.
There are three MUXs in the interface of each FU, and a register (connect to CPU) to select the actual input signals, so that the connections is configurable.
Some of the FUs are shown in figure 4 . As matching a symbol sequence is a typical protocol processing task, a FU for macthing a bit-stream level is designed. for scalable, all the registers are 8-bit long,( (a) in Figure 3) . The SR is a shift When the incoming bit-stream is not encoded in NRZ code, the "deFU" will be used at data receiving stage. Several typical extracting methods are available, including Manchester code, NRZI code and 4B/5B code, which can be selected by a MODE register. To reduce power consumption, the shift register in "deFu" is designed as 10 bits long for 4bits data. Two deFUs will be used for byte, the power consumption can be controlled by clock gating [5] since they do not work at the same time.
The ecFU is an auto-reload counter, designed to have 4 bits for the same reason as "deFU". The initial value can be reloaded from RB to CNT when overflow, so the counter can work consecutively.
The peFU starts when a falling edge of its input signal is detected, and it generates 3 phase signals in the next 3 clock cycles to control other FUs.
The sFU is composed of several shift registers, and it works as a programmable state machine.
Some other FUs are also used but not shown in Figure 4 . A "bfFU" is a digital filter composed with a accumulator, a counter, and a register, they are all 4 bits long. The "psFU"is a sample points generator, it is 16 bit circular shift register. Besides, there is clock generator to produce the 16X clock.
All FUs work share a clock and a reset signal. As each FU has a "enable" signal and most of them are controlled by ase signals, the power consumption can be optimized with logical synthesis tools by clock gating [6] .
When a match or overflow event occurs, it is recorded in registers, and generates an interrupt request, which enables user to process some work not supported by hardware. In the ideal situation, the accelerator can be configured to process a complete receiving task automatically, these interrupts can be masked.
V. FUNCTION EVALUATION
To evaluate the function of the accelerator, we write a prototype based on the architecture we proposed in hardware description languages. Several protocols using 4 kinds of encoding mentioned in section III have been tested. It can receive a complete data packet for any protocol using Manchester code and both start and end delimiters are defined in any pattern not longer than 16 bits, for other encoding, it can match any start symbols within 32 bits automatically.
A typical configuration is shown in Figure 5 , which is an automatic processing procedure. First, all registers should be initialized before a receiving procedure start. After start, beFUI continually detects the falling edge of the RXD signal. When detected, it starts ecFUI to generate a delay for bit-synchronization, then the psFU is started, which will consecutively run until reset. The bfFU (Filter) accumulates the logical value RXD signals when the output of sample generator is "1", because the sample points generator is configured as shown in Figure 2 . The psFU will accumulate the RXD signals five times at each sample point. The peFU is used to detect the falling edge of the output of psFU. When a falling edge is found, it will produce 3 phase signals to control the other FUs. The emFUI1 and emFU2 are used to detect start delimiter and end delimiter, which are both 16 bits Manchester code, the high bytes data of both start and end delimiter are stored in A and B registers of emFU1 respectively, the low bytes of which in emFU2. At the first state, only emFU1 is enabled, when it's match signal is active, all registers in sFU will shift one bit, which will enable emFU2 and disable emFUI. At the same time the A and B registers of emFUI will exchange data, so when emFUlis enabled again, it will match the end delimiter. When the complete start delimiter is found, deFUI and deFU2 will be enabled in order to convert Manchester code to NRZ data. Then the emFU1 is enabled again to match the end delimiter. The ecFU3 is used to count the number of bits shifted into deEUI and deEU2. When a byte data is received, the DMA circuit will read the data and save it in memory. When end delimiter is found or data error detected, all FUs will be reset to their initial states.
VI. KNOWN PROBLEMS AND FUTURE WORK
The advantage of this accelerator is its flexibility, but there is drawback on chip area because a lot of registers are used for configuration. Tradeoffs must be made in future work. We have found that only a few configurations are used for the protocols we have tested, so some hardware connection can be used between some special FUs without VII. CONCLUSION An accelerator for networked sensors application is designed. A novel architecture is proposed for the design procedure, in which both the data-path and state machine are defined as configurable function units. This gives it the flexibility to be composed into application specific protocol to improve communication rates, synchronization accuracy,and finally the power consumption. The efficiency of the accelerator is proved by several typical protocols using different encoding and start symbols.
