Low power signal processing functionality is required for wireless sensor network nodes due to their limited battery life. Previously, we have proposed a reconfigurable array of DSP acceleration functional units for such a sensor node. The array maximizes operating life by matching system power consumption to available energy through power scalable approximate signal processing [1] . This paper presents the detailed architecture and implementation of the functional unit in the array. The use of low power building blocks and bit serial processing enables energy scalable implementation of several DSP functions. Post-layout simulation of a semicustom implementation in 0.25 ,um CMOS technology demonstrates a factor of three power scalability with input bitwidth for an FIR matched filter.
INTRODUCTION
Wireless sensor network technology promises to provide a new information gathering and distribution infrastructure, enabling numerous applications in medical monitoring, environmental science, and security. When wireless communication dominates the sensor node power consumption, signal processing the gathered information before transmission to other nodes is necessary to maximize operating lifetime. We have proposed an energy scalable computational array for such sensor signal processing [1] which consists of functional units embedded in an island-style interconnect structure. The energy scalability of the computational array stems from the energy scalable implementation of the functional units. The functional unit block diagram is shown in Fig. 1 FFT requires complex addition and multiplication. Low power techniques at both the architecture and circuit levels have been applied to the functional unit design. At the architecture level, clock gating, block partitioning, guarded inputs, and memory banking reduce power consumption. At the circuit level, an SRAM-based multiported register file replaces a flip-flop-based input shift register and significantly reduces active power. The functional unit provides energy scalable computation by varying (I)input bitwidth, (2)LUT word width, and (3)the number of operation iterations. A configuration word is dedicated to energy scalability control. In Section 2, we describe two custom circuit blocks that enable mechanisms (1) and (2) . The implementation of mechanism (3) is described in Section 3. Results and conclusion are presented in Section 4 and Section 5. Fig. 4 . SRAM-based input shift memory architecture One way to support power scalability via varying input bitwidth is to implement a shift register using flip-flops and bypass multiplexers [2] . Through appropriate clock gating and multiplexing input selection, the shift register offers a power and input bitwidth tradeoff. However, providing finergrained power scalability requires multiple clock gating circuits and bypass multiplexers. The incurred overhead may not be justified by the resulting power scalability. Moreover, the flip-flop based shift register does not allow parallel load. For hardware constrained design, the input shift memory must The memory cell is an augmented version of the LUT cell shown in Fig.2 with addition of one differential write port ydi and two read ports ydo and xdoe. The bitlines for the X and Y ports route orthogonally to allow parallel loading along the X direction, shifting multiple parallel bit streams in the Y direction. The y-axis read port, YDO, addresses the LUT. A 2x3 memory block (Fig.4) illustrates the architecture of the input shift memory. By controlling the activation sequence of the read and write signals on the wordlines, arbitrary bitwidth inputs are allowed, which leads to bit-level power scalability with minimum overhead. When read and write signals are asserted for a half clock cycle, the shift memory achieves the same throughput as its flip-flop-based counterpart. Fig.5 shows the post-layout power simulation results for variable input bitwidth. SRAM-based shift memories offer an approximately lOX power reduction over flip-flops [3] .
ENERGY SCALABLE SERIAL ARITHMETIC
In older VLSI technologies, bit serial algorithms were used to reduce the area of arithmetic blocks. As CMOS scales, leakage current becomes a major contributor to power consumption and at the low frequencies (kHz-MHz) at which most sensor DSP applications operate, serial implementations are lower power than parallel ones due to reduced transistor count [4] . Moreover, in serial arithmetic, the computation result is successively refined as more bits are processed, which naturally provides a power-precision tradeoff. In this section, we describe three functions to illustrate energy scalable implementation using serial arithmetic.
Vector Dot Product
The vector dot product is implemented using the bit-serial word-parallel Distributed Arithmetic algorithm [5] . Consider Since each bkn equals 0 or 1 only, the bracketed term in equation 1 has 2M possible values, which are precomputed and stored in a LUT. The variable vector traverses through the input shift memory MSB first to address the LUT, whose contents are accumulated to obtain the outer sum of Eq. 1. For an N-bit Xk vector, the final resulty is produced after N cycles. Supposed the bitwidth of the variable vector x can be adjusted at single bit granularity. Truncating each trailing bit of x eliminates one shift, one table lookup, and one accumulator load. The truncated version can run at slower speed to obtain the same throughput, which also saves dynamic power. The precision degrades due to the increased input quantization noise. Our functional unit implements a 4-tap FIR filter using the VDP function. Fig.6 shows the power consumption vs. event recognition scalability of an FIR matched filter for a biomedical monitoring application using a flip-flop-based input shift memory. Power consumption can be reduced further by using the SRAM-based input shift memory described in Section 2.2. The computable range is further increased by shifting out the leading zeros (equivalent to scaling up x). The integer part of logx becomes int(logx) = N-lznum, where lznum is the number of leading zeros in x.
Two levels of energy scalability can be obtained. At the coarse-grained level, disabling the LUT lower bank eliminates the linear approximation, which involves serial multiplication and addition. When linear approximation is activated, controlling the number of serial multiply iterations offers a fine-grained power-precision tradeoff. Fig.8 illustrates the coarse-grained energy scalability.
Serial Multiply
The signed serial multiplication (SMUL) is computed by iteratively executing a sequence of adds and shifts based on the value of the LSB of the multiplier (mbit). If rnbit = 1, the multiplicand is added to the MSBs of the partial product and the resulting value is right-shifted by one bit with the sign bit preserved; if mbbit = 0, the partial product is only right-shifted; at the very last iteration, if mbit = 1 (multiplier is negative), then the complemented multiplicand is added to the partial product and the sum is right-shifted to produce the final product. Number of non-energy-scaled computation iterations Fig. 9 . SMUL average power vs. result quality.
than 1101 x 1011. Number representation also impacts result accuracy. Energy scaling affects the worst case multiplier error in Q.15 fractional representation much less than in integer representation. Fig. 9 shows the power-quality tradeoff as the number of iterations scales.
IMPLEMENTATION AND RESULTS
Power results presented here are based on post-layout simulation of a semicustom implementation in 0.25 ,um CMOS. The SRAM-based input shift memory and LUT layouts are created manually. Remaining blocks are implemented with the OSU standard cell library [6] . Cadence Encounter is used to place and route the design (Fig. 10 ) and the extracted netlist is simulated using Synopsys VCS-NanoSim. Fig. 11 . Average power at 10 MHz. Fig. 11 lists the average current for each function running at 10MHz and operating on random vectors. With the nominal power supply for the 0.25 ,m CMOS process at 2.5 V, the functional unit consumes an average power of a few mW. The most power-hungry function is complex multiply because it involves four multiplications, two additions, and several operand swaps. Based on the device count and a frequency of 10 MHz, we estimate the active power consumption should be around 1 mW. Several factors may contribute to the excessive power consumption, for instance, the under-buffered interface signals between custom blocks and the standard cells and the suboptimal standard cell synthesis of the controller. With a full custom implementation, significant power reduction is expected.
CONCLUSION AND FUTURE WORK
We have described a functional unit which can compute several essential DSP functions in an energy scalable way. We have shown the architecture and circuit-level implementation of an SRAM-based input shift memory and lookup table. We have demonstrated the advantage of using serial arithmetic in energy scalable design by describing energy scalable implementations for several functions using serial algorithms. The simulated power for semi-custom implementation, while suboptimal, nevertheless demonstrates the concept of power scalable implementation. In future work, we will focus on controller optimization and transistor-level design of the remaining datapath blocks. 
