Reduction of power dissipations in CMOS circuits needs to be addressed for portable battery devices. Selection of appropriate transistor library to minimise leakage current, implementation of low power design architectures, power management implementation, and the choice of chip packaging, all have impact on power dissipation and are important considerations in design and implementation of integrated circuits for low power applications. Energy-efficient architecture is highly desirable for battery operated systems, which operates in a wide variation of operating scenarios. Energy-efficient design aims to reconfigure its own architectures to scale down energy consumption depending upon the throughput and quality requirement. An energy efficient system should be able to decide its minimum power requirements by dynamically scaling its own operating frequency, supply voltage or the threshold voltage according to a variety of operating scenarios. The increasing product demand for application specific integrated circuit or processor for independent portable devices has influenced designers to implement dedicated processors with ultra low power requirements. One of these dedicated processors is a Fast Fourier Transform (FFT) processor, which is widely used in signal processing for numerous applications such as, wireless telecommunication and biomedical applications where the demand for extended battery life is extremely high. This paper presents the design and performance analysis of a low power shared memory FFT processor incorporating dynamic voltage scaling. Dynamic voltage scaling enables power supply scaling into various supply voltage levels. The concept behind the proposed solution is that if the speed of the main logic core can be adjusted according to input load or amount of processor's computation "just enough" to meet the requirement. The design was implemented using 0.12 um ST-Microelectronic 6-metal layer CMOS dual-process technology in Cadence Analogue Environment.
INTRODUCTION
Power dissipation has become a major concern in CMOS devices, primarily on wireless battery operated portable systems where energy consumption is heavily constrained. In wireless battery operated system, such as wireless sensor network, to increase a system lifetime means minimising the system power dissipation by reduction in computation speed and communication power. A microsensor node, from tens to thousand microsensor nodes in a wireless sensor network, for sensing environment and delivering the information back to the user is required to live up to 5 years from a single "AA" battery. If a microsensor node is running on average rating of 1500-1800 mAh for 5 years, the system is required to have an average power dissipation of less than 10uW. Therefore the important key factor in such application is energy dissipation per function rather than clock speed or silicon area. It has become increasingly important in battery operated portable systems to be aware of its energy or power dissipation.
Power aware or energy aware system is not identical to low power design. Although both techniques correspond to reduction in power consumption, low power design targets the worst case path or scenario to reduce power dissipation. However, an energy efficient system should be able to decide its minimum power requirements by dynamically scaling its own operating frequency, supply voltage or the threshold voltage to be energy-efficient under a variety of operating scenarios.
A microprocessor based portable system generally has a power management scheduling inside the kernel, which is responsible of adjusting clock frequency and operating voltage in low power sleep mode. Some of the recent microprocessor, like the StrongARM processor supports external dynamic voltage scaling (DVS) scheduling and internal operation frequency scheduling 1 . Typically in dynamic voltage scaling processors based system, V dd and clock frequency can be dynamically varied according to the required throughput to significantly extend battery life. Therefore it is principally important to acquire the sources of power dissipation in CMOS circuits.
SOURCE OF POWER DISSIPATION IN CMOS CIRCUITS
Fundamentally, the two main sources of power dissipation in CMOS circuits are dynamic and static power dissipation. The total power dissipation is the accumulation of dynamic and static power dissipation, as shown by equation 1 2 .
(1)
Where, P = total power C = circuit capacitance V dd = supply voltage f = clock frequency N = switching gate transition/cycle Qsc = short circuit charge quantity I leak = leakage current I static = static current Dynamic power dissipation (P dynamic ) is more dominant in majority of applications. P dynamic , due to its nature, has a direct relationship with switching frequency (N) and supply voltage (V dd ).
As seen from equation 1, dynamic power dissipation can be effectively reduced by lowering the V dd due to its quadratic relationship. Clock frequency depends on the device application, and higher clock frequency is still preferable in processors nowadays. However, an effective V dd reduction could only be achieved by reduction in total capacitance, which means smaller transistor size.
Short circuit charge (Q sc ) is another element in dynamic power dissipation which can not be neglected. This paper refers to Q sc rather than short circuit current, due to the fact that there are two components involved in CMOS short circuit. A general understanding of short circuit current is when a direct path from power supply to ground is established, due to the transistors leaving the cut off region into saturation and vice versa. However, the reverse charge to the gate to drain coupling capacitance also has an influence in the total short circuit power. Therefore, it is more accurate to represent the total short circuit power in terms of Q sc .
The second part of the total power dissipation equation is derived from the total static power equation (P static ). P static is mainly formed of two components, mainly the subthreshold leakage power, due to the small subthreshold leakage current (I leak ) conduction between source and drain, and the static leakage power, from reverse bias pn-junction leakage current (I static ) between source/drain and substrate, or commonly known as substrate leakage current.
The subthreshold leakage current, which is known to have an exponential characteristic, can be expressed as,
.
where n is the process parameter, V th is the threshold voltage and V TH is the thermal voltage at room temperature 3 .
The exponential increase of subthreshold leakage drain current with decreasing V th for a given V gs is shown in equation (2) . In other words, a transistor with higher threshold voltage has a lower leakage current. However, one must be careful in choosing a V th as a slight increase in V th means larger delay. Another important factor is to use a transistor with steep slope characteristic or transfer characteristic. The slope is measured by plotting of the drain current in semi-logarithmic
scale against V gs , which is linear in subthreshold region. The larger the slope means the closer the transistor's behaviour to an ideal switch.
The architecture for reducing the total power (dynamic and static) in an FFT chip is illustrated in Figure 1 . The dynamic power management is accomplished using Dynamic Voltage Scaling (DVS) block and the static power management is realised in Threshold Voltage Scaling (VT scaling) block.
This paper is focused on implementation of DVS for dynamic power reduction on the FFT processor, which will be discussed in Section 3,4 and 5. 
DYNAMIC VOLTAGE SCALING POWER MANAGEMENT
From previous section theoretical and experimental procedures show that lowering V dd directly reduces the power consumption. However, the drawback in lowering V dd is a larger delay. Larger delay signifies slower performance or slower processing speed in integrated circuits [4] [5] [6] [7] . Intel Corporation has developed a power dissipation solution for laptop/notebook computer user. The technique is named "Speed-Step Technology" and can be found in Intel's Mobile processors. "Speed-Step Technology" presents a dual power supply scaling management and frequency scaling. The dual power supply scaling management enables the processor to operate in two power domains, the battery mode and the external adapter mode. It steps up or down the processing speed according to the power source. Thus a reduction of power dissipation is obtained by slowing down the processing speed while in battery mode, resulting in longer battery life. However, "Speed-Step Technology" could not be applied to only battery sourced portable devices where only a single power source is used.
One of the proposed solutions to lower power dissipation is to use a dynamic V dd scaling. Dynamic V dd scaling enables power supply scaling into different supply voltage levels. It is known that lowering the supply voltage means less charge or current flowing inside the circuit, which result in larger delay [8] [9] [10] [11] . The concept behind the V dd scaling is that, if the frequency of the processor or logic cores can be adjusted according to the input throughput load or amount of processor's computation required, then the processor can run into "just enough" operation frequency to meet the requirement, as shown in Figure 2 . The supply voltage can be reduced down for processes such as background task, which can be executed at a reduced frequency, thus minimising power consumption. The facts behind power reduction in DVS are: -A system is not always required to work at 100% performance. -The total power dissipation pre submicron is dominated by dynamic power. The performance level is reduced during low utilisation periods in such a way that the processor finishes its task "just in time" by reducing the working frequency. While the operational frequency is lowered, at the same time the supply voltage V dd , could also be reduced.
As the most effective way to manage power dissipation is highly dependent on the application, current DVS implementation is focused more at the algorithm level power scheduling, which resides in the kernel system of the operating system rather than hardware implementation. The portability of the DVS system to support multiple platforms with different requirements is the main concern with many DVS designers. These considerations make DVS designers to depend on DVS at algorithm level power scheduling. However, optimisation in DVS architecture can still be done by realising some of the voltage scheduling algorithm modules in hardware, which will result on an effective power management scheme.
IMPLEMENTATION OF DYNAMIC VOLTAGE SCALING
This section of the paper discusses the system modules required for the power aware FFT processor with DVS. The power management module discussed is the DVS architecture includes a DAC, tuneable ring oscillator, pulse width modulator, phase frequency detector, current driver and loop filter 12 .
Generally, there are three key components for implementing DVS in a system with central processing unit (CPU), namely: -An operating system capable in varying the processor speed. -A regulator which generates a minimum voltage for a particular frequency. -A central processing unit which is capable of working on wide operating voltage.
As discussed earlier, hardware implementation alone would not be possible. The processor's speed is controlled by the operating system, where obligation is to gather the load requirements of the process from a power profiler module. In order to minimise energy dissipation, the supply voltage must vary as the processor frequency varies. However, the operating system is not capable of controlling the required minimum supply voltage for a given frequency. Hardware implementation is required to provide this functionality. The overall architecture of the DVS system is shown in Figure  3 . Pulse width modulation is a common modulation technique in signal processing. PWM generates a series of pulses where the pulse width varies depending upon the weighted input binary scalar. The PWM module used in this DVS system has the same functionality as the conventional system. The module generates a series of weighted pulses from the comparison of V ref , as the reference voltage, and the oscillation frequency from the clock reference. The series of pulses from PWM is then compared by Phase and Frequency Detector (PFD) with tuneable oscillator frequency.
Pulse width modulation is a common modulation technique in signal processing. PWM generates a series of pulses where the pulse width varies depending upon the weighted input binary scalar. The PWM module used in this DVS system has the same functionality as the conventional system. The module generates a series of weighted pulses from the comparison of Vref, as the reference voltage, and the oscillation frequency from the clock reference. The series of pulses from PWM is then compared by Phase and Frequency Detector (PFD) with tuneable oscillator frequency.
PFD is commonly used as phase detector in frequency lock in Phase Locked Loop (PLL) systems. As the phase difference of the two input signals, Ref-Clk (coming from PWM output) and Clk (the actual VCO clock), change so do the two output signals: Up and Down. The bigger the phase difference, the larger is the pulse width produced at the Up and Down terminals. The output signal DOWN is high when Clk leads the Ref-Clk signal, and output signal UP is high when Clk lags the Ref-Clk. These series of small pulses control the charge current injected by the charge pump circuit.
The purpose of charge pump circuit is to transform time domain train pulses into continuous steady voltage for VCO, depending on the signals from PFD. If reference clock signal lags the VCO signal, PFD will discharge the charge pump and lower the output voltage, and vice versa. The loop filter removes jitters and smoothes out the continuous steady voltage from the charge pump into analog voltage for VCO frequency control.
The VCO used in this DVS architecture is a ring oscillator which consists of transmission gates switches. A wide frequency range can be generated by operating one of the switches on. The minimum oscillation frequency is obtained by using all the inverter stages. The ring oscillator also comprises of a current controlled inverter connected in parallel with conventional inverter for gain control and for different inverter frequency stages. The output frequency of the VCO can be programmed from 100KHz to 334MHz.
FAST FOURIER TRANSFORM
Fast Fourier Transform (FFT) algorithm has been widely used in sensor and wireless applications. Primarily, the usage of FFT in this paper refers to the area of wireless acoustic sensor signal processing, with application in environment, healthcare and transportation. The passive, independent nature of these ultra low power wireless sensors requires an efficient power management.
The most suitable FFT architecture for low power sensor application is shared memory FFT architecture [13] [14] . The advantages of shared memory architectures are area-efficient and lower overall power consumption. However, the shared memory architectures could not achieve a high speed operation, due to more computation cycles required. The main trade-offs in the FFT processor is hardware overhead and speed requirements.
The low power and low speed characteristics of the shared memory FFT processor architecture make it suitable for a low power application such as in wireless sensors network application. Shared memory FFT processor architecture generally consists of a Butterfly core datapath, a data storage memory and a twiddle factor look up table, as shown in Figure 4 . The main computational core is handled by a Butterfly core radix. Conventionally, the computational core in a dedicated FFT processor can range from a single multiplier algorithm logical unit to a high order radix FFT. 
Control Logic block
The control logic manages the timing of the four functions in an N-point real valued FFT; -reordering data inputs and storing in data memory -performing the N/2-point complex value FFT -placing data onto the output databus The control logic does not only send the four timing functions, but also sends out signals for deciding the memory data address, input and output bus interfacing and important re-configurability signals to the butterfly core, to enable energy awareness. 
Butterfly Core
The butterfly core computes a single butterfly computation per clock cycle, which process a normal complex value FFT followed by real valued FFT stage. The complex value stage comprises of complex multiplication and complex addition, additional adders and subtractors process the complex data to real value.
In this paper, the FFT core is simulated with the DVS system comprising of 8-bits Baugh Wooley (BW) multiplier 15 . Baugh Wooley Multiplier is used for two's complement multiplication due to its efficiency in handling signed bits. The effectiveness in handling signed bits multiplication makes it a common processing core in FFT processors.
Conventionally in a non reconfigurable design the BW multiplier core is build in a single block for the largest bandwidth. In this example, the 16 bit multiplier block is used to compute both 8-bits and 16-bits multiplications. However, the implementation of a single largest bandwidth multiplier block is not optimal due to the sign bit switching in two's complement computation. The proposed solution is to implement a variety of multiplier bitwidth, with the control block deciding the selection of multiplier block. 
Data Memory Cells
Another important component in shared memory FFT processor architecture is the memory cells. Seventy-five percent of the total power consumption in a FFT processor belongs to memory cells data access and the complex number multiplier operation 17 . It is also understandable that the larger the number of bits in the multipliers or in a long size FFT the larger the word-length required for the memory cells. Memory cells require huge chip area with large power consumption. Another aspect that determines power consumption in memory cells is the number of access ports. A single port memory access can be efficient in regards to power dissipation, however it also means a bottle neck in high speed FFT operation. Shared memory FFT architecture performs a repetition of stationary computation, means the processed data can be overwritten back to its read location. Therefore, the FFT architecture uses only 512K-bit data memory cells to read and write 32 bit input data.
The MTCMOS or Multi-Threshold CMOS topology was chosen for the memory architecture technique in the FFT processor due to the number of inactive (standby) cells 18 . Initially, the MTCMOS principal was applied in the design of SRAM to reduce the power dissipation of the peripheral circuits such as row decoders and input/output buffers. MTCMOS is a circuit technique that uses two different combinations of transistor type. Low-Vt transistors for the highspeed core logic and High-Vt transistors as power switch to reduce leakage current. The main principle of MTCMOS is shown in Figure 6 . MTCMOS has been a popular technique because of simplicity of the design. Ideally the larger the threshold level the lower the leakage current, however, one must decide the optimum value of threshold level between the power switch (High-Vt devices) and the logic core (Low-Vt devices), as recovery delay tends to increase with higher threshold level. A power switch with thicker oxide (tox) must be considered to prevent source-drain current blow up.
In this paper, the MTCMOS SRAM was designed by using the conventional gated-V dd and Gnd structure, which was introduced in [16] [17] [18] . This technique reduces the leakage current by using the conventional method of high-Vt transistors between V dd and Gnd to cut off the power supply of the low-Vt memory cell, when the cell is in sleep mode. However, modification was done by applying an addition virtual V dd and Gnd lines for data loss prevention, as shown in Figure 6 . The two virtual lines will maintain the stored charge of the memory cells while the power lines are cut off. This technique introduces a slight delay in write and read time due to activation of sleep transistors. However, the delay is necessary for the memory cells to recover from sleep mode to active mode. 
RESULTS
The DVS system, multiplier and the SRAM were designed and simulated in Cadence Design Framework II Analog Environment using 0.12µm low leakage ST-Microelectronic library. The voltage scaling occurring at the output, which intern is the supply voltage of the FFT, occurs due to the variation of the VCO frequency. This phenomenon is illustrated in Figure 7 . The variation of the VCO frequency transpires because of the DAC weighted binary input. The external clock shown in Figure 7a , is set to 150MHz, while the VCO frequency, shown in Figure 7e , is adjusted to be slightly above 150MHz. It can be clearly seen in Figure 7b , prior to the VCO frequency reaching 1us, the DVS output voltage (Vdd) ramps up to 1.2V, as the full set of binary inputs (i.e. Bit<4:0> are all 1) are given. In addition to this, the supply voltage output in Figure 6b , only ramps up to 700mV as the VCO frequency is reduced below the external clock of 150MHz.
The performance of the DVS system is summarised in Table 1 . Overall the DVS system without any load consists of 366 transistors and dissipates 174.1µW at 150 MHz operation frequency. The effect of temperature in conjunction with DVS on the current performance is also addressed. As shown in Figure 8 , the dependence of the system power on the operating temperature for various voltages (V dd Scaling) in 0.12µm technology. The figure shows that total power increases as the temperature increases, however, there is an optimum voltage where temperature effect is minimal on power dissipation. The effectiveness of Vdd scaling on the FFT multiplier core is presented in Figure 9 . DVS in a larger multiplier is more effective with average power reduction of approximately 25%. The synthesized result of the FFT processor on Synopsys Design Compiler with medium mapping effort is shown in Figure 10 . The compiled FFT architecture accommodates 414,360 standard gates cell. 
CONCLUSION
In this paper, a novel full custom approach for dynamic voltage scaling that can be used as part of a power management system has been presented. This architecture was described through a FFT processor design. The power dissipation of the FFT multiplier core with dynamic voltage scaling was simulated, along with the MTCMOS SRAM as data memory. Power consumption expressions as functions of three control parameters (frequency, supply voltage and body bias voltage) have also been examined and presented in the paper.
