A high-speed 4-bit ALU has been designed for 1 V operation to demonstrate the usefulness of the backgate forward substrate bias (BGFSB) method in 1.2 mm N-well CMOS technology. The 4-bit ALU employs a ripple carry adder and is capable of performing eight operations -four arithmetic and four logical operations. The BGFSB voltage has been limited to j0.4j V. Delay time measurements are taken for all operations from the SPICE simulations with and without the back-gate forward substrate bias. A speed advantage of a factor of about 2 -2.5 is obtained with BGFSB over the conventional design.
INTRODUCTION
Digital integrated circuits commonly use CMOS circuits as building blocks. The continuing decrease in feature size of CMOS circuits and corresponding increase in chip density and operating frequency have made power consumption a major concern in VLSI design [1, 2] . Excessive power dissipation in integrated circuits, not only discourages their use in portable environment but also causes over heating, reduces chip life and degrades performance. Minimizing power dissipation is therefore important, both for increasing levels of integration and to improve reliability, feasibility and cost [3] . Different power reducing techniques such as reducing voltage, load capacitance or switching frequency of the output node, are being used to design low power, high-performance chips based on CMOS. At a given clock rate and for a known load capacitance, the dynamic power dissipation is proportional to the square of the power supply voltage [4] . Therefore, reducing the power supply voltage results in quadratic improvement in the power dissipation of a CMOS circuit, which is the most common and effective way of reducing the power consumption [5] . However, lowering the supply voltage causes two design problems. One problem is that the chip throughput is degraded due to increased circuit delays at reduced voltage [6] . The other problem is that there is a significant loss in the performance as the supply voltage reaches the sum of the thresholds of the PMOS and NMOS transistor [6] . Recently, it has been shown that low-threshold voltage devices can be used whenever high performance is required [7 -9] . The threshold voltage can be reduced by the back-gate forward substrate bias (BGFSB) method for low-voltage digital circuit design [10] . This method reduces the threshold voltage of the P-MOSFET and the N-MOSFET and thus leads to reduced circuit delays and power. This method is suitable for the supply voltage between 0.6 and 1.5 V. In the other approach, several voltages are used on board to selectively bias different transistors [6] . The problem with this method is that the optimal voltages may vary on the chip at various conducting blocks depending on performance requirements and circuit types. Furthermore, interface between circuits under different supply voltages requires complicated and expensive hardware and device structures [6] . Because of the problems associated with the latter approach, the preferred approach is to use a global supply voltage and electrically reduce the threshold voltage of MOSFETs such as by the BGFSB method.
An arithmetic logic unit is an important part of a digital computer. It is where all arithmetic and logical operations are performed. Two important attributes of all digital circuits, for most applications are maximizing speed and minimizing power consumption. The speed of different modules used in the design will dominate the overall performance of the system. For the ALU design, the most important part is the adder. The adder can be implemented in many ways such as carry look-ahead adder (CLA), carry-save adder (CSA) and ripple carry adder (RCA). Carry look-ahead adder is fast, but the area of the layout and fan-out on some of the signals tends to grow progressively with the number of bits, slowing down the adder [3, 11] . Ripple carry adder will give us a smaller layout area but the delay time for the worst case is a little longer.
In the following, we will present the design of a 4-bit ALU for operation at 1 V power supply voltage and demonstrate the advantage of using BGFSB method. The methodology, circuit design and technology considerations, transient simulations, post-layout measurements and results are discussed in the following sections.
METHODOLOGY

Power Dissipation in CMOS Circuits
Power dissipation in CMOS can be attributed to three main components: short circuit, static and dynamic power. Static power is defined as the product of the power supply voltage and static or dc current. The static current is due to parasitic diodes common to all bulk MOS devices, and is generally ignored except in battery-powered applications with long standby or sleep times. The static power dissipation (P S ) is obtained from [3] 
where n is the number of devices. The dynamic power is the dominant source of power consumption in modern integrated circuits, resulting from the charging and discharging of the gate capacitances during switching. In CMOS digital circuits, the power consumption depends on the number of switching event [3] . The power dissipation is proportional to the square of the power supply voltage and is given by [7, 12] 
where C L is the load capacitance. V DD is the power supply voltage and f C is the switching frequency. The power consumption is summed for all gates operating at a switching frequency, f C . If the number of gates is operating at different frequencies, then the total dynamic power dissipation is obtained from the addition of dynamic power of number of gates operating at different frequencies. Neglecting the short circuit power dissipation in comparison to the dynamic power part, the total power dissipation can be obtained from the sum of two components,
Equation (3) allows us to reduce the power consumption by reducing the power supply voltage, the load capacitance and the switching frequency. Reducing effective capacitance is important in terms of gate power, as power consumed by the gate is directly proportional to the total capacitance in a gate circuit [13] . As most of the capacitance in combinatorial logic like ours, is due to transistor capacitance (gate and diffusion), it is necessary to keep these contributions to a minimum when designing for low power circuits. Careful sizing of the transistors in the design can help minimize the transistor capacitance.
Back-gate Forward Substrate Bias (BGFSB)
The method of forward biasing the back gate or bulk substrate with respect to the source in the dynamic active mode is schematically shown in Fig. 1 . The gate width to length (W/L ) ratio of P-and N-MOSFET are 3.6/1.2 and 1.8/1.2, respectively. A j0.4j V forward bias voltage for V BP and V BN is applied between the respective substrate and the source of the PMOS and NMOS transistors. Figure 2 shows a plot of the inverter delay versus the power supply voltage, V DD with and without BGFSB, and 0.1 pF capacitive load at the output. V SS is set to zero. We can observe from Fig. 2 that applying BGFSB at low voltage operations improves the circuit delays considerably and gives an improvement factor of about 1.5 -2. The circuit delay in case of the conventional design for the inverter operating at 1 V is about 1.2 ns, while that
designed using BGFSB is about 0.7 ns. It is to be mentioned that power supply reduction below three times the threshold voltage (3V T ) will degrade circuit speed significantly [12] . Therefore, scaling of the power supply should be accompanied by threshold voltage reduction in order to maintain high operating speed. There is one multiplexer at the input section and one at the output. The multiplexers have two select inputs S 0 and S 1 . The 4-bit ALU uses a 4-to-1 MUX designed in CMOS pass-transistor logic [4] for low power. The MUXs are used to provide the proper input signal for the adder circuit depending on the operation being performed on the input side and also to pass the output of the full adder to the output pin, at the output side. Figure 4 (a) shows the block diagram of a 4-to-1 MUX where a select pin S 2 is connected to the AND -OR logic. For S 2 ¼ 0; one of the four arithmetic operations is performed, while for S 2 ¼ 1; one of the four logical operations is performed as shown in Table I . Figure 4 (b) shows the logic diagram of the full adder from which the following four logic functions: AND, OR, EXOR and EXNOR have been derived. OR is implemented from NOR-INVERT configuration in CMOS. The output stage in Fig. 4 (b) includes a 4-to-1 MUX with added AND -OR logic at the output. S 2p is complement of S 2 bit used in the input stage of the 4-to-1 MUX with AND -OR logic. For S 2 bit equal to 1 or S 2p bit equal to zero, we get logic function at the output. Since the logical operations are performed using the basic logic gates, the delay for each logic operation would be the delay through the gate. However, the arithmetic operations make use of the complete adder. Increment and decrement operations are special cases of addition and subtraction. Increment operation is equivalent to an addition by 1 and subtraction is equivalent to 2's complement addition. The delay for each arithmetic operation is more complex than that of the logical operation as it depends not only on the type of logic used to construct the SUM and CARRY units of the full adder, but also on the input pattern and the critical paths in the circuit. Optimizing the design of the full adder optimizes all operations to some extent. The Boolean expressions for the SUM and CARRY are described as follows:
4-BIT ALU DESIGN
where A and B are two inputs and C IN is the CARRY input to the full adder.
As shown in Table I , a particular operation of the ALU is performed based on the three select signals (S 0 , S 1 and S 2 ), thus allowing one of the eight operations to be performed. S 0 is the LSB and S 2 is the MSB. For the logical operations, each bit output is obtained in parallel, as the operations of each bit are independent of the other. For all arithmetic operations each successive stage depends on the previous stage for the CARRY bit. After the full adder performs the necessary operation, the output multiplexer selects the correct output. The value of signal S 2 decides whether it is a logical or arithmetic operation. Figure 5 shows the topology of a 4-bit ripple carry adder. The carry ripples from one stage to the other. For some input patterns no rippling occurs, while for some others, rippling occurs all the way from LSB to the MSB position. The propagation delay for such a structure, also called the critical path, is defined as the worst case delay over all input patterns [3] . In case of a ripple carry adder, the worst-case delay happens when a carry generated at the least significant bit position propagates all the way to the most significant bit. The delay is then proportional to the number of bits in the input words N and is given by [3] :
where t CARRY and t SUM are propagation delays from one stage to another. Following two important conclusions are drawn from Eq. (6). (1) The propagation delay of the RCA is linearly proportional to N, the number of bits. This property becomes increasingly important when designing adders for the wide data-paths. (2) When designing the full adder cell for a fast ripple carry adder, it is far more important to optimize t CARRY than t SUM , since the latter (t SUM ) only has a minor influence on the total value of T ADDER . Worst case delay calculations for all arithmetic operations are performed using Eq. (6) . For logical operations, as each bit will have the same delay, it is sufficient to measure the delay through one stage of the ALU.
The ALU was designed in 1.2 mm, n-well SCMOS (scalable CMOS) technology. This technology uses two levels of polysilicon and two levels of metal for interconnection. The polysilicon at the level one is used for the gate and as well as for interconnection. The technology is used both for the design of analog and digital circuits. The minimum size MOSFET has 1.8 mm channel width and 1.2 mm channel length, respectively. Figure 6 shows the layout design of the 4-bit ALU. All PMOS transistors have the W/L size of 3.6/1.2 and NMOS transistors have the W/L size of 1.8/1.2. In the design, provision has been made to apply back-gate forward substrate bias to all the transistors externally. Independent bonding pads have been assigned for V BN and V BP to the psubstrate and n-wells as shown in Fig. 6 . Thus, the design could be tested with and without BGFSB. However, in standard CMOS design, p-substrate and n-well are directly connected to V SS (GND) and V DD pads, respectively.
RESULTS AND DISCUSSION
To verify the functionality and advantage of the back-gate forward substrate bias method to lower the threshold (V T ) of the transistors, SPICE simulations were performed for the 4-bit ALU. Inverter design was used as a concept to design fast ALU for low voltage operation. Figure 7 shows the transient current characteristics versus input gate voltage for V DD ¼ 1 and 1.5 V, with no BGFSB. Figure 8 shows the characteristics for transient current without BGFSB and with BGFSB varying from 0.1 to 0.3 V. We observe from Figs. 7 and 8 that (1) the peak current decreases with the decrease in V DD and (2) as BGFSB is increased, transient current increases and the subthreshold current also starts increasing. Careful observation of Fig. 8 shows that when we apply a small bias of 0.1 V, there is an increase in the transient current and a very negligible amount of sub-threshold current. As BGFSB is increased to 0.3 V, the sub-threshold current increases to about 2 nA, still lower than the current in the above threshold region. Figure 9 shows the sub-threshold currents for a BGFSB of j0.3j and j0.4j V. In Fig. 10 , we have compared the sub-threshold currents for a bias of j0.4j and j0.5j V. Sub-threshold current is seen to increase from 2 nA for a bias of 0.3 V to 100 nA for a bias of j0.4j V steadily rising to as much as 4 mA for a bias of j0.5j V. Thus, we get an increase in the transient current with a small rise in the sub-threshold current with the application of BGFSB.
It is equally important to see the effect of latch-up in CMOS since BGFSB is applied between source and substrate of transistors as shown in Fig. 11(a) . The corresponding equivalent circuit is shown in Fig. 11(b) . Figure 12 shows that latch-up action triggers at substrate bias above j0.4j V. The latch-up is negligible below j0.4j V for normal operating conditions of the device. Therefore, the lower limit for the threshold voltage is set by the amount of off-state or leakage current (due to standby power considerations in static circuits including the latchup) that can be tolerated [8, 12] . The BGFSB more than j0.4j V can degrade the circuit performance severely. The threshold voltage should ideally not be less than 0.4 V [12] and the forward substrate bias not more than j0.4j V. Figure 13 shows the inverter delay versus the substrate to source bias. The circuit delay decreases with increasing bias. However, applying a bias of 0.5 V or more severely degrades the performance of the circuit as seen from the voltage transfer characteristics of Fig. 14. Figure 15 is a plot of threshold voltage (V T ) versus substrate bias (V BS ), for NMOS and PMOS transistors. With the application of the forward substrate bias, there is a reduction in the threshold voltages for NMOS and PMOS transistors leading to higher device currents. For a BGFSB of j0.4j V, the threshold voltage reduces from 0.6 to 0.4 V in case of NMOS transistors and from j0.92j to j0.8j V for PMOS transistors. The value of the threshold voltage for NMOS transistor is in the allowable range of 0.4-0.6 V. This indicates about a 0.19 V reduction in the threshold voltage for NMOS and about 0.12 V in the case of PMOS transistor. This reduction in threshold leads to higher device currents, which in turn leads to faster circuits. A BGFSB of 0.5 V or higher will cause threshold of NMOS to fall to 0.35 V and lower values which can cause undesirable operation. Therefore for getting better speed performance a BGFSB of j0.3j to j0.4j V can be acceptable. Figure 16 shows the DC voltage transfer characteristics of the inverter. From Fig. 16 we can see that with the application of the BGFSB to the NMOS, there is a decrease in the input voltage needed to turn on the N-MOSFET and pull down the voltage to ground. On the other hand, there is an increase in the critical input voltage needed to turn on the P-MOSFET and pull up the output to the supply voltage, V DD . Thus application of the BGFSB to the P-MOSFET shifts the transition along the input voltage axis toward the high voltage level while BGFSB on N-MOSFET pushes the transition region to the low voltage level [10] . The BGFSB method can be used to provide an efficient alternative to adjust the transition region to around the midway between high and low levels to obtain needed noise margins. A high noise margin of 0.5 V is obtained in the design. We used this method to design a fast ALU for low voltage operation and set V BS to j0.4j V, to lower the threshold voltage of transistors.
The input frequency used to measure the worst case delays for all operations is around 0.5 MHz for both with and without forward substrate bias conditions. This was done to get a fair comparison of the delays for each operation of the circuit with and without back-gate forward substrate bias. Input pattern is chosen so as to obtain the worst case delay or the critical path delay. Table  II gives the input pattern for the worst case delays for arithmetic operations. Propagation delays (tp LH and tp HL) are measured from the 50% input to the 50% output voltage response. The worst case delay or critical path delay for the ADDITION and INCREMENT operations occurs when the input pattern is such that, the carry ripples from the first to the last stage, i.e. from LSB to the MSB. Similarly, the worst case delay for SUBTRACTION and DECREMENT operations occurs when carry bit propagates from the MSB stage to the LSB stage. The delay relation of Eq. (6) is used for calculation of delays for all arithmetic operations of the ALU. For all logical operations the delay is calculated for a single stage only, since the output of each stage is independent of the previous stage.
Tables III and IV summarize the delays obtained from the simulation of the layout of the 4-bit ALU and full adder with and without back-gate forward substrate bias. The supply voltage is 1 V. The rise and fall times of the input signal are 5 ns. The improvement in the speed performance of the circuit is measured through a factor, K, which is the ratio of the propagation delay for any operation of the 4-bit ALU without BGFSB to that with the BGFSB. The computed value of K is also summarized in Tables III and IV for demonstration. K has the value between 2 and 2.5, which shows that there is a significant reduction in delay for all operations when back-gate forward substrate bias is used.
CONCLUSION
The potential of the BGFSB method has been highlighted for low-voltage and high-speed applications. A fast 4-bit ALU has been designed in 1.2 mm, N-well CMOS technology for 1 Voperation to demonstrate the usefulness of the BGFSB method. A BGFSB of j0.4j V is applied to all NMOS and PMOS transistors in the circuit to lower the threshold voltage. The BGFSB applied is set on the basis of the latch up action triggering in the circuit. Latch up action triggers above j0.4j V and is negligible below this voltage. An improvement factor K between 2 and 2.5 is obtained for all operations performed by the 4-bit ALU with BGFSB. In the design, all transistors are subjected to the same BGFSB because of the limitation of the single well in standard CMOS process. In the steady state, subthreshold current increases due to reduction in threshold voltage because of BGFSB. In the present design, the steady state current was significantly lower than the threshold voltage current of the MOSFET. However, in high-density chips, the current could be suppressed to a significant level by the use of the switched source impedance method [14] . 
