Abstract-Reduction in average and peak power during test application is important to improve battery lifetime in portable electronic devices employing periodic self-test and to improve reliability/cost of testing. This paper proposes an integrated solution for peak and average power reduction in test-per-scan BIST by targeting power reduction in both combinational block and scan chain. First, we present a novel circuit technique, called First Level Supply gating (FLS), to virtually eliminate power dissipation in combinational logic by masking signal transition at the logic inputs during scan shifting. We realize the masking effect by inserting an extra supply gating transistor in the VDD to GND path for the first level gates at the output of scan flip-flops. Simulation results on ISCAS89 benchmarks show an average reduction of 65% in area overhead, 119% in power overhead (in normal mode), and 104% in delay overhead compared to lowest-cost known signal masking alternative. To reduce the leakage power of the combinational block, which is considerably high in scaled technologies, we propose input vector control using FLS during scan shifting. Experiments on a set of ISCAS89 benchmarks show about 38% average reduction in leakage power with the proposed leakage reduction technique. Second, to address the power in the scan chain, we propose an efficient scan partitioning technique that reduces both average and peak power in the scan chain during shift and functional cycles. Experiments on a set of ISCAS89 benchmarks show 12.6% average reduction in peak power with the proposed partitioning method over partitioning according to RTL description.
I. INTRODUCTION
Power dissipation during testing can be significantly higher than that during functional mode, since the input vectors during functional mode are usually strongly correlated compared to statistically independent consecutive input vectors during testing. Zorian in [1] showed that the test power could be twice as high as the power consumed during the normal mode. Test power is an important design concern to increase battery lifetime in hand-held electronic devices, that incorporate BIST circuitry for periodic self-test. It is also important to improve test cost, since reduced test power of a module allows parallel testing of multiple embedded cores in an IC [7] . Peak and average power reduction during test contributes to enhance reliability of test and hence, to improve yield.
There are two components of power dissipation during testing: power in the combinational block and power in the scan chain. An integrated solution for low power test has to consider reduction in both components of test power. There has been multitude of research exploring efficient techniques to reduce test power in scan-based circuits. Wang et al. proposed automatic test pattern generation technique to reduce power dissipation during scan testing [3] . Scan-latch reordering [5] or input vector reordering [6] techniques have been proposed for reduction in test power. In [7] , Whetsel provided a solution for average and peak power dissipation by transforming conventional scan architecture into desired number of selectable separate scan paths. Each scan path is in turn filled with stimulus and emptied of response. Sankaralingam et al. proposed a solution to the peak power problem during external testing by selectively disabling the scan chain [8] . These techniques, however, target reducing number of switching in the scan chain and cannot completely prevent redundant power loss in the combinational logic.
Inserting blocking logic into the stimulus path of the scan cells to prevent propagation of scan-ripple effect to logic gates offers a simple and effective solution to significantly reduce test power in the combinational logic, independent of test set. Werstendorfer et al. have proposed NOR or NAND gate-based blocking method in [10] . Blocking gates (NOR or NAND) are controlled by the test enable signal and the stimulus paths remain fixed at either logic '0' or logic '1' during the entire scan shift operation. Zhang et al. have used multiplexers at the output of the scan cells, which hold the previous state of the scan register during shifting [11] and thus, prevent activity in the combinational logic. Another method for reducing combinational power using blocking is to use a scanhold circuit as a sequential element. This technique, referred as enhanced scan [9] , helps in delay fault testing by allowing application of an arbitrary two-pattern test. The problem with the blocking logic is that they add significant delay in the signal propagation path from the scan flip-flops to logic gates [10] . Moreover, they have large overhead in terms of area and switching power during normal operation of the circuit.
In this paper, we propose a unified solution for test power reduction, which eliminates redundant power in combinational logic as well as minimizes power dissipation in the scan chain. In particular, the paper makes the following contributions:
• We present an elegant signal blocking technique, referred to as First Level Supply gating or FLS, to reduce power dissipation in the combinational logic during scan shifting. This is achieved by selectively inserting a supply gating transistor in the first level of logic connected to the scan cell outputs, which essentially "gates" the ripple in scan flip-flops. The proposed method is as effective as the other blocking methods in terms of reducing peak power and total energy dissipation during scan testing. However, since we introduce just one transistor in the discharge path of the first level logic, the delay penalty is significantly reduced over other blocking methods, which insert additional level of logic into signal propagation path. The overhead incurred in die-area and switching power in normal mode of operation due to extra DFT logic is also significantly lower than the methods using NOR, MUX, and hold-latch.
• Even though redundant switching in combinational logic can be completely eliminated by isolating the combinational part from activity in the scan chain using blocking logic, leakage current flows in the combinational block through the entire shift operation. We have observed that the energy dissipated by leakage power can be significant (about 25% of total active power on average in 70nm technology node for a set of ISCAS89 benchmarks). In order to reduce leakage power in the combinational block, the paper proposes, for the first time, a technique to reduce leakage power in the combinational logic, during scan shifting. It utilizes FLS gating technique with very low DFT overhead to reduce leakage power by input vector control [14] .
• Another component of test power is the power dissipated in the scan chain. The scan partitioning technique presented in [7] is an effective solution for significantly reducing power in the scan chain (switching power of the scan flip-flops and power in the clock line) during testing. However, it does not propose any method for grouping individual scan elements into partitions to deterministically achieve saving in test power. We have noticed that for a given number of partitions, depending on how the scan flip-flops are partitioned, both peak and average power can change significantly. Hence, it is important to devise an efficient partitioning method that can guarantee optimization of test power. We present a low-complexity algorithm to judiciously partition the scan chain and primary inputs such that power saving in the scan chain is maximized.
II. FIRST LEVEL SUPPLY GATING FOR POWER REDUCTION IN SCAN MODE
The dynamic power dissipation in the combinational circuit can be reduced by lowering the activity of the circuit. In this 
A. Active Power Reduction in Test Mode Using Supply Gating
We propose to use the supply gating for dynamic power reduction by reducing the activity of the combinational block during scan shift. We observe that if supply gating is applied to all gates in the active mode, it can prevent propagation of input switching to the circuit, thereby reducing the switching power. However, application of supply gating to the whole circuit has considerable performance and area overhead [2] . To overcome such a significant overhead, we have proposed a novel First Level Supply-gating (FLS) technique, where only the first level logic gates connected to the scan flip-flops are gated using supply gating transistors. Insertion of the supply gating transistor in the first level logic will screen the rest of the combinational logic from the state-input (scan-input) transitions.
The general scheme of the proposed supply gating is shown in Fig. 1 . In order to avoid floated nodes at outputs of the first level gates, which can consequently result in short-circuit currents on the following gates, the outputs of the first level gates need to be enforced at logic one or zero in the supplygated mode. If the GND is gated, then the outputs of the first level gates can be enforced to VDD by a pull-up PMOS driven by the GATING CONTROL (GC) signal. If the VDD is gated then the outputs of the first level gates can be enforced to ground using NMOS pull-down transistors driven by the GC signal. The gated-GND is a more suitable technique for gating due to smaller area overhead and less delay and power penalties associated with NMOS supply gating transistors compared to PMOS supply gating transistors [2] . Fig. 2 shows the proposed FLS gating techniques applied to a general circuit. To further reduce the overhead of the supply gating transistor, all the first level gates share a single supply gating transistor. By sharing the supply gating transistor, area overhead can be reduced because a shared supply gating transistor can have less size than the sum of the sizes of all supply gating transistors in the unshared case. In the unshared case, the size of the supply gating transistor is chosen to be 10 times the minimum transistor size, regardless of the type of the gate. Statistically speaking, for random input data patterns, at each instant approximately half of the first level gates are switching, while the rest do not experience any switching. Therefore, the size of the supply gating transistor in the shared FLS can be half the sum of the sizes of all supply gating transistors in the unshared FLS.
B. FLS Scan Test Scheme

III. LEAKAGE REDUCTION IN TEST MODE BY INPUT VECTOR CONTROL (IVC)
The FLS scheme eliminates the switching power on combinational blocks during the scan mode. However, the combinational circuits can still dissipate power due to leakage current of transistors. In our experiment, power dissipation due to leakage was 17% of the total test power (in case of FLS, 25% for MUX-based method [11] ) on an average at a 100MHz test clock. The standby leakage increases exponentially with technology scaling and temperature increase [2] . The vector bits are often scanned in using a slow clock to reduce switching power consumption and the chance of errors occurring due to scan chain delays. This increases the scan time and therefore the leakage component of energy dissipation. Therefore, it is important to address the leakage power issue even during the test mode.
Leakage of a combinational circuit is a strong function of the state of its inputs [14] . Therefore, by selecting the best input vector for a combinational circuit during the standby mode, its leakage power dissipation can be minimized. Input vector control is one of the potent techniques for leakage reduction and is based on forcing a particular state of logic to the combinational gates [14] . There are algorithms proposed in the literature for finding the best input vector [14] . The existing gating techniques fix the state of inputs during scan shift. However, this input state may not correspond to the best input vector that minimizes overall leakage power on the combinational block. In the hold-latch [9] and MUX-based [11] gating techniques, the state of inputs are fixed at state of scan flip-slops before scan shifting starts. Therefore, the state of the inputs cannot be set to the best vector in the MUX and hold-latch based gating techniques. In the NOR and NAND based gating techniques, the state of all inputs are forced to logical '0' and '1', respectively. However, state of all '0' or all '1' may not correspond to the best input vector. The NAND and NOR gating can be used together to provide best input vector for the combinational block during the scan shift. In this case, the NOR masking is used at the inputs that are to be at logic state of '0', and NAND masking is used at the inputs that are to be at the logic state of '1' to generate the best input vector. In this case, inverted masking signals are required to gate the NOR and NAND gates. Although this mixed use of NOR and NAND gating can result in minimum leakage on the combinational block, the masking gates (NAND and NOR) themselves consume leakage power. The application of best vector by FLS can result in more leakage savings, because FLS does not introduce extra gates to mask the input switching. Moreover, stacking effect at the first level of logic gates helps to significantly reduce leakage in those gates, contributing to reduction in total leakage.
In the FLS scheme of Fig. 1(a) or 1(b) , the outputs of all the first level gates are forced to logic level '1' or '0', respectively. However, this state of inputs may not correspond to the best input vector for minimum leakage. By selective use of gated GND ( Fig. 1(a) ) or gated VDD ( Fig. 1(b) ) for individual inputs, the state of the circuit can be assigned to the best input vector during the scan test in order to minimize leakage power dissipation on the combinational circuit. Fig. 3 shows the scan architecture with input vector control using FLS. It is worth noting that sharing of the gating transistors is still possible. However, to avoid possible short-circuit condition, sharing has to be limited between logic gates with similar gating -i.e. all the NMOS GND-gating transistors can be shared among the GND-gated first level gates and all the PMOS VDD-gating transistors can be shared among the VDD-gated first level gates (see Fig. 3 ). In this case, an inverted gating control signal is required to control the PMOS VDD-gating transistors.
IV. PROPOSED PARTITIONING TECHNIQUE
A. Modified Scan Architecture
The basic idea is to split the scan chain and the primary inputs into multiple partitions and to allow only one partition to make transition at a time [7] . Since switching occurs only in a small segment of the circuit, the number of simultaneous transitions in the circuits is restricted, thus, reducing peak power. During shift cycles, only one partition of the scan chain loads in new scan-in values and shifts out previous state outputs. The other partitions of the scan chain remain disabled and do not incur any transition. Hence, rippling of scan transition is limited to only one partition. Since only one of the scan partition is active during shifting, we can disable the clock in other partitions, saving clock power significantly.
To accommodate the proposed test procedure, we need to make few simple modifications to the conventional test-perscan BIST architecture as in [7] . Fig. 4 shows the modified scan architecture. Test patterns generated by the Linear Feedback Shift Register (LFSR) are applied to the scan chain and to primary inputs. Outputs from the combinational logic Fig. 4 shows the schematic of the modified scan architecture with two partitions of the scan chain (scan chain A and B) and primary inputs (partition I and II). Supply gating transistors are inserted in the first level of logic as shown. Each scan partition has its own clock generated from a clock tree using gating signals CtrlA and CtrlB. The gating signals are generated by a modulo-2 counter and a decoder. During test mode, the clocks CLK1 and CLK2 operate mutually exclusively, thus only one scan chain is clocked at any test clock cycle. For normal operation, the clocks are mapped to the system clock. Each partition of the scan chain also has its own independent scan-in and scan-out port. Depending upon the value of the control signals (CtrlA and CtrlB), the LFSR feeds each of the scan chains separately. It is worth observing that increasing the number of partitions would improve the peak power dissipation both during shift and capture cycles, but would also incur hardware overhead and more capture cycles.
B. Partitioning Method for Primary Inputs and Scan Chain
For the proposed technique to be effective, judicious partitioning of the primary inputs and scan chain is important. Circuits with certain specific characteristics cannot deterministically produce desired result with respect to peak power saving when the logic cone associated with few primary inputs is very dense as compared to others. Optimal partitioning of the inputs to reduce peak power dissipation depends on both the circuit topology and the input vector set. Since a totally exhaustive search for finding the optimal partition is of exponential complexity and requires inordinate amount of computation, we resort to heuristics.
We formulate the partitioning problem as a combinatorial optimization problem and apply greedy strategy to achieve an approximate solution. Like many partitioning algorithms, our algorithm is based on weight assignment to each element and then partitioning into different sets. The objective is to obtain a set of partitions, such that total weight for one partition is comparable to others. input. The Boolean Difference [4] for an input provides an effective measure of the switching activity in the logic circuit corresponding to a transition in the input. Depending on the circuit structure, the transitions at some inputs cause more activity in the internal nodes than those at other inputs. Thus, weights are assigned to each primary and scan input depending on the cumulative signal activity of all nodes in the circuit when the particular input is switched. Although not precise, it gives a fairly good idea of the behavior of the circuit to a change in the particular input value. Once the weights are assigned to each primary input, we partition the set of m inputs into k disjoint sets such that total weight in one set is similar to another. When k is two, this is done by sorting the inputs based on their weights and then alternately assigning the elements of the sorted list to the same group. The partitions formed are then optimized iteratively by a greedy strategy. Two elements are exchanged if they have similar weights and then simulated again to check for improvements. Since the complexity of proposed algorithm is dominated by the sorting operation, it has an algorithmic complexity of O(mlgm). It can be noted that number of elements in each partition does not necessarily need to be equal, as long as the weights of individual partitions are comparable. Being computationally inexpensive, the algorithm, can be easily applied to large designs.
V. EXPERIMENTAL RESULTS
To check the effectiveness of the proposed power saving techniques, we performed simulations on a set of IS-CAS89 benchmark circuits. The gate-level netlists were first technology-mapped to a LEDA 0.25µm standard cell library using Synopsys design compiler. The benchmark circuits are then translated to Hspice and scaled to 70 nm. The simulation was performed in the 70nm BPTM models [13] to observe the effect of gating in a sub-100nm scaled technology. The power-optimized test vector set was generated using method described in [14] . NanoSim from Synopsys was used for both leakage and dynamic power estimation. Delay was measured by simulating the critical path of a circuit in Hspice. We assumed full-scan implementation of the benchmarks. The partitioning algorithm was implemented in C programming language. Table I shows reductions in overhead (area, delay, and power (in normal mode)) achieved by FLS. Since the layout rules for the 70nm node are not available, the measure used for area is the total transistor active area (W * L for a transistor).The NOR-based gating has the least area penalty among the existing gating techniques. The proposed FLS gating technique exhibits the smallest area overhead for all benchmark circuits (less than 10%) with an average improvement of 65% over NOR based gating. The FLS technique has the least impact (minimal increase) on circuit delay. The delay overhead of the FLS technique is less than 1.5% for all the benchmark circuits. Note that the delay overhead in NOR-based gating would be more if the input logic polarity is to be preserved. In that case, an extra inverter needs to be added to the inputs to correct the logic level. This further adds to the delay overhead of the NOR-based gating technique. FLS shows an average reduction of 104% in delay overhead compared to the NORbased gating. Significant power savings are observed for all the benchmark circuits (Table I ). In fact, the power dissipation of the FLS circuits are very close to the power dissipation of the original combinational circuit without any gating. This is because in FLS, the supply gating transistor and the pull-up PMOS do not switch in the active mode. The only source of power overhead is due to the diffusion capacitance added to the outputs of the first level gates by the PMOS pull-up. However, this capacitance is negligible compared to the gate capacitance of the second level gates. FLS shows an average reduction of 119% in power overhead compared to the NOR-based gating. The results of leakage savings by input vector control (IVC) using mixed NOR/NAND and mixed gated GND/VDD FLS for different benchmark circuits are shown in Table II . The best input vectors are found using the algorithms described in [14] . As observed, depending on the benchmark, significant savings can be achieved by applying the best input vector using selective use of gated GND or gated VDD FLS schemes for individual inputs. The mixed FLS gating technique shows average improvements of 38%, 36%, and 31% in leakage power on an average compared to the NOR, NAND, and mixed NOR/NAND masking techniques, respectively. These improvements can be attributed to two facts: a) FLS eliminates the extra gating logic circuits (NOR/NAND) which are leaking and b) FLS reduces the leakage of first level gates due to the stacking effect [2] . The leakage reduction is an additional advantage of the mixed FLS techniques on top of the benefits in terms of area, delay, and power in the normal mode.
Due to the exponential increase of leakage with technology scaling and temperature increase, the leakage reduction of the mixed FLS becomes more effective as the technology scales or the temperature increases. Table III shows the improvement in the effectiveness of this technique in reduction of the overall test power in scaled technologies. As the technology scales to smaller feature sizes, the leakage power on the combinational block becomes a larger fraction of the total test power. Therefore, leakage reduction by mixed FLS gating can result in a more dramatic reduction in the total test power. As observed from Table III, compared to the NOR-based gating, the mixed FLS gating results in an average reduction of 5.3% in the overall test power in the 70nm technology. This reduction, however, improves to 25% in a more scaled technology of 45nm. These results manifest the scalability of Table  IV . It shows peak power improvement in both combinational part and scan chain. In Table IV , k denotes the number of partitions in the scan chain and primary inputs. It is clear from the table that substantial reduction in peak power can be achieved by the proposed technique, with maximum of 43.3% (64%) in the combinational block and 51.4% (63.1%) in the scan chain, when partitioned into two (three) sets. Since we prevent switching in the combinational block during scan shifting using supply gating, peak power in combinational logic corresponds to functional cycles only. Columns 9 and 10 show the percentage saving in total energy in the scan chain. It can be observed that about 51% of the energy in the scan chain can be saved on an average with just two partitions. The saving increases to an average of 64% for three partitions. The reduction in energy in the scan chain comes from reduced rippling of scan values during shift operation. Table V shows the percentage improvement in peak power saving with heuristic-based input partitioning over partitioning with respect to RTL description. It shows that the simple heuristic-based partitioning described in section IV-B gives on the average improvement of about 12.6% and 11.5% in peak power for k = 2 and k = 3 respectively, over partitioning according to the RTL description. It is worth noting that the proposed partitioning method gives a deterministic nature to the solution of peak power problem and guarantees reduction in peak power for any vector set. The improvement in terms of peak power reduction can significantly increase when compared to the worst-case possible partitions. The last column presents the CPU time in millisecond for running the partitioning program on a Sun Sparc Ultra workstation. The algorithm for partitioning is computationally efficient and can be applied to large circuits.
VI. CONCLUSIONS
We have proposed an integrated solution for low power scan design targeting reduction of power in both the scan chain and the combinational block. The proposed FLS gating technique minimizes power in the combinational block. It not only eliminates redundant switching activity in the combinational block, but also provides leakage minimization through application of the best input vector during scan shifting. The proposed scan partitioning technique reduces average and peak 
