[32] A. Chandra Abstract-Power dissipation during scan testing is becoming an important concern as design sizes and gate densities increase. While several approaches have been recently proposed for reducing power dissipation during the shift cycle (minimum-transition don't care fill, special scan cells, and scan chain partitioning), limited work has been carried out toward reducing the peak power during test response capture and the few existing approaches for reducing capture power rely on complex automatic test pattern generation (ATPG) algorithms. This paper proposes a scan architecture with mutually exclusive scan segment activation which overcomes the shortcomings of previous approaches. The proposed architecture achieves both shift and capture-power reduction with no impact on the performance of the design, and with minimal impact on area and testing time (typically 2%-3%). An algorithmic procedure for assigning flip-flops to scan segments enables reuse of test patterns generated by standard ATPG tools. An implementation of the proposed method had been integrated into an automated design flow using commercial synthesis and simulation tools which was used on a wide range of benchmark designs. Reductions up to 57% in average power, and up to 44% and 34% in peak-power dissipation during shift and capture cycles, respectively, were obtained when using two scan segments. Increasing the number of scan segments to six leads to reductions of 96% and 80% in average power and, respectively, maximum number of simultaneous transitions.
Index Terms-Design for testability, low power, scan testing.
I. INTRODUCTION
Scan architectures represent an attractive solution for both built-in and external testing of digital integrated circuits (ICs). This is because they increase the controllability and observability of internal nodes of the circuit, are easy to implement, and have relatively low impact on area and performance. A scan-based test cycle has two distinct cycles: shift and capture. Shifting a test pattern into the scan chain occurs simultaneously with shifting out circuit's response to the previous test pattern. In the capture cycle, the test pattern, loaded in the the scan chain during the shift cycle, is applied to the circuit under test, and the response of the circuit is captured into the scan chain.
Limited battery capacity, high cooling costs, and circuit reliability are only some of the factors which made it necessary to consider power consumption during IC design [11] . Clock gating is probably the most efficient and commonly used approach for reducing power dissipation at register-transfer and logic level [5] , [17] . However, traditional scan insertion cancels during test the effect of clock gating logic [5] . During scan testing, clock gating logic is disabled and, hence, all flipflops in the design are clocked in every clock cycle. During normal operation, only the flip-flops which have to be updated are clocked, while all remaining flip-flops are disabled by the clock gating logic. Hence, internal switching activity during testing can exceed the level corresponding to the normal operation of the circuit. Sustained intense switching activity causes overheating and electromigration which can permanently damage the chip under test or seriously affect its reliability. Moreover, the effects of parasitic resistance of power supply rails combined with the large current drawn from the power grid by the large number of internal nodes which switch at the same time-reduce the voltage delivered to cells. Ignoring the effect of this reduction in voltage-referred to as IR (a voltage drop caused by the current flow I passing through the power/ground lines characterised by an electrical resistance R) drop-increases the probability of noise-induced test failures. Fixing IR-drop-related problems requires redesigning the power grid, and hence, a design respin. Given today's tight market windows and high costs of design respin, it is desirable that such late design failures are preempted if possible from early design stages.
Several methods aiming to solve power-related problems associated with scan-based test have been proposed recently. They fall into the following broad categories.
Low transition test patterns [3] , [12] , [18] . These methods reduce the number of transitions in the scan-in vectors, and consequently the shift-power component caused by scan-in transitions. These methods have no direct control over the number of transitions in the scan-out vectors, thus, overall reduction in power cannot be guaranteed. Moreover, these methods do not address peak-power problems during the capture cycle.
Power conscious ATPG algorithms [13] , [14] , [19] . These are special ATPG algorithms which aim to decrease the number of transitions in scan-in and scan-out vectors for shift-power reduction, and also to decrease the Hamming distance between test stimulus vectors and the corresponding test response vectors for capture cycle-power reduction. These ATPG algorithms, while overcoming the shortcomings of minimum-transition don't care filling methods, are complex and the generated test sets are generally much larger compared to test sets generated with regular ATPG algorithms.
Special scan cells [6] , [15] . The approach proposed in [6] inserts blocking logic on the outputs of the scan cells in order to block the shift ripple at the inputs of the circuit. Although this method substantially reduces power dissipation during the shift cycle, it introduces undesired delay on the data path due to the blocking logic which has a negative impact on circuit's performance. The work presented in [15] improves the solution from [6] by inserting blocking logic only on the outputs of a limited number of flip-flops which are not on critical paths. The blocking logic is enabled-disabled in two additional clock cycles inserted before-after the capture clock. This way, the switching caused by enabling-disabling the blocking logic does not add to the switching caused by the test response capture. Neither of these approaches addresses the problem of peak power during capture cycles.
Scan chain partitioning [1] , [10] , [16] , [20] . The method proposed in [1] uses two nonoverlapping clocks running at half the frequency of the main clock to operate the odd and the even scan cells of the scan chain. This technique reduces shift-power dissipation by a factor of approximately two, without affecting the testing time or the performance of the circuit. The approach proposed in [10] splits the scan chain into multiple segments based on a compatibility relation between the flip-flops and activates only one segment in each shift clock. An extra test vector, computed using a special ATPG algorithm, is applied during the shift cycle to the primary inputs of the circuit under test in order to further reduce switching due to the shift ripple. A simpler yet very efficient approach, first proposed in [20] and extended later in [16] , splits the scan chain into length-balanced segments and enables only one in each shift clock. The maximum number of scan cell outputs which are rippling in each shift clock can be tuned by selecting the appropriate number of scan segments. No blocking logic is inserted on the stimulus path, thus, the performance of the design is not affected. Moreover, this method reuses test sets generated for standard scan architectures, hence, it does not require special ATPG algorithms. Operating only during shift cycles (which dominate the overall testing time), these methods reduce average power, hence eliminating the risks of overheating and electromigration. However, in all these approaches, the capture clock is applied simultaneously to all scan cells, leaving the designs prone to noise-induced test failures during capture cycles.
Power dissipation during test capture cycles is likely to be higher than during the functional operation, especially for circuits designed for low power operation. One category of examples are low power finite state machines, where the encodings of "next states" are correlated with the "present states," such that transitions between pairs of "reachable" states cause low switching activity in the circuit. During test, however, any values can be shifted into the state register, including values corresponding to states unreachable during the normal operation. This breaks the correlation between consecutive values loaded in the state register, and may cause higher switching activity in the circuit. Another category of examples are circuits with clock gating. During the normal operation, the clock gating logic disables a fraction of the flip-flops in the design, thus reducing the maximum number of flip-flop outputs which can change their value. During scan testing, however, the clock gating logic is disabled by test-specific signals. Therefore, all flip-flops in the design are clocked in each test clock, which inherently leads to higher switching activity in the circuit compared to the normal operation mode.
New approaches, easily integrable into existing automated design flows, for reducing switching activity not only during shift cycles but also during capture cycles are needed in order to provide a comprehensive and practical solution to the power-related problems associated with scan-based testing. Methods based on scan chain partitioning [1] , [10] , [16] , [20] appear to be efficient solutions in terms of shift-power reduction and integrability into existing design flows versus area and testing time overhead, when compared to other approaches. Hence, these methods merit further investigation and provide the foundation for the work presented in this paper. A scan architecture with mutually exclusive scan segment activation is proposed in Section II for reducing both shift and capture power. Basically, the scan chain is split into a given number of length-balanced segments, and only one segment is enabled during each test clock (shift or capture) through the use of a clock gating scheme. Unlike standard scan architectures and previously proposed low power scan architectures [6] , [10] , [16] , [20] , which apply the capture clock at the same time to all scan cells, the proposed scan architecture applies sequentially the capture clocks to the segments of the scan chain. As only a fraction of the flip-flops in the design can change their values simultaneously in each test clock, the proposed architecture reduces not only shift-power dissipation but also capture-power dissipation. Hence, this method eliminates the risks of overheating and electromigration as well as the risk of high IR drops during capture cycles which can lead to noise-induced test failures. An algorithmic procedure for assigning flip-flops to scan chain segments enables reuse of test vectors generated for single-clock capture. Hence, the proposed low power scan architecture does not require special ATPG algorithms to handle the multiclock capture cycle. Section III presents experimental results on several benchmark circuits.
II. SCAN ARCHITECTURE WITH MUTUALLY EXCLUSIVE SCAN SEGMENT ACTIVATION
With the goal of reducing the number of scan cells which are switching simultaneously during testing, the method presented in this paper splits the scan chain into a given number of length-balanced segments, and enables only one scan segment during each test clock. At each shift clock, a test stimulus bit is shifted into the active scan segment while a test response bit from the previous test pattern is shifted out from the scan segment. Unlike all previously proposed methods based on scan chain partitioning, instead of applying the same single-capture clock to all flip-flops in the design, this scan architecture captures the test response for each test pattern over a sequence of clocks cycles, one for each scan segment. Hence, only a fraction (given by the length of the scan segments) of the flip-flops in the design will be clocked in each test clock. This limits the maximum number of flip-flop which can toggle simultaneously, and consequently both shift and capture clock cycles will generate only a limited amount of switching activity in the circuit. This method, replacing standard scan insertion, reduces both average and peak-power dissipation during test. This enables shifting of test data at high frequencies without the risk of overheating the chip under test, and also eliminates the risk of noise-induced test failures, hence avoiding unnecessary respins of the design. Fig. 1 presents the proposed low power scan architecture. The scan chain is divided into N length-balanced segments. If the number of scan cells is not a multiple of N, the sum of the differences between the scan lengths is upper bounded by N 0 1 (the maximum remainder of division by N). In order to account for the small length differences between the scan segments without increasing the complexity of the scan control unit, the test vectors are padded with dummy bits . The size of the scan control unit depends only on the number of scan segments and, hence, it is not affected by the size of the design. Fig. 3(b) shows the simulation waveforms for the scan control unit for a low power scan chain architecture with three scan segments. During test mode (test mode = 1), the three clock signals generated for the three scan segments are mutually exclusive during both shift and capture. Initially (t = 0 ns) the scan chain is in shift mode (scan enable = 1). The scan segments are clocked in a cyclic sequence (Segment 0, Segment 1, Segment 2, Segment 0, . . .) until all bits of a test pattern are loaded into the scan chain. At t = 360 ns, the test pattern has been fully loaded into the scan chain and the architecture is put into capture mode by asserting low the scan enable signal. Three capture clocks are applied in sequence, one for each scan segment. After the first capture clock (scan clk[0] = 00100), the first third of the circuit response is latched into Segment 0, in the second capture clock (scan clk [1] = 00100), another third of the test response is stored into Segment 1, and in the third and last capture clock (scan clk [2] = 00100), the last part of the circuit response is stored into Segment 2. The multiclock capture cycle is the fundamental 1 One possible solution for making the scan control unit testable is to scan its sequential part, i.e., the flip-flops of the modulo-N counter, and add observation difference between this approach and all previously proposed low power scan architectures, which capture the entire test response in a single clock. While in the case of single-clock capture, all flip-flops in the scan chain can change their values simultaneously, the multiclock capture cycle allows at most 1=N of the flip-flops in the design to change their value simultaneously. After N capture clocks, the entire test response is available in the scan chain, thus, a new shift cycle can start. During normal operation (test mode = 0), all three clocks are mapped to the system clock. Clock gating circuitry corresponding to the normal operation mode should be built on the scan clk signals and it should be disabled by asserting high the test mode signal.
As the scan segments are length-balanced and only one scan segment is active during each test clock, the number of simultaneously clocked flip-flops (i.e., the sources of switching activity in the circuit under test) can be tuned at scan insertion by selecting the appropriate number of segments for the scan chain. It should be noted that increasing the number of scan segments also increases the number of capture clocks, and hence, the overall testing time. However, the increase in testing time is insignificant for circuits with long scan chains where the testing time is dominated by the shift cycles. For example, for a circuit with a 1000 flip-flop scan chain, partitioning the scan chain into two segments will reduce the number of simultaneously clocked scan cells to 50% while increasing the length of a test cycle by only one extra capture clock, which represents 0.1% of the original testing time.
There are two basic types of testing: dc testing, which is done to verify the circuit structure independent of frequency or timing, and ac testing, which assesses frequency and timing compliance [4] . AC scan testing means applying the scan capture clocks at the operating frequency of the circuit under test. The proposed mutually exclusive capture clock generation scheme has been developed specifically to target dc tests. If ac tests are required too, the proposed scan chain architecture can be treated as a standard scan chain by using the same capture clock for all segments. In order to reduce the overall testing time, it is desirable to increase the test concurrency at system level. However, power is a constraining factor for the maximum test concurrency at system level. Previously proposed low power scan architectures reduce power dissipation during shift cycles, but the capture-power dissipation remains unchanged. Let us assume the peak shift power is X and the peak capture power is 0.8X for a given design when using a standard scan architecture. The low power scan architecture proposed in [16] and [20] with three scan segments will reduce shift power by three times (0.33X), while capture power remains 0.8X. Hence, the global peak power (shift and capture) has been decreased only by 20% compared to a potential reduction of 66%. This leads to suboptimal test concur- rency at system level for a given power constraint, and hence, to longer test times. Our architecture, however, reduces both shift and capture power, hence enabling increased test concurrency at system level under the given power constraint. Shortening the duration of the stuck-at test session allows more time for the at-speed tests, which can be executed in a more sequential fashion to comply with the given power constraint. It should be noted that the proposed architecture allows at-speed tests to be applied by using the same capture clock for all scan segments. In conclusion, a complete test session will consist of two subsessions: a short and highly parallel test session for stuck faults, when the mutually exclusive clocking scheme is used during both shift and capture, followed by a low concurrency subsession of at-speed tests, when the mutually exclusive clocking scheme is active only during shift cycles. As testing time represents an inportant factor to the overall cost of test, the proposed scan architecture represents an efficient solution for reducing the cost of testing complex chips under power constraints.
A. Structural Dependencies and Capture Violations
In order to reuse test stimulus and test responses generated using traditional ATPG tools for single-clock capture cycles, it is necessary to ensure throughout the multiclock capture cycle that stimulus data bits are overwritten with test response bits only after they have become According to the timing diagram shown in Fig. 4 , FF1 and FF2 are assigned to different scan segments, and hence, their capture clocks do not occur simultaneously. As FF2 depends on FF1, after applying the capture clock to FF1 (clk1 = 00100), the value held by FF1, representing stimulus data for FF2, is overwritten with the test response bit.
Definition 3: The situation when a capture clock applied to a flip-flop in the design overwrites a necessary stimulus bit is referred to as a "capture violation."
The structural dependencies between flip-flops in the design have to be analyzed in order to identify all possible "capture violation" situations. For this purpose, a structural dependency graph (SDG) can be derived from the net list of a design. Each node in the SDG corresponds to a flip-flop in the design, and a directed edge from node V i to node V j means there is a combinational path from the output of flip-flop V i to the input of flip-flop Vj. According to the SDG model, Vi depends on V j , if there is a path in the SDG from V j to V i . In case of a bidirectional dependency between two nodes V i and V j , i.e., V i and V j belong to a cycle in the SDG, flip-flops Vi and Vj must receive the same capture clock in order to avoid a "capture violation" situation. Generalizing this observation, all nodes from a strongly connected component (or simply strong component) [7] of the SDG must share the same capture clock, as there is a path between each pair of nodes of a strong component. Consider, for example, the SDG shown in Fig. 5 . Nodes  FF4, FF5, FF6, FF7 , and FF8 form a strong component as there is a path between each ordered pair of them. Applying the capture clock to one of these flip-flops before applying it to the others will result in a capture violation. For example, capturing first in FF4 will overwrite the test stimulus needed by FF5, FF6, and FF8, and so on. Therefore, flip-flops FF4, FF5, FF6, FF7, and FF8 must be assigned to the same scan segment in order to receive the same capture clock.
From the above discussion, it can be concluded that structural dependencies between flip-flops have to be taken into account when assigning flip-flops to scan segments in order to preserve test stimulus and test response vectors computed for single-clock capture. Section III presents a systematic method for partitioning the flip-flops in the design into equal-length scan segments and scheduling segment capture clocks while avoiding "capture violations."
B. Scan Chain Partitioning
Partitioning the flip-flops in the design into scan segments must meet the following two constraints.
1) The scan segments have to be length-balanced. 2) There is at least one ordering of the segment capture clocks which does not lead to any "capture violations" between the scan segments. According to the low power scan architecture presented in Fig. 1 , all flip-flops assigned to a scan segment share the same clock signal. As explained earlier, all flip-flops covered by a strong component in the SDG must share the same capture clock in order to avoid "capture violations," and consequently, they must be all assigned to the same scan segment. This implies that the length of the scan segments will be lower bounded by the size of the largest strong component in SDG. However, the scan segment length is imposed by the given number of scan segment, as the scan segments are length-balanced. It might happen that the size of the largest strong component in the SDG exceeds the scan segment length imposed by the number of scan segments. In this case, it is necessary to "break" the largest strong component into smaller ones, which could be fitted into scan segments of the desired length. "Breaking" a strong component means removing some of the bidirectional dependencies between two or more nodes in the strong component. This can be achieved by replacing a node in the strong component with a pair of nodes: an input-only node and an output-only node. This pair will be further referred to as a "extended node." The input-only node holds the stimulus bit for the fan-out logic cone, while the output-only node captures the test response bit from the fan-in logic cone. As between the input-only node and the output-only node, there is just a one-way dependency; more precisely, the output-only node depends on the input-only node, the two nodes can have different capture clocks, and hence, they can be assigned to different scan segments.
There are two alternatives for implementing the "extended nodes" in hardware. The first possible solution is illustrated in Fig. 6 . Flip-flop SFF1 corresponds to a node selected for breaking the largest strong component in the SDG. In this "extended node" implementation, SFF1 is used as an input-only node, and an extra flip-flop SFF2 is added to act as the corresponding output-only node. In this solution, no extra logic is inserted on the functional data path, thus, the impact on the performance of the original circuit is minimal. The delay introduced by the capacitance of the D input of SFF2 can be compensated, if necessary, by resizing the driving gate. For this implementation of "extended nodes," the test vectors have to be padded with dummy bits on the positions corresponding to output-only nodes, as these nodes are used only for test response capture.
If the performance of the circuit is not critical, another solution is to implement the pair of nodes using a scan-hold flip-flop [2, p. 483], as shown in Fig. 7 . This solution incurs less area overhead compared to the first approach at the cost of an extra delay introduced on the functional data path by the "hold" latch. The HOLD line of the scan-hold flip-flop is driven by to the scan enable signal. During the shift cycle (scan enable = 1), HOLD is asserted to one and, hence, the "hold" ments N seg (line 1). Next, the set of strong components SSC of SDG are identified (line 2) using a linear time search algorithm [7, p. 30] . If the size of the largest strong component exceeds the scan segment length Lseg imposed by the given number of scan segments, the largest component is broken into smaller ones by replacing one of its nodes with an "extended" node (line 5). This step is repeated until the size of the largest strong component in the SDG becomes less than the required scan segment length L seg . Once the sizes of strong components in the SDG have been adjusted according to the segment length, the algorithm proceeds to assigning nodes in the SDG to scan segments (line 6). The set of covered nodes C nodes and the first scan segment Sseg 0 are initialized to empty sets (lines 6 and 7). An iterative procedure starts to assign flip-flops to scan segments. At each iteration, the algorithm identifies the strong component sc in the SDG which has all fan-out nodes, if any, already covered, i.e., in the covered node set C nodes , and adds the nodes in sc to the current scan segment (line 11). Hence, during the first iterations, the primary outputs of the design, which include also the output only parts of "extended nodes," will be assigned to the first scan segment as they do not have any fan-out nodes, i.e., no flip-flops in the design depend on them. When the number of nodes in the current scan segment reaches the scan segment length L seg (line 10), the nodes in the current segment are marked as covered and a new empty segment is started. This process is repeated until all nodes in the SDG have been assigned to scan segments. If not all nodes could be fitted into the given number of scan segments (line 13), the algorithm breaks the largest strong component in the SDG and repeats the procedure of assigning strong components to scan segments. The order in which the capture clocks will be applied is the same with the order in which the scan segments were created according to Algorithm 1. This will ensure that each capture clock will overwrite only stimulus data which became unnecessary for the current capture cycle. The following example shows how scan chain partitioning works. Example 1: Consider the SDG shown in Fig. 5 where nodes FF1, FF2, and FF3 are primary inputs, nodes FF9, FF10, and FF11 are primary outputs, and nodes FF4, FF5, FF6, and FF7 represent internal flipflops. The largest strong component in this case contains four nodes, FF4, FF5, FF6, and FF7, as there is a path between each ordered pair of these nodes. Assuming the given number of scan segments N seg is four, the scan segment length is three. It can be seen that for the original SDG, the size of the largest strong component exceeds the scan segment length.
The algorithm selects node FF7 as "breaking" node for the largest strong component in SDG. Thus, Node 7 will be replaced with an extended node comprising the pair (FF7a, FF7b) [ Fig. 8 ], where FF7a is the output only node, while FF7b is the input only node. The largest strong component has now only two nodes, FF4 and FF5, which already complies with the imposed scan segment length. Analysis of the resulting SDG, shown in Fig. 8 , shows that flip-flops FF7a, FF4, and FF5, and FF7b, FF6, and FF8, respectively, can be assigned to different scan segments without causing "capture violations," as long as the first three flip-flops receive the capture clock after the latter three. The scan chain partitioning algorithm continues with assigning nodes to scan segments. As initially the set of covered nodes is empty, the algorithm assigns the three primary output-nodes, FF9, FF10, and FF1, to the first scan segment (Fig. 9) . This segment will receive the first capture clock in the multiclock capture cycle as none of the remaining flip-flops in the design depend on the values of the primary outputs, and hence, no "capture violation" can occur. Next, the algorithm assigns nodes FF6, FF7b, and FF8 to the following scan segment as only nodes in Segment 0 depend on them, and Segment 0 has been already scheduled for earlier capture. In a similar fashion, the algorithm assigns nodes FF4, FF5, and FF7a to Segment 2, and nodes FF1, FF2, and FF3, to Segment 3, respectively. From examining Fig. 9 , it can be observed that by applying capture clocks to Segment 0, Segment 1, Segment 2, and Segment 3 in this order, no necessary stimulus bits will be overwritten, and hence, no "capture violation" will occur.
III. EXPERIMENTAL RESULTS
The efficiency of the low power scan architecture described in Section II was validated by running two sets of experiments using the largest seven ISCAS89 benchmark circuits. Ten additional designs have been generated by concatenating two to seven of the largest ISCAS89 circuits, in order to asses the scalability of the proposed approach to larger designs. The number of flip-flops in the designs considered for experiments ranged from 300 to 7000.
A preliminary analysis has been performed on the ISCAS89 circuits with standard scan architectures to determine the fraction of shift and clock cycles which cause high power consumption. The results of this analysis are shown in Table I . For this analysis, we have used ATPG-generated (Mintest [9] ) test vectors with the don't cares mapped to zeros. Experimental data shows that the fraction of capture clocks for which the power dissipation exceeds 80% of the global peak capture power ranges from 5% (circuit s5378) to 75% ( circuit s13207). Therefore, avoiding capture-power peaks by means of removing the "problem" test patterns from the test set is not feasible for some designs without seriously affecting the fault coverage of the test set. Designs which exhibit high power dissipation during a significant fraction of the capture cycles could benefit from the proposed scan architecture, which reduces both shift and capture power without affecting the fault coverage of the original test set.
The goal of the first set of experiments was to estimate the reduction in average power which can be achieved using the proposed method. Six experiments were performed for each design: one experiment using standard scan chain insertion, and five experiments using the proposed scan architecture with two to six scan segments. The following flow was used in each experiment. 1) Each design was been synthesized using Alcatel MTC35000 technology library.
2) The appropriate type of scan chain (standard or low power) was inserted into the synthesized design.
3) The design was simulated using Mentor Graphics' ModelSim [8] simulator using five pseudorandomly generated scan patterns in order to capture the toggle activity of internal nodes. 4) The toggle activity was back-annotated to the synthesized design, and an average power estimation was obtained using Synopsys' Power Compiler [17] . Table II shows the relation between the average power dissipation and the number of scan chain partitions. Column 2 corresponds to the standard single-segment scan chain, while the remaining columns show the results for the proposed low power scan chain architecture using two to six scan segments. For each of the five versions of the low power architecture, Table II reports the average power dissipation (Pavg) as well as the relative reduction (%red) obtained over the standard scan chain. It should be noted that the reported values correspond to the power dissipated by the circuit under test, including the scan chains. The power consumed by the clock tree is not considered. For example, for circuit s38584, the proposed scan architecture with two scan segments reduced the average power by 50% compared to the standard scan architecture. The three scan segment architecture further reduces average power by an additional 42%, which represents 92% reduction compared to the standard scan architecture. The last two rows in Table II show the average and worst case reductions in average power dissipation. Table III shows the overhead associated with the proposed low power scan architecture. The increase in testing time due to the multiclock capture cycle can be derived from the number of scan segments and the total number of flip-flops in the design. The number of flip-flops in the original designs is shown in Column 2 (FF). Columns xFF show the number of extended nodes needed to implement the proposed low power scan chain for each experiment. Columns % show the number of extended ondes as a percentage of the total number of flip-flops in the original design. Depending on the solution used to implement extended node, the number of extended nodes represents:
1) The number of extra scan cells which have to be added to the design, and also the number of additional shift clocks per test pattern, when extended nodes are implemented using extra scan flip-flops (Fig. 6) . 2) The number of scan cells which have to be replaced with scan-hold flip-flops when extended nodes are implemented using scan-hold flip-flops (Fig. 7) . It should be noted that, in this case, the total number of scan cells in the design does not increase. Generally, the percentage of extended nodes decreases and can get as low as zero, as the number of flip-flops in the design increases. This is because, for large designs, the length of the scan segments tend to be much higher than the size of the largest strong component in the SDG and, thus, only few or no extended nodes are necessary during scan chain partitioning. The last two rows in Table III report the average and worst case percentages of extended nodes. Even for the worst case scenarios, reductions up to nearly 70% can be achieved by using the proposed low power scan architecture at the cost of having at most 12% extended nodes from the total number of flip-flops in the design. The last column in Table III shows the worst case CPU times (in seconds) required to perform the scan chain partitioning algorithm and to insert the resulting scan chain into the designs. The proposed scan chain partitioning and scan insertion were performed using a tool written in C++ running on a Linux Pentium 4, 1.6 GHz with 512 MB of random access memory.
A second set of experiments was performed in order to estimate the reduction of the peak-power dissipation achieved using the proposed scan architecture. Cycle-accurate power simulation is necessary for determining the peak-power dissipation. As transistor-level simulation is time consuming, the number of transitions in the circuit occurring in each clock cycle was used as a cycle-accurate measure of power dissipation. The six versions of each design (standard scan architecture and proposed scan architectures with two to five scan segments) were simulated using 20 linear feedback shift register (LFSR)-generated test patterns and 20 test patterns generated using Mintest [9] , with don't cares mapped to zeros. The number of transitions in the circuit was recorded for each clock cycle. In order to compensate for the disproportion between the number of shift cycles and capture cycles, the scan control unit was modified for these experiments to apply five consecutive captures, instead of a single one, for each test pattern. Tables IV and V show the peak power values, in terms of number of transitions per clock, for ATPG and LFSR-generated test patterns, respectively. Columns 2 and 3 show the peak-power dissipation during shift and capture cycles, respectively. As it can be seen, peak-power dissipation during capture cycles is comparable with peak-power dissipation during shift cycles. Thus, reducing peak power in capture cycles is as important as reducing the peak power during shift for avoiding IR-drop related test failures. Columns 4 to 19 in Tables IV and V show the shift and peak power for the proposed scan architecture with two to five scan segments. The %r columns show the reductions obtained over the values corresponding to the standard scan architecture. For example, for circuit s38584, the proposed scan architecture with two segments obtained reductions of 21% for ATPG-generated test vectors, and of 27% for pseudorandom vectors in capture peak-power dissipation over the standard scan architecture. Reductions of 26% and 25% in shift peak power were obtained for pseudorandom and ATPG-generated vectors, respectively. The last two rows show the average and worst relative reductions for all experiments.
IV. CONCLUSION
This paper presented a scan chain architecture using mutually exclusive scan segment activation, where the scan chain is split into lengthbalanced segments and only one segment is enabled in each test clock (shift or capture). Thus, this architecture is capable not only of reducing average power but it also eliminates peak-power problems during capture cycles, which have not been addressed by previous approaches based on scan chain partitioning. The maximum number of flip-flops which can change their values simultaneously is limited to the scan segment length. Increasing the number of scan segments reduces the switching activity in the circuit under test and consequently power dissipation. The algorithmic procedure proposed for assigning flip-flops to scan segments enables full reuse of test vectors generated using standard ATPG tools without affecting the fault coverage. An implementation of the proposed method had been integrated into an automated design flow using commercial synthesis and simulation tools which was used for a set of experiments performed on 17 benchmark designs. These experiments showed that significant reductions in both peak and average power are achieved when using the proposed scan architecture without affecting the performance of the designs and with minimal impact on area and testing time (typically 1%-3%). Hence, this method represents a potential solution to power-related issues associated with scan-based testing.
