This paper discusses an automated method to build scan chains at the register-transfer level (RTL) for powerconstrained at-speed testing. By analyzing a circuit at the RTL, where design complexity is lower than at the gate netlist level, one can divide a circuit into multiple partitions, which can be tested independently in order to reduce test power. Despite activating one partition at a time, we show how through conscious construction of scan chains, high transition fault coverage can be achieved, while reducing test time of the circuit when employing third party test generation tools. Furthermore, as shown in experimental results, by constructing scan chains for the partitioned circuit at the RTL, area and performance penalty of the design-for-test hardware may be reduced.
I. Introduction
Structural tests targeting the single stuck-at fault model applied through scan chains (SCs) have been successfully used to detect physical defects that affect the static circuit behavior [1] . As the geometric feature size of digital integrated circuits decreases, the number of physical defects that affect the dynamic behavior (i.e., timing failures) is on the rise. One problem with the existing current-based test methods used to screen these type of defects, e.g., IDDQ, is that it is becoming increasingly difficult to distinguish the quiescent current of a faulty device from the fault-free one [2] . As early as in [3] , it has been shown that by applying the same set of stuck-at test vectors at the operational frequency, timing-related faults can also be detected. As a result, at-speed testing has established itself as an essential step in manufacturing test. However, applying at-speed tests using scan poses unique challenges as discussed next.
To detect timing-related defects, two test patterns V 1 and V 2 need to be used to initialize the logic into a known state, and to trigger the targeted transitions in the circuit at the operating frequency [4] . In this paper we consider the Skewed-Load test application strategy, where the second pattern V 2 is obtained by shifting the the first pattern V 1 . Despite the need for a scan enable which can switch between the scan and capture mode at-speed, this method can reuse the existing infrastructure for testing stuck-at faults and it also eliminates the need for sequential automatic test pattern generation (ATPG) [1] . However, it is important to note that for the Skewed-Load approach the coverage of delay faults (i.e., fault models of timingrelated defects) is limited by the correlation between V 1 and V 2 . This problem is further aggravated when additional constraints are imposed on SCs by the available power budget during test.
The elevated power dissipation during test has become a major concern that limits the test throughput and manufacturing yield and, consequently, new power-conscious test methodologies have emerged in the past decade [5] . A method independent of the test vectors that guarantees to reduce the test power in the circuit under test (CUT) is to divide the circuit into multiple partitions, such that each partition can be tested separately. This, however, influences the correlation between at-speed vectors applied using Skewed-Load, thus adversely affecting the delay fault coverage. The focus of this paper is to enable the use of scan chain divisions for power-constrained at-speed test using the Skewed-Load test application strategy. By utilizing information obtained from the circuit description at the RTL, the circuit can be partitioned by consciously controlling the flip-flops (FFs) from different SCs via separate scan enables.
The rest of the paper is organized as follows. Section II discusses the related work and gives the motivation for the proposed solution. Section III details our proposal, while results and conclusion are given in Sections IV and V.
II. Related Work and Motivation
A power-aware ATPG algorithm is proposed by Wen et al. in [6] . By filling the don't cares in the test patterns consciously, the amount of transitions in the circuit during capture is decreased. However, this method may increase the number of test patterns, and is more complex than a regular ATPG algorithm. Butler et al. [7] combine an ATPG algorithm with design partitioning, which limits the amount of active FFs at a given time. However, this method requires the circuit to be partitioned manually. Lee et al. proposed a method to reduce capture power [8] by assigning to each SC a time when it should capture the test responses. To solve the problem of data dependencies between FFs, a new ATPG algorithm is introduced to find the upper bound of the number of capture cycles needed. Despite reducing both shift and capture power, this method may introduce more test patterns and area overhead from the partitioning of SCs. To lower the power during shift, Whetsel proposed to divide the SCs in a design such that they are shifted at different times in [9] . Since the number of FFs that are active during shift is reduced, the amount of transitions in the combinational logic are also decreased. This can effectively reduce the power consumption during shift. Rosinger et al. proposed a method for reducing both the shift power and capture power [10] . By employing a scan architecture with mutually exclusive scan segments, a circuit is divided such that the test patterns and responses for each segment will be shifted, and captured respectively at different times.
None of the existing methods for scan chain division are suitable for at-speed test using the Skewed-Load test application strategy. This is because when generating patterns that trigger the targeted transitions, all the FFs driving a logic cone must be controlled by an ATPG tool. As a consequence, as shown later in this paper, scan chain divisions need to share FFs between them. Thus, one will need to carefully analyze the design to create the multiple scan chain divisions such that shared FFs between partitions are minimized. Since sizes of designs will continue to increase in the future, the complexity for analyzing and identifying scan chain divisions at the gate level of design abstraction can become prohibitively high. As a consequence, to ensure the complexity of the analysis algorithm remains low, it is apparent that such investigation should be done at the RTL, rather than at the gate netlist level. Although there are a number of methods proposed in the literature, such as [11, 12] , to construct functional SCs at the RTL, none of them explore CUT partitioning for managing test power. Therefore, the main motivation for this paper is to investigate the suitability of creating scan chain divisions for power-constrained at-speed test.
III. Creating Scan Chain Divisions at RTL using a New Partitioning Algorithm
In this section we first explain the problem of circuit partitioning. We then introduce the architecture for controlling the partitioned circuit during test. Next, we present an algorithm for dividing a circuit into smaller partitions such that they can be tested independently. After determining how each FF and the associated combinational logic should be allocated to the appropriate partition, the scan synthesis algorithm for inserting SCs to the design at RTL in [13] is applied. This algorithm helps create SCs with reduced correlation between test pattern pairs for Skewed-Load test application strategy. Once the scan circuit is constructed, the corresponding RTL description of the partitioned circuit can be interfaced to a RTL-to-GDSII tool flow for logic synthesis and test pattern generation.
A. Architecture for the Partitioned Circuit
The idea of partitioning is to locate the independent logic cones that are driven and captured by mutually exclusive sets of FFs. When such logic cones are found, they can be put into different smaller sections in a design. In order to better understand the problems, an example of a partitioned circuit is shown in Figure 1 
Fig. 2. Architecture for controlling partitions during test application
its corresponding triggering and capturing FFs. Since each of the four partitions are independent of each other, testing this whole circuit becomes the task of testing four smaller circuits, each with its own logic cone. Since the partitionunder-test is smaller than the whole circuit, the power dissipated during shift and capture will be lowered.
There is one major problem that needs to be solved when locating the independent partitions of a circuit. Due to the functional data dependencies in a circuit, it is difficult to locate the mutually exclusive triggering and capturing FFs for the selected logic cones in different partitions. An easy way to solve this problem is to insert dummy FFs between partitions to break these conflicts of shared FFs. However, this could incur excessive amount of area overhead since the dummy FFs are inserted just for the purpose of test.
Figure 1(b) shows an enlargement of Partition 2 in the circuit. As in other partitions, the selected logic cone, and the corresponding FFs that trigger transitions and capture circuit responses are identified. As mentioned above, it is difficult to find the mutually exclusive set of triggering and capturing FFs for a selected logic cone in a circuit. Thus, we divide the FF sets into three categories in order to identify the conflicting FFs. The three categories are (i) local FFs, (ii) outgoing FFs and (iii) incoming FFs. Local FFs drive and capture responses for the logic cone only in the specified partition. Outgoing FFs capture responses from the logic cone in the specified partition, but drive the logic in another partition. The FFs labeled Outgoing FFs (1) in Figure 1 When testing a partition, the local FFs, the incoming FFs and the outgoing FFs will have to be activated together. Thus, the incoming FFs and outgoing FFs will not only be active when the targeted partition is under test, but also when testing the adjacent partitions in order to maintain the desired level of fault coverage. As a result, it is obvious that it will be beneficial to have a large number of local FFs, and a small number of incoming FFs and outgoing FFs in order to reduce the test power of a single partition. It is also desired that the outgoing FFs of a partition should be shared with the least amount of adjacent partitions. This not only simplifies the control of the partitioned circuit, but also guarantees the outgoing FFs will be active in as little time as possible. This can be done by carefully identifying the independent logic cones when partitioning the circuit. The algorithm for doing so will be presented in the next subsection. One point the reader should note is that if a logic cone is too large to be partitioned, the number of incoming and outgoing FFs can become excessive if two partitions are created. In this case, since all these shared FFs need to be active at the same time, partitioning a large cone will likely not help reduce its test power.
In order to activate the appropriate sets of FFs for each partition during test, the architecture in Figure 2 will have to be employed. In this architecture, a separate scan enable (SE) signal is assigned to each partition. The local FFs in a partition are controlled by the corresponding SE signal. An OR gate is inserted in order to combine the SE signals from multiple partitions since the outgoing FFs of a partition will also be activated when the adjacent partitions are being tested. Figure 2 (2) in Partition 1 will be enabled to test Logic cone 2. To prevent excessive tester channel occupation, a simple decoder can be inserted to activate the SE signals one at a time, since, to lower test power, only one partition should be active at a time.
B. The Partitioning Algorithm
A pre-processing step consists of extracting information about the data dependencies between FFs in the design from the RTL description. This can easily be done by building a sequential graph (S Graph), where the nodes represent FFs in the design and data dependencies are shown as edges. Once the S Graph is built, the problem of dividing a circuit into multiple small partitions with the emphasis of having a large amount of local FFs and small number of outgoing FFs in a partition becomes equivalent to splitting the S Graph into smaller sub-graphs with the least amount of cross-edges between each subgraph. This is because each cross-edge between two subgraphs represents an outgoing FF between two partitions in a circuit. As a result, the partitioning problem can be easily formulated as the minimal cut set problem, which is known to be NP-hard in graph theory. However, since we have an additional constraint that the outgoing FFs should be shared by a minimum number of neighboring partitions, we have developed a simple greedy algorithm instead of reusing the existing heuristics.
We define four variables in Table I . Before detailing the algorithm, the gain function is described as:
(1) This equation gives a higher gain when a node in the S Graph has more edges connecting to a node in a neighboring partition than to a node in the local partition.
The algorithm for partitioning a circuit is shown in Algorithm 1. The designer can specify how many partitions are needed in order to meet the power constraint. The function PartitionSize returns the number of FFs in the specified partition. At the beginning of the algorithm, all nodes in the S Graph are assigned to Partition 1. Then, line 2 will use the gain formula in Equation 1 to calculate the gain of all nodes in the S Graph. At this point, the greedy algorithm starting at line 5 of Algorithm 1 will be applied repeatedly until all Partitions are created. It starts by selecting a node with the highest gain from Partition i − 1 at line 6. The reason for choosing the highest gain can be shown using Figure 3 . In this figure, the algorithm is trying to select a node in Partition 1 and move it to
TABLE I. Variables for the gain function
Variable Name Representation N LI Number of incoming edges from local partition N LO Number of outgoing edges to local partition N F I Number of incoming edges from foreign partition N F O Number of outgoing edges to foreign partition
Fig. 3. Example of an S Graph
Partition 2. The two candidate nodes will be F F 3 and F F 4 with gains -1 and 2 respectively. By moving F F 4 to Partition 2, the number of cross edges between the two partition can be decreased by 2, while moving F F 3 will increase the number of cross edges by 1. After a node is selected, lines 8 and 9 will update the list that contains all the incoming FFs and outgoing FFs of a partition. These incoming FF list and outgoing FF list will be used to update the gain of the parent and child nodes of the selected node at line 10 of the algorithm. This is repeated until Partition i − 1 reaches the targeted size, at which point line 11 will lock the nodes that are driven by Partition i − 1 in Partition i. The reason for this is to prevent a FF to be shared by more than two partitions. This can be shown using F F 7 in Figure 3 . Assuming the algorithm is trying to select a node from Partition 2 for creating Partition 3, by locking F F 7, which is driven by F F 4 in Partition 1, it avoids F F 4 being shared between Partition 1, 2 and 3 at the same time, since it also drives F F 6 which is located in Partition 2. Algorithm 1 will terminate at line 11 if the amount of locked FFs is larger than the targeted size.
IV. Experimental Results
In this section we discuss our results for a DMA circuit [14] . It is important to note that this circuit contains 2050 FFs when using gate level scan and 2115 FFs when When dividing a circuit into k partitions, it is expected that the number of active FFs in a partition should be 1/k of the total number of FFs. However, in the proposed solution, for the purpose of launching patterns for detecting delay faults, not only the local FFs, but also the incoming FFs in neighboring partitions must be activated. For example, for three partitions, there will be 1044 (or 49.4%) active FFs in the largest partition; for eight partitions, the largest partition will have 522 (or 24.7%) FFs. Note, it is assumed that the power when testing a single partition will be directly proportional to the scan chain division size. This is consistent with prior literature [9] . Tables II and III show the testability results generated by a commercial ATPG tool [15] for the DMA circuit with three and eight partitions. The timing constraints and number of SCs are listed in columns 1 and 2 respectively. Note that the number of SCs for the DMA circuit with three and eight partitions are chosen to be different to show that the benefit of our approach is irrespective to the number of SCs in the design. The column labeled TF represents the total number of transition delay faults in the circuit. The column labeled AU represents ATPG untestable faults, which are faults that are untestable due to the limitation of scan cell arrangement. FC corresponds to fault coverage for transition faults, CTP denotes the number of compressed test patterns, ST is the scan time in thousands of clock cycles, and CPU represents test generation time in seconds. For the columns labeled Gate Full, full scan is inserted at the gate level after the circuit has been synthesized. All the columns for the gate level case are generated without partitioning the circuit. For the columns named RTL Full, the testability results for TF, AU, CTP, ST and CPU are calculated by summing the corresponding data between the multiple partitions that are obtained by our approach. It is to be noted that the TF for RTL scan is higher than that of gate level scan due to the presence of redundant FFs and the added DFT logics for controlling the multiple partitions. Despite the increase in TF, the amount of AU for RTL scan with three and eight partitions are actually 942 and 2260 faults less than that of gate level scan on average. This decrease in AU faults in turn improves the fault coverage of the RTL scan by 3.25% and 3.33%. This improvement is due to reusing the method from our prior work [13] . However, by employing the scan chain division method proposed in this paper, although the amount of CTP increases by over 2000 and 5000 on average, the scan time is actually reduced on average by 104 and 95 thousand clock cycles. This translates also into 18.7% and 45.9% reduction in volume of test data for the DMA circuit with three and eight partitions respectively. This is because within a single partition, there are fewer active FFs that need to be scanned. Besides, note that our proposed solution also improves the test generation time.
Tables IV and V show the area and performance results comparison between gate level scan and RTL scan with three and eight partitions for the DMA circuit. Column 1 shows the timing constraints used by the synthesis tool [16] . Column 2 provides the total number of SCs and Columns 3, 5 and 7 indicate whether the timing constraints were met during synthesis for the non-scan circuit, the circuit with gate level scan and the RTL scan circuit. Columns 4 and 6 show the area overhead when compared to the non-scan circuit for gate level scan and TABLE V. Area data for DMA with eight partitions RTL scan accordingly. Column 8 shows the difference in area overhead between gate level scan and RTL scan. As can be seen in the table, despite the presence of the redundant FFs and the additional logic for controlling the partitions during test, the area for RTL scan is 3.39% and 1.19% respectively less on average than that of gate level scan. Moreover, from Columns 3 and 5 of Tables IV and V, the non-scan circuit and gate level scan fail to meet timing beyond 1.75 ns and 1.80 ns respectively. However, the performance is improved to 1.7 ns for the RTL scan with three and eight partitions as indicated in Columns 7 in both tables. We attribute this contribution to the fact that the logic synthesis tool can better optimize the circuit by generating the scan paths and functional logic simultaneously when the scan infrastructure is provided in the RTL description. However, one point to note is that the synthesis tool failed to meet the timing constraints at 1.8 ns for the DMA circuit with three partitions and 12 SCs, and at 1.8 ns and 1.75 ns for the DMA circuit with eight partitions and 24 SCs. We consider this anomaly to be caused by the heuristic nature of the logic synthesis engine. It is also important to note that the computational time for partitioning the circuit and inserting scan at the RTL only takes a few minutes when performed on the DMA circuit with 2115 FFs on a 1.5GHz PowerPC G4 with 1GB of RAM.
V. Conclusion
This paper described how by dividing a circuit into multiple partitions at the RTL for power-constrained atspeed testing, testability of the circuit can be improved by consciously constructing the SCs.
