With the advancement of VLSI manufacturing technology, entire electronic systems can be implemented in a single integrated circuit. Due to the complexity in SoC design, circuit testability becomes one of the most challenging works. Without careful planning in Design For Testability (DFT) design, circuits consume more power in test mode operation than that in normal functional mode. This elevated testing power may cause problems including overall yield lost and instant circuit damage. In this paper, we present two approaches to minimize scan based DFT power dissipation. First methodology includes routing cost consideration in scan chain reordering after cell placement, while second methodology provides test pattern compression for lower power. We formulate the first problem as a Traveling Salesman Problem (TSP), with different cost evaluation from [18] , [19] , and apply an efficient heuristic to solve it. In the second problem, we provide a selective scan chain architecture and perform a simple yet effective encoding scheme for lower scan testing power dissipation. The experimental results of ISCAS'89 benchmarks show that the first methodology obtains up to 10% average power saving under the same low routing cost compared with a recent result in [19] . The second methodology reduces over 17% of test power compared with filling all don't care (X) bit with 0 in one of ISCAS'89 benchmarks. We also provide the integration flow of these two approaches in this paper.
Introduction
In System-on-Chip (SoC) era, chip design and testing engineers have encountered more and more new design challenges. Due to the need of DFT, modern design has made external testing more difficult than before. During scan testing, the power dissipation is the critical issue because test vectors require a large number of shift operations and make circuits with high transition activity [1] . In fact, circuits may consume twice or more power in test mode than in normal functional mode operation. Scan power issue may cause several problems including the increase of product cost, circuit reliability reduction, instant circuit damage, decrease of overall yield, and autonomy decrease in portable systems [2] , [20] . Some works [34] , [35] , [37] formulate the problems as IR drop and apply pattern filling techniques to deal with these problems.
When dealing with high-performance modern ASICs and SoCs, a nondestructive test must satisfy all the power constraints defined in the design phase. In the past, because test needed only to cover stuck-at faults, tests typically ran at lower speed than normal circuit working frequency. Moreover, the scan-based architectures consume a lot of power because each test vector requires a shifting operation to initialize scan cells and evaluate test responses. There are several approaches proposed to reduce average power during scan testing operation. It is feasible to reduce testing power consumption by lowering scan frequency, but it will increase testing application time and will not be suitable for high complexity and advanced SoC design which may need atspeed testing. Low power Automatic Test Pattern Generation (ATPG) algorithm concerns both fault coverage and low power issue [21] . But this approach needs more test vectors to achieve the same fault coverage comparing with traditional ATPG approach. It also increases test time. Test power consumption reduction by chaining scan cell with low power order is proposed in [22] , [23] . However, [22] cannot guarantee short scan chain interconnection length and may cause congestion problems during scan routing. In order to avoid the routing congestion, [18] constrains power-optimized scan chain connection length by partitioning the chip into several regions and chaining scan cells with low power order in each region. Since this approach partitions chip by geographical criteria, the number of scan cells in each region may not be well-distributed. When there are few scan cells in some cluster, this cluster will suffer from poor power optimization ratio. [19] further proposes to partition the chip by balancing the number of scan cell in each region and makes more power reduction comparing to [18] . In [18] and [19] , although they have shortened scan chain connection length by partitioning chips, they still have impact on routing wire length. And the stronger the constraint on the longest scan chain length is, the larger the number of clusters is. Consequently, if the number of clusters becomes larger, test power reduction will become smaller.
On the other hand, in addition to scan chain reordering approach, test data compression can be another way to achieve low power scan testing. In [3] , [24] , [25] , test data compression is used for test volume and test application time reduction. Moreover, [4] , [8] - [10] use various test data compression techniques for reducing scan power, test volume, and testing time. [5] - [8] applies multiple scan chain concept in test power reduction. [4] , [9] , [10] , [15] In this paper, we propose a methodology to reduce test power consumption by considering routing congestion in scan based architecture and selective test data compression. For routing congestion consideration in scan chain design, we formulate it as a traveling salesman problem (TSP) with different cost function from previous approaches [18] , [19] , and discuss the tradeoff between power and routing overhead. Although those approaches [18] , [19] can reduce routing overhead when chaining scan cells with power driven strategy and design partitioning, these approaches may omit some good choices of scan cell pairs which have both low transition number and short connection length. Since we do not constrain power driven chaining scan cells in specific regions, our approach can achieve up to 10% power saving under the same routing cost in s9234 benchmark compared with [18] . Our experimental result is also better than [19] which has 1-3% improvement comparing to [18] . We have obtained at most 57% routing cost improvement under the same test power consumption as well. For test data compression technique to further reduce test power, we obtain averagely over 11% power reduction with small decoder circuit overhead, compared with the testing power in original test pattern length. The test volume reduction rate is 37% in average for six ISCAS'89 benchmarks. Our new scan architecture is easy to implement with synthesis tools. We further integrate these two approaches and obtain a better power reduction percentage, compared with filling don't care (X) bit with 0 in original test pattern, and shorter scan chain wire length consideration.
The remainder of this paper is organized as follows. In Sect. 2, we briefly describe power optimization scan chaining proposed in [22] , routing constraint driven approaches on low power scan chain (LPSC) proposed in [18] , [19] , and test data compression techniques for low power in [8] - [10] . In Sects. 3 and 4, we present our approaches on how to consider both power and routing constraint issues, and our new scan architecture and selective test data compression techniques. In Sect. 5, we show the experimental results and conclude the paper in Sect. 6. This paper is an extended version of [14] .
Low Power Scan Testing by Chain Design Considering Routing Cost and Test Data Compression
In IC design flow, scan cells and testing circuit are inserted after synthesis procedure. The scan chain connection will be broken before going into placement phase to prevent scan chain from having great impact on routing congestion. Then scan cells will be reordered with layout driven chaining algorithm. As for scan chain reordering, there have been many papers about shortening scan chain length and reducing scan chain routing congestion [26] , [27] . As chip complexity and operating frequency evolve dramatically, scan chain reordering should not only focus on area overhead reduction, but also should take care of power consumption issue, such as [28]. In addition to scan chain ordering schemes, test data compression is an effective approach to reduce test power and test volume at the same time. We review some low power driven scan chain ordering methods in [18] , [19] , [22] , [28] and some test data compression techniques for low power issues [8] - [10] .
Power Driven Scan Chain Reordering
The dynamic power, P=Σ 1/2* C li * f * Vdd 2 * S i , where C li is the equivalent output capacitance which is strongly correlated to its fan-out number, Vdd is the power supply voltage, and S i is switching probability. According to the equation, we can reduce test power by decreasing the number of scan transition activities. In [22] , a low power-driven scan chain ordering approach applied a heuristic algorithm to minimize scan chain power. In this algorithm, it uses the test data which are generated from scan cell insertion tool and automatic test pattern generator.
The proposed approach procedure is as follows. First, we construct a complete undirected graph: vertex represents scan flip-flop and each edge represents possible connection between two scan cells. The weight of each edge represents the total number of bit differences between two scan cells. Figure 1(a) illustrates scan flip-flops and scan vectors. We calculate bit differences between each pair of flipflops on the scan-in vectors and output responses. For example in Fig. 1(a Figure 1(b) shows the corresponding weighted graph. Second, we use greedy algorithm to find Hamiltonian cycle with optimal low cost. From the optimal low cost cyclic solution, we estimate power for each flip-flop by determining scan-in and scan-out ports and find a minimum cost cutting edge. The solution is near-optimal scan cell order with much lower transition number during scan chain shifting operation.
Routing Constrained Low Power Scan Chains
The power driven scan chain reordering approach from pre- vious subsection has drawbacks mainly in creating routing congestion and long scan connection in the design. To show this point, Fig. 2 shows s9234 benchmark (has 211 flip-flops) routing result with power driven scan chain reordering. In Fig. 2 , nodes represent the position of scan cells in the design and edges are connections between scan cells. Although the power driven approach can efficiently reduce power (27% power reduction compared to length driven scan chain reordering), the routing result is not optimal and has pretty high routing congestion.
In order to reduce the routing overhead in power driven scan chain ordering, [18] proposed a chips partition method with geographic criteria and scan chain flip-flops in each cluster with low power driven order. In this approach, it definitely can shrink low power scan chain length, but the testing power reduction may be low when there are few flipflops in a cluster. To improve power reduction ratio in each cluster, [19] proposed a better version on partitioning chips with well distributed scan flip-flops in each cluster. This approach can slightly increase testing power reduction ratio and the total wire length of the scan chain remains almost the same with [18] . From the experimental results described in [18] , [19] , we observe that the testing power reduction and scan chain connection length are strongly correlated with the number of clusters. If there are more clusters, the power reduction ratio will be less and the scan chain length will be shorter. In fact, we will show that we do not have to partition the chip and obtain the same tradeoff easier.
[28] proposed a technique for reordering of scan cells to minimize power dissipation that is also capable of reducing the area overhead of the circuit compared with a random ordering of the scan cells. They use dynamic minimum transition fill (MT-fill) to fill the unspecified bits in the test vector. They use this greedy/intuitive heuristic to achieve locally optimal scan cell ordering, however this sequential approach may not have big picture in lowering scan test power. Furthermore, although they use a tradeoff parameter λ to control relative importance of those two terms, it is not easy to specify a good value. We will show that our tradeoff parameter can provide better flexibility in giving power or routing cost minimization solution.
Test Data Compression to Achieve Lower Scan Testing Power
There are some previous works using compression techniques to achieve low power test. For example, [8] - [10] use test data compression techniques for reducing scan power, test volume, and testing time. In [9] , [10] , they used Golomb coding and alternating run-length codes for low power scan testing and test data compression. Moreover, [16] uses dictionary based method with memory to compress test data. This method uses memory storage inside chip to save compressed code in it and reduces the shift in data size. Another strategy is saving the encoding information in circuit [17] . It provides an inverter-interconnect based decompression network to decode the test data. We use decoding scheme to implement a low power architecture to achieve test data compression and low test power. This architecture needs some decoders to decode fewer bits of shift-in data. The experimental result shows that this approach reduce a lot of scan power and test data size. Although our compression scheme needs some routing resources, that is unavoidable cost. In the integrated flow, our reorder technique can alleviates the overall routing cost.
Simultaneous Power and Routing Cost Minimization in Scan Chain Design
In previous works ( [18] , [19] ), they focus on chaining scan cells with power driven order in each cluster and find that test power reduction ratio is strongly correlated with the number of clusters. They then chain each cluster with a knowledge based architecture. Although [18] and [19] both can reduce routing overhead which is induced by power driven scan cell chaining by partitioning the design, these approaches may omit some good choices of scan cell pairs which have both low transition number and short connection length. In order to consider test power and routing length minimization, we propose reordering scan chain with cost function which can take both power and routing length into consideration. In this way, this approach can make scan chain reordering process to find a better scan chain order solution without limiting in specific regions. In the following subsections, we will delineate how the edge weight is defined and how good quality of the scan chain order can be found.
Weighted Graph Construction Using Power And Routing Cost
In power driven scan chain ordering [22] , it can definitely get low transition scan chain order. However, this approach
has not yet considered the physical information of the scan chain in the design and resulted in routing cost issue. In order to improve routing overhead in LPSC ordering, this approach constructs a weighted undirected graph G(V,E) and uses distance cost and bit difference cost between each pair Edge weight(i, j)
where Dist(i, j) is the direct connecting distance between i-th scan cell and j-th scan cell, Bit Diff (i, j) is number of bit differences between test vectors in i-th scan cell and j-th scan cell, L is diagonal length of the chip, N is total number of scan-in and response vectors, and β is the parameter that controls how much effort we pay to scan power consumption and it is ranged from 0 to 1 † . Because the unit of Dist(i, j) and Bit Diff (i, j) are differently scaled, we normalize direct connecting distance of scan cell i and j by diagonal distance of the chip and normalize number of bit difference by total number of scan-in and response vectors.
Efficient Heuristic in Finding Min-Cost Scan Chain Order
After weighted graph is constructed, we need to find a path with minimum cost. This problem can be formulated as a TSP problem, which is known as NP-complete. In order to generate an acceptable solution efficiently, we implement a heuristic algorithm to get a competent low cost solution. Our algorithm shows in Fig. 3 . The complexity of this algorithm is O(|V||E|), where |V| is number of flip-flops and |E| is number of edge in the weighted graph.
For circuits, which contain a large number of flip-flops, solving by TSP problem could be time-consuming. We can shrink the graph by grouping the nodes in the graph based on a threshold value in edge cost. The nodes will be grouped when their associative edge cost are below a threshold value. Then we apply the same algorithm. Since we want low cost trip in TSP, using this technique can reduce the complexity and achieve comparable results. After we find the low cost scan chain order, we can estimate the power for two scan cells of entrance of this scan chain to decide scan-in port and scan-out port.
Power Estimation for Scan-in and Scan-out Ports Determination
When scan chain order has been decided, we need to define scan-in and scan-out port because the number of transition activities is not only related to the number of bits difference between scan cells but also strongly correlated to their relative positions. In order to choose which one of scan cells in the beginning of the scan chain, we need to estimate power for these two orders. To estimate scanning power dissipation, we use weighted transition model proposed in [29] .
The estimation of scan-in and scan-out power in our scan chain order is the formula (2) (from [29] ):
From this power estimation function, we can decide which one of scan cell in the beginning scan chain to be scan-in port and which one to be scan-out port for this low power dissipation scan chain.
Selective Test Data Compression Technique for Low Power Scan Testing
For further test power reduction after scan chain reordering, we propose a simple yet effective selective test data compression method to reduce the input test pattern size and total power consumption. In this method, we will compress some selected test patterns and use another compressed scan chain (CSC) as input data. We then shift the compressed data into CSC. The CSC decodes the compressed data to the normal scan chain (NSC) by additional decoders. The following subsections introduce our selective scan chain architecture and scan test power minimization strategy via test data compression. Our integrated methodology flow is shown in Fig. 4 . Our scan chain reordering technique and low power test compression technique are applied after traditional ATPG, which is suitable for modern design methodology.
The Selective Scan Chain Architecture for Lower Scan Power
The proposed scheme applies a new architecture and an optimization flow to achieve lower power and smaller test data volume. The selective scan chain architecture is shown in Fig. 5 . All the test patterns are divided into two groups: first group of test pattern is used for CSC, the shift-in patterns are compressed form; second group of test patterns is used † We apply this tradeoff parameter β which can be used by designer to specify the relative importance of power and routing cost. Different from [28] , our parameter is very easy to specify (ranging from 0 to 1).
Fig. 4
Our proposed methodology to reduce scan testing power by power-driven scan chain reordering and selective scan chain optimization. in NSC, the shift-in patterns are not compressed. In this architecture, CSC uses the first group of test patterns which has more X bits to manipulate. X bit ratio is the ratio of the X in a single pattern length (SPL). And we use X bit omit ratio as an indicator to separate the test patterns. If X bit ratio of a test pattern is smaller than the selected X bit omit ratio, the pattern belongs NSC group, otherwise the test pattern belongs to CSC group. Compressed patterns need special decoders to extract the test patterns to NSC. Figure 5 shows the one scan unit to four bits (1-unit-to-4-bit) decoder structure with original scan chain and compressed scan chain. Each compressed scan unit is composed of one or more scan cells and provides decoding results to the normal scan cell as test data. Figure 6 shows the decoder circuit examples. Figure 6 (a) has one scan cell in compressed scan unit, and Fig. 6(b) has 2 scan cells in compressed scan unit. Because the decod- Fig. 7 Pattern selection stage. After we get the partitioned test data set, we use X bit omit ratio to determine which test data row should be put into CSC data set and which test data row should be put into NSC data set.
ing scheme is simple, the decoder circuit is small as well. The number of scan cells in each scan unit are generated by the optimization methodology. After introducing our test architecture, following subsections present the optimization methodology on test data.
Optimization Methodology for Test Data Compression and Further Power Reduction
Our methodology consists of three stages. First stage is pattern selection, it sets the X bit omit ratio in order to select patterns for CSC. Second stage is pattern compression, it merges 4-bit length of test patterns in the same column of test sets. The first column is the first 4-bit in each test pattern. Third stage is power optimization stage. In this stage, it uses shorter pattern length and applies greedy search method to find the smallest power consumption code in CSC column by column. Each X bit omit ratio provides one result for test data volume and power consumption. By evaluating all ratios, we can get an optimal ratio for power minimization.
Pattern Selection Stage
This stage separates the test patterns into two groups by X bit omit ratio. The test patterns are generated from automatic test pattern generation (ATPG) tool, such as SyntestTurboScan [11] or TetraMax [12] . First group of test patterns is for CSC, and the second group is for NSC. First group of test patterns for CSC needs further compaction, while second group of test patterns for NSC uses normal shift method to test circuit. If test pattern's X bit ratio is smaller than given X bit omit ratio, this test pattern will belong to the NSC group, otherwise it will belong to CSC. Figure 7 shows that less X bit pattern is put into NSC group. Total original test size in equation (3) and total new test size in equation (4) SPL new is the CSC test data length that comes from compressed scan unit (shown in Fig. 5 ).
Pattern Compression Stage
After test patterns for CSC are determined, these patterns are compressed by the pattern compression stage. This step merges pattern with X bits. For example, X000 and 0000 can be merged into pattern 0000. As can be seen, Fig. 8(a) lists the first four test patterns which are selected for CSC. The test data is separated by four bits per column. It has 5 columns and 1 residual bit at the end in this case. This procedure starts to merge two test patterns from the first pattern of the first column to the last pattern of the first column until the first column completely merged. Then, it starts to merge Fig. 8(a) ).
the first pattern of the second column. Patterns in each column are independently merged. Figure 8(b) is the example of the merged pattern results. It fills all X with 0 to each merged pattern at the end of this stage. Our pattern compression algorithm is shown in Fig. 9 .
Column 1 in Fig. 8(b) shows that it has 8 merged patterns. It means that this column needs 3 bits to encode the 8 merged patterns. Similarly, columns 2, 3, 4, and 5 need 3-bit, 4-bit, 2-bit, and 1-bit respectively. Because column 3 still needs 4 bits to encode in this case, there is no gain on volume reduction. Test data in this column is not changed. Finally, the new test data length in this case is 14 bits since the residual bit is also added at the end of new test data.
Power Optimization Stage
In order to minimize the shift-in power with the new test data, the greedy search method are applied in each column to get the new test data. Figure 8(c) shows the new test data which is in compressed form. This stage maps new test data code to the merged pattern from previous compression stage. This stage transfers new test data code to the original test pattern. For example, The first column of Fig. 8(b) has 8 different merged codes. After the power optimization procedure, we get the result in Fig. 8(c) . It shows that the first 4 test pattern mapping are 110, 110, 110, and 100 in column 1. That means the shift-in data for CSC is 110 and the decoded result is 0000 (shown in Fig. 8(d) ) for NSC in the first column of the first test pattern.
In this stage, we try to calculate the smaller transition mapping of the new test code from the first column. Because the first column in Fig. 8(b) has 8 different codes, it needs 3 bits to encode them. The permutation of 3 bits, with 8 new encoded data, is 40320(8!). The optimization method calculates all of the encoding results and selects the fewest switch power encoding from the first column. Next, this procedure starts to encode next column and selects the fewest switch power encoding. The greedy method obtains the low power encoding results at the final of this stage. The encoding results become new test data for CSC. Figure 10 shows our power optimization algorithm. Fig. 8(c) ).
Experimental Results
The experimental results with this approach on circuit benchmark of ISCAS'89 family [30] show in this section. In order to simplify scan-based test power estimation, we use number of transition in scan chain as dynamic power unit and normalize it with routing driven scan chain ordering to highlight the power reduction ratio. The results are verified using PrimePower [13] and the estimation error is within 3%
† . As for interconnection length of scan cells calculation, we use direct connection length between scan cells.
The circuit characteristics and testing vector information are shown in Table 1 . The deterministic testing vectors are generated from Syntest-TurboScan [11] . The lost of fault coverage (FC) is due to circuit design and aborted faults. The second column of the first part in the table shows the number of gates which NOT gates are included and the third column shows the number of D flip-flops. In the second part of the table, we show the number of testing vectors for each circuit benchmark and corresponding fault coverage. Note that the value in Table 1 will not be changed during the process.
The Results for Routing Cost Aware Scan Chain Ordering
The first part of experimental results of our approach are shown in Table 2 . The second row PR represents power reduction ratio and WL represents routing length of all scan cells. The placement result is generated from the placer which follows the design of Dragon standard-cell placement tool [31] , [32] . We also follow the assumption from [18] that there is a strong connection between routing length and routing congestion. We start by setting β with 0 and increase the value by 0.2 until β is equal to 1. For each benchmark, we show its power reduction ratio and scan chain interconnection length. All power reduction ratios are normalized to ratios with β=0 of each benchmark which has best routing cost and poorest power consumption. From Table 2 , we can see the tradeoff between testing power and scan chain routing cost. For example, although s13207 has 6.9% higher power reduction ratio in β=0.4 than that in β=0.2, it cost 43.9% longer in wire length. With our well defined cost function, we observe that the power saving ratio is almost linearly increasing with β from 0 to 1. From this result, testing designers can control tradeoff between power consumption and routing overhead more intuitively. We have implemented [18] to compare the experimental results with our approach. Table 3 shows the experimental result of [18] in our platform. As previously shown, all power reduction ratios of each benchmark are normalized to results of routing congestion driven scan chain reordering. The experimental results show that the more clusters we use, the less the power reduction is and the shorter the scan chain length is, and the trend is the same as shown in [18] . Furthermore, the advantage of our approach is clearly shown in Fig. 11 , which we compare the result from Table 3 and our proposed approach. The horizontal axis is routing wire length and the vertical axis represents power reduction ratio compared with pure routing driven scan chain reordering. We can see that our approach has more power reduction ratio under the same routing overhead. We also have less routing overhead under the same power reduction ratio. Table 4 validates the above statement and shows averagely 43% routing cost improvement.
In Fig. 12 , we show scan chain routing graph of s9234 benchmark with four clusters in (a) and β=0.6 in (b). We can explicitly observe the advantage over routing congestion under the same test power saving ratio which is around 19% in this example. As for approach in [19] , which clusters chip with well distributed flip-flops, their approach has 1-3% improvement in power reduction ratio and has about the same routing overhead compared with [18] in benchmark s9234. We then deduct from Table 3 that our approach has both better power reduction and routing cost compared with approach in [19] .
The Results for Selective Scan Chain Architecture
The second part of experimental results is shown in Table 5 and Fig. 13 . We use 1-unit-to-4-bit compressed scan unit strategy in our experimental architecture. It shows that dif- ferent circuits need different X bit omit ratio to obtain the smallest test data size and lowest test power. The results are normalized the result of filling all X bits with 0. For test data volume reduction comparison, we compare our results with [39] since these two approaches are similar. In Table 6 , our reduction rate and the selected code method in [39] are similar. Both methods use 4 bits block as encoding source. The selected code uses simpleto-decode strategy to implement, which can reduce the complexity in decoder design but the test data volume would be higher than the optimal Huffman encoding. Our method [18] . The design is partitioned into four clusters and has power saving ratio 19.02%. (b) By our approach. The β is set to 0.6 and has power saving ratio 19.30%. It is clear that our approach provides better wiring with the same power saving ratio. only optimizes the encoding column by column but not all of columns at one time. That also degrades the compression ratio. Finally, both of the methods provide similar well compression results.
The Integrated Methodology
We further proposed the integrated flow of these two approaches to obtain more power reduction in test power, and at the same time gain better tradeoff in scan chain wire For s38584, we can get the total optimal size reduction rate 50% at X bit omit ratio of 75%. (b) The power reduction rate of s38584 at X bit omit ratio of 90% is 8.4%. Table 7 The cost weights for the patterns containing X bit in chain ordering technique of our combined approach.
Pattern Cost XX 0 X0 or 0X 0.5 X1 or 1X 0.5 10 or 01 1 length. The flow is shown in Fig. 4 . Since we do not define the X bit pattern in our chain ordering technique, we can defined the cost weights for those patterns with X bit. Table 7 shows the pattern cost example.
Conclusion
In this paper, we propose two approaches to alleviating test power issue by routing cost driven scan chain design and selective test data compression. Since it does not constrain chaining scan cells in a specific cluster region, the first approach has more freedom to choose better scan cell connection. With well defined cost function for weighted graph, it obtains better test power and routing cost optimization more explicitly. It also provide better tradeoff parameter for more flexibility in power or routing cost optimization. To further reduce test power consumption after scan cell reordering, the selective test data compression is provided for smaller test power and test data volume with limited area overhead. However, the limitation of the decoder's performance also needs to be explored in this low scan power architecture design.
