Abstract-Three-dimensional (3-D) stacking of integrated circuits (ICs) using through-silicon-vias (TSVs) is a promising integration platform for next-generation ICs. Since TSVs are not fully accessible prior to bonding, it is difficult to test the combinational logic between scan flip-flops and TSVs at a prebond stage. In order to increase testability, it has been advocated that wrapper cells (WC) be added at both ends of a TSV. However, a drawback of WC is that they incur area overhead and lead to higher latency and performance degradation on functional paths. Prior work proposed the reuse of scan cells to achieve high testability, thereby reducing the number of WC that need to be inserted; however, practical timing considerations were overlooked and the number of inserted WC was still high. We show that the general problem of minimizing the WC is equivalent to the graph-theoretic minimum clique-partitioning problem, and is therefore NP-hard. We adopt efficient heuristic methods to solve the problem and describe a timing-guided and layoutaware solution. We evaluate the heuristic methods using an exact solution technique based on integer linear programming. We also present design-for-test optimization technique to leverage the reuse-based method during post-bond testing. 
industry is headed toward further exploitation of the benefits provided by 3-D integration in a variety of product lines, such as 3-D NoC [4] , 3-D memory-on-processor [5] , and 3-D FPGA [6] . The emergence of 3-D logic-logic stacks has also been predicted for the near future [7] . Motivated by rapid advances in design methods and integration technology, researchers have started investigating test and design-for-testability techniques for 3-D ICs [8] [9] [10] [11] [12] [13] [14] .
3-D-integration is an attractive design platform because it utilizes the vertical dimension for creating a stack of multiple dies, thereby leading to a smaller footprint along with a highdensity packing of transistors per unit volume [12] . In contrast to off-chip wire bonds [15] , a TSV-based 3-D-integration approach overcomes barriers in interconnect scaling. The TSVs are fabricated as cylindrical copper nails providing electrical connection from active front-side of a silicon die through the silicon substrate to the back-side [12] . Prior to bonding, TSVs are not fully accessible because one of their ends is not connected to logic on other dies. Such one-ended TSVs lead to prebond testability problems of reduced controllability and observability. The combinational part of die logic between the last level of scan cells and outbound TSVs cannot be observed, and that between inbound TSVs and first level of scan cells cannot be controlled. While this problem is more serious for logic-on-logic stacks, it is also a concern for the logic die in memory-on-logic designs. Prebond testing is, however, necessary for achieving high stack yield; it is desirable to have known-good-dies (KGDs) before stacking. Without design-for-test (Df T) innovations, the dies cannot be adequately tested before bonding.
In order to enable prebond testing by overcoming controllability and observability bottlenecks, wrapper cells (WC) are needed at the two ends of a TSV [12] . Such WC also facilitate post-bond testing of dies and the interconnects between dies. Since the number of TSVs on a die can be of the order of tens of thousands, the use of WC for each TSV can lead to significant area overhead. Moreover, WC on functional paths can lead to higher latency and performance degradation.
To reduce the overhead of WC, the reuse of existing primary inputs (PIs), primary outputs (POs), and scan cells for increasing the controllability and observability was proposed in [16] and [17] . In both these approaches, WC are required only for TSVs that cannot be controlled or observed using existing scan cells in the design. Heuristic methods were developed to reduce the number of WC without affecting fault coverage. However, prior work provides no insights on the minimum number of WC that are required for full testability, i.e., comparable to the case when the die is a stand-alone chip with I/O pads. These heuristics are therefore ad hoc and the effectiveness of the solutions provided by them needs to be carefully evaluated. Moreover, prior work did not consider the adverse timing impact of additional capacitive loading on functional paths if scan cells are reused for TSV controllability/observability enhancement.
In this paper, we show that the general problem of minimizing the WC count is NP-hard. The proof of complexity is based on polynomial-time reduction from the well-known NP-complete problem of finding a minimal clique partition in a graph [18] . We adopt efficient heuristic algorithms to solve this problem, and compare our results with the heuristic method proposed in [16] . We also compare our results with exact results obtained using integer linear programming (ILP) on smaller benchmark circuits. The proposed method allows timing constraints to be incorporated in the following ways.
1) Scan cells that lie at the ends or beginnings of critical paths can be discarded from the set of candidates for reuse. 2) An upper limit can be placed on the number of scan cells that are considered for reuse. 3) An upper limit can be placed on the number of TSVs whose testability is enhanced by the reuse of any single scan cell. To avoid routing congestion due to the reuse of existing scan cells, we place a practical constraint on the maximum distance between any (TSV and scan cell) pair. A scan cell is not reused for a TSV if the distance between the scan cell and the TSV exceeds a given threshold. Experimental results are presented for benchmark circuits from the OpenCore [19] and ITC'99 [20] benchmarks.
The remainder of this paper is organized as follows. Section II provides details about the motivation for this paper and discusses related prior work. The problem statement is formulated in Section III, and we show that the optimization problem is NP-hard. In Section IV, we adopt heuristic graph-theoretic algorithms and propose a generic approach that can be used to incorporate timing and layout information. Experimental results are discussed in Section V. Finally, Section VI concludes this paper.
II. MOTIVATION AND PRIOR WORK
The TSVs in a 3-D-stacked integrated circuit (IC) are used to connect metal layers using landing pads across multiple dies. In accordance with terminology used in literature, we refer to TSVs that drive gates in a die as inbound TSVs for that die, and outbound TSVs for a die are TSVs driven by that die. A TSV that is both inbound and outbound TSV for a die is called a bidirectional TSV. We only consider inbound and outbound TSVs in this paper, but the solutions for them can be extended for bidirectional TSVs.
Various options for partitioning a design into multiple active layers are explored in [5] . Trade-offs between design effort and latency-power-area benefit are analyzed for different granularity of partitioning, i.e., partitioning at the core level, functional-block level, gate level, or transistor level. At the granularity of cores, a multicore design can be built by stacking different cores on top of each other, or stacking a layer of in-package L2/L3 cache over a layer of cores. At the next level of granularity, stages of processor pipeline can be split across multiple dies, as also proposed in [7] . A functional block can be split across multiple dies at the next level of granularity, e.g., splitting of cache lines, or bit-slicing of functional units. At the finest level is the transistor level, e.g., all nMOS transistors can be placed on one die and all pMOS devices on another die, but the state-of-the-art via technology is not ready to support this scheme of partitioning in the near future [5] . Moreover, it requires a tremendous amount of redesign effort. Applying prebond tests will also require fault modeling at the transistor level. For all other granularity of partitioning, redesign effort is less and simpler gate-level fault models can be used for prebond testing. Therefore, we do not consider transistor-level granularity of partitioning in this paper.
For the first three levels, partitioning can be done in three ways [21] : 1) combinational to sequential, where sequential elements drive one end of TSVs, and the TSVs drive combinational logic on the other side; 2) sequential to combinational, where combinational logic drive the outbound TSVs on one die and the corresponding inbound TSVs on another die drive sequential elements; and 3) combinational to combinational, where combinational logic is present on both ends of TSVs. It is argued in [21] that a fourth approach of partitioning, namely, "sequential to sequential" is unlikely for most designs, as buffers and delay elements are always present in such paths to avoid potential hold violations due to clock skew. Our experimental circuits are a hybrid of the above partition approaches. Note that in the first approach, it is not possible to observe the faults activated in the combinational blocks during prebond testing, whereas controllability is affected during prebond testing of the combinational blocks in the second approach. In the third approach, testability is compromised during prebond as well as post-bond testing phases.
Appropriate Df T solutions are required to provide controllability and observability to TSVs and the logic connected to them. In this section, we first assess the impact of open-ended TSVs on stuck-at and transition fault coverage during prebond testing. We show that the fault coverage is considerably reduced if WC are not used. Next, we overview [16] and [17] that addresses this problem by reusing existing scan cells. Testability issues related to post-bond testing are addressed in Section IV. Table I shows the number of combinational logic gates before the first level and after the last level of scan cells (as a percentage of total gate count) in the netlist for two-die stack implementation of OpenCore benchmarks [19] , namely cf_ fft_256_8, cf_rca_16, and des_perf. Note that we are reporting the count of those gates that only have inbound TSVs in their fan-in (FI) cone for the first-level of combinational logic, and those gates that only have outbound TSVs in their fan-out (FO) cone for the last level of combinational logic. In addition to these gates, the faults on gates that Table I are conservative with respect to the decrease in fault-coverage that will be observed in the absence of WC. Table II shows how the stuck-at and transition fault coverage is affected when no WC are added to make the TSVs controllable and observable at the prebond stage. The circuits from the OpenCore benchmarks are each partitioned into two dies for performance optimization, and the fault coverage for the two dies in each benchmark are listed for two cases: 1) WC are added for increasing testability and 2) WC are not inserted. For the same two cases, Table III shows stuck-at and transition fault coverage for a four-die implementation of cf_rca_16. It further highlights the importance of equipping die logic with sufficient testability for prebond testing. A commercial ATPG tool was used to calculate the fault coverage in each case. Based on these results, we conclude that it is necessary to incorporate a Df T solution for prebond testing.
A. Fault Coverage Without WC
However, the addition of WC leads to die-area overhead. As shown in Fig. 1 , the WC are typically based on either IEEE Standard 1149.1 or IEEE Standard 1500 for supporting various test modes. Using these cells require an additional area of one or two flip-flops per end for each TSV. As discussed in [12] , when dies are designed by different teams or companies, it is recommended that ripple-protected WC be used; therefore, even higher overhead (four flip-flops) is required at each TSV to achieve testability. Therefore, the insertion of WC at every TSV imposes significant overhead. 
B. Related Prior Work
It is shown in [16] and [17] how existing scan cells can be reused to provide testability to dies prior to bonding. For an inbound TSV, a scan cell is selected such that it does not have an overlapping FO cone with the TSV, as shown in Fig. 2(a) . The output of the selected scan cell is multiplexed with the TSV [see Fig. 2(b) ]. Fig. 3 illustrates how fault coverage is affected if the FO cones of a TSV and the selected scan cell overlap. The error due to the s-a-1 fault at line A cannot be propagated to the output. Although the fault is activated, it is not possible to create a sensitized path to an observable output. If the FO cones are nonoverlapping, then there is no loss in testability if the corresponding scan cell is selected. Similarly, for outbound TSVs, the criterion for choosing a scan cell is that the FI cones of an outbound TSV and the selected scan cell do not overlap. In order to use a scan flop to detect faults in the FI cone of an outbound TSV, a multiplexer and a XOR gate are required, as shown in Fig. 4 . A heuristic was proposed in [16] for selecting a scan cell based on the above requirements, and the number of reused scan cells was maximized. It was shown how the reuse of scan cells results in a significant reduction in the number of WC required. This Df T solution leads to an increase in the pattern count, but it was reported in [16] Example showing an untestable s-a-1 fault at line A due to overlapping FO. that the test-data volume reduced due to a reduction in the number of bits per pattern. A drawback of this solution is that it is targeted only at prebond testing; post-bond interconnect testing becomes more complex and it has to rely on ATPG for the in-test mode.
A limitation of [16] is that after reusing a scan flop for a TSV, the scan flop is not considered for further reuse by other TSVs. The heuristic proposed in [16] does not update the FO (FI) of scan cells that are reused. Moreover, the ad hoc approach in [16] does not provide any insights on the minimum number of WC required for full testability. Another limitation of [16] is that it does not consider timing constraints that impose restrictions on the use of scan flops that are at the ends of critical paths.
In [17] , a hypergraph-based partitioning algorithm was presented to partition a netlist into two dies such that the number of scan flops connected to TSV ends is maximized. However, such a partitioning criterion is impractical, since partitioning a circuit needs to exploit the benefits offered by TSVs, such as higher performance and reduced power consumption. Kumar et al. [17] proposed a heuristic method for assigning flops to TSVs using concepts from graph theory. However, their approach searches for a solution locally; only scan flops that are connected to TSV ends are reused in a greedy manner. Our solution satisfies the conditions required for testability, but instead of being an ad hoc approach as in [16] and [17] , it is based on a formal method that provides globally optimal results. We first show that the general problem of reducing the number of additional WC is NP-hard. We construct a polynomial-time reduction from the minimal clique-partitioning problem, which is known to be NP-complete [18] , and subsequently adopt heuristic algorithms to solve this problem. We also implement an exact approach for solving this problem using ILP, and compare between solutions obtained using various algorithms.
III. PROBLEM DESCRIPTION
In this section, we present more details about the optimization problem. We assume a full-scan design, and refer to a scan flip-flop simply as a flop in the subsequent discussion.
A. Selection Criteria for Flops
Definition 1: The FO cone of a gate is obtained by tracing downstream from every output of that gate, and the tracing is bounded by flops and POs. Similarly, the FI cone of a gate is obtained by tracing upstream from every input signal of that gate, and the tracing is bounded by flops and PIs. The FO (FI) cone of an inbound (outbound) TSV is defined in a similar manner.
We first consider the following problem. Given a circuit consisting of TSVs and flops, assign a flop to each TSV, such that: 1) the FO cones of an inbound TSV and the assigned flop do not overlap; 2) the FI cones of an outbound TSV and the assigned flop do not overlap. If no assignment is possible for a TSV, a WC is added. Fig. 5 shows how an added WC can be reused for multiple TSVs. A WC can be assigned to multiple TSVs; but note that once an assignment is made, the FO/FI cone of the assigned flop (or WC) is changed. If a flop is assigned to an inbound TSV, then the new FO cone of the flop will also include the FO cone of the inbound TSV. Similarly, the FI cone of a flop includes the FI cones of the outbound TSVs assigned to it. Based on this information, the following lemma can be stated.
Lemma 1: Any two inbound TSVs that are assigned to a common flop do not have overlapping FO cones. Similarly, any two outbound TSVs that are assigned to a common flop do not have overlapping FI cones.
Proof: The lemma can be proved using contradiction. Suppose that two inbound TSVs have overlapping FO cones. Then any flop that is assigned to one TSV cannot be assigned to the other; otherwise the selection criteria is violated. This is because the FO cone of the assigned flop is now a superset of the FO cone of the first TSV after the assignment. However, we assumed that the two TSVs were assigned the same flop, which is a contradiction. The proof for the case of outbound TSVs is similar.
Next, we discuss our optimization problem and the cliquepartitioning problem from graph theory, and establish the relationship between these two problems.
B. WC-Count Minimization (WCM) Problem
Given a design with flops and TSVs, minimize the number of additional WC that are required to provide complete testability to all TSVs (and connected logic), such that the flop selection criteria are not violated. We refer to this problem as WCM. Note that if no additional WC are required, then all the TSVs are assigned the existing flops in the design.
C. Clique-Partitioning Problem
Definition 2: A clique is a complete graph, i.e., a graph in which each node is connected to every other node. It can be easily seen that the number of edges in a clique with n nodes is n 2 , i.e., (n(n − 1))/2. Given a graph, the problem of partitioning the vertex set of the graph into disjoint sets, such that each set forms a clique and the number of sets is minimized is known as minimal clique-partitioning problem. The above problem of finding is known to be NP-hard [18] . The decision version of this problem, whether a given graph has k cliques, is known to be NP-complete. We use the clique-partitioning problem, or CPP, to prove that WCM problem is NP-hard.
Note that solving CPP is equivalent to solving a graph coloring problem [18] , and any method for solving one problem can be used to solve the other problem. We use this relationship to utilize known ILP models for the graph coloring problems to obtain an exact solution for WCM, which can be applied to solve small problem instances.
D. Polynomial-Time Reduction From CPP
Given a general instance of CPP, we describe a construction that can be used to prove that WCM is NP-hard. Given a graph G, and an integer k, we construct an instance of WCM, and show that G can be partitioned into k cliques if and only if the instance of WCM can be solved using k cells (includes both existing flops and additional WC). We use the following definition in our construction.
Definition 3: A set of nodes in a graph is called an independent set if no two nodes of the set are connected.
Without loss of generality, let us assume that the vertex set V of graph G is ordered in any arbitrary way, i.e., V = {v 1 , v 2 , . . . , v |V| }. Then using vertex v 1 , a polynomial-time and deterministic labeling procedure, such as in Fig. 6 , can be developed that returns a maximal independent set containing vertex v 1 . Let the size of the returned set be m. Then G consists of m + n nodes, where n = |V| − m. The labeling procedure also labels these m nodes as flops, hereafter referred to as "flop nodes," and labels the remaining n nodes as TSV ("TSV nodes"). The labeling corresponds to an instance of WCM; the m nodes correspond to m existing flops in the design that can be assigned to TSVs, and n TSV nodes correspond to n TSVs. In this graph model, an inbound TSV is not differentiated from an outbound TSV because this does not affect the proof. Because of this labeling scheme, there are no edges between flop nodes. An edge between two TSV nodes exists only if the corresponding TSVs do not have overlapping FOs/FIs. Similarly, an edge between a TSV node and a flop node exists only if FOs/FIs of the corresponding flop and TSV do not overlap. An edge between a flop node and a TSV node in the graph means that the corresponding flop and the inbound (outbound) TSV do not have overlapping FO (FI) cone in the circuit. Similarly, if two TSV nodes in the graph have an edge between them, then the FO (FI) cones of the corresponding inbound (outbound) TSVs do not overlap in the circuit. We prove below that partitioning G into k cliques is equivalent to using k cells (both reused flops and WC) for all the TSVs.
Lemma 2: If it is possible to partition G into k cliques, then k ≥ m.
Proof: Since G has an independent set of size m, these m nodes will belong to different cliques, i.e., a clique will have at most one of these m nodes. Therefore, the proof follows from the definition of a clique.
Let us assume that the labeled graph G can be partitioned into k cliques. Then it immediately follows from the labeling scheme that a flop can be assigned to a set of TSVs, only if the corresponding nodes in G form a clique. Since the edges represent the "nonoverlapping condition" between nodes, a clique in G corresponds to a feasible assignment.
From the above lemma, a clique cannot have more than one flop node. If a clique consists only of one flop node, then the node is not assigned to any TSV. If a clique consists only of TSV nodes (there are k − m such cliques), then an additional WC is required. If G can be divided into k cliques, then the number of additional WC needed is equal to k − m.
Conversely, it can be seen that if k cells are needed for providing testability to all the TSVs (and connected logic), then G can be partitioned into k cliques. This follows from the fact that a flop node and a set of TSV nodes form a clique in G only if the corresponding flop node can be assigned to the corresponding TSVs. Also, if a group of TSVs is assigned a new WC, then the corresponding TSV nodes form a clique in G. Fig. 7 shows a one-to-one correspondence between a general instance of CPP and the transformed instance of WCM. In this example, m = 2, n = 3, and the number of cliques (k) formed is 2. Therefore, the number of additional WC needed is k − m = 0. The graph is labeled after running the labeling procedure in Fig. 6 . Since the labeling procedure is deterministic, there is a one-to-one mapping between the two instances. Moreover, the run-time complexity of the labeling procedure is O(|V| 2 ); therefore, the instance transformation step runs in polynomial time. Fig. 8 shows another example with m = 3, n = 10, and k = 5, and the number of additional wrappers inserted is k − m = 2.
We conclude that WCM is NP-hard, since we have carried out a polynomial transformation of a given instance of CPP to an instance of WCM.
IV. DESIGN ENHANCEMENT: DIE-LEVEL MODULAR TESTING
A Df T architecture based on IEEE Standard 1500 for supporting modular testing of dies and embedded cores is proposed in [22] . This wrapper design is also the basis for ongoing IEEE P1838 standardization efforts for 3-D-stacked ICs [23] . Fig. 9 shows an IEEE Standard 1500 WC. The combination of the multiplexer m 1 and the flip-flop shown in the figure can be viewed as a scan flip-flop; therefore, the select line of m 1 is referred to as scan enable or SE. Note that the WC enables data capture on the line CFI, which is not possible through the reuse methodology described in Fig. 2(b) for inbound TSVs. Similarly, using the solution proposed in Fig. 4(b) , the flop that is used for observing faults present in the FI cone of a outbound TSV cannot be used for controlling the TSV. In contrast, the 1500 output WC can control the outbound TSV that is wrapped. These shortcomings make the reuse scheme incompatible to modular testing of dies during post-bond test, as envisioned in [22] . Therefore, we propose Df T enhancements that support the various operating modes of IEEE Standard 1500. Fig. 10 shows how a multiplexer is added for capturing the values on an inbound TSV. When the capture line is asserted, the scan flop captures the value on the TSV line, otherwise the flop captures the output on its FI cone; therefore, this design does not interfere with the normal mode. Since we want a flop to be reused multiple times, XOR gates are used for compacting the values on multiple TSV lines (see Fig. 11 ).
For solving the controllability problem of the outbound TSVs, we require an extra multiplexer per TSV. Fig. 12 describes the proposed scheme. A high value on the transfer line pushes the content of the scan flip flop to the outbound TSVs. The design enhancements made for both inbound and outbound TSVs together mimic the behavior of 1500 Standard WC under various operating modes. Just as the WC are stitched together to form a ring around a die in the diewrapper architecture [22] , the scan flops that are being reused are stitched together to form a chain. Table IV lists various modes that are supported by a wrapper boundary register in IEEE Standard 1500 [24] . The table compares signal values of IEEE Standard 1500 WC with that of the proposed Df T to support those modes, for both inbound and outbound TSVs. These modes collectively form the basis of the rich instruction set of the IEEE 1500 Standard. In the inward-facing mode, the inputs are controlled in the IEEE 1500 Standard. In the proposed scheme, the inbound TSVs are controlled by setting the appropriate signals, as shown in the table. The outputs or the outbound TSVs are controlled in the outward-facing mode. The inputs (outputs) are observed during the outward-facing (inward-facing) mode. The safe state allows the setting of inputs and outputs to a known state. The symbol "X" indicates a do not-care. While shifting test patterns during INTEST, the capture signal can be a do not-care (because the scan flops will be in the shift mode), but it should be set to 0 during the capture cycle to capture the values on the FI cones of the reused scan flops.
For a bidirectional TSV, Df T structures must be added for both FI and FO cones of the TSV. Fig. 13 illustrates the proposed design for a bidirectional TSV.
V. HEURISTIC METHODS AND ILP MODEL
As discussed in Section III, CPP and graph coloring are equivalent problems, and since we can solve WCM by targeting CPP, we can use graph coloring in this paper. We adopt Tseng's algorithm from [25] and method-1 from [26] . We refer to these adaptations as "one-clique-at-a-time" (OCAT) and "multiple-cliques-at-a-time" (MCAT), respectively. We next describe below these algorithms, and following that we discuss an ILP model for generating optimal (exact) solutions.
A. OCAT Algorithm
Tseng and Siewiorek [25] used clique partitioning for solving optimization problems in the design of processors at the register-transfer level. The problems of allocation of registers, data operators and interconnect units were modeled as CPP, and a heuristic algorithm was developed. This algorithm constructs maximal cliques one at a time, by selecting an edge that is "best" connected, and instantiating a clique with the selected edge. Edge contraction, as explained in Fig. 14 , is applied repeatedly with the selected edge, until no edges are left in the graph. The pseudocode for the adaptation of this algorithm (OCAT) is presented in Fig. 15 . Fig. 16 shows an example of edge contraction on an edge connecting vertices p and q.
B. MCAT Algorithm
This approach also selects edges one by one for edge contraction. However, the order in which edges are selected is different from [25] . Moreover, it does not find maximal cliques in a greedy manner. In [25] , maximal cliques are formed one at a time, but in this approach, multiple cliques can be formed simultaneously. We refer to this approach as MCAT. The algorithm is detailed in Fig. 17 .
C. Incorporating Layout Information
Until now, we considered all flops in the die for assignment to a TSV. However, flop assignments should be made under wire-length constraints. The assignment of a flop to a TSV such that the flop is far from the TSV in the circuit layout is not desirable because it can lead to routing congestion, and subsequently to performance degradation. Therefore, we have developed a layout-aware approach that accounts for wire lengths in the optimization method. First, a layout is generated for the original netlist optimization for prebond testability. After a graph is constructed corresponding to a WCM instance, an edge between a flop and a TSV is dropped if the distance (wire-length) between the corresponding flop and the TSV is above a preset threshold (l_th). Dropping this edge ensures that the corresponding nodes are not grouped in the same clique. Another practical constraint arises from the placement of TSVs in clusters. Therefore, reusing a scan flop for TSVs belonging to different clusters should be avoided. If TSVs are placed in clusters, edges between TSVs lying in 
D. Incorporating Timing Constraints
The size of a clique is equal to the number of gates the corresponding flop from the clique has to drive, in addition to the existing gates that the flop was driving. Therefore flop reuse leads to an increase in the capacitive load for that flop, which increases the delay of paths that originate from that flop (see Fig. 18 ). This can lead to performance degradation. The size of a clique also determines the number of XOR gates used for compacting the output response on the corresponding TSVs. Although the functional path faces additional delay of only one multiplexer; however, during test mode, the delay in the test path is exacerbated by the presence of multiple XOR gates. Because of these two reasons-increased capacitive load on the driving flop, and increased delay in test paths-we limit the size of the clique that can be formed. The parameter q is used for controlling the size of clique-q is the number of times a flop can be reused.
We must avoid adding multiplexers and XOR gates on critical paths. We must also find all such paths that have the potential to suffer performance degradation due to proposed Df T methodology. We remove those flops and TSVs from reuse consideration that are either sources or sinks to paths that are critical or have the potential to become critical after Df T insertion. We find path delays of the worst path from every scan flop and TSV, and sort them in order of path delays (highest to lowest). Next, we discard a fixed fraction (say 5%) of the flops and TSVs from the top of the list, and thus eliminate potential timing violations. We repeat the same method for flops that are sinks for longest paths in the circuit.
E. Limiting the Reuse of Flops
In the above heuristic methods (see [16] , [17] ), no limits are placed on the size of cliques that are formed during optimization. In case of a clique corresponding to inbound TSVs, the size of the clique is equal to the number of gates the corresponding flop from the clique has to drive. Hence flop reuse leads to an increase in the capacitive load for that flop, which increases the delay of paths that originate from that flop (see Fig. 18 ). If layout data is available, parasitic and timing analysis can be used to rule out some flops from being considered as reuse candidates. A limit must also be placed on the number of times a flop can be reused, say q. Moreover, we can preselect a fixed number of flops, say p, that can be reused. The search process is guided by the values of p and q. During contraction of edges in the proposed method, whenever a contracted edge creates a clique of size q, one of the following actions is taken. all edges from this clique to other TSVs or intermediate cliques are deleted, so that only a flop node can be selected in subsequent iterations. If the contraction results in a clique of size q + 1, all edges are dropped from the clique. It can be easily seen that using the above edge-deletion rules, any flop or a newly added WC is not reused more than q times.
Similarly, a limit should be placed on the size of cliques corresponding to outbound TSVs. The XOR gates are placed on FI path of the scan flop. Reusing a flop multiple times for providing observability to the logic connected to outbound TSVs leads to performance degradation. Therefore, a limit is also placed on the maximum number of outbound TSVs that can be assigned to the flop.
F. ILP Formulation for CPP
An ILP model allows us to leverage existing solvers to obtain optimal solutions. Even though ILP is not always scalable for large problem sizes, it allows us to evaluate the quality of heuristics by comparing heuristic solutions with exact solutions for smaller problem sizes. In order to develop an ILP model, we consider the following definition.
Definition 4: The complement of a graph G is another graph H having the same set of vertices such that any two vertices are adjacent in G if and only if they are not adjacent in H.
The graph-coloring problem can be formulated as an ILP model. The goal of this problem is to minimize the number of colors required to color all the nodes of a given graph, such that no adjacent nodes are assigned the same color. It can be seen easily that the nodes that are colored using the same color in graph G forms a clique in the complement graph H, and minimizing the number of colors used in G is equivalent to minimizing the number of clique partitions in H. The number of vertices (N) in the graph can be used as a bound on the number of colors used. The binary variables y k and x ik are defined in the following way. If color k (k = 1, 2, . . . , N) is used, then y k = 1, otherwise y k = 0. The binary variable x ik (i = 1, 2, . . . , N) indicates whether the node i is assigned the color k. The ILP model is given below
The first constraint ensures that each node is colored. The second constraint makes sure that the color k is counted only if color k is used by a node, and the last constraint guarantees that the connected nodes do not get the same color.
We extend the above basic formulation to put a constraint on the number of times a flop or a WC can be reused. The constraint can be stated as t x tk ≤ q, where t varies over indexes of all TSV nodes.
We use the results obtained by the above ILP model to evaluate the results obtained using the proposed heuristics for smaller benchmarks.
VI. RESULTS
In this section, we first discuss the benchmark circuits that we use for our experiments. Next we compare the results obtained using the heuristic algorithms discussed in Section V with that obtained using the method described in [16] . The results are compared to assess the following.
1) The number of existing scan flops that are reused, and the number of (additional) WC that are needed.
2) The impact of the Df T solution on stuck-at and transition fault coverage, and the number of test patterns.
3) The area overhead. For our experiments, we use three benchmark circuits, namely des_perf, cf_rca_16, and cf_ fft_256_8, from OpenCore benchmark suite [19] , and nine benchmark circuits (b01, b02, b03, b04, b06, b15, b20, b21, and b22) from ITC'99 benchmarks [20] . For the OpenCore designs, the partitioning was done for performance optimization. For the ITC'99 benchmark circuits, hMetis [27] -a hypergraphpartitioning tool-was used. The designs from OpenCore are each partitioned into four dies, and that from ITC'99 are partitioned into two dies. Tables V and VI show design-related details about the two sets of benchmark circuits, respectively. We further find that compared to MCAT, the OCAT algorithm is more effective in reducing the number of reused flops (N F ). This can be explained by observing that OCAT creates one clique at a time, and it creates a clique in a greedy manner. However, MCAT performs better than OCAT in minimizing the objective function, i.e., N WC , especially when the number of scan flops that are available for reuse is limited. While MCAT leads to fewer WC, it tends to require more flop reuse. Hence the choice of the heuristic can be based on the relative priorities of these two measures. Moreover, the computational complexity of MCAT is less than OCAT by a linear factor of the number of vertices in a given graph for solving the clique-partitioning problem [26] .
The CPU time for each experiment (each row of Table VII ) on a 16-core Intel Xeon CPU, running at 2.5 GHz with 16 GB RAM, was in the range of minutes for smaller benchmarks for each method. As the size of circuit increases, MCAT starts performing better than OCAT in terms of the execution time. This is because OCAT also stores the information about common neighbors between two nodes, and has to update the same information after an edge contraction. With the largest benchmark cf _ fft_256_8, while MCAT, and the approach discussed in [16] ran within a few minutes, OCAT ran for 30 min.
In order to study the impact of timing constraints on flop selection, we carried out a series of experiments as described below. Since we do not have layouts for the netlists for the 3-D stack designs, we were not able to extract accurate timing information. Instead, we set the fraction of flops, referred to as α, that cannot be reused (due to timing constraints) to a range of values, e.g., α = 0.1, 0.2, and 0.3. For each choice of α, we randomly select αN flops, where N is the total number of flops in the die, and mark these flops as not reusable. The random selection is repeated 20 times to obtain some level of statistical significance. Table VIII shows the TABLE IX  STUCK-AT FAULT COVERAGE AND TEST PATTERN COUNT FOR VARIOUS METHODS FOR THE BENCHMARK CIRCUITS   TABLE X  TRANSITION FAULT COVERAGE AND TEST PATTERN COUNT FOR VARIOUS METHODS FOR THE ITC'99 BENCHMARK CIRCUITS results (mean and standard deviation) on the pair of flops that are reused and the number of WC required for testability in these cases. These results are compared to that obtained with no constraints on specific reuse, but an upper limit of 1 − α is set on the fraction of flops that can be reused (under MCAT column). We also report comparative data obtained using the ad hoc heuristic from [16] . We note that more WC are required when specific flops are deemed nonreusable, since flops can no longer be freely reused. In practice, when layout data and accurate timing information is available, our method can be used by simply deleting nodes in the graph that correspond to flops that cannot be reused.
B. ATPG Results
To demonstrate that the overall fault coverage is not adversely affected by the proposed Df T method, we compare fault coverage for different methods. After assignment of flops were made to the TSVs, the netlist was edited to insert multiplexers and XOR gates at locations specified by results from all the methods. A commercial ATPG tool was run to measure the fault coverage and number of patterns needed to achieve that coverage. The results obtained from each method are compared with the case when WC are used for each TSV (see Table IX ). As expected, irrespective of the approach used to optimize the WC count, the fault coverage does not deviate from its value when WC were inserted at each TSV. The number of test patterns is only moderately higher in most of the cases, and even lower in some cases. While it is not clear which method is more effective in minimizing the pattern count, the number of bits per pattern for MCAT and OCAT is much lower than for [16] because the number of WC inserted using these methods is significantly lower. As shown in Table X , similar results were observed for transition fault coverage. Table XI shows the optimum number of additional WC (N WC ) needed for each small circuits taken from the set of ITC'99 benchmarks. The table reports the cumulative number of additional WC needed for both the dies. The variable q was fixed to a value of 4 and p was fixed at 0.7·N TSV for all the experiments. The table also reports the number of flops that need to be reused. While the MCAT procedure yielded an optimal solution (in terms of N WC ) in all the cases, the OCAT procedure deviated from the optimum value for the b03 benchmark.
C. Comparison With ILP Model
These results indicate that the heuristic methods are effective and tend to provide optimal or near-optimal results.
D. Layout and Timing-Aware Assignment of Flops to TSVs
We used Cadence encounter to perform place-and-route and clock-tree synthesis for all dies separately using the 45 nm NanGate standard cell library. In the floor plan, we distributed the TSVs evenly across a die forming a grid in order to keep the TSV density (number of TSVs per unit area) uniform across the die area. The TSVs have a keep-out zone with a width of four times the minimum-sized inverter, and the height of a standard cell.
The distance between a TSV and a flop is estimated by the Euclidean distance between the coordinates of the TSV and the D-pin of the flop. Similarly, the distance between two TSVs is calculated using the Euclidean metric. If this distance is below a threshold value (l th ), then the corresponding edge, if it exists, is dropped from the graph. We also filtered out TSVs and flops from the graph model based on the discussion in Section V-D. Synopsys PrimeTime was used for creating a sorted array of flops based on timing delays of paths to and from the flops. We ran experiments on the benchmark circuits by varying l_th, and measured its effect on the values of N F , N WC , stuck-at fault coverage, and pattern count. The results are shown in Table XII . Since MCAT is computationally less expensive and provides results as good as the OCAT approach, we used the former for obtaining our results. As expected, on increasing l th , the number of existing flops that were reused increased and the number of additional WC required reduced. There is no significant change in the pattern count or the fault coverage.
In order to verify that the modified netlists do not suffer from performance degradation, we compared slack time of the critical paths of the modified netlists with that of the corresponding netlists having one WC dedicated to each TSV. For filtering flops and TSVs that are either sources or sinks to the critical paths, we used the technique described in Section V-D. When zero flops or TSVs were filtered, we saw performance degradation along critical paths because of the addition of multiplexers on those paths. Moreover, we saw that the worst path in our approach is always an "XOR path" added in the designs to observe the faults on FI cones of multiple outbound TSVs (see Fig. 12 ). However, these cannot be sensitized in functional mode, and they are activated only during test mode. Moreover, only the single stuck-at faults are detected along such XOR paths, and the clock frequency needed for test is much smaller than the functional frequency. Therefore, transition faults will not be targeted along these paths, and they can be discarded from consideration as the "worst-case" longest paths. After removing XOR paths and further removing the top 5% of flops and TSVs that are potential sources or sinks to critical paths, we did not find performance degradation in the designs. To evaluate the benefits provided by our Df T technique to support die-level modular testing, we compared test time obtained with the modular approach to that obtained using a nonmodular approach, i.e., ATPG on the flattened netlist. We generated a netlist for each die of a design without adding the Df T structures needed for post-bond testing; however, the optimization necessary for providing prebond testability was carried out. A top-level netlist was generated for each design that instantiates the individual dies, and ATPG results were obtained on the new netlists. The results are compared to the ATPG results obtained from the case when we supported modular testing in Table XIII. In the latter case, we can retarget the patterns generated during prebond testing; therefore, the test time for post-bond testing is obtained by adding the test time of individual dies. Since interconnect testing only marginally affects the overall test time, it is ignored in this example. The value of l th was set to 100 micrometer and q = 4 was used in each case. We observed a significant savings in test time with the modular approach. Note that in practice, some amount of ATPG on the flattened design might be desirable to test the interconnects between dies by sending signals across die boundaries. Table XIV compares the area overhead due to the reusebased method with that from the case when one WC is dedicated to each TSV. We see considerable savings in overhead because of the reuse-based method. The gain is expected to be even more significant for larger circuits with tens of thousands of TSVs and hundreds of thousands of flops.
VII. CONCLUSION
We have studied the problem of minimizing the WC count in 3-D-stacked ICs using a formal approach based on graph theory. We have proven that this problem is NP-hard by constructing a polynomial-time reduction from the cliquepartitioning problem, which is known to be NP-complete. Since the clique-partitioning problem is also equivalent to the graph-coloring problem, this paper paves the way for the adoption of heuristic algorithms for graph coloring, instead of less effective ad hoc techniques. We have implemented two such heuristic algorithms for the WC minimization problem. We have further shown how timing and layout information can be incorporated in the graph model to address the problem of increased capacitive load and delay on critical paths due to multiple flop reuse. An ILP model has also been presented to evaluate the quality of results obtained from the heuristic methods. Experimental results for benchmark circuits show that the proposed methods can reduce the number of WC significantly without affecting fault coverage or pattern count.
