Interconnect Testing and Test-Path Scheduling for
Interposer-Based 2.5-D ICs
I. INTRODUCTION

I
NCREASING wire delay and higher interconnect power consumption are serious problems faced today by the semiconductor industry. A promising solution to these problems lies in the use of through-silicon vias (TSVs) for 3-D chip stacking, also referred to as 3-D integration [1] . TSVs are short metal pillars that pass through the silicon substrate and connect the metal layers on the front with another die or package on the back. Despite the numerous benefits offered by 3-D integration, such as high bandwidth and low power, commercial exploitation of 3-D integrated circuits (ICs) will only be feasible after pressing concerns about heat dissipation, yield, and test cost are adequately addressed. At the present time, interposer-based 2.5-D ICs have been identified as a precursor to 3-D integration [2] .
In 2.5-D integration, dies are mounted on a common silicon interposer through micro-bumps. A cross-sectional view of a typical 2.5-D IC with three dies stacked on the interposer is shown in Fig. 1 [3] , [4] . The interposer includes two types of interconnects: 1) horizontal interconnects composed of microbumps and a structure of multiple metal layers at the top of the interposer, which connect various dies and is referred to as the redistribution layer (RDL) and 2) vertical interconnects composed of micro-bumps, a cluster of TSVs and C4 bumps, which connect dies to the package. As 2.5-D integration emerges as a mainstream technology, test challenges must be adequately addressed in order to address concerns about defect screening. In particular, to ensure that high-yielding dies do not have to be discarded because of defects elsewhere in the 2.5-D IC, the testing of interconnects is even more important. Furthermore, defects in a low-cost interposer or in the micro-bumps can lead to the loss of defect-free dies that are considerably more expensive.
Defects in interposer interconnects can arise during interposer fabrication, as well as during die bonding and assembly. Typical defects include hard shorts and opens as well as resistive shorts and resistive opens that lead to increased interconnect delay [5] . Moreover, even though the interconnect is often fabricated using a mature process, its actual electrical characteristics can still deviate from its expected behavior due to process variation [5] , leading to small-delay defects. Efficient defect screening must therefore be performed as part of the production process.
The IEEE 1149.1 Standard test-access port (TAP) and boundary-scan architecture [6] are obvious candidates for the testing of interposer interconnects. Boundary scan can connect the I/O pins of the dies on the interposer in series using special cells and can be controlled using a finite-state machine (FSM). It uses five pins for external connections, which standardizes the interface so that it is identical for all devices. However, the use of the boundary-scan architecture alone is not sufficient for detecting interposer interconnect defects in 2.5-D ICs. With the standard TAP controller, since the Capture_DR and Update_DR states are separated by more than one clock cycle, small-delay defects cannot be detected through at-speed testing.
In this paper, we present an efficient interconnect-test solution that targets TSVs, RDL wires, and micro-bumps for shorts, opens, and delay faults. The proposed test technique is fully compatible with the IEEE 1149.1 Standard. An enhancement is made to the standard TAP controller so that the proposed test architecture can be used with at-speed interconnect testing. A new boundary-scan structure is proposed, and scan paths are grouped and divided into scan-in paths and scanout paths. With this technique, the test patterns shifted in and the test responses shifted out can be carried in parallel and independently from one other. In order to reduce the test cost, we present a test-path design and scheduling technique that minimizes a composite cost function based on test time and the design-for-test (DfT) overhead in terms of additional TSVs and micro-bumps needed for test access. The locations of the dies on the interposer are taken into consideration in order to determine the order of dies in a single test path. We present simulation results to demonstrate the effectiveness of fault detection, and synthesis results to evaluate the hardware cost per die over that required by the IEEE 1149.1 Standard. We also present test-path design and test-scheduling results to highlight the effectiveness of the optimization technique.
The rest of this paper is organized as follows. Section II reviews related prior work. Section III presents the proposed test architecture that facilitates the detection of opens, shorts, and small-delay defects. In Section IV, we describe our optimization technique to design the scan paths [i.e., the group boundary-scan cells (BSCs) and interposer interconnects] and to schedule testing using different scan paths so that the test cost is minimized. Section V presents experimental results for fault detection with the proposed test instructions, area costs based on the synthesis of the BSCs, and test scheduling based on the proposed optimization technique. Finally, Section VI concludes this paper.
II. RELATED PRIOR WORK
The concept of 2.5-D ICs can be traced back to the early 1990s. At that time, multichip modules (MCMs) emerged, wherein different dies were assembled in a single ceramic package substrate [7] . The system-in-package (SiP) technology, which became popular starting around 2000, improved upon MCMs in terms of integration densities [8] . The test problems mentioned in this paper were also of concern then and discussed in [9] and [10] . The main difference between a MCM (or SiP) and a 2.5-D IC lies in a passive silicon interposer that is placed between package and the dies. The interposer can provide more than 10 000 die-to-die interconnects and connect to over 1200 I/O pins. With such high-density interconnects and I/O pins, testing of 2.5-D ICs is considerably more challenging than the testing of MCMs and SiPs [11] .
Due to probe-technology limitations, test access is impaired and the horizontal interconnects cannot be accessed from the front (active) side via probing after dies are mounted on the micro-bumps [12] . Thus, only vertical interconnect can be accessed by standard probe needles via C4 bumps on the back-side of the interposer. In view of the importance of the test-access problem, research on testing of 2.5-D ICs has surged in the past few years, and a number of test and DfT solutions have been proposed [13] - [16] .
Early work on a post-bond test and DfT strategy for 2.5-D ICs containing a passive silicon interposer base was reported in [13] . Each die on the interposer is equipped with a 3-D-enhanced die wrapper based on the IEEE 1149.1 and IEEE 1500 Standard [17] to support interconnect testing. However, this method is aimed at die testing but it does not effectively target the various interposer faults. Chi et al. [18] presented an approach for detecting and diagnosing interconnect opens and shorts. Redundant tri-state buffers, TSVs, and micro-bumps were introduced in their architecture to provide alternative interconnects when a functional interconnect is faulty. Although the architecture supports diagnosis and repair of the interposer, the added redundancy can lead to an increase in fabrication cost. In addition, the requirement of a larger number of redundant TSVs and micro-bumps may be prohibitive.
Huang and Huang [19] proposed a delay testing and characterization method for interposer wires. A ring-oscillator structure is modified based on the original design proposed for TVS-based 3-D stacked ICs [20] . Two oscillation periods are measured for two different variable output-thresholds applied to the ring oscillator. Since the difference between these two oscillation periods ( T) is linearly correlated to the delay of the interposer wire, it is claimed in [20] that a smalldelay defect in an interposer wire can be detected when its T is an outlier among all the fault-free samples. However, this approach relies on a fixed clock frequency generated by the ring oscillator, which can lead to test escapes or over-testing when different functional frequencies (corresponding to the different operating frequencies of the dies on the interposer) are used for response capture. Therefore, a major drawback of this approach is its inflexibility and potential ineffectiveness for at-speed testing in realistic scenarios.
Huang et al. [21] presented a general at-speed test method for die-to-die interconnects and demonstrated its application to the testing of interposer wires in a 2.5-D IC. The main idea of this approach is to send a pulse through an interconnect and use this signal as a clock in the capture flip-flops. If the interconnect delay exceeds a certain limit, the flip-flop fails to toggle, and the error can be observed by shifting out the captured data. However, the use of an extra driver or a custom driver may not be practical because drivers are part of I/O cells that are carefully designed for the functional mode. The modification of I/O cells requires additional design effort and introduces overhead.
A case study on the testing of interposer-based 2.5-D ICs was presented in [12] . The test and debug strategies are described for an industry production setting. For interconnect testing, the concept of "pretty-good-die" (PGD) is introduced. In the PGD method, additional dummy metal is used to connect separated interconnects and test loops are formed. Probe needles can be used to shift test patterns through this test loop. However, the specific micro-bump/TSV structure, wherein a set of eight TSVs and micro-bumps are connected together for testing, does not correspond to actual interconnect structures in a 2.5-D ICs. Moreover, while dummy metal interconnect can be used for interposer testing, it is desirable to remove it before the 2.5-D IC is packaged. There is no discussion in [12] of the technology used to remove it; any imperfections can lead to defects.
More recently, another method for interconnect testing was presented in [22] . This test architecture is based on extensions to IEEE 1149.1. Only slight enhancements are made to the standard boundary-scan architecture: scan paths are grouped and divided into scan-in paths and scan-out paths; scan-in paths and scan-out paths are equipped with different BSCs. Opens, shorts, and small-delay defects, as well as defects and imperfections in the micro-bumps can be effectively detected. However, the BSCs used in [22] are not compatible with the IEEE 1149.1 Standard. Additional test pins are required to implement the proposed method, which may increase test cost. In addition, this method lacks an automatic test flow, such as that embodied by the boundary-scan description language (BSDL) file in IEEE 1149.1.
III. PROPOSED TEST ARCHITECTURE
A major challenge in the testing of interposer interconnects is that only the bottom side of the interposer can be accessed with standard probe needles. In order to address this problem, a test loop that begins at one C4 bump and ends at another must be formed for each pair of C4 bumps so that all the interconnects of 2.5-D ICs can be tested. Test loops can be implemented via scan-chain structures.
A. Boundary-Scan Structure
During interconnect testing, a scan chain is used to load the stimuli at one end of an interposer wire and to capture the response at the other end. Two types of BSCs are needed for interconnect testing: 1) launch cells and 2) capture cells. Because the two types of cells are functionally independent and serve different purposes during testing, their structures also differ. Thus, in our design, the boundary-scan chain is divided into two separate chains, namely the scan-in and scanout chains. In the scan-in chain, all of the launch cells are grouped and connected together; similarly, all of the capture cells are grouped and connected together in the scan-out chain. In this way, the scan-in of the test stimuli and the scan-out of the test responses can be conducted in parallel, which reduces the interconnect testing time.
The details of the proposed test architecture are shown in Fig. 2 . The boundary-scan interface inside the die is used to control the proposed test architecture and is composed of four functional elements: 1) a TAP controller; 2) an instruction register; 3) five TAPs (TCK, TMS, TRST, TDI, and TDO); and 4) a group of data registers. The TAP controller is a synchronous FSM, and it coordinates two important test operations: 1) the DR cycle and 2) the IR cycle. The DR cycle is used to load the test signals to the selected data registers, and the IR cycle is used to load instructions to the instruction register.
In order to shift in test patterns and shift out test responses in parallel, both the scan-in chain and the scan-out chain must be enabled simultaneously. Therefore, two additional ports, TDI_new and SO_new are added to the IEEE standard boundary-scan chain. TDI_new connects to the scan-out chain of the previous die on the interposer, and SO_new connects to the scan-out chain of the next die on the interposer. Similarly, TDI and TDO connect to the scan-in chains of the previous and next dies on the interposer, respectively. In Fig. 2 , the green line represents the scan-in chain and the red line represents the scan-out chain. These chains are independent and can transfer signals at the same time. Two multiplexers-M1 and M2-are added to the IEEE standard boundary-scan chain, and are used to switch between test modes. When SIO_select is 1 and BSC_select is 0, the proposed boundary-scan structure is enabled and the scan-in of test patterns and the scan-out of test responses can be carried out in parallel. When SIO_select is 0 and BSC_select is 1, the standard boundary-scan structure is enabled, which is used for testing of the dies on the interposer.
A diagram of the connections between the boundary-scan chains of different dies is shown in Fig. 3 . There are three dies stacked on the interposer, and each die is equipped with the proposed test architecture. All of three test architectures are controlled by the global ports TCK, TMS, TRST, TDI, and TDO. Since Die 1 is the first die on the interposer and does not need to receive test responses from the previous scan-out chain, both TDI and TDI_new of Die 1 are connected to the global TDI port. Similarly, since Die 3 is the last die on the interposer and does not need to send test patterns to the next scan-in chain, a multiplexer M3 is used to accept signals from either the TDO or SO_new port of Die 3. The output of M3 is connected to the global TDO port.
B. Modified TAP Controller
The interconnect testing for shorts and opens can be conducted using the boundary-scan architecture. However, due to limitations of the TAP controller defined in the IEEE 1149.1 Standard, the application of the test-pattern is carried out in the Update_DR state but the capture of the response is accomplished in the Capture_DR state. As a result, the launch and capture procedure are separated by more than one clock cycle, which prohibits at-speed testing.
In order to detect timing-related faulty behaviors, two private instructions need to be added in addition to the public instructions (e.g., BYPASS, IDCODE, and EXTEST) in the IEEE 1149.1 Standard. We refer to these instructions as OPENTEST and DELAYTEST. The OPENTEST instruction is used to test opens and shorts, and the DELAYTEST instruction is used to detect small-delay defects. When the instruction is either OPENTEST or DELAYTEST, the proposed boundary scan chain will be enabled between TDI and TDO. However, since the FSM of the standard TAP controller prohibits at-speed testing, additional states must be added to the original FSM. The proposed FSM is shown in Fig. 4 . Three new states are added: 1) Idle_DR; 2) Prepare_DR; and 3) Update_Capture_DR.
The Idle_DR state is a temporary controller state that replaces Capture_DR during at-speed testing. In the Idle_DR state, when a rising edge is applied to TCK, the controller enters the Shift_DR state if TMS is held at 0. The Prepare_DR state is a temporary controller state, and several control signals are set up in this state so that the boundary-scan chains are ready to implement launch and at-speed capture in the following state. If TMS is asserted and a rising edge applied to TCK in the Prepare_DR state, the controller enters the Update_Capture_DR state. If TMS is held low and a rising edge applied to TCK, the controller enters the Shift_DR state. The Update_Capture_DR state is used to launch patterns from the launch cells and capture the responses in the capture cells. If a rising edge is applied to TCK, the controller enters either the Select_DR_Scan state or the Run_Test/Idle state if TMS is held at 1 or 0, respectively.
During at-speed testing, the launch and capture procedures cannot be separated into different controller states. As a result, the controller does not enter the Capture_DR state; instead, it makes a transition to the Idle_DR state. Note that in the DR cycle under the EXTEST instruction, the controller goes through Select_DR_Scan → Capture_DR → Shift_DR → Exit1_DR → Pause_DR → Exit2_DR → Update_DR. In the DR cycle under the DELAYTEST instruction, the controller goes through Select_DR_Scan → Idle_DR → Shift_DR → Exit1_DR → Pause_DR → Prepare_DR → Update_Capture_DR. Therefore, the number of states through which the controller proceeds in one DR cycle under the DELAYTEST instruction is the same as that under the EXTEST instruction. As a result, the original BSDL can be reused; only the boundary register description parts need to be updated. After test patterns are shifted in or test responses are shifted out in the Shfit_DR state, the controller goes through several temporary controller states and makes a transition to the Prepare_DR state to prepare to enter the Update_Capture_DR state. After the controller moves to the Update_Capture_DR state, both the launch cells and the capture cells are activated. Test patterns are launched out of the parallel outputs of the launch cells on the falling edge of the UpdateDR_out signal (introduced later). After this procedure, test patterns are applied to interconnects and then captured by the capture cells on the rising edge of ClockDR.
C. BSCs and Circuit Block
The structure of the standard BSC ("BSC_1" cell in the IEEE 1149.1 Standard) is shown in Fig. 5(a) . The standard BSC has two flip-flops, the capture scan flip-flop and the update hold flip-flop, that both operate during the DR cycle. The capture scan flip-flop is controlled by the ShiftDR signal and the update hold flip-flop is controlled by the UpdateDR signal. During at-speed testing, the test patterns on different dies need to be launched simultaneously and the test responses need to be captured at speed. However, the standard BSC is not designed to implement these operations during a single clock cycle. In order to allow at-speed delay testing, the proposed architecture features different structures for the launch cells and the capture cells than those in the standard BSC.
The novel structures for the launch cells and capture cells are shown in Fig. 5(b) and (c), respectively. In the launch cell, since the test patterns on different dies must be launched simultaneously, a flip-flop that triggers on the negative edge is added. Thus, even though the UpdateDR signals on different dies may have skew, they will be synchronized on the falling edge of the global TCK signal and the UpdateDR_out signal will launch the test patterns on the different dies simultaneously. Although the TCK signal can have skew, the skew is typically small enough to be negligible, as small as 10 ps in a balanced TCK system based on data provided by our industry collaborator and also in [23] . Therefore, the skew of TCK is negligible.
In the capture cell, an extra delay block and multiplexer are added. The delay block takes UpdateDR_out as input and generates the time delayed signal ClockDR_new, which serves as the at-speed capture signal for the capture cells. Since the UpdateDR_out signal initiates on the falling edge and the ClockDR_new signal initiates on the rising edge, the magnitude of the delay is the width of the UpdateDR_out signal plus a functional clock period. The multiplexer determines whether the original ClockDR signal or the newly generated ClockDR_new signal will be used as the clock signal for the capture cell. When ClockDR_select is 0, the multiplexer outputs ClockDR, which is used for open/short testing. Otherwise, the multiplexer outputs ClockDR_new, which is used for at-speed testing.
The circuit block diagram is shown in Fig. 6 . In contrast to the standard TAP controller, the proposed TAP controller has four inputs: 1) TCK; 2) TMS; 3) TRST; and 4) Delay_enable. The signals TCK, TMS, and TRST are provided by the tester. The signal Delay_enable is generated from the decoder, which stores operational codes (opcodes) for all instructions, including the newly defined instructions OPENTEST and DELAYTEST. The Choose_DR signal selects the appropriate register connected between TDI and TDO based on the opcode output from the instruction register. When the opcode from the instruction register matches DELAYTEST, the decoder sets Delay_enable to 1. Otherwise, it sets Delay_enable to 0. When the opcode from the instruction register matches OPENTEST or DELAYTEST, the decoder sets BSC_select to 0. Otherwise, it sets BSC_select to 1.
The proposed TAP controller generates several outputs. The outputs ClockIR, ShiftIR, and UpdateIR are fed to the instruction register during the IR cycle. The signals ClockDR, ShiftDR, and UpdateDR are respectively fed to the boundaryscan register, bypass register, and identification register during the DR cycle. The reset signal is used to reset all registers when the FSM is in the Test_Logic_Reset state. When the FSM is in the Shift_DR state and the controller is under the OPENTEST or DELAYTEST instructions, the SIO_select signal is set to 1 so that the proposed boundary-scan chains can connect to the scan-in and scan-out chains on the previous and next dies separately. Otherwise, the SIO_select signal is set to 0 and the standard boundary-scan chain is enabled. Note that the capture signal for the capture cell is different from the signals applied to the launch cell, bypass register and identification register; we refer to this signal as ClockDR_out. As described in Fig. 5(c) , ClockDR_out is derived from either ClockDR or ClockDR_new. The selection is implemented by ClockDR_select, which is another output of the proposed FSM. When the FSM is at the end of the Prepare_DR state, ClockDR_select is set to 1 and ClockDR_new is selected as the capture signal. Otherwise, ClockDR_select is set to 0 and ClockDR is selected as the capture signal. In addition to these output signals, select is used to determine whether data registers or the instruction register is used to be connected between TDI and TDO.
D. Test Procedures
With the proposed on-chip test architecture, probe needles can be used to shift test patterns and capture responses. The test procedures for detecting opens, shorts, and delay defects are described in this subsection.
For the horizontal interconnects (composed of micro-bumps and RDL wires), test patterns launched from the launch cells will pass through the micro-bumps, RDL wires, and other micro-bumps and are then captured by the capture cells. Therefore, the micro-bumps connected with RDL wires can be tested at the same time as the RDL wires inside the interposer.
There are two types of vertical interconnects (composed of micro-bumps and TSVs): 1) the vertical interconnects connected to the launch cells and 2) the vertical interconnects connected to the capture cells. We refer to the two types as V-launch and V-capture, respectively. When test patterns are launched, certain patterns will go through the V-launch paths and are captured by probe needles. This procedure is used for testing V-launch. In order to test V-capture, certain patterns are launched directly from the probe needles and pass through the V-capture paths to be captured by the capture cells. This procedure is used for testing V-capture.
The test procedure for opens and shorts can be summarized in terms of the following steps. IV. TEST-PATH DESIGN AND SCHEDULING The successful integration of up to four dies on a passive interposer has been reported for a 2.5-D IC [3] , [24] , and the stacking of even larger numbers of dies on interposers has also been discussed [25] , [26] . A total of 12 dies on an interposer has been reported in [13] . For such designs, the length of a single test path (consecutive boundary-scan chains connected between a pair of TDI and TDO pins) can lead to extremely high test time. In order to reduce the interconnect test time, it is more efficient to divide a single long test path into multiple, shorter test paths. However, multiple test paths require additional TSVs and micro-bumps and therefore increase the test cost; hence, manufacturing and test cost must be considered in the design of test paths.
A large number of different test path configurations are possible for any given 2.5-D IC test architecture; Fig. 7 assumes three dies on an interposer, and shows all possible configurations for these three dies. For four dies on an interposer, Fig. 8 shows some possible configurations and the total number of possible test-path configuration is 73; for 12 dies on an interposer, the number of possible test-path configurations is 12 470 162 233, calculated using the recursive equation [13] , where N(k) is the number of possible configurations for k dies. Therefore, efficient optimization methods are needed to search for the optimal test-path configurations. Three optimization methods with different objectives are illustrated in this section.
We use integer linear programming (ILP) to solve the design and scheduling problems. ILP models are computationally intractable and often infeasible for large problem instances; however, with a limited number of dies per interposer (e.g., if we limit the number of dies to 10), the problem instance is small enough to be amenable to an optimal ILP solution.
The test-path scheduling problem was introduced in [27] . The goal was to minimize the total interconnect test cost (Section IV-B). In this paper, we examine several additional facets of the optimization problem. In Section IV-A, we develop the hardware cost model for additional test-paths. In Section IV-C, we present an optimization technique to minimize either the test time or the hardware area under a constraint on the overall test cost. Section IV-D introduces a method to determine the order of dies in a single test path based on their locations on the interposer.
A. Structure of Additional Test Paths
We refer to a test path that contains one or more scanin chains as an input test-access mechanism (in-TAM), and a test path that contain one or more scan-out chains as an output TAM (out-TAM). The in-TAM and out-TAM connection between two dies without additional test paths is illustrated in Fig. 9(a) . We refer to this connection as a basic connection. In the basic connection, the scan-in chains of two dies are connected by path 1 (P1) between TDO and TDI and the scan-out chains of two dies are connected by path 2 (P2) between SO_new and TDI_new. Note that P1 is always connected between two dies because it is the original path in the IEEE 1149.1 Standard. It is used to not only connect scan-in chains but also to connect the instruction registers, Identification registers, and Bypass registers between two dies.
When a single and long in-TAM or out-TAM is divided into multiple, shorter in-TAMs or out-TAMs, each in-TAM needs a vertical path to shift in test patterns from the probe needles, and each out-TAM needs a vertical path to shift out test responses from the 2.5-D IC. Therefore, additional TSVs and micro-bumps are required to form these vertical paths. Fig. 9(b) illustrates the connection between Dies 1 and 2 when the original out-TAM is broken to form two shorter out-TAMs. We refer to this connection as scan-out connection. Since Die 1 is now the last die on a new out-TAM, an additional TSV is added connecting to the SO_new port of Die 1 so that the test responses can be shifted out of the 2.5-D IC through the TSV. Since Die 2 is now the first die on a new out-TAM and it does not need to receive test responses from the previous scan-out chain, the TDI_new port of Die 2 can be either floated or connected to the TDI port of Die 2. Fig. 9(c) shows the connection between Dies 1 and 2 when the original in-TAM is broken to form two shorter in-TAMs, which we refer to this connection as scan-in connection. The out-TAM configuration does not change, so P2 still interconnects SO_new and TDI_new. Although Die 1 is the last die on a new in-TAM, P1 still connects TDO of Die 1 and TDI of Die 2 because it is the original path in the IEEE 1149.1 Standard. Since Die 2 is the first die on a new in-TAM, a vertical path is added so that test patterns can be shifted into Die 2 from probe needles outside the 2.5-D IC. However, the added vertical path cannot share the same micro-bump on Die 2 as P1, otherwise the test patterns shifted into Die 2 would be affected by signals passing through P1.
In summary, to fabricate each additional out-TAM, it is necessary to add one additional TSV. For each additional in-TAM, it is necessary to add one TSV and one micro-bump.
B. Minimization of Total Interconnect Test Cost
The use of a single, long test path leads to high test time; multiple short test paths lower test time but they increase hardware cost. In this subsection, we present an optimization technique that balances the test time and the area overhead to minimize the overall interconnect test cost. The problem can be defined as follows: given a 2.5-D IC with a set of m dies, let the test cost on the automatic test equipment (ATE) per unit length (corresponding to one BSC) be a, let the cost of fabricating one additional in-TAM be b 1 , and let the cost of fabricating one additional out-TAM be b 2 . The goal is to determine an optimal test-path design and schedule such that the total test cost C (defined below) is minimized. We minimize the overall test cost, which includes both test time and area overhead
To develop an ILP model for this problem, we need to define a set of variables and constraints. We first define integer variables p (the number of in-TAMs) and q (the number of out-TAMs). Constraints on p and q are defined as follows:
where m is the total number of dies on a common interposer. These two constraints indicate that the number of in-TAMs and out-TAMs cannot exceed the number of dies on the interposer and are imposed by the test architecture, which has only one TDI port and one TDO port per die. Next, we define two binary variables x ij and y ik . The variable x ij is equal to 1 if the scan-in chain of die i is included in in-TAM j and 0 otherwise. Similarly, y ik is equal to 1 if the scan-out chain of die i is included in out-TAM k. Constraints on variable x ij and y ik can be defined as follows:
The first constraint indicates that each die can only be included in one in-TAM and one out-TAM. The second constraint indicates that each in-TAM or out-TAM must contain at least one die on the interposer.
We let variable L represent the longest test path among all in-TAMs and out-TAMs. The constraints on variable L are defined as follows: The first constraint is related to the maximum length of the in-TAMs, where I i denotes the number of micro-bumps that are connected to the input ports of die i. Similarly, the second constraint is related to the maximum length of the out-TAMs, where O i denotes the number of micro-bumps that are connected to the output ports of die i. With the variables defined above, the total test cost C for a 2.5-D IC with a set of m dies is defined as follows:
The cost of parameters a, b 1 , and b 2 , introduced earlier in this section, are defined as follows:
Note that c ATE is the tester usage cost per second (as considered in [28] ); f is the test frequency; N is the total number of interconnects being tested; area TSV is the area of a TSV; c interposer is the cost of the interposer per unit area; area μbump is the area of a micro-bump; and c die is the die cost per unit die area. We assume that the true/complement algorithm [29] is adopted for testing. Thus the number of test patterns for N interconnects is 2 · log 2 (N + 2). The complete ILP model is shown in Table I .
C. Optimization in Alternative Scenarios
The optimization solution described above is motivated by the observation that a single long test path provides the minimum hardware cost, while multiple short test paths lead to the minimum test-time cost. The optimization technique in Section IV-B yields the minimum overall test cost when test time and area overhead must be jointly optimized. An alternative optimization scenario can also arise in practice. It might be desirable to minimize test time (area overhead) for a given upper limit on the area overhead (test time). In this subsection, we present an optimization technique that minimizes either the test time or the hardware area, depending on which has the higher priority, while satisfying a constraint on the overall test cost.
The problem can be defined as follows: given a 2.5-D IC with a set of m dies, let the cost of test on the ATE per unit length (corresponding to one BSC) be a, and let the cost of fabricating one additional in-TAM be b 1 . Let the cost of fabricating one additional out-TAM be b 2 , and let the maximum overall test cost that the 2.5-D IC can support be C max . To satisfy the cost constraint, the overall test cost C cannot exceed an upper limit C max . The optimization goal is to select an optimal test-path configuration that minimizes either the test time or the area overhead, with an upper limit on the overall test cost, which includes both test time and area overhead as components.
The constraints of this ILP model are the same as those for the ILP model in Section IV-B, except for the following additional constraint:
where the variable L is the longest test time, p is the number of inTAMs, and q is the number of out-TAMs. If the test time has a higher priority than the hardware area, the objective of the ILP model is to minimize the longest test path, L. If the hardware area has a higher priority than the test time, the objective of the ILP model is to minimize the total number of in-TAMs and out-TAMs, p + q. The complete ILP model is shown in Table II .
D. Placement of Dies on the Test Path
The optimization techniques introduced in Sections IV-B and IV-C can determine which dies on the interposer are placed in specific in-TAMs and out-TAMs, but it does not determine the order of the dies. The die order is important because it affects the complexity of interchip routing; ordering the dies randomly can lead to longer test wire lengths. Note that the definition of "test wire length" is different from "test length." In particular, test wire length refers to the physical wire length of a test path, and test length refers to the number of BSCs in a test path. Although the test length (test time) is not affected by the test wire length, longer test wires may cause timing problems, degrade the test quality, and lead to congestion. In this subsection, we present an optimization technique that considers the location of each die to determine the order of the dies in a test path.
The problem can be defined as follows: given a 2.5-D IC with a set of m dies, let the ATE test cost per unit length (corresponding to one BSC) be a, and Let the cost of fabricating one additional in-TAM be b 1 . Let the cost of fabricating one additional out-TAM be b 2 , let the maximum test wire length of an in-TAM be WL in , and let the maximum test wire length of an out-TAM be WL out . The physical distance between die i and die j is given as d ij . The goal is to determine an optimal test-path design and schedule such that the total test cost C is minimized while enforcing the constraint that the test wire length of any in-TAM or out-TAM cannot exceed WL in or WL out , respectively.
Since, scheduling the in-TAM dies is very similar to scheduling the out-TAM dies, we only present the ILP formulation for scheduling the in-TAM dies. We first define a binary variable z ih that is equal to 1 if die h is directly behind die i in the same in-TAM and 0 otherwise. Constraints on variable z ih can be defined as follows:
The first constraint in (10) indicates that no die can be directly behind itself. The second constraint in (10) states that if die h is directly behind die i, then die i cannot be directly behind die h. The two constraints in (11) ensure that each die must have exactly one neighbor die directly before it and exactly one directly after.
The total test wire length for an in-TAM is the sum of the distances between all consecutive dies in the in-TAM. Using variables x ij and z ih , a constraint on the test wire length of the in-TAM can be defined as follows:
The quantity m h=1 d ih · z ih represents the distance between die i and the die after it in the in-TAM. Equation (12) represents the total test wire length of the in-TAM j. Note that (12) includes nonlinear elements, the product of variable z ih and variable x ij . We linearize it by introducing a new binary variable w ihj that represents the product z ih · x ij . The linearized function for the total test wire length for each in-TAM can then be written as follows:
The constraints on variable w ihj can be defined as follows:
Equations (14) through (16) arise from the definition of w ihj ; specifically, the product of two binary values must be less than or equal to either value. Equations (17) through (19) are inherited from (10) through (11), and (20) is inherited from (2). The constraints for the ILP model are shown in Table III.   TABLE III  CONSTRAINTS FOR SCHEDULING THE IN-TAM DIES V. EXPERIMENTAL RESULTS In this section, we present experimental results for the proposed test-architecture design and path-scheduling methods. The test architecture was specified using Verilog. It was then simulated using ModelSim and synthesized using design compiler. The path optimization problem was solved using the advanced ILP solver Xpress-Mosel [30] .
A. Test Architecture Simulation Results
The structure shown in Fig. 2 is employed for the functional simulation. A total of 16 launch cells and 16 capture cells are used to form the boundary-scan register. Therefore, when the proposed test architecture works as the standard boundaryscan architecture, its boundary-scan register has 32 cells. In addition, since we limit ourselves to interconnect testing, we do not consider the die logic in our evaluation of the test architecture.
The frequency of the scan test clock, TCK, is set to a typical value of 10 MHz. The process is controlled by a sequence of TMS signal values, stored in advance in the BSDL file.
It should be noted that PI is not the primary input of the logic die, but a 32-bit signal, referring to the PI ports of the launch cells and the capture cells, as shown in Fig. 5(b) and (c). Since there are 32 BSCs, PI is a 32-bit signal. The higher-order 16 bits of PI represent PI ports for the launch cells and the lower-order 16 bits of PI represent PI ports for the capture cells. PO has similar description in our architecture.
The public instruction EXTEST in IEEE 1149.1 is simulated first. The results are shown in Fig. 10 . In the IR cycle, EXTEST instruction code (03) is shifted into the 8-bit instruction register. The instruction reaches the decoder on the falling edge of the UpdateIR signal, and is recognized by the controller. BSC_select is set to 1 so that the boundary scan register is reconfigured as the standard boundary-scan structure. In the DR cycle, when the controller enters the Capture_DR state, test responses (f78290ef) are captured by the BSCs (launch cells and capture cells) from PO. Then, the controller switches to the Shift_DR state. In the next 32 clock cycles, test responses are shifted out from BSCs to TDO, and new test patterns (ab84f648) are shifted into BSCs from TDI. Finally, the test patterns are launched to the interconnects on the falling edge of the UpdateDR_out signal in the Update_DR state. Therefore, the proposed test architecture can execute the EXTEST instruction correctly. The proposed architecture has thus been verified to be compliant to the IEEE 1149.1 Standard. Fig. 11 illustrates the OPENTEST instruction. The proposed controller enters the IR cycle to load instruction code (09) and then transitions to the DR cycle. The BSC_select signal is set to 0 to enable the proposed boundary scan structure. In the DR cycle, the controller first enters the Capture_DR state, and captures test responses (e7c572cf) from PI. The lowerorder 16 bits (72cf) are captured by the capture cells and the higher-order 16 bits (e7c5) are captured by the launch cells. Then, the controller moves into the Shift_DR state. The lowerorder 16 bits responses are shifted out from the capture cells through scan-out chain to SO_new and the higher-order 16 bits responses are shifted out from the launch cells through scan-in chain to TDO. Meanwhile, The SIO_select signal is set to 1 in this state so that scan-in chain accepts signals from TDI and scan-out chain accepts signals from TDI_new. Therefore, test patterns (ab84) are shifted in from TDI through the scan-in chain to the launch cells. Test responses (1ac8) from the previous die are shifted in from TDI_new through the scan-out chain to the capture cells. After the controller moves into the Update_DR state, the test patterns are launched to interconnects on the falling edge of the UpdateDR_out signal. Then the controller moves into IR cycle again and waits for next instruction taking effect.
The DELAYTEST instruction is illustrated in Fig. 12 . DELAYTEST instruction code (0a) is loaded into the decoder, which sets the delay_enable signal to 1 and transmits it to the controller. Rather than moving into the Capture_DR state, Analog simulation and the robustness analysis are discussed in [31] . The small-delay defect simulation results are shown in [31, Fig. 14] .
B. Case Study
Since interconnect testing requires two or more dies on the interposer, we consider a case study that has two test architectures connected together. We assume that the operating frequencies of all the dies are 500 MHz. Both test architectures contain 16 launch cells and capture cells. The launch cells of architecture 1 (A1) are connected to the capture cells of architecture 2 (A2) through the interconnects of in the interposer. The capture cells of A1 and the launch cells of A2 are connected to other dies, which are not considered in this case.
In Fig. 13 , two consecutive DELAYTEST instructions are presented. Since a total of 32 launch cells and 32 capture cells are included in the scan-in chain and scan-out chain, 32 clock cycles are required in the Shift_DR state. Although A1 and A2 are controlled separately by two controllers, they share the global TCK, TMS, and TRST signals. Therefore, the values of their control signals are the same and only control signals of A1 are shown in Fig. 13 . Fig. 13(a) shows the first DELAYTEST instruction. Test patterns (547b547b) are shifted into the launch cells of A1 and A2 from TDI_1 port. In the Update_Capture_DR state, test pattern 547b is launched from A1 to A2 on the falling edge of the UpdateDR_out signal. Meanwhile, another test pattern 547b is launched from A2 to the die following A2 (not shown here). Then shortly after the launch operations (2 ns since the functional frequency is 500 MHz), the ClockDR_out signal implements at-speed capture. After test responses are captured by A1 and A2, the controller transitions to the second DELAYTEST instruction. Note that no die loads stimulus to A1; A1 captures signals (9b7e) from the PI_1 port. Fig. 13(b) This is because we do not consider any interconnect delay during simulation of this case study. If there is a delay which is larger than the functional period (2 ns in this case), the test responses captured by A2 should not be the same with test patterns launched by A1. Then a small delay defect is detected. Similarly with the first DELAYTEST instruction, test patterns are launched and test responses are captured at speed when the controller moves to the Update_Capture_DR state.
C. Area Overhead
There are two types of area overhead in the proposed architecture the control overhead and the BSC overhead. The control overhead refers to the controller and the decoder. The BSC overhead is due to redesign of the standard BSC as a launch cell and a capture cell. We synthesized the proposed test architecture and the standard boundary-scan architecture using a 45nm CMOS process [32] . Table IV shows the area overhead numbers. Compared to the standard boundary-scan architecture, the control overhead is 10.1%. The launch cell and capture cell overhead is 31.9%, and 17.9%, respectively. The average boundary overhead is 24.9%. In addition, the area overhead is negligible compared to the total area of a die. For example, the total area overhead is only 0.6% of an Ethernet die. The overhead is therefore negligible (< 1%) for large designs with a million gates. In summary, the total overhead is in a reasonable range, which ensures the implementation of the architecture with low hardware cost.
In order to analyze the impact of the proposed DfT method on functional performance, additional experiments have been conducted. We take an Ethernet die for example. The slack 
D. Test-Path Design and Scheduling Results
In order to evaluate the effectiveness of the optimization techniques presented in Section IV, we next present results obtained using the ILP models. We considered a 2.5-D IC design crafted using the ITC'02 SoC test benchmarks [33] ; these benchmarks were considered as dies on the interposer. Table V lists the number of I/O ports for each die. These numbers reflect recent and forthcoming 2.5-D IC designs. For example, an industry collaborator at a major semiconductor company reports that the number of inputs and outputs of one die in their forthcoming 2.5-D products is 3360 and 2816, respectively. 1 Thus, our benchmarks are representative and they highlight the effectiveness of the proposed testpath scheduling method. We assume a volume of 1 00 000 2.5-D ICs, i.e., 1 00 000 chips are tested. Table V also lists the parameter values used in the evaluation of our scheduling and optimization framework, which are based on the published data from [18] and [34] . The parameter area μbump is set to 1600 μm 2 for a typical micro-bump pitch of 40 μm. The parameter area TSV is set to 10 000 μm 2 for a typical TSV pitch of 100 μm.
In the 2.5-D IC test case, ten dies are stacked on an interposer: u226 (Die 1), d281 (Die 2), d695 (Die 3), h953 (Die 4), g1023 (Die 5), f2126 (Die 6), p22810 (Die 7), p34392 (Die 8), p93791 (Die 9), and a586710 (Die 10). Fig. 14 shows the optimization results derived using the ILP model. The total cost for interconnect testing is minimized when three in-TAMs (p = 3) and three out-TAMs (q = 3) are used. Neither the test time nor the test-path length is reduced if one of the parameters p or q is increased while the other one remains constant. In contrast, the test cost increases sharply due to the additional test paths. In addition, an increase in p leads to a sharper increase of the total cost than the increase in q. This is because the fabrication of one additional in-TAM (one TSV and one micro-bump) costs more than one additional out-TAM (one TSV). If we increase p and q simultaneously, the test time is reduced but cost due to the additional test paths increases sharply. If we decrease both p and q at the same time, the test cost increases slowly. The test time can increase dramatically in the absence of an optimal solution. In this test case, the optimal test length is 4576 and the test length without an optimal solution can be as large as 13 691. In addition, if the scan-in chain and scan-out chain structure that are proposed in this paper is not used for interconnect testing, the test length can be even larger and reach 26 090. Therefore, the proposed test-schedule optimization method is desirable in practice.
With advances in technology, the parameter values listed in Table V are likely to change, and in particular, we expect that unit test and fabrication costs will decrease. Suppose that the cost of test on the ATE per unit length (a), and the cost of fabricating an additional TAM for a test path (b 1 and b 2 ) are both be reduced by 50%. With this new set of cost parameters, we report the optimal test paths using ILP (Table VI) , as well as the best-case and worst-case test-cost figures. We make the observation that when a alone is reduced by 50%, the best values of p and q for this case are both reduced to 2. When b 1 and b 2 are reduced by 50%, the best values of p and q for this case are both increased to 4. The reduction in test cost is also more significant when a was reduced, compared to the case when both b 1 and b 2 were reduced. When a, b 1 , and b 2 are all reduced by 50%, the best value of p and q remain unchanged but the best-case test cost is reduced by 50%.
For the 2.5-D IC test case, we next assume that the maximum overall test cost C max is given. The number of test paths is minimized under this constraint on C max . The results for this alternative optimization are presented in Fig. 15 when C max is varied from $70 to $160 with a step size of $10. The number of test paths is the same as the optimized result shown in Table VI when C max is $70. This is because C max is close to the minimum test cost. The number of test paths decreases when C max is relaxed (C max is increased). Since there is a tradeoff between the number of test path and the test length, the test length increases as the number of test path decreases. Fig. 16 shows the minimum test length when C max is varied from $46 to $55 with a step size of $1. The minimum test length decreases when C max is relaxed. The number of test path increases as the test length decreases.
The placement results for dies on the test path are shown in Table VII 6. These numbers are hypothetical but they are representative normalized values of actual distances for a 2.5-D IC. Actual wire-length information based on placement information about the dies can be used instead of these hypothetical values. In Table VII , both WL in and WL out are set to 12. Note that the scheduling results are different from that shown in Table VI . This is because the ILP model reschedules test paths due to the constraints of WL in and WL out . However, the optimization method ensures a balance between the minimized test cost and total wire length needed for testing. Fig. 17 shows the overall test cost when both WL in and WL out are varied from 7 to 15. When the constraints of WL in and WL out are tight, the test paths are rescheduled and the overall test cost is increased. When the constraints on WL in and WL out are relaxed, the overall test cost is reduced, eventually reaching the minimum value corresponding to the scenario when these constraints are not imposed.
VI. CONCLUSION Interposer-based 2.5-D ICs are gaining exposure as an alternative choice for next-generation ICs as a precursor to 3-D integration. We have introduced a new architecture that allows the interconnect testing of 2.5-D ICs. The proposed technique targets TSVs, RDL wires, and micro-bumps for opens, shorts, and small-delay defects. A simple extension to the standard boundary-scan structure and TAP controller makes it fully compatible with the IEEE 1149.1 Standard. We have also described a test-path design and scheduling technique to reduce the overall test cost, test time, and hardware area. The proposed scheduling technique can also determine the order of dies in a single test path. We have presented comprehensive ModelSim simulation results, synthesis results, and test path design results to demonstrate the effectiveness of the proposed approach.
