Abstract-Interposer-based 2.5D integrated circuits (ICs) are seen today as a precursor to 3D ICs based on through-silicon vias (TSVs). All the dies and the interposer in a 2.5D IC must be adequately tested for product qualification. This work provides solutions to new challenges related to testing of 2.5D ICs. We propose a test architecture using e-fuses for pre-bond interposer testing. We design a test architecture that is fully compatible with the IEEE 1149.1 standard and relies on an enhancement of the standard test access port (TAP) controller. We present an efficient built-in self-test (BIST) technique that targets the dies and the interposer interconnects. We next describe two efficient ExTest scheduling strategies that implement interconnect testing between tiles within a system on chip (SoC) die on the interposer. Finally, we present a programmable method for shiftclock stagger assignment to reduce power supply noise during SoC die testing in 2.5D ICs.
I. INTRODUCTION
Interposer-based 2.5D ICs have emerged as an attractive technology option for integrating multiple dies within a small package, and also as a precursor to 3D integration [1] . In 2.5D ICs, a passive silicon interposer is used for chip-scale assembly. Multiple active dies are not vertically stacked; rather, they are placed side-by-side on the silicon interposer. The cross section of a 2.5D IC is shown in Fig. 1 [2] , [3] . Three dies are stacked with fine-pitch micro-bumps on an interposer. The interposer includes the silicon substrate and multiple metal layers of wires. The silicon substrate contains a cluster of TSVs that provide vertical interconnections between dies and the package. Metal layers at the top of the interposer (referred to as the redistribution layer) provide the connectivity between different dies. The interconnects in the interposer's metal layers are fabricated using the same processes as the interconnects in the silicon dies. Today, high density I/O ports are available for the dies in a 2.5D IC, and a large number of die-to-die connections are available inside the interposer. Therefore, 2.5D ICs can provide enhanced system performance, reduced power consumption, and support for heterogeneous integration [4] .
High integration complexity gives rise to the likelihood of defects during the fabrication of 2.5D ICs. In particular, since the structure of 2.5D ICs is different from traditional 2D ICs, new test challenges have emerged. This challenges are described below.
Pre-Bond Interposer Testing: The interposer cannot be tested easily before it is stacked with other dies due to several reasons [5] . Testing the interposer requires the targeting of both horizontal and vertical interconnects. If both sides of the interposer can be probed at the same time, pre-bond interposer testing can be easily accomplished. However, double-sided probing of the interposer is not feasible today due to limitations of wafer handling and probe-card design. In addition, it is difficult to probe the micro-bumps on the top side of the interposer due to their high density. Interconnect testing requires connecting the interconnects in a loop so that a logic value can be applied at one end and the resulting value can be observed at the other end. However, interconnects are separated and independent from each other at the pre-bond stage. Therefore, new and innovative solutions are needed for pre-bond testing. Limited Ability for At-Speed Testing: Post-bond testing is difficult due to limited access to the TSVs and the multiple metal layers inside the interposer. As a result, the IEEE 1149.1 test-access port (TAP) and the associated boundaryscan architecture (IEEE 1149.1) [6] have been used in 2.5D ICs for testing interconnects. However, the use of IEEE 1149.1 alone is not sufficient for detecting interposer interconnect defects in 2.5D ICs. In the standard TAP controller, since the Capture DR and Update DR states are separated by more than one clock cycle, small-delay defects cannot be detected using at-speed testing. High-Density I/O Ports and Interconnects: The silicon interposer in a 2.5D IC provides more than 10,000 die-to-die interconnects, with as many as 1,200 I/O ports [7] . With such high-density interconnects and I/O ports, testing of 2.5D ICs is far more challenging than testing of traditional 2D ICs. For instance, there are 186k micro-bumps but only 25k C4 bumps in AMD Fiji [8] . Logic dies are typically equipped with full scan and boundary scan [6] . Test access for a 2.5D IC can be achieved using boundary scan. However, the high density of interconnects typically lead to large test-data volume. If this large volume of test data has to be applied through a one-bit serial boundary-scan chain, the test will take a very long time to execute and hence become prohibitively expensive. Reduced Number of Test Pins: Although a large number of I/O ports are available for the dies in a 2.5D IC, the majority of I/Os are connected to other dies through horizontal interconnects inside the interposer. External I/O ports that are connected to TSVs are much fewer in count than the total number package pins available for the same die in a 2D IC [3] . As a result, the number of test pins available for testing a die in a 2.5D IC is much smaller than that in 2D package. High Power Consumption: The dies in 2.5D ICs are typically system-on-chip (SoC) designs, which can provide increased functionality and higher performance [9] . However, the power consumption during die testing has also grown dramatically for 2.5D ICs [10] . One potential solution to reduce power consumption is to apply a single input clock to the die and derive multiple test clocks inside each block [11] . However, since all test clocks have the same activity, all the clock domains will toggle simultaneously; thus, a large number of scan flip-flops are likely to toggle together, leading to increased peak power, which can be much higher in comparison to the functional mode. There will be an increase in power-supply noise (PSN) on the power rails, which can slow down the circuit and result in fails. Moreover, power rails within and around blocks are not designed for this amount of activity; excessive PSN can corrupt the state of the flip-flops. This paper addresses the above challenges and provides practical solutions that have resulted from collaborations with industry partners. Section II presents an efficient method to locate defects in a passive interposer before stacking. The proposed test architecture uses e-fuses that can be programmed to connect or disconnect functional paths inside the interposer. The concept of die footprint is utilized for interconnect testing, and the overall assembly and test flow is described in this section. Moreover, the concept of weighted critical area is defined and utilized to reduce test time. In order to fully determine the location of each e-fuse and the order of functional interconnects in a test path, we also present a test-path design algorithm.
Section III presents an efficient interconnect-test solution that targets TSVs, the redistribution layer, and micro-bumps for shorts, opens, and delay faults. The proposed test technique is fully compatible with the IEEE 1149.1 standard. To reduce test cost, we also present a test-path design and scheduling technique that minimizes a composite cost function based on test time and the design-for-test (DfT) overhead in terms of additional TSVs and micro-bumps needed for test access. The locations of the dies on the interposer are taken into consideration in order to determine the order of dies in a single test path.
Section IV presents an efficient built-in self-test (BIST) technique that targets the dies and the interposer interconnects in 2.5D ICs. The proposed BIST architecture can be enabled by the standard TAP controller in the IEEE 1149.1 standard. The area overhead introduced by this BIST architecture is negligible; it includes two simple BIST controllers, a linear-feedback-shift-register (LFSR), a multipleinput-signature-register (MISR), and some extensions to the boundary-scan cells in the dies on the interposer. With these extensions, all boundary-scan cells can be used for selfconfiguration and self-diagnosis during interconnect testing.
To reduce the overall test cost, a test scheduling and optimization technique under power constraints is described. Section V presents two efficient ExTest scheduling strategies that implement interconnect testing between tiles inside an SoC die, while satisfying the practical constraint that the number of required test pins cannot exceed the number of available pins at the chip level. In order to minimize the test time, two optimization solutions are introduced. The first solution minimizes the number of input test pins, and the second solution minimizes the number of output test pins. In addition, two subgroup configuration methods are further proposed to generate subgroups inside each test group.
Finally, Section VI presents a programmable method for shift-clock stagger assignment to reduce power supply noise during SoC die testing in 2.5D ICs. An SoC die in the 2.5D IC is typically composed of several blocks and two neighboring blocks that share the same power rails should not be toggled at the same time during shift. Therefore, the proposed programmable method does not assign the same stagger value to neighboring blocks. A mathematical model is presented to derive optimal result for small-to-medium sized problems. For larger designs, a heuristic algorithm is proposed and evaluated using commercial SoCs and silicon data.
II. PRE-BOND TESTING OF THE SILICON INTERPOSER
In order to minimize the yield loss resulting from the stacking of good dies on a defective interposer, it is necessary to test the interposer before die stacking. In addition, the interposer is the least expensive component in the entire stack. As a result, pre-bond testing of the silicon interposer is needed to reduce cost; in this way, we can avoid a cheap, but faulty interposer rendering the expensive 2.5D IC. In this section, we present an efficient solution to locate defects in the passive interposer at the pre-bond stage [12] . E-fuses have been used extensively in a variety of applications due to their programmability [13] - [15] . They can be programmed using a voltage pulse; the schematic of an e-fuse is shown in Fig. 2 . Before programming, the resistance of the e-fuse is small and it can be treated as an interconnect wire. When a high voltage is applied to point B, a high current blows open the e-fuse. The state of the e-fuse changes and it is programmed. After programming, the resistance of the efuse is high enough (above 10 4 Ω) that it can be treated as an open [16] . In order to program an e-fuse in the interposer, we must have a discharge path to the substrate. However, since the interposer is a passive device, no programmable field effect Paper TC. 3 INTERNATIONAL TEST CONFERENCE transistor (FET) can be used as a discharge path. Industry practice today is to form a substrate tap/tie using one or two additional masks such that a Schottky contact can be formed to act as a discharge path [17] . The general test architecture to target horizontal interconnects is shown in Fig. 3 . An interposer example is utilized to illustrate the proposed test architecture. E-fuses are inserted inside the interposer, and horizontal interconnects, which are not connected in functional mode, are now connected for testing. Because the micro-bumps on the top of the interposer have very high density, it is difficult to use them to probe the interposer. Therefore, additional test paths are inserted into the interposer for probing purposes. As shown in Fig. 3 , each additional test path is composed of a probe pad and an e-fuse. Once the test paths are formed, test patterns are applied to the test paths and the horizontal interconnects can be tested.
After all horizontal interconnects are tested, the e-fuses will be programmed and their resistance will increase to a significantly large value so that they will not affect chip functionality.
The test architecture for vertical interconnects is similar. Efuses are inserted inside the interposer, and separated vertical interconnects are then connected. Since C4 bumps at the bottom of the interposer can be probed directly with standard probe needles, no additional test paths are required to test vertical interconnects.
For vertical interconnects, there are three types of faults: break faults, void faults, and pin-hole faults. Fig. 4 shows images of the three faults. The physical mechanisms underlying these three faults are discussed in [18] . Break faults and void faults increase the TSV resistance by different amounts based on the defect dimensions. Pin-holes create a conduction path from the TSV to the substrate, resulting in a leakage fault. In order to detect these faults, a high voltage is applied to the vertical test paths. The three types of faults can be detected and identified based on the differences in response voltage.
Three types of faults can typically occur in the horizontal interconnects: open faults, inter-bridge faults, and inner-bridge faults. Open faults refer to any hard or resistive opens, regardless of the fault location. Inter-bridge faults refer to bridge faults that occur between two test paths. Inner-bridge faults refer to bridge faults that occur inside a single test path. Since each test path can be viewed as a single interconnect, a traditional interconnect test algorithm (True/Complement Algorithm) [19] is used to test open faults and inter-bridge faults. However, this method is not applicable for inner-bridge faults: the faulty response is identical to the correct response. Therefore, e-fuses have to be programmed serially to test inner-bridge faults. If all test paths are programmed serially, the test time would be extremely long. In order to avoid this condition, the concept of weighted critical area is introduced, which refers to an area with more interconnects. We have developed a systematic method to determine a weighted critical area. The interposer is divided into n × n subareas, where n is defined as the division resolution of the interposer. The number of active micro-bumps is accounted for each subareas (num i ); the average value (avg) and standard deviation (σ) are calculated. If num i is greater or equal to avg + σ, then the subarea i is defined as a weighted critical area. If a test path contains functional interconnects that connect to the same weighted critical area, it is referred to as a dense test path; otherwise, it is called a nondense test path. In non-dense test paths, no two interconnects are in the same area and hence inner-bridge faults are less likely to occur; thus, it is not necessary to program e-fuses in that test path serially and programming time is reduced.
In order to analyze the influence of weighted critical area, we have analyzed a commercial interposer from a major foundry; this interposer is referred to as X5. Five dies are stacked on the interposer: one ASIC die and four high-bandmemory (HBM) dies. Interconnects are between the ASIC die and each HBM die. Based on data from the manufacturer, the time from applying a test pattern to observing the test response is 25 µs; the time to program e-fuse (t e ) is 7.5 µs.
The division resolution (n) of X5 is varied from 2 to 10. Therefore, X5 is divided from 2 × 2 to 10 × 10 subareas. Then the pre-bond testing is implemented on them. The experimental results are shown in Fig. 5 . We use the acronym "WCA" to refer to the testing approach that considers the weighted critical area; "nWCA" refers to the testing approach that does not consider weighted critical area. The number of probe pads for WCA is larger than those for nWCA when n is small because WCA introduces more constraints when test paths are formed. With an increase in n, the constraint is Paper TC. 3 INTERNATIONAL TEST CONFERENCE 3 weaker because functional interconnects that are originally in the same area at low resolution are in different areas now; the number of probe pad finally reach the same value for WCA and nWCA. In Fig. 5 (b), WCA can significantly reduce the testing time, which is at most 20% of that in nWCA. Therefore, it is advisable to include weighted critical area for pre-bond testing.
WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WC WCA A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW nW

III. AT-SPEED INTERCONNECT TESTING
A major challenge in the testing of interposer interconnects is that only the bottom side of the interposer can be accessed with standard probe needles. In order to address this problem, a test loop that begins at one C4 bump and ends at another must be formed for each pair of C4 bumps so that all the interconnects of 2.5D ICs can be tested. Test loops can be implemented via boundary-scan structures. However, the standard boundary-scan structure has limited ability for at-speed interconnect testing. In this section, we present an efficient interconnect-test solution that targets TSVs, RDL wires, and micro-bumps for delay faults [20] , [21] .
The proposed test architecture is shown in Fig. 6 . The boundary-scan interface inside the die is used to control the proposed test architecture and is composed of four functional elements: a TAP controller, an instruction register, five testaccess ports (TCK, TMS, TRST, TDI, TDO), and a group of data registers. The TAP controller is a synchronous FSM, and it coordinates two important test operations: the DR cycle and the IR cycle. The DR cycle is used to load the test signals to the selected data registers, and the IR cycle is used to load instructions to the instruction register.
In our design, the boundary-scan chain is divided into two separate chains, namely the scan-in and scan-out chains. In this way, the scan-in of the test stimuli and the scan-out of the test responses can be conducted in parallel, which reduces the interconnect testing time. Two additional ports, TDI new and SO new are added to the IEEE standard boundary-scan chain. TDI new connects to the scan-out chain of the previous die on the interposer, and SO new connects to the scan-out chain of the next die on the interposer. Similarly, TDI and TDO connect to the scan-in chains of the previous and next dies on the interposer, respectively. In Fig. 6 , the green line represents the scan-in chain and the red line represents the scan-out chain. These chains are independent and can transfer signals at the same time. Two multiplexers-M1 and M2-are added to the IEEE standard boundary-scan chain, and are used to switch between test modes. When SIO select is 1 and BSC select is 0, the proposed boundary-scan structure is enabled. Otherwise, the standard boundary-scan structure is enabled.
The standard TAP controller defined in IEEE 1149.1 is modified in order to detect timing-related faulty behaviors. Two private instructions are added in addition to the public instructions (e.g., BYPASS, IDCODE, and EXTEST) in IEEE 1149.1. We refer to these instructions as OPENTEST and DELAYTEST. The OPENTEST instruction is used to test opens and shorts, and the DELAYTEST instruction is used to detect small-delay defects. When the instruction is either OPENTEST or DELAYTEST, the proposed boundary scan chain will be enabled between TDI and TDO. However, since the FSM of the standard TAP controller prohibits at-speed testing, additional states must be added to the original FSM. The proposed FSM is shown in Fig. 7 Note that the number of states through which the controller proceeds in one DR cycle under the DELAYTEST instruction is the same as that under the EX-TEST instruction. As a result, the original boundary-scan description file (BSDL) can be reused; only the boundary register description parts need to be updated.
The circuit block diagram is shown in Fig. 8 . In contrast to the standard TAP controller, the proposed TAP controller has four inputs: TCK, TMS, TRST, and Delay enable. The signals TCK, TMS, and TRST are provided by the tester. The signal Delay enable is generated from the decoder, which stores operational codes (opcodes) for all instructions, including the newly defined instructions OPENTEST and DELAYTEST. The Choose DR signal selects the appropriate register connected between TDI and TDO based on the opcode output from the Instruction register. When the opcode from the Instruction register matches DELAYTEST, the decoder sets Delay enable to 1. Otherwise, it sets Delay enable to 0. When the opcode from the instruction register matches OPENTEST or DELAYTEST, the decoder sets BSC select to 0. Otherwise, it sets BSC select to 1.
IV. BUILT-IN SELF-TEST ARCHITECTURE
BIST architectures have been used in the past to test MCMs and SiPs [22] - [24] . However, more than 1200 I/O pins and 10,000 die-to-die interconnects are available in the silicon interposer of a 2.5D IC [7] . With such high-density interconnects and I/O pins, testing of 2.5D ICs is far more challenging than testing of MCMs and SiPs. In this section, we present an efficient BIST technique that targets the dies internal logic and interposer interconnects in 2.5D ICs [25] , [26] .
The block diagram of the proposed BIST architecture is shown in Fig. 9 . The main components and the relationships between them are illustrated. Some of the blocks are typical for any BIST design, such as Pattern Generator (PG) and Response Compactor (RC). In the proposed BIST architecture, PG is used to generate test patterns for both die testing and interconnect testing. RC is used to compress test responses and generate signatures. These components are controlled either by the die-test BIST controller (intest BIST controller) or by the interconnect BIST controller (extest BIST controller). The multiplexers in Fig. 9-M1 , M2, and M3-are controlled by the instructions in the Instruction Register, switching between "normal", "test-by-1149", "intest BIST", and "extest BIST" modes.
During die testing, the BIST circuitry is in the "intest BIST" mode. All control signals are from the intest BIST controller. After all test patterns are applied to the die under test (DUT), TDO is used to shift out signatures. Finally, the intest BIST controller generates a finish signal, indicating the termination of "intest BIST" mode. During interconnect testing, the BIST circuitry is in the "extest BIST" mode. All control signals are from the extest BIST controller. After test application, pass/fail or diagnosis results are shifted out through TDO, and a finish BIST signal is generated by the extest BIST controller. In order to support interconnect testing, two types of boundaryscan cells are needed. Test patterns are shifted and launched by the launch cells; test responses are captured and compressed by the capture cells.
In the proposed BIST architecture, two BIST controllers are incorporated: an intest BIST controller and an extest BIST controller. Both controllers are synchronous FSMs that are used to activate self-testing and to coordinate the overall sequence of events. All the components are integrated and controlled by these two BIST controllers.
The state-transition diagram of the intest BIST controller is shown in Fig. 10 . When the Go inner BIST signal is applied to the intest BIST controller, the controller enters the Begin BIST state. In this state, the lengths of the LFSR and MISR are selected by the Enable signals and the seeds are loaded into the LFSR and MISR. The controller then makes a transition to the Prepare BIST states. Several control signals are generated in this state, including the load counter signal to initialize the Counter, and the se signal to set the DUT Paper TC. 3 INTERNATIONAL TEST CONFERENCE to test mode. The controller then repeats the Shift BIST, Pause BIST, and Capture BIST states several times to shift in the test patterns and to capture and compress the test responses. Once all the test patterns are applied to the DUT, the controller transitions to the MISR BIST state to shift out the signature from the MISR. Finally, the controller enters the End BIST state and waits for the next initialization.
The state-transition diagram of the extest BIST controller is shown in Fig. 11 . The controller gives commands for defect detection or diagnosis operations. The switch between detection and diagnosis is controlled by the diag signal: the controller conducts diagnosis when the diag signal is logic 1 and detection when it is logic 0. When the controller receives the Go BIST signal, it enters the Begin BIST state. In this state, several components are initialized, including the LFSR, the Counter, and the MUX. The controller then goes through the Shift BIST, Launch BIST, Capture BIST, and Prepare BIST states. During these cycles, test patterns are shifted into the in-BSC and launched to the interconnects; test responses are captured and compressed in the capture cells. After all of the test patterns have been applied to the interconnects, the controller enters either the Pass/Fail BIST state or the Begin Diag state, depending on the value of diag. In the Pass/Fail BIST state, the detection result is shifted out of the chip. In the Begin Diag state, several control signals are activated to prepare for diagnosis. Then the first signature is shifted out during the Shift Diag state. A pair of 1-0 test patterns is generated during the Set Diag state and the Reset Diag state, and the test responses are captured in the We synthesized the proposed test architecture and the standard boundary-scan architecture using the 45 nm Synopsys TSMC standard-cell library and Synopsys Design Compiler [27] . In order to evaluate the area overhead, we also synthesized a medium-sized IWLS 2005 benchmark [28] . The synthesis results are shown in Table I . Although the BIST area overhead is considerably large in comparison to the standard boundary-scan architecture, it can be neglected compared to the total area of a die. For example, the total BIST area overhead is only 3.7% of an Ethernet die. The overhead is therefore negligible (< 1%) for large designs with a million gates.
The slack data of the critical path was obtained using the 45 nm Nangate library and Synopsys PrimeTime. Table I shows that the BIST leakage power overhead is only 0.2% (BIST is idle in functional mode). Therefore, the performance and power impact of BIST are also negligible.
V. EXTEST SCHEDULING
It is more difficult to test dies in a 2.5D IC than in a 2D IC due to the reduced number of test pins available in a 2.5D implementation. Previous work in 2.5D ICs mainly focuses on testing of interposer interconnects [29] - [31] . However, the feature of reduced test pins in 2.5D ICs has never been analyzed. In this section, we present an efficient ExTest scheduling strategy that reduces CPU run time and increases fault coverage while satisfying the constraint that the number of test pins required does not exceed the number of available test pins at the chip level [32] .
The typical structure of a SoC die in a 2.5D IC is shown in Fig. 12 . The die consists of multiple tiles, and tiles are located in the four regions of the die: top-left (TL), top-right (TR), bottom-left (BL), and bottom-right (BR). Each tile can only be accessed by the test pins that are in the same region.
The scheduling problem can be defined as follows. We are given a die with a set of M tiles. The parameters considered for scheduling are defined in Fig. 13 . Note that some test pins are bidirectional. The sum of T R input and T R output is larger than T R. The test pins on the other three regions have similar parameters, namely T L, T L input , T L output , BR, BR input , BR output , BL, BL input , and BL output . If Tile i is located in the top-right region, tr i is 1 and the other three parameters (tl i , br i , and bl i ) are 0. The goal is to find a test group such that the number of tested interconnects is maximized. We enforce the constraint that the number of test pins required cannot exceed the total number of test pins available in the four regions. In addition, all interconnects driving the tile under test must be concurrent tested.
During the testing of a test group, the tiles that belong to the group are referred to as "enabled tiles", and tiles that are not in 1 . T : the total number of test pins in SoC dies with dedicated wrappers; 2. T input : the total number of test pins that can serve as inputs in T ; 3. Toutput: the total number of test pins that can serve as outputs in T ; 4. T R: the total number of test pins in top-right region of large SoC dies; 5. T R input : the number of test pins that can serve as inputs in T R; 6. T Routput: the number of test pins that can serve as outputs in T R; 7. I i : the number of input scan channels for Tile i; 8. O i : the number of output scan channels for Tile i; 9. tr i , tl i , br i , and bl i : binary values, indicate the location of Tile i; 10. w ij : the number of interconnects from Tile i to Tile j. Fig. 13 . Definition of parameters in the scheduling problem. the group are referred to as "disabled tiles". Each enabled tile can be classified as an "assistive tile" or a "tested tile" based on its functionality. The assistive tiles can only be used to launch patterns, and the interconnects under test are only connected to primary outputs of assistive tiles. The tested tiles are used to capture responses, and these interconnects are connected to the primary inputs of tested tiles. An illustration of the different types of tiles is shown in Fig. 14 .
Test group
Based on the definition of enabled tiles and tested tiles, two binary variables x i and y i , 1 ≤ i ≤ M , are defined to develop an ILP model for this problem. The variable x i is equal to 1 if Tile i is included in the test group and utilized as a tested tile. Similarly, y i is equal to 1 if Tile i is included in the test group and utilized as an enabled tile. Two constraints on variables x i and y i are first defined as follows:
2) The first constraint defines the relationship between x and y for the same tile: a tested tile must be an enabled tile. This is based on the definitions of an enabled tile and a tested tile. In the second constraint, if x j is equal to 1 and w ij is greater than 1, then variable y i must be equal to 1. This indicates the relationship between x and y for different tiles: if Tile j is a tested tile, all tiles whose primary outputs connect to Tile j must be enabled tiles.
Since all interconnects feeding to a tested tile must be concurrent tested, the tested tiles in one test group cannot be reused as tested tiles in other test groups, which leads to two other constraints on x i and y i :
If Tile i does not share the same input scan channels with other tiles s ii = 1; s ij = 0, ∀ i = j; If S i exists and Tile i is the representative Tile of Fig. 15 . Definition of the binary parameter s ij in different cases.
Constraint (3) addresses the situation where x i is equal to 0. If there are no interconnects feeding Tile j, Tile j cannot be a tested tile. Constraint (4) considers the case when y i is equal to 0. If Tile i is not a tested tile and no interconnects connect to its primary outputs, then Tile i cannot be an enabled tile. The sharing of inputs can be easily implemented in a die. The set S is defined as a set of tiles that share the same input scan channels. A new binary parameter s ij is defined in Fig. 15 .
When inputs are shared, not all enabled tiles are considered in the input constraint (Constraint (7)); only the representative tiles and the tiles that do not share input scan channels are considered. Thus, a new binary variable, z i , is defined. The variable z i is equal to 1 if Tile i is considered in the input constraint, else it is 0. There are two constraints on variable z i , one for each value of z i , that are defined as follows:
For purely assistive tiles during a test round, no responses are recorded from their output scan channels. As a result, the output scan channels of assistive tiles need not be connected to the output test pins. The constraints on the number of test pins in the top-right region are defined as:
These constraints indicate that the sum of the input/output/total scan channels of all of the enabled tiles in the top-right region cannot exceed the number of available input/output/total test pins. Because the sum of T R input and T R output is larger than T R, these constraints provide the flexibility for arranging the input and output test pins inside a region. The constraints on the number of test pins for the other three regions are similar to Constraints (7), (8) , and (9) . With the variables defined above, our objective is to maximize the total number of interconnects tested in one test group for a die with a set of M tiles, which is given by: The quantity M i=1 w ij represents the total number of interconnects feeding Tile j. The product M i=1 w ij · x j indicates whether interconnects must be added to the objective function based on whether Tile j is a tested tile. Finally, the total number of tested interconnects is the total number of interconnects feeding all the tested tiles in one test group.
The proposed ExTest scheduling strategy has been applied to a SoC dies used in 2.5D IC production. It has 50 million flip-flops and 35 asynchronous clock domains. We refer to it as A531. The A531 die has 531 tiles and 479 test pins that are in the following locations: 131 tiles and 120 test pins in the top-right region, 132 tiles and 120 test pins in the top-left region, 131 tiles and 119 test pins in the bottom-right region, and 137 tiles and 120 test pins in the bottom-left region. A 531 × 531 interconnection matrix is generated based on the design netlists. Since the average width of input scan channels is 1.7 for all tiles in A531, each tile is also assumed to have one or two input scan channels. Table II shows the scheduling results. If each tile has one input scan channel, the proposed strategy targets 531 tested tiles in 6 test rounds. In other words, all the interconnects are successfully tested. By increasing the number of input scan channels, the number of test rounds increases because each enabled tile requires more test pins.
In the previous method in use in industry, all tiles in A531 were compiled together to run DRC and ATPG. When the entire design was analyzed in a server with 512 GB memory and a single core, it took 30 days, 11 hours, and 20 minutes to generate 512 test patterns. In the proposed method, the largest test group has 374 enabled tiles and the smallest test group has 221 enabled tiles. When the design is analyzed in the same environment, generating 512 test patterns for the largest test group takes 11 days, 2 hours, and 40 minutes; generating 512 test patterns for the smallest test group takes 6 days, 2 hours, and 20 minutes. In addition, DRC and ATPG can be run in parallel for each group. Therefore, the run time for the proposed method is 11 days, 2 hours, and 40 minutes, which is only one-third of the runtime of the previous method.
VI. A PROGRAMMABLE METHOD FOR LOW-POWER SCAN SHIFT IN SOC DIES
With technology scaling and the relentless increase in design sizes, the power consumption during testing has also grown dramatically. Power savings can be achieved by staggering the test clocks during shift. Stagger can be achieved by ensuring that the clocks for different blocks have different duty cycles or different phases, thereby reducing the number of simultaneous transitions. The shift-clock stagger is implemented at the block Paper TC. 3 INTERNATIONAL TEST CONFERENCE For each block in the block list { If the block is not assigned a stagger value { Determine to which group the block belongs (01, 10, or 11); If the block is in "01" { Select a random stagger value j; Assign stagger value j to the block; } Else { For each neighbor that has a stagger value { Find neighbor (index l ind ) that has the largest shared length; } If digit 2 is 0 { /* block belongs to "10" */ For each stagger value j { If j does not generate conflict blocks { /* block a and b are conflict blocks: they have same */ /* stagger value and s ab ≥ min{thr · pa, thr · p b } */ Assign j to the block; Exit the loop; } Else { Find the stagger value j: generate min conflict blocks;
Assign j to the block; } } } If digit 2 is 1 { /* block belongs to "11" */ Set the target stagger value to be 0 (t = 0); Put all used stagger values by neighbors in Array; For each j in Array { If j does not generate conflict blocks { Set t = j; Exit the loop; } } If t != 0 { /* we can reuse a used stagger value */ Assign t to the block; } Else { /* all used values generate conflict blocks */ Get block l ind 's stagger value k; Find the unused value j that is farthest from k; Assign j to the block; } } } } } Return block to stagger value mapping. level, which allows one block's scan chain to toggle at a time when the scan chains for other blocks remain quiet. The differences in the toggling activities for the block are achieved by assigning different stagger values to each block.
The number of stagger values is limited due to the limited number of clock phases. However, an SoC can contain several hundred blocks. Therefore, we are faced with the problem of assigning a small number of stagger values to a large number of blocks to minimize toggle activity. We refer to this problem as shift-clock stagger assignment (SCSA). In this section, we present a programmable method for SCSA [33] .
In order to implement SCSA, the position of each block B inside the SoC must be known; this information also includes the neighboring blocks for B and their shared boundary length. This relationship can be characterized by a position matrix S = [s ij ]. Each element s ij represents the shared length between block i and block j. If block i and block j are not neighbors of each other, then s ij is zero.
It is inevitable that some neighboring blocks will be as- signed the same stagger value. In this condition, the value of the shared length over the block's perimeter is defined as the threshold (thr); the PSN on the power rails is deemed to be acceptable when the shared length of two blocks is smaller than thr. In this case, we can assign the same stagger value to these blocks.
The SCSA problem can now be formally defined as follows. We are given an SoC design with a set of M blocks. The perimeter of block i is p i ; the shared length between block i and block j is s ij . The threshold is thr and the number of available stagger values is A. Our goal is to assign stagger values to the blocks such that the number of conflict blocks is minimized.
For a block that has not been assigned a stagger value, its neighbors can have several configurations, represented by two binary values: digits 1 and 2. Digit 1 indicates whether the neighbors have stagger values. When it is 0, there are no neighboring blocks that have been assigned stagger values. Otherwise, at least one neighboring block has a stagger value and digit 1 is 1. Digit 2 indicates the use of stagger values for the neighboring blocks. When it is 0, all available stagger values are used for the neighboring blocks. Otherwise, at least one stagger value is not used for neighboring blocks and digit 2 is 1. Therefore, the blocks can be classified into three groups based on their neighboring conditions: 01, 10, and 11. The condition 00 can never happen by definition because there is an inherent conflict by definition between 0 in digit 1. These conditions are illustrated by a sample example with 4 stagger values in Fig. 16 , where "0" indicates no stagger value is assigned to the block. The assignment operation for each group is different and the pseudo-code of the proposed algorithm is shown in Fig. 17 .
Next, we present measured silicon results for a GPU design using SCSA. Fig. 18 presents holistic silicon results that Paper TC. 3 INTERNATIONAL TEST CONFERENCEdemonstrate the benefits of SCSA. A total of 500 patterns were applied to the GPU and the voltage droop was measured for each pattern. We plot the normalized V droop (100% refers to maximum droop) in the figure, where each dot refers to a test pattern. For the sake of completeness, we also report the capture clock frequency utilized for the corresponding test pattern. In Fig. 18(a) , the cluster on the left lies within the permissible droop window that can be tolerated during test. All the points to the right are indicative of excessive voltage drop (as high as 75-100% of the maximum droop) due to scan shifting. Fig. 18(b) shows that when the same 500 patterns are applied to the GPU with 8 stagger values, the droop profile shifts to the left and falls within the acceptable window. Hence, the silicon data conclusively demonstrates the benefits of SCSA for SoC testing.
VII. CONCLUSION
Several challenges in 2.5D ICs have been identified as a potential roadblocks for volume manufacturing and commercial exploitation. These challenges include: (1) pre-bond interposer testing, (2) limited ability for at-speed testing, (3) high-density I/O ports and interconnects, (4) reduced number of test pins, and (5) high power consumption. This paper has addressed the above challenges and presented new testing techniques, which are expected to bring 2.5D integration technology closer to adoption in mainstream products.
