Abstract-Three-dimensional (3D) stacking using through-silicon vias (TSVs) promises higher integration levels in a single package, keeping pace with Moore's law. Testing has been identified as a showstopper for volume manufacturing of 3D-stacked integrated circuits (3D ICs). This work provides solutions to new challenges related to 3D test content, test access, diagnosis and debug. We analyze the the impact of thermo-mechanical stress due to TSV fabrication process on test quality. We propose a test-generation flow that takes TSV-induced stress into account by using stressaware circuit models. Pre-bond TSV test is a challenge due to limited accessibility of TSV at the pre-bond stage. We develop a non-invasive method for TSV test and diagnosis using ring oscillators, duty-cycle detectors, and a regression model based on artificial neural networks. In order to efficiently deliver test content, 3D design-for-test (DfT) architectures are needed. We propose an optimization approach that takes uncertainties in input parameters into account and provides a solution that is efficient in the presence of input-parameter variations and minimizes test time. Finally, post-silicon debug is a major challenge due to continuously increasing design complexity. We develop a low-cost debug architecture for massive signal tracing in 3D-stacked ICs with wide-I/O DRAM dies that significantly increases the observation window compared to traditional methods that use trace buffers.
I. INTRODUCTION
Three-dimensional IC stacking using through-silicon vias (TSVs) is a relatively new technology that offers a number of advantages over conventional stacking methodologies [1] . TSVs are vertical copper or tungsten conducting nails passing through a thinned die. Typical TSV dimensions are 5 μm diameter and 50 μm height. The actual connection to the next die can be a direct copper-to-copper bond of the TSV onto a small landing pad, but today is often implemented by means of a CuSn micro-bump, of which typical dimensions are 25 μm diameter at 40 μm pitch [2] . As they form direct vertical interconnects between stacked dies, TSVs allow for a much larger number and higher density of interconnects than conventional wire-bonds. Due to their geometry, TSVs have relatively low capacitance and inductance, hence they enable high bandwidth and reduce power consumption [3] .
Despite the numerous benefits offered by 3D integration, test challenges for 3D-stacked integrated circuits (ICs) must be addressed before volume manufacturing and defect screening can be feasible [4] - [6] . These challenges include the following.
• Test content
As in any semiconductor product, a 3D IC should be tested for defects that lead to errors in functional operation. These defects can be divided into two categories: (i) defects that are specific to 3D ICs, and (ii) defects that occur both in traditional (2D) ICs and 3D ICs. The first category includes defects in TSVs and micro-bumps, as well as defects caused by mechanical stress during TSV and micro-bump manufacturing. New test techniques need to be developed to screen for these defects. The second category includes defects in the internal die logic. Traditional test methods, such as scan test, have been successfully used to test for these defects. However, test cost tends to multiply with the increasing chip complexity; therefore, more efficient test solutions are required to keep test cost low.
• Test access
Test access is difficult both in pre-bond and post-bond test of 3D ICs. In pre-bond test, the small dimensions of microbumps and TSVs make probing very difficult. Therefore, special techniques are required to test TSV-based connections prior to die bonding. In post-bond test, only one die in a 3D stack has external connections -other dies require test access through this die, necessitating a 3D design-for-test (DfT) architecture that allows for test access to all components in the stack. In addition, 3D DfT architectures must be optimized for on-chip area requirements and test time, both of which have a major impact on test cost.
• Diagnosis and Debug
Silicon debug requires a relatively large engineering effort, accounting for a significant part of the total time-tomarket [7] , [8] . In order to keep pace with advances in system-level integration, including 3D stacking, significant enhancements are needed for traditional bug-localization methods.
This paper addresses the above challenges and provides practical solutions to these challenges. First, in Section II, we address the issue of thermo-mechanical stress due to TSV fabrication. TSV stress changes the timing profile of the digital logic surrounding TSVs, which has an impact on delay-fault testing. We analyze this effect and show quantitatively that test quality is significantly reduced if the test patterns are generated with TSV stress-oblivious circuit models. The detrimental impact of TSV stress on pattern effectiveness and test quality can be overcome by using stress-aware circuit models for test generation.
In Section III, we present a technique for contactless prebond TSV test and diagnosis using ring oscillators and dutycycle detectors. TSVs are used as capacitive loads of their driving gates that are configured in a ring oscillator. By measuring the oscillation period and the duty cycle of the signal generated by the ring oscillators, we can detect resistive open and leakage faults. A regression model based on artificial neural networks can predict the fault type and the fault size; this model uses the oscillation period and the duty cycle measured at multiple voltage levels. The
Paper DDC.1 978-1-4673-6578-9/15/$31.00 ©2015 IEEE   INTERNATIONAL TEST CONFERENCE  1 accuracy of the regression model is evaluated through simulation using realistic models for a 45 nm CMOS technology. Section IV addresses the challenge of robust optimization of 3D test-access architectures. Traditional optimization frameworks suffer from the drawback that they ignore potential uncertainties in input parameters. In realistic scenarios, however, the input parameter values used in the design phase may differ from the actual values that are known only after the design phase. Examples of such parameters include test power and configuration of the die test-access mechanism. We propose a robust-optimization framework that takes uncertainties in input parameters into account and provides robust solutions. We evaluate the proposed framework and show that robust solutions are superior to singlepoint solutions in terms of average test time when there are uncertainties in the values of input parameters.
Finally, Section V presents a method for massive signal tracing using on-chip DRAM for in-system silicon debug. Traditional debug techniques using signal tracing suffer from the limited capacity of on-chip trace buffers. We propose a low-cost debug architecture for signal tracing that exploits large amounts of fast on-chip memories available in wide-I/O 3D ICs. The key idea of the proposed architecture is to store the trace data to functional memory, thereby significantly increasing the signal-observation window compared to traditional methods that use dedicated trace buffers. We evaluate the proposed method and show that the observation window can be increased by orders of magnitude compared to prior work at comparable hardware cost.
II. TSV STRESS-AWARE ATPG
TSV-induced thermo-mechanical stress is a negative side effect during TSV manufacturing that requires special attention. The thermal expansion coefficient of copper, a common TSV fill material, is significantly higher than that of silicon: 17 × 10 −6 /K versus 3 × 10 −6 /K [9] . Due to this mismatch, TSVs are likely to cause residual stress in the silicon during fabrication and thermal cycling. One of the effects of thermal stress is mobility variation in MOS devices in the proximity of TSVs. These variations lead to a change in the timing profile of the circuit [10] , [11] , which affects delay-fault testing. In this section, we present a study of the impact of timing variations due to TSV stress on the quality of test patterns generated to screen small-delay defects (SDDs) [12] . We show that the use of TSV stress-oblivious circuit models results in a significantly increased escape rate of faulty chips. The level of this increase depends on the yield of the fabrication process; we conclude that accurate modeling of TSV stress is more important for processes with lower yields.
The impact of TSV stress on pattern effectiveness is quantified using the statistical delay quality level (SDQL) metric [13] . This is a key metric in our approach, since the SDQL of a chip correlates with the expected test escape rate due to smalldelay defects. We also show that the test escape can be reduced considerably by incorporating TSV stress in cell timing libraries and using these libraries with a commercial timing-aware ATPG tool. Therefore, any detrimental impact of TSV stress on pattern effectiveness and test quality can be overcome by using stressaware models for test generation. We also show that TSV stressaware testing leads to negligible increase, if any, in pattern count.
Our approach consists of two major parts: The flow used to obtain TSV stress-aware circuit model is shown in Fig. 1 . We first create a timing library for a range of mobility values. The next step is stress calculation, which is performed as outlined in [11] . FEA simulations of the stress generated by a single TSV using the FEA software ABAQUS [14] . For any given physical layout, we perform full-chip stress analysis using linear superposition. With the stress tensor at each point in the design, the corresponding change in mobility of electrons (NMOS) and holes (PMOS) can be computed using the measured piezoresistive coefficients given in [15] , assuming (100) silicon. With this approach, we obtain the change in mobility due to TSV stress for each standard cell in the design. The appropriate timing library for each cell can then be picked from the precharacterized set of timing libraries. We fed all the libraries, netlists, and parasitics into Synopsys Primetime to get TSV-stressaware timing results.
To evaluate the impact of TSV-induced stress on test quality, we have developed a tool flow using conventional timing analysis and ATPG tools: Synopsys PrimeTime and TetraMax, respectively. Fig. 2 gives an overview of the flow. As input, we use the original (non-stress-aware, NSA) models and the modified (stress-aware, SA) models. First, we perform timing analysis with PrimeTime to extract the slack data. Next, we generate two delay test pattern sets with TetraMax: one using the NSA and the other using SA models. Finally, we perform fault simulation and compute SDQL with TetraMax using the following combinations.
1) The NSA pattern set on the NSA model. Table I gives an overview of the design data, including gate, scan flip-flop and TSV count. We partitioned the netlist and create three different stacks for each core: two-die, three-die, and four-die stacks. For each die, we used a 3D force-directed placer to place the gates [17] . This placer places TSVs in a regular fashion, and assigns nets to TSVs using a 3D Minimum Spanning Tree approach. In the next step, we routed each die separately in Cadence Encounter and performed timing analysis in Synopsys Primetime. We treated dies as modules, and TSVs as top level interconnections. Primetime was also used to get die-level timing constraints, and these were used to perform timing optimization in Cadence Encounter. We also obtained stress aware timing from the methodology described above. Some sample layouts of des perf are shown in Fig. 3 , along with the obtained PMOS mobility maps.
Next, we applied the flow depicted in Fig. 2 to the benchmarks. Table II shows the test escape rate for one of the benchmarks, des perf, expressed in defective parts per million (DPPM). The DPPM is calculated as SDQL normalized by the number of delayfaults N f : DP P M = SDQL/N f . The results indicate that the actual escape rate of the NSA patterns (Row 2) is significantly higher than the estimated one (Row 1). This implies that neglecting TSV-stress in the ATPG flow by using a NSA model leads to decreased test quality. We made similar observations for the other benchmarks.
The results presented in Table II also show that the test escape rate of the optimized patterns (Row 3) almost equals the one initially targeted (Row 1). Therefore, if TSV stress is taken into account during ATPG, it has no significant impact on the test quality. This is also true for small keep-out zones: regardless of the KOZ size, the test quality can reach the level of a circuit without the timing variation effects caused by TSV stress. We also observed that there is no significant variation in pattern count between the NSA and SA patterns for the same implementation. This implies that a stress-aware ATPG flow has negligible impact on pattern count compared to a conventional ATPG flow. In addition, we observed that the size of the KOZ has no impact on pattern count.
III. CONTACTLESS PRE-BOND TSV TEST AND DIAGNOSIS USING RING OSCILLATORS AND DUTY-CYCLE DETECTORS
One of the challenges associated with 3D test arise from new defects due to the TSV manufacturing process; such defects include voids and pinholes. Voids, as shown in Fig. 4 are formed due to insufficient filling [18] . A pinhole is an oxide defect that creates a short between the TSV and the substrate [19] . Many of these defects arise prior to the bonding process. Therefore, they can be targeted during pre-bond testing, increasing the probability of getting a known good die (KGD) prior to bonding and therefore increasing the product yield. It has been widely acknowledged that the lack of KGD can be a serious yield limiter for 3D stacking [4] , [20] , [21] .
However, pre-bond testing of TSVs is difficult because of testaccess limitations. Prior to wafer thinning, TSVs are buried in the silicon substrate and are only indirectly accessible through the circuitry connected to the TSVs. Even when the back side of the TSVs is exposed after wafer thinning, probing on those is challenging because of strict requirements on the probing Paper DDC. 1 INTERNATIONAL TEST CONFERENCE equipment. Recent studies report success in mechanical probing at array pitches of 40 μm [2] ; however, such probing solutions are still being researched and it remains to be seen how easily they can be used in practice. Therefore, probe-less solutions for pre-bond TSV test should also be investigated as an alternative to solutions that rely on probing.
We propose a contactless pre-bond TSV test based on ring oscillators [22] . We target two types of TSV faults: resistive opens and leakage faults. Several TSV defects can be modeled by these faults. For instance, micro-voids increase the TSV resistance at the defect location and thus can be modeled as a resistiveopen fault. Pinholes create a conduction path from a TSV to the substrate, resulting in a leakage fault. Fig. 5 shows electrical TSV models we used for three cases: (a) fault free, (b) micro-void, and (c) pinhole. We model a fault-free TSV as a capacitance C TSV between the TSV filling material and the substrate. The value C TSV depends on the TSV technology; recent studies report C TSV = 59 fF [23] . A micro-void is modeled as an increased TSV resistance R O at the location x. The value of R O can vary from a few Ω in case of a micro-void to infinity in the case of a full open in the TSV. TSV leakage, e.g., due to a pinhole defect, is modeled by the conductance G L = 1/R L . The value of G L can vary from zero in the fault-free case to several hundred μS in the case of a strong leakage. The value of G L can increase over time, since leakage faults tend to deteriorate, therefore it is necessary to detect even small leakage to ensure the reliability of TSVs.
In order to detect variations of R O and G L , we perform a parametric test using ROs. Deviations in these parameters due to faults lead to variations in the propagation delay of the net connected to the TSV. These variations can be measured by ring oscillators (ROs). An RO is a feedback loop containing an odd number of inverters. Fig. 6 shows the configuration of an RO with TSVs as loads. The RO is comprised of multiple non-inverting I/O cells and an inverter. The oscillation period and the duty cycle of the generated signal are captured by on-chip DfT hardware based on binary counters and time-to-digital converters.
The number of TSVs in an RO (N TSV ) can be determined based on the desired oscillation frequency. The signal TE (test enable) controls the multiplexers selecting between the functional outputs coming from the internal logic and the oscillator loop. The signals BY [1] . . . BY[N TSV ] (Bypass) control the multiplexers that include or exclude a TSV from the oscillator loop. OE (output enable) controls the tri-state drivers of the I/O cells. In functional mode, this signal is set by the internal logic. In test mode, OE is set to 1 to enable the drivers. Propagation delays in digital gates are affected by random process variations [24] . Since the oscillation period and the duty cycle are functions of gate propagation delays, they are sensitive to these variations. This has a negative impact on fault-diagnosis accuracy and can lead to aliasing. In order to reduce the noise effect introduced by random-process variations, we measure the oscillation period twice. . By subtracting T osc,b from T osc , we can eliminate the common part, i.e., the propagation delay of the common circuitry part of both configurations.
Even though the effects of weak leakage on rise and fall times of the TSV driver virtually cancel each other for the oscillation period, they strongly affect the duty cycle of the oscillating signal. In the following, we define the duty cycle D of an oscillating signal as the ratio of the on-time T on of one cycle to the oscillation period T osc :
where T osc is the oscillation period, T on is the on-time, defined as the time, for which the oscillating signal is above V dd /2 during one cycle, and T off is the off-time, defined as the time, for which the oscillating signal is below V dd /2 during one cycle. The opposite effects of weak leakage faults on rise and fall times of the TSV driver create a difference between T on and T off , which can be effectively detected by measuring the duty cycle. This is the main motivation to use duty-cycle detectors in addition to frequency detectors, in order to be able to detect weak leakage faults more accurately. As duty-cycle detector, we use a time-todigital converter based on the circuit proposed in [25] . The key idea of this detector is to integrate a constant current during (a) T on and (b) T off until a certain threshold voltage is reached and compare the integration times of (a) and (b). Next, we describe the DfT structures that are used to perform on-chip measurement of the RO oscillation period. Fig. 7 presents an overview of the design. It consists of a control-logic block (TEST CTRL) that is connected to the IEEE 1149.1 (JTAG) TAP controller [26] . We assume that a TAP controller is already in place for other test and debug purposes, hence we can reuse it to control the TSV-test DfT circuitry from the test equipment. The only requirement on the TAP controller is two extra instructions to control the TSV test procedure, which is explained below. TEST CTRL is a logic block that is shared between multiple groups of ROs and that generates signals to control the ROs, the counters, the clocks, and the data flow. Each RO group contains M RO ROs, a binary counter, an RO-select register (SEL), a register X to Paper DDC. 1 INTERNATIONAL TEST CONFERENCE In order to perform TSV diagnosis based on measured oscillation frequency and duty cycle, we employ artificial neural networks (ANNs) [27] . ANNs are considered universal tools to efficiently model such complex systems with a large number of input variables [28] , and therefore suitable to build a regression model that can accurately diagnose a TSV based on multiple measured data points. net.
In the case of class dual = 1, we can determine that the TSV under test has both leakage and resistive-open faults. However, due to masking effects, it is difficult to predict the size of the faults, therefore, we ignore the values G L and R O . Even though the model will not provide a reliable analysis of the fault size in this case, it will still classify the TSV as being faulty.
In order to verify the proposed regression model based on ANNs, we have generated two independent large sets of training and test data with the sample size over 10, 000. First, we created Class-net in MATLAB and trained it using the training-data set. As the performance evaluation metric, we used the mean squared error (MSE). The training function for the network was selected following the guidelines in [29] . Then, Class-net was evaluated using the test-data set. Fig. 9 shows the confusion matrix from that evaluation. The entry (i,j) shows the number of samples of type j classified as i by the network, and their percentage. For instance, the entry (2,1) shows 37 (or 0.1%), which means that 0.1% of all test points were classified as R O , although the actual fault is G L . According to this matrix, the majority of the samples were classified correctly (green cells), and only a small percentage of the samples were mispredicted (red cells), which shows the high accuracy of the classification network.
The network G L -net was trained using a subset of the generated samples with R O = 0 (leakage faults only). In order to show the improvement in accuracy that we achieve by adding the duty cycle to the regression model as extra information, we performed the following simulation. We created a "reduced" regression model G L -net r with the same settings as G L -net, with the only difference that it uses T osc and T osc,b pairs as input data, with 2K T = 16 inputs in total. Fig. 10 shows a comparison between Paper DDC.1 INTERNATIONAL TEST CONFERENCE G L -net and G L -net r in terms of MSE for a range of leakage faults. As expected, the model using duty cycle as input variable provides significantly better predictions for weak leakage faults with G L ≤ 150 μS, which will help to reduce test escapes as well as over-testing. The network R O -net was generated using a similar network architecture as G L -net. It was trained using a subset of the generated samples with G L = 0 (resistive-open faults only). Since resistive-open faults affect the rise and fall times of the TSV drivers in the same way, the duty cycle is insensitive to this type of faults. Nevertheless, the duty cycle is essential for classification of faults and detecting weak leakage faults.
We also perform a simulation to show the impact on the diagnosis accuracy if T osc and D are only measured at one voltage level. We created regression models that only use {T osc , T osc,b , D, D b } at a specific voltage level instead of multiple levels and trained it using similar settings as for the full models G L -net and R O -net. Fig. 11 shows a comparison of G L -net to the model that uses the four input variables measured at V dd = 0.85V, as leakage faults are better detectable at voltage levels below nominal V dd . As we can observe, a reduction of the test to one voltage level would result in a significant loss of diagnosis accuracy. Similar observations were made for a reduced version of R O -net that only takes the four input variables {T osc ,
These results support our premise that a regression model using multiple voltage levels is more effective than that using a single voltage level.
IV. UNCERTAINTY-AWARE ROBUST OPTIMIZATION OF TEST-ACCESS ARCHITECTURES
Increasing test complexity and test cost are undesirable consequences of integration of large SoCs in a single 3D stack. Recent work has addressed this issue and proposed several methods for test architecture optimization and test scheduling [30] - [32] . A drawback of these methods is that they consider known (constant) values for input parameters, which may differ from the actual values. This can lead to non-optimal decisions made at the design stage, increasing the test time. Variations in input parameters can be attributed to several reasons.
• At the design stage, some input parameters such as power consumption and test pattern count, are not known exactly Fig. 11 : MSE of G L -net using multiple vs. single V dd . and hence we need to rely on estimates, which can be inaccurate.
• In a 3D scenario, a die can be used as "off-the-shelf" component in various 3D stack designs with different constraints on power consumption and available bandwidth for test data. A die that is optimized for a particular 3D stack can result in non-optimal test times in other stacks. Figure 12 highlights an example of such a reconfigurable architecture with a variable number of test inputs. The figure shows the partitioning of cores at die level into four groups and the routing within groups as well as between groups. In this example, there are three possible configurations with different TAM widths: n-bit, 2n-bit, and 4n-bit interfaces. Depending on the configuration, a different set of I/O is used to match the width. For instance, only TI1 and TO1 are used as test I/Os in n-bit configuration, such that all groups are concatenated. Two cores can only be tested in parallel if they are assigned to different TAM wires, irrespective of their group assignment. For example, c1 and c12 cannot be tested in parallel in this configuration because they share wire 3. With this flexible 3D test architecture, the die can be integrated in different stacks that use n, 2n, or 4n test inputs. Hence the die-level TAM width in test-architecture optimization for the stack becomes a distributed input parameter at the design stage.
To deal with the issue of uncertainties in input parameters, we have developed a robust-optimization method for 3D ICs [33] , Paper DDC. 1 INTERNATIONAL TEST CONFERENCE 6 [34] . This method is based on a technique that has been proposed in the past [35] . The main idea of [35] is to consider a number of scenarios in each of which variable input parameters take certain values and to find a solution that stays near-optimum in all scenarios. The robust optimization explicitly takes into account input-parameter variations, as a range for each input parameter is specified. The robust solution might not be the absolute optimum for a given point, but is optimized to be not too far off from that optimum as long as the input parameters are in the anticipated, pre-specified variation range. Even though a robust solution sacrifices optimality in some scenarios, it performs better on average than a single-point solution. In [35] , robust optimization is used for linear programming (LP). However, many practical optimization problems are N P-hard and cannot be solved in polynomial time using LP [36] . Examples of such problems include test-architecture optimization and test scheduling. These problems can be modeled by integer linear programming (ILP).
We developed a mathematical model for the robust optimization problem with the objective to find a solution that is "close" to point-optimal solutions in every scenario, where the closeness is defined as the normalized difference between total test time of a point-optimal solution and that of the robust solution. We demonstrate our framework using three uncertain parameters: (i) core-test power, (ii) TAM configuration, and (iii) core-test times. The set of uncertain parameters can be extended depending on the application. As obtaining an exact solution for this robustoptimization problem for 3D-test architectures is intractable even for small dies, we propose an efficient heuristic based on simulated annealing (SA). Figure 13 provides on overview of the proposed SA algorithm. In the beginning, we initialize the artificial variable Temp, and find an initial grouping of cores and a TAM assignment of each core to the bus of n wires. Next, we randomly perturb the TAM assignment by rearranging some of the cores. The new test architecture is evaluated by finding a nearoptimal test schedule for each scenario s, and the cost function F based on and expectation of test time T exp over all scenarios is kept track of as F new . If the cost of the new solution is not higher than that of the current solution, the new test architecture is accepted. Otherwise, it is accepted with the probability
If Temp is above a specified threshold Temp min , Temp is decreased by a specified factor and the algorithm goes back to the perturbation phase. This time, the grouping of the cores is slightly perturbed and the new test architecture is evaluated again. Perturbation of TAM and the grouping alternate between each iteration. Once Temp has fallen under Temp min , the algorithm stops and the best solution so far is provided as the result. The evaluation of a TAM architecture, as shown on the righthand side of Figure 13 , is another SA-based heuristic. The key idea of this algorithm is to start with an initial schedule and then iteratively perturb the order of tests at every iteration. A schedule is constructed using the following greedy approach. We go through the list of all cores one by one in a particular order and assign the start of each core test to the earliest time possible, without violating the given constraints. For example, consider a list of four cores {C 3 , C 2 , C 4 , C 1 }. The algorithm assigns the start time of the first core in the list, C 3 , to t = 0. This is always a feasible step as no other test has been scheduled yet. Next, the algorithm assigns C 2 to t = 0 and checks for feasibility. If no constraint is violated, the algorithm proceeds to the next core. Otherwise, C 2 is rescheduled to the next potential start point, which coincides with the finish time of C 3 . This procedure repeats until all four tests are scheduled. This algorithm loops through all N c cores in the list and the last core is tried to be assigned N c times in the worst case. Therefore, the algorithm has the runtime complexity O(N 2 c ) and requires low CPU time even for large SoCs. As the order of cores defines the constructed schedule, we can explore the solution space by perturbing this order. In our approach, we start with an initial (random) core-test order and move a randomly selected core to another position in the list.
Next, we present simulation results to evaluate the proposed heuristic method for robust optimization, which was implemented in C++. Since there are no publicly available 3D-IC benchmarks, we applied our methods to SoCs from the ITC'02 SoC Test Benchmark set [37] and two industrial SoCs that we use as dies in a 3D stack in our simulations. We approximated the test time of each core by multiplying the length the longest scan chain with the test pattern count. In order to test the performance of the framework, we created an SOC with 100 cores, comprising 16 large identical cores with 32-bit TAM and relatively high test power and long test time, 32 medium identical cores with 16-bit TAM and medium test power and test time, and 52 small cores with randomly chosen parameters.
Without loss of generality, we assume three discrete points for maximum available test power P max , such that the nominal value occurs with the probability of 0.6 and the other two points P max ± ΔP with the probability of 0.2 each. Furthermore, we assume three different sets of T i : (1) a set with nominal T i , (2) a set, in which some of the core test times are reduced by 10%, and (3) a set, in which some of the core test times are reduced by 20%. Each set can occur with equal probability. We limit the possible configurations to n-bit, 2n-bit, and 4n-bit configurations, as shown in Figure 12 , and assign equal probabilities to them. As all three input parameters can take three different values independently from each other, the total number of scenarios we consider is S = 3 3 = 27. In contrast to the exact ILP model, the complexity of our heuristic algorithm grows only linearly with the number of scenarios, as we solve the scheduling problem for each scenario independently. Therefore, we can easily extend our framework to handle more uncertain parameters while maintaining reasonable CPU-runtimes. In order to demonstrate the advantage of robust solutions, we performed the following simulation. For each SoC, we obtained an optimal solution assuming a single point in the input parameter space (the nominal value of P max and T i , and the 1n configuration). For the robust solutions, we considered 27 scenarios for variable P max , T i , and conf. The non-robust solutions were evaluated for the same scenarios as the robust solutions in order to compare their performance in the presence of uncertainties in the input parameters. Table III shows the input parameters W m,n (TAM width for the case of 1n-bit configuration), P m,n (nominal value of P max ), and a comparison of non-robust and robust solutions in terms of T exp and max . Note that the difference in max for the 100-core benchmark is only 16%. Due to the combination of the input parameters for this benchmark, the nonrobust algorithm was able to find a non-robust solution that was close to the optimum ( max = 0.22), hence the robust algorithm had little room to improve the solution.
As the results indicate, test architectures that were optimized using the proposed robust optimization method lead to significantly lower max , i.e., the robust solutions stay "closer" to pointoptimal solutions compared to non-robust solutions. In addition, we observe a measurable improvement of the expectation of test time for most SoCs.
V. MASSIVE SIGNAL TRACING USING ON-CHIP DRAM FOR SILICON DEBUG
Traditional debug solutions focus on using dedicated designfor-debug (DfD) circuitry. However, today's system-on-chip designs (SoCs) offer resources, such as embedded processors and large amounts of fast embedded memories, that can be exploited for efficient test and post-silicon debug. Examples of such systems are 3D ICs with wide-I/O DRAM [38] or traditional ICs with embedded DRAM (eDRAM) [39] . We propose a debug solution that utilizes these functional on-chip resources and provides benefits such as better test compression, on-chip fault diagnosis, as well as signal tracing with significantly increased observation windows [40] .
Several design-for-debug solutions have been proposed in the past to provide observability of a circuit's internal signals [7] , [41] , [42] . Signal tracing is a commonly-used technique for postsilicon debug. The main idea of this method is to localize bugs in digital logic by observing internal nodes of the circuit during test-program execution. The traced data is usually stored in onchip trace buffers that are read after program execution. Due to the limited size of the trace buffers, only a few signals can be observed over a relatively short period of time, which is typically a fraction of the total run time of a test program.
In order to overcome these limitations, a number of innovative methods have been proposed in the past [43] , [44] . These methods increase the observation window by running multiple iterative debug sessions with intermediate processing steps. In our approach, we use available on-chip DRAM for trace-data storage, thereby extending the observation window by orders of magnitude compared to previously proposed methods. In addition, we use multiple-input signature registers (MISRs) in order to generate compact signatures based on the traced data. These signatures are compared to "golden" signatures in order to identify erroneous intervals of trace data. The signatures are calculated by simulation of the test program using a behavioral model of the circuit. Alternatively, the signatures can be calculated using emulations on a FPGA prototyping board. Fig. 14 shows an overview of the proposed debug flow. In Step 1, prior to the actual debug session, "golden" signatures are generated by simulation of the test program using a behavioral model of the circuit. Alternatively, the signatures can be calculated using emulations on a FPGA prototyping board. This approach is especially useful for large SoCs, simulations of which may be impractical. As multiple sets of internal signals can be tapped for tracing, a separate set of golden signatures is needed for each signal set. However, all signature sets can be generated in one simulation run. Steps 2-4 are applied on the debug equipment. In Step 2, a set of golden signatures for the selected internal bus to trace are uploaded to the circuit under test (CUT), for instance, through an IEEE 1149.1 (JTAG) interface or through shared I/O chip pins. In Step 3, the test program is executed, the selected signals are traced, and the trace data is stored into the DRAM. In Step 4, the stored data is downloaded to the external equipment and stored for analysis. Next, another debug session can be executed with a different setup, for instance, (a) with different trace signals, (b) different test program, or (c) different temperature and voltage settings. Finally, in Step 5, the obtained trace data is analyzed, which can be performed offline on a work station, freeing the test equipment for another experiments. Fig. 15 provides an overview of the proposed architecture, including the data flow between modules. We capture L t signals at the functional clock frequency and shift in the sampled values into an L t ×M t trace buffer, where L t is the width and M t is the depth of the buffer. This buffer serves as a temporary storage of trace data captured in one time interval of the length M t . At the same time, we feed the observed signals through an L tto-K t compactor into an L t -bit multiple-input signature register (MISR) to identify failing intervals and skip the storage of errorfree intervals. The compactor is optional and its purpose is to match the input width of the MISR to the width of the observed Paper DDC.1 INTERNATIONAL TEST CONFERENCE 8 bus; it can be implemented as an XOR tree. The MISR calculates a K t -bit long signature for each interval of M t cycles that is used to identify failing intervals by comparing it with expected signatures that are stored in DRAM. Once a signature miscompare is detected, the captured data from the current interval is uploaded to the shadow buffer, from where it is shifted into the DRAM, together with the number of the current interval as time stamp S t . The trigger module is optional and it can be used to start capturing trace signals upon given transactions occurring on the bus. This module constantly tries to match the observed signals with a pre-loaded pattern and when a match occurs, a trigger signal is asserted, which triggers the starting point of signature generation. The control logic starts and ends the debug session and provides control signals to other DfD blocks and DRAM. By storing only erroneous intervals, we effectively achieve compression of trace data. We can estimate the compression ratio based on the frequency of errors in the stream of trace data, using the total amount of observed trace data as a baseline. The data that we store includes the actual trace data, the expected signatures, and the time stamps. Fig. 16 shows the compression ratio r as function of error probability p for the case L t = K t = 32, S t = 32, and different values of M t . For practical reasons, all implementation-related input parameters are powers of two. For low error probabilities, only a few intervals are erroneous and need to be stored, hence the stored data is dominated by the expected signatures, allowing for high compression. In the case of a high error probability, nearly all intervals are erroneous and need to be stored, in addition to the expected signatures, hence no compression is achieved. This observation is consistent with Fig. 16 , as M t = 128 provides the highest compression for small values of p, and M t = 1024 provides the highest compression for large values of p.
We created a Verilog RTL model of the proposed architecture shown in Fig. 15 and synthesized it using a 45 nm CMOS library [45] . For the case L t = K t = 32, S t = 32, J t = 8, H t = 4, and M t = 128, the synthesized circuit contained 8357 sequential and 8783 combinational gates. The standard-cell area of the design was 0.062 mm 2 , which is negligible for realistic chips. For instance, on a 25 mm 2 die, the DfD circuitry would only occupy 0.25% of the die area.
We verified the debug module with a Verilog model of a hyperpipelined microprocessor OpenRISC 1200 HP [16] . We connected the debug module the decoder to an address bus within the CPU.
We executed a random instruction sequence over 20,000 clock cycles and recorded 313 golden signatures. Next, we injected bugs in the control flow of the program-counter generation unit. Similar to the previous simulation, some bugs corrupted all intervals. However, a certain bug resulted in only six erroneous intervals that were scattered over the entire observation window of 313 intervals. Therefore, only 2% of the intervals had to be stored in the DRAM for analysis. Another bug in the datapath of the ALU caused several short error bursts, resulting in only four erroneous intervals (1.3%) that occurred at the end of the observation window. In practice, this kind of bug requires an extensive effort for localization, as the observed data is erroneous for only a few clock cycles during the entire observation window. Using the proposed method, we can efficiently identify and store the erroneous intervals in only one iteration.
VI. CONCLUSIONS
This paper has covered several research topics related to testing, design-for-test, and debug for 3D-stacked ICs.
We have presented a solution for TSV-stress aware test-pattern generation. We have shown that neglecting TSV stress leads to a higher test escape rate compared to that obtained using TSVstress aware models. The use of the proposed TSV-stress-aware flow reduces the test escape rate to what can be achieved in the absence of TSV-induced stress, without a noticeable impact on the test-pattern count.
Detecting defects in TSVs at the pre-bond stage is challenging due to their limited accessibility. We have proposed a solution that employs ring oscillators and duty-cycle detectors to target TSV defects that can be modeled as resistive-open faults and leakage faults. In order to diagnose these faults, we use a regression model based on artificial neural networks. This model distinguishes between resistive-open and leakage faults and estimates their size. The oscillation period and the duty cycle are measured at multiple voltage levels for higher diagnosis accuracy. Simulation results show that this method is able to accurately diagnose TSV faults even in the presence of process variations.
We have developed a framework for robust optimization of test-access architectures for 3D ICs. In contrast to traditional, non-robust optimization methods, the proposed approach takes uncertainties in input parameter values into account, such as test Paper DDC. 1 INTERNATIONAL TEST CONFERENCEpower limits and test-data bandwidth. Simulation results showed that non-robust solutions may be far away from point-optimal solutions when the input parameter values change, whereas robust solutions remain close to point-optimal solutions for a variety of scenarios. Finally, we have presented a method for on-chip signal tracing for 3D-ICs that integrate large amounts of DRAM. The proposed debug architecture detects erroneous behavior of the observed signals using pre-calculated signatures and stores the signal values into functional DRAM, from which it can be transferred to external debug equipment. Compared to traditional signal tracing methods that use dedicated on-chip trace buffers, the proposed method allows for debug sessions that are orders of magnitude longer, significantly simplifying the debug effort.
In summary, this paper has addressed challenges in testing and design-for-test of 3D ICs that have been identified as a showstopper for volume manufacturing and commercial exploitation. The presented techniques enable low-cost test of 3D ICs and hence bring 3D integration technology closer to adoption in mainstream products.
