I. INTRODUCTION
Three-dimensional integrated circuit (IC) technology with through silicon vias (TSVs) is a new technology that allows the vertical stacking and interconnecting of multiple die into one 3-D IC [10] , [14] . There are a number of benefits and motivations for developing 3-D ICs, including 1) a better form factor realized from the increased density from vertical integration [3] ; 2) increased performance due to the improvement in interconnect delay because of short TSV length; 3) heterogenous integration where different functional die, such as memory, logic and sensors, are fabricated separately and then integrated together; and, finally, 4) cost, as 3-D technology might offer an alternative cheaper path to increase semiconductor integration without the need to resort to prohibitively expensive 2-D lithographic geometric shrinking. Exam There are a number of integration methods used in 3-D IC fabrication: wafer-to-wafer (WTW), die-to-wafer (DTW), and die-to-die (DTD). These methods play an important role in determining the final yield of 3-D ICs [1] , [2] , [10] , [14] . In wafer-to-wafer integration, entire wafers are directly bonded together. WTW offers the highest throughput, and allows for the thinnest wafers. Since the minimum TSV diameter is limited by the via's aspect ratio, WTW supports TSVs with the smallest via diameters, as it has the thinnest wafers, which in turn allows for greater TSV density. However, WTW can incur a serious yield loss as there is no way to separate the good die in advance. With WTW integration, a bad die from one wafer can end up integrated with a good die in another wafer yielding an overall bad . die-to-wafer and die-to-die integration can improve the yield of 3-D ICs as they allow the die to be diced and tested in advance and use only the good ones during the 3-D integration process. DTD and DTW also allow the use of different wafer and die sizes. This flexibility, however, comes at additional test and bonding costs [11] , lower throughput and lower TSV density. Yield loss can be mitigated through the use of redundancy as in the case of 3-D DRAM ICs, or 3-D multicore processors [10] . Furthermore, some applications, especially in high-end systems, require a small pitch that is only attainable through WTW, irrespective of the yield. The objective of this paper is to develop techniques that improve the yield of WTW integration. As a wafer lot typically contains many wafers (typically 25), one way to improve the yield of WTW integration is to first test the wafers in the different wafer lots, and then match the wafers together during integration so as to increase the number of good 3-D ICs. Fundamentally, we should match wafers from different lots to reduce (or avoid at best) the chance that a good die from one wafer ends up integrated with a bad die from another wafer. In this paper, we thoroughly investigate this flexibility and develop optimal methods that maximize the yield of WTW 3-D integration. The contributions of this paper are as follows.
• We formulate the yield maximization problem in wafer-to-wafer 3-D integration technology. We provide hardness results for this problem and show special cases where it can be solved optimally in polynomial time.
• We propose a number of effective heuristic and optimal solutions to solve the problem. Our algorithms offer a graceful tradeoff in terms of quality of results as measured by yield and scalability as measured by runtime and memory requirements.
• Using realistic defect models and yield analysis simulations, and we provide comprehensive experimental results that demonstrate the effectiveness of our proposed algorithms in improving the yield of wafer-to-wafer 3-D integration for large numbers of wafer stacks.
• Our results demonstrate that our proposed optimal integration techniques can improve the yield (reaching up to 25%) in comparison to yield-oblivious integration strategies. The organization of this paper is as follows. Section II provides a brief overview of the related research. In Section III, we formulate the main problem of maximizing the yield of wafer-to-wafer integration and propose a number of solutions. Section IV provides a comprehensive set of experimental results and conclusions that demonstrate the effectiveness of our proposed approaches.
II. PREVIOUS WORK
Despite the importance of the yield on the cost-effectiveness of 3-D technology [1] , [10] , there are few works that directly address the yield problem [1] , [5] , [10] , [11] , [12] . Yield loss in WTW integration can happen either due to defects in the individual wafers that constitute the stack, or defects that result from the 3-D integraton process (e.g., during TSV creation or bonding). The defects that impact the individual wafers result from typical random defect mechanisms that impact 2-D ICs. Generally, the larger the die area, the larger the chance it includes one or more defects; thus, wafers with large die printed on them will have a lower die yield than wafers with small die. If two types of wafers are made in the same fabrication process then they are subject to the same defect density. If the wafers are made with different fabrication processes, a possibility with 3-D ICs, then they are likely to have different defect density. Defects impacting different wafers are typically uncorrelated, and the modeling of such defects have been researched to maturity in the past [7] , [8] , [13] . For example, the negative binomial distribution [13] is typically used as a good model for the distribution of defects on semiconductor wafers.
To address yield loss in 3-D ICs, a few techniques have been so far proposed. Patti [10] suggests incorporating redundant resources into the 3-D IC to make potential stacked devices (such as memories and FPGAs) repairable in the presence of defects. More recently, Ferri et al. [5] suggest improving the parametric yield of DTW and DTD integration by carefully matching the speed of the die that are integrated in the 3-D stack, and Smith et al. [12] suggest matching the wafers in WTW integration to improve the yield. Finally, Smith et al. [11] investigate the implications of 3-D IC yield on the cost of WTW, DTW, and DTD integration methodologies.
III. PROBLEM FORMULATION AND PROPOSED SOLUTIONS
The defect wafer map of some wafer W i can be represented as a It is easy to show that for the general case of K 3, the classical NP-hard 3-D matching problem (one of the original six NP-hard problems considered by Garey and Johnson [6] ) is reducible to the functional yield maximization problem. While this result diminishes the possibility of finding optimal solutions for increasing N and K in a feasible runtime, we will later show that it is possible to obtain optimal solutions for K = 2 in polynomial time, and we will demonstrate in the experimental results section (Section IV) optimal results for up to K = 4 wafer stacks. The hardness result also points out the importance of developing heuristic solutions that scale in performance, runtime and memory requirements for general values of N and K.
A. Greedy Heuristic
As discussed earlier, there are N K possible different 3-D integration stacks. In an attempt to find the best N wafer stacks that maximize the total yield, it is possible to devise a greedy heuristic to solve the yield maximization problem. A greedy heuristic first forms a list of all possible N K wafer stacks. Then, for every wafer stack, the heuristic calculates the number of resultant good 3-D ICs after taking into account the distribution of good die on each wafer as given by the wafers' defect maps. The heuristic then sorts the list in descending order according to the number of good 3-D ICs of each stack. The list is then traversed in order where a wafer stack is chosen as long as none of its constituent wafers participated in an earlier chosen wafer stack. Fig. 2 gives a summary of the greedy algorithm. Note that the runtime complexity of the algorithm is equal to O(KN K log N), and the memory requirement is equal to at least (K + 1)N K bytes needed to store the list of possible wafer stacks for sorting purposes. As our experimental results later show, the memory requirement turns out to be a limiter toward the application of the greedy algorithm for wafer stacks with large numbers of wafers K > 5. For example, for N = 25 (industrial lot size for 300 mm wafers) and K = 6, the algorithm would require at least 1.7 GB of memory and 48 GB of memory for K = 7.
B. Iterative Matching Heuristic (IMH)
To understand the proposed iterative matching heuristic, we first consider the special case where K = 2, i.e., where there are only two wafer lots fL 1 ;L 2 g. This special case can be solved optimally using a graph-theoretical framework as follows. First, we construct a bipartite graph composed of 2N vertices and N 2 edges as shown in Fig. 1 . The first set of N vertices corresponds to wafer maps of the first lot L 1 and the second set of N vertices corresponds to the set of wafer maps of the second lot L 2 . Each edge is labeled by the number of good die produced from integrating the wafers at its end points. In this case finding an optimal matching or assignment that maximizes the total yield as mea- using the Hungarian algorithm [9] . We will use a left-precedence operator to denote the optimal matching operation on two wafer map lots; thus, the set of wafer maps resulting from optimally integrating lots L 1 and L 2 can be expressed by L 1 L 2 .
We propose to extend the matching algorithm heuristically by applying it iteratively. Given a set of wafer lots L = fL1; L2;...;LKg, the final wafer map can be iteratively calculated as follows:
One issue that needs to be considered is to find a good iteration order, i.e., the values of i1; . . . iK, to carry out the matching iteratively. To resolve this issue, at any iteration j our algorithm picks the lot L i that gives the largest number of good die when optimally matched to the wafer maps L111 1Li resulting from the previous j01 iterations. The first wafer lot L i can be chosen either randomly or according to the number of good die. The algorithm description is formally described in Fig. 3 . The runtime of the algorithm is O(K (assuming the Hungarian algorithm is used for pair-wise lot matching) and the memory requirement is O(N 2 ). We stress that IMH is guaranteed to be optimal for only two wafer lots (K = 2). For more than two lots, IMH is no longer guaranteed to be optimal and is only a heuristic. Our experimental results in Section IV show that it provides very close to optimal results. Note that the order of wafer lot integration in the algorithm has no relationship whatsoever with the order of integration of the actual wafers during fabrication. The final output of the algorithm is the assignment of each wafer to a wafer stack. The integration of the wafers that belong to a wafer stack will be carried out in order during 3-D fabrication.
C. Optimal Integration Using ILP
To find the optimal integration strategy for general values of K, While the computational runtime complexity incurred from using ILP solvers can be significant, memory will turn out to be the real limiter as specifying the the indices and values of the non-zero entries of the sparse constraint matrix requires 3 2 (K + 1)N K bytes.
D. Upper Bounds to the Optimal Solution
An upper bound to the optimal solution can be found by relaxing the ILP and allowing the program variables x i ;i ;...;i to take fractional values. In this case the 0 xi ;i ;...;i 1 constraint is added for each variable in the program, and then the program is solved using standard linear programming techniques (e.g., the simplex method or interior point methods). Standard linear programming solvers are typically quite fast; however, in our case, the main bottleneck will be the memory needed to specify the constrain sparse matrix, especially as K and N increase in value and as explained in the previous subsection.
IV. EXPERIMENTAL RESULTS
In this section, we demonstrate the effectiveness of the proposed algorithms in maximizing the functional yield of wafer-to-wafer 3-D integration through a set of comprehensive experiments. The following settings apply to all of our experiments.
• The classical negative binomial distribution [13] is used to generate defect wafer maps, where the yield of an individual wafer is given by (1 + (AD0 =)) 0 , where is the defect clustering ratio, D 0 is the defect density and A is the area of the die. We use an = 4 for the defect clustering ratio in all experiments. We assume 300-mm wafers with 3-mm edge exclusion on the periphery.
The gross number of die per wafer is given by (R 2 e =A) 0 2(R e = p A) + [4] , where R e is the effective wafer radius.
For all experiments but one, we assume a standard wafer lot size of 25 wafers. We vary the die area, defect density and number of wafers in the 3-D stack depending on the experiment. • All proposed algorithms are implemented in C++ and compiled with 0O3 optimizations. The basic Hungarian algorithm is implemented to compute the optimal matching of wafers in two wafer lots, and the GNU linear programming kit (LPK) is used to compute the solution to the integer linear program together with the solution to the relaxed linear program. 1 Impact of Defect Density. In the first set of experiments, we investigate the impact of the defect density per wafer on the final yield of the produced 3-D ICs. We compare the performance of the proposed integration algorithms at different defect densities. The die area is assumed to be 1 cm 2 , which gives about N = 590 die per wafer for a 300-mm wafer. We set the number of wafers in the 3-D stack to be equal to K = 3 and vary the defect density to result in yields from 30% to 90% per wafer. In Table I , we report two values for each integration algorithm: 1) the overall yield of 3-D ICs, and 2) the number of produced good 3-D ICs normalized to the number of 3-D ICs produced from random assignment. The latter value gives the advantage of deploying our techniques over a yield-oblivious random assignment integration. Furthermore, the normalized value gives the direct increase in revenue from using our algorithms.
The results show that the proposed integration algorithms consistently lead to an improved overall yield compared to a random yield-oblivious assignment. The defect density and, hence, the yield per wafer is a factor of the design, the process technology and the fabrication facility. Thus, for a given wafer yield dictated by these factors, the proposed techniques result in quite significant improvements. For example, at 50% yield per wafer, the optimal technique (ILP) gives a 21.9% improvement over random assignment, i.e., the revenues will be multiplied by 1.219. The results also show that the upper bounds calculated through relaxing the ILP are quite close to the optimal solution.
One may wonder if the random assignment technique might give comparable results to the proposed algorithms if different random assignments are simulated and the best one is picked and applied during integration. To test that possibility we executed 10000 different random integration assignments for the case at yield = 50%. The different simulations give results around the reported average of 12.37% with a standard deviation of 0.206 and a maximum of 13.02%; these results are far from the optimal yield value 15.08%.
Impact of Die Area. Increasing the die area decreases the number of produced die per wafer and also reduces the yield as a defect would destroy a larger portion of the wafer as the die are larger. To study the performance of the proposed algorithms under various die sizes, we choose a defect density of 0.4 defects/cm 2 and vary the die sizes from 50 mm 2 to 250 mm 2 . We assume the number of wafers in the 3-D stack is equal to K = 3 The results are reported in Table II. As TABLE II  IMPACT OF DIE AREA ON THE FINAL YIELD FOR THE VARIOUS INTEGRATION expected, the yield decreases as the die area increases; however, the improvements in yield from using the proposed integration strategies increase in magnitude as the die area increases.
Impact of Number of Wafers in the 3-D Stack.
In this important experiment, we study the scalability of the proposed algorithms in quality and runtime as the number of wafers in the 3-D stack increases. We initially assume a defect rate resulting in 80% yield per wafer and a die area of 1 cm 2 . The yield and runtime results are given in Table III . A "0" in the table indicates the algorithm failed because of memory allocation problems. The obvious part of the results is that yield generally degrades, as expected, as the number of wafers in the stack increases. In comparing the various algorithms, we find the following.
• The upper bounds on the optimal solutions stay tight for up to K = 4; however, both the ILP and the relaxed LP run out of memory for values of K 5. Furthermore, the runtime of the ILP dramatically increases as the number of wafers K increase. The scalability of the optimal ILP algorithm can be improved by using more powerful workstations and better commercial ILP solvers.
• The greedy algorithm produces good results up to K = 5. For larger values of K, it runs into memory problems that prevent it from scaling gracefully.
• The iterative matching heuristic is the most scalable of all algorithms. All instances are solved in less than 1 second and furthermore the quality of the solution is close to the optimal. It also dominates the greedy algorithm in both yield and runtime. Compared to other methods, the iterative matching heuristic is the only technique that is scalable in memory requirements.
• Overall the yield loss due to wafer-to-wafer integration at large values of K will be unacceptable unless the yield per wafer is extremely high or the 3-D structure has redundant resources to cope with the defects (as is the case with error correction codes in memory stacks). Impact of Wafer Lot Size. One possibility to improve the results of wafer-to-wafer integration is to batch or aggregate wafer lots to effectively increase the size of wafer lot. For example, it is possible to aggregate two wafer lots each with 25 wafers to produce a larger wafer lot of 50 wafers. The aggregated wafer lot will then be used with other aggregated wafer lots to derive the integration process. A random assignment will not benefit from such batching as the yield will stay the same on the average. However, the proposed algorithms can exploit the larger wafer lots to find better assignments that further maximize the functional yield. Towards testing this hypothesis, we carry out an experiment where we try four different wafer lot sizes N =25, 50, 75, and 100 (we assume K = 3, individual wafer yield of 80%, and die area is 1 cm 2 ). We plot the yield per wafer stack for both the random assignment and optimal assignment integration strategies in Fig. 4 . As hypothesized, the yield random assignment strategy stays on the average constant; however, as the wafer lot size increases, the optimal strategy is able to exploit this flexibility and increase the yield. Cost Considerations. Our proposed methods require wafer testing in comparison to randomly assigning wafers. The cost of testing should be evaluated in comparison to the improvement in revenues attained from the increased yield from our methods. Providing exact cost numbers requires many factors, but we consider here some hypothetical estimates for the purpose of illustration. Let's assume a 3-D processor based on the die of an Intel Core 2 Duo integrated with two DRAM die. Intel Core 2 Duo has a die area equal to $143 mm 2 . Using our die calculating formula in Section IV, the number of die per wafer is 418. 
V. CONCLUSION
We have formulated the problem of yield maximization in wafer-towafer integration. We have proposed a optimal techniques and scalable heuristics with near optimal performance to maximize the yield. The proposed assignment techniques provide significant improvements to wafer-to-wafer integration yield, increasing the overall number of good die in many cases. Our proposed methods require wafer testing in comparison to randomly assigning wafers. The cost of testing should be evaluated in comparison to the improvement in revenues attained from the increased yield from our methods.
Low-Power Snoop Architecture for Synchronized Producer-Consumer Embedded Multiprocessing
Chenjie Yu and Peter Petrov Abstract-We introduce a cross-layer customization methodology where application knowledge regarding data sharing in producer-consumer relationships is used in order to aggressively eliminate unnecessary and predictable snoop-induced cache lookups even for references to shared data, thus, achieving significant power reductions with minimal hardware cost. The technique exploits application-specific information regarding the exact producer-consumer relationships between tasks as well as information regarding the precise timing of synchronized accesses to shared memory buffers by their corresponding producers and/or consumers. Snoop-induced cache lookups for accesses to the shared data are eliminated when it is ensured that such lookups will not result in extra knowledge regarding the cache state in respect to the other caches and the memory. Our experiments show average power reductions of more than 80% compared to a general-purpose snoop protocol.
Index Terms-Low-power cache coherence, low-power multiprocessor systems-on-a-chip (MPSoC), producer-consumer communication in MPSoC.
I. INTRODUCTION
The abundance of wireless connectivity coupled with the ever growing increase in integration densities have resulted in a multitude of handheld and wearable embedded applications such as portable media players, mobile phones with aggregate data functions, personal organizers, etc. Battery life and power consumption has become one of the primary implementation constraints for these applications. Due to the integration of multiple functionalities and ever increasing demand for performance, it has become a natural design practice to utilize multiprocessor systems-on-a-chip (MPSoC) in embedded systems. Typically these systems feature several processor cores, possibly of heterogeneous natures, that access a shared memory. For reasons of low complexity and high speed, the most common approach is to use a shared system bus. In order to provide the required bandwidth to the shared memory, local caches at each processor node are usually employed. Local caching in multiprocessor systems, however, introduces the possibility of cache incoherence; a situation that occurs when a processor updates a data object after that same object is cached somewhere else. To resolve this issue, cache coherence protocols are used.
The snoop-based cache coherence protocols are the most widely deployed as they rely on the inherent broadcast nature of the common bus connecting the processor nodes to the memory. Each cache controller "snoops" the bus for memory transfers, for each of which a cache lookup is performed in order to determine whether a cache block state should be changed in the local cache. Easily extendable multiprocessor structures and software-transparent implementation have made snoop protocols easy to understand, deploy, and reuse, with minimal impact on the performance of memory subsystem [1] . Quite often, however, shared data are cached in just a few nodes. Snooping in the others leads to a waste of energy. It was shown in [2] that only around 10% of the application memory references actually require cache coherence tracking.
