ABSTRACT 
INTRODUCTION
Three-dimensional (3D) technology, being able to provide abundant interconnect resources with improved performance and less communication energy by integrating multiple silicon dies with short and dense through-silicon vias (TSVs), has become a promising solution to address the unlimited demands for integration capabilities [3] . In particular, to fulfill the ever-increasing demands of storage capacity, many semiconductor companies have implemented 3D-stacked memories, by using TSVs as vertical bus across multiple DRAM layers and it is believed that such memory products will be commercialized in the near future [7, 10] .
Due to their extreme high density, memory circuits are prone to manufacturing defects. Consequently, to avoid yield loss, redundant rows and columns are added on-chip so that most faults can be repaired by replacing the rows/columns containing faulty bits with redundant ones. Since we can only implement limited amount of redundant resources onchip for cost considerations and we cannot afford lengthy repair time for throughput considerations, numerous redundancy analysis and repair algorithms have been presented in the literature for effective and efficient memory repair [2] .
Memory circuits typically contain multiple memory blocks and spare rows/columns are added to each memory block. If one of the blocks is not reparable, the entire memory circuit has to be discarded. Intuitively, we can increase memory yield by letting neighboring blocks share the precious redundant rows/columns [15] , but this strategy involves quite high routing overhead and hence is not used in practice for traditional two-dimensional (2D) memory circuits. With the emerging 3D-stacked memory, however, sharing redundant resources between neighboring vertical blocks for yield enhancement becomes feasible because short TSVs can be used for routing.
With the above redundancy sharing strategy, a memory block that is not self-reparable can borrow spare resources from its vertical neighbors (if any) and possibly becomes reparable after bonding. Consequently, when compared to the case that we only bond self-reparable dies together to form the 3D-stacked memory, we have the opportunity to achieve significant yield enhancement, especially when the defect density is high and/or the redundant resources are limited. At the same time, whether we could realize this opportunity highly depends on the matching strategy for the memory dies. That is, if a self-reparable memory die is bonded with a non-reparable one but they could not form a functional 3D-stacked memory circuit, the memory yield might even be sacrificed. Therefore, with the distinct defect bitmaps of different memory dies obtained with pre-bond testing, how to selectively matching them together to maximize the yield for the bonded 3D-stacked memory is an interesting and relevant problem. In this paper, we present novel solutions to tackle the above problem. Experimental results with various memory organizations and defect distributions show that the proposed methodology is quite effective for yield enhancement.
The remainder of this paper is organized as follows. Section 2 reviews related work and motivates this paper. The 3D-stacked memory architecture that supports redundancy sharing across neighboring dies is described in Section 3. Next, we discuss the memory repair strategy with redundancy sharing and our memory die matching algorithms in Section 4 and Section 5, respectively. Section 6 presents experimental results with various memory organizations and defect distributions. Finally, Section 7 concludes this paper.
PRELIMINARIES AND MOTIVATION 2.1 Prior work
In this subsection, we briefly discuss related work in memory repair and the bonding strategies in 3D circuits.
Memory Repair
The memory repair problem can be formulated as a constrained vertex cover sets of bipartite graphs problem and has been proved to be NP-complete in [9] . It is possible to obtain optimal repair solution by exhaustive search, but this is too time-consuming to be used in practice. To address this problem, various redundancy analysis and fast memory repair techniques (e.g., [4, 9] ) were presented in the literature to reduce repair time at the cost of slight memory yield loss.
In the above memory repair strategies, they assume a full fault bitmap of the memory is available and the repair is conducted by external memory testers. With the increasing usage of embedded memories, built-in self-repair (BISR) has become more popular. As it is not cost-effective to store the entire fault bitmap before repair, various memory repair techniques with limited fault bitmap are developed in recent years. In particular, a so-called essential spare pivoting algorithm (ESP) is presented in [5] . In this work, the authors classify faults into three types: 1) suitable for row repair; 2) suitable for column repair; 3) orthogonal fault which can either be repaired by a spare row or a spare column. This work is shown to be quite effective and efficient.
A memory circuit typically contains multiple memory blocks. If one of these memory blocks is irreparable, the entire memory circuit is deemed as defective and has to be discarded. Clearly, if a self-irreparable memory block can borrow some redundant resources from other reparable blocks, memory yield can be increased. Motivated by the above, [15] proposed a distributed global replaceable redundancy scheme, which allows to use the spare rows/columns in a memory block to repair faults in other memory blocks. In [1] , the author proposed a memory built-in self-repair (BISR) algorithm with sharable redundant resources, using a programable decoder. However, such decoder design is much more complex than conventional one, which not only increases area overhead but also significantly deteriorates the routability of memory. Because of this, redundancy sharing is not utilized in practice for today's 2D memory devices. 
Bonding Strategies in 3D ICs
3D ICs can be built by stacking multiple silicon layers in several manners: wafer-to-wafer (W2W) bonding, die-to-wafer (D2W) bonding (for 3D ICs built on two semiconductor wafers), or die-to-die (D2D) bonding. The main advantage of W2W bonding is the simplicity of the manufacturing process. However, the yield of the 3D chips can be quite low without "known good die" information [3] . To mitigate this problem, [12] proposed several wafer assignment algorithms that selectively bond wafers together for yield enhancement, assuming that individual dies are tested on-wafer before bonding. Although helpful, bonding wafters together has the instinctive weakness that some good dies are bonded to bad ones and have to be discarded, and hence the manufacturing yield can be still unsatisfactory, especially when the die size is large and/or the defect density is high. D2W/D2D bonding, on the other hand, requires a more complex manufacturing process. However, as we are able to attach known good dies, the manufacturing yield can be significantly higher when compared to W2W bonding [8] . In this work, we assume D2D bonding is utilized to form the 3D-stacked memory chips.
Motivation
In 3D-stacked memory, bit-arrays are stacked vertically on each other and TSVs are utilized as buses to link the stacked dies together, as shown in Fig. 1 . Such organization provides us the opportunity to conduct redundancy sharing across neighboring dies with short TSVs without incurring much routing overhead. With redundancy sharing, an irreparable memory die is likely to become reparable by borrowing spares from its neighboring dies, but whether we can realize the above yield enhancement opportunity highly depends on the matching strategy.
Consider the eight memory dies shown in Fig. 2 , each of them containing four memory blocks with different fault density. For simplicity, we classify these dies into four categories. The memory blocks in Type A are all self-repairable and provide extra spare resources (see A1 and A2 in Fig. 2(a) ). Type B memory dies cannot repair themselves but become reparable by borrowing a small number of redundant resources from other dies, and we assume that two Type B memory dies stacking together can be reparable. Type C memory dies must borrow plenty of spares to become reparable, while the memory blocks in type D have high fault density and is irreparable even if they are allowed to borrow spares from others.
For this example, suppose we only allow bond self-reparable memory dies only (see Fig. 2(a) ), the yield is only 25%. As shown in Fig. 2(b) , stacking A1 to B2 and B1 to A2 produces two reparable 3D-stacked memory circuits, and the overall yield in this case is 50%. Suppose we have an aggressive repair strategy as shown in Fig. 2(c) , which attempts to repair Type D memory dies using the spare resources in Type A dies, the yield can be 0% in the extreme case. In other words, a bad matching strategy can even reduce the overall memory yield since good dies might be wasted. A good matching strategy as shown in Fig. 2(d) , on the other hand, results in a maximum yield of 75%.
From the above, we can conclude that die matching strategy is critical for the final yield of the 3D-stacked memory circuits. Suppose we can quickly know whether the matched 3D-stacked memory is reparable or not for any pair of memory dies, the matching problem is quite similar to the wafer-to-wafer bonding selection problem in [12] . However, with both redundant rows and columns for a memory block, we cannot afford the time used to temporarily running repair algorithm between every possible memory die pairs. Consequently, how to conduct efficient and effective matching to achieve the maximum memory yield is a challenging problem, which motivates this work. Fig. 3 depicts the 3D-stacked memory architecture that supports redundancy sharing across dies. Spare rows/columns on each memory die not only connects to the programmable decoder on its own layer, but also link to the decoder of other layers using TSVs. The routing overhead to support redundancy sharing across dies is quite low due to the use of short TSVs as routing mechanisms. The pre-fabricated multiplexor controls which memory block use the corresponding spare row/column. For a memory block with n spare rows and m spare columns, sharing all the redundant resources between two neighboring dies requires n + m TSVs, as shown in Fig. 3 (a) (only spare rows are shown in this figure). With the continuing improvement of TSV manufacturing technology, this overhead is of a less concern. Moreover, we can leave some spare resources for repairing its own die and/or share the same TSVs for different redundant rows/columns, reducing the amount of TSVs used for redundancy sharing. An example is shown in Fig. 3(b) .
3D-STACKED MEMORY ARCHITECTURE WITH REDUNDANCY SHARING
The above architecture can tolerate some TSV defects. Let us consider memory blocks in Fig. 3 (a) . If one of the three TSVs is defective, we can leave the corresponding spare row to repair its own memory block, while we still have two spare rows for sharing among neighboring blocks. As long as a memory block does not require to borrow all three redundant rows to become reparable, it is still sufficient.
MEMORY REPAIR WITH REDUNDANCY SHARING
When conducting self-repair of a memory block, there might be multiple repair strategies to replace the faulty rows/columns with redundant ones, and any one of them suffices. With resource sharing between neighboring dies, however, how to repair each individual die determines whether the other die can be repaired or not. For example, two memory blocks are shown in Fig. 4(a)-(b) , in which black vertices denote faulty cells and solid/dashed lines denote spare columns/rows, respectively. Memory block in Fig. 4 (a) is self-irreparable. Memory block in Fig. 4 (b) is self-reparable, but it can lend no spares to the other block if we guarantee to repair itself first. Bonding these two dies together is reparable only if memory block in Fig. 4(a) repairs itself with three spare columns (one is borrowed) while memory block in Fig. 4(b) repairs itself with two spare rows (one is borrowed) and one spare column. From this example, the memory repair algorithm with redundancy sharing must be aware of the fault information of stacked dies and be conducted at a global level.
Problem Formulation
The problem of memory repair with redundancy sharing can be formulated as follows:
Problem: Given • The number of memory blocks N b on the memory die;
• The fault bitmap of each memory block obtained from pre-bond testing; • The number of spare rows/columns R a ,C a in each memory block;
Our objective is to determine the repair scheme of each memory block so that the stacked memory dies can be reparable, whenever possible.
Memory Repair Strategy
A single memory repair problem is formulated as constrained vertex cover sets of bipartite graphs. Based on the fault bitmap of the memory block, we can build corresponding bipartite graphs (see Fig. 4 ). The left set of vertices denote the row number and the right set denote the column number of the faulty bits, respectively. An edge between two vertices represents the position of the corresponding faulty bit. With given number of spare rows and columns as constraints, our objective is to find a set of vertices to cover one end of every edge.
We propose a novel algorithm to tackle this problem, inspired from the concept of so-called irrespective error matrixes in [1] . We first extract rows and columns that contain faulty bits from the two memory blocks with sharable redundancies, and integrate them together into a new fault bitmap by putting them in diagonal position without intersection, as shown in Fig. 4(c) . For the new fault bitmap, our redundant resources are also doubled. By translating the problem of repairing two dies with redundancy sharing into a new problem as above, we guarantee that a sharable redundant row/column can be flexibly utilized to replace a faulty one in either of the two memory blocks without enumerating all possible repair solutions for each die. Then, we can apply any redundancy analysis and repair algorithm to solve this problem. In this example (see Fig. 4(c) ), a solution is found so that a self-irreparable block and a self-reparable one forms a reparable stacked memory circuit.
MATCHING FOR YIELD ENHANCEMENT

Problem Formulation
As discussed earlier, how to conduct efficient and effective die matching has a significant impact on the final yield of the 3D-stacked memory circuits. We model this problem by constructing a graph with these memory dies as vertices and add edges between two vertices if the corresponding memory dies are reparable after stacking. This problem can then be solved in polynomial time as the classical maximum matching problem.
The challenge here lies in the fact that it is extremely time-consuming to run repair algorithm between every die pairs and hence it is impractical to obtain the exact edge information as modeled above. We hence need more efficient method to address the die matching problem, as formulated in the following.
Problem: Given a number of memory dies with the following information:
• The number of memory blocks in each die;
• The fault bitmap of each memory block in every die obtained from pre-bond testing; • The number of spare rows/columns of each memory block; Our objective is to match the given dies in a pairwise manner to maximize the total number of reparable 3D-stacked memory circuits. It is important to note that even though a 3D-stacked memory can contain more than two layers, we consider pairwise matching only in this work. There are two reasons behind: (i). selectively matching more than two dies is a much more complicated problem since the possible combinations grow rapidly; (ii). generally speaking, the die yield cannot be a small value (otherwise the product is not profitable) and a good pairwise matching algorithm is able to recover most self-irreparable dies. By simply combining the matched memory pairs to form 3D-stacked memory circuits with more layers, we can maintain a high memory yield.
Direct Matching
As discussed earlier, we cannot afford the computational complexity to run repair algorithm for every possible pair of memory dies to obtain exactly whether two dies matched together can form a reparable stacked circuit (referred to as matching condition). If, however, we can estimate the matching conditions with sufficient accuracy efficiently, we should be able to achieve good matching results. Motivated by the above, we develop two kinds of matching conditions to solve our die matching problem. Before describing them in detail, we first present our die matching algorithm as follows. Line 1), and check every pair of memory dies whether they are considered to be reparable according to our matching condition (Line 4). After checking all pairs and adding the corresponding edges, a undirect graph is constructed (Lines 1-6 ). We then use the classic 'Blossom' maximum matching algorithm [11] to get the matched die pairs (Line 7). Finally, to verify whether every pair is reparable and get the final yield (Lines 8-10), we use a two-step memory repair algorithm: (i) We first utilize the efficient ESP algorithm [5] for repair; (ii) If a solution cannot be found, we resort to a branch-andbound method for another try and abort to repair if a solution still cannot be found under a runtime limit. 
Die Matching Algorithm
If not, remove S k from S, x = x -1 11 Output S 
Reparability Condition
Let us first consider reparability condition which guarantees that any pair of memory dies passing through this condition must be reparable. Apparently, matching self-reparable dies together is one type of reparability condition, as shown in Fig.6(a) . However, such matching strategy is too conservative, far from the optimal matching solution 1 . We therefore propose to extend the ESP algorithm presented in [5] for building more effective reparability condition efficiently.
Similar to [5] , for a memory block i with R i spare rows and C i spare columns, we classify the faulty bits into three types: (i). F r i bits that are suitable for row repair; (ii). F c i bits that are suitable for column repair; and (iii). F o i orthogonal bits that can be repaired by either spare rows or spare columns. Then, we use the following formula to determine whether two blocks a and b are reparable after stacking:
1 The "optimal matched dies" shown in Fig.6 refers to the obtained dies in the case that we can run final repair algorithm between every possible memory die pair to determine whether they are reparable. 
are the number of left spare rows/columns after repairing the corresponding faulty cells that require dedicated row/column repair. If their sum can be used to repair the remaining orthogonal faulty bits, we can guarantee that bonding the two memory blocks together is reparable. By checking all pairs of memory blocks between the two memory dies, the above condition is used as our new reparability condition, which can provide yield enhancement when compared to matching self-reparable dies only, as shown in Fig.6(b) .
Irreparability Condition
The above reparability condition guarantees that all found die pairs are reparable. Such stringent requirement prevents us from finding those reparable pairs that cannot pass the previous reparability condition checking and inherently limits the achievable maximum yield. In this subsection, we consider another type of matching conditions based on irreparability checking. By only eliminating those pairs that are guaranteed to be irreparable, more possible die pairs can be found but we cannot guarantee they are reparable. We obtain the irreparability condition by analyzing the property of bipartite graph constructed from the fault bitmap.
Given a bipartite graph G = (V, E), where vertices are partitioned into two sets (X and Y , x i ⊆ X, y i ⊆ Y ) and each edge (x i , y i , ) ⊆ E, we have the theorem that the minimum number of vertices that cover all the edges is equal to the number of edges in any maximum bipartite matching of the graph [9] . For a memory block i with R i spare rows and C i spare columns, it is easy to deduct from the above that, the memory block is not reparable if R i + C i is smaller than the number of vertices required for maximum bipartite matching (can be obtained efficiently as in [9] ). Now let us consider the matching condition for two memory blocks a and b, we have the following lemma.
Lemma 1: Given two memory blocks with spare rows R a , R b and spare columns C a ,C b . The mapped bipartite graph from these two blocks is
the stacked memory is possibly to become reparable. The die matching algorithm is then conducted on the bipartite graph constructed according to the above irreparability condition. With this strategy, we are able to find more die pairs that are possibly reparable, but this is not guaranteed. For the two examples shown in Fig. 7 , both require at least four spares (either spare row or spare column) to be possibly reparable. Even though both are deemed as possibly reparable according to the above lemma, if in total we have 2 spare rows and 2 columns with redundancy sharing, the faults in (a) are reparable (rows 1,2 and columns 6,9), but the faults in (b) cannot be repaired (a possible repair solution is when we have 4 spare columns). Fig. 8 presents the matching effect according to irreparability condition. As can be observed, within the matched dies, only part of them are reparable.
Iterative Matching
Directly matching memory dies according to the previous reparability condition and irreparability condition all have their limitations. In this subsection, we consider to match dies iteratively. That is, we conduct die matching multiple times, and in each iteration, we keep those good pairs while redo the matching for the left dies by changing the matching condition.
We start from the irreparability condition 2 as discussed earlier, since such strategy gives us more possible die pairs and our die matching algorithm will try to match them by aggressively making use of the redundant resources. Then, we keep those pairs that are reparable, and for those found pairs that are not reparable, we conduct matching again with tightening irreparability condition as follows:
, with gradual increase of the value k (initialized as 0).
As shown in Fig. 9 , every time we tighten the irreparability condition, more reparable pairs are found because the existence of more spare rows/columns between them. At the same time, the number of found pairs decreases, and the above procedure terminates until we cannot find any pairs that can be matched.
EXPERIMENTAL RESULTS
Experiment Setup
We consider a total of 1000 1Gb memory dies to be formed as 2-layer memory circuits, and hence we can obtain a maximum of 500 functional stacked memories when the yield is 100%. Each memory die contains 4 × 4 memory blocks, and each memory block is with the size of 8k × 8k (Row ×Column) bit-cells.
For fault injection, we consider two distributions to obtain the number of faults in each die: (i). Poisson distribution with λ = 2.130 [6] ; (ii). Polya-Eggenberger distribution with λ = 2.130 [5] . For PolyaEggenberger distribution, we also tune another parameter α with α = 2.382 [13] and α = 0.6232 [14] , representing the case with clustered faults and that with evenly distributed faults, respectively. We assume that all the spare rows/columns can be borrowed between neighboring vertical memory blocks, and we inject random TSV faults with faulty rate as 0.1%. Experiments are conducted on two cases with different percentage of six kinds of faults (see the following table 
Results and Discussion
In our experiments, we compare four matching strategies: (i). matching self-reparable dies only; (ii). matching according to reparability condition; (iii). matching according to irreparability condition; and (iv). iterative matching. Tables 2 present our experimental results with gradual increase of spare rows/columns. 
