This paper presents a systematic design methodology for yield enhancement of asynchronous logic circuits using 3-D (3-Dimensional) integration technology. In this design, the target asynchronous circuits on one planar device layer, are fabricated with aggressive technology and built on fault tolerant graph models with extra spare resources. In the presence of hard errors, these circuits can be reconfigured by autonomous reconfiguration logic on another planar device layer fabricated with conservative technology. The yield analysis shows that this method can result in 20-30% overall yield enhancement. This design method can be conveniently applied to clocked designs without significant changes.
INTRODUCTION
Aggressive technology scaling results in continuously increasing process complexity and wafer handling cost, making it more difficult and expensive to achieve high fabrication yield by identifying and eliminating most defects [15] . Due to the increased time-to-market pressures, however, foundries are often forced to start volume manufacture on a given technology before the fabrication process becomes completely mature. Hence, to optimize the circuit design for better yield and to shorten yield learning stage have become important issues to semiconductor industry.
With the challenges facing conventional device scaling, 3-D (3-Dimensional) integration technology allows scaling to continue by shifting the focus from device scaling to circuit scaling. In 3-D integrated circuits (IC), planar device layers are stacked, one on top of another in a 3-D structure where adjacent device planes can be connected by short, vertical vias [6] . Using 3-D integration, the designer can construct VLSI systems that exhibit lower interconnect latencies; higher packing densities of circuits; and heterogeneous integration of devices of different materials (such as SiGe and III-V materials) or technologies (such as submicron and nanometer) [6] . During the fabrication of a 3-D IC, each planar device layer can be manufactured and tested separately. All layers are then stacked and assembled together with vertical inter-layer interconnects. Figure 1 shows the diagram of a 3-D IC structure with two device layers.
Relentless technology scaling also results in higher clock frePermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. quency and presents a growing challenge to global clock distribution. This makes it difficult and expensive to design a singlyclocked, globally synchronous VLSI system. On the other hand, the absence of clock in asynchronous circuits makes the timing issue become much less critical, and the expected performance is decided by average-case (instead of worst-case) path delay. Besides, an asynchronous system has the potential to achieve high energy efficiency because computations are purely data-driven and no dynamic energy is consumed for an idle component. All these facts make asynchronous design become an increasingly practical alternative [4] . Future VLSI systems could utilize both types of circuits: clocked logic for local computations facilitates circuit design, and asynchronous logic for global computations alleviates clock distribution and timing headaches. Previous work on circuit design for yield enhancement is primarily based on hardwired N-modular redundancy (NMR) [7, 14] or fully programmable logic/gate arrays [1, 2, 8, 9] . However, it is non-trivial to apply hardwired NMR method to asynchronous circuits without significant timing assumptions [13] because potential deadlock due to faults can permanently block the voting procedure, making the replication-and-voting methodology ineffective. For programmable logic/gate arrays, the large number of configuration bits significantly slow down the circuit. Besides that, explicit fault diagnosis effort is required for manual reconfiguration, which considerably increases product test cost.
In this paper, we propose a new design method for yield enhancement of asynchronous circuits. By utilizing 3-D integration technology and adding self-reconfiguration logic onto a separate reliable device layer, the VLSI system can recover from hard errors automatically and results in lower product failure probability. Unlike NMR method, no critical timing assumption has to be made and this design becomes suited for asynchronous circuits. Partial redundancy required by this design, results in smaller silicon area and helps reducing overall hard errors. Compared with fully programmable arrays, only a limited number of configuration bits are added by this design, which largely reduces performance and wiring overheads. Also, self-healing behavior spares fault diagnosis of target circuits, resulting in less cost of product testing. Sec-tion 2 presents the design in detail. Section 3 presents yield analysis and shows the experimental evaluation results. Section 4 draws the conclusions.
DEFECT TOLERANT ASYNCHRONOUS DESIGN WITH 3-D IC
A systematic way to build a defect tolerant system is to make each module defect tolerant. This bottom-up method not only largely reduces design complexity, but also results in defect tolerance at fine granularity. Figure 2 shows the framework of this defect tolerant design.
... In Figure 2 , self-checking logic is augmented to each hardware module so that circuit deadlocks when it becomes faulty. Each module is built on a fault tolerant (FT) graph with extra spare resources. Pass-gates with control inputs (configuration bits) coming from the reconfiguration logic, are applied to all inputs/outputs and necessary internal wires of each VLSI module so that the circuit topology can be changed dynamically. Whenever a module becomes faulty, this module deadlocks due to the fail-stop logic. Because of handshake-based data communication, a single deadlocked module can cause the whole asynchronous system to permanently stall. Thus, only one deadlock detector (implemented as a delay line of a current-starved inverter chain [11] ) is required to detect the deadlock and trigger the self-reconfiguration logic. The latter is used to reconfigure the modules by replacing the faulty sub-modules with the spare workable ones. After reconfiguration completes, computation restarts from either the beginning or the last architectural checkpoint.
Due to the absence of voting procedure, no significant timing assumption is required, making this design suited for asynchronous logic. Unlike fully-programmable arrays, reconfiguration is implemented at coarse granularity (module-level instead of gate-level), resulting in less configuration overhead. Automatic reconfiguration spares fault diagnosis and manual reprogramming, and significantly reduces product test cost. It should be noted that the extra reconfiguration circuitry does not increase packaging expense (which is the significant part of overall manufacturing cost) because no additional I/O pin is introduced. The following subsections further explain this design framework.
Implementation with 3-D IC technology
In Figure 2 , all circuitry (self-reconfiguration and deadlock detection) in the dashed box are critical (error-sensitive) and must be made highly reliable. Otherwise, the system cannot be reconfigured in the presence of errors. With 3-D integration technology, the system can be implemented as follows. (i) All the error-sensitive transistors are placed onto a single planar device layer (shown as the top layer in Figure 1 ) which is fabricated with conservative technology (such as micrometer or submicron technology with very high yield). (ii) The target VLSI circuits (the VLSI modules which are outside the dashed box) are placed onto other planar device layers which are fabricated with aggressive technology (such as deep-submicron or nanometer technology with immature fabrication processes). It should be noted that adding an extra layer of reconfiguration circuitry has little impact to overall heat dissipation: the extra layer sits at the bottom (the farthest from the heat spreader) when the chip is flipped and plugged onto the socket, and there is no switching activity on the extra layer unless the system is being reconfigured.
The 3-D implementation has the following advantages. (1) By fabricating error-sensitive transistors with reliable technology, this 3D layout makes those critical circuitry unlikely affected by hard errors. Because the critical circuitry sit at the bottom (when plugging onto the socket), the energy consumption of upper layers causes little thermal interference to the error-sensitive circuits. (2) Interdevice-layer vias can reduce wiring overhead between reconfiguration logic and target circuit, resulting in less planar silicon area. (3) The error-sensitive circuitry stand by when the system is workable. Using reliable technology results in less leakage current, and helps reduce overall power overhead of the defect tolerant design without noticeably hurting system performance.
Fail-stop behavior and fault tolerant graph models
A widely-used asynchronous circuit template, precharge halfbuffer (PCHB) [10] , is chosen for asynchronous logic implementation. A PCHB circuit can have multiple inputs and outputs, and it can be used to construct almost any asynchronous logic. For example, an asynchronous MiniMIPS microprocessor [12] uses PCHBs for more than 90% of its circuits. Similar to a precharge domino circuit in synchronous designs, a PCHB circuit performs computations using pull-down (NMOS) networks, making it fast and compact. In this circuit, each input/output variable X is usually dualrail encoded (X 0 , X 1 ) with an explicit acknowledge (X e ). Validity and neutrality of the variables are checked and synchronized, generating the common acknowledge to all the inputs as well as the precharge/enable signal for computation of current stage. Further details can be found in [10] .
Due to the multi-rail encoded data and explicit handshake-based event-ordering, the authors in [3] proved that any failure by a single stuck-at fault in the PCHB circuit either deadlocks the circuit or produces an illegally dual-rail encoded output (X 0 X 1 ='11'). Thus, fail-stop behavior can be implemented by adding a checker (NAND gate) onto each output of a PCHB circuit. The checker watches the output, and blocks the incoming handshake signals (and stalls current handshake cycle) whenever the output becomes illegally encoded.
To achieve reconfiguration, each VLSI module is built on a fault tolerant graph with extra spare resources so that faulty components can be replaced with workable spares. In [13] , we developed a K-FT linear array (called min-spare array) by adding the minimum (K) spare nodes and necessary redundant internal connections between the nodes. The left subgraph in Figure 3 shows a construction example of 2-FT 2-node linear array, where 2 spare nodes (shaded), 4 external edges (dashed), and 5 internal edges (bold) are added for any 2-node fault tolerance. As to VLSI modules which consist of identical components, they generally can be modeled as a linear array (e.g., full adder) or a collection of linear arrays (e.g., multiplier, FIR filter), given that either data or control propagates linearly through them. The collection of linear arrays usually can be further simplified into a single array through coalescing all nodes of the same row (or column). For all these VLSI modules, min-spare array model can be applied. For VLSI modules of other topologies, the K-FT graph can be simply (K +1) replicas of the target module (called full-duplication graph, shown as the right subgraph in Figure 3 ). One and only one replica is selected by enabling its connections with the environment. Compared with NMR method, only K (instead of 2K) full replicas are added to this graph.
It should be noted that this defect-tolerant design methodology is independent of the implementation of fail-stop behavior and the construction of FT graph models. As long as other efficient failstop logic and graph models exist, they can be applied to further reduce the overall hardware cost.
Self-reconfiguration
Reconfiguration in Figure 2 is both dynamic and automatic, and no manual fault diagnosis and external reprogramming is required. This can significantly reduce product test cost. For asynchronous circuits, on-the-fly fault location is non-trivial: it generally results in large hardware overhead and compromises the overall reliability by exposing more transistors to the unreliable environment. Instead of relying on fault diagnosis, we propose a new method here: self-reconfiguration logic searches all legal configurations by itself, until it finds a workable one. To implement the exhaustive search, finite state machines are utilized to remember the current configuration and decide which configuration to take next. Reconfiguration is activated repeatedly until a workable setting is found. The rest of this subsection explains further details.
Reconfiguration. Reconfiguration logic can be implemented using synchronous circuits which take the time-out signal(s) from deadlock detector(s) as the clock(s). Conservative timing is assumed to guarantee circuit functionality.
Let an asynchronous system consist of M modules and K defects are to be tolerated. Each module is built on a K-FT graph. Whenever hard error(s) occur, the system stalls (due to fail-stop) and the deadlock detector activates the top finite state machine (shown as Block FSM in Figure 2 ) of self-reconfiguration logic. Block FSM selects K modules and reconfigures those modules concurrently by activating their per-module reconfiguration circuitry (shown as blocks Reconfig I· · · III). Each per-module reconfiguration logic includes a finite-state machine which is used to search a workable configuration from all local configurations of the current VLSI module.
Since each hardware module is K-FT, it has at least K +1 different configurations. Configuration of the whole asynchronous system can be modeled as a matrix with at least K+1 rows (configurations per module) and an arbitrary number of columns (modules). Let each matrix entry be either '1' (faulty configuration of the module) or '0' (workable configuration of the module). If there are at most K faults, this matrix can have at most K entry-'1's. It should be clear that there must be at least one row (a configuration of the asynchronous system) with all '0's. Hence this concurrent reconfiguration procedure is guaranteed to find a workable system configuration.
Block FSM has M outputs. During a reconfiguration, exactly K outputs can be valid to activate K modules for concurrent reconfigurations. Thus, the core of Block FSM is a log 2 M K ¡ -bit cyclic counter which is used to select all possible K modules, together with necessary combinational logic to derive the M outputs according to current counter output. The implementation of per-module reconfiguration logic depends on the fault tolerant graph topology used in the local VLSI module. For a VLSI module of fullduplication graph, the per-module reconfiguration logic is simply a (K + 1)-bit one-hot counter (a cyclic shift register with unique bit-'1') for switching to a different copy. For a VLSI module of minspare array, per-module reconfiguration is completed by selecting N nodes out of N + K nodes (suppose a N -node array). Hence a dedicated log 2
configurations. All configuration bits (which correspond to an embedded isomorphic array of specific N nodes) are then derived through supporting combinational logic, according to current counter output. Further details of reconfiguration for min-spare array can be found in [13] . Figure 4 shows an example of self-reconfiguration for a 1-FT 2-module system.
In Figure 4 , Block FSM is a 2-bit one hot counter which enables one module for reconfiguration. Per-module reconfiguration cir- cuitry are shown in the two dashed boxes. Module I is built on a 1-FT 3-node min-spare array: 2-bit counter searches all 4 configurations, and combinational logic derives all configuration bits (for 11 edges) accordingly. Module II is built on a 1-FT full-duplication graph (composed of 2 full replicas): a 2-bit one-hot counter chooses the corresponding replica by enabling its connections with the external environment. Primary input TO comes from the deadlock detector. It becomes high when system deadlocks, activating the reconfiguration circuitry. Note that TO is delayed (TD) for permodule reconfiguration logic so that Block FSM becomes updated before per-module reconfiguration starts. At the end of each reconfiguration, TO will be reset to low.
Defect recovery time.
Defect recovery time is decided by the number of configurations the system has tried before it finds a workable one. Although the concurrent per-module reconfigurations might unnecessarily reconfigure non-faulty modules, they significantly reduce defect recovery time. Let μ d be the total time required by one reconfiguration. Depending on different system sizes, μ d is generally in terms of microseconds or milliseconds. The worst defect recovery time T r , which can be used to estimate the expected defect recovery time, is that system has tried all possible configurations and only the last one is workable. In this hierarchical reconfiguration framework, we have T r = . M can be reduced by partitioning VLSI system at coarser granularity, and N decreases by dividing linear array(s) into larger nodes. During product testing, the defect recovery (test and reconfiguration) time can be minutes or hours. Hence the total configurations of this defect tolerant design can be up to hundreds of thousands or even millions (if μ d is in terms of milliseconds). Therefore, this defect-recovery method can be conveniently applied to large VLSI designs.
EVALUATION
In this section, we investigate yield enhancement of the defect tolerant design with 3-D implementation. One problem for traditional yield modeling is that model parameters are specific to fabrication process and layout, and they are generally hard to be reasonably estimated [7] . In order to draw general conclusions, we examine the yield of the defect tolerant design based on the assumed yields of baseline circuitry, instead of the physical parameters such as critical area and fault density. Let Y o be the yield of baseline circuitry, and Y ft be the yield of K-defect tolerant circuitry. To make the conclusions independent on fault clustering factors, we assume faults to occur independently. Thus, the calculation of Y ft is pessimistic and the results of yield enhancement are conservative.
For the defect-tolerant circuit of min-spare array topology (suppose it is a N -node array and Y n is the survival rate of each node), the yield is equivalent to the probability that at least N out of N +K nodes survive.
Because fault clustering is not considered and faults are assumed to occur independently (for pessimistic evaluation of yield enhancements), we have
For the defect-tolerant circuit of full-duplication graph topology, the yield is the probability that at least one out of K + 1 replicas is workable.
By choosing defect tolerance at coarse granularity (e.g., large modules and big array nodes), we can easily make the overhead of pass gates and extra wiring in fault tolerant graphs negligible so that the extra area of vias between configuration logic and target circuits can be reasonably omitted. In this case, the calculated yields using equations (1) and (2) are good approximations to the real results. We next extend the yield calculations for 3-D structures.
Regarding the baseline (non-FT) circuit of 3-D IC with L layers, we assume that all device layers are fabricated with the same technology and the yield of each layer is the same (Y o ). The overall yield can be calculated as
For the defect-tolerant circuit of 3-D IC, an extra layer (called configuration layer) is added for reconfiguration (and deadlock detection) for the target circuitry on the other L layers (called target layers). For simplicity, we assume the yield of each target layer is the same (Y ft ). We also assume that all circuits on target layers are of uniform fault tolerant graph topology (either min-spare array or full-duplication graph), and investigate the corresponding yield enhancement. The yield of a general defect tolerant design with hybrid graph topologies, can be estimated as some intermediate between these two extremes. The overall primitive yield of defecttolerant design is
L . Due to the extra configuration layer, however, some silicon dies which are supposed to be used for target circuitry are allocated for reconfiguration logic. To count such silicon cost penalty, the overall (effective) yield of defect-tolerant circuit is calculated as follows,
where Y cf g the yield of configuration layer. Y ft in equation (4) can be derived from Y o using equation (1) or (2), depending on the graph topology. For configuration layer, it uses conservative technology and Y cf g should be calculated in a different way. Suppose the transistor count of reconfiguration circuit for the ith target layer is T i cf g and the transistor count of the circuit on the ith target layer is T i o . Let λ cf g and λ o be the minimum feature sizes of configuration and target layers respectively. Since all layers of 3-D structure are with the same area and all reconfiguration logic share the same planar layer, the technology scaling factor for configuration layer can be estimated using transistor counts as follows,
With yield scaling model in [5] , the yield of configuration layer can be estimated as
where, P is a constant of defect-size probability density function and usually P 3 [5] . By choosing an appropriate fault tolerance granularity, it is easy to make T i cf g to be no more than 4 − 5% of T i o [13] . Hence S can be approximated to be 5/ (1) and (2), only the yield of circuit with min-spare array depends on N . The impact of N to the overall (effective) yield enhancement is investigated with given K and L. We choose the number of target layers (L) to be 3, as this number has been proved to be reasonable in terms of areaperformance efficiency and heat dissipation [6] . Figure 5 shows the results with K = 1. The curves are similar for other Ks. Figure 5 shows that the defect tolerant design can improve the overall yield by up to 45%. When baseline yield is low enough (<30%), a very small yield improvement can be earned from adding redundancies, resulting in only a little overall (effective) yield increase. When baseline yield is high enough (>95%), the potential of primitive yield increase becomes very limited so that even the maximum primitive yield improvement cannot compensate the inherent silicon cost penalty (due to extra configuration layer), resulting in even lower overall (effective) yield. For other cases, there exists significant yield improvement (20%-30% on average). It can be concluded that this defect tolerant design is able to noticeably reduce fabrication loss for immature technologies, helping to increase profitability of foundries in the presence of increased timeto-market pressures.
Impact of N . According to equations
For the cases of different N s, finer fault tolerance granularity (larger N ) helps reduce overall failure probability, resulting in higher yield enhancement (given all other parameters). Such yield increase, however, becomes less with larger N . Although larger N helps yield enhancement, it cannot be too large. Otherwise, the transistor count of reconfiguration logic will be increased dramatically [13] , which may break the assumption made for scaling factor approximation and cause the aforementioned yield estimate inaccurate. In the remaining of this section, we choose N to be 4 in order to preserve that assumption without causing significant yield loss compared with other N s (according to Figure 5 ). Figure 6 shows the overall yield enhancements of both fault tolerant graphs, with respect to different Ks. Generally speaking, more faults can be tolerated with higher K, and the system is more likely to survive from defects. Hence larger K tends to increase the overall yield enhancement (given all other parameters). Because min-spare array achieves fault tolerance at finer granularity, it generally results in more yield increase than full-duplication graph (given K and L).
Impact of K.

Impact of L.
We now examine the changes of overall yield enhancement with respect to different number of target layers (L). Figure 7 shows the results for K = 1, and the curves are similar for other Ks. In this figure, X-l denotes X layers in total and thus L = X − 1 (target layers). The overall (effective) yield calculation takes silicon cost penalty (due to extra configuration layer) into account, and such constant penalty becomes more significant for less target layers, resulting in reduced yield enhancement. One extreme case is 2-l: the penalty (due to configuration layer) becomes 100% and there is no yield enhancement at all. Unlike the overall (effective) yield, however, the primitive yields (without considering the configuration layer penalty) always increase (such as the curves of X-l-p). The overall (effective) yield improves only when the primitive yield increase can compensate the configuration cost. Although more target layers (larger L) tend to amortize configuration layer penalty, it requires more configuration circuitry on the configuration layer and more transistors there, which in turn makes the fabrication technology for reconfiguration logic less conservative and reducing Y cf g . For low Y o , the reduction of Y cf g due to larger L is more significant than the savings of amortized configuration layer penalty. Thus, less target layers results in higher yield enhancement. When Y o becomes high (>40-50%), the savings of amortized configuration cost penalty begin to dominate, resulting in better yield enhancement with larger L. However, there cannot be too many device layers in a 3-D structure. Otherwise, inter-layer via overhead will become non-negligible. Heat dissipation and electromagnetic interaction will come to be problematic issues [6] .
Comparison with NMR.
We compare the overall yields of this defect-tolerant design with traditional NMR method. Both designs allow defect tolerance without external intervention. Our design, however, is more suited for asynchronous circuits because it does not require critical timing assumption. With different amounts of extra hardware resources, these two design methods result in different primitive yields. For fair comparison, we also take spare resource cost into effective yield calculations. The yield results are called normalized yields, which reflects the cost-effectiveness of a defect-tolerant design. Let the 3-D structure have L target layers.
Regarding the K-defect-tolerant design of min-spare array (suppose N -node array), K out of N + K nodes are spare resources. Hence the overall normalized yield is,
For the K-defect-tolerant design of full-duplication graph, spare resources are the extra K full replicas. The overall normalized yield is,
When it comes to NMR design, 2K full replicas are spare resources 1 . The overall normalized yield is,
where, Several observations can be made from Figure 8 . Firstly, this defect tolerant design results in higher normalized yield than NMR because of fault tolerance at finer granularity: less silicon penalty (due to less spare resources) is required and can compensate its primitive yield loss compared with NMR. Higher Y o tends to reduce the primitive yield increase, making extra silicon cost (primarily due to spare resource) become dominant. This explains why all the normalized yield enhancement (to NMR) curves monotonically increase with Y o. When Y o approaches 100%, there is little yield enhancement for both designs. But our design results in less normalized yield loss than NMR. Therefore, this defect-tolerant design is more cost-effective than NMR. Secondly, larger K results in more fault tolerance capability but requires more spare resources. When those extra hardware costs become more significant (with larger K) than the primitive yield increase of our defects-tolerant design, larger K results in less normalized yield enhancement to Comparison with PLA Method. Since PLA hardware structure strongly depends on the circuit functions and it is hard to make general quantitative comparisons, we only give a coarse comparison of our defect-tolerant design to PLA method. Although PLA method could result in lower hardware overhead due to its fault tolerance at even finer granularity (gate-level instead of module-level), explicit fault diagnosis and manual reprogramming forces the defect tolerance to be offline, resulting in significant product test cost. Besides, PLA could cause noticeable performance degradation due to the large number of programmable gates. On the other hand, our design implements reconfiguration at coarse granularity, resulting in small performance overhead. Also, this design achieves fault tolerance through automatic reconfiguration and noticeably reduces product test cost.
CONCLUSION
This paper proposed a general asynchronous design for yield enhancement. With extra spare resources and reconfiguration-specific logic, the VLSI circuits can maintain functionality in the presence of hard errors. The evaluation showed that this design results in significant yield enhancement (∼20-30% on average, even with pessimistic evaluation). Unlike traditional NMR method, this design is applicable to asynchronous circuits (due to no voting procedure) and more cost-effective for yield enhancement (due to fault tolerance at finer granularity); Compared with PLA and FPGA-based methods, this design significantly reduces product test cost (because of automatic reconfiguration) and results in smaller performance overhead (because of less programming bits).
This defect tolerance design methodology can be conveniently applied to synchronous designs, and similar yield enhancement results are expected (although the numbers might be different). With the same reconfiguration logic, fault tolerant graph models and the 3-D implementation, the only difference is the implementation of fail-stop behavior. One straightforward way to achieve fail-stop in synchronous logic, is to duplicate the clocked circuit and compare the outputs of the two replicas off the path on each clock cycle. If there is any mismatch, the global clock will be cut off and the online reconfiguration will be activated. The rest procedure is unchanged.
ACKNOWLEDGMENTS
We especially thank Christianto C. Liu for his insights and comments regarding yield modeling. We also thank anonymous reviewers for their valuable feedback.
