Abstract-This paper addresses the question of how to add redundancy to a collection of physical objects so that the overall system is more robust to failures. Physical redundancy can (generally) only be achieved by employing copy/substitute procedures. This is fundamentally different from information redundancy, where a single parity check simultaneously protects a large number of data bits against a single erasure. We propose a bipartite graph model of designing defect-tolerant systems where defective objects are repaired by reconnecting them to strategically placed redundant objects. The fundamental limits of this model are characterized under various asymptotic settings and both asymptotic and finite-size optimal systems are constructed.
I. INTRODUCTION Classical Shannon theory established principles of adding redundancy to data for combatting noise and, dually, of removing redundancy from data for more efficient storage. The central object of the classical theory is information, which unlike physical objects, can be freely copied and combined. In fact, the marvel of error-correcting codes is principally based on the counter-intuitive property that multiple unrelated information bits Xl"'" Xk can be simultaneously protected by adding "parity-checks" such as Y=XI+",+Xk mod2. undergoes an erasure of an arbitrary element.
Physical objects (e.g. transistors in a chip) may also be subject to erasures (failures) and thus it is natural to ask about ways of insuring the system against probable failure events. Necessarily, any such solution would entail addition of spare (redundant) elements. Note, however, that for physical objects operations such as (1) are meaningless: generally the only operations that apply to physical objects are copy/substitute. It may, therefore, seem that nothing better than a simple repli cation can guard against failures. This paper shows otherwise. Indeed, there exist non-trivial ways to add redundancy as long as the objects' diversity does not exceed their number. That is, if the number of types of objects is smaller than the total number of them.
Specifically, we study the following problem formulation: Given k objects ("functional nodes"), connect each one of them to some of the available m spares ("redundant nodes") in such a way that in the event that t ::: 1 of the objects fail (originals or spares) the overall system can be made to function after a repair step. Such a repair step consists of replacing each failed functional node with one of the spares that it is connected to. The key restrictions are 1) the functional nodes are one of q types 2) the spares have to be programmed to one of the q types before the failure events are known and 3) the same connections need to repair all possible choices of types for the k functional nodes. We are interested in minimizing the redundancy m/k and the number of connections to spare nodes.
Our motivation for studying this model comes from the following applications:
• Objects are digital gates of one of q types on a sili con chip. Imperfect manufacturing process causes cer tain gates to fail. As part of post-manufacture testing a configurable interconnect fabric is programmed to route around defective gates. Details are discussed further in Section II-B.
• Objects are elements in a programmable logic device (e.g. look-up tables (LUTs) in an FPGA). As part of periodical firmware update, manufacturer assigns values of LUTs (both functional and redundant) without knowledge of locations of device-specific failures. Then, a built-in algo rithm for each failed LUT T reconnects it to an adjacent spare LUT R, with the requirement that R and T be equivalent. The key here is for the local algorithm to be computationally non-demanding. It is not hard to come up with other potential applications in warehouse planning, operations research, public safety etc.
In short, we are looking for a k x m bipartite graph with the property that for any q-coloring of the left nodes there is a q-coloring of the right nodes such that each of the k nodes is connected to at least t nodes of its color. The goal is to trade off redundancy m/k vs. number of edges. For q = 2 our problem is equivalent to sparsity vs. edge-size tradeoff for (t, t) -colorable hypergraphs, cf. [1] . It may be instructive to look at simple non-trivial graphs in Figures 3-4. (In all figures, circles are original or functional nodes and squares are spare or redundant nodes.)
To summarize our main findings, if the number of types q 2 k then no strategy is better than straightforward t-fold replication. However, as long as q < k there exists designs that provide savings compared to repetition, as we will see in Section III. Consequently, we characterize the fundamental tradeoff between redundancy Tn / k and the number of edges (connections) in the following cases: 1) q, t fixed and k, Tn --+ 00; 2) q fixed and k, Tn, t --+ 00; 3) q, k fixed and Tn, t --+ 00. Perhaps surprisingly, in this (combinatorial) problem it is possible to obtain exact answers for asymptotics.
II. PROBLEM SETUP AND MAIN RESULTS
A. Defect-tolerance model Definition 1. Fix finite alphabet X where IXI = q. A bipartite graph with k functional (left) nodes and Tn redundant (right) nodes is called a t-error correcting design if for any labeling of k functional nodes by elements of X there exists a labeling of Tn redundant nodes by elements of X such that every functional node labeled x E X has at least t neighbors labeled x. We will call such a graph a (k, Tn, t, E)q-design, with E denoting the number of edges.
This paper is devoted to the fundamental tradeoff between the two basic parameters of t-error correcting designs:
• redundancy of a (k, 'In, t, E)q-design is p = 7Z
• the wiring complexity (or average degree) of a
For a fixed q and t 2 1 we define the region Rt as the closure of the set of all achievable pairs of (c, p): {(
To interpret the relation between Definition 1 and defect tolerance we consider one particular application, namely re configurable circuits. Consider a chip design process, in which the chip is composed of many similar cells (e.g. standard cell designs of ASICs). Cell structure is dictated by the chip manufacturer (fab). Each cell has k input/output buses and k placeholders (nodes) that can be filled in with logic realizing one of q functions. Now because of manufacturing defects, not all k functional elements will operate correctly. For this reason, each cell also contains Tn placeholders for redundant elements. The designer then selects what type of logic to instantiate into these redundant elements. Once the chip is manufactured and placed on the testbed, the testing equipment goes over each cell and checks which functional elements came out defective. The programmable switches then can be used to reconnect input/output buses from the defective functional elements to one of the redundant nodes containing the same logic.
With respect to this application, our goal is to understand what cell topologies the fab should try to implement in order to attain optimal tradeoff between the number of redundant elements, provisional wires (buses) and defect-tolerance. The exact relation to the previous definition of the t-error correct ing design is as follows: the k functional nodes in our model represents the placeholders intended for the components which are necessary for the chip to operate and the redundant nodes represents the placeholders for the redundant components. The labeling we apply to the nodes is the choice of components for each space. The edges correspond to the provisional wires. Note that our performance metrics, p and c, are meaningful for this circuit interpretation: they correspond to the extra silicon area and wiring (and fan-out) required for defect tolerance. There are certainly other metrics (such as geometric constraints) which are interesting for circuit applications, but we leave that to future work.
One may argue that the interconnect should be allowed to depend on the labeling of functional nodes. Indeed, the latter will be known before the final topology for the chip is made. In contrast, our procedure insists on laying out provisional wires before the specific choice of elements in the placeholders is known. The explanation is that our work attempts to find a universal design, which would be independent of the chosen functional node labels and thus could serve as the new standard cell for all defect-tolerant circuits. Nevertheless, we will discuss variations of this procedure in Section VI.
Relation to prior work: The subject of designing digital electronics robust to errors has been traditionally approached with the goal of combatting dynamic noise. This is epitomized in the large body of work started by von Neumann [2] . Although significant progress has been made in understanding fundamental limits in von Neumann's model, the practical applications are limited due to a prohibitively high level of redundancy required [3] .
Instead, we are interested in fighting static manufacturing failures which has the advantage of being able to test which parts of the circuit failed and to configure out (or "wire around") the defective parts. This side information enables significant savings in redundancy [4] and it is rather popular in practice: multi-core CPUs [5] , analog-to-digital converters [6] , sense-amplifiers [7] , parallel computing [8] , [9] , etc.
In summary, fighting dynamic noise (von Neumann's model) has good theoretical understanding, but requires huge redundancy. Static defects are practically handled via recon figurability. This paper is an attempt to provide theoretical foundations for the latter method. The two main results that characterize the redundancy wiring complexity tradeoff are for the small t case (Theorems 3 and 4) and the asymptotic t case (Theorem 5). 
(8)
where Theorem 5 parametrically defines a region of achievable designs, whereas Theorem 3 and Theorem 4 explicitly define the respective achieveable regions. (Note that evaluation of the bound (7) presents non-trivial technical difficulties.) The resulting achievable regions for q = 2 are plotted in Figure 1 . This plot, for instance, shows that at redundancy level 10% we can:
• correct 1 error if each functional node is connected on average to about 1.9 redundant nodes • correct 2 errors if each functional node is connected on average to about 1.9 x 2 redundant nodes • correct 103 errors if each functional node is connected on average to about 1. 7 x 103 redundant nodes The optimal designs for Theorem 5 are what we call subset designs, which are discussed in Section III-B. The proofs of Theorems 3 and 5 are given in Sections IV and V respectively. See Section IV of [10] for Theorem 4.
D. Implications and extensions of results
The result for Rl and R2 demonstrates that for correcting small numbers of defects the best solution in the limit of a large number of functional nodes is a linear combination of two basic designs, the repetition block and the complete design (see Figure 2) , and designs with finite k can do no better.
We note that while we do not know R t for t > 2, according to (4) all regions t R t will lie between Rl and R=, approaching the latter as t --+ 00, making Theorem 5 the fundamental limit for the tradeofl between redundancy and wire complexity. It is perhaps surprising that unlike most known asymptotic combinatorial problems, this one admits a relatively simple solution. Theorem 5 also holds for arbitrary alphabet, whereas computing regions R t for large q is not covered by Theorem 3 or 4.
We also study asymptotics in the regime of fixed k and m, t --+ 00 (see [10] Section V).
Plotof1ItRt,q=2
.21-,-----------'-
------;: :: :::: :; :::;: :; ;;::; ;-] I :�1,�ndt=21
Normalized Wiring Complexity (Elkl) First, notice that the minimal possible m equals qt, like in the complete design. However, some of the edges can be removed.
III. EXAMPLES OF GOOD DESIGNS
Optimal designs with k = q + 1, m = q and t = 1 are shown in Figure 3 . Optimal designs with k = q + 1, m = 2q and t = 2 are shown in Figure 4 . While these designs are optimal for a given value of k, we can find other designs with larger values of k which are better in terms of the P-e tradeoff. 
Proof Sketch. To show t ;::: :: : 7fF ( Ps), fix any labeling wk E Xk of the k functional nodes of G. Let rm to be the optimal labeling of the redundant nodes. Let
• Px denote the empirical distribution of the frequency of each label in wk • g = ( e1 , ' " ,eq) be the type of each redundant node v, where ej is the number of functional nodes with label j which is a neighbor of redundant node v • PYI.lo ( jl!:' ) be the proportion (empirical distribution) of redundant nodes of type g which are labeled j in labeling rm The distribution of g for degree 8 redundant nodes is approximately given by (10) . For each label j, we can count the average number of redundant node neighbors with label j a functional node with label j has. This quantity is given by (8) without the maximums and minimums, which we get after taking the worse case label j and Px with the best possible pYI.�,Jjl!:' ) . We can show there is a way for rm to obtain this average for each functional node by random coding. Proof. Note that corner points (t, t) and (qt, 0) are achieved by the repetition block and the complete design, respectively. By Proposition I the region Rt is convex and hence must contain R�K) .
D
By merging the repetition block and complete design, we can get designs in R�K) with each functional node having the same degree. 
Proof. Notice that every functional node clearly should have degree at least t. Let us define 7rj, j = t, t + 1, ... , qt -1 to be the fraction of functional nodes of degree j and 7r qt to be the fraction with degree qt or larger. This satisfies (13). We only need to show (14). For each labeling rm E xm of redundant nodes let Qt (rm) be the set of functional node labelings for which conditions of Definition 1 are satisfied (we say that rm covers Qt(rm) of the labelings). It is clear that the design is t-error correcting if and only if
r'tnE;\:,'tn (15) Two functional nodes of degree t should have disjoint neigh borhoods (otherwise labeling them different values clearly violates Definition 1). Thus Qt(rm) is empty unless each such neighborhood has a constant label. This shows that for the tk7rt redundant nodes we are restricted to only q k1 r r choices, while the rest contribute qm -t k7rt more choices.
Given any of the qm -( t -1 ) k1 r r choices of rm we can estimate I Qt(rm) I from above by assuming that each functional Proof. Let G be a (k, m, t, E)q-design. Choose an ordering of the functional nodes in G. For each U E Sk (the full symmetric group on k elements), let G" be isomorphic to the design G, with the order of its functional nodes transformed by u. Then merge G" for all u E Sk identifying functional nodes with the same order, so that the result is
GpERM is constructed to be permutation invariant (and thus a subset design) and by Proposition 6 GpERM is a (k, m· k!, t· k!, E . k!)q-design. This paper studies a defect-tolerance model where steps proceed as follows:
a. bipartite graph is designed; b. functional nodes get q-ary labeling; c. redundant nodes are assigned q-ary labels (so that each functional node has t neighbors with matching label). There are two natural variations where sequence of steps are interchanged:
• adaptive graph: b.--+a.--+c.
• non-adaptive redundancy: a.--+c.--+b. In the first case, the graph is a function of the q-ary labels, while in the second case the redundant nodes are not allowed to depend on the labeling of functional nodes. The setting considered in this paper (a.--+b.--+c.) is an intermediate case.
The fundamental redundancy-wiring complexity tradeoff for these cases is defined similarly to (2) . Both tradeoffs are rather easy to determine for any t ::;:, 1:
• adaptive graph: R t = {(c,p ): c::;:' t,p::;:, O}.
• non-adaptive redundancy: R t = {( c, p) : c ::;:, qt, P ::;:, O} .
These observations are summarized in Figure 5 .
B. Stochastic defects
This work considers correcting arbitrary (worst-case) defect patterns. One of the conclusions is that to correct fraction 0:
of defects (i.e. t = o:k) on k functional nodes, the number of edges should grow as k2. Instead we can relaxed the re quirement to correcting i.i.d. Bernoulli(o:) defects. Each defect pattern will occur with some probability and we only want all defects in the design to be corrected with high probability (computed over distribution of defects and functional labels). It turns out that in such a probabilistic model, correcting fraction-0: of defects is possible with designs possessing 0 (k log k) edges and 0 (k) redundant nodes. See Section 4.4 in [11]. 
C. Open Problems
Regions which are still to be determined include:
• R t for t > 3 and q = 2 • R t for t > 1 and q ::;:, 3 For q = 2, it is also unknown what the smallest value of t is for which R t does not equal the region defined in Equation (5) .
