Abstract-Entering the nanometer era, a major challenge to current design methodologies and tools is how to effectively address the high defect densities projected for nanoelectronic technologies. To this end, we proposed a reconfiguration-based defect-avoidance methodology for defect-prone nanofabrics. It judiciously architects the nanofabric, using probabilistic considerations, such that a very large number of alternative implementations can be mapped into it, enabling defects to be circumvented at configuration time, in a scalable way. Building on this foundation, in this paper we propose a synthesis framework aimed at implementing this new design paradigm. A key novelty of our approach with respect to traditional high level synthesis is that, rather than carefully optimizing a single ('deterministic') solution, our goal is to simultaneously synthesize a large family of alternative solutions, so as to meet the required probability of successful configuration, or yield, while maximizing the average performance of the family of synthesized solutions. Experimental results generated for a set of representative benchmark kernels, assuming different defect regimes and target yields, empirically show that our proposed algorithms can effectively explore the complex probabilistic design space associated with this new class of high level synthesis problems.
I. INTRODUCTION

E
MERGING nanotechnologies have seen significant advances in recent years [2] , [3] , [4] , [5] , and it is predicted that the manufacturing of nanoelectronic-based systems is likely to become practical within 10-15 years [6] . Besides the inevitable challenges in terms of complexity and scalability, the ability to handle defective devices will be a critical element of any future architecture, since defect rates are expected to be much higher than current values [4] , [6] , [7] . Building on the success of the TERAMAC experiment [8] , Heath et. al. identified the possibility of utilizing reconfiguration to achieve defect tolerance in systems targeted at emerging nanoelectronic technologies [7] . Since then, several nanostructures well suited to creating reconfigurable computational fabrics have been This is an extended version of a paper by the same authors presented on the ACM/IEEE Design Automation and Test in Europe (DATE) conference, 2006 [1] . This extended journal version provides a more detailed discussion (supported by new experimental results) on delay variations exhibited by components synthesized with the DAS-NANO framework, and more details on yield estimation and on the metrics used in the cost function of the clustering algorithm. We have also included more experimental results assuming different defect regimes, and extended the background information on the target nanofabric architecture, and provided a more detailed contrast to previous work.
Chen He is with Freescale Semiconductor, Inc., Austin, TX 78735, USA (e-mail: chen.he@freescale.com). Margarida F. Jacome is with the Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX 78712, USA (e-mail: jacome@ece.utexas.edu).
successfully demonstrated, see e.g., [2] , [3] , [5] . Yet, design approaches implementing defect tolerance via reconfiguration have to contend with a major scalability challenge -defect mapping and configuration must be performed on a per chip basis [7] , [5] , [9] , [10] . Recently, we proposed a probabilistic design paradigm aimed at enabling both such complex tasks to be performed 'on chip', relying on the processing power of the fabric itself -a critical step towards ensuring scalability [11] , [12] . Our approach is based on structuring designs as hierarchies of carefully dimensioned (re)configurable fabric regions, while decomposing and assigning functional flows to each region -by restricting the functionality preassigned to a specific nanofabric region, we effectively limit the scope and complexity of the associated defect mapping and configuration tasks, see details in Section II.
Beyond providing a promising foundation towards addressing the scalability challenge of reconfiguration-based defectavoidance techniques, the approach in [11] , [12] gives also a framework in which to explore critical new tradeoffs among performance, yield, and cost/complexity -as will be seen, the probabilistic nature of these tradeoffs makes this new class of 'defect-aware' high-level synthesis (HLS) problems quite unique. In particular, rather than carefully optimizing a single ('deterministic') solution, as done in traditional HLS, the defect-aware HLS problem requires the joint synthesis and optimization of a sufficiently large family of alternative solutions, so as to enable actual defects to be circumvented at configuration time -critical towards meeting the target probability of successful configuration, i.e., yield 1 . The need to provide 'redundant' (or 'extra') configuration capacity, so as to enable such multiple solutions, is non-existent in traditional HLS, and fundamentally impacts system performance, thus leading to a substantial departure on the way the design space should be explored. For example, judiciously increasing the 'size' of the behavioral flows (or 'instructions') to be atomically executed on application-specific functional units, usually critical to achieving high performance in traditional HLS [13] , [14] , [15] , [16] , [17] , may actually hurt average (or expected) performance in this new context. This is so because 'larger' flows may require substantially more redundant configuration capacity to meet the target yield, thus decreasing the degree of locality of a design, see [11] , [12] .
Due to those key differences with respect to traditional HLS, this new class of problems requires the definition of a new HLS framework and associated support algorithms, appropriately exploring the design space. In this paper, we propose a DefectAware Synthesis framework for reconfigurable NANOfabrics (DAS-NANO), aimed at systematically solving this new class of HLS problems. Specifically, given a target application kernel, the broad goal of DAS-NANO is to generate a design that achieves the specified target yield with best expected performance, i.e., best average component latency over a large number of nanochip instances.
We start by introducing DAS-NANO's main synthesis flow, and then discuss in some detail the different algorithms implementing its various phases. Then, relying on extensive experimental data generated for a set of representative benchmark kernels, assuming different defect regimes and target yields, we empirically show that the proposed framework can effectively explore the complex design space defined by this new class of HLS problems.
The paper is organized as follows. In Section II we briefly review the architecture of our target nanofabric. An overview of DAS-NANO is given in Section III, and Sections IV, V and VI discuss in detail the various algorithms implemented in the framework. Experimental results are presented in Section VII. Our work is contrasted with previous research in Section VIII. Finally, Section IX concludes the paper.
II. BACKGROUND ON TARGET NANOFABRIC ARCHITECTURE
In this section, we first briefly introduce a concrete nanowire crossbar-based memory organization that could potentially be used as the basic building blocks for our target nanofabric. Although this specific design is not part of our work, suggesting a potential realization path for our memory-based reconfigurable nanofabric architecture, relying on state-of-theart work on emerging nanotechnologies, is critical towards establishing its potential viability and promise. Then we review the abstractions and design hierarchy of our proposed nanofabric architecture. Finally, we discuss the defect model assumed in our experiments.
A. Technology Background: Crossbar-Based Nanomemory
Computers built solely of wires, switches and memorybased look-up tables (LUTs), i.e., requiring no traditional logic gates (or very few of these), are very appealing in the context of nanotechnologies [7] . The appeal of such memory-only computers lies in the fact that one can rely mostly on simple, highly regular, and ultra dense fabrics, comprised of crossbar structures, to build powerful substrates capable of performing arbitrarily complex computations.
For concreteness, Fig.1 shows a promising nanowire crossbar-based memory structure, denoted the HarvardCalTech nanomemory [18] , [19] . This particular architecture contains: (1) a crossbar nanowire memory array comprised of nonvolatile nanoscale cross-switches -possible realizations for the latter include suspended nanotube switches [3] , crossed nanowire diodes [18] , or rotaxanes-based molecular switches [2] ; and (2) a row decoder and a column decoder -possible implementations of such decoders may rely on crossed semiconductor nanowire (cNW) field-effect transistor (FET) arrays Fig. 1 . Nanowire crossbar-based memory structure. [18] , [19] [20], or FET arrays formed by modulation-doped nanowires with top-gated microwires [19] . Work in [18] , [19] has shown that such nanomemory structure is practical and tremendous density can be achieved.
B. Abstractions and Design Hierarchy
Our target nanofabric, architected as proposed in [11] , [12] , is shown in Fig. 2 . The basic configuration unit of the nanofabric, called a region, is a grid of eight processing elements (PEs) and eight switching elements (SEs) -see bottom part of Fig.2 . The PEs perform standard 8-bit arithmetic/logic operations, and are comprised of a small set of look-up tables (LUTs), implementing the various operations, and simple control logic. Such LUT-based (i.e., memory-based) PEs are likely to be a good solution in that they may be possible to realize in a very compact and regular way, exploiting the favorable characteristics of abovementioned array-based nanomemory architecture. SEs are used to route between adjacent PEs. As established in [12] , [21] , the use of such PEs and SEs as the primitive programmable elements enables a simple and scalable defect mapping and configuration methodology. In contrast, 'finer-grained' approaches, e.g., [9] , [22] , require offline processing specific to individual chips, which may be a serious impediment to low-cost and fast mass-production (see also Section VIII-B). Still, it should be noted that some level of internal redundancy may be needed in order to to enable our 'coarse' PEs to be treated as primitive elements in our approach, that is, to ensure a corresponding likelihood of failure within acceptable limits (say, no more than 20% or 25%, as suggested in [4] ).
As discussed in [11] , [12] , each region of the target nanofabric can be configured to execute a small behavioral segment, or 'complex instruction', comprising a small number of (interdependent) arithmetic/logic operations, called a basic flow -see right bottom part of Fig.2 , and formal definition in Section V-A. In order to enhance the probability of successful configuration when defects are present in the target region, we require basic flows to contain less than eight operations/nodes -for example, the basic flow shown on the bottom frame of Fig.2 has only four operations. Naturally, the larger the resource redundancy in a region, the larger the number of alternative configurations for a basic flow, and thus the higher the probability of successfully instantiating it on that region, in the presence of defects.
To achieve a sufficiently high probability of successful configuration in a scalable way, we have introduced the second hierarchical/aggregation level in the fabric architecture, denoted mapping unit (MU) [11] , [12] -shown on the middle part of Fig.2 . Namely, a flow cluster with m basic flows can be instantiated in an MU containing n regions, where m is less than or equal to n. The middle frame of Fig.2 , for example, shows a flow cluster with three basic flows being mapped to an MU with four regions. Thus, the mapping unit abstraction creates a second level of redundancy, while retaining the original simplicity of the region-based defect mapping and configuration.
Finally, MUs are grouped together to form a component of the nanofabric, implementing an application kernel. For example, the application kernel shown on the top of Fig.2 is comprised of two MUs, each with four regions -the top three flows of the kernel are assigned to MU1, while the bottom three flows are assigned to MU2. As discussed in [11] , [12] , to simplify scheduling and control, we require the set of flows assigned to the various MUs of a component to satisfy convexity constraints [16] , that is, there cannot be 'circular' (input/output) data dependencies among basic flows assigned to different MUs.
A few brief comments on routing. We start by reiterating the well-known fact that it is much harder to achieve defect and fault tolerance on generic transformational elements (e.g., functional units and PEs) than it is on non-transformational elements (e.g., switching and transport/routing structure) -see, e.g., the huge complexity associated with coded computation [23] , [24] in contrast to the much simpler error detecting/correcting codes used for data storage and transport [25] . Accordingly, most of our work on defect mapping and defect tolerance in [11] , [12] focuses on the more challenging transformational elements of our fabric architecture -i.e., the regions -rather than on routing structures. Still, a brief discussion on the routing structure assumed in our fabric is in order. Inside a region, SEs are used to route between adjacent PEs. For the various defect regimes assumed in our experiments (see Section VII), SEs supporting up to two routing channels among adjacent PEs proved to be sufficient [11] , [12] , yet this may need to be revisited in the future -see Section IX. We assume that there are two input and two output pads placed on each side of a region, so that up to two primary inputs and two primary outputs can be routed to/from each PE. Furthermore, each region within an MU is surrounded by a routing channel containing an identical number of routing tracks on each of its sides, and a switch box is placed at the crossing points of such routing channels. At the component level, inter-MU routing is done using a similar architecture, comprised of long lines along each side of an MU, and switch boxes placed at the crossing points of such long lines. In [11] , [12] , we have shown that the complexity of routing inside an MU can be effectively controlled, by limiting the maximum number of regions per MU. For instance, with no more than nine regions per MU, one can simply use four routing rings to route signals among the regions in an MU, and rely on efficient look-up-table based algorithms to explicitly program the switch boxes with the possible path between any two such regions. By limiting the number of MUs per components, similar simple routing strategy can be also used for inter-MU routing.
The top level component abstraction discussed above implements an interface that hides/encapsulates the particular defect realizations for a nanofabric instance -that is, all operational (i.e., successfully configured) systems are structurally 'identical' at the component level [11] , [12] . Note, however, that there still is uncertainty associated with the actual performance, i.e., delay/latency, of such components. Namely, there will be delay variability across different component instances, and components are still susceptible to transient/soft faults, and may thus malfunction. Still, the component abstraction provides a basic foundation towards controlling the complexity associated with handling the remaining sources of uncertainty.
As discussed in [11] , [12] , by architecting the nanofabric in terms of such hierarchy of abstractions, one decomposes the nanosystem's complex defect mapping and configuration problems into a set of quasi-independent subproblems, each with the scope of a single region and basic flow. This makes our approach inherently scalable. Moreover, because each subproblem is small and relatively simple, solutions can be explicitly 'enumerated' and are thus appropriate for implementation with simple table-look-up algorithms -critical towards enabling on-chip self-test and self-configuration methods.
To summarize, when designing a reconfigurable nanofabric to implement a nanosystem's components with a given target yield, we need to allocate the resources required by each component, i.e., determine the number of MUs and the number of regions within each MU, as well as the number of routing tracks forming the interconnect structure. We need also to decompose the kernel to be executed by each such component into a number of basic flows -we may think of this as the 'instruction selection' phase of the synthesis process. Finally, we need to assign subsets of such basic flows to the component's MUs, so as to achieve best expected performance, i.e., best average component latency over all nanofabric chips, while meeting the target yield. In this paper, we propose a synthesis framework and associated algorithms to automate this process.
C. Defect Model
Since nanotechnologies are still in their infancy, we limit ourselves to an abstract notion of defects, similar to that adopted in [9] . Specifically, we assume that a defective PE or SE will malfunction without affecting other surrounding resources. Furthermore, we assume that a defect always manifests itself as a permanent fault, such as a stuck-open or stuckat fault [9] , and, thus, we do not consider defects that only affect parameters such as delay and power consumption, i.e., we do not consider 'parametric' yield issues. Finally, we do not consider latent/aging defects, although our reconfigurationbased approach could be extended to tolerate defects occurring in field as well.
In order to validate our approach empirically, as well as perform trend analysis, we consider defect regimes parameterized by the tuple (P e , P a , P c ), where P e , P a , P c denote the probabilities of failure for PEs; PEs operating as arbiters 2 ; and connections 3 , respectively. We assume a wide range of potential defect distributions for future nanotechnologies, namely, P e to be in the range of 1-20%, P a to be in the range of 0.5-10%, 4 , and P c to be in the range of 0.1-2% 5 . Finally, we assume a uniform distribution of defects, and that such defects are independent and identically distributed (i.i.d.) across regions and routing tracks. As nanotechnologies mature, we expect to be able to consider defect models that more precisely represent the defect characteristics expected by such technologies -see Section IX.
III. OVERVIEW OF DEFECT-AWARE SYNTHESIS FLOW
Given an application kernel, the goal is to generate a component design with best expected performance, i.e., best average component latency over all nanofabric chips, while meeting the specified probability of successful configuration, or yield [1] . Fig. 3 shows the overall synthesis flow implemented in DAS-NANO -for simplicity, the target yield for intra-and inter-MU communication resources, P tc , is specified separately from the yield for the component's regions, P tt 6 . As shown in Fig. 3 , the main steps of DAS-NANO's synthesis flow are: (1) DAS-BehavioralBounds: setting the maximum number of operations (or nodes) allowed on each basic flow ('instruction size'), and the maximum number of basic flows possible to map into a single mapping unit (MU) -discussed in Section IV.
(2) DAS-Allocation: deciding on the number of MUs to be instantiated in the component, and on the number of regions within each such MU -discussed in Section IV. (3) DAS-InstructionSelection: generating a flow cover for the kernel's dataflow graph (DFG), using basic flows with no more than the allowed number of nodes -discussed in Section V. (4) DAS-Binding: clustering and assigning basic flows to MUs, subject to convexity constraints (unlike conventional HLS, this is a many-to-many binding) -discussed in Section V. (5) DAS-Routing: specifying the communication structure, i.e., the number of intra-and inter-MU tracks to be instantiated in the nanofabric -discussed in Section VI.
The decisions made in steps DAS-BehavioralBounds and DAS-Allocation are based on rough, preliminary yield estimates -Section IV discusses how such estimates are generated. Once all design details are fully defined, one still needs to check if the actual yield meets the target value. As shown in Fig. 3 , if it does not, the level of redundant configuration capacity provided on the particular component design needs to be increased (via step DAS-AdjustCapacity), and then a new design cycle is initiated.
Final Configuration
Step. Once the design is finished and a corresponding batch of programmable nanofabric chips is manufactured, those chips still need to be configured. Namely, the defects on each chip need to be first individually mapped (using the TMR-tile based group testing method described in [12] , [21] ) and, then, based on such results, an exact placement of each basic flow onto a region of the appropriated MU is determined, such that the identified faulty resources are avoided -both such topics were extensively addressed in [12] , [21] , and are beyond the scope of this paper.
IV. DAS-BEHAVIORALBOUNDS AND DAS-ALLOCATION
The first two steps of DAS-NANO's synthesis flow define the level of 'redundant' configuration capacity that needs to be provided in the nanofabric, so as to achieve the target yield with best average latency for the particular component. Specifically, they define the 'size' of the behavioral abstractions (basic flows and clusters of flows) and the 'size' of the corresponding structural abstractions (number of MUs and number of regions per MU). Note that these two problems, coupled to the subsequent instruction selection and binding problems shown in the synthesis flow in Fig. 3 , are essentially a probabilistic version of the classical instruction selection and binding problems in high-level synthesis (HLS), which are well known to be NP-hard [26] , [27] , [28] . Specifically, beyond the fact that our objective function is expected latency (rather than 'deterministic' latency), we must also meet an extra (complex) optimization constraint, not present in traditional HLS -yield -making the defect tolerant version of these traditional HLS problems even harder than their original counterparts. Therefore, we use (greedy) heuristic algorithms to tackle the joint problems in a divide-and-conquer manner, as discussed below.
A. Algorithm to Determine Configuration Capacity
In order to enhance a component's yield, one should increase its level of 'redundant' configuration capacity, by: 1) DAS-BehavioralBounds: Determine the max flow size, n max , and the max number of basic flows assigned to each MU, f max . instantiating more regions in its MUs; 2) using a kernel cover with smaller basic flows; and/or 3) instantiating more MUs and assigning fewer basic flows to each MU. The previous alternatives are listed in increasing order of their impact on yield and average component latency -see experimental results in [11] , [12] . Accordingly, since our goal is to meet the yield constraint with best average component latency, we have developed a greedy algorithm that starts from a design with minimum configuration capacity, and thus with the highest possible locality, and then increases configuration capacity, in above order, until the target yield is met.
DAS-InstructionSelection
The pseudo-code of the proposed algorithm is shown in Fig.4 . Its inputs are: n G -number of nodes in the kernel's DFG G; and P tt -target yield for the component's transformational resources. The algorithm's outputs are: n max -maximum number of operations allowed on any basic flow 7 ; f maxmaximum number of basic flows that can be mapped into one MU; n MU -number of MUs to be instantiated in the component; and n reg -number of regions in each MU.
B. Yield Estimation
Preliminary Yield Estimation. The algorithm in Fig.4 requires estimating the transformational component yield P (see line 8) so as to get the proper configuration capacity -we solve this hierarchically, by estimating the yield at each level of the design hierarchy. First, as in [11] , [12] , we estimate basic flow yield at the region level using Monte Carlo simulation, for a specific defect regime (P e , P a , P c ). Specifically, we generate a large number of defect realizations on a region, and then use the TMR-based group testing method described in [12] , [21] , to obtain a defect map for each such region instance. We then use a simple table-look-up algorithm to find if a feasible configuration exists for the particular basic flow on that region instance, exists. The probability of successful configuration, or yield, for each basic flow is given by the actual fraction of region instances for which a feasible configuration has been found.
We run such Monte Carlo (MC) simulations for essentially all possible basic flows of various sizes, considering different defect regimes (P e , P a , P c ) (see Section VII-A for more details of the MC simulation). Fig. 5 shows a sample of our results -namely, minimum and maximum yields for basic flows containing one to seven nodes, assuming defect regime (P e , P a , P c ) = (10, 5, 1)%. The results were obtained with one million MC simulation runs, to ensure that the worst relative error of each probability estimate was no more than 10% with 95% confidence. As one would expect, basic flow yield decreases as the number of nodes in the basic flow increases, yet there are some variations for basic flows of identical size, caused by their distinct connectivity requirements. Thus, when initially estimating the yield for a basic flow of a given size (under a specific defect regime), one may select more or Algorithm DAS-BehavioralBounds&Allocation: less conservative values, depending on how aggressive one may wish to optimize expected performance, knowing that by choosing less conservative values, one may need to iterate over several design cycles. Yield at the next level of hierarchy, i.e., MU level, is roughly estimated assuming that all basic flows are identical. For this special case, i.e., f max identical basic flows being mapped into n reg regions, yield at the MU level is equal to the probability of having at least f max regions on which that basic flow can be successfully configured. Given the i.i.d. assumption on the defects across regions (see Section II-C), yield is directly given by:
where P r is the estimated yield for the particular basic flow being considered, obtained as discussed previously. Finally, the yield estimate at the component level P , is given by
where P MUi is the yield of the component's ith MU.
Final yield estimation. The method discussed above generates the rough yield estimates used to drive the design of the fabric, for each particular kernel/component. Once the design is concluded, the procedure used to estimate the yield of the resulting detailed component design is very similar, except that now we know the exact flow cover, and resource allocation and assignment decisions implemented in the design, and can thus be more accurate. Especially, at the MU level, we need now to consider the case of mapping different basic flows to an MU. An interesting and common case consists of mapping i basic flows of type f , and j basic flows of type g, to an MU consisting of n reg regions, where basic flow f is 'dominated' by basic flow g, denoted f ⊆ g. Informally, we say that f ⊆ g if the graph representing basic flow f is a subgraph of that representing g. In order to compute the probability of successfully configuring these basic flows in the n regions on the MU, we can first select at least i + j regions that are configurable for the 'dominated' basic flow f , and then pick at least j regions among them to configure the 'dominating' basic flow g. So, we have
where P r (f ) is the estimated yield of basic flow f on a region, and P r (g|f ) is the conditional probability of successfully configuring the 'dominating' basic flow g on a region, given that the 'dominated' basic flow f can be configured on that region, which is given by
where P r (g) is the estimated yield of basic flow g on a region.
Equation (3) can be extended to the case of mapping a set of basic flows containing more than two basic flow types, but with a strict '⊆' (dominance) ordering among all flows. Finally, for the general case where m different basic flows, without such dominance ordering, are mapped into n regions of an MU, the yield estimate is obtained by exhaustively enumerating all possible configuration combinations.
V. DAS-INSTRUCTIONSELECTION AND DAS-BINDING
In the two subsequent steps of DAS-NANO's synthesis flow -also called 'clustering' phase -a flow cover for the kernel's data flow graph (DFG) is first generated (DASInstructionSelection), and then an assignment of the resulting basic flows to MUs is performed (DAS-Binding), satisfying n max and f max , respectively, and aiming at minimizing average (or expected) latency. Furthermore, as alluded to above, the sets of basic flows (or 'flow clusters') assigned to the various MUs must also satisfy convexity constraints, that is, there cannot be 'circular' data dependencies among them.
A. Problem Formulation
We first introduce some notation. We use a dataflow graph (DFG) representation for the kernels of interest. A DFG is a direct acyclic graph (DAG), denoted as G = (V, E), where the set of vertices V represents operations and the set of edges E ∈ V × V models data dependencies between operations. Given an edge e from u to v, denoted e = (u, v), u is a predecessor of v, and v is a successor of u. We use pred(v) and succ(v) to represent the set of v's predecessors and successors, respectively. The union of prev(v) and succ(v) forms the set of v's adjacent nodes, denoted adj (v) .
A basic flow is a subgraph of
We use |f | to denote the number of nodes in f . A flow cluster, C = (V C , E C ), is a subgraph of G containing a set of basic flows f i (i = 1, ..., |C|), where |C| stands for the number of basic flows in C, and
We say that a flow cluster C is convex if there exists no path in G from a node v ∈ C to another node u ∈ C, which contains a node x / ∈ C. Our clustering problem comprises two steps: DASInstructionSelection and DAS-Binding. Fig. 6 symbolically illustrates the results of these two steps. The output of the DAS-InstructionSelection step is a node-clustered DFG, denoted G f (see resulting basic flows f 1 , f 2 , f 3 and f 4 in Fig.  6(b) ), and the output of the DAS-Binding step is a flowclustered DFG, denoted G fc (see resulting flow clusters C 1 and C 2 in Fig. 6(c) ), where each flow cluster is assigned to an MU. Note that G fc has three types of edges: intra-flow edges, inter-flow edges, and inter-MU (or inter-flow-cluster) edges. As discussed in [11] , the intra-flow edges do not cause extra delay, while the inter-flow and inter-MU edges do incur data transfer delays, corresponding to moving data between regions belonging to the same or to different MUs.
We denote the schedule latency L G as the number of clock cycles required to complete the execution of all nodes (or operations) in G. For a given target latency, the mobility µ(v) Cluster each basic flow f in G f into a flow cluster C (cluster(f ) = C), such that ∀C, |C| ≤ f max , and C is convex, and L G fc is minimized. Unlike the conventional HLS problems, in our probabilistic design paradigm, the delay incurred by such data transfers may vary among different component instances, since the mapping of basic flows to regions is not fixed, and depends upon the actual defect distributions on each chip. Fig. 7 illustrates this, using an hypothetical kernel, where basic flows f 1 and f 3 in flow-cluster C1 are assigned to MU1, and f 2 in C2 is assigned to MU2, and both MU1 and MU2 are assumed to contain nine regions. For illustration purposes, we fix the position of basic flow f 1 in MU1 and vary the positions of f 3 and f 2 in MU1 and MU2, respectively. Clearly, the data transfer delay of the inter-flow edge e1 (from f 1 to f 3) varies substantially for the two alternative placements of f 3 in MU1, i.e., e1 delay2 > e1 delay1. Similar considerations can be made with respect to the data transfer delay of the inter-MU edge e2 (from f 1 to f 2) for the two alternative placements of f 2 in MU2. Since the region in the appropriated MU to which each basic flow is actually mapped will not be determined until defect mapping and configuration is performed for each particular chip, our clustering algorithm uses expected values for such delays, derived using a combination of analysis and simulation, see details in Section VII.
B. Algorithm for Clustering Phase: DAS-TPC
DAS-NANO's clustering phase bears considerable resemblance to clustering problems defined in the context of traditional HLS and compilers, see e.g., [29] , [26] , [27] , [30] . In fact, we were able to successfully adapt TPC (Two-Phase Clustering), a state-of-the-art algorithm proposed by Lapinskii et. al. [29] , to address our 'defect-aware' clustering phase. As discussed in the sequel, our version of the algorithm, denoted DAS-TPC, is used in both the node clustering and the flow clustering phases of DAS-NANO. 9 
1) Node Clustering
Step: DAS-InstructionSelection: Similarly to the original TPC algorithm, DAS-TPC starts by performing a fast greedy clustering, and then iteratively improves on that initial solution -both phases of the algorithm are briefly discussed below.
1) Initial Clustering Phase.
The greedy algorithm used to generate the initial clustering is shown in Fig.8 -lines in bold represent our additions/enhancements to the original TPC. The order in which nodes are considered for clustering (line 1) is determined by a ranking function identical to that proposed in [29] , which is composed of the following three elements (in priority order): 1) The ALAP value of the candidate operation/node, with earlier operations being considered first; 2) the mobility value of the nodes, with the one having lower mobility being considered first; and 3) the number of successors, with the one having more successors being considered first. Note that this ranking function gives priority to operations on the critical path(s), thus providing more flexibility for those more difficult/constrained operations.
Then, for the selected node v, we evaluate each possible alternative clustering to a basic flow f (cluster(v) = f ), using cost function trcost(v, f ) (line 4) which, similarly to [29] , is defined as follows:
where trcost dd (v, f ) denotes the direct data dependency cost and trcost cc (v, f ) denotes the common consumer cost. Specifically, we add one to trcost dd (v, f ) for each predecessor of v, u ∈ pred(v), such that cluster(u) = f . We add 1 to trcost cc (v, f ) for each successor of v, u ∈ succ(v), such that there exists an u's predecessor w ∈ pred(u) and cluster(w) = f . Clearly, trcost dd (v, f ) favors solutions that place consumer and producer operations into the same basic flow, and trcost cc (v, f ) favors solutions that place multiple 9 We have also developed a defect-aware version of HP's Partial Component Clustering (PCC) algorithm [26] , denoted DAS-PCC, yet our version of DAS-TPC consistently outperformed DAS-PCC, and thus we only present results for the former.
producers to a common consumer into the same basic flow [29] .
However, different from the original TPC, we perform size and convexity constraint checks during the node clustering step (line 3). Note, first, that non-convex data dependencies among basic flows assigned to the same MU are allowed, in order to preserve fine-grain parallelism [11] , [12] . Thus, convexity constraints need to be enforced only across MUs. However, the joint consideration of n max (maximum number of operations on a basic flow) and f max (maximum number of basic flows that can be mapped into one MU) does limit the maximum number of nodes that can be mapped into a single MU -to n max × f max . Therefore, although allowing non-convex data dependencies during this phase, we need to make sure that the total number of nodes in basic flows exhibiting 'circular' data dependencies does not exceed that limit, so as to enable all such basic flows to be later mapped to a single MUotherwise, convexity constraints at the MU level would be violated.
We use the depth-first search based algorithm proposed in [31] to check for convexity constraints' violations. If one such violation is detected, it must be eliminated. Meeting convexity constraints with a greedy clustering algorithm is somewhat challenging, since convexity is a global constraint, while our algorithm makes clustering decisions greedily/locally. In [27] and [30] , convexity constraints are considered during instruction selection, yet both algorithms have worst-case exponential computing complexity. In contrast, we devise a low cost heuristic backtracking strategy (see line 9 -15 in Fig. 8 ) that has so far performed very well. Fig.9 describes the heuristic. Consider a DFG G containing 4 nodes, and assume n max = 2, f max = 1. After clustering node 1 to basic flow 1, node 2 to basic flow 2, and node 3 to basic flow 1, when the algorithm tries to cluster node 4, no admissible clustering can be found, since clustering it to basic flow 1 violates the flow size constraint, and clustering it to basic flow 2 violates convexity constraints at the MU level. In order to handle the problem, our algorithm starts by backtracking to a previously clustered node, selected as specified in lines 11 to 13 in Fig.8 . Specifically, following backwards the ranking order for the previously clustered nodes, the algorithm backtracks to the first node v b that meets the following three conditions: 1) v b was clustered to a basic flow that the current node cannot be clustered to, without violating convexity constraints ; 2) at least one of v b 's successors, denoted s, was clustered to a basic flow different from its own; 3) v b was not backtracked to before. Once the backtracking node v b is selected, the algorithm reclusters it to the basic flow of its successor s, and then restarts the regular greedy algorithm from the next node, in the ranking order. For the example in Fig.9 , node 2 would be the first one to satisfy the three conditions, and hence would be selected to be the backtracking node. Node 2 would then be re-clustered to basic flow 1, and the normal clustering process would resume with node 3, eventually generating the solution shown in Fig.9 .
Although simple, this backtracking strategy has performed quite effectively in practice. Still, there is always the possibility that this heuristic cannot generate a feasible solution -either because a convex clustering (at the MU level) does not actually exist, or because the heuristic has failed. If this happens, as shown in line 16 and 17 in Fig.8 , we cluster the problematic node to a new basic flow, in order to avoid size and convexity constraint violations. Of course, this might adversely impact performance, since the use of 'smaller' basic flows results in more inter-flow data transfer delays. Still, our heuristic has so far proven to be quite effective in identifying those clustering decisions that may have caused an avoidable constraint violation -specifically, for all of our experiments, it has failed to backtrack successfully only once (for one of the DCT-DIT experiments discussed in Section VII), resulting in one more basic flow than possible for that case, with insignificant impact on performance.
2) Iterative Improvement Phase. Although our initial clustering algorithm performs quite well (see Section VII), improvement is in general still possible. To take advantage of these opportunities, similarly to [29] , we have developed a relatively low-cost iterative improvement algorithm -see Fig.10 , where lines in bold represent our enhancements to the original TPC algorithm.
The iterative improvement algorithm is based on boundary permutations [29] . A boundary node (line 5 in Fig.10 ) is a node that has at least one predecessor or successor node clustered to a different basic flow -such nodes will be moved around different basic flows, providing opportunities for eliminating or collapsing associated inter-flow data transfers. Differently from the original TPC algorithm, though, our boundary permutations need to satisfy constraint n max , as well as convexity constraints. Namely, after moving a boundary node to a different basic flow, the latter may contain more than n max nodes, and hence we need to make sure that it will also export a boundary node to another basic flow -such chain of moves should continue until the last basic flow to receive a boundary node still contains no more than n max nodes -see lines 7 and 8 in Fig.10 . Note also that, at any step of the chain of moves, if there are multiple options, the one minimizing cost function (6) (discussed below) is selected.
An example of a chain of moves is shown in Fig.11 , where n max = 4. After moving a boundary node v1 from basic flow f 1 to basic flow f 2, f 2 contains five nodes and violates the n max constraint. Thus, we need to move a boundary node in f 2 (say, v2) out to another basic flow (f 3). Note that, among all possible moves of a boundary node from f 2 to other basic flows, v2 and f 3 are selected because they minimize the temporary clustering cost. After that movement, f 3 contains four nodes (thus ≤ n max ), and so the chain of moves ends. 10 After completing one such chain of moves, we obtain a new temporary clustering solution and evaluate it using a suitable cost function (line 11 in Fig.10 ) -if the resulting cost improves, we accept the chain of moves and update the current clustering solution. The algorithm terminates when all possible chains of moves fail to improve cost, or an upper bound on the number of iterations is reached.
The following cost function is used in DAS-TPC:
Algorithm DAS-IterativeImprovement: 1. progress = 0; iteration = 0; 2.
compute the initial clustering cost; 3.
do { 4.
iteration++; 5.
for each boundary node v in G { 6.
for each node p ∈ adj(v) and cluster(p) = cluster(v) { 7.
temporarily move v to cluster(p); 8.
performs a chain of temporary moves of boundary nodes until target flow contains no more than nmax nodes; 9. } 10.
if the temporary clustering satisfies convexity constraint { 11.
compute the new clustering cost; 12. if the clustering cost improves { 13.
commit the chain of moves and update the clustering; 14. progress = 1; } 15.
} } 16.
} while (progress == 1)and(iteration <= iterationmax); at discouraging the clustering of nodes with large mobility differences into the same basic flow, since this will likely decrease the exposed instruction level parallelism, and thus potentially harm performance 11 . M b is defined as the sum of mobility differences over all flows, i.e.,
where µ f,max and µ f,min denote the maximum and minimum mobility associated to the nodes in basic flow f , respectively. Fig.12 explains this latter cost component with an example kernel. As it can be seen, clustering node 24 of zero mobility 11 Since in our target fabric, an MU cannot start execution until all of its input data is ready [11] , [12] , nodes with high mobility will have to wait until the data for the low mobility nodes arrives, if such data is produced by a different MU. (i.e., a node in the critical path) to the basic flow containing nodes 1, 2, 3, of high mobility, is not a good choice. Indeed, if any basic flow containing a predecessor to node 24 is clustered to a different MU, nodes 1, 2, 3 will have to wait for the basic flow containing such node to complete execution, leading to unnecessary execution delay. M b aims at avoiding such problematic clustering choices.
2) Flow Clustering
Step: DAS-Binding: After clustering the original DFG G into basic flows, we contract each resulting basic flow to a node v c , and construct a corresponding contracted graph G c . 12 When there is a 'circular' data dependency among basic flows (this is possible since non-convexity is allowed at the node clustering phase), we further contract all the basic flows (i.e., their corresponding contracted nodes v c ) contained in the 'circular' dependency path into a single node, thus ensuring that all such basic flows will necessarily be mapped to a single MU, and the previously contracted graph G c becomes thus acyclic. (Recall that, as discussed in Section V-B.1, the actual number of basic flows contained in each such contracted node will necessarily satisfy f max .) After generating the acyclic contracted graph G c , we simply re-apply the DAS-TPC algorithm (used in the previous step)
to G c , so as to derive the set of flow clusters to be assigned to each MU, where cluster size is now limited by f max (rather than n max ).
VI. DAS-ROUTING
After decisions on MU allocation and binding are made, we still need to determine how many routing tracks are needed to support intra-and inter-MU data transfers, for the given probability of a track being defective (P et ). For simplicity, we assume P et to have the same first order value as P c , the probability of a connection inside a region being defective. Since regularity is desirable, we assume a target nanofabric with uniform routing channels [11] , [12] , i.e., with the same number of tracks on all channels.
Note that, differently from previous approaches, the exact placement of basic flows in the internal regions of a component's MUs is not known -thus, our goal is to actually determine the number of tracks required to support all potential alternative placements. In order to do so, we consider a number of distinct basic flow placements, aimed at exposing different routing configuration requirements, namely: 1) 'compact' placement -the basic flows are mapped to regions all located at one corner of the corresponding MU; 2) 'spread' placement -the basic flows are mapped to regions as far from each other as possible on the appropriated MU; and 3) random placements -the basic flows are randomly mapped to regions of the appropriated MU. For each such placement, we start by assuming that all routing tracks are defect-free, thus reducing our problem to the so called 'symmetrical FPGA array routing' [32] . We then execute the well-known Pathfinder congestion negotiation routing algorithm, used in VPR [32] , to obtain the number of tracks required by that particular solution. We repeat the process for all alternative placements, and then select the highest number of defect-free routing tracks required by any such solution. Then, given P et (probability of a track being defective) and the target yield for communication resources P tc , we use a binomial distribution to estimate the number of tracks required in the presence of defects.
VII. EXPERIMENTAL VALIDATION
In this section, we present the experimental results to empirically show that our DAS-NANO framework can effectively explore the complex probabilistic design space defined by the defect-aware HLS problem. Table I shows the set of representative HLS benchmark kernels used in our experiments. They include a simple twobutterfly Fast Fourier Transform (FFT), an Auto-Regression filter (AR), an Avenhous Filter (AF), a Finite Impulse Response filter (FIR) and its unrolled version (FIRu), an Elliptic Wave Filter (EWF), a version of the Fast Fourier Transform (FFTm) used in MediaBench [33] , and various DiscreteCosine Transform (DCT) algorithms [34] .
A. Experimental Methodology
Assuming a defect regime (P e , P a , P c ), and a given target yield, our DAS-NANO's synthesis flow was used to generate a component design for each benchmark kernel in Table I . Specifically, we first derived the number of MUs and the number of regions per MU, using the algorithm described in Fig. 4 . Then, we executed the DAS-TPC algorithm to perform node clustering (DAS-InstructionSelection) and flow assignment (DAS-Binding). Afterwards, we verified if the resulting design met the target yield -it did for all of our experiments. Finally, the required number of routing tracks was obtained, using the method described in Section VI.
To evaluate the resulting design, we estimated the experimental component latencies when defects are present, denoted L expr , via Monte Carlo (MC) simulations. Specifically, considering the defect densities specified for the particular group of experiments (i.e., P e , P a and P c ), random defective fabrics were generated in each MC run. For each such fabric instance, we used our TMR-tile based group testing method [12] , [21] to obtain the defect map of each region, and then used the heuristic algorithm proposed in [11] , [12] to determine the exact placement of basic flows in the component's MUs. In this way, we obtained the actual delay of each interflow and inter-MU edge, and hence L expr , for each MC run representing a particular fabric instance and corresponding chip configuration. In all experiments, we assumed the same basic delay model adopted in [11] , [12] , where each operation takes two cycles to complete and have its results routed to its consumer operation inside the particular basic flow, and a signal takes one cycle to traverse the length of a region.
The method just discussed allowed us to determine interflow and inter-MU delays for a particular chip configuration. Still, when designing a component, the exact delays are not known, since they depend not only on the particular design being considered, but also on the specific configuration implemented in each individual fabric/chip. So, during the iterative improvement step of the clustering phase, we use a predicted component latency, denoted L pred , in the clustering cost function (see Section V-B.1). 13 We estimate L pred using a combination of simulations and analysis to determine expected delays for inter-flow and inter-MU edges, as discussed below.
Specifically, we considered a number of different designs, and ran each of them through a large number of MC simulations, assuming the defect regime of interest. Fig. 13 shows the minimum, maximum, and average experimental delays for inter-flow and inter-MU edges, obtained for different bench-mark kernels, assuming a number of different designs for each such kernel and the defect regime (P e , P a , P c ) = (10, 5, 1)%. Note that these sample results were obtained across 10000 MC simulation runs to ensure that the worst relative error of each delay estimate was no more than 10% with 99% confidence. As shown, the average experimental delay of inter-flow edges is consistently approximately two cycles, for all cases. The average experimental delay of inter-MU edges varies between four to six cycles. Thus, in all the experiments that follow, the expected delay of an inter-flow edge is assumed to be two cycles (by the clustering algorithms), and the expected delay of an inter-MU edge is assumed to be either four cycles or six cycles. The impact of these two values on the resulting average L expr of the corresponding designs generated by the algorithms will be discussed in detail in the experiments below. 
B. Contrast to Deterministic Synthesis Methodology
Our reconfiguration-based defect-aware synthesis approach enables non-deterministic or probabilistic mapping of applications to the nanofabrics. We start by contrasting its effectiveness with regard to a deterministic synthesis methodology based on triple-module-redundancy (TMR). For this experiment, we use a simple two-buttefly FFT kernel (see Table I ). We let P f denote the probability that a component fails to be configured, i.e., P f = 1 − P tt P tc . By a TMR-based design, we mean a design based on a priori allocation of redundant resources and arbiters, so that a target P f is met. This approach requires a single synthesis step and no defect mapping or configuration. Two defect regimes (technologies) were considered (P e , P c ) = (10, 1)% and (P e , P c ) = (1, 0.1)%, see Fig. 14. For each target P f , we generated two TMR-based designs: TMR1 -assuming perfect arbiters, i.e., P a = 0 (unrealistic for nanotechnologies); and TMR2 -assuming P a = P e /1000. We assumed optimistically that data transfer delays across regions and MUs for TMR-based designs would be zero. For component designs based on our DAS-NANO approach, we assumed pessimistically that P a = P e /2. Fig.14 exhibits the worst/maximum L expr for the designs resulting from our DAS-NANO approach and the TMR-based deterministic component designs -note that there is no delay variability in the latter. As expected, TMR-based designs with non-ideal arbiters eventually reach an arbitration bottleneck beyond which they can no longer reduce P f . Also note that (unrealistic) TMR-based designs with perfect arbiters can achieve P f targets, yet exhibit a much longer delay than the worst case for the component designs resulting from our DAS-NANO approach. This is due to the large amount of a priori redundancy required to achieve a small target P f . Thus, from the perspective of creating defect tolerant designs, deterministic synthesis based on TMR is inappropriate, i.e., delivers low yield and performance, and is likely to consume more power since more PEs need to be executed simultaneously. Yet, a TMR-based design is at the same time robust to soft/transient faults. We shall return to this point in Section VIII.
C. Results on Design Space Exploration
In this section we provide experimental evidence to support the claim that our DAS-NANO framwork can effectively explore the design space induced by the defect-aware HLS problem. Table II shows samples of the results generated by our DAS-NANO framework for the last nine kernels in Table I , obtained for various target yield values (P tt and P tc ), and assuming defect regime (P e , P a , P c ) = (10, 5, 1)%. Recall that n max , f max , n MU and n reg denote the maximum basic flow size, maximum number of flows mapped to one MU, the number of MUs, and the number of regions per MU used on a particular design. L pred denotes the predicted component latency estimated early on for the particular design -the first number given in that column is obtained when the expected delay of inter-MU edges is assumed to be four cycles, and the second number is obtained when the expected delay of inter-MU edges is assumed to be six cycles. (As mentioned before, the expected delay of inter-flow edges is assumed to be two cycles for both cases). Note, that the designs generated by the clustering algorithms were identical for both L pred values on all but one of our experiments -the EWF design with P tt = 1 − 10 −15 -the implications of this are addressed below.
L expr in the table denotes the experimental component latency for the actual design, where the three numbers correspond to the minimum, maximum, and average latency over 10000 MC runs. The row with "*" for the EWF design with P tt = 1 − 10 −15 shows the experimental latency of the resulting design when the predicted delay of inter-MU edges is assumed to be six cycles. As mentioned before, this is the only case for which the resulting designs are different when the expected delay of inter-MU edges is six cycles versus four cycles, yet the difference is essentially not significant, in terms of both the performance/latency and resource requirements of the two resulting designs. Finally, T req in the table denotes the number of required defect-free routing tracks (worst case among eight different basic flow placement scenarios, as described in Section VI), and T actual denotes the total number of tracks required to meet the target yield P tc , assuming the probability of a track being defective P et = P c .
As it can be seen, the average component latency increases with increases in target yield -this is exactly what one would expect, since increases in redundant configuration capacity are needed to achieve higher yields, but they have a deleterious effect on locality. In particular, as we can see from Table II , as the target yield increases, the number of MUs required to achieve such yield increases, and thus the expected delay of the inter-MU edges tends to increase as well. Accordingly, for most benchmarks, when the target yield is low, e.g. P tt = 1−10 −5 , L pred computed assuming the inter-MU delays to be four cycles is closer to the average L expr . Yet, when the target yield is high, e.g. P tt = 1 − 10 −15 , L pred computed assuming the inter-MU delay to be six cycles is closer to the average L expr . Still, no matter what the expected delay of inter-MU edges is, L pred exhibits the same trend as L expr , i.e., increases as target yield increases. Such fidelity of L pred with regard to L expr allows us to use L pred safely while exploring the design space, thus enabling a much quicker practical exploration, since it avoids the complex Monte Carlo simulations required for deriving L expr . Another interesting observation pertains the actual delay variation across different component instances, shown in the column representing L expr . For all cases, the minimum L expr and the maximum L expr are within 10−15% of the average L expr , for the particular defect regime and target yield considered in the experiment. This inherent variability is to be expected in a reconfiguration-based approach such as ours.
Our experiments show as well that the number of defectfree tracks decreases with increases in yield, since increasing redundant configuration capacity leads to less congested designs. However, when we simultaneously (and more realistically) consider the presence of defective routing tracks, again the number of tracks tends to increase, as the target yield increases.
We performed the same set of experiments discussed above for a number of different defect regimes, and observed consistent trends and results. For instance, Table III shows a similar set of experimental results generated by DAS-NANO, assuming (P e , P a , P c ) = (1, 0.5, 0.1)%. For this defect regime, we found that the average delay of inter-flow edges was still consistently around two cycles for all cases, while the average delay of inter-MUs edges was roughly two to four cycles. This was not surprising, since the number of MUs required to achieve the target yield is much reduced, for this lower defect density. Accordingly, as before we still used two cycles for the expected delay of inter-flow edges, yet used now two or four cycles for inter-MU edges. We found that, for all cases, the resulting designs were identical for those two different delays, and so was L expr . We still present L pred for the two cases (four-cycle case first and then two-cycle case), for completeness. Note further that, since the defect density is lower for the experimental results shown in Table III , the number of MUs required to achieve the target yield tends to be smaller, and the maximum flow sizes larger than those used in the designs generated for the previous case ((P e , P a , P c ) = (10, 5, 1)%). As a result, component latencies are smaller than those of the previous designs for the same target yield, by 16 -56%. Also, the performance variations (differences between the minimum and maximum experimental latencies) across all component designs are quite smaller, due to increased locality of the resulting designs.
Our framework DAS-NANO's execution time is essentially determined by its clustering phase, DAS-TPC -see discussion on the time complexity of TPC in [29] . To assess the effectiveness of the somewhat costly iterative improvement step of our algorithm, we implemented a simpler version, without iterative improvement, denoted DAS-INIT. For each kernel, we recorded the execution time of both versions of the algorithm, as well as the average experimental latency of their generated solutions. Fig.15 and Fig.16 show samples of our results, for defect regime (P e , P a , P c ) = (10, 5, 1)%, assuming a target yield (P tt ) of 1 − 10 −15 and 1 − 10 −10 , respectively. The execution time is in milliseconds, and was obtained on a SparcV9 750MHz processor. On average over all our experiments, DAS-TPC achieves 17% improvement on average experimental latency with respect to DAS-INIT, with roughly four times increased execution time. Accordingly, in the current version of DAS-NANO, designers can select to enable iterative improvement or not, based on their specific applications, optimization goals, and sensitivity to execution time.
VIII. PREVIOUS RELEVANT WORK
A. Work on High Level Synthesis and Clustering
The synthesis framework proposed in this paper builds on state-of-the-art work in the areas of high level synthesis (HLS) [35] , [14] , [15] , [16] , instruction-set extensions for performance acceleration relying on application-specific processors and FPGAs [36] , [27] , [28] , [30] , and instruction scheduling/assignment for clustered datapaths and processors [37] , [38] , [39] , [26] , [29] . Although we propose several enhancements to existing HLS algorithms, these individual enhancements are not the key innovation in this paper. Instead, the paper's main contribution is the definition of a new defectaware HLS problem, and associated synthesis flow, such that one can jointly synthesize and optimize a large family of alternative solutions, rather than a single deterministic solution, so as to achieve a specified target yield with best average component latency.
B. Work on Defect-Tolerant Nanosystems
A good review of previous work in defect tolerance for nanotechnologies and proposed fabric architectures can be found in [40] . Several reconfiguration-based approaches to defect avoidance have been proposed for defect-prone nanotechnologies, [7] , [41] , [42] , [9] , [5] , [10] , [11] , [12] , [21] . Two key research directions have emerged in this area: (1) 'fine-grained' reconfiguration-based approaches [41] , [9] , [43] , [44] ; and (2) 'coarse-gained' reconfiguration-based approaches [42] , [11] , [12] , [21] , [45] , [46] . 'Fine-grained' approaches, as the name suggests, use simple primitive programmable element, e.g., a nanoblock in [41] , [9] may only hold a few gates. The advantage of such 'fine-grained' approaches is that the probability of a primitive element being defective is relatively small, even under the high defect density projected for nanotechnologies, thus potentially enabling a more efficient use of defect-free resources. However, it is very challenging for such 'fine-
