This paper presents preliminary work exploring adaptive field programmable gate arrays (AFPGAs). An AFPGA is adaptative in the sense that the functionality of subcircuits placed on the chip can change in response to changes observed on certain control signals. We describe the high-level architecture which adds additional control logic and SRAM bits to a traditional FPGA to produce an AFPGA. We also describe a synthesis method that identifies and resynthesizes mutually exclusive pieces of logic so that they may share the resources available in an AFPGA. The architectural feature and its associated synthesis method helps reduce circuit size by 28% on average and up to 40% on select circuits.
INTRODUCTION
Field programmable gate arrays (FPGAs) have become an increasingly popular medium for implementing digital circuits due to the high costs associated with application specific integrated circuit (ASIC) implementations. An FPGA can be bought off the shelf and can be configured to implement an arbitrary digital circuit. This programmability allows reduced development and verification times, and avoids the large costs involved in chip fabrication. The programmability of FPGAs comes with a steep price, however. An FPGA-based implementation is estimated to be forty times larger and three times slower than a comparable implementation using standard cells [1] . A number of studies have examined methods of improving the area and speed efficiency of FPGAs [2] - [5] . Here we consider an architectural modification that improves the area efficiency of existing FPGA architectures by dynamically changing the functionality of the circuit implemented in the programmable logic fabric.
The key difference between an AFPGA and FPGA is in the structure of the configuration element. In an FPGA, each configuration element determines the functionality of a single programmable resource as illustrated in Figure 1a . Although most of the configuration elements in an AFPGA are the same as in an FPGA, a fraction of them use the alternative structure illustrated in Figure 1b . In this structure, two configuration elements are connected to a single pro- grammable resource through a multiplexer. The functionality of the programmable resource is determined by the configuration element selected by the multiplexer. The motivation for the study of this alternative structure is best illustrated with an example. Consider an FPGA architecture where a logic block can implement any function of three inputs. A possible implementation of a four-bit add-subtract unit for this architecture is given in Figure 2 . A subtraction is performed when the control signal as (AddSub) is equal to 1 and an addition is performed when it is equal to 0. This implementation uses 12 logic blocks. The add-subtract unit can be specialized to perform either an addition or a subtraction by setting as to zero or one, respectively. After setting as to a constant, logic simplification removes the four XOR blocks reducing the number of logic blocks required to 8. The specialized versions of the add-subtract unit are illustrated in Figures 3a and 3b . Although the logic blocks have been specialized to perform different functions, the routing structure is still identical. By allowing the logic blocks to use the multiplexed configuration element, the two specialized circuits can be combined into a single circuit as illustrated in Figure 4 . In the combined circuit, each logic block implements the functionality required by both addition and subtraction. The appropriate function is activated by as which is assumed to connect to a special configuration select signal. This example shows that there are potential area benefits when the multiplexed configuration element is used selectively in an FPGA architecture.
The architectural modification we are proposing adds a control signal and some extra configuration memory in order to reduce the size of the circuit to be implemented. In modern FPGAs [6, 7] configuration memory is a small fraction of total chip area, and a small portion of this configuration memory is duplicated in order to allow two distinct subcircuits to share the same set of programmable resources. A detailed study of this area tradeoff can be found in [8] .
Here we present a high-level overview of the architecture and present a synthesis method that helps reduce circuit size significantly when targeting AFPGAs.
AFPGAS: HIGH-LEVEL ARCHITECTURE
AFPGAs are built by modifying an existing SRAM configurable FPGA. Although other programming methods such as antifuses and floating gate devices can be considered, these are less popular and are not considered in this work.
Before we describe the additional circuitry needed to create AFPGAs, we briefly review the academic FPGA architecture [3] that is the basis for our work. At the highest level, an FPGA architecture is comprised of I/O pins and logic clusters. Programmable routing resources comprised of programmable multiplexers and routing switches allow connections to be made between logic clusters, and between logic clusters and I/O pins. Each logic cluster is made up of a number of basic logic elements (BLEs), and each BLE contains a lookup table (LUT) and a register. Here, we assume that the target architecture uses 4-input LUTs. Previous work has established the area effiency of this particular LUT size [2, 4] .
The distinguishing feature of AFPGAs is their use of the adaptive configuration element illustrated in Figure 1b . This is in contrast to FPGAs which use the traditional configuration element illustrated in Figure 1a . In the traditional configuration element, each programmable resource is attached to a single SRAM cell which determines its functionality. In the adaptive configuration element, two SRAM cells connect to the programmable resource through a multiplexer, and the SRAM cell that gets to control the programmable resource is determined by a configuration select signal attached to the multiplexer. If the two SRAM cells contain differing values, the programmable resource being controlled changes its functionality dynamically in response to changes in the select signal.
An AFPGA architecture is obtained by making two modifications to an existing FPGA architecture. First, a portion of the configuration elements in an FPGA are replaced by adaptive configuration elements. Since the adaptive configuration elements require a select signal, the second modification adds select signals, each controlling the configuration elements in a region of the chip. The select signals are accessible from the programmable routing network allowing them to be driven by I/O pins and basic BLE outputs.
The adaptive BLE, illustrated in Figure 5a , is obtained from a traditional BLE by modifying the configuration elements that determine LUT functionality. Using adaptive configuration elements in place of traditional configuration elements, the LUT inside the adaptive BLE has the ability to perform two different functions of its inputs. A select signal, s, connects to the adaptive configuration elements and provides a way of selecting the function computed by the LUT.
Multiplexers and routing switches present in an FPGA can be made adaptive by replacing their configuration elements with adaptive configuration elements. The adaptive An AFPGA uses adaptive BLEs, multiplexers and routing switches to allow two subcircuits to share the same set of programmable resources.
COMPARISON WITH PREVIOUS WORK
The AFPGA architecture proposed here has similarities to previous work on multicontext FPGAs. An FPGA's context is the set of configuration elements that determine its functionality. A traditional FPGA contains a single context which allows it to implement a single function. A multicontext FPGA has a number of contexts, but only one of them is active at any time. There has been both academic and industrial interest in multicontext FPGAs. Academic works include the Dharma architecture [9] , dynamically programmable gate array (DPGA) [10] and virtual gate element array (VEGA) [11] while industrial work carried out at Xilinx considers a version of the Xilinx XC4000E FPGA [12] with several contexts.
An AFPGA can be viewed as a multicontext FPGA with 2 contexts. However, unlike a true multicontext FPGA, only a portion of an FPGA's context is duplicated to obtain the second context. A second difference is in the way that the contexts are switched. In a multicontext FPGA, the context switch is global; a chip wide signal determines the context that is made active. This signal is usually connected to a control circuit that sequences through the contexts or to an external signal provided by the user. On an AFPGA, a number of different context activation signals are used, each one serving as the activation signal for a different region on the chip. Furthermore, a synthesized subcircuit within the AFPGA determines when the context is to be switched. A third difference is in the nature of the transformation used to [12] which is a transformation based on sequential techniques, while AFPGAs use adaptation which is a transformation based on combinational techniques.
AFPGAS: CAD FLOW
The CAD flow for AFPGAs is illustrated in Figure 6 . The first two steps, synthesis and technology mapping, are the same as in an FPGA CAD flow. However, starting with the third step, adaptation, there are significant differences between the two CAD flows.
The process of adaptation creates pairs of subcircuits that share programmable resources in an AFPGA. There are two technology mapping steps in the AFPGA CAD flow, the first precedes adaptation and the second follows adaptation. The first mapping step is useful because it improves the performance of adaptation. During adaptation, the selection of control signals is guided by area concerns, and the accuracy of the selection process is greatly enhanced if the circuit has been mapped into LUTs. The second technology mapping step is useful because it produces a smaller circuit by remapping parts of the circuit affected by the simplification operations (described in Section 5.2) that take place during adaptation.
An AFPGA architecture is based on an existing FPGA architecture, thus the final four steps in the AFPGA CAD flow, packing, clustering, placement and routing, perform their traditional FPGA CAD tasks in addition to new tasks that are AFPGA specific. In addition to the task of grouping LUTs and registers into BLEs, the packing step groups pairs of LUTs selected from pairs of subcircuits to form adaptive BLEs. Clustering and placement have the additional task of ensuring that the BLEs that are part of a subcircuit pair are clustered and placed in a region with adaptive structures (BLEs, multiplexers and routing switches) whose functionality changes in response to a single configuration select signal. Finally, the routing step has two additional tasks. First, it has to make use of adaptive multiplexers and routing switches when making connections for subcircuit pairs. Second, it has to connect the control signals that generated subcircuit pairs to the appropriate configuration select signals.
ADAPTATION

Introduction
A circuit may be viewed as being composed of several subcircuits. The process of adaptation replaces some of these subcircuits with adaptive subcircuits capable of sharing the adaptive resources present in an AFPGA. An adaptive subcircuit consists of two half circuits, each a specialized version of the subcircuit being replaced. The creation of an adaptive subcircuit begins with the selection of a control signal which is then used to identify a subcircuit whose area is greatly reduced when the control signal is assumed to be a constant. Two specialized versions of the subcircuit can be generated, one assuming that the control signal is zero and the other assuming that the control signal is one. Since the control signal can only be in one of two possible states, zero or one, the specialized subcircuits are not needed at the same time and can be combined to form the two halves of an adaptive subcircuit. The control signal connects to the adaptive subcircuit and determines which of the two halves is to be active at any time.
Preliminaries and Problem Definition
The combinational portion of a boolean circuit can be represented as a directed acyclic graph (DAG) G = (V (G), E(G)). A node in the graph v ∈ V (G) represents a logic gate, primary input or primary output, and a directed edge in the graph e ∈ E(G) with head, u = head (e), and tail, v = tail (e), represents a signal in the logic circuit that is an output of gate u and an input of gate v. The set of input edges for a node v, iedge (v) , is defined to be the set of edges with v as a tail. Similarly, the set of output edges for v, oedge (v) , is defined to be the set of edges with v as a head. A primary input (PI) node has no input edges and a primary output (PO) node has no output edges. An internal node has both input edges and output edges. The set of distinct nodes that supply input edges to v are referred to as input nodes and is denoted inode (v) . Similarly, the set of distinct nodes that connect to output edges from v are referred to as output nodes and is denoted onode (v) .
The circuit of Figure 7a and its graph representation, Figure 7b , will be used as an example when illustrating the notions defined below. A node v in the graph is considered to be a boolean variable v whose value is determined by the boolean function F v at node v. Although we use the same notation for both a node and the variable it is associated with, it will be clear from the context which one of the two is meant. An edge e from u to v indicates that the function used to compute v, F v , depends on variable u. In the example, the function at d depends on a and b, and is defined to be
In addition to the use of the standard logic values 0 and 1, the symbol ' * ' is used to represent an unknown or undefined logic value.
The ordered pair Note that the implication set of an assignment includes the assignment itself.
Two laws regarding implications will be useful when generating implication sets. The transitive law states that
If the output or any of the inputs of a function F v have been assigned values, then a justification operation, JUSTIFY(v), can be performed on v. The operation returns a set of assignments that are a consequence of the assignments incident on v. The scope of the operation is limited to the assignments that can be derived by examining F v ; the assignments returned only apply to the functions's inputs or to its output. Returning to our example, if d = 1 and f = * then JUSTIFY(f ) returns the assignment Figure 8 illustrates the transformation that takes place during adaptation. First, a control variable, v, and the subgraph it controls, H v , are identified. Then, two specialized versions of H v are generated; H 0 is obtained by simplifying H v after applying the assignments in impl [v, 0] and H 1 is obtained by simplifying H v after applying the assignments in impl [v, 1] . Finally, the two subgraphs are combined to form the adaptive subgraph, H v . Variable v remains connected to the adaptive subgraph and determines which one of H 0 and H 1 is to be active at any time. The area of a subgraph H, area(H), is defined to be the number of nodes in it. In an adaptive subgraph, only one of the component subgraphs is active at any time. Thus, the area of an adaptive subgraph is defined to be the maximum of the areas of the component subgraphs,
Note that the area of an adaptive subgraph H v is always smaller than or equal to the area of H v because H v is composed of the two subgraphs H 0 and H 1 which are both simplified versions of H v . The goal of adaptation is to replace controlled subgraphs with adaptive subgraphs such that the area of the resulting graph is minimized. Consider the process of adaptation on the graph of Figure 7b using b as the control variable. We identified the subgraph controlled by b earlier. The subgraphs of Figures 9a and 9b are obtained when the controlled subgraph is simplified assuming b is 0 and 1, respectively. These two subgraphs are combined into the adaptive subgraph illustrated in Figure 9c . Although the adaptive subgraph reduces the area of the graph by two nodes, even more gains are possible if the adaptive subgraph is remapped. In the original graph, all nodes had two inputs, and if we assume that the target architecture consists of 2-LUTs then remapping produces the graph of Figure 9d . The final graph has a three-fold reduction in area compared to the original.
Generating Implication Sets
Implication sets are central to the problem of adaptation. They are used to identify controlled subgraphs as well as to generate adaptive subgraphs. Our use of implication sets is not new; they have found several uses in the literature. They have been used to solve satisfiability (SAT), automatic test
end forever Fig. 10 . The procedure used to generate adaptive subgraphs.
pattern generation (ATPG), and logic synthesis problems. Thus, there have been a number of algorithms proposed to generate implication sets [13] - [17] . Our work uses a modified version of the algorithm proposed in [17] . The primary benefits of this algorithm over the others are its speed, its ability to generate all possible implication sets simultaneously and its ability to discover a large number of implications. We briefly review the algorithm here. The algorithm maintains implication sets for each signal in the circuit. Initially, each implication set will contain a single entry: the assignment itself. As the algorithm progresses the implication sets are expanded. The algorithm stops when there are no further changes observed on any of the implication sets. An implication set is expanded by first making the assignments within it. Then justification operations are used to discover new implications. The transitive law is used to discover further implications of the newly discovered implications. In addition to the transtive law, during the process of discovery, the contrapositive law is used to add new entries to other implication sets.
Generating Adaptive Subgraphs
The procedure used to generate adaptive subgraphs is presented in Figure 10 . It begins with a call to IMPLICATION-SETS which generates all implication sets in the graph. Then, before any modifications are made to the graph, a copy of the variables is placed in C. The implication sets that were generated only apply to the variables that are present in the unmodified graph, but some of these variables will disappear as the graph is modified during adaptation. Set C allows us to keep track of the variables for which implication sets are available.
The inner loop (lines 6-14) uses a greedy method to find a control variable that can be used to generate an adaptive subgraph, and the outer loop (lines 4-20) keeps replac-
u ← * 10 end for 11 return H Fig. 11 . The procedure used to generate a specialized version of H v under the assumption that v has logic value l.
ing subgraphs with their adaptive equivalents until the inner loop fails to find a control variable. For every variable that is present in both C and the modified graph G, the inner loop determines a cost that reflects its suitability as a control variable. The cost of a variable v is determined by COST and is based on the controlled subgraph, H v , as well as the two subgraphs H 0 and H 1 produced by SPECIALIZE. The variable with the highest non-zero cost is selected as the control variable to be used in generating an adaptive subgraph. Given a variable v and a logic value l, the specialization procedure, illustrated in Figure 11 , produces a version of H v simplified using the implications in impl [v, l] . Note that some assignments in impl [v, l] cannot be used (lines 3 and 8) as the variables they apply to may no longer be a part of the modified graph G.
Returning to Figure 10 , having selected a control variable in the inner loop, the outer loop removes the subgraph controlled by the variable and replaces it with an adaptive subgraph composed of the two specialized versions of the removed subgraph.
Since the goal of adaptation is to minimize the area of the resulting circuit, the cost computed for a variable v is based on the number of edges eliminated in H v to obtain H 0 and H 1 . Although the area savings can be measured directly by counting the number of nodes eliminated, the edge-based measure is finer grained and better reflects the area obtained after the technology mapping step that follows adaptation; the technology mapping step may be able to pack more logic into the nodes where some edges have been eliminated. The cost is computed as the minimum of two values c 0 and c 1 where c 0 is the number of edges eliminated to obtain H 0 and c 1 is the number of edges eliminated to obtain H 1 . Furthermore, there are two special cases that may cause the cost of a variable to be zero preventing its selection by the inner loop (lines [11] [12] [13] . It is assumed that at minimum each control signal in the target architecture determines the functionality of the adaptive structures within a region that is approximately the size of a cluster. Thus, to produce adaptive subgraphs that are at least as big as a cluster, the cost of a variable is set to zero if it does not eliminate as many edges as there are inputs to a typical cluster in the target archi- tecture. For example, if the target architecture uses clusters with four BLEs, then a variable's cost is set to zero if it does not eliminate at least 10 edges [3] . If a variable v controls a large subgraph H v and has disproportionate values for c 0 and c 1 then it is unlikely to be a good control variable and its cost is set to zero. For example, consider a circuit with a global reset signal. The reset signal controls most of the circuit, and most of the circuit can be removed if it is assumed to be one. However, if the reset signal is assumed to be zero, most of the circuit remains unchanged. If an adaptive subcircuit is generated using the reset signal, area would not be reduced significantly and a large part of the circuit would no longer be able to participate in other specialization operations. Far better area results can be obtained for the circuit by using a number of control signals, each of which controls a smaller subcircuit and is able to produce balanced values for c 0 and c 1 .
The complexity of generating adaptive subgraphs (lines 4-20) can be derived as follows. The number of nodes in the graph is denoted n, and the average number of nodes in the subgraph controlled by a variable is denoted k. The inner loop simplifies the subgraph controlled by each variable to determine a cost, thus its complexity is O(kn). The outer loop will iterate O(n/k) times as each iteration replaces a subgraph of size O(k) with an adaptive subgraph. Therefore, the complete process has a complexity of O(n 2 ). On a Pentium III 1 GHz computer, adaptive subgraphs can be generated for the 20 largest MCNC [18] circuits in 19 minutes. Table 1 . The effect of adaptation on the 20 largest MCNC circuits.
TECHNOLOGY MAPPING AFTER ADAPTATION
that are not a part of any adaptive subgraph in a cone at v. For example, in Figure 12 , d can be a part of cone at f , e can be a part of a cone at g, and c can be a part of a cone at e. However, a 0 /a 1 cannot be a part of a cone at b or d.
RESULTS
The 20 largest MCNC [18] circuits were used to study the benefits of the adaptation step. Before undergoing adaptation, each circuit was first synthesized using SIS (scriptrugged) [19] and then technology mapped into 4-LUTs using IMap [20] . IMap was also used for the second technology mapping step following adaptation. For each circuit, Table 1 presents the number of control variables used during adaptation (Ctrls), the area before adaptation (Unadapted), the area following adaptation (Adapted), the area following the second technology mapping step (Mapped), and the fraction of LUTs in the final circuit that are adaptive (Adaptive).
Although an average of 7.1 control signals were used during adaptation, there are significant differences in the number of control variables used to adapt each circuit. Some circuits such as bigkey produce significant area reductions with a single control variable while circuits such as clma produce similar area gains with a much larger number of control variables (13) . Architectural decisions will be guided by the larger number of control variables as an architecture that supports several control variables can be made to mimic a single control variable, whereas an architecture that supports a single control variable cannot mimic the presence of several control variables.
When adaptation is used in isolation, an average area savings of 18% is observed. The addition of the second technology mapping step increases the area savings to 28%.
A large fraction of the circuit produced by adaptation (88%) is adaptive. From an architectural point of view, this suggests that most of the LUTs in an AFPGA must be of the adaptive variety to effectively support the circuits being produced by adaptation.
WORK DESCRIBED ELSEWHERE
Although adaptive circuits use 28% fewer LUTs than nonadaptive circuits, this is only one of the factors in the area comparison between adaptive circuits. To determine the true area benefits of AFPGAs, two other factors need to be quantified. First, the area of the extra circuitry required to implement adaptive circuits needs to measured. Second, the structural changes that a circuit underwent to become adaptive may produce an associated increase or decrease in the demand for routing resources, and this effect needs to be measured. A detailed area study considering these effects is presented in [8] . The study showed that even after the area overheads were considered, an AFPGA was found to be approximately 8-14% more area efficient than a comparable FPGA.
SUMMARY
We presented a high-level overview of the AFPGA architecture and detailed the adaptation step which was used to exploit the architectural features present in an AFPGA. Two algorithms that were part of the adaptation step were described. The first algorithm generated implication sets which were then used by the second algorithm to generate adaptive subcircuits. Following adaptation, a technology mapping step was run to reduce both the area and depth of the resulting circuits. The entire procedure reduced the area of the 20 largest MCNC circuits by 28%.
