I
huge inefficiencies in the implementation of asynchronous designs as circuits require a variety of separate devices. As a result, most asynchronous designers focus on custom or semicustom integrated circuits, incurring greater expense in time and money. The net effect has been that optimized and robust asynchronous circuits have not become a part of typical system designs. The asynchronous circuits that must be included are usually designed in an ad hoc manner with many underlying assumptions. Such a highly error-prone process causes implementations to be unnecessarily susceptible to delay variations.
Field-programmable gate arrays, one of today's dominant media for prototyping and implementing digital circuits, are also inappropriate for constructing more than the simplest asynchronous interfaces. They lack the critical elements at the heart of current asynchronous designs. Unfortunately, resolving this problem is not just a simple matter of adding these elements to the programmable ar-ray. The FPGA must also have predictable routing delay and must not introduce hazards in either the logic or routing. Futhermore, the mapping tools must also be modified to handle asynchronous concerns, especially the proper decomposition of logic to fit into the programmable logic blocks and the proper routing of signals to ensure that required timing relationships are met.
Ideally, we need an FPGA that can support both synchronous and asynchronous circuits with comparable efficiency. As a step in this direction we present Montage, an integrated system of FPGA architecture and mapping software designed to support both asynchronous circuits and synchronous interfaces. The architecture provides circuits with hazard-free logic and routing, mutual exclusion elements to handle metastability, and methods for initializing unclocked elements. The mapping software generates placement and signal routing sensitive to the timing demands of asynchronous methods. With these features, the Montage system forms a prototyping and imple mentation medium for asynchronous designs, providing asynchronous circuits with a powerful tool from the synchronous designer's toolbox.
Requirements for FPGA support
There are numerous reasons why synchronous FPGAs and mapping techniques cannot be used for asynchronous circuits. They fall into the general categories of hazards, timing constraints, state-holding elements, analog components, and decomposition.
Hazards. In a synchronous circuit a clock determines when a signal is sampled. The value of the signal is only important near the sampling clock edge, allowing the designer to largely ignore any extraneous signal transitions. In contrast, an asynchronous circuit is constantly sampling its signals. Because of this, any extraneous transitions (hazards) may cause incorrect results, and thus must be avoided.
Hazards may be inherent in a Boolean function or arise because of the way it is implemented. For example, i f both inputs to an XOR are allowed to change simultaneously, an unavoidable hazard results because one input may change before the other. Thus, slight differences in signal arrival times will cause the circuit to generate spurious transitions. Asynchronous circuits either d o not use elements with unavoidable hazards or do not allow the hazardous situations to occur. However, during circuit mapping to FPGAs, e pecially during decomposition, the circuit logic may be altered, possibly adding hazards. Careful decomposition techniques must be used to restructure the logic so that the resulting circuit remains free of hazards. (We discuss this later.)
Designers do not have precise control over how logic is implemented within the logic block of an FPGA. These logic blocks must be designed such that they d o not introduce new hazards into implementations. Again, circuits may have unavoidable hazards that an implementation cannot avoid, but the implementation must not introduce hazards that d o not exist in the original circuit. If one has only simple gates as logic elements (such as ANDs and XORs, as in the CFA FPGA), making them hazard-free is easy. However, lookup tables (LUTs), the element generally used in FPGAs to implement arbitrary n-input functions, are much more complex than simple gates. As shown in the next section, certain implementations of LUTs do not introduce hazards into the logic they implement.
A less obvious concern is that FPGA routing typically involves not just simple wires, but includes routing functions. For example, the interconnection of several wires in an FPGA will often be accomplished by a multiplexer, which must be free of hazards.
However, it is easy to design a multiplexer that suffers from charge sharing. Charge sharing allows a value to be stored on an unused capacitance, and when this value is reconnected to a signal, it may momentarily alter the signal's value. In this way, the routing functions can introduce hazards even when the specification did not call for logic.
The point to be made is that hazard avoidance in FPGAs is a subtle issue, and while it is possible to remove hazards from an FPGA architecture by care ful design, an FPGA not specifically targeted to asynchronous circuits will most likely generate hazards.
Timing constraints. All asynchronous design methodologies make assumptions about how and where delays are encountered in the resulting circuits. These circuits depend on the assumptions, and an FPGA system must meet these assumptions to properly map the circuits. Bounded-delay meth~dologies'-~ require upper bounds on the delays in all circuit elements, and insert extra delays into feedback or other paths in response.
The magnitude of the inserted delays is any amount greater than some formula on other delays in the circuit, and thus can be left to the placement and routing tools to specify exactly. However, the FPGA system must be able to insert these delays in the circuit. Finegrained architectures generally leave many logic cells unused in their mappings, and the paths to be delayed could be fed through unused cells configured as buffers, thus delaying the signal.
All FPGAs could delay signals by using more circuitous paths. However, the router must be efficient enough to find these paths. Routers are usually based on finding the shortest path under some cost metric, which can be computed efficiently. Unfortunately, finding the shortest path with delay greater than some value is a more complex prob- lem. Other methodologies have bundleddata ~onstraints,~*~ which require the delay along one path (where a path includes both logic and routing) to be greater than the delay on other paths. This is a similar, but more difficult, version of the same problem, since the burden of meeting the timing constraint must be fairly shared by all segments of a path.
Quasidelay-insensitive methodolog i e~~?~ contain isochronic fork constraints. These are either symmetric, where all ends of a fork must be reached nearly simultaneously, or asymmetric, where one end of the fork must be reached before the other. While speed-independent methodologies8+9 assume there is no delay in any wire, in practice these can be replaced by isochronic forks. In many FPGAs, the routing resources are very complex, with delays often greater than the logic delays. In such a system, meeting isochronic constraints can be almost impossible. While asymmetric forks can be handled simply by routing to the required earlier destination first, and then to other destinations, symmetric forks are much harder to design. Unless there is a relatively fast path from some shared routing resource to all fork d e s tinations, there is very little chance that the symmetric isochronic fork assump tion will be met.
The final timing constraint used is atomic, multioutput gates. Specifically, some method~logies~*~ use gates with more than one output, assuming that 62 the logic for all gate outputs will react to a new input at about the same time. For example, a toggle element is a o n e input, twcloutput element in which one output responds to odd input transitions and the other to even input transitions. This method assumes that by the time one output sends an output transition, the other output has sensed the input. Thus, the environment can then send in a new transition without worrying that the unfired output hasn't sensed the previous input.
One way to handle this constraint is to carefully craft a module set to guarantee that the atomic gate constraint is met.Io These modules will contain not only logic but also routing designed to ensure the constraints. To allow this, the mapping tools must respect the structure of the modules, ensuring that all resources are assigned as dictated, and the architecture must be reasonably uniform so that a module has many potential placements.
State-holding elements. Synchronous circuits require some mechanism for storing information from one clock cycle to the next. (locally clocked latches with metastability handling).
While adding each of these elements into the standard logic cell would be expensive, all but the Q latches can be implemented out of standard combination logic. To implement these n-input stateholding functions, one can express them as an (n + 1) input combinational function, with the function's output fed back as the new input. However, the methodologies usually consider logic elements as atomic gates (as just discussed). Thus, we must be able to ensure that this feedback path is fast enough for a changing element to stabilize before another input arrives. Current FPGAs route this feedback path the same as all other signals, thus meeting the timing contraints can be difficult.
It is also important to consider the starting state of the FPGA. After an SRAM-based FPGA is programmed, or after an antifuse-based FPGA is powered up, the programming will be established, but the signal values may be incorrect. In a synchronous circuit, we can simply set the latches to some p r e set state (a feature provided in many FP-GAS) and wait for the circuit to stabilize before starting the clock. Unfortunate ly, an asynchronous circuit has no global clock to stall, and the stateholding functions often have no latches to preload. We could require the circuit logic to have an explicit reset signal, but it would require a large amount of extra logic. An alternative is to provide an underlying mechanism to keep stateholding functions at a preset value until the circuit settles.
Analog components. Many asynchronous circuits contain elements that reliably sample a signal at a given point (a synchronizer) or determine which of two signals arrives first (an arbiter). What is special about these elements is that while the elements may take an arbitrarily long time to respond, the responses must always be correct and free of hazards.
For example, an arbiter with two in-puts and two outputs raises an output when the corresponding input is raised, while ensuring that at most one output is raised at a time. When the arbiter raises an output, it will not lower the output until the corresponding input is lowered. This behavior cannot reliably be implemented in a purely digital circuit. Thus, we cannot use the standard digital logic elements provided in FPGAs to map these elements. One solution is to include a mutual exclusion element (Figure 1 ) into the FPGA architecture. With the addition of appropriate digital logic around it, the element can reliably perform synchronization and arbitration functions.
Decomposition. To map a synchronous circuit into an FPGA, we must r e structure it so that its basic elements fit in the FPGAs logic elements. This process is called decomposition or technology mapping. For LUT-based FPGAs, we break all logic elements into individual gates with no more inputs than the LUT can handle. For other FPGA logic blocks, this may require changing the types of gates used as well.
For example, an FPGA whose only logic element is a NAND cannot implement an XOR directly. Instead, the XOR would be replaced with an equivalent sum-of-products form, with the proper number of inputs per gate, which can then be implemented by NAND gates.
While decomposition for synchronous circuits is well understood, these techniques are not sufficient for all asynchronous methodologies. For synchronous circuits, we can apply operations such as De Morgan's law, associativity, and Boolean minimization. We can use many of these techniques for boundeddelay methodo l o g i e~,~.~ especially purely algebraic operations.IJ2
Decomposition for other methodologies, particularly quasidelay insendive6,' and ~peed-independent,~.~ are suspect. For example, consider the cir- cuit in Figure 2a . It contains a ringoscillator of three inverters, and a three input AND gate attached to the inverter outputs. In this circuit the AND gate will never fire. If we use standard decomposition techniques to map this to an FPGA with two-input LUTs, the AND gate will break into two cascaded AND gates (shown in Figure 2b ). Note that this resynthesis is one of those allowed for both synchronous and boundeddelay circuits.
As we can see from the state graph in Figure 2c , this circuit can reach state 101 1, where the top AND gate might b e come true. Thus, this decomposition is incorrect.
A correct decomposition appears in Figure 2d . While the original circuit is not useful since it never generates an output, it represents a large number of situations in asynchronous circuits in which a gate is partially but not completely activated. That is, in many situations a gate comes within one input transition of an output change, but the circuit changes some other gate input. In fact, a circuit could reach all states within one input of the gate firing without actually firing the output. In such a situation any simple decomposition can fail, since some new signal transition is unsensed by the rest of the circuit. What is necessary is a more complex resynthesis of the circuit, ensuring that no gate transition is unsensed. Unfortunately, we are currently aware of no work that addresses this problem.
Technology mapping must also deal with hazards in an FPGAs logic elements. While a logic element might be able to implement a given specification, it could also introduce new hazards. + a&) . Thus, unless the circuit to be mapped does not allow this transition to occur, the circuit cannot be implemented with this element. Techniques exist for handling such situations in bounded-delay circuits.I3
The Montage architecture
As discussed earlier, asynchronous circuits are not well served by current FPGA architectures. Asynchronous logic implementations must consider hazards, synchronization and arbitration of events, and strict adherence to the timing assumptions of the design methodologies. Unfortunately, current FPGAs do not address these issues. Some of the required elements cannot be implemented in the standard digital logic found in these devices. In addition, the logic and routing elements must be d e 
b). Merging two copies of (b), with data flowing in opposite directions in the two copies, gives the structure (c). Shown in fd)
are the connections between the two copies at diagonal crossings.
...
Figure 4. Top half of a segmented channel (on its side). The boftom half is identical to the top.
signed more carefully to avoid hazards, since in asynchronous circuits every transition is important. Finally, routing resources must have predictable, optimizable delays to help meet timing assumptions. The Montage P G A is a version of the Triptych architecture designed to handle synchronous interface and asynchronous circuits.I4 Like Triptych, Montage is an SRAM-based FPGA, which has the advantage over an antifusebased FPGA of allowing the chip to be programmed for delay testing without permanently configuring it. Note that while we discuss a specific instance of the Montage architecture, we are currently considering architectural variations, including alteration of the vertical interconnection and increases to lookup table size.
The Montage global routing structure is identical to the Triptych routing structure, with diagonal connections between local cells as well as augmented 64 vertical segmented channels (Figure 3) . This structure effectively maps general synchronous circuits. It is even better suited to asynchronous circuits in which we expect to find more tightly connected subcircuits and less random global routing. Also shared with Triptych is the philosophy of allowing mappings to fix the trade-off between logic and routing resources by having logic blocks capable of performing routing functions.
Montage's short, diagonal connections handle most routing, providing fast signal propagation. The vertical segmented channels handle longer range connections and large fan-out nodes. They are implemented as a set of segmented channel wires (Figure 4 ) that connect the center outputs of routing and logic blocks (RLBs) to the center inputs of RLBs flowing in the same direction in the next column. Needless to say, this flexibility leads to slower signal propagation, and speedcritical designs will avoid using the vertical channels for critical paths. There are seven tracks in a vertical channel; six handle inter-RLB routing and the seventh carries a pin input. The six inter-RLB tracks are broken up into two tracks each of 8,16, and 32 RLB high segments. The basic Montage array is 64 RLBs high by 16 wide.
A Montage RLB ( Figure 5 ) has three inputs, three outputs, and a functional unit (FU) that operates on the inputs. There are two different types of functional units. The first is a logic block, which implements logic functions and state-holding elements. As shown in Figure 6 , the logic block has a lookup table capable of implementing any function of three inputs. We chose to show this switch logic function block because it does not suffer from charge sharing. This is important because asynchronous circuits require very clean signals, with absolutely no extraneous transitions.
The function output can be fed through a D latch. We can configure this latch either with one of two clocks in synchronous mode (allowing two independently clocked synchronous circuits to coexist on a chip) or with a choice of initialization state in asynchronous mode. In the asynchronous initialization mode the latch is set to a value during programming. The latch holds the function output to this value until the circuit stabilizes, at which point the latch is bypassed. Each RLB can choose independently how to use the D latch, so a single circuit can have two separately clocked synchronous circuits, asynchronous elements initialized with the built-in circuitry, and unlatched logic blocks.
Note that we can replace any one of the three logic block inputs with a feedback line carrying the function's output value. This feature allows us to build asynchronous state-holding elements. We do so by expressing the stateholding function of R inputs as a combina-tional function of (n + 1) inputs. Here, the extra input is the function's previous value. In this way a single logic block can implement any three-input combinational function, or a tweinput stateholding function such as an asynchronous S-R flip-flop or a Muller C element.
The second type of functional unit is an arbiter block, which can implement an arbiter, an enabled arbiter, or a synchronizer. We can also combine these units with logic blocks to form more complex functions such as Q flops. All inputs are completely permutable and can be inverted. Although we expect these blocks to be used infrequently, the roles they serve in asynchronous circuits are essential and are not implementable in standard digital logic. Thus, they must appear as special, built-in blocks in any FPGA designed to implement asynchronous circuits but which does not allow mappings to program circuits at the transistor level. (Crosspoint is an example of an antifusebased P G A that might allow sufficient transistor-level programming to implement an arbiter.I5)
As an example of Montage's power, Figure 7 (next page) shows all of Ebergen's7 basic elements mapped to Montage. The C element, NCEL, RCEL, and XOR can all be mapped into a single RLB. Since locations in gray are only used for routing and can easily be used for logic from other circuit elements, a toggle requires two RLBs, and the sequencer needs 10. The sequencer is the only element including an arbiter block, which is used in the center left RLB of the sequencer mapping. Since we can fit approximately eight Montage RLBs into the space of one Xilinx CLB, these are very efficient mappings. (The CLB is the basic tile of one of the most popular current FPGAs, which can implement at most two functions in one CLB.) Figure 8 (p. 67) shows larger hand-mappings, including Martin's fair arbiter,6 and a Sutherland-style FIFO. 5 Currently, we plan a 15:l ratio between the number of logic blocks and arbiter blocks, asshown in Figure 9 . We chose this number based on the relative infrequency of arbiters and synchronizers in typical asynchronous circuits.
Since we found that typical Triptych mappings used at least 25% of their RLBs for routing only, jobs which the arbiter RLBs in Montage are equally capable of handling, we believe most unused arbiters will be absorbed into this factor. However, we have taken care to ensure arbiter blocks occupy the same amount of area as logic blocks, allowing easy alteration of the arbiter mix in Montage implementations.
An important point to be made about the architecture is how Montage handles bundled data, inserted delays for boundeddelay circuits, and isochronic forks. For bundled data and inserted d e lays, the Montage routing structure'ssimplicity makes it easier to design a router that ensures delayed signals take longer paths. Also, since Montage mappings will typically leave up to 25% of the logic blocks unused, these unused logic blocks can serve as inserted delays by configuring the blocks as buffers.
Montage uses different implementa-
66
tions for the two types of isochronic forks ( Figure 10 , p. 68). For asymmetric isochronic forks, in which one end must be reached before the other, Montage routes the signal to the critical end of the fork and then from there to the other end of the fork. Thus, the dual routing and logic nature of a Montage RLB ensures the signal reaches one end before the other. For symmetric isochronic forks, in which all ends must be reached simultaneously, Montage places the ends of the fork either off the same interconnection line or off diagonals flowing from a sharedsource RLB. In this way, the isochronic fork depends on the delays of very localized elements, d e lays which can easily be checked during initial chip verification. A more difficult requirement is for the placement and routing tools to ensure isochronic constraints. For the placer, we require that all destinations of a symmetric isochronic fork be placed so that the constraint can be met. Specifically, the destinations must be able to share a single vertical segmented channel, or the diagonals from a shared neighbor RLB. To incorporate this requirement into the annealer's cost function, we could simply add a penalty for all isochronic forks that do not meet this constraint. Unfortunately, forks with large numbets of destinations will rarely happen to line up as a prop er isochronic fork, and the annealer has little chance of meeting all the constraints.
Montage mapping software
Our solution is to extend the fork penalty to recognize when a fork constraint is getting close to being met. Specifically, the penalty is decreased when two or more terminals are positioned so that the constraint can be met between those pins, with larger locally correct groups decreasing the penalty even more. In this way the annealer is encouraged to get closer and closer to a proper placement, while allowing it to try different fork positionings. In practice, fork constraints are almost always met.
Routing of these symmetric isochronic forks also requires special handling. Specifically, we cannot simply attempt to reach each fork destination individually, since they may take paths inconsistent with the isochronic assumption. Instead, we check the placement of the destinations of an isochronic fork to d e termine all valid fork points. For example, if the isochronic fork ran to exactly two different destinations, and they were one cell apart in the same column, the fork point could either be on a shared vertical segmented channel or either of the two RLBs that are direct neighbors of both of these destinations. Then, instead of routing to the function blocks of the destinations, we route to these fork points.
The Triptych router uses a straightforward shortest-path algorithm to route individual sourcesink pairs. This algorithm maintains a queue of the neighbors of short paths found so far and repeatedly removes the shortest neighbor from this queue, adding back in the neighbors of the removed node. Once a fork point is reached, we calculate the cost for routing from this point to each of the destinations of the symmetric isochronic fork, add it to the cost of the current route up to this point, and reinsert it into the queue. When we finally reach one of these complete routes in the queue, we know this is the preferred route and accept it. In this way, we can directly extend all of our work on performance optimization and congestion avoidance to isochronic fork routing without excess special casing.
68
Placing asymmetric isochronic forks, in which one destination must be reached before another, simply requires that the distance metric be extended to properly reflect the resulting routing. Since we will route the signal through the earlier destination and then on to the later destination, we simply treat this segment between destinations as a separate signal. A similar extension works for the router, with the addition that it routes the new signal not from a function block output but instead from wherever the signal enters the earlier destination's RLB.
By leveraging off of our Triptych work, we have developed, in a fairly short time, an integrated tool set for placing and routing asynchronous circuits with isochronic forks. We have not yet extended these tools to handle bundled data nor inserted delays. THE DEVELOPMENT of an FPGA for asynchronous circuits opens up several new avenues of exploration. The entire process of mapping for FPGAs must be reevaluated for this domain. Most obviously, placement algorithms must take into account the constraints generated by bundled data and inserted delays, and routers must ensure these constraints are met. Decomposition tools must also be developed for properly breaking down circuit elements into sizes accommodated by the target FPGAs. 
