Previous work has suggested that "soft" synthesizable programmable logic cores can efficiently provide small amounts of post-fabrication flexibility to integrated circuits. Previous architectures restrict the circuitry assigned to the core to be combinational. In this paper, we present two methods to enhance these architectures to support sequential logic. We apply these methods to a previously developed fabric, and optimize and compare them. We also describe a proof-ofconcept chip employing one of our techniques.
INTRODUCTION
Embedded programmable logic cores are increasingly being seen as an attractive way to provide post-fabrication flexibility to an integrated circuit. A programmable logic core is a flexible logic fabric that can be customized to implement any digital circuit after fabrication [ 1-61. Before fabrication, the designer embeds a programmable fabric (consisting of many uncommitted gates and programmable interconnects between the gates). After the fabrication, the designer can then program these gates and the connections between them.
Post-fabrication flexibility is attractive for a number of reasons. First, it may be possible to postpone some design details until late in the design cycle. Secondly, as products are upgraded, or as standards change, it may be possible to incorporate these changes using the programmable part of the chip, without fabricating an entirely new device. Finally, it may be possible to fabricate a single chip for an entire family of devices. The characteristics that differentiate each member of the family can be implemented using the programmable logic. Several integrated circuits containing programmable logic cores have been described [7, 8] .
Despite these compelling advantages, the integration of programmable logic cores into ASICs still face many challenges. One of the challenges is that design tools that allow seamless integration of fixed and programmable logic are still not mature. Timing analysis, power distribution, and verification are difficult when the function to be implemented in the core is not known.
In [9] , an alternative technique is described which addresses this concern by shifting the burden from the SoC designer to mature standard-cell synthesis tools. In this technique, core vendors supply a synthesizable version of their programmable logic core (a "soft" core) and the integrated circuit designer synthesizes the programmable logic fabric using standard cells. A "soft core" is one in which the designer obtains a description of the behaviour of the core, written in a hardware description language. Note that this is distinct from the behaviour of the circuit to be implemented in the core, which is determined after fabrication. Here, we are referring to the behaviour of the programmable logic core itself. Although this technique suffers increased speed, density, and power overhead, the task of integrating such cores is far easier than the task of integrating "hard" cores into a fixed-function chip. For very small amounts of logic, this ease of use may be more important than the increased overhead.
The synthesizable architectures presented in [9, 10] can only implement combinational logic circuits. In this paper, we describe how the architecture can be enhanced to support sequential circuits. A straightforward embedding of flipflops within each logic block, as is done for stand-alone FPGAs, will not suffice. After describing our baseline architecture in Section 2, Section 3 will explain why simply embedding flip-flops in each logic block will not suffice, and will outline two solutions. Section 4 will describe a novel placement algorithm for one of our architectures. Section 5 will then present experiments aimed at optimizing and comparing our architectures, and Section 6 will present a proof-of-concept chip that illustrates our approach.
We will illustrate our techniques using the product-term based synthesizable architecture described in [ 101. In this architecture, each logic block is a product-term block (PTB), which contains an AND plane to programmably generate product terms, and an OR plane to programmably combine these product terms. We describe a PTB using the tuple (i,p,o) where i is the number of inputs, p is the number of product terms (p-terms), and o is the number of outputs. In [lo] , the PTB outputs are not registered. The PTB's are arranged in levels, and connected using a unidirectional network. A PTB in level i can drive a PTB in any level greater than i, but can not drive a PTB in a level less than or equal to i. This directional interconnect architecture ensures that the core can be synthesized using standard offthe-shelf synthesis and verification tools. Many of these tools (such as those performing timing verification) have difficulty An quadratically with the number of PTB's, in practice the unprogrammed standard FPGA architecture contains many of number of PTB's is small, and this interconnect does not these loops (although a good designer will rarely configure become unwieldy.
--the FPGA to implement combinational loops; before configuration, they exist). The directional interconnect proposed in [ 101 does not contain combinational loops.
Unlike the approach in stand-alone FPGA's, this technique requires two routing networks, which significantly increases the size of the fabric. This area can be reduced by only
ENHANCED ARCHITECTURES
including flip-flops for some logic block outputs. In the next Standard FPGA architectures support sequential logic section, we will present experiments to determine how many through the use of one or more flip-flops embedded in each logic blocks should contain registered outputs. logic block, as shown in Figure I (a). Each logic block output can be programmably registered or unregistered.
B. Architecture 2: Decoupled Architecture
Depopulating the flip-flops as described above reduces the The problem with this approach for a synthesizable flexibility of the architecture. One way to combat this is to programmable logic architecture lies in the interconnect. replace the flip-flops within the logic blocks with a global Applying a directional interconnect to an architecture array of registers. All logic block outputs connect to a containing flip-flops as in Figure l (a) will not suffice, since number of "general" registers through an array of their will be no way to implement registered feedback loops; multiplexors. The outputs of the registers then feed back into such loops are an important part of most sequential circuits the logic block inputs. By not associating any particular such as state machines (where the state variables must be register output to any specific logic block output, utilization used as. inputs to the next state logic). However, simply of the registers can be greatly increased. This added adding feedback signals to the architecture to support these flexibility, however, comes at a cost of increase area for registered feedback loops will re-introduce combinational signal routing. In the next section, we will determine whether loops in the unprogrammed fabric (through the combinational this coupling leads to an overall improvement in density, and output of the.logic block).
determine how many global registers are required.
A. Architecture I : Dual-Network Architecture
Our approach is to provide both the registered output and combinational output signals outside of the PTB, and connect the registered outputs and unregistered. outputs to the other PTB's using two separate routing networks, as shown in Figure-1 (b) . The unregistered network is "directional", as in [9,10], while the registered network can connect an output to a PTB in-any level. In this way, the unprogrammed fabric contains loops, but each Imp contains at least one flip-flop, meaning the synthesis tools will be able to process the circuit effectively. Figure 2 shows our architecture. As in [ 101, we assume each PTB has 10 inputs, 3 outputs, and either 9 or 18 product terms depending-on the size of the core. Experimentally, we found that a-"triangular" arrangement of PTB's, in which the number of PTB's in level i is 50%.of the numbersof PTB's in level i-I, provides the best tradeoff between area and delay, for both combinational and sequential circuits. As shown in the figure, the combinational output of each PTB can drive 
MAPPING ISSUES
In order to map a logic circuit to the architecture, an automatic placement and routing tool is necessary. Due to our rich interconnect, routing is trivial. In this section, we describe a novel placement algorithm that maps a netlist of PTB's (generated by PLAmap [ 1 11) to our architecture.
The primary goal of the placement tool is two-fold: it must find a placement that satisfies the directional constraints imposed by the routing network, and ensure that sequential elements in the circuits are mapped to PTB's containing flipflops. The algorithm initially assigns PTB's to levels in a depth-first manner, based on dependencies in the circuit. Such an assignment may result in more PTB's assigned to a level than there are physical PTB's in that level, or may lead to a mismatch between registered PTB's and registers in the architecture. To resolve these conflicts, levels are visited in order (starting from the inputs), and the excess PTB's are demoted to subsequent levels. The selection of PTB's from within a level to demote is done on the basis of the PTB's slack (as defined in [ 121).
ARCHITECTURE OPTIMIZATION
In this section, we experimentally optimize several architectural parameters for the architectures in Section 3. We also compare the decoupled architecture to the dualnetwork architecture.
A. Dual-Network Architecture Optimization As described in the previous section, we can reduce the area by only including flip-flops in some PTB's. By examining benchmark circuits mapped on an architecture with registers in every PTB, we observed that the utilization of registers in lower levels (near the left side of Figure 2 ) was significantly lower than the utilization of registers in higher levels (near the right side of Figure 2 ). Intuitively, this makes sense; combinational values may need to be computed using several levels of PTB's before they are registered. Thus, rather than depopulating the flip-flops uniformly, we assume the number of PTB's with registered outputs in level i is equal to ~~=n,%~'-'+'' where n, is the number of PTB's in level i, 1 is the number of levels, and v is a constant that we optimize in this section.
To find an optimum value for v, we mapped 47 MCNC sequential benchmark circuits, each containing between 14 and 296 equivalent 4-LUT's, to PTB's using PLAmap [ 111.
For each value of v, the placement tool described in Section 4 was used to find the smallest architecture onto which each circuit could be mapped. The area and depth of this architecture was measured, and these quantities averaged over all benchmark circuits. We have used depth instead of estimated delay because we found these quantities were well correlated.
Intuitively, as v increases, the number of flip-flops and the size of the non-directional network increases, so we would expect the area to rise. On the other hand, if v is too small, the placement tool will find it more difficult to find a legal mapping for circuits with many sequential elements, and thus will compensate by choosing a larger core. Thus, we would expect these two competing trends to cause a minimum. The depth drops slightly as v increases; this is because the placement tool finds it easier to find legal placements, meaning, in some cases, it can get by with a smaller core. The product of area*depth as a function of v is shown in Figure  3(a) . Overall, the area*depth shows a minimum at ~0 . 8 .
B. Decoupled Architecture Optimization
We next investigate the effect of replacing the registered PTB outputs with a global array of registers, as described in Section 3B. We will use the parameter, d, to denote the number of global registers as a fraction of the total number of PTB output signals. Thus, if there are n PTB's, each with o outputs, there are r=n *(d*o) flip-flops in the register file.
Intuitively, if d is too low, the placement tool will not find enough flip-flops to implement the benchmark circuits, and will increase the size of the fabric to compensate. On the other hand, if d is too large, some flip-flops will remain unused. As shown in Figure 3(b) , a comprise of h 0 . 6 7 provides the best area*depth tradeoff.
C. Dual-Network and Decoupled Architecture Comparison
Comparing the minimums of the two curves in Figures 3(a) and (b), it is clear that a well-optimized decoupled architecture is more efficient than a well-optimized dualnetwork architecture. The area*delay product of the best decoupled architecture is roughly 8% lower than that of the best dual-network architecture. As a proof-of-concept, we laid out a programmable logic core using the dual-network architecture with v=l .O, and used it to implement a parallel network interface module from [ 131 as our target application, as described in [14] . Unlike [14] , in which only the combinational next-state logic was implemented, in our architecture, we can implement the entire state machine, including the flip-flops. and "PTB6-8" are logic blocks belonging to the fvst and second-level, respectively; and the multiplexors labelled "M1 ,
19-5-3

M2
, and Mout", are interconnect blocks in the first, second, and last levels, respectively.
The speed and area required by OUT core depends on the timing constraints supplied to the synthesis tool. Figure 5 shows an area*delay graph for several different timing constraints. The implementation with the lowest area*delay product had'a delay of 6 ns, and an area of 348,000 pm2 in a 0.18 pm TSMC process. Table 1 compares these measurements with those obtained &om the. lookup-table based fabric in 1141. Unlike our implementation, the fabric in [ 141 does not include flip-flops, and thus, can only be used to implement the combinational parts of the state machine. As the table shows, OUT. architecture is 12% smaller and 40% faster than the lookup-table based core. The difference is primarily due to tbe use of a product-term based architecture rather than a lookup-table based architecture. 
7.
CONCLUSIONS In this paper, we have described two methods to support the implementation of sequential logic in synthesizable programmable logic cores. In one method, flip-flops are embedded in each logic block, and registered and unregistered outputs of each, logic block drive two separate routing networks. In the second method, the flip-flops are arranged in a single shared global register file that can be accessed by all logic blocks. Overall, this second scheme results in a 8% lower area*delay product. We have implemented our architecture in a 0.18prn TSMC process and shown that it is 12% smaller and 40% faster than a lookuptable based core of similar capacity.
