Abstract-As integrated circuits become increasingly more complex and expensive, the ability to make post-fabrication changes will become much more attractive. This ability can be realized using programmable logic cores. Currently, such cores are available from vendors in the form of "hard" rectangular layouts. In this paper, we focus on an alternative approach for fine-grain programmability: vendors supply a synthesizable RTL version of their programmable logic core (a "soft" core) and the integrated circuit designer synthesizes the programmable logic fabric using standard cells. Although this technique suffers in terms of speed, density, and power overhead, the task of integrating such cores is far easier than the task of integrating "hard" cores into an ASIC or SoC. When the required amount of programmable logic is small, this ease of use may be more important than the increased overhead. This paper presents two synthesizable "soft" programmable logic core architectures and describes their associated place and route issues. We compare the two architectures to each other, and to a "hard" programmable logic core. We also show how these cores can be made more efficient by creating a nonrectangular architecture, an option not usually available to "hard" core vendors. Finally, a proof-of-concept integrated circuit containing one of these cores is described.
I. INTRODUCTION
R ECENTLY, we have witnessed impressive improvements in the achievable density of integrated circuits. In order to maintain this rate of improvement, designers need new techniques to manage the increased complexity inherent in these large chips. One such emerging technique is the system-on-a-chip (SoC) design methodology. In this methodology, pre-designed and pre-verified blocks, often called cores or intellectual property (IP), are obtained from internal sources or third-parties, and combined on a single chip. These cores may include embedded processors, memory blocks, interface blocks and components that handle application specific processing functions. Large productivity gains can be achieved using this approach. In fact, rather than implementing each of these components separately, the role of the SoC designer is to integrate them onto a chip to implement complex functions in a relatively short amount of time.
One major issue today in SoC design is the overall design cost in terms of engineering costs, the cost of IP blocks and the rising costs of masks in advanced technologies. For this reason, it is desirable to construct programmable SoCs to amortize the cost of a single design across many related applications. Furthermore, the cost of errors in the design can be significant. No matter how seamless the SoC design flow is made, and no matter how careful an SoC designer is, there will inevitably be some chips that have problems that are found after fabrication. This may be due to design errors not detected by simulation or it may be due to a change in design requirements. While this type of problem is not unique to chips designed using the SoC methodology, it lends itself to the use of an elegant solution to the problem: one or more programmable logic cores can be incorporated into the SoC.
A programmable logic core (PLC) is a flexible logic fabric that can be customized to implement any digital circuit after fabrication. Before fabrication, the designer embeds a programmable fabric, consisting of many uncommitted gates and programmable interconnects between the gates, onto the chip. After the fabrication, the designer can then program these gates and the connections between them to serve different applications or implement design changes. These configurable logic blocks and connections have also been commonly referred to as embedded FPGAs (field programmable gate arrays), as opposed to stand-alone FPGAs that have been available for two decades.
Several companies already provide programmable logic cores [1] - [4] . Yet, the use of these cores is still far from mainstream. There are a number of reasons for this: 1) Tools for the design and integration of programmable fabrics are not widely available as yet. This is somewhat of a chicken-and-egg problem: existing tools and flows will not be enhanced to support the seamless integration of programmable logic cores until this design technique becomes mainstream, and the design technique will not become mainstream until the tools are enhanced to support programmable logic cores. However, as chip design costs escalate, the economics of chip design will be a strong driver for increased hardware programmability. 2) Programmable logic cores come in relatively fixed formats. That is, the integrated circuit designer can not modify the overall size of the fabric or the internal structure of the programmable logic core. The integrated circuit designer must choose a programmable logic core 0018-9200/$20.00 © 2005 IEEE that is closest to the desired size; this could lead to wastage of chip area. This can be addressed by providing tiles of programmable logic that can be snapped together to form a design logic fabric of the desired size to minimize the area penalty. 3) Embedded programmable logic is not as efficient as hardwired logic in terms of area, power and speed. There are, however, special-purpose fabric generators emerging that can provide a better tradeoff between these specifications, depending on the target application. In spite of these barriers, we believe that the use of embedded programmable fabrics will continue to increase on both ASIC and SoC designs. There will be a need for large-grain, mediumgrain and fine-grain fabrics to serve a variety of needs on the chip. Of particular interest in this paper is the use of fine-grain programmable fabrics. There are many cases where an integrated circuit designer would prefer to have many very small regions of programmable logic, rather than a single or handful of large programmable logic regions. As a simple example, consider a control logic block which coordinates the operation of the rest of the chip; it may be beneficial to map selected parts of this control logic to programmable logic, rather than the entire control logic block.
In this paper, we describe a novel method for incorporating fine-grain programmable logic cores into an SoC. Rather than providing "hard" rectangular layouts, core vendors would provide "soft" descriptions of their programmable logic cores (PLC). Alternatively, the user could develop these cores themselves without much difficulty. These descriptions would typically be written at the register transfer level (RTL) in a hardware description language (HDL), such as VHDL or Verilog. We refer to this as a soft PLC. The integrated circuit designer could then incorporate the soft PLC description into the RTL description for the rest of the (nonprogrammable) chip, and then synthesize the entire chip using existing synthesis tools. The advantages and certain limitations of this approach are the subject of this paper.
In [5] , Phillips and Hauck describe the Totem architecture, which is a coarse-grained programmable logic fabric. Phillips and Hauck describe several ways of implementing their fabric, one of which is to use a soft description mapped to standard cells. Unlike our approach, however, they focus on large coarsegrained fabrics rather than the small fabrics that might be incorporated into an SoC. Reference [6] also describes a standard-cell implementation of a programmable logic fabric, but again, it does not specifically target the SoC domain. This paper is organized as follows. First, the soft PLC technique is described in more detail in Section II. Sections III and IV describes new architectures and place-and-route algorithms for these cores. Since the soft cores are intended to be synthesized using standard synthesis tools, it is unlikely that traditional FPGA architectures, optimized for full-custom layout, will be appropriate. We provide two novel architectures [7] , [8] that are designed specifically for these soft cores. Section V identifies key parameters for our architectures, and seeks optimum values for these parameters. Finally, Section VI describes our experiences with a test chip that was fabricated using one of our synthesizable programmable logic cores. Conclusions are provided in Section VII.
II. SOFT PLC DESIGN FLOW
As described in the introduction, integrated circuit designers who wish to use a programmable logic core typically receive a "hard core" which contains the actual physical transistor layout information. The size and shape of the core is fixed; the only freedom the designer has is where to position the core on the chip and how to connect the I/O to the block. However, using our scheme, the designer receives the core in the form of a "soft core". A "soft core" is one in which the designer obtains an RTL description of the behavior of the core, written in Verilog or VHDL. In this sense, it is similar to the definition of a soft IP core used in SoC designs [15] . The distinction is that, in a soft PLC, the user circuit to be implemented in the core is programmed after fabrication.
The value of this approach is derived from the tools needed to implement the fabric. Since the designer receives only an RTL description of the behavior of the core, synthesis tools must be used to map the behavior to gates and eventually to layout. These tools can be the same ones that are used in the standard ASIC flow. In fact, the primary advantage of the new method is that existing ASIC tools can be used to implement the chip. No modifications to the tools are required, and the flow follows a standard integrated circuit design flow. This will significantly reduce the design time of chips containing these cores.
A second advantage is that this technique allows small blocks of programmable logic to be positioned very close to the fixed logic that connects to the programmable logic to improve routability and shorten wire lengths. The use of a "hard core" requires that all the programmable logic be grouped into a small number of relatively large blocks. A third advantage is that the new technique allows users to customize the programmable logic core to better support the target application. This is because the description of the behavior of the programmable logic core is an RTL description that can be understood and edited by the user. Finally, it is easy to migrate the programmable block to new technologies; new programmable logic cores from the core vendors are not required for each technology node [15] .
Of course, the main disadvantage of the proposed technique is that the area, power, and speed overhead will be significantly increased, compared to implementing programmable logic using a hard core. Thus, for large amounts of circuitry, this technique would not be suitable. It only makes sense if the amount of programmable logic required is small. In Section V, we will quantify this tradeoff, but first we explore the issues of design flow and architecture suitable for such an approach.
The basic design flow employing soft PLCs is as follows:
1) The integrated circuit designer partitions the design into functions that will be implemented using fixed logic and programmable logic, and describes the fixed functions using a hardware description language. At this stage, the designer must determine the size of the largest function that will be supported by the core; this can be done either by considering example configurations, or based on the experience of the designer. 2) The designer obtains an RTL description of the behavior of a programmable logic core. This behavior is also specified in the same hardware description language.
3) The designer merges the behavioral description of the fixed part of the integrated circuit (from step 1) and the behavioral description of the programmable logic core (from step 2), creating a behavioral description of the block. 4) Standard ASIC synthesis, place, and route tools are then used to implement the soft PLC behavioral description from step 3. In this way, both the programmable logic core and fixed logic are implemented simultaneously. 5) The integrated circuit is fabricated. 6) The user configures the programmable logic core for the target application. Note that in Step 4 of the design flow, there is an important difference in the implementation of the programmable logic for a standard FPGA fabric and a soft PLC fabric, as illustrated in Fig. 1 . Consider the simplified view of a 3-input lookup table (3-LUT) used in an FPGA. The standard fabric uses SRAM cells to store configuration bits and pass transistors to implement the 3-LUT shown in Fig. 1(a) . In the soft PLC case shown in Fig. 1(b) , a standard-cell library is used to implement the same 3-LUT. In fact, all desired functions of the soft PLC are constructed from NANDs, NORs, inverters, flip-flops (FF) and multiplexers from the standard cell library. The same holds true for the programmable interconnect in the FPGA and soft PLC.
To emphasize this point further, consider how the complete fabric would be constructed in the two cases. For the soft PLC, the final logic schematic and layout is determined by the logic synthesis tool, technology mapping algorithms, and the placeand-route tool. In the case of a hard fabric, a custom layout approach is used to create a "tile" for the FPGA. Then the FPGA fabric is assembled by replicating the tiles horizontally and vertically. Clearly, the standard FPGA approach is more area efficient but the soft PLC has the advantage of ease of use.
III. PROPOSED ARCHITECTURES FOR SOFT PLC
Now that the main features of the approach have been outlined, we describe two alternative architectures for a soft programmable logic core. The first proposed architecture is very similar to a standard FPGA architecture with some adjustments.
However, this approach still has a significant area penalty. Since the desired fabric is intended for fine-grain programmability, one would expect the architecture to be different from standard FPGAs. As will be shown in Section V, we can reduce the area of our core by removing some degree of flexibility; the second architecture contains fewer programmable switches and hence is more area-efficient, yet contains enough flexibility to implement small circuits.
A. Architecture 1: Directional Architecture
The most straightforward way to implement a synthesizable programmable logic core is to describe the behavior of a standard FPGA at the RTL level using a hardware description language. The standard FPGA blocks are fairly complex and allow for both combinational and sequential elements. It is important to carefully consider the target applications and the required complexity of the programmable blocks. In doing so, we can make the following observations. Observation 1: Synthesizable programmable logic cores only make sense for very small amounts of programmable logic. An envisaged application would be the next state logic in a state machine. In that case, only combinational functions are needed.
Observation 2: Many CAD tools (the tools that will be used to synthesize the programmable logic core, perform timing verification, etc.) have problems with combinational loops.
These observations motivate us to modify a standard FPGA architecture. First consider Observation 1. Since we are targeting small amounts of logic, we began with an architecture that will only implement combinational logic, allowing us to remove all flip-flops needed for sequential logic functions. Flipflops can be added at the inputs and outputs of the programmable logic core by the IC designer if desired. Removing flip-flops reduces area and simplifies timing analysis. Of course, the flipflops associated with the programming cells are still required for both logic and interconnect blocks.
Observation 2 leads to a more interesting problem since an un-programmed PLC contains many combinational loops. Although these loops are ultimately false paths, they can still pose problems for CAD tools and during the actual configuration bit programming process. Thus, we have created a "directional" architecture in which the flow between logic blocks can only occur from left to right. Since our architecture only implements combinational circuits, this will not allow any loops in the logic; any feedback loops that are required would be implemented outside of the core.
Based on these observations, we have created the architecture shown in Fig. 2(a) . Each switch block is a standard switch block, with all right-to-left connections removed, as shown in Fig. 2 
(b).
A simplified view of the 3-LUT is shown again in Fig. 2(c) . The choice of a 3-LUT (as opposed to a 4-LUT or 5-LUT) was based on the observation that the ratio of logic area divided by routing area is larger in a synthesized core than a hand-optimized core; thus, we found that a smaller LUT is more efficient.
B. Architecture 2: Gradual Architecture
We can consider more efficient architectures by making the following additional observations.
Observation 3: Since we are implementing such small circuits, we should consider removing some flexibility to improve area efficiency.
Observation 4: Since the core will be hardwired into a fixedfunction chip, we will require additional flexibility on the inputs and outputs.
Observation 5: Unlike a hard FPGA layout, it is not critical that each tile be identical. In a hard layout, FPGA vendors do not wish to layout multiple tiles; in our case, the fabric is synthesized and laid out automatically by CAD tools. Therefore, we have some freedom in defining the structure of the underlying fabric.
These observations lead to the architecture in Fig. 3 , which we call the "Gradual Architecture." Like the Directional Architecture, signals in the Gradual Architecture flow from left to right, and the logic resources consist only of 3-LUTs. However, in this architecture, the number of horizontal routing channels gradually increases from left to right, since more outputs are generated in each level that can be used as inputs by the downstream LUTs. The vertical tracks are only accessible through LUT outputs (each vertical track can be driven by one LUT), and can be connected to horizontal tracks using a dedicated multiplexer at each grid point. Note that, except for this multiplexer, no switch block is required in this architecture. The extension of this architecture to any number of rows and columns is straightforward.
The routing multiplexers in the first column are different from the others. We have performed experiments showing that primary inputs are frequently required in many different columns. Thus, we have included several routing multiplexers in each row (we will vary the number of these multiplexers in Section V). For each row there are one or more output select multiplexers to choose a primary output of the circuit. The output multiplexers choose between the outputs of all LUTs located in the last column and any horizontal line located above or below that specific row. The exception to this is that only one routing multiplexer per row from the first column passes a signal to the output select multiplexers.
IV. PLACEMENT AND ROUTING ISSUES
Once a programmable logic core has been embedded into a chip design, and the chip has been manufactured, the user-defined circuit can be implemented on the core. A CAD tool is usually employed to determine the programming bits needed to implement the user-defined circuit. Since our architectures contain novel routing structures, some modifications must be made to standard FPGA placement and routing algorithms. In this section, we describe these modifications for the two architectures described in Section III.
It is important to note that we are not referring to the standard cell placement and routing tools needed to implement the programmable fabric itself onto the chip. Rather, the algorithms in this section are used to implement a user circuit on the programmable fabric after the chip has been fabricated. For example, the VPR tool [9] determines where to place the logic functions and how to form the connections between the logic functions on a given FPGA fabric. At the end of the process, the programming bits are generated for the fabric. These bits must be shifted into the fabricated chip to implement a user-defined circuit. The process is repeated if a different user circuit is to be implemented.
A. Placement Algorithms 1) Directional Architecture:
The placement algorithm for the Directional Architecture described in Section III is based on the original simulated annealing placement algorithm of VPR [9] . The only change was to impose a restriction on the placer which stipulates that input sources for all blocks must originate from the left of that block. Otherwise, it is viewed as an illegal placement. During the annealing, we never allow a move that would result in an illegal placement.
The cost function used in the VPR placement algorithm depends on the delay of potential connections as well as on the Manhattan distance between pins. In a synthesized core, the delay between pins depends on where the individual cells that make up the core are positioned; it may be that adjacent blocks in the conceptual representation of Fig. 2 (a) may be positioned far apart in the actual layout. However, for convenience, we base our placement cost function on the distances and delays in the conceptual representation. Improvements can be made by supplying the VPR tool with the extracted delay and distance information from the actual layout of the synthesized core. Instead of relying on the conceptual representation, we can then use the "physical" representation to obtain better delay estimates during placement and routing.
2) Gradual Architecture: In the Gradual Architecture, the routing fabric is less flexible than a standard FPGA. Poor placements can easily lead to un-routable implementations. We use a simulated annealing based algorithm with a unique cost function for this architecture, as described below. Fig. 4 shows two examples of "good" placements on a simplified view of the Gradual architecture. In Fig. 4(a) , a source logic block drives two sink logic blocks in the adjacent column. The corresponding net can be routed without any conflicts since no shared resources are required. Note that the input multiplexer used to feed each input pin of a logic block is not a shared resource; there is one such multiplexer per input pin. Any number of sinks in the column immediately adjacent to the source can be connected in this way as shown in Fig. 4(a) for the case of two sinks.
On the other hand, nets that drive logic blocks that are not in the immediately adjacent column must make use of routing multiplexers; these are shared resources. In the example of Fig. 4(b) , a net drives four sinks but only needs one routing multiplexer, since the sinks are all in two vertically adjacent rows (meaning that the track between the two rows can be used to drive all sinks). If another net also required the shaded routing multiplexer, a conflict would arise when we tried to route the two nets. Since these routing multiplexers are shared resources, we wish to minimize the number of routing multiplexers used by each net. Therefore, we should penalize placements that generate many such potential conflicts for the router. Again note that the input multiplexers used to feed the input pins of each logic block are not shared resources, and thus should not play a role in the cost of a given placement.
Based on these considerations, a new cost function was developed for the placement algorithm that directly relates to overuse of routing multiplexers. Before presenting the cost function itself, we first describe certain factors that will be used in the function. Consider the nets in Fig. 5(a) that would connect the indicated source and sink. In this case, we consider it equally likely that the final routed net will use one of the two indicated routing multiplexers; therefore, we define the demand for each of the two multiplexers as 0.5 relative to the indicated source and sink. In Fig. 5(b) , it is almost certain that the routed net will use the indicated routing multiplexer, since that single multiplexer can be used to feed both sinks, so the demand for that net is close to 1. Note that a valid route could be found that does not use this multiplexer; however, such a route would require two routing multiplexers. During placement, we assume that this will not happen, and thus, set the demand term for all other routing multiplexers for this net to 0. Of course, this does not mean the router is constrained to use this routing multiplexer. It is simply an assumption made to compute the cost function during placement. Fig. 6 shows a net that drives four vertically adjacent rows. In this case, we assume that the two indicated routing multiplexers are used with probability 1 during placement. Experimentally, we have determined that this leads to better results than if we assign all five routing multiplexers in that column the same value (which would be about 1/2). Again, note that the router is not constrained to actually use the indicated multiplexers.
To derive the cost function, we start by defining an occupancy function, , of a routing multiplexer as an estimate of how many nets would like to use that routing multiplexer. We can write this as the sum of the estimated demand for a given multiplexer by each net:
where is the estimated demand for the routing multiplexer at column and row (c, r) by net . As already described, the demand is a number lies in the range between 0 and 1; 0 implies that there is little chance that the router will use this multiplexer to route net , while 1 means that the router will, with high probability, use this multiplexer when routing net .
Next we define the capacity function, , of a routing multiplexer as the number of output lines available from a given set of input lines. It is an estimate of the ability to satisfy the routing demand at a given location. Typically, the capacity of all routing multiplexers is set to 1 since each one has a single output. However, for those muxes in the first column, the capacity is equal to the number of horizontal lines that can be driven from primary inputs. Referring back to Fig. 3 , the capacity function would be 3 since three muxes drive 3 adjacent horizontal lines from the same set of primary inputs at each location.
With these definitions in place, the cost of a given placement on a C-column, R-row core is given by where is the occupancy demand of a routing multiplexer at location , and is the output capacity of multiplexers at location . We take the difference between and to incorporate the fact that one or more outputs are available at each location. If the difference is negative, we set the cost of that routing mux to 0 using the max function. The term is a small bias value (set to 0.2 for our experiments).
B. Routing Algorithms
The negotiated-congestion based routing algorithm from VPR [9] was used without modification for both architectures. For the Gradual Architecture, the routing task is very easy since there are only a few potential routes for each net. For the Directional Architecture, there are many potential routes so the routing is more complex. The use of the advanced router within VPR gave us ability to evaluate different architectures and placement schemes during our architectural investigation. 
V. EXPERIMENTAL RESULTS
In this section, we experimentally compare the two architectures described in Section III. We used 19 small combinational MCNC benchmark circuits [14] . We selected small circuits since these are the type of circuits we expect to be used with our architecture; large circuits would likely be implemented using hard programmable logic cores. For each circuit, we initially found the minimum-size square core on which the circuit can be placed and routed. We then created a VHDL description of each core, and synthesized it using Synopsys Design Compiler™ and a standard 0.18-m CMOS library. The cell area reported by the Synopsys tool was used for a basis for comparison in Table I .
A. Directional Architecture Versus Gradual Architecture
The first four columns of Table I show the results for the Directional Architecture. For each benchmark circuit, we varied both the core size and the number of tracks in each channel, and chose the configuration which resulted in the minimum area; the chosen size and channel width are shown in columns two and three of the table. For each configuration, we then synthesized the architecture using Synopsys; the fourth column in the table shows the cell area required to implement the core.
The final three columns show the results for the Gradual Architecture. In this case, we varied both the core size and the number of input multiplexers per row, and chose the configuration which resulted in the lowest area. These numbers are reported in columns five and six of the table, and the synthesized cell area from Synopsys is shown in the final column. From the last row of the table, the geometric average of the area required to implement the circuits on the Gradual Architecture is 18.9% less than that required to implement the same circuits using the Directional Architecture.
B. Soft Versus Hard Programmable Logic Cores
As mentioned in Section II, the primary disadvantage of using a "soft" programmable logic core is the reduced density, speed, and increased power consumption. In this subsection, we estimate the area penalty of a soft core compared to a hard core.
The most accurate way to compare the area required by soft and hard programmable logic cores would be to lay out (by hand) a hard core, and compare its area with the numbers in Table I . This is a time-consuming task. Instead, we estimated the size of a hard core using a detailed transistor-count model, following the methodology described in [9] . We focus on a 4 4 Gradual Architecture with three input multiplexers per row. By estimating the number of minimum transistor equivalents (MTEs) required to implement the circuit, and converting this to area in our 0.18-m technology, we estimate the layout area of such a core to be 12868 m . A soft core was generated using these same parameters, and the size (after synthesis using Synopsis and physical design using Cadence) was 81092 m . Thus, the synthesized core requires approximately 6.4 more area than the hard core.
This number is significant. Clearly, for large programmable logic cores, our approach would not be suitable. However, if only small amounts of programmable logic are required, this density penalty may be acceptable. In addition, the use of a hard core will usually require the selection of a core from a library. Since it is unlikely that a library would contain all sizes and shapes of cores, in most cases, a designer would end up choosing a larger core than is required. Using a soft core, the designer can create a core of any size. Even if a core of the appropriate size was created, the difficulty inherent in embedding hard cores may make the use of hard cores less attractive than our soft approach.
We have also compared our sizes to commercial FPGA layouts using publicly available information. These comparisons yield little insight, however, since the commercial devices contain far more tracks per channel, and contain additional elements such as flip-flops in the logic blocks.
C. Sensitivity of Results
As described in [11] , it is critical to analyze results for their sensitivity to experimental assumptions. Table II shows two of our sensitivity results for the data in Table I . The first part of the table shows how the conclusions change if we alter the number of input/output connections per grid. In the experiments in Section V-A, it was assumed that an Directional Architecture has input/output connections along each of the four edges of the core, and that an Gradual Architecture has input/output connections along the left and right edges of the core. We attempted to use two other input/output ratios, and gathered the results in Table II . Although the Gradual Architecture always produced higher density than the Directional architecture, the margin by which the Gradual was better varied (we do not have enough data to conclude that this is a result of anything other than experimental "noise"). According to the methodology in [11] , we classify this experiment as sensitive to the input/output ratio, even though the conclusion that Gradual is better than Directional was the same in all cases.
The second part of the table shows how a less aggressive placement schedule (fewer moves per temperature and larger temperature drops during the annealing) and routing schedule (fewer routing attempts) affects the conclusions. In this case, the margin was smaller, meaning the experiment was only slightly sensitive to the choice of algorithm.
D. Nonrectangular Fabric
The grid of logic blocks in standard FPGAs is usually square or rectangular. From [12] , however, logic circuits often have a "triangular" shape as shown in Fig. 7(a) . In standard FPGAs, this does not present a problem, since the routing resources are flexible enough that signals can be routed left, right, up, or down, as shown in Fig. 7(b) . This means that in a standard FPGA, the physical implementation of a circuit need not match the fanout shape of the circuit. In the architectures described in this paper, however, the signal flow is restricted from left to right. As shown in Fig. 7(c) , this can lead to unused logic blocks if the circuit does not have a naturally square shape.
We can alleviate this problem somewhat by creating a programmable logic core that is not square. We have observed that in many implementations, several logic blocks in the rightmost columns remain unused. We can take advantage of this by removing logic blocks from the last few columns, as indicated with shading in Fig. 7(c) . We quantify the number of logic blocks removed using the parameter , where is defined as the proportion of the logic blocks in the top row that have been removed. In Fig. 7(c), is . In all cases, we remove blocks in a "triangular" fashion; if we remove blocks from column , we remove blocks from column . A value of 0 for indicates a rectangular core; a value of 1 indicates a triangular core. Note that a nonzero value of does not imply a nonrectangular final layout. The diagram in Fig. 7 (c) is a conceptual representation; the core will be synthesized into gates, and the gates will be placed into rows of standard cells regardless of the shape of the conceptual representation. Intuitively, as is increased, the area of the implementation will go down. If is decreased too much, however, the area will rise, since a larger virtual grid will be needed. This effect can be seen in Fig. 8 . Fig. 8(a) shows how the implementation area depends on for each circuit implemented on the Gradual Architecture (each line represents a different circuit). Because we were unable to synthesize large triangular cores using our synthesis tools, results are only shown for 11 of the 19 benchmark circuits. The geometric average over these 11 circuits is shown in Fig. 8(b) .
Although each individual circuit in Fig. 8(a) exhibits its own characteristics, the results in Fig. 8(b) indicate that the overall gain obtained using a nonzero value of is relatively small. From Fig. 8(a) , the "breakpoint" (the point at which a larger grid is needed) is not the same for each circuit. Thus, the average results show that only a modest improvement can be achieved. Overall, the value of that gave the lowest area was 0.6, which resulted in an 11.1% lower area than a square core, averaged over all circuits.
VI. PROOF-OF-CONCEPT IMPLEMENTATION
To investigate the implementation issues of our synthesizable embedded core approach, we have chosen a module derived from a chip testing application. This module acts as a bridge between a test access mechanism (TAM) circuit [13] and an IP core under test. In the research work described in [13] , the TAM is actually a communication network that transfers test data to/from internal IP blocks on the chip in the form of packets. The module we selected allows the TAM and the IP core to run at different frequencies, resulting in higher overall TAM throughput. A chip designed with this type of network TAM would contain one of these selected modules for each IP core on the chip. Fig. 9(a) shows a block diagram of the module. The module consists of a buffer memory, a packet assembly/disassembly block, and two state machines. Packets received from the TAM circuit are optionally buffered before being converted to a form usable by an IP core under test. A key component in the module is the Packet Assembly/Disassembly block which controls the assembly and disassembly of test packets based on a given packet format. The packet format was subject to change from time to time during the course of the research described in [13] which required a re-design of this block.
A. Reference Version

B. Programmable Version
When packet formats are modified to adjust header, data and address information, the control circuitry must also be modified. Noting this fact, we decided that the next-state logic would benefit from programmability. This would allow the user to modify some packet processing and control operations simply by re-programming the block. If the next state logic of the state machine is made programmable, as shown Fig. 9(b) , new schemes can be implemented after fabrication of the integrated circuit. Although a hard programmable logic core could also be used here, it is better suited to the soft PLC approach due to its fine-grain nature.
C. Implementation Issues
We designed two versions of this module: 1) the reference version with no configurability, and 2) the programmable version, in which the assembly/disassembly control is removed and replaced with a soft programmable logic fabric. The fabric uses the Gradual Architecture as it was found to be more efficient than the Directional Architecture. When adding the programmable component to our module, a number of other interesting issues arose. This section summarizes these issues.
1) Programmable Logic Core Size:
The first issue was how much programmable logic is needed to replace the fixed next state logic. Without knowing the actual logic function that will eventually be implemented in the core, it is difficult to estimate the amount of programmable logic required. However, in this case, we have domain knowledge regarding the types of functions that will be implemented, and we can use this knowledge to make reasonable decisions. We designed two user logic functions that would be implemented in the core, and determined the size of the core that would be required to implement each function using VPR [9] . For our circuit, we found that a core consisting of 49 LUTs (i.e., a 7 7 array of 3-LUTs) would be sufficient for both potential logic functions; however, to allow some safety margin and anticipation of larger functions, a core of 64 LUTs (8 8 array) was used.
2) Connections Between the Core and the Fixed Logic: A second issue is how the programmable logic core is connected to the rest of the module. Although the core itself is programmable, specific inputs and outputs must be connected to the core in advance. This will dictate which functions are possible to implement in the core. Again, we have domain knowledge to assist us with this decision. We can select which inputs are connected to the core and which outputs will be made available from the core. In our design, the two user logic functions required 9 inputs and 10 inputs, respectively, and required 11 outputs and 12 outputs, respectively. We afforded ourselves some flexibility by hardwiring a selected set of 10 inputs and 13 outputs to our core.
3) Routing the Programming Clock Signal: During physical design process, it was apparent that our synthesizable core was placing an extra burden on the router due to the large number of flip-flops in the design. A programmable logic core contains many configuration bits to store the state of individual routing switches and the contents of lookup tables; in a synthesizable core, these configuration bits are built using flip-flops that have clock inputs to enable programming. As shown in Fig. 10(a) , there are configuration bits for input muxes and output muxes, as well as the LUTs themselves. Each of these FFs must be connected to a common clock signal for programming purposes as indicated by the bold line.
To determine how flip-flop-intensive our core is, we compared its flip-flop density to that of a nonprogrammable design. We analyzed an ASIC implementation of a 68HC11 core, and found that the flip-flop density (number of flip-flops per unit area) was of the flip-flop density in our programmable logic core. Thus, we realized that the clock tree in our core will be more complex and consume more chip area than a typical ASIC. This was confirmed; in our implementation, 45% of the layout area was consumed by the clock tree, power striping, and signal routing (experience with other ASICs of this size has shown that 25% is usually enough). Furthermore, FFs must be connected as one long shift register for programming purposes, and this also added to the routing complexity.
The results of the physical layout of the bit configuration clock routing are shown in Fig. 10(b) . Our core contains 1803 such flip-flops, each connected to the bit configuration clock signal. The clock net highlighted in white is the configuration clock; this routing is clearly more complex than the other nets (shown in grey). This extra clock complexity increases the area overhead of the design, beyond what would be estimated by just considering only the standard cell area. In our case, this is a notable source of area overhead, since the original next state logic was purely combinational logic with no FFs or clocks. Note that this clock tree overhead would occur in both a soft and hard programmable logic core.
D. Implementation Results
1) Area Overhead:
We implemented both the programmable and nonprogrammable versions of the module using the same tool flow to further quantify the area overhead. The reference module (without the programmable logic core) required 369 700 m in a 0.18-m TSMC process, of which 1217 m is the area due to the assembly/disassembly controller next state logic. The programmable module (containing 64 LUTs as described above) required 1 025 000 m , of which 684 600 m was due to the programmable next state logic.
The layout areas are summarized in Table III . Clearly, the differences in these numbers are significant. Our synthesizable TABLE III  AREA RESULTS SUMMARY   TABLE IV  SPEED RESULTS programmable logic core required 560 more chip area than the fixed logic that it replaced. From the analysis in Section V, the synthesizable core requires 6.4 more area than a hard programmable logic core. However, the use of a hard core may not be suitable for such fine-grain applications. It would require the same considerations as any other hard IP plus additional ones for programmability. For the size of fabric being used, the soft PLC would provide a more seamless approach.
Further investigation into the area overhead showed that 53% of the area of our programmable logic core was due to routing multiplexers and the configuration bits that control these multiplexers, as shown in Fig. 10(a) . These multiplexers are large; the largest in our core has 26 inputs. Our standard cell library contains only two-and four-input multiplexer cells; larger multiplexers are built by cascading these smaller multiplexers. Clearly, the area overhead could be improved significantly by either supplementing our cell library with larger multiplexers, or modifying the architecture to employ smaller multiplexers.
2) Delay Overhead: We measured the speed of our reference and programmable modules before and after physical design. Table IV shows the post-physical design results. In this case, we configured the core using the two user-defined logic functions mentioned above, and measured the length of the critical path through the logic circuit in each case. As the table shows, the results indicate that the programmable core has approximately twice the critical path delay as the reference design, for both user-defined functions.
The module containing the programmable fabric was fabricated in 0.18-m TSMC CMOS and tested using the same two user-defined logic functions. The speed results correlated well with the results shown above. The chip design had a critical path of about 40 ns compared to the expected 50 ns, well within the error tolerances of the models used in the CAD tools and the statistical variations of the CMOS process.
VII. CONCLUSION
In this paper, we have presented two new architectures for synthesizable programmable logic cores. Synthesizable programmable logic cores are different than the programmable cores currently available from vendors in that they are obtained as a HDL description, and synthesized using standard synthesis tools. The use of these cores has significant area overhead; we have estimated an overhead of compared to using "hard" programmable logic cores. Yet, for small logic circuits, these "soft" cores have a number of advantages: they are easy to integrate with fixed logic, we can create cores of any size and shape, and they are easy to migrate to a new technology.
One of the primary applications we envisage for these cores is the implementation of small combinational logic blocks, such as the next-state logic or output-logic of state machines. As a result, our architectures are different than traditional FPGAs in that they only support combinational circuits, and are "directional" in that signal only flow in one direction through the fabric. In addition, the interconnect pattern is less flexible and the routing resources less plentiful. We have performed experiments to show that small combinational circuits can be implemented on these cores efficiently.
This paper also has illustrated some the issues that arise when such a core is used, through the use of a proof-of-concept chip: the choice of the size of a core, the choice of inputs and outputs, and the difficulty in routing the flip-flops.
Better synthesis results could be obtained by adding specialized cells to the standard-cell library to implement our programmable logic fabric. We have not considered this in this paper, since our goal was to create architectures that can be implemented using the standard synthesis tools, cell libraries, and design flows that are already familiar to integrated circuit designers. However, initial experiments have shown that, by removing unnecessary features, we can create a replacement for our flip-flop standard cell that is 40% the size of the standard cell version. Since, in the entire fabric, the flip-flops account for 43% of the chip area, we would expect significant savings if this standard cell was used to construct our fabric. We also expect that significant improvements can be obtained using custom-designed multiplexer standard cells. Clearly, if this design technique is to become mainstream, specialized standard cells should be created.
Although these soft cores are less efficient than their fixed counterparts, the use of programmable logic cores, and especially synthesizable programmable logic cores, is still important. The post-fabrication flexibility that these cores provide will be vital as integrated circuits get larger and as masks get more expensive. Synthesizable programmable logic cores are a sensible solution when only small amounts of programmable logic are required, since they can be treated much like regular logic during the design process. The results of this paper clearly show that there is still work to be done improving their area and speed, but as new architectures are uncovered, and new CAD techniques are developed, it is likely that both hard and soft cores will become an important part of future integrated circuits. His research interests include field-programmable gate arrays and integrated-circuit CAD development.
Kimberly A. Bozman received the B.Eng. degree from the University of British Columbia, Canada, in 2002. After graduation, she joined the System on a Programmable Chip group at the University of British Columbia where she researched synthesizable programmable logic IP for System on a Chip design.
Since October 2002, she has been with Altera Corporation, Canada, where she is engaged in the development of commercial place and route tools. Her research interests include field programmable gate array (FPGA) architectures and CAD tools for FPGAs. 
