As integrated circuits have become more and more complex, the ability to make post-fabrication changes will become more and more attractive. This ability can be realized using programmable logic cores. Currently, such cores are available fIom vendors in the form of a "hard layout. An alternative approach is to use a "soft", or synthesizable programmable logic core that can be synthesized using standard library cells. In this paper, we describe the design of an integrated circuit that incorporates such a synthesizable programmable logic core. We focus on implementation issues that arose; specifically, the choice of core sue, the connection of the core to the rest of the integrated circuit, and clock tree synthesis. We also present area and delay overhead results.
Introduction
Recent years have seen impressive improvements in the achievable density of integrated circuits. Accompanying this improvement in technology is a significant increase in cost. The mask set for a single integrated circuit is approaching one million dollars, and this number will likely increase. Due to the increasing complexity, the design time of these integrated circuits is growing as well. This increased cost and design time is especially troublesome if a chip must be "re-spun" multiple times. No matter how careful an integrated circuit designer is, there will always be some chips that are designed, manufactured, and then deemed unsuitable. This may be due to design errors not detected by simulation or it may be due to a change in requirements.
One technique that partially alleviates this problem is to use one or more programmable logic cores within an integrated circuit. A programmable logic core is a flexible logic fabric that can be customized to implement any digital circuit after fabrication [1-6]. Before fabrication, the designer embeds a programmable fabric (consisting of many uncommitted gates and programmable interconnects between the gates). After the fabrication, the designer can then program these gates and the connections between them. This technique is attractive for a number of reasons. In some cases, it may be possible that some design details can be left until late in the design cycle. In a communications application, for example, the development of a chip can proceed while standards are being finalized. Once the standards are set, they can be incorporated into the programmable portion of the chip. A second reason that the use of programmable logic cores is attractive is that as products are upgraded, or as standards change, it may be possible to incorporate these changes using the programmable part of the chip, without fabricating an entirely new device. Finally, the use of a programmable logic core may make it possible to fabricate a single chip for an entire family of devices. The characteristics that differentiate each member of the family can be implemented using the programmable logic. This would, in effect, amortize the cost of developing the ASIC over several products. Several integrated circuits containing programmable logic cores have been described [7, 8] .
Despite these compelling advantages, the use of programmable logic cores has not become mainstream. In fact, many companies that develop these cores have either changed focus or gone out of business. There are a number of reasons for this. One reason is that designers often find it difficult to identify a subsystem that can be implemented in programmable logic. A second reason is that embedding a core with an unknown function makes timing, power distribution, and verification difficult. This extra design complexity limits the use of programmable logic cores to only the very best VLSl designers.
In [9] , an alternative technique is described which speaks to the second of these concerns. In this technique, core vendors supply a synthesizable version of their programmable logic core (a "soft" core) a i d the integrated circuit designer synthesizes the programmable logic fabric using standard cells. Although this technique suffers increased speed, density, and power overhead, the task of integrating such cores is far easier than the task of integrating "hard" cores into an fixed-function chip. For very small amounts of logic, this ease of use may be more important than the increased overhead.
In this paper, we describe an integrated circuit implemented using the "synthesizable embedded core" technique from [9] . A small network interface circuit was divided into a programmable portion and a fixed portion. The fixed portion was implemented using standard cells, and the programmable portion was implemented using a synthesizable programmable logic core. The primary purpose of this paper is to illustrate some of the implementation issues that emerge when such a core is embedded onto a fixed chip. This is Figure 1 : B important; although there are been various publications describing integrated circuits containing embedded programmable logic cores (usually "hard cores"), in this paper, we specifically focuses on implementation issues such as the choice of core size, the connection of the core to the rest of the design, and clock tree synthesis. We also present measured results that quantify the area and speed overhead that we observed.
Chip Architecture To investigate .the applicability of our "synthesizable embedded core", we have chosen a parallel network interface (PNI) module. This module acts as a bridge between a test access mechanism (TAM) circuit [IO] and an IP core under test. The module allows the TAM and the IP core to run at different frequencies; this results in higher TAM throughputs. A chip designed using a System-on-Chip (SoC) design flow might contain one of these modules for each intellectual properly (IP) core on the chip. Figure 1 shows a block diagram of the module. The module consists of a buffer memoly, a packet assemblyldisassembly block, and two state machines. Test packets h.om the Test Access Mechanism (TAM) circuit are optionally buffered before being converted to a form usable by an IP core under test.
A. Baseline Architecture

B. EnhancedArchitecture
A key component in Figure 1 is the Packet AssemblyDisassembly block which controls the assembly and disassembly of test packets. The heart of this block is a small state machine (labeled "AssemblyDisassembly Control" in Figure 1) . We have identified the next-state logic in this state machine as a part of the chip that would benefit &om programmability. When debugging the integrated circuit, it may be desirable to develop new test patterns and strategies that would not have been foreseeable when the chip was designed.
These strategies may require new assembly/disassembly schemes; if the next state logic of the state machine is programable, these schemes can be modified during the testing of the integrated circuit.
Because this next state logic is small, however, a hard programmable logic core would not be an efficient solution.
aseline Architecture Instead, the "synthesizable embedded core" is more suitable. Thus, we have chosen this module as our proof-of-concept vehicle, and have identified the next state logic within the Assembler Control as our programmable component.
Implementation Issues
We designed two versions of our module: (I) the baseline architecture with no configurability, and (2) the enhanced architecture, in which the assemblyldisassembly control next state logic has been removed and replaced with a synthesizable programmable logic fabric. When adding the programmable component to our module, a number of important issues arose. This section summarizes these issues.
A. Programmable Logic Core Size
The first issue was how much programmable logic is needed to replace the fixed next state logic. Without knowing the logic function that will eventually be implemented in the core, it is dificult to estimate the amount of programmable logic required. We designed two potential logic functions that might be implemented in the core, and measured the size of the core that would be required to implement each function (using custom-built CALI tools described in [9]). For our circuit, we found that a core consisting of fortynine 3-LUTs would be sufficient for both potential logic functions; however, to allow some safety margin and anticipation of larger functions, a core consisting of sixty-four 3-input LUTs was used.
B. Connections between the Core and the Fixed Logic
A second issue is how the programmable logic core is connected to the rest of the module. Clearly, even though the core itself is programmable, the inputs and outputs that are connected to the core will dictate which functions are possible to implement in the core.
Unlike an embedded memory (which could also be used to provide a programmable implementation of the next state logic), the programmable logic core described in [9] is generous in the number of inputs. Furthermore, the size of the core increases only linearly as the number of inputs increases (as opposed to the exponential increase in the sue of an embedded memory). In our design, our two potential logic functions required 9 inputs and 10 inputs respectively, and required 11 outputs and 12 outputs respectively. We maintained flexibility by hardwiring 10 inputs and 13 outputs to our core.
46
4-1 -2 C. Clock Tree During physical design, it became clear that our synthesizable core places extra stress on the clock tree router. A programmable logic core contains many configuration bits to store the state of individual routing switches and the contents of lookup-tables; in a synthesizable core, these configuration bits are built using flip-flops. Our core contains 1803 such flip-flops, each comected to a common clock. To illustrate how flip-flop-intensive our core is, we can compare its flipflop density to that of a non-programmable ASIC. We analyzed an implementation of a 68HC11 ASIC, and found that the flip-flop density (number of flip-flops per unit area) was 37% of the flip-flop density in our programmable logic core. Thus, we would expect that the clock tree in our core will be more complex and consume more chip area than a typical ASIC. This was confirmed; in our implementation, we had to reserve 45% of the chip area for the clock tree, power sbiping, and signal routing (experience with other ASICs of this size has shown that 25% is usually enough). The complexity of the resulting clock tree is shown in Figure  2 . The clock net highlighted in white is the configuration clock; this routing is clearly more complex than the other regular clock nets (shown in grey).
This extra clock complexity increases the area overhead of the design, beyond what would be estimated by just considering the increase in cell area. In our case, this is especially a concern, since the next state logic that the core replaces is combinational, and thus needs no clock tree at all.
Implementation Results
This section provides an overview of the area and speed overhead incurred by our programmable logic core.
A. Area overhead oJsoJi-PLC
The area overhead incurred by our programmable logic core is significant. Our implementation of the baseline module I Implementation I AreaofNext I AreaofEntire I (estimated usine (estimated usine 684 600 pm' 1 025 000 p2 Synthesizable Core (measured) (without the programmable logic core) required 369700 p' in a 0.18pm TSMC process, of which 1217 pm2 is the area due to the assembly/disassembly controller next state logic. The implementation of our module in which this next state logic was replaced with a programmable logic core (containing sixry-four 3-LUTs as described above) required 1025000 pm', of which 684600 pm' was the programmable logic core.
Clearly, the differences in these numbers are significant. Our synthesizable programmable logic core required 560x more chip area than the fixed logic that it replaced. Using estimates from 191, the synthesizable core requires 6 . 4~ more area that a hard programmable logic core. These numbers are summarized in Table 1 .
Further investigation into the area overhead showed that 53% of the area of our programmable logic core was due to routing multiplexers (as described in [91), and the configuration bits that control these multiplexers. These multiplexers are large; the largest in our core has 26 inputs. Our standard cell library contains only two-and four-input multiplexer cells; larger multiplexers are built by cascading these smaller multiplexers. Clearly, the area overhead could be improved significantly by either supplementing our cell library with larger multiplexers, or modifying the architecture to employ smaller multiplexers.
We are currently investigating these issues.
B. Oe/uy Overhead
We measured the speed of our baseline and enhanced circuitry before and after physical design. During synthesis, the target clock speed was reduced, and the resulting slack measured. Figure 3 shows these pre-physical design results. A negative slack means that the synthesizer was unable to find an implementation meeting the target clock speed. As the graph shows, a clock period of 20 ns was achievable in the baseline architechue, while a clock period of 50 ns was achievable in the enhanced architecture. The design provides inputs to the programmable logic core on the falling edge of the clock and samples the outputs on the rising edge; thus, the delay of the core is half of the clock period. It is important to note that during synthesis, the function that will eventually be implemented on the core is unknown.
Thus, the critical path used for optimization during synthesis is obtained by finding the worst case delay through all potential paths through the core. Table 2 shows the post-physical design results. In this case, we configured the core using the two potential logic functions identified in Section 3(a), and measured the maximum clock speed in each case. As the table shows, the results indicate that the enhanced architecture runs approximately half as fast as the baseline architecture, for both potential functions.
ns
51.0811s
Conclusions This paper has described the implementation of an integrated circuit containing a synthesizable programmable logic core, and in doing so, has illustrated some the issues that arise when such a core is used. One issue involved the size of the programmable logic core selected. If a core is too small, it will be unable to implement logic functions that may be required in the future. In our integrated circuit, although we estimated that forty-nine 3-LUTs would be enough to implement a variety of next state functions, we chose a sixtyfour 3-LUT architecture. Similarly, we chose a core with more inputs and outputs than we expected we would need. These two issues apply to both "hard" programmable logic cores and synthesizable logic cores. The third issue, the difficulty in routing the clock tree, applies only to the synthesizable programmable logic cores. .This difficulty in routing the clock tree results in a lower overall density than would have been predicted from [9] . Table 2 Speed Results small amounts of programmable logic are required, since they can be mated much like regular logic during the design process. The results of this paper clearly show that there is still work to be done improving their area and speed efficiency, but as new architectures, are uncovered, and new CAD techniques are developed, it seems likely that these synthesizable cores will become an important part of many future integrated circuits.
