As integrated circuits become increosingly complex, the ability to make post-/obrication changes will become more imporiant andamactive. This ability can be reolired using pmgmmmable logic cores. Currently, such cores are available fmm vendors in the f o m of a "hard" laput.
Introduction
Recent years have seen an impressive improvement in the achievable density of integrated circuits. In or& to utilize tbis excess capacity, while maintaining reasonable design costs, the System-on+-Chip (SoC) design methodology has emerged. In this methodology, pre-designed and preverified blocks, often called cores, are obrained from internal sources or third-parties, and comblned onto a single chip. Although this technique p d a l l y alleviates some of the complexity, the design and test of a correctlyfunctioning integraled circuit is still a difficult task.
One way to ease t h~s task is to use embedded programmable logic cores. A programmable logic core is a flexible logic fabric that can be customized to implement any digital circuit after fabrication [1-5].
Before fabrication, the desiqer embeds a programmable fabric (consisting of many uncommitted gates and programmable interconnects between the gates) onto an SOC. ARer fabrication, the designer can then program these gates and the connections between them. This technique is attractive for a number of reasons. In some cases, it may be possible that some design details can be left until late in the design cycle. In a communications applicaboon, for example, the development of a chip can proceed wlule standards are being filized. Once the standards are set, they can be incorporated into the programmable portion of the chip. A second reason that the use of programmable logic cores is attractive is when products are upgraded or as standards change, it may be possible to incorporate these changes using the programmable part of the chip, without fabricating an entirely new device. Finally, the use of a programmable logic core may make it possible to fabricate a single chip for an entire family of devices. The characteristics that differentiate each member of the family can be implemented using the programmable logic. This would amortize the cost of developing the ASIC over several products Several integrated circuits containing programmable logic cores have been described [6-81.
Despite these compelling advantages, the use of programmable logic cores has not become mainstream. In facf many companies that develop these cores have either changed focus or gone out of business. There are a number of reasons for this. One reason is that designers often fmd it difficult to identify a subsystem that can be implemented in programmable logic. A second reason is that embedding a core with an udmovm function makes t i -, power distribution, and verification difficult. Lastly, embedded programmable logic cores must address physical connection and placement issues. This is difficult when the regions of f i e d and programmable logic are tightly coupled together and or when there are a number of small programmable pieces diskibuted over the entire SOC. This e&a design complexity limits the use of progrannnable logic cores to only the very best VLSI designers.
In [9] , an alternative technique is described which addresses the last two concems. In this technique, core vendors supply a synthesizahle version of their programmable logic CO= (a "soft" core) and the integrated circuit designer synthesizes the programmable logic fabric using standard cells. A "soft core" is one in which the designer obtains a description of the behaviour of the core, witten in a hardware description language. Note that this is distinct from the behaviour of the circuit to be implemented in the core, which is determined after fabrication. Here, we are referring to the behaviour ofthe programmable logic core itself.
Since the designer receives only a description of the behaviour of the core, he or she must use synthesis tools to map the behaviour to gates. These synthesis tools can be the Same ones that are used to synthesize the fixed (ASIC) portions of the chip. The primary advantage of the new method is that existing ASIC tools can be used to implement the chip. No modifications to the tools are required, and the flow follows a standard integrated circuit design flow that designers are familiar with. This will significantly reduce the design time of chips containing these cores. A second advantage is that this technique allows small blocks of programmable logic to be positioned very close to the fxed logic that connects to the programmable logic. The use of a "hard core", however, r q u i e s that all the programmable logic be grouped into a small number of relatively large blocks. A third advantage is that the new technique allows users to customize the programmable logic core to suppolt his or her needs precisely. This is because the description of the behaviour ofthe programmable logic core is a text file which can be edited and understood by the user. Finally, it is easy to migrate the circuit to new technologies; new programmable logic cores from the core vendors are not required.
The primary disadvantage ofthe proposed technique is that the area, power, and speed overhead will be significantly increased, compared to implementing programmable logic using a hard core. Thus, for large amounts of circuiky, this technique would not be suitable. It only makes sense ifthe amount of programmable logic required is small. An envisaged application might be the next state logic in a state machine. In this paper, we present a new family of architectures for a synthesizable embedded programmable logic core (PLC). Compared to the arclutecture in [9] , the new archtecture is significantly more area and speed efficient. Unlike the arclutectures in [9] , which are based on lookup-tables GUT'S), our new family of archtectures is based on a collection of product-term may blocks. It is well known that product-term array blocks can result in density and speed improvements for small sue circuits [IO] ; in this paper, we show that the small combinational circuits envisaged for these synthesizable cores are vexy suitable for product-term based architectures. In addition, this paper shows that the nature of synthesizers and synthesized circuits places unique demands on a product-tam based architecture.
This paper is organized as follows. Section 2 describes the new architecture family, and shows how it differs from standard commercial product-term based architectures. Section 3 then describes the experimental results aimed at optimizing several archilectuml parameters.. Finally, Section 4 compares the new architectme to the LUT-based architecture from [9] .
Synthesizable Product-Term Based Architecture Family
In this section, we describe our family of product-term based archtectures. Each member of the family is composed of one or more product term-based blocks (Pm's) connected using a novel interconnect architecture.
The number of PTB's, as well as the number of input and output pins, vary between members ofthe family. Product tepmbased blocks (PTB's) are essentially circuits that can lmplement any Boolean function in a sum-ofproduct form. A PTB.consists of two planes -the AND plane and the OR plane. The AND plane is a product term generator; it is used to create products terms that can be fed into the OR plane. The OR plane is used to "sum" the product terms to create the desired Boolean function. Clearly, the size of a PTB can be scaled by altering the number of primary inputs, the number of product terms (the number of AND gates in the AND array) and the number of outputs (which is 0 t h equal to the number of OR gates in the OR array). In this paper, we will refer to the sue of a PTB using the tuple (i,p,o) where i is the number of inputs, p is the number of product terms @-terms), and o is the number of outputs. Unlike commercial PTB based architectures, our PTB's do not contain registers on their outputs. As in [9], we are targeting small combmational circuits (such as the next state logic in a state machine). Registers can be attached to the periphery of the core if desired. We focus on PLAtype logic cores (which we call PTB's) lnstead of PALtype logic cores becaux their flexibility allows for more efficient logic implementation and because of available PLA-based technology mapping algorithms.
A progmmmable logic core can be implemented using a single PTB. The behaviour of the single PTB can be wrilten in a hardware description language, synthesized, and combined with the rest of the integrated circuit. This works well for small cores, but as the number of inputs, product terms, and outputs grow, the sue of the synthesized fabric wiU grow.
The number of programming bits (which would be implemented using flip-flops in the synthesized fabric) is proportional to the sum of the product of the number of inputs and product terms, and the product of the number of product terms and outputs. For large cores, tius becomes unwieldy. Because of this, large product-term based devices usually contain a collection of smaller PTB's connected using a very flexible interconnect switch matrix. 
Interconnect Architecture
Unfortunately, the single global interconnect switch matrix found in commercial devices is not appropriate for a synthesizable architecture. This is because it allows PTB outputs to be connected to any other PTB input (or a large fmction of them). This can create combinational loops in the unprogrammed fabric (for example, if the output of a PTB is connected to the input of the same PTB). These combinational loops are normally not a probleq since it is up to the user lo confisure the fabric in such a way that these combinational loops do not occur In our case, however, we wish to synthesize the fabric itself using standard synthesis tools. S t a n d d synthesis tools have problems synthesizing circuits with combinational loops. Thus, we need a fabric without combinational loops. Figure 1 shows our novel routing architecture that can be used to connect PTB's. We have considered two interconnect strategies. In the fust strategy, in Figure ] (a), the PTB's are arranged in a rectangular grid, while in the second strategy, in Figure l(b) , the PTB's are arranged in a triangular shape. In both cases, each core contains PTB's arranged in several levels. The outputs of PTB's in one level can be connected to the inputs in all subsequent levels (levels to the right) but can not drive any PTB's in the same level, or preceding levels (levels to the left). This results in a directional architecture, and eliminates the possibility of combinational loops, as described above. where C, denotes the number ofPLA blocks at levels. The sue ofa routing multiplexor is calculated by summing the number of primary inputs with the number of signal outputs from the PTB's in the preceding levels, and subtracting by the number of inputs in a PTB block.
Allowing a "full-connect" fahric may not seem to be very area efficienf especially since the multiplexors grow in Ratio ofPTB's in neighbouring levels size with the depth of the core. Because of this, 191 suggested a sparsely populated routing fabric used in their LLIT-based architecture. However, in our case, the number of PTB blocks and the depth of the fabric are comparatively small, meaning the sue of the multiplexors do not become unwieldy. An advantage of a fully connected arciutecture is that the placement and routing tasks become trivial.
Archifeciuml Pammeiers
There are two classes of parameters we use to describe a specific design within our architectural family: lugh-level parameters and low-level parameten. Consider a VLSI designer who wishes to employ one of OUT cores. The designer would have a rough idea of how much logic should fit in the core (and hence, the sue of the core in terms of number ofLUT's or PTB's) as well as the number of inputs and outputs ofthe core. The designer would use these quantities, whiich we refer to as high-level parameters, to choose a specific core from a library or a core generator. The hgh-level parameters are summarized in Table 1 Low-level parameters, on the other hand, would not normally be specified by the VLSI designer. These parameters describe the details of the core layout (size of each product-term block, details of the interconnect between the block, etc.). The designer of the core library itself (as opposed to the VLSI designer who uses the library) would like to use optimum values for these parameters in the design of each core in the library. The low-level parameters for both rectangular and biangular cores are listed in Table 2 ; Section 3 will seek optimum values for t h e e parameters. 
Low-Level Parameter Optimization
In this section, we seek to fmd optimum values of the lowlevel parameters in Table 2 . We do not attempt to find optimum values for the high-level parameters in Table 1 , since these are parameters that would usually be determined by the VLSI designer depending on the applications expected to be mapped to the programmable logic core. Although we have not investigated all possible combmations of values for the parametas in Table 2 , we have varied three of the key parameters @, r, and a to be explained in the following section), and have identified their impact on the area and delay of the resulting programmable logic core. In addition, this section will answer the question of whether a rectangular or hiangular core is more area and delay efficient.
Number oJProduct T e m s per PTB:
We used combinational MCNC benchmark circuits as the basis of our experiments. Their sizes mnge from IO to 300 equivalent 4-LUT's with primary inputs and outputs ranping h m 4 to 200 and 1 to 70, respectively. To limit .our optirmzation space we initially fixed PTB input, i to 12 and outpuf o to 3; these values are in line with results from previous work [10, 14] . We choose to optimize the numkr of product tams fmt, instead of the number of inputs or outputs in a PTB as during experimentation product terms account for a large amount of total PTB area and was found to greatly influence the utilization efficiency of the PTB blocks [14] . After obtaining the optimal number of PTB product tams, we focused on fmdmg the optimal values of PTB inputs and outputs. Experimentally, we confimed that a PTB with 12 inputs and 3 outputs is a good choice.
To fmd the optimal number of PTB product terms, p. we did an experimental parameter sweep ranging from 6 to 21 terms. We mapped each benchmark circuit to PTB's using PLAmap [I I] , and chose the minimum triangular core size (with a=0.5) into which the circuit would fit. A behaviouml description of each core was generated, and synthesized using Synopsys Design Compiler with the Virtual Silicon0.18pm library.
For each value ofp, we measured the m a and delay a k r synthesis, and plotted the geometric average of the area*delay product in Figure 3 (a). The best value for p is 9. However, we have observed small circuits tend to prefer a smaller p , while larger circuits tend to prefer a larger p. We repeated our experiments, but m t i o n e d the benchmark circuits into two sets -one set for small circuits (less than 50 equivalent 4-LUT's) and the other set for larger circuits. The results are shown in Figures 3(b) and 3(c). We see small circuits prefer p=9, withp=12 or 15 resulting in a relative difference of 12-13%. Breakmg down the graph into area and delay components (not shown), p=12 or 15 results in a 23% degradation in area compared to 9 product terms. For larger circuits, we see p=18 or 15 provides the best area and delay results. When p=9, we get a 11% increase in area-delay product As a result, we propose that cores aimed at small circuits use PTB's with p=Y, and cores aimed at larger circuits use FTB's with PIS. This combination results in a 5% improvement in area-delay product than would be obtained by just setting p to 9 for all cores.
Value of r for recrmgufar cores
To fnd the optimum value of r (the ratio of the number of levels to the number of PTB's in each level) for rectangular COP%, we used a similar procedure. The same benchmark circuits were used, and PTB's of either (12,Y,3) or (12,18,3) were used (depending on the core sue, as described above). Figure 4 shows the impact of this parameter on area, circuit depth, and area*depth averaged over our benchmark circuits (again, geomehic average is used). We have used depth instead of estimated delay in these circuits, since we are using a fixed-size PTB, these quantities arc well correlated. As the graph shows, as r increases, the number of levels in the core increases, leading to a longer delay. On the other hand, the area decreases as r increases. A shallow core (small r ) places more conshints on the placement of logic, since more stringent precedence relationships must he obeyed (recall, each PTB can drive PTB's in subsequent levels only).
Overall, a value of ~0 . 4
gives the best area'depth result. levels. Fory>3, there are x levels, where x is the smallest value for which n, is less than or equal to 3. Figure 5 shows the impact of a nu area, depth, and area*depth of our benchmark circuits. As the graphs show, a value of ~0 . 5 is a good choice. We see that both delay and area increases for a larger than 0.5. A large a implies a larger depth and more PTB's, thereby increasing both delay and area, When a is less than 0.5, mapping depth decreases, but area increases This is because there are not enough levels to map the benchmark circuits for small a, so thus the number of PTB's in the fust level must increase to provide the required mapping depth.
Comparison oftriangular and rectangular cores:
Comparing the results from Figures 4 and 5 , we can see that the best triangular core leads to 23% smaller area but a 5.2% larger depth (and hence delay) than a rectangular core, on average. The best area*delay product is 19% lower in a triangular core than in a rectangular core.
Thus, we use triangular cores with ~0 . 5
in the remainder of t h~s paper.
Comparison to LUT-based Architectures
In this section, we compare the area and delay efficiency of our synthesizable core with that of the LUT-based arclutecture in [9] . Based on the results of Section 3, we used a triangular core with ~0 . 5 and PTB input i=12, product termsp=9 or 18, and output 0=3.
The results are shown in Table 3 . Overall, the PTBbased architecture is 35% smaller and 72% faster than the LUT-based architecture. In the architecture of [9], most of the area and delay was due to large routing multiplexors. In our arclutecture, since we have larger, and hence fewer, product-term blocks, the routing fabric is simpler and faster. This result is explained in Figure 6 . Figure 6 shows area and delay as a function of the number of logic blocks (LUT's or PTB's) in the fnst level. The results show that the area of the LUT-based architecture increases at a faster rate than the area of the PTB-based arclutecture. This is because there are more logic blocks in the LUT architecture than PTB's in the product-term architecture for a given circuit. Fewer logic blocks means fewer routing multiplexors are required, and thus, area increases more slowly. We see a similar trend with delay; because the product-term architecture uses larger blocks, fewer logic levels are required compared to the LUT-based architecture.
From the results in Table 3 , we observe that larger circuits tend to result in higher area improvements. Tlus is because the LUT-based architecture in [9] contains large Table 4 : Comparison to a LUT-based architecture routing multiplexors; the sues of these multiplexors grow as the core increases Although our multiplexors also grow as the core size increases, our multiplexors form a far smaller portion of the overall fabric than the corresponding multiplexors in the .LUT-based architecture
Conclusions
In this paper, we have presented a product-term based synthesizable programmable logic device, and compared it to the lookup-table based device in 191. Overall, we found that our new architecture is 35% smaller and 72% faster, primarily due to a dramatic reduction in the mount of circuitry needed to route signals. We also investigated the effects of various architectural parameters on the efficiency of our core.
Belter synthesis results could be obtained by 'Tweaking" the standard-cell lihrary to include cells specifically optimized to implement our programmable logic fabric. We have not considered this in Uus paper, since our goal was to create architectures that can be implemented using the standard synthesis tools, cell hbbranes, and design flow that integrated circuit designers are already familiar with. Nonetheless, if this design techque was to become mainstream, specially-designed standard cells could be created
