Abstract
Introduction
Recent years have seen impressive improvements in the achievable density of integrated circuits. In order to maintain this rate of improvement, designers need new techniques to handle the increased complexity inherent in these large chips. One such emerging technique is the System-on-a-Chip (SoC) design methodology. In this methodology, predesigned and pre-verified blocks, often called cores or intellechral property (IP) are obtained J?om internal sources or third-parties, and combined onto a single chip. These cores may include embedded processors. memory blocks. or circuits that handle specific processing functions. The SoC designer, who would have only limited knowledge of the structure of these cores, could then combine them onto a chip to implement complex functions.
No matter how seamless the SoC design flow is made_ and no matter how careful an SoC designer is, there will always be some chips that are designed, manufactured, and then deemed unsuitable. This may be due to design errors not detected by simulation or it may be due to a change in requirements. This problem is not unique to chips designed using the SoC methodology. However, the SoC methodology provides an elegant solution to the problem: one or more embedded programmable logic b) One "U" shaped core and four "0" shaped cores used for test wrappers and control a) An L-shaped Core used for Interface logic embedded programmable logic core is a flexible logic fabric that can be customized to implement any digital circuit after fabrication. Before fabrication, the designer embeds a programmable fabric (consisting of many uncommitted gates and programmable interconnects between the gates). After the fabrication, the designer can then program these gates and the connections between them. Several companies, including Actel, Ahnel and eASIC already provide EPLC's [10, 12, 13] . Figure 1 show two hypothetical examples of where EPLC's may be beneficial. In Figure l(a) , an EPLC implements interface logic between the other ASIC cores inside the chip and the peripherals outside the chip. As standards change, it is beneficial if this interface logic is flexible. Figure l(b) show another example in which the EPLC's are used as test logic controllers in SoC design [l] . This allow a test engineer to implement new test stimulus and/or test analysis circuits on the programmable core after the chip is fabricated As testing proceeds, if errors are found, new tests can be devised and the new on-chip test circuitry can be implemented in the EPLC.
The architecture of most EPLC's inhaits much fiom the architecture of stand-alone FPGAs. In these devices, configurable logic elements (usually sets of lookup-tables) are placed in a grid, separated by horizontal and vertical channels containing futed metal tracks. These metal tracks are connected to each other and to the logic elements using programmable switches. This is often termed an islandstyle architecture. The number of logic elements on an FPGA die is usually dictated by cost and/or die-sue constraints. Invariably. however, the shape of the FPGA die is either square or rectangular.
In SoC design however, it may be desirable to use an EPLC which is not square or rectangular. In Figure l(a) . for example, the EPLC is L-shaped; not only may this better mesh with the other cores. but an L-shape may be very suitable if the U 0 associated with this block spans more than one edge of the chip. In Figure l(b) , both " 0 shaped and "U" shaped EPLC's can be seen. " 0 shaped EPLC's can be used as a "wapper" around other (possibly fixed-function) cores, in this case, to map test signals to the cores. "U" shaped EPLC's may also be used when a complete wapping is not required. In any case, it is clear that EPLC's should he able to take on shapes other than rectangular.
Constructing a non-rectangular core is straightforward. The island-style architecture in most EPLC's provides a natural way to implement "L?'. W", and "O'-shaped cores; one or more basic logic elements (and the routing around these logic elements) can be removed to form an .'L". "U", or .'0. An example is shown in Figure 2 ; in this case, the logic elements in the shaded region can be removed to form a 'V'-shaped core. In order to use a non-rectangular core, however, placement and routing tools that map user circuits to the EPLC are required. It is important to note that we are not referring to the tool that positions the EPLC on the chip with the other cores; we are referring to the tool that places and routes the gates of the user circuit on the EPLC after chipfabrication. Such place and ronte tools have been well-studied for stand-alone FPGA's; however. as described above, EPLC's may differ in that they may not be square or rectangular. Since FPGA place and route tools use algorithm which are inherently geomeirical, it seems likely that the same algorithms may not perform well for "L", ' W , and "0-shaped EPLC's. In this paper, we investigate place and route algorithms for these non-rectangular EPLC's. In particular, we answer the following questions:
1. Can standard FPGA place and route algorithms map user circuits to "L"-shaped, "0"-shaped, and "LT'-shaped EPLC's, and if not, how can they be modified so that they do? 2. What is the penalty (if any) for using non-rectangular cores compared to rectangutar cores?
3. How does the algorithm perform as the EPLC becomes more non-rectangular (for example, as the "hole" removed to create an " 0 becomes larger)?
Routing in an FPGA is similar to the globaUdetailed routing in standard cell and custom chip tools. These tools can handle non-rectangular shapes by breaking these shapes into rectangular channels, and routing within the channels separately. It is conceivable that we could do the same thing here: break the EPLC into rectangular regions, and place and route within each region separately. This will not likely work well. however. since:
would need to partition logic among these regions (for example, partition logic among each arm of an "L").
This may lead to ineffrcient implementation, since the partitioning algorithm would likely he performed before any placement and routing. The standard cell routers only performs routing, and thus does not need to partition logic. Even if we were able to partition the user circuit into rectangular regions, we would need to impose additional constraints on the psitiom of the interconnects that connect between these regions.
Since the flexibility in EPLC s is so much smaller than that in custom cells, King the entry and exit points of connections between the regions would significantly hurt routability.
2.
Thus, we must use a placement and routing tool that can map user circuits to non-rectangular EPLC's without breaking ths EPLC into rectangular regions. In Section 2,
we describe an existing stand-alone FPGA placement and routing tool. In Section 3, we will show how this tool can be modified to better support "L", 'V", and " 0 shaped EPLC's. In Section 4, we will experimentally investigate how the algorithm performs, and in Section 5, we wiIl experimentally quantify the penalty for using nonrectangular cores, and determine how the algorithm performs as the core becomes less and less rectangular.
Placement and Routing for Stand-Alone FPGA's
We have based ow algorithm on the existing VPR place and route tool [2] . The following subsections describe the relevant details of the algorithms; a more complete description is in [3, 4] .
a) Placement
Placement is the process of assigning a physical logic element to each logic element in the user circuit. VF' R employs a timing-driven simulated-annealing based placement algorithm. The cost function employed the annealer is:
where h is a constant'. The Wiring cost is the normalized sum of the scaled bounding boxes of each net. The timing cost is F, * (normalized timing cost)
where Timing Cost(i. j) = Deladi. j p Criticality(i. j)"P In this equation, the Criticality of a connection is an estimate of how close a connection is to the critical path and exp is a constant. The delay of a connection clearly depends on the placement; calculation of this delay during placement is difficult. VF' R uses a pre-computed matrix that contains the delay of each potential connection (an initial fast route of each connection is performed, and
Elmore delay is used to estimate the delay). Note that it is not necessary to compute the delay between every pair of logic elements; in most standalone FPGA's, the delay between location (x,yj and @+Ax, y+dyj is approximately independent of the value of x and y. This is shown in Figure 3 ; the delay of a net connecting a and b will be approximately the same as a delay of a net connecting c and d. Thus, a single m a y indexed on the span of each wire is enough. For each potential placement, the x andy span of each wire can be found, and the matrix can be used to quickly find the delay of each connection. For more details, see [4] .
bj Routing
The goal of an FPGA router is to find paths for the connections between the placed logic blocks. Unlike a channel router, each route must be constructed using fixed metal tracks connected using programmable switches. Thus, the flexibility available to an FPGA router is far less than that available to a channel router. Because of this.
single-step combined globddetailed routers are often used when targeting FPGAs, rather than separate global and detailed routers. The router is based on the Pathfinder algorithm [5] . Nets are routed sequentially using an A* maze-routing algorithm. Initially, nets are allowed to share physical tracks. Once all nets have been routed, the cost of two nets sharing a track is increased slightly. Each net is then ripped-up and re-routed. This is repeated for several iterations; each time the cost of sharing becomes slightly higher. M e a at the end of an iteration, no track is shared between more than one net, a legal routing has been found, and the algorithm terminates. During maze-routing, the fitness of each potential segment n that might be added to the net is evaluated using the following cost function:
where delafin) is the Elmore delay of the segment n, and h(n), h(n), and p(n) are the base cost, the historical congestion cost, and the present congestion cost of using segment a and the Criticality is a measure of how close to the critical path a given segment is. The first term in the cost function represents the delay of the currently routed net, while the second represents the congestion cost for the current segment. Nets with a high value of Criticality (ie.
nets on or near the critical path) are thus routed primarily for speed, while other nets are routed primarily for congestion. This ensures that, as routing progresses. nets which are not critical are moved away fiom congested regions.
For large FPGAs, standard maze-routing can become very slow, and consnme a significant amount of memory.
Thus, W R uses an A*-Qpe algorithm. Rather than choosing a segment using the above cost function, the following is used Total Cost(n) = Cost(n) + Estimated Cost (n) where the Estimated Cost term is computed using the Manhattan distance between the current routing segment and the sink of the connection. Figure 4 shows this.
graphically; the Estimated Cost term for the segment would be x+y. In this way, most connections are found v a y quickly; the algorithm reverts to a standard maze-router if there is signifcant congestion.
To further speed the routing algorithm. a bounding box around the net being routed is computed. This size of this bounding box is then increased by two logic elements in each direction. When routing, the ronter does not expand to segments that lie outside this bounding box. In [3], it is s h o w that this has a negligible effect on route quality. but speeds up routing somewhat.
Place and Route for Non-Rectangular EPLC's
The VPR placement and routing algorithms described in Section 2 are both geometry-based. In the following two subsections, we will show why these algorithms are not suitable for non-rectangular cores, and will show enhancements to better support these cores. 
a) Placenrent
The current placement tool does not produce good solutions on '0" and "U"-shaped cores. Figure 5 show the problem. As described in Section 2, the delay between a pair of logic blocks is found using a precomputed delay lookup table, indexed hy the x and y span of the net. As shown in Figure 5 , the Manhattan distance between two blocks may not correctly represent the shortest path distance between the two nodes in 'V and "0-shaped cores. Therefore, the delay value stored in the precomputed delay lookup table gives a very poor estimate of the actual delay of this connection. This poor delay estimation also affects the Criticality of this connection. If the connection s h o w in Figure 5 is on or near the critical path, the placer would not pay as much attention as it should to minimize the delay of this connection and eventually could result in a slower circuit. In a "U" or an "0"-shaped core, the same delay table can not be used for each block; different (x.y) locations should have different delay tables. One possible solution is to Shorten Path Dmance Betweea the T w TRmmali Figure 5 : Difference between the Manhattan distance and the shortest path distance in a "U-shaped core compute a separate delay table for each block location (x,y). However, this is very non-scalable, and could make the placer run V~I Y slow and require too much memory.
Our solution is to compute separate delay tables for all blocks in Regions 1 and 2 in Figure 5 ; this delay table is used for all nets that span the two regions. The original delay table for all other nets. A similar technique can be used for "0"-shaped cores.
b) Routing The current routing algorithm does not produce good solutions on "0" and "U"-shaped cores. Figure 6 shows the first problem. As described in Section 2, the router finds a bounding box around each net, expands it by two logic elements in each direction; the router then does not explore routes outside this bounding box. As shown in Figure 6 , this can cause a problem in "U" and "0"-shaped cores. In the figme, the net will not be successllly routed, because all potential routes must pass outside the bounding box.
The solution to this problem is straightfomd; we simply remove the bounding box constraint, and allow the router to explore the entire graph. Although this slows down the algorithm somewhat, experiments have shown that the impact on run-time is very small. The second problem is that estimated cost term in the A* search may be inaccurate. As described in Section 2, the expected cost is computed as the Manhattan distance between the current routing segment and the sink of the connection. Figure 5 shows an example where this estimation is incorrect. In this case, the preferred route is to leave the source in a downward direction. However. Psuedo-code ofthe correct shortest path distance calculation for "U" and "0"-shaped cores using the current cost functio~ the upwards direction appears equally attractive. Although the correct route will still be eventually found, the portion of the routing fabric that will be explored is large, especially considering that the bounding box constraint has been removed, as described above. This leads to long run times.
Routing Boundary
Our solution is to explicitly add terms to better estimate the distance from the current routing segment to the sink in " U and 'CY-shaped cores. This is shown in Figure 7 .
Note that none of these enhancements are needed for "L"-shaped cores. Circuits can be placed and routed on "L"-shaped cores using the same tools as for square and rectangular cores.
Algorithm Evaluation a) Methodology
To evaluate the proposed algorithmic enhancements. we experimentally mapped sixteen large benchmark circuits from the Microelectronics Corporation of North Carolina (MCNC) [9] onto a model EPLC. We assumed an islandstyle EPLC, where each logic block contains four 4-input lookup tables and four flipflops. It was assumed that each fixed Wiring track spans one logic block and Wilton switch block [14] is employed. We assumed a 0.18pm CMOS process available from TSMC.
Each circuit was first mapped to 4-input lookup tables and flip-flops using FlowmaplFlowpack [6] . The lookup tables and flip-flops were then packed into logic blocks using a timing-driven packing algorithm [7] . The logic blocks were then placed and routed on an appropriately sued EPLC using both the original and enhanced algorithms. For each circuit. we sized the EPLC to be the smallest shape that meets the relative aspect ratios shown in Figure 8 . Only "U" and " 0 shaped core results are described in this section. since as described above. existing Figure 8 : The relative aspect ratios of a "U"-shaped core and an "0"-shaped core used in algorithm evaluation place and route tools work well with "L"-shaped cores. Routing was performed twice; the 6rst route was used to find the minimum number of routing tracks needed for 100% routability. This number was then increased by 20%, and the routing repeated. This "low-stress" routing is representative of the routing performed in real industrial designs. Table 1 shows the routing area ( i n terms of the number of Minimum Transistor Equivalents [3]), critical path delay, and algorithm run-time for all sixteen circuits implemented on a 'U"-shaped EPLC. Columns 2 to 5 show results for the original VPR placement and routing tool. columns 6 to 9 show the results for the original VPR placement algorithm and the enhanced routing algorithm and columns 10 to 13 show the results for the enhanced placement and routing algorithms. As the Table shows, the enhanced router produces similar results to the original router, but with a 62% faster run-time. When the enhanced placer is used, the tun-time is increased somewhat, but the critical path delay is reduced by approximately 12%. Table 2 shows the results for an "0-shaped core.
b) Results
Again. the improvement in run-time of the enhanced router is significant (40%). The improvement in critical path when the enhanced placer is used is ahout 4%.
Architecture Study
The results in Tables I and 2 show that the algorithm works well for a particular aspect ratio. In this section, we answer two questions. First, we investigate how the efficiency of an "L", "'U. and "0-shaped core compares to a square core. Second. we investigate how this efficiency changes as the dimensions of the core change. In [SI. it w a s shown that the efficiency of a rectangular core is less than that of a square core. and that the efficiency drops as the core gets more and more rectangular. In this section, we investigate whether this is true for our non-rectangular cores.
a) Methodalogv
We use the same experimental set-up and benchmark circuits as in Section 4. We deiine the ihimess of an "L".
'U", and "0-shaped core as the proportion of a square core that is removed to create the irregularly-shaped core.
This quantity can range from 0 to 1. A value of 0 would describe a square core; as the thinness increases, the core becomes more and more non-square. This is shown graphically in Figure 9. b) Results Figure 10 shows the area and critical path results as a function of thinness, for "L", "U", and "0-shaped cores.
As expected, as the thinness increases, the area and delay efficiency decreases. The area penalty is significant; for a thinness of 0.64, there is a penalty of between 50% and 150% depending on the shape. The delay impact is not as significant. Figure 11 shows that this is directly a result of the improved algorithms described in this paper. Using the enhanced algorithms, for a thinness of 0.64, the delay penalty of a "U"-shaped core is approximately 60%, while if the original algorithms had been employed, the delay penalty would be almost 150%. The results for an "0-shaped core are similar.
Conclusions
In this paper, we have presented enhanced placement and routing algorithms for "U"-shaped and "0-shaped embedded programmable logic cores. For a typical .'Vshaped core (thinness=0.64), the algorithms give a 12% reduction in the critical path of the resulting circuif compared to algorithms optimized for square and rectangular cores. A critical path reduction of 4% for an "0-shaped core was obtained For both "U and "0-shaped cores, the enhanced router runs much faster than the original router but the enhanced placer nms slower than the original placer because calculation for additional delay tables is required. Overall, the --time of the two algorithms together remains roughly the same for a "0-shaped core is reduced by 25% for the ''U''-shaped core. We also have shown that the penalty for using "U" and "0"-shaped programmable logic cores is significant. even with the new placement and routing algorithms. It is important to note that we are not suggesting that .'V and "0-shaped cores are a bad idea. In many cases, a nonrectangular core will be required, either because of VO constraints or because it is the only shape that will fit well with other cores in an SOC. Instead, our results show that if such a core is used, the enhancements to the placement and routing can significantly reduce the area and delay penalty. For a typical W"-shaped core (thmness=O.64), the proposed algorithm reduce the delay penalty kom 150% to 60%. 
