The importance of effective and efficient accounting of layout effects is well established in High-Level Synthesis (HLS), since it allows more realistic exploration of the design space and the generation of solutions with predictable metrics. This feature is highly desirable in order to avoid unnecessary iterations through the design process. In this article, we address the problem of layout-driven register-transfer-level (RTL) binding as this step has a direct relevance to the final performance of the design. By producing not only an RTL design but also an approximate physical topology of the chip-level implementation, we ensure that the solution will perform at the predicted metric once implemented, thus avoiding unnecessary delays in the design process.
INTRODUCTION
High-level synthesis (HLS) typically uses generic abstract models of hardware during the tasks of scheduling, allocation, and binding. The use of these models simplifies HLS algorithms and standardizes the output of HLS to a generic format so that it can then be implemented in a particular technology through register-transfer-level (RTL) synthesis (e.g., logic synthesis, technology mapping, and physical design).
However, experimental evidence indicates that there is significant variation in hardware attributes based not only on the specific technology chosen, but also on the physical design of each implementation. Decisions made at this level have a pronounced impact on the final design. However, the impact of these decisions cannot be found until later in the design process. This, in fact, introduces a large amount of unpredictability into the results produced by HLS and is one of the main reasons for its lack of acceptance in industry nowadays. McFarland [1986] , Brewer and Gajski [1990] , and Knapp [1990] clearly indicated the significance of interconnect and other layout effects-traditionally considered as second-order in HLS-on the overall implementation area and delay. For HLS algorithms (e.g., scheduling, allocation, and binding) to make effective decisions that eventually result in high-quality layouts, we need to incorporate physical design information during HLS. We must account not only for place and route effects, but global considerations as well, such as RT wiring, component styles, aspect ratio, floorplanning, and the combination of "all of the above." Without such information, the RTL designs may produce unpredictable results when implemented on silicon.
The work presented here proposes a paradigm to incorporate layout information into the tasks of HLS. As the first step towards solving the problem, we turn our attention to the task of binding. Binding is typically the final task in HLS which follows scheduling and allocation. In binding, there are three subtasks: functional unit (FU) binding, where operations are assigned to hardware modules, storage binding, where values are assigned to hardware registers, and interconnection binding, where interconnections are bound to specific buses or multiplexors.
One way to account for layout information is to actually go through a physical design procedure every time a candidate solution is generated since existing CAD systems treat binding and physical design independently. Figure 1(a) shows the flow of a typical design methodology of an automatic behavioral synthesis system. This traditional flow suffers from the following four major drawbacks.
-One must explore a large number of solutions; -evaluating each solution takes too long since it is not known whether the design will meet the constraints until the end of the time-consuming phase of place and route; -when the constraints are not met, it is difficult to identify where the problem comes from and at which level the design should be modified; furthermore there is no way to identify the constraint that caused the solution to be infeasible; and finally, -the three subtasks (FU, storage, and interconnection binding) are tightly related to each other, and the deadlock situation among them is still an open problem in HLS.
In contrast with previous approaches, we incorporate physical information into the task of binding. The approach relies on:
-accurate and efficient prediction of design metrics, which reduces the run-time to evaluate a design solution, and the -means of incorporating layout information in the design process to find an efficient subset for design space exploration.
As shown in Figure 1 (b), the main features of this work are the following.
-It uses ChipEst-FPGA, a chip level area and performance estimator that provides accurate efficient prediction of design metrics; -the final result is evaluated without actually going through the timeconsuming phase of place and route; -when time constraints are met, the algorithm will output not only a structural RTL netlist, but also its corresponding physical topology which can be carried through to a silicon implementation in a predictable manner; -whenever time constraints are not met, our binding techniques provide a means of exploring the design space in a realistic and efficient way; with this exploration, our techniques will provide feedback to the previous tasks if the constraints cannot result in any feasible solution, and output the best implementation that can be achieved; and -we break the deadlock situation among FU, storage, and interconnection binding by performing these three subtasks simultaneously; physical design information is taken into account as well.
Although our proposed approach is valid for any technology, we benchmark the results with respect to the field programmable gate array (FPGA) design style since the ability to shorten development cycles has made FPGA an attractive alternative to standard cells and mask programmed gate arrays for realization of application-specific integrated circuits. Specifically, the Xilinx XC4000 [Xilinx 1994 ] series is assumed to be the layout design style for remainder of this article. 1 2. BACKGROUND 2.1 Overview of the Xilinx XC4000 Family of FPGAs Xilinx XC4000 consists of an array of CLBs embedded in a configurable interconnect structure and surrounded by configurable I/O blocks as shown in Figure 2 (a). The Xilinx XC4000 family consists of 10 members. The family members differ in the number of CLBs (ranging from 8 ϫ 8 to 24 ϫ 24) and I/O blocks, (ranging from 64 to 192). The typical gate capacity varies from 2,000 to 13,000.
1 For a different FPGA family, our approach is still suitable but corresponding estimation tools should be developed. For custom design, we have our CompEst and ChipEst tools [Ramachandran et al. 1992] . Layout-Driven RTL Binding Techniques • 2.1.1 XC4000 Configurable Logic Blocks and Lookup Tables. Xilinx XC4000 CLBs mainly consist of two 4-input LUTs, which are called F-LUT and G-LUT, respectively, and one 3-input LUT, which is called H-LUT as shown in Figure 2 (b). A K-input LUT is a memory that can implement any Boolean function of K variables. The K inputs are used to address a 2 K ϫ 1-bit memory that stores the truth table of the Boolean function. All the CLB outputs can be direct, inverted, or registered.
XC4000 Programmable Interconnect Point and Routing Resources.
Xilinx XC4000 routing resources are connected by switch matrices. There are 8 (6 for smaller devices) intersections containing 6 programmable interconnect points (PIPs) each. The PIP, shown schematically in Figure  3 (c), is a pass transistor controlled by a configuration memory cell. XC4000 routing resources include single-length (general-purpose) lines (SLs), shown in Figure 3 (a), double-length lines (DLs), shown in Figure  3 (b), and long lines (LLs). LLs run the width or the height of the chip with negligible delay variations. SLs connect every pair of adjacent switch matrices 2 and DLs bypass alternate switch boxes. 3 Thus the wirability of a net is no longer a simple function of its length and the congestion of its routing region. On the other hand, since signal delay depends more on the number of PIPs through which a signal passes than on the length of the segments, the double-length lines allow a signal to travel twice the distance in the same amount of time, or to travel the same distance in half the time as the single length lines do. 4 The delay of a wire is also no longer a simple function of its length as shown in Figure 4 .
Overview of the FPGA Design Methodology
HLS generates an architecture from a behavioral specification subject to constraints on area and delay. Following that, the design process of FPGAs 2 The wire between two adjacent switch matrices is a SL segment. 3 The wire connecting every other switch matrix is a DL segment. 4 Experiments show that SL segments and DL segments have approximately the same delay. can be decomposed into three major steps as shown in Figure 1 (a). First, partitioning (or technology mapping), which includes lookup table (LUT) mapping and configurable logic block (CLB) construction, partitions the incoming logic into a netlist of CLBs. Next, placement determines a good assignment for the CLBs in the FPGA array. Once the placement is known, routing decides the type of routing resources and route for each net. This is a flat design approach since the netlist fed into partitioning is a gate-level netlist and the partitioning is done on the whole netlist (more detailed discussion can be found in Xu and Kurdahi [1996] ).
In contrast to this flat design flow, Figure 1 (b) shows a hierarchical HLS design flow aimed at FPGAs. The hierarchy is kept during the whole process and the design is not flattened. The RTL components are bound to precharacterized components 5 or use layout tools to do the components layout. This way, the structural information is preserved in each component.
Maintaining this hierarchy is beneficial for the following reasons.
-It is easy to debug and to add or change logic since design changes in one component can be made without affecting the rest of the design. -It is easy to adapt to a different technology. -It is easy to improve the design routability by grouping and floorplanning the RT components according to the dataflow. It is easy to improve the design's performance. -It matches well with the HLS design paradigm since the hierarchy is maintained throughout the design process. -With proper binding and component selection, it is possible to optimize the overall design by selecting different component implementations for Gajski et al. [1994] . To avoid unnecessary iterations and shorten the design cycle, it is very helpful to have an estimator giving area and timing estimates quickly before actually going through the time-consuming placement and routing phases as shown in Figure 1 (b) . It is very important that the estimator has a more realistic and accurate model that takes into account not only component area and timing, but also wiring effects. -One important class of FPGAs, implemented by Xilinx, uses lookup tables (LUTs discussed in Section 2.1) to implement combinatorial logic and is called LUT-based FPGAs. Xilinx has three logic cell array families of LUT-based FPGAs including XC2000, XC3000, and XC4000. They share a common structure: an array of CLBs surrounded by a configurable interconnect and they differ in details of the logic and interconnect structures. The main features of this technology are briefly explained in Section 2.1.
PREVIOUS WORK
Recently, there has been an increased interest in developing accurate efficient prediction of design metrics. Specifically, Sastry and Parker [1984] , Feuer [1982] , Gura and Abraham [1989] , Gamal [1977] , and Donath [1979] presented preplacement estimation for wire length. An excellent survey of timing estimation [Ousterhout 1985; LaPotin and Chen 1989; Sutanthavibul and Shragovitz 1991; Dunlop et al. 1984; Jackson and Kuh 1989] is presented in Benkoski and Strojwas [1991] . All those papers discuss custom design style only. Recently, some work has been done for FPGAs. Several fast mapping heuristics for LUT-based FPGAs are surveyed in Francis [1992] . Such heuristics can be used to obtain estimation of CLB count. However, techniques for accurate timing estimation have not been proposed so far. Xilinx's [1994] partitioning, placement, and routing (PPR) software package has its own built-in estimation tool. This estimation is very accurate since it performs the actual mapping using Chortle [Francis et al. 1990 ], but the tool does not provide performance estimation.
Other than Xilinx, Synopsys [Xilinx95 1995 ] also provides accurate area estimation by doing actual mapping. Moreover, it can provide estimation of the number of logic levels for the design. Nevertheless, it does not take into account wiring delay.
The research presented in Schlag et al. [1991] empirically examines the performance of multilevel logic minimization tools for a LUT-based FPGA technology and suggests that there is a linear relationship between the number of literals and the number of routed CLBs. It provides estimation for both area and timing but the work is only applicable to the XC3000 series.
All those approaches are suitable to estimate component-level and chiplevel design but with flat design methodology. None of them supports a hierarchical design methodology.
The work presented here is the extension of our previous work presented in LAST&TELE [Ramachandran et al. 1992 ], CompEst-FPGA. LAST&TELE described estimation techniques for custom design. ChipEstComp presented a gate-level area and timing estimation for a LUT-based FPGAs approach for component and chip but with flat design methodology. ChipEst-Est presented here has a realistic and accurate model since it takes into account not only the component area/delay but also the wiring effects. It mainly handles hierarchical design methodology for high-level applications. It can handle Xilinx XC4000 estimation and is easy to adapt to other Xilinx series such as XC2000 and XC3000 with minor modifications.
For design space exploration, there are several papers that address the means of incorporating layout information in the design process.
-3-D "scheduling." Weng and Parker [1991] presented an approach to the problem of binding while simultaneously considering floorplanning. Operators are assigned (and placed) as close as possible to their predecessors in order to minimize the interconnection cost. However, this approach did not consider the cost and delay of registers, multiplexors, and wiring space overhead. -GBA. Jang and Pangrle [1993] and BITNET [Mujumdar et al. 1994 ] also considered binding with physical information. However, GBA applies only to one-dimensional bit-slice design, and BITNET does not consider interconnection delay. - Ewering [1990] and ApplaUSE [Frank and Lengauer 1995] addressed the binding with physical information problem by moving placement earlier before bus and register assignment, but no physical information is taken into account when FU binding is performed. -SMB [Fang and Wong 1994] presented an integrated approach for minimizing critical path delay by simultaneously performing FU binding and floorplanning. But their approach has to start with a fixed floorplan and RT solution and does not account for the shape and delay of multiplexors that affect the delay of the critical path. Furthermore, it is not clear whether SMB can handle multicycled FUs, a consideration that is central to realistic design.
Our approach, on the other hand, does not rely on any particular floorplan or RT solution and we take the shape and delay of multiplexors and other components into consideration. Furthermore, we consider clock period (register-to-register delay) in the datapath as our main process object rather than FU-to-register or register-to-FU delay as the main concern as in SMB. Finally, our approach handles multicycled FUs as well as chaining of FUs in one clock cycle.
ARCHITECTURAL MODEL AND PROBLEM DEFINITION
In high-level synthesis, an RTL system that consists of FUs, storages, and interconnections is synthesized from the behavioral descriptions. In order to explore the impact of physical design information in HLS, we need to define a target architecture. In our approach, we consider two styles of target architectures: multiplexor-based and bus-based. Although our approach can handle both architectures, we confine our scope in this article to the multiplexor-based architectural model. We also assume FUs are 2-input, 1-output combinatorial circuits, and registers are 1-input, 1-output circuits. Operation chaining is supported in this model by allowing connections from the output ports of some FUs directly to the input ports of other FUs. Moreover, operations can execute over several clock cycles: multicycled operations are possible.
Our problem can be informally defined as follows.
Given (1) a scheduled dataflow graph (SDFG), (2) a number of FUs, registers, an input and output multiplexor, and (3) maximum clock period, which is usually part of the system specification, identify whether there is a feasible RTL datapath solution. If there is, perform binding, generate an RTL netlist and its corresponding floorplan; otherwise, report it to previous tasks in HLS and output the best solution that can be achieved.
The example in Figure 5 illustrates the problem. Given are a scheduled dataflow graph which consists of two control steps, and allocation resource which includes two adders and four registers. The shape function and the corresponding delay information of the components can be obtained from the component library. Our algorithm will output a RTL datapath netlist with all the binding information. Meanwhile, a corresponding floorplan and the clock period (register-to-register delay) that includes wire delay will also be generated. We assume that the controller is implemented as a Moore-style FSM with status and control registers. This way, the clock cycle is determined by the worst case register-to-register delay which will fall either completely inside the datapath or completely within the controller. Our work in this article concentrates on the datapath area and delay metrics. Once a controller netlist is obtained, our estimation tool, CompEst-FPGA can easily estimate the worst case register-to-register delay within the controller. This in turn can be compared to the datapath delay to determine the performance bottleneck, if any.
OVERVIEW OF OUR APPROACH
The flow of our algorithm is shown in Figure 6 . Given a scheduled dataflow graph, we first construct an initial fully connected netlist in which each FU is connected to every register and each register is connected to every FU. This imposes no limitations on binding.
6 Then we use our physical-level estimation tools ChipEst-FPGA and CompEst-FPGA [Xu and Kurdahi 1996 ] to obtain an approximate topology of the layout. CompEst-FPGA is a component-level estimation tool that predicts the area and delay of a given RTL component netlist. Given a specification of a particular component as a set of Boolean equations, we use CompEst-FPGA to predict the shape function of that component. CompEst-FPGA predicts the effects of some logic synthesis tasks such as technology mapping as well as the effects of physical design. This shape function can be obtained by estimating the dimensions of a component with a varying number of rows. In addition, CompEst-FPGA estimates the critical path delay of each configuration with wiring delay as well as false paths being taken into account. Benchmarking 6 The redundant connectivity will be stripped once binding is finished. We explain how the algorithm explores different RT netlists in Section 7.1. 
Layout-Driven RTL Binding Techniques
• has shown that CompEst-FPGA can estimate area with about 2.5% accuracy and static delay with about 2-13% accuracy.
Once we have obtained a shape function for each component, ChipEst-FPGA is used to generate an approximate topology of the overall design. The layout information obtained from ChipEst-FPGA is subsequently used by the binding algorithm to generate the RTL solution. Section 6 describes the details of ChipEst-FPGA and Section 7 describes the binding approach.
CHIP ESTIMATION

Problem Definition
Given an RT-level description, the goal of chip area estimation is to predict the area of the chip in terms of number of CLBs as well as the most Given an RT-level description, the goal of chip timing estimation is to estimate the performance of the chip in terms of minimum clock period by using the delay information of all the RT-level components along with the estimated topology information obtained from chip area estimation.
Chip Area Estimation
Our chip-level area model uses slicing tree techniques derived from Chen and Bushnell [1988] for evaluating the area of designs implemented using RT-level components.
Component Shape Function.
To improve the density of the chip, designers may try different floorplans by varying the topological placements of each component. Component shape function represents the different topological placements in the actual layout and their corresponding delay information. For example, a 4 ϫ 1 mux needs three CLBs. It can have three different topological placements in the actual layout (Figure 7(a) ). This results in a shape function such as the one shown in Figure 7 (b).
At the RT-level, the shape functions of some components can be obtained from our component library which is a collection of hard macros with shape function and delay information. The collection of hard macros includes components that are frequently used in the design so we precharacterized their shape function. Also, it includes vendor-supplied predesigned components such as hard macros in the Xilinx library and the like. For the components whose shape function is not known a priori (the controller, for example), their shape function can be obtained by invoking the component estimator, CompEst-FPGA, described in Xu and Kurdahi [1996] . CompEst-FPGA estimates the area and delay of combinatorial circuits described 
Layout-Driven RTL Binding Techniques
• either at the gate-level or using Boolean equations. It estimates the outcome of the technology mapping, placement, routing, and timing optimization phases of the design procedure. CompEst has been benchmarked with respect to a wide variety of gate-level designs with and without postlayout optimization. The results indicate that the estimation is not only accurate, but also time-efficient, taking one to two orders of magnitude less run-time to evaluate as compared to the actual Xilinx design tools.
Chip-Level Area Model.
The chip-level slicing tree technique involves slicing down to the leaf blocks that consist of either RT-level components or the controller. This constructive approach does not consume excessive run-time since the number of leaf blocks is relatively small. This technique is illustrated in Figure 8 . The slicing tree is built by recursively partitioning the input design. Because of specific characteristics of FPGAs, partitioning objectives have to be selected accordingly. One of the objectives is minimization of routing resource consumption. It is mainly accomplished by devising data objects that will partition in such a way as to permit the greatest number of signals to traverse the shortest distances along the fewest routing channels with the least crossovers. This most often means placing interconnected objects adjacent to each other with related elements aligned to the routing axes.
Because of the granularity of FPGA (the area is in terms of CLBs rather than in terms of microns), 7 reducing unused area is a very important 7 For example, the Xilinx XC4013 has 24 by 24 CLBs rather than thousands by thousands square microns in the custom design. objective. To achieve this, objects with similar sizes are placed adjacent to each other because this can minimize the wasted area. Sometimes, this will conflict with the objective of putting strongly connected blocks adjacent to each other. We introduce a cutting edge threshold in our algorithm to trade off between area and performance. The cutting edge threshold actually is a parameter obtained by calculating the average size of all the blocks to be partitioned; if some block's size exceeds the cutting edge threshold (that means it is far bigger than the rest of the blocks), it will be isolated from the rest of the blocks and be a subslice of the current slice. For example, shown in Figure 9 (a), the netlist contains 4 components, Mult needs 60 CLBs, two registers need 16 CLBs each, and one Mux needs 8 CLBs. If we only consider the interconnection between them, we will end up with a 12 ϫ 12 CLB device as shown in Figure 9 (b); if we consider the cutting edge threshold, the Mult will be isolated from the rest of the blocks and be one subslice for slice 1234. The result with the cutting edge threshold is a 10 ϫ 10 CLB device as shown in Figure 9 (c); we can see that slicing with the cutting edge threshold produces more area-efficient results. The shape function of the entire design is computed by constructively adding the shape function of these leaf blocks. In addition to the area of leaf blocks, the routing area used by the nets connecting these blocks also needs to be taken into account.
Because of the flexibility and symmetry of the CLB architecture, it facilitates the placement and routing. For leaf blocks, the inputs, the outputs, and the function generators themselves can freely swap positions within CLBs of the components to avoid routing congestion. So, when we build up the shape function for the slice using the shape function of two sibling slices at level i, our main concern about routing is the routing area between those two sibling slices. Routing budgets are given to each routing area; when the expected routing resources needed exceed the budget, the 
Layout-Driven RTL Binding Techniques
• upper-level slice will correspondingly be "dilated" so as to meet routing needs. The expected routing resources can be obtained by estimating the interconnection count between two sibling slices, as described next.
The available routing resource budget will depend on the shapes and sizes of the two sibling blocks as shown in Figure 10 . In the intervening routing channel between the two sibling blocks, there are six single-length lines between every pair of adjacent switch matrices that are parallel to the slice orientation. In addition, we assume that double-length lines perpendicular to the slice orientation are also used in that channel, whereas the ones parallel to the slicing orientations are reserved for the parent level in the slicing tree. Let W p be the width of the current parent slice, and H p be the height of the current parent slice (both measured in units of CLBs); the total available routing budget can be calculated based on the size of the slicing cut (i.e., the length of the routing channel) between the two sibling blocks as follows.
for horizontal slicing for vertical slicing.
where ␣ is the single-length line count: 6 for bigger devices and 4 for smaller devices.
Once we get the routing resource budget for the parent slice from the shape of two sibling slices, we can decide if adjustment is needed according to whether the number of interconnections is exceeding the budget. If there is a need to adjust (whether horizontal or vertical), the budget will then be appropriately increased by having more single length lines so as to accommodate the extra routing requirements. Take a vertical slice as an example: whenever we increase one CLB in the horizontal direction, we will have two additional columns' worth of single-length lines. Again, let W p and H p be the width and height of the parent slice and num con be the expected number of interconnections between two sibling slices; we can estimate the increased area and increased area can be added to the total estimated area 
where
This process of building the composite blocks is performed in a post-order manner from the leaves of the slicing tree towards the root. Thus the area of the entire design is determined. For each two sibling blocks, there exist two possible ways of generating the parent block depending on the orientation of the slice. Since only a prediction of the chip dimensions is desired, we need not perform an actual floorplan of the chip from the slicing tree. Therefore, we need not decide on the orientation of the slice line when traversing the slicing tree bottom-up. For each two siblings, two shape functions of the parent block are generated: one assuming a horizontal slice and another assuming a vertical slice. The two curves are then superimposed and a "lower bound" curve is generated by keeping only the smaller of the two slice orientations at each x as shown in Figure 11 (c). The resulting shape function is taken as the set of predicted dimension pairs for the optimal layout area of the parent block.
At the end of this phase, we can estimate the area of the overall chip, according to the number of I/Os, and we can predict whether the design can be fitted into one FPGA device. If it can be fitted, we can also predict the specific XC4000 device that will be the best choice. Let W, H, num io be the 
• estimated width, height, and number of I/Os of the chip, respectively; W 1 , W 2 , H 1 , H 2 , num io 1 , num io 2 be the width, height and number of I/Os of two consecutive devices, device 1 and device 2 , respectively. If
then device 2 is the best choice. At this moment, we also have an approximate topology of the chip that can be used in the subsequent timing models described next.
Chip-Level Wiring Delay Estimation
The chip delay includes component, wire segment delay, and programmable interconnection point (PIP) delay. The wiring estimation model includes predict-the-pin-location on each leaf block, and predict-wiring-delay two phases.
Predict-the-Pin-Location on Each Leaf Block.
Given an input RT level design, our chip-level area model described in Section 6.2.2 outputs an approximate floorplan that provides estimates of the relative locations of the constituent blocks. To better estimate chip-level timing, pin location must be either known or estimated. On those blocks that have been predesigned, the pin locations are known. For other components that have not been laid out yet, we must estimate "preferred" location for each pin. Pin location can be determined by evaluating the approximate topology of the design. The chip area estimation process determines the approximate locations of the blocks in the design, taking routing area into account. For each net, first we identify the source pin; then we identify load pins and their associated blocks. By evaluating the mean location of these blocks, a "preferred" location of each source pin is first determined. Then, by finding the shortest Manhattan distance between each pair of source and destination blocks, a "preferred" location of each sink pin can also be determined.
For the example shown in Figure 12 , the source pin of source block S has four destination blocks D1 through D4 as its loads. By evaluating the mean location of D1 through D4, we can get the preferred source pin location shown in Figure 12(a) . Then, by finding the shortest Manhattan distance between S and D1, S and D2, S and D3, S and D4, a preferred location of each sink pin can be determined as shown by the circles in Figure 12 (b). Figure 13 , the Manhattan distance x and y values (in units of CLBs) are first calculated. Then, a wire type (single-length line, doublelength line, and long line) is assigned to that wire as described in the following section. This decides the number of PIPs and number of segments between points A and B. Subsequently, the point-to-point delay (pin-to-pin delay without fanout effects) D pp (A, B) can then be calculated. Finally, the delay with fanout effects D(A, B) can be obtained by adjusting D pp (A, B) with a fanout factor as described in the following.
Predict Wiring Delay. To predict the delay between points A and B, D(A, B), in
To predict the wire type, the algorithm mainly checks the interconnect wire lengths x and y, respectively. First, long lines are assigned to all the wires that are longer than 8 CLBs in either direction. Then, single-length lines are assigned for all wires that are shorter than 2 CLBs. Note that single-length lines cannot be connected to double-length lines. Thus, if one segment of a wire is assigned to a single-length line, then the other segment of the wire is also assigned to a single-length line if its length is From Section 2.1.2, we know that net length does not necessarily correlate well with the actual delay. Therefore, we use an empirical model to characterize the delay-versus-wiring-type relationship. Our empirical model is based on a large number of observations obtained by using Xilinx's XDM layout tool to place and route a set of benchmarks and analyzing the delay of each point-to-point connection using Xdelay, the Xilinx timing analysis tool. We found that it is satisfactory to approximate the delay as a function of (1) the number of PIPs it goes through in both X and Y directions, respectively, and (2) the corresponding segment delays. Let us denote the delay for each PIP in the programmable switch matrices as d pip , and the delay for each segment as d seg .
8 For a 2-point net (A, B) , the point-to-point delay will be the summation of such delays in both X and Y directions. Let x and y be the Manhattan distances of (A, B) in the X and Y directions, respectively (both measured in units of CLBs). If only singlelength lines are used, they will pass through x and y segments, and through x ϩ 1 and y ϩ 1 PIPs in the X and Y directions, respectively. Double-length lines need one PIP in every other CLB and, similarly, for segments at the same distance as the single-length line interconnection. Long lines with the same length will not go through any PIPs and the delay is approximated as being proportional to the wire length. Thus the point-to-point delay (pin-topin delay) will be:
and the associated parameters are listed in Figure 13 . When the number of fanouts of a net is larger than one, say f, the delay on each sink pin j (j ϭ 1, . . . , f ) will be affected by the delay on the rest of the sink pins k (k ϭ 1, . . . , f ; k j) on the net. Let i be the source pin, for each sink pin j (j ϭ 1, . . . , f ). The point-to-point delay without fanout effect D pp (i, j) is first computed. Afterwards, we denote D(i, j) as the delay with fanout effects, and it can be obtained by adjusting the point-to-point delay without fanout effects D pp (i, j) using the following formula.
where ⑀, a fanout adjustment factor, is experimentally obtained as 2.5. We can see that the fanout delay effect at the chip level is quite big. This is because at the chip-level, part of the fanout effect could be masked by the components. For example in Figure 14 , net n fans out from block A to two other blocks, B and C, so its RT level fanout is 2. However, the net actually feeds 5 CLBs when the design is flattened. At the end of this step, we have a netlist that contains the components' delay and the estimates of net delay. This information will be subsequently used during binding. Figure 15 . The datapath part is composed of datapath logic blocks and the data registers. Data registers are used to store data inputs, outputs, and intermediate values in the datapath. The controller can be implemented either as a Mealy or a Moore model. A Moore model is more widely used for high-speed synchronous systems and is also easier to synthesize automatically. Thus, our timing model assumes that the controller is implemented as a Moore finite state machine with status and control registers. A Moore controller consists of two combinational logic blocks, the next state logic and the output logic; one or more control registers stores Thus the overall system can be modeled as a network of combinational logic blocks separated by registers. In this case, the worst case register-toregister delay is estimated and is output as a lower bound on the clock period for single phase clocking. Since the controller has status and control registers, the clock cycle is determined by the worst case register-toregister delay that will fall either completely inside the datapath or completely within the controller. Our work concentrates on the datapath area and delay metrics.
Predict Clock Cycle Length. A typical timing model for digital systems is shown in
Compute Data Path Delay.
A typical datapath operation involves reading operands from the registers, computing the result in the FUs, and finally writing the result back into a destination register. The input multiplexors are at the input ports of FUs, and the output multiplexors are at the output ports of FUs. The path delay is determined by register-toregister delay. Based on our architectural model shown in Figure 15 , we can specify the path delay by the following equation, w rm is the wire delay from register to input multiplexor, w mf is the wire delay from input multiplexor to FU, w fm is the wire delay from FU and output multiplexor, and w mr is the wire delay from output multiplexor to register.
From the component library, we can get the delay of the different components. From the distance metrics obtained by the ChipEst-FPGA, we can calculate the wire length and use our estimation tool to get the wire delay [Ramachandran et al. 1992] . One example is shown in Figure 18 (b).
6.3.5 Determining the Clock Cycle Time. The total execution time of a design is given as the number of time steps times the clock period. The number of time steps is determined by scheduling and allocation and is known once the RT-level design is generated. The minimum possible clock period is determined by the worst case register-to-register delay. Given the delay of a combinational block, we can determine the register-to-register delay between its input and output registers as shown in Figure 16 . Let t cb be the worst case delay through that block, t p (R in ) be the propagation delay through the input register, and t setup (R out ) be the setup time of the output register. The delay between R in and R out is estimated as
note that t cb can be either in the controller (either state logic or output logic) or datapath. In the latter case, Equation (1) can be used and the minimum possible clock period is estimated as Layout-Driven RTL Binding Techniques
• where R i and R j are assumed to be connected through a combinational logic block. Note that our timing models are kept simple due to run-time efficiency constraints. Our goal here is not to provide accurate timing analysis of the design. Rather, the aim is to provide the higher-level tools with an early assessment of design cost and performance. However, the designer can easily apply more accurate timing analysis models using the delay estimates of the various blocks and interconnections that are produced by ChipEst-FPGA (i.e., a forward annotated RT-level netlist).
LAYOUT-DRIVEN BINDING TECHNIQUES FOR HIGH-LEVEL SYNTHESIS
Referring to Figure 6 , given a scheduled SDFG and a resource allocation, an initial fully connected netlist is generated based on the architectural model of Section 4. The netlist is then run through ChipEst-FPGA [Xu and Kurdahi 1997a ] to obtain approximate topology timing information. The layout information obtained from the estimation (described in Section 6) is used to estimate the delay of every register-to-register path in the RTL design. This allows us to construct a distance matrix, graphically illustrated in Figure 18 . This distance matrix is used by our binding algorithm to assess the delay of a candidate solution.
The backbone of our approach is a branch-and-bound search algorithm. We sequentially perform binding one control step at a time. Within each control step, and for each operation in that step, FU and storage binding are performed simultaneously by finding a virtual binding for the operation first, then for its output variables (O var ), and finally for its input variables (I var ). The actual binding will not be finalized until all the virtual bindings have succeeded. The search space can be illustrated with a tree having three levels of hierarchy as shown in Figure 17 . The first level is for FU, the second level is for O var , and the third level is for I var . At the FU level, the depth of the tree is equal to the number of operators (OP i : the ith operator) in the control step, and each path is a virtual binding for all the previous objects (an object can be FU, I var , or O var ). For example, the path from the root to node M means OP 1 is bound to FU 1 and OP 2 is bound to FU 2 . After finishing FU binding, the binding procedure proceeds to the O var and I var level.
The search order of the algorithm is decided by a seed value which is provided at the start. The different seeds result in different orders of subtrees and different orders with each level of hierarchy during the search. During the search, our algorithm can accept a seed to start with a different search order. Also, a backtracking mechanism enables the algorithm to backtrack up to the higher level of virtual binding solution when the current virtual binding fails and to resume the binding process. We can see that the search space can be huge. It is unrealistic to evaluate all the possible solutions. Thus the layout information from ChipEst-FPGA is used to confine our exploration space to a subset of promising solutions as described in Section 7.1.
Let us use a three-dimensional graph to express paths of register-FUregister as shown in Figure 18 . For example, the shaded square in Figure  18 stands for a path from r 1 to ADD 1 to r 2 (for operation chaining, the FU plane stands for chained operators). Using layout information, we can calculate the delay for each register-FU-register path. We also define the number that decides whether the path will be used as the cutoff point for binding. Only paths with delay smaller than the cutoff point can be used in binding. Thus by setting the cutoff point for binding, we can control the size of our search space. When the cutoff point is smaller than a certain value, there may not be enough paths for binding. We call this limit point the cutoff-point threshold.
Details of each of the remaining steps in the overall flow shown in Figure  6 are discussed in the following sections and computing the path delay is done as shown in Section 6.3.4.
Set Cutoff Point for Binding
Knowing all the path delays, we can set the cutoff point to decide whether a given path can be used for binding (for multicycled operations, the partial path is identified). Let us denote the initial cutoff point for binding as CT init and the cutoff point for the current iteration as CT current . Let cr delay prev be the critical path delay of the previous binding solution and ␤ be the factor of choosing the current cutoff point. The user can decide whether ␤ should be equal to 10, 100, 1,000 . . . so that a tradeoff between 
Layout-Driven RTL Binding Techniques
• the time spent on exploration and the number of solutions explored can be made.
The initial and current cutoff points can both be obtained by the following equations.
CT init ϭ MAX͑Delay i, j, k ͉i, k ϭ 0, 1, . . . , r; j ϭ 0, 1, . . . , f ͒ (2)
where Delay i, j, k is the delay between the ith register, the jth FU to the kth register, r is the number of registers, and f is the number of FUs. For different cutoff points, the RT netlist will be different. This way, we can explore different designs by adjusting the cutoff point. The initial binding is performed on the fully connected netlist and the initial cutoff point equals the maximum register-to-register delay. Then, for each iteration, a new cutoff point is decided based on the current critical path delay. The idea behind this is to drop the current longest path and to explore different RT netlist binding solutions. Although the way of selecting the cutoff point is straightforward, the cutoff point plays an important role in our approach. By decreasing the cutoff point gradually, we actually categorize the binding solution into several groups. Once the cutoff point is given, we try to find a solution that meets the constraint instead of finding the best solution (i.e., the one with lowest clock cycle) for that given cutoff point. Later, we can search for better solutions by further lowering the cutoff point.
The delay of the longest path is reduced every time CT current is calculated. In this way, we can guarantee that a different binding solution will 
336
• M. Xu and F. J. Kurdahi be generated each time although the performance of the final layout may not be necessarily better.
Feasibility Check
During binding, a feasibility check is needed to determine if there are enough paths with delay less than the cutoff point to perform binding. A feasibility check includes two tasks: compatibility check and resource check.
Given an operation and a FU, the compatibility check in FU binding determines whether the operator can be bound to the FU. Also, given an input or output variable and a register, the compatibility check in register binding determines whether the variable to be bound is compatible with all the variables already bound to the target register by analyzing the lifetimes of these variables.
The other task is called resource check. Once a cutoff point for binding is given as a constraint, those paths whose delay exceeds the constraint can be eliminated from consideration. We can now compute the number of FUs, input registers, output registers, and input/output registers on the remaining paths. Then we compare them with the required number of FUs, input registers, output registers, and input/output registers. Then we can identify whether the available resources are sufficient to perform the binding successfully. The feasibility check is carried out every time a new object (FU, O var , or I var ) has been virtually bound. This speeds our search algorithm and will stop the algorithm whenever the cutoff point reaches the cutoff-point threshold. As we mentioned in Section 5, we use a branch-and-bound algorithm to search for different binding possibilities sequentially, one control step at a time. The inputs to this algorithm consist of the source object to be bound, a set of target object candidates, and the allocation of resources. Within each control step, the virtual binding is carried out in the order of FU, O var , I var , and multiplexors. The FU, O var , and I var binding call the same recursive binding procedure (which is outlined in Algorithm 1) to generate all the different possible solutions and the feasibility check in every step prunes the infeasible solutions as early as possible.
Binding
At the end of its execution, the algorithm either generates an actual binding solution if one exists under the given cutoff point or reports that no feasible solution is available together with the best result that can be achieved.
We need to mention here that, for the interconnection from registers to FUs, there are two different assignments since each FU has two input ports. By assigning interconnections to the ports differently, the multiplexor cost (i.e., size and number) will be different. So our interconnection binding not only includes the compatibility check which checks whether two interconnections can share same multiplexor, but also attempts to minimize the size of the multiplexor and the number of interconnections. Basically, we try two different assignments for each interconnection, check the multiplexor cost, and select the one with the cheaper cost. Once the FU and O var have been virtually bound, the interconnection from FU to Ovar has also been bound.
Pruning
If the binding succeeds, the algorithm will proceed to the next step, pruning. In this step, all the unnecessary interconnections will be pruned, all the unnecessary multiplexors will be deleted, and finally the size of the multiplexors will be shrunk according to the actual interconnection information. When the multiplexors are changed, new types of multiplexors may be generated. The algorithm will then update the area and timing information based on the component information in the library or by invoking CompEst-FPGA [Xu and Kurdahi 1996] . At the end of this step, a custom RTL netlist will be generated.
Layout Adjustment
At this point, if the clock period exceeds the maximum clock period, layout adjustment will be invoked to rerun our ChipEst-FPGA on the pruned RTL netlist based on new multiplexors and interconnection information. Usually this will minimize the wasted layout area and improve the performance of the final design.
After layout adjustment, if the cycle time still does not satisfy the maximum clock period constraint, we need to reset the cutoff point and redo the binding. This iteration will continue until the cutoff point reaches the threshold. Among all the results from different iterations, the best result will be kept and its estimated floorplan information will be forwarded to the Xilinx software tool through the constraints file (.cst file). In most cases, Xilinx software will generate similar final floorplanning during partitioning, placement, and routing. This way, we only evaluate a set of possible solutions to see whether a final solution can be found. Our experimental results in Figure 20 show that there is a big chance that a solution can be found if one exists although we only evaluate a small subset of possible solutions.
EXPERIMENTAL RESULTS
Experimental Procedure
We have implemented our layout-driven RTL binding techniques for HLS in C on the Sun SPARC workstation. The designs used to test our binding techniques are from some well-known high-level synthesis benchmarks. The first example is the second-order differential equation solver which consists of 6 multiplication operations, 2 additions, 2 subtractions, and 1 comparison. In this example, we assume that multiplications are performed by multipliers and additions, subtractions, and comparison are performed by ALUs. The second example is the fifth-order elliptic wave filter (EWF) which consists of 8 multiplications and 26 additions. The third example is the discrete cosine transformer (DCT) which consists of 16 multiplications, 25 additions, and 7 subtractions. In the second and third examples, we assume that multiplications are performed by multipliers, and additions and subtractions by ALUs. The bit-width of all the examples is set to 4.
The datapath components can be obtained from our library in which all the components' layout and timing information is precharacterized. The components can be implemented by different tools such as the Xilinx hard macro library, xblox, and designware. Alternatively, we use GENUS, a generic component generator, to generate the logic equation according to the desired functionality and our component estimator can be invoked to estimate its shape function. Then we use Synopsys synthesis tools to optimize and synthesize the design. After synthesis, the components are translated to gate-level netlist (in xnf format) and fed into the Xilinx partitioning, placement, and routing tool (ppr) by giving different constraints to get different aspect ratios. For each specific placement and routing, we use the Xilinx delay analysis tool, Xdelay, to get the delay information. Thus we generate a shape function for each component similar to the one shown in Figure 5 .
Results
The first set of experiments we did was for DCT, EWF, and HAL examples. Figure 19 shows the results. The estimated clock cycle at the registerLayout-Driven RTL Binding Techniques • transfer level is shown in the third column and only includes FU and register delays since we typically have no interconnection or layout information. We get the cutoff point for binding (column 4) by using Formulas 2 and 3, the clock period without pruning (column 5) is the cycle time after we perform the binding and can be further used to get the next cutoff point. If the binding succeeds, we construct the actual RTL netlist and get its actual cycle time (column 6). If this still cannot satisfy the maximum clock period constraint, we further optimize the cycle time by layout adjustment (column 7). These results clearly indicate that: (1) layout and interconnection delays are significant since they may contribute as much as 50% of the overall delay; and (2) by varying the cutoff point, we can explore a set of alternative binding solutions with varying clock cycle times. Our techniques are efficient since each solution takes less than a minute of CPU time for all the cases.
To test the robustness of our branch-and-bound algorithm, we did another set of experiments using the DCT example. Given different seeds, the algorithm described in Section 7.3 will search for solutions in a different order and may find different bindings for the same cutoff point. We tried 7 different search orders and got all the results for the different cutoff points. We compared their best case cycle times and cutoff-point thresholds as Figure 20 . In each case, there is only small variation for the best-case cycle time. This shows that our small set of explorations is not only efficient, but also sufficient for finding the best solution in most cases. For the 7 sets of runs, in the worst case, 7 to 8 minutes were required to find the best solution that could be achieved.
In order to assess the accuracy of our ChipEst-FPGA, we fed the same RT level VHDL file into our ChipEst-FPGA to produce estimates of the chip area and delay using models described in Section 6. Several benchmarks can be found in Xu and Kurdahi [1997a] . Our estimation accurately predicted the exact device type needed every time. For performance estimation, there were some differences between estimated and measured values. The average estimation error for performance was about 6.0%, whereas the worst-case error was 18.7%. Figure 21 shows three examples we tested for EWF. The error for performance is less than 4.0%. Our ChipEst-FPGA is at least an order of magnitude faster to obtain than the actual layout process. This clearly indicates that the ChipEst-FPGA can be efficiently used to provide fast and accurate feedback to our binding process, allowing it to make better informed design decisions. Since the binding methodology hinges on the ability to compare designs, a crucial factor to test for was the fidelity of the estimator. Fidelity is the ability of the estimator to correctly rank different design alternatives vis-à -vis the actual implementation [Kurdahi et al. 1993] . The last two columns in Figure 21 compare the relative ranking of the three EWF designs (in terms of clock cycle length) as determined by ChipEst-FPGA and the corresponding ranking based on the evaluation of the actual design as measured by running PPR and Xdelay. The results show perfect fidelity of the estimator. Layout-Driven RTL Binding Techniques 
CONCLUSION
We presented a binding approach that simultaneously binds FUs, registers, and interconnections and also uses an accurate layout estimator to simultaneously produce an RTL solution and a corresponding floorplan. Future work will incorporate the layout effects into scheduling tasks in HLS.
