In this work we address the problem of managing interconnect timing in high-level synthesis by generating a layoutfriendly microarchitecture. A metric called spreading score is proposed to evaluate the layout-friendliness of microarchitectural netlist structures. For a piece of connected netlist, spreading score measures how far the components can be spread from each other with bounded length for every wire. The intuition is that components in a layout-friendly netlist (e.g., a mesh) can spread over the layout region without introducing long interconnects. We propose a semidefinite programming relaxation to allow efficient estimation of spreading score, and use it in a high-level synthesis tool. On a number of test cases, a normalized spreading score shows a stronger bias in favor of interconnect structures that have better timing after layout, compared to the widely used metric of total multiplexer inputs. We also justify our metric and motivate further study by relating spreading score to other metrics and problems for layout-friendly synthesis.
INTRODUCTION
High-level synthesis (HLS) is the process of automatically generating RTL models from behavioral specifications. Compared to the traditional RTL-based design flow, the potential advantages of HLS include better management of design complexity, code reuse across platforms and performance targets, and easy design space exploration. HLS is Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC '12, June 3-7, 2012 getting wide adoption. In our experience, a primary challenge to HLS is the generation of results that consistently meet timing (i.e., the frequency target). We consider timing to be a vital factor to the success of HLS, not just because an RTL implementation that cannot meet timing is often unacceptable, but because actions the designer could take to circumvent timing failures tend to seriously undermine the advantages of HLS.
A straightforward approach to fixing timing failures is to manipulate the RTL code directly, but this requires an understanding of the generated code, and is time-consuming and error-prone. Some tools allow specification of explicit clock boundaries and/or sharing decisions. For example, the Handel-C language requires cycle-accurate input [1] . While this approach is useful when fine-tuning is needed, it is questionable whether the level of abstraction is really raised with this method. In the extreme case, input to the HLS tool is just an RTL model specified in a high-level language. Furthermore, code reuse and design space exploration can become more difficult when decisions on scheduling/sharing are included in the specification. Another common practice is to add an extra timing margin to tolerate excessive interconnect delays. This partially solves the problem, but not always, because delay of a long interconnect can exceed the target clock period, especially when the synthesis engine is not sophisticated enough concerning interconnect complexity. This approach often leads to unnecessarily long latency and low throughput, and is thus undesirable in many target applications of HLS such as signal processing. In addition, a large timing margin can cause overhead in area and power because the synthesis tool needs to use faster components and insert more registers.
Therefore, an HLS tool needs to manage interconnect delay intelligently in order to fully realize its advantages. This is challenging due to the absence of information about the gate-level netlist and the layout. Most existing solutions to the problem can be categorized as follows.
1. Use a regular architecture for global interconnects. The chip area is typically divided into regular islands (or clusters), where inter-island data transfers are performed on regular multicycle interconnects, such as a mesh [6, 19, 22] . The approach is effective for managing global interconnects; yet the regularity is a strong unnecessary constraint and can lead to suboptimal performance and resource usage. 2. Incorporate a rough layout. There are numerous efforts to combine HLS with floorplanning to help interconnect estimation and optimization [9, 10, 13, 26, 28] . This is quite a natural approach and can potentially work well. However, since layout itself is nontrivial, implementation of a stable and fast layout engine in the inner iteration of microarchitecture optimization is a challenge. 3. Use structural metrics to evaluate netlist structures. It is recognized that interconnect complexity (and thus timing) depends largely on netlist structure. An HLS tool can easily explore many different microarchitectures, guided by structural metrics. Such metrics are usually derived from a graph representation of the netlist without performing layout. Widely used structural metrics include total multiplexer inputs [4, 14, 15, 20] , number of global interconnects [7, 21] , adhesion [17] , etc. These metrics generally lead to efficient heuristics, but in general the interconnect delay after layout cannot be guaranteed.
In this paper we propose a new structural metric called spreading score. The intuition is that components in a layoutfriendly netlist can often be spread apart from each other without introducing many long interconnects, and that such long interconnects should have larger allocated slacks. Spreading score captures these properties in a mathematical programming formulation and can be estimated efficiently using semidefinite programming (SDP). Compared to the approach of performing a layout, our metric is stable and fast, because the globally optimal solution to the SDP problem can be obtained in polynomial time. Compared to previous structural metrics, our metric is more layout-oriented. Experimental evaluation using a normalized spreading score to guide HLS optimization shows encouraging timing improvement on a series of test cases without large area overhead, when compared to the previous metrics, such as total multiplexer inputs.
The remainder of this paper is organized as follows. In Section 2 we describe spreading score as the optimal value of an optimization problem. An SDP relaxation is presented in Section 3 to allow efficient estimation. Experimental results are reported in Section 4. A few interesting connections with other problems and metrics are discussed in Section 5, followed by a conclusion in Section 6.
THE SPREADING METRIC
An HLS tool typically represents an optimized input specification as a control-data flow graph (CDFG). The synthesis engine performs module selection, operation scheduling and resource sharing, and then generates a microarchitecturelevel netlist.
1 The netlist often consists of components (including functional units, registers, memories, I/O ports, multiplexers, pre-synthesized modules, etc.) and wires which connect the components. To simplify the discussion, we first consider the simple case where each component has only one output port, and delays between all input ports and the output port are the same.
We construct a directed graph G = (V, E) to model the component-level connectivity, where V = {1, 2, . . . , n} is the set of vertices with each representing a component, and E ⊆ V × V is the set of directed edges with each representing a wire from the source component to the sink component. Note that an edge is present only when there are data transfers between the two components; if two components are connected in the netlist only because they are both sinks of a net, no edge is created between the corresponding vertex pair. In addition, connections from a component to itself are discarded to avoid self-loop in the graph; this is reasonable because such connections can be regarded as local interconnects within the component.
A layout of the netlist is regarded as an embedding of G in the 2-dimensional Euclidean space R 2 . Each vertex i is associated with a column vector p i = (xi, yi)
T to represent its position in the embedding. The length of the connection (i, j) ∈ E can be measured as the Euclidean distance in R 2 ,
i.e., pi − pj
Consider the following optimization problem.
Here w = (w1, w2, . . . , wn)
T is the nonnegative weight vector with w i being the area of component i; lij is the maximum allowed length for the wire connecting i and j. The objective function measures how far components are spread from their weighted center of gravity, using a weighted 2-norm of the distance vector. Thus the problem in Eqn. 1 asks to maximize component spreading, under the constraint that the length of every connection (i, j) ∈ E does not exceed l ij .
With the proper selection of l ij , we claim that the optimal value of the above problem can be used to evaluate the layout-friendliness of a netlist. This is based on the following observation: if components in a netlist can be spread over the chip area without introducing long wires, it will be easy for the layout tool to remove overlaps between components without significant increase in interconnect delay.
This argument can be supported by examining well-known hand-designed interconnect topologies. For example, mesh [6, 19] , ring [16] and couterflow pipeline [24] can all spread without long interconnects, and they are regarded as scalable and layout-friendly topologies; on the other hand, spreading the full crossbar or hypercube on the 2D plane inevitably introduces long interconnects, and these topologies are generally much more expensive in interconnect cost.
Note that increasing the allowed wire length l ij can often lead to better spreading in Eqn. 1, and this explains why an extra timing margin can help layout. However, lij is limited by timing constraints in practice. To capture firstorder timing information, we use d ij to denote the delay of the wire (i, j) ∈ E, and consider it to be a monotone singlevariate function of wire length
Two additional variables ti and τi are attached to each i ∈ V , where t i denotes the arrival time (after clock edge) at input ports of i, and τi denotes the arrival time at the output port. We then have
If the corresponding component is combinational, we use di to denote its delay, and then
Otherwise, if the component is sequential, τi will be a constant, and ti should be bounded by the required time at the input of i, T i, that is,
For a register i, Ti can be regarded as equal to the clock period subtracted by the setup time. A primary output also has a required time depending on the interface timing specification. We can then treat l ij as variables and optimize spreading under timing constraints.
The formulation in Eqn. 6 effectively combines interconnect slack allocation with node spreading, and captures both structural property and timing property of the netlist. We refer to the optimal value of the above problem as the spreading score of the netlist. The graph construction and labeling procedure can be extended easily to handle more complex cases. For example, for a component with multiple input ports and multiple output ports, if the delay varies significantly between different inputs and outputs, we can create a vertex for each port, so that the delay between each pair of ports can be characterized individually as done in [18] ; constraints on the distances between ports can be enforced to keep the geometry of the component. Similar treatment on a very large component can make the estimation of interconnect length aware of port positions, instead of regarding all ports as being located at the center of the component. The required time at a port can be manipulated easily to capture nontrivial situations like multicycle paths, multiple clock domains, or other complex I/O timing requirements.
EFFICIENT EVALUATION
It is difficult to solve the problem in Eqn. 6 directly, because maximizing a convex function (like the objective function in Eqn. 6) is generally NP-hard (note that minimizing a convex function is easy). We hereby propose a tractable relaxation and use the solution of the relaxed problem to estimate the spreading score.
Consider the graph G with n vertices, we use a 2 × n matrix P = (p 1, p2, . . . , pn) to represent its embedding in R 2 , i.e.,
Let Q = P T P . Then Q is a symmetric semidefinite matrix with rank at most 2, and
We can use Q as variables in the formulation in Eqn. 6 without losing any useful information, as indicated by the following theorem.
Theorem 1. Given a semidefinite matrix Q, we can always reconstruct the embedding of the graph, in the sense that the distance between any pair of vertices is preserved.
Proof. Since Q is semidefinite, we can perform a Cholesky decomposition and get matrix U = (u 1, u2, . . . , un) , so that Q = U T U . Let P = (p1, p2, . . . , pn) be another matrix such that
Thus pi − pj = ui − uj ; this means that pairwise distances between vertices are decided given Q.
Using Eqn. 8, we can rewrite objective and constraint functions in Eqn. 6 as follows.
Here diag(w) is the n×n diagonal matrix with w on its diagonal. ei is the ith standard basis vector in R n and we define matrix
T to simplify the equations. X, Y is the element-wise inner product (Frobenius inner product) of matrices X and Y , i.e., X, Y = i j XijYij. Then we can rewrite the problem in Eqn. 6 to use Q as variables.
The above problem is equivalent to that in Eqn. 6, and is thus equally hard. Yet after relaxation of the rank constraint, the resulting problem is easy to solve when a quadratic delay model is used. That is, D(lij) = αl 2 ij , where α is a constant that depends on technology. Then we get the following relaxed problem.
This problem is convex. In fact, it can be solved as an SDP problem. Like linear programs, SDP problems can be solved optimally in polynomial time, and efficient solvers have been developed in recent years [2] . Due to page limitations, we will not discuss background on convex programming and SDP here. Interested readers may refer to [3, 25] on these topics.
The problem in Eqn. 13 essentially asks for an embedding in R n instead of R 2 , and its optimal value is a lower bound of the spreading score. It would be interesting to see how good the bound is. For this, we refer to the following result.
Theorem 2 (Göring, Helmberg and Wappler [12]). For a relaxed version of the problem in Eqn. 1 with pi ∈ R n , an optimal embedding always exists in R tw(G)+1 , where tw(G) is the tree-width [23] of G.
Although a rigorous proof is yet to be derived, we conjecture that the same result holds for the problem in Eqn. 13; this is based on the intuition that additional variables for slack allocation do not interfere with variables for graph embedding. The result implies low distortion when the optimal solution in R n is embedded back in R 2 . In the extreme case where tw(G) = 1 (i.e., the netlist is a tree), an optimal solution always exists in R 2 , and our relaxation is exact. This can also be empirically explained as follows: for a vertex i at position pi, the direction from origin (weighted center of gravity) to p i is the direction of steepest ascent for the objective function, and this direction is within the vector subspace spanned by existing direction vectors, because p i = − 1 w i j =i wjpj; thus the objective function intrinsically prefers moves that do not increase the dimension of the embedding.
The quadratic delay model is used in Eqn. 13 to simplify the relaxation. It is also possible to use the linear delay model, but that leads to more variables and less sparse matrices in the formulation.
EXPERIMENTAL EVALUATION

Normalization
While spreading score characterizes the layout-friendliness of a given netlist, using it directly to compare different netlists tends to favor netlists with larger area, because more components and larger weights can increase the spreading score naturally. To avoid this problem, we can normalize the spreading score of a netlist against that of a uniform mesh with the same area, and use the resulting value in comparison.
Consider 
This means that the spreading score of a mesh grows quadratically with regard to the total area. Thus, we can divide the spreading score by n i=1 wi 2 to get a normalized value, which can be used to compare different netlists without a bias on area.
Experiment Setup
We have implemented a simulated annealing algorithm to perform microarchitecture exploration in the xPilot HLS tool [5] . Based on an initial solution, perturbations are performed to generate alternative microarchitectures. Feasibility and cost are evaluated to decide whether the new solution is accepted. We compare two cost functions: (A) a weighted sum of total area and normalized spreading score (with a negative weight, i.e., larger spreading score will lead to smaller cost), and (B) a weighted sum of total area and total multiplexer inputs. In both cases, feasibility check includes legality check (dependency, resource hazard, combinational loop, and performance constraint) and timing check (without considering interconnect delay). Area and timing information about components are obtained from a precharacterized library. Random perturbations are performed in the following ways.
• Move an operation from one functional unit to another.
• Move a variable from one register to another.
• Merge two functinal units or registers.
• Insert (or delete) an additional register before the input port of a component, if the input data have been buffered at least two cycles before its use. This creates a multicycle interconnect and doesn't require changes in operation scheduling.
• Reschedule an operation one cycle earlier or later, and update the schedules of related operations if necessary.
When estimating spreading score, we construct sparse matrices to describe the problem in Eqn. 13 and use CSDP 6.1.1 [2] to solve the SDP problem. To speed up the solver, we use the solution of the SDP problem for the previous netlist as the starting point when solving the problem for the perturbed netlist. In addition to the primal solution, CSDP is able to give solutions to the dual problem as well. The dual solution can potentially provide information for sensitivity analysis and can be used to guide the perturbation; however, this is not yet implemented. We perform HLS on several designs and implement the results on a 65nm ASIC technology, using Synopsys Design Compiler and IC Compiler. In all cases, the layout region is a square and the target density is 80%. Since the designs are blocks to be integrated in a larger chip, we only use lowerlevel metal layers (M1 to M5) to route wires in the design; upper metal layers are reserved for system-level connections and power/clock networks.
Results and Analysis
We take a few snapshots of the simulated annealing process and plot the area, total multiplexer inputs and spreading score for each snapshot in Figure 1 for the design "QAM." For both cases, the area and spreading score are normalized so that either metric in the initial solution (with no sharing) is one, and the total multiplexer inputs metric is normalized so that it is one in the final solution of the case with optimization for cost (A). From Figure 1 , we observe that total area is generally reduced in the optimization process. This is probably because the design has a lot of compatible operations that can share functional units, and sharing them does reduce total area despite potential area overheads caused by the added multiplexers. The initial netlist without any sharing has the best normalized spreading score. This is not due to the bias toward larger area because normalization has been performed; it can be explained by the fact that sharing often creates connections between components that were not directly connected, and thus makes the netlist harder to spread. While the optimization of the normalized spreading score and total area tends to limit total multiplexer inputs, optimizing the total multiplexer inputs directly does not necessarily lead to better result on the normalized spreading score. With similar amount of total multiplexer inputs in the netlist, the normalized spreading score can still vary significantly; this indicates that normalized spreading score and total multiplexer inputs point to different optimization directions.
We report the worst-case slack and area after implementation for each benchmark in Table 1 . Modern RTL synthesis tools typically have a large amount of freedom in making trade-offs between timing and area/power through logic refactoring, cell selection, buffer insertion, etc. Therefore, the achieved clock period after layout tends to be very close to the clock target, making advantages in timing less obvious. Despite this effect, the approach that optimizes for the normalized spreading score does lead to consistently better timing. More significant advantages are expected with less powerful downstream tools and simpler libraries. Clock target and slacks is in ns; area is in nm 2 . Due to the trade-off between area and timing, a better structured netlist tends to have more slack, which in turn gives the RTL synthesis tool more freedom in reducing area. On the other hand, less aggressive sharing may be needed to obtain a more layout-friendly netlist in HLS, and this can increase the overall area. These effects can affect timing in different directions; we are unable to draw a conclusion as to which cost function leads to superior area.
In our experiments, evaluation of the normalized spreading score leads to a significantly longer runtime of the tool. Theoretically, the worst-case complexity of SDP is O(n 3 ), while feasibility checks and evaluation of total multiplexer inputs can all be finished in O(n). For the design "Industry2" with more than 500 operations, the tool with cost (A) takes about 30 minutes on a workstation with dual 2.6GHz CPUs and 4GB memory, while the tool with cost (B) takes about 5 minutes on the same workstation. Fortunately, hierarchical design style is often used in engineering practices, and this helps to control the number of components in a given level of hierarchy to make our approach feasible for very large designs.
FURTHER DISCUSSION
In this section we relate spreading score to other problems and metrics, to further study its properties and to motivate future research.
One may want to measure how far the vertices are spread by looking at pairwise distances, instead of distances between vertices and their center. Consider the embedding of (p 1, p2, . . . , pn) centering at c =
Thus the two metrics differ only by a constant factor n in the unweighted version. However, the weighted sum of square for pairwise distances, i.e., i j wij pi − pj 2 , does offer more flexibility, because of the larger number of weights (n 2 , compared to n in the formulation for spreading score). The reformulation and relaxation techniques in Section 3 can still be applied to handle the revised formulation with pairwise distances; the only difference is that the coefficient matrix in the objective function will be dense (as opposed to diagonal in Eqn. 13). The added flexibility is useful for certain purposes. For example, when two components are connected by a path with many registers and plenty of slack, their distance in the embedding is probably long. In such a case, further increasing their distance does not offer a clear advantage, and then we can reduce the corresponding weight in the objective.
We now discuss the relation between the proposed embedding and placement. We conjecture that our embedding problem is related to the dual of the placement problem in some sense. Roughly speaking, the placement problem asks to minimize wire length, with lower bounds on pairwise distances so that components do not overlap; the embedding problem asks to maximize pairwise distances, with upper bounds on wire length. Such "duality" indicates connections between spreading score and wire length after layout. This further justifies the use of spreading score as an estimator of layout-friendliness.
Kudva, Sullivan and Dougherty propose to use sum of all-pairs min-cut (in contrast to sum of all-pair distance in an embedding) to evaluate the adhesion of a gate-level netlist, and use it in logic synthesis to improve routability [17] . Their idea of using a structural metric to guide the generation of a layout-friendly netlist influences our work. However, we use different approaches: their metric is related to techniques used in the analysis of social networks [27] , which has roots in classic graph theory; our technique is influenced by the geometric embeddings of graphs and the associated algebraic structures. We consider our metric advantageous in two aspects: (1) it is more layout-oriented; (2) thus it has the ability to capture timing information by relating interconnect delay to distance. On the other hand, the adhesion metric is probably advantageous in another two aspects: (1) it can be evaluated more efficiently; 2 (2) since 2 An approximation of adhesion can be obtained in O(n 2 ). The complexity of SDP with a fixed error bound is O(n 3 ) for the dense case; sparsity and incremental solving can improve scalability for our problem. cut size is used, it may be more closely related to average wire density in layout, and thus help to reduce congestion. 3 Spreading score is indirectly related to cut size as well, as suggested by the following result from [12] . For a simplified version of the problem with w i = 1 and lij = 1, the dual problem of the relaxation in Eqn. 13 can be transformed to an SDP formulation for calculating n a (G) . Hereâ(G) is the absolute algebraic connectivity of graph G, and it is shown to be related to the node connectivity as well as edge connectivity of G [11] . According to the SDP duality theory, the estimated spreading score we get is equal to n a (G) . One limitation of spreading score is that it does not capture timing related to control signals (from the FSM controller) very well. This is because RTL synthesis tools often change the FSM controller drastically in optimization. Capturing controller timing is intrinsically difficult without logic synthesis. In our implementation, we exclude the FSM when constructing the graph; instead, we generate a Moore-style one-hot FSM to alleviate the problem.
CONCLUSION
A new metric of a netlist, spreading score, is proposed to measure layout-friendliness. It captures both structural properties and timing information, and can be estimated efficiently and stably. The usefulness of spreading score has been justified both theoretically and experimentally. New techniques introduced in this paper can potentially lead to interesting observations and solutions for other related metrics and problems. We consider this an interesting direction for future work.
