The "quadratic placement" methodology is rooted in [6] [ 141 [ 161 and is reputedly used in many commercial and in-house tools for placement of standard-cell and gate-array designs. The methodology iterates between two basic steps: solving sparse systems of linear equations, and repartitioning. This work dissects the implementation and motivztions for quadratic placement. We first show that (i) Krylov subspace engines for solving sparse systems of linear equations are more effective than the traditional successive over-relaxation (SOR) engine [ 151 and (ii) order convergence criteria can maintain solution quality while using substantially fewer solver iterations. We then discuss the motivations and relevance of the quadratic placement approach, in the context of past and future algorithmic technology, performance requirements, and design methodology. We provide evidence that the use of numerical linear systems solvers with quadratic wirelength objective may be due to the pre-1990's weakness of min-cut partitioners, i.e., numerical engines were needed to provide helpful hints to min-cut partitioners. Finally, we note emerging methodology drivers in deep-submicron design that may require new placement approaches to the placement problem.
Introduction
In the physical implementation of deep-submicron ICs, row-based placement, solution quality is a major determinant of whether timing correctness and routing completion will be achieved. In rowbased placement, the first-order objective has always been obvious: place connected cells closer together so as to reduce total routing and lower bounds on signal delay. This implies a minimumwirelength placement objective. Because there are many layout iterations, and because fast (constructive) placement estimation is needed in the floorplanner for design convergence, a placement tool must be extremely fast. As instance sizes grow larger, movebased (e.g., annealing) methods may be too slow except for detailed placement improvement. Due to its speed and "global" perspective, the so-called quadratic placemenr technique has received a great deal of attention throughout its development by such authors as Wipfler et al. [16] , Fukunaga et al. 191, Cheng and Kuh [6] , Tsay and Kuh [15] and others. Indeed, quadratic placement is reputedly an approach that has been used within commercial tools for placement of standard-cell and gate-array designs. ' This work was supported by a grmt from Cadence Design Systems Inc
Design Automation Conference
Copyright 0 1997 by the Association for Computing Maginery, Inc. Permission to make digital or hard copies of p w or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for protit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission andor a fee. Request permissions from Publications Dept. ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org. 0-8979 1 -847-9/97/O006/$3.50 DAC 97 -06/97 Anaheim, CA, USA This work revisits the quadratic placement methodology and addresses both its mode of implementation and its relevance to future design automatization requirements. The remainder of our paper is as follows. Section 2 develops notation and synthesizes a generic model of quadratic placement. Section 3 briefly summarizes our experiments comparing Krylov-subspace solvers and successive over-relaxation (SOR) solvers. Section 4 proposes a new order convergence approach for the quadratic solver. Section 5 analyzes the interaction between the linear system solver and the coupled min-cut partitioning step. Finally, Section 6 discusses the relevance of the quadratic placement approach in the context of past and future algorithmic technology, as well as performance and design methodology requirements.
The Quadratic Placement Methodology

Notation and Definitions
A VLSI circuit is represented for placement by a weighted hypergraph, with n vertices corresponding to modules (vertex weights equal module areas), and hyperedges corresponding to signal nets (hyperedge weights equal criticalities and/or multiplicities). The two-dimensional layout region is represented as an array of legal placement locations. Placement seeks to assign all cells of the design onto legal locations, such that no cells overlap and chip timing and routability are optimized. The placement problem is a form of quadratic assignment; most variants are NP-hard.
Since the numerical techniques used for "quadratic placement" apply only to graphs (hypergraphs with all hyperedge sizes equal to 2), it is necessary to assume some transformation of hypergraphs to graphs via a net model. Throughout we use the standard clique model, but we have also obtained similar results for the clique model of [ 151 as well as for a directed star model.*
Definition:
The n x n Laplacian Q = ( q i j ) has entry qi, equal to -ai, for i # j and diagonal entry qii equal to E'&, aij, i.e., the sum of edge weights incident to vertex vi.
"Pad" constraints fix the locations of certain vertices (typically, due to the pre-placement of IIO pads or other terminals); all other vertices are movable. The one-dimensional placement problem seeks to place movable vertices onto the real line so as to minimize an objective function that depends on the edge weights and the vertex coordinates, The n-dimensional placement vector x L ( x i ) gives physical locations of modules "1,. . . ,vn on the real line, i.e., IS the coordinate of vertex vi, The corresponding two-dimensional placement problem is addressed by the means of independent horizontal and vertical placements. Squared Wirelength Formulation:
subject to constraints Hx = b. This function can be rewritten as @(x) = $xTQx.
Minimize the objective 'For il given multipin signal net. the graph edges that represent the net may be constructed in several ways, e.g., a directed star model, an unoriented sttu model or a clique model (see [2] for B review). The resulting weighted graph representation G = ( V , E ) of the circuit topology has edge weights a,] derived by "superposing" all derived edges in the obvious manner.
The vast majority of quadratic placers in the literature solve the 2-dimensional placement problem with a top-down approach, i.e., one-dimensional placement in the horizontal direction is used to divide the circuit into left and right halves, after which a placement in the vertical direction is used to subdivide the netlist into quarters, etc.
Essential Structure of a Quadratic Placer
We now review essential components of the quadratic placement paradigm to establish the historical couplings of numerical optimizations with min-cut optimizations or other means of "spreading out" a continuous force-directed placement. We will illustrate our discussion by referring to the PROUD algorithm of Tsay et al. This development is similar to that of other "force-directed' or "resistive network" analogies (see, e.g., [6] [9] [ 131 [ 161). The essential tradeoff relaxes discrete slot constraints and changes the "true" linear wirelength objective into a squared wirelength objective, in order to obtain a continuous quadratic minimization for which a global optimum can be found, However, the typical resulting "global placement" concentrates modules in the center of the layout region. The key question is how the "global placement" (actually, a "continuous solution obtained using an incorrect objective") should be mapped back to the original discrete problem.
Two approaches have been used to obtain a feasible placement from a "global placement". The first approach is based on assignmenr, either in one step (to the entire 2-dimensional array of slots) or in two steps (to rows, and then to slots within rows) [9] . The second and more widely-used approach is partitioning: the global placement result is used to derive a horizontal or vertical cut in the layout, and the continuous squared-wirelength optimization is recursively applied to the resulting subproblems (see [6, 12, 13, 161) .
The main difficulty is making partitioning decisions on the extremely overlapped modules in the middle of the layout (see Figure  1) . The obvious median-based partitioning (find the median module and use it as a "pivot") is sensitive to numerical convergence criteria. Thus, iterative improvement is commonly used to refine the resulting partitioning (see, e.g., [12]). A typical objective for the iterative improvement is some form of minimum weighted cut. Hence, quadratic placers can be quite similar in structure to topdown min-cut placers, with initial cuts induced from placements under the squared-wirelength objective.
Krylov-Subspace Solvers
Recall that quadratic placement solving a series of sparse systems of linear equations. We now review experiments with iterative solvers such as Successive Over-relaxation (SOWSSOR), BiConjugate Gradient Stabilized (BiCGS) and others (see [4] ) showing that the commonly used SOWSSOR methods (developed in the early 1950's) are not the best available now.
The time complexity of an iterative solver depends on both the cost of a single iteration (which is constant during the solution of a given system) and the number of iterations needed until iterates adequately reflect the true solution. The theory of iterative methods shows that the number of iterations needed to obtain a good approximation in norm depends on the spectrum of the matrix involved [IO] . Hence, the idea of a preconditioner -a way to transform the original system to an equivalent one with "improved" spectrum.
Because most implementations of preconditioners entail additional per-iteration cost, one must carefully examine the overall efficiency of solver/preconditioner combinations on particular classes of instances: more expensive iterations must be balanced against the number of iterations saved. We have solved a number of test systems with multiple combinations of solvers and preconditioners, and recorded number of iterations as well as CPU usage.2
We find that BiConjugate Gradient Stabilized (BiCGS) is among the best solvers; though it does not guarantee convergence, the method is good even for degenerate (not necessarily symmetric) matrices and in our experience provides more robust convergence 2See 141 for pseudocodes of solvers and preconditioners that we used and their efficiency comparisons; see [IO] for theoretical analyses. A brief review of the relevint numerical methods is as follows. Iterative methods for solving large systems of linear equations can be classified as ,stationary or non-stutionqv. Stationary methods include Jacnbi, Gauss-Seidel, Successive Over-relaxation (SOR) and Symmetric Successive Over-relwation (SSOR). They axe older, easier to implement and computationally cheaper per iteration. Non-stationary methods include Conjugate Gradient (CC), Generalized Minimal Residual (GMRes) and numerous variations. These are relatively newer and notably harder to implement and debug, but provide for much faster convergence. Additional computational expense per iteration (sometimes by a factor of 7) is normally justified by much smaller numbers of iterations. Jacobi method solves once for every variable with respect to the other variables. Gauss-Seidel uses updated values as soon as they are computed. SOR is a modification of Gauss-Seidel which depends on the extrapolation piameter o (tinding optimal values of 0 is nontrivial, and heuristics are typically applied). SSOR has no computational advantage over SOR as a solvej, hut is useful as a preconditioner; one iteration is roughly twice as expensive as an SOR iteration.
CG or GMRes will generate a sequence of orthogonal vectors which are residuals of the iterates. CG is very effective when the matrix is symmetric and positive definite, while Gh4Res is useful for general non-symmetric matrices. Particdm modifications of CG include BiConjugate Gradient (BiCG), Conjugate Gradient Squlwd (CGS), Quasi-Minimal Residual (QMR) and BiConjugiite Gradient Stabilized (BiCGS). In BiCG, two hequences me generated which are only mutually orthogonal. The method is useful for non-symmetric nonsinguli matrices; convergence is irregular and the method can break down. QMR applies a least-squares update to smoothen the irregularity of BICG: it is more reliable but rarely faster. Further improvements Ltre given by two vwiations of Transpose-Free Quasi-Minimal Residual (TFQMR and TCQMR).
CGS is a fast method with even more irregular convergence than BiCG, while BiCGS is just another way to smoothen convergence of BiCG. Finally, for positive definite matrices, the Chebyshev iteration computes coefficients of a polynomial minimizing the residual norm. Solvers which provide smooth convergence can be also used as preconditioners. Direct solvers present a different source of preconditioners for iterative methods, with examples being incomplete Cholesky (ICC), LU-factorization and incomplete LU-factorization (ILU). than conjugate gradient (CG). For preconditioners, Incomplete LUfactorization and the Successive Over-relaxation family (including SSOR) are particularly successful. In many tests, the vacuous preconditioner was surprisingly competitive with the best nontrivial preconditioner (iteration count was worse, but iterations were cheaper). At the same time, using the wrong preconditioner could easily lead to a 3-fold loss in CPU time. The relative performance depicted in Table 1 is representative of our results. Table 2 shows the win of using BiCGS over SOR. 
Order Convergence Criteria
Any solver from the previous section builds a sequence of iterates that converges to the solution x of Equation (1). How soon the iteration can be stopped determines performance. Typical convergence tests are based on some norm of the residual vector for an iterate3; which is taken to represent error with respect to the true solution. In practice, most norms are equivalent; heuristics (check convergence every j iterations, check differences of iterates rather than residual vectors, etc.) can reduce the time spent on convergence tests. We observe that using a placement solution solely to construct an initial min-cut partitioning solution wastes information: all that is retained are memberships of vertices in "left" and "right" groups. If the final iterate will be sorted and split to induce an initial solution for the min-cut partitioner, then the iteration should terminate when further changes will be inessential to the partitioner. This depends on the strength and stability of the partitioner, but the iteration should at least stop when the left and right groups stabilize. We now define several heuristic alternatives for what we call order 'When solving the system Ax = b, the residual vector for a given iterate xt is b-Axk. convergence criteria. Consider an iterate xk from some linear system solver, whose i-th coordinate x k ( i ) gives the location of module vi at this iteration. The placer relies on the relative sorted order of the coordinates, rather than their absolute values, to assign modules to sub-blocks. We use a direct permutation xl to represent the ordering induced by the k-Lh iterate. If vi is the j-th module in the ordering induced by xk, then z l ( j ) = i; nk(i) = j defines the the inverse permutation xi (see Table 3 ).
Inverse permutations
x; 1 4 1 1 0 ) 3 1 8 1 5 1 9 1 2 I 1 1 0 1 6 1 7 x;+, 1 7 1 1 0 1 3 1 8 1 5 1 9 1 2 1 1 1 0 1 6 1 4 
with k'*-largest element finding). Some intuition is given by:
Order Convergence Theorem Given xk -+ x, nk+ and n i converge iP computed with sufficiently small E > 0.
Proof Without loss of generality assume convergence in the L"-norm. Choose E smaller than of the smallest distance between two distinct limit values for coordinate slots. Then, starting with some N-th iteration, each coordinate should be 8-close to its limit value. Consequently, the values in two coordinate slots will either differ by more than E or will be &-close (if the limit values are equal).
Note that if the distance between two limit coordinate values IS precisely E, x ; and ~l k may never become constant. This shows that it may be computationally difficult to reach complete order stabilization if limit values of many coordinates are close (which is the case with quadratic placement; see Figure 1 ). Rather than try to circumvent this phenomena, we intend to use it (see below).
include the following (due to space limitations, we do not give motivations, but only constructions of certain measures). An important building block in our construction is the maximum change in sorted index that any module experiences between iterates xk and Xkf] :
0
We have worked with several order convergence measures, which m a f l + + = " O~i g { l q ( i ) -nkf+l(i)(} This is the maximal placewise difference between two direct permutations, i.e., for each coordinate slot we compute the two indices to which the values sort in consecutive iterates, then take the maximum difference of these two indices over all slots. In Table 3 , maxD++ = 10 since coordinate slot 4 has value 99 (index 10) in one iterate and value -99 (index 0) in the other iterate.
In practice, a min-cut partitioner may be biased by locking modules that are at extreme indices in the sorted ordering. This decreases the size of the solution space and reduces runtime. Alternatively, modules might be left unlocked, but the starting solution constructed to capture extreme module coordinates (see the GORDIAN methodology, for example [ 121) . Hence, 
Performance Gains
Figure 2 presents order convergence measures between consecutive solver iterates for examples case3 and avq-large with respect to a flow-based measure. One sees three5 distinct periods in the convergence history, which we interpret as follows:
0 Rapid Order Convergence The order of coordinates (equal to each other in the initial guess) changes rapidly, approaching that of the true solution.
0 Coordinate Adjustment The order stabilizes while coordinate values still change to approach those of the true solution 0 Coordinate Refinement Vertices become clustered in small regions, and small changes in their coordinates (immaterial to partitioner) produce peaks of order convergence scores.
Our intuition behind the use of such order convergence measures is as follows. During early iterations the order of coordinates is expected to change because the initial guess has very little in common with the true solution. These order changes will decrease as the iterates approach the solution. However, when the iterates are very close to the true solution, modules will become concentrated in small areas, and order changes will increase due to small displacements of many closely located vertices, which should be immaterial to a strong partitioner. Therefore, one should stop iterations when order convergence measures start increasing, or earlier if they become close enough to zero.
Experimental Results
We now present experimental data which yields insights regarding the proper "solver-partitioner interface" (i.e., the point at which Figure 2 : Order convergence studies for examples case3, avq-small and golem3(scaled by 0.1) with SOR iterates. From each 5 consecutive order convergence scores computed with Flow$g20%, the best 3 have been averaged. F l 0 w~~~2 0 % used error tolerance times the size of placement interval. The plot shows that the order of coordinates stabilizes quickly.
'We also performed experiments with golem3 (103048 cells, 144949 nets and 339149 pins), but we can not include results due to space considerations. The behavior observed with golem3 is similar to that observed with case3 and avqhrge (see 
Interaction with the Partitioner
We analyzed SOR iterates for each of our test cases by sorting the coordinates and then inducing initial solutions for a FiducciaMattheyses (FM) [SI min-cut partitioner. Our experimental procedure was as follows. (1) Initial solutions were induced by preseeding some percentage of leftmost and rightmost vertices in the ordering into the initial left and right partitions, respectively. Remaining vertices were assigned randomly into the initial left and right partitions. Three pre-seeding percentages of 0%, 20% and 50% were used. A pre-seeding of 0% corresponds to an initially random solution, and 50% corresponds to the solution obtained by splitting the iterate. (2) All vertices were free to move except for fixed pads which were locked according to whether they were to the left or right of the median coordinate. (3) For each iterate, we generated multiple pre-seeded initial solutions and ran FM from each, using unit module areas and an exact bisection requirement. runs of FM using the iterate as a pre-seed. Using 50% pre-seeded modules yields large improvements. Figure 3 shows the minimum cut obtained from 30 pre-seeded solution as a function of the SOR iterate. It is clear that strong (50%) pre-seeding enables FM to return better solutions than using initially random solutions (0%) or somewhat locked solutions (20%). Note that the benefit of using a later iterate as opposed to a relatively early iterate is marginal; there is indeed a potential gain if we can apply order convergence criteria. We have also observed that not only do the cutsizes remain similar from iterate to iterate, but the solutions themselves do not vary significantly (see [3] for more details, including data showing Hamming distances for various pairs of partitioning solutions). are not shown since they are significantly worse than both 0% and 50% solutions.
Modern Partitioners Do Not Need Hints
The previous experiments confirm that a traditional FM min-cut partitioner can certainly benefit from the hints contained in solver iterates. We now propose an interesting notion; that the use of numerical linear systems solvers with quadratic wirelength objective is historically due to the pre-1990's weakness of min-cut partitioners. In the past few years, significant improvements to FM have been made, primarily in the areas of tie-breaking and multilevel schemes (see [2] for a survey). The multilevel approach integrates hierarchical clustering into FM, generalizing the early "two-phase" approach (e g., [5] ). The recent work of [ I ] developed a multilevel partitioner that reports outstanding solution quality with respect to many other methods in the literature. We have obtained this partitioning code and integrated it into our testbed. We repeated the previous set of experiments using 5 runs of multilevel (ML-FM) [ 11 instead of FM partitioning. Figure 4 shows how ML-FM solution quality varies with convergence of SOR, and with the amount of information retained from the iteratc6 The 20% data was omitted for these plots since the cuts were much worse than the 0% or 50% data (likely due to ML-FM's inability to handle pre-seeding in a natural way). The conclusions are clear: ML-FM dramatically outperforms FM, and furthermore draws no benefit 'Given an initial biputitioning solution and an 0.5 level of pre-assignment, ML-FM determines a bottom-up clustering that is compatible with this solution throughout the hierarchy. With an 0.0 level of pre-assignment, ML-FM has no constraints on its bottom-up clustering. From the results in Figure 4 , these constraints actually hurt solution quality: other pre-seeding techniques for guiding ML-FM should be explored. from using solver iterates. In fact, ML-FM solutions are generally worse when constrained to follow the structure of an iterate in its initial solution. We have also observed that the ML-FM solutions for different iterates are extremely similar in terms of both structure and cut cost, no matter what initial solution is chosen. Thus, that if a minimum-cut solution is desired, a viable approach may be to use ML-FM on a random initial solution, and not use any quadratic placement techniques.
Our experimental data leads to the surprising hypothesis that a linear system solver may be completely avoided in the quadratic placement approach with no loss of placement quality. While strong partitioners "ignore" hints from the linear system solver, it is certain that a partitioner is needed to counteract the effects of clumping and squared wirelength objective. We by no means suggest that placement reduces to partitioning on one level, but rather that such is the case in the rop-down context, where rich geometric information is implicit in the partitioning instance.
Discussion and Futures
We have synthesized the motivations and structure for a generic "quadratic placement" methodology, and given both performance improvements and a historica! context for the approach. We have also shown that -possibly -an implementation of the approach no longer requires an embedded linear systems solver, and indeed can revert back to "min-cut placement" when armed with the latest partitioning technology. We by no means claim that every existing quadratic placer should discard its numerical engine, but -all else being equal -we suspect that if a minimum weighted cut is the true objective, ML-FM or other recent partitioners might be invoked with no loss of solution quality?
More generally, we believe that there are basic drivers that suggest looking beyond "quadratic placement" technology. First, there are well-known limitations to modeling capability within a quadratic placer, e.g., path timing constraints, invariance of orderings to unequal horizontal and vertical routing, the requirement of pre-placed pads to "anchor" the placement, etc. Second, a top-down, performance-and HDL-driven design methodology will have relatively smaller technology-mapped blocks in order to gain predictability; these will not be large enough for a quadratic placer to show its "global awareness" and runtime advantages. Third, the advent of block-based design may mean fewer large, flat problem instances. Synthesized glue logic will be spread out over disconnected, heterogeneous regions; the resulting chip planning -block building -assembly flow harkens back to classic block packing and route planning issues, and does not play to the strengths of quadratic placement. Thus, while quadratic placers are the state of the art today, it remains to be seen whether other approaches will more effectively address future placement requirements.
