Cell placement is an important phase of current VLSI circuit design styles such as standard cell, gate array, and Field Programmable Gate Array (FPGA). Although nondeterministic algorithms such as Simulated Annealing (SA) were successful in solving this problem, they are known to be slow. In this paper, a neural network algorithm is proposed that produces solutions as good as SA in substantially less time. This algorithm is based on Mean Field Annealing (MFA) technique, which was successfully applied to various combinatorial optimization problems. A MFA formulation for the cell placement problem is derived which can easily be applied to all VLSI design styles. To demonstrate that the proposed algorithm is applicable in practice, a detailed formulation for the FPGA design style is derived, and the layouts of several benchmark circuits are generated. The performance of the proposed cell placement algorithm is evaluated in comparison with commercial automated circuit design software Xilinx Automatic Place and Route (APR) which uses SA technique. Performance evaluation is conducted using ACM/SIGDA Design Automation benchmark circuits. Experimental results indicate that the proposed MFA algorithm produces comparable results with APR. However, MFA is almost 20 times faster than APR on the average.᭧
Introduction
Cell placement is an important problem arising in various VLSI circuit design styles such as standard cell, gate array and Field Programming Gate Array (FPGA). Given a circuit description, the problem is to find a layout of the circuit while minimizing some cost function. Usually two closely related criteria are used to construct a cost function: minimization of the routing length and minimization of the chip area. In some design styles (e.g. standard cell), minimization of the area is equivalent to minimization of the routing length (Shahookar and Mazumder, 1991) , whereas in some others area is fixed (e.g. FPGA) . If the area is fixed, minimization of the routing length is necessary for the routability of the circuit using the available routing resources. Minimization of the routing length also minimizes the propagation delays of the circuit, hence increasing its speed (Shahookar and Mazumder, 1991) .
Although the cell placement problem has different characteristics related to the technology used in different design styles, key features of the problem remain the same. This enables a general definition for the cell placement problem to be made which is valid for all design styles. The problem is decomposed into two phases such that the first phase is same for all design styles and the second phase depends on the design style. An instance of the first phase of the cell placement problem consists of a hypergraph Q(C, N) representing the circuit to be placed, and a rectangular grid of clusters with P rows and Q columns where the circuit will be placed. Hypergraph Q(C, N) consists of a vertex set C representing the cells of the circuit, a hyperedge set N representing the nets of the circuit, a cell weight function q cell :C → N, and a net weight function q net :N → N, where N represents the set of natural numbers. The aim is to partition the vertex set C into P ϫ Q clusters such that the routing cost is minimized and the weights of the clusters are nearly balanced. The weight of a cluster is the sum of the weights of the cells in that cluster. In general, cell weight function is used to encode the areas of cells, and net weight function is used to increase the importance of some nets which may be crucial for the performance of the circuit. The rectangular grid of clusters is used for estimating the final locations of the cells. The computation of routing cost is discussed in detail in Section 2. (Shahookar and Mazumder, 1991) . The circuit has 3 input (I1, I2, I3) and 2 output (O1, O2) pads. Pads may be interpreted as cells which must be mapped to the boundaries of the cluster grid. The example circuit in Fig. 1 (a) may be represented with a hypergraph Q(C, N) according to the above definition as: 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, I1, I2, I3, O1, O2} N ¼{{I1, 1, 2, 3, 4}, {I2, 1, 2, 3, 4, 11, 12}, {I3, 6, 10, 11, 12, 13}, {1, 8}, {3, 7}, {11, 13}, {5, 6}, {8, 9}, {9, 15}, {13, 16}, {O1, 15}, {2, 5}, {4, 10}, {12, 14}, {6, 8}, {7, 9}, {10, 15}, {14, 16}, {O2, 16}}
Unit cell and net weights are assumed in this example. Fig. 1(b) shows the placement of this circuit to a 4 ϫ 4 grid of 16 clusters.
The second phase of the cell placement problem is the mapping of the cells in the clusters to their final locations in the layout. In standard cell design style, cells are used for constructing rows, and in gate array design style, cells are mapped to rows or grid locations according to the type of the gate array used (Sechen, 1988) . Some gate arrays consist of modules forming a rectangular grid. For this type of gate arrays the second phase of the problem may be skipped by choosing the number of rows and columns of the cluster grid to be equal to the number of rows and columns of the module grid, respectively. Symmetrical FPGAs consist of logic blocks forming a rectangular grid (Rose et al., 1992 , Rose et al., 1993 . Hence, the second phase of the problem can be similarly skipped for symmetrical FPGAs. This two phase modeling enables the development of heuristics for the first phase of the problem which are independent of the design style.
Since cell placement problem is NP-Hard (Lengauer, 1990) , finding efficient placement heuristics is an important research issue. In the last decade, neurocomputing approaches based on Hopfield model were successfully applied to various combinatorial optimization problems such as the traveling salesman problem (Peterson and Söderberg, 1989; VandenBout and Miller, 1989; Takahashi, 1997) , scheduling problem (Gislén et al., 1992) , mapping problem (Bultan and Aykanat, 1992) , knapsack problem (Ohlsson et al., 1993; Ohlsson and Pi, 1997) , communication routing problem (Hökkinen et al., 1998) , graph partitioning problem (Herault and Niez, 1989; Peterson and Söderberg, 1989; VandenBout and Miller, 1990) , graph layout problem (Cimikowski and Shope, 1996) , circuit partitioning problem (Yih and Mazumder, 1990; Bultan and Aykanat, 1995) . In this paper, the Mean Field Annealing (MFA) technique is applied to the cell placement problem. MFA is a new approach for solving combinatorial optimization problems (Peterson and Söderberg, 1989; Miller, 1989, VandenBout and Miller, 1990; Gislén et al., 1992; Aykanat, 1992, Bultan and Aykanat, 1995; Ohlsson et al., 1993; Ohlsson and Pi, 1997; Hökkinen et al., 1998) . MFA combines the collective computation property of Hopfield neural networks (Hopfield and Tank, 1985) with the annealing notion of Simulated Annealing (SA) (Kirkpatrick et al., 1983) . In MFA, discrete variables called spins (or neurons) are used for encoding configurations of combinatorial optimization problems. An energy function written in terms of spins is used for representing the cost function of the problem. Then, using the expected values of these discrete variables, a nondeterministic gradient descent type relaxation scheme is used to find a configuration of the spins which minimizes the energy function associated with them.
In this paper, a MFA-based cell placement algorithm is proposed. In order to show the performance of the proposed algorithm on concrete examples MFA formulations are derived for symmetrical-array FPGA design style. However, the MFA formulations proposed for FPGAs are general enough so that they can easily be applied to the first phase of the cell placement problem in other design styles with minor modifications.
The organization of the paper is as follows. Section 2 discusses the method used for approximating the routing cost of the placement. FPGA design style is briefly summarized in Section 3. Section 4 begins with the presentation of the general guidelines for applying MFA technique to combinatorial optimization problems. Then, the proposed formulation and implementation of the MFA algorithm for the cell placement problem following these guidelines are presented. The encoding scheme used in the proposed formulation is discussed in Section 4.1. The proposed energy function formulation and derivation of the mean field theory equations are presented in Section 4.2 and Section 4.3, respectively. The parameter selection and cooling schedule are discussed in Section 4.4. Finally, experimental results which evaluate the relative performance of the proposed algorithm are discussed in Section 5.
Routing cost
Computation of the routing cost is the crucial part of the cell placement problem. In the first phase of the problem, cells are partitioned to P ϫ Q clusters which form a rectangular grid. Fig. 1(b) shows the partitioning of the circuit in Fig. 1(a) to a 4 ϫ 4 grid. Initially, it is assumed that all clusters have the same size, forming a uniform grid as in Fig. 1(b) . After the cells are mapped to the clusters, areas of the clusters may be different, resulting with a nonuniform grid. If the clusters are balanced, the difference between the uniform grid and the actual nonuniform grid is not significant.
In order to calculate the routing cost the exact locations of the cells in the layout must be known. Each cell is assumed to be placed to the center of the cluster to which it is mapped. During the placement, it is not feasible to calculate the exact routing length for two reasons. Firstly, a feasible placement is not available during the execution of some algorithms (Dunlop and Kernighan, 1985) , secondly, the computation of the exact routing cost necessitates the execution of the global and the detailed routing phases which are as hard as the placement phase. Hence, most of the placement heuristics use a method for approximating the routing cost. An efficient and commonly used approximation is the semi-perimeter method (Shahookar and Mazumder, 1991; Sherwani, 1993) . In this method, the routing cost of a net is approximated by the semi-perimeter length of the smallest bounding rectangle (bounding box) enclosing all the cells connected to that net. Fig. 1(b) shows the bounding box of the net {10, 15} with two cells. This method gives a good approximation to the Steiner tree which is the most efficient routing scheme (Shahookar and Mazumder, 1991) . The shortest way to route a net is to find the minimum length Steiner tree of the cells connected to that net. Steiner trees can also be used as an approximation of the final routing length, but finding the minimum Steiner tree is an NP-Hard problem and its computation may not be feasible. Hence, semi-perimeter method is a good and efficient way of approximating the routing length.
Another way to view the semi-perimeter method is to define the vertical and the horizontal spans for each net (Sechen, 1988) . The vertical and the horizontal spans of a net are the lengths of the vertical and the horizontal sides of its bounding rectangle, respectively. Fig. 1(b) shows the vertical and the horizontal spans of the net {10, 15}. Total routing cost can be computed by adding the vertical and the horizontal spans of all the nets. If vertical and horizontal routings have different costs, then the total routing cost can be approximated by multiplying the vertical and the horizontal spans of the nets by the appropriate unit costs.
FPGA design style
Field Programmable Gate Arrays (FPGAs) were widely used in industry in recent years. Because they provide cheap and flexible usage, fast manufacturing turnaround time and low prototype cost, many designers prefer to use them in their applications. Several types of FPGAs were introduced over the last years, which differ from each other by their programming technologies, logic block architectures and routing network architectures (Rose et al., 1992) . They can be classified into four main categories: symmetricalarray, row-based, hierarchical and sea-of-gates.
A typical symmetrical-array FPGA consists of a twodimensional grid called logic cell array (LCA) which is interconnected with vertical and horizontal channels as shown in Fig. 2(a) . Each point in this two-dimensional grid is called a configurable logic block (CLB). A CLB can implement a set of logic functions. In FPGA design style, CLBs are used to provide the functionality of the circuit by mapping the logic gates of the circuit to CLBs. Logic blocks at the boundaries of the LCA are called inputoutput blocks (IOBs). IOBs are used for external connections of the circuit. Routing network, which consists of vertical and horizontal channels placed in between CLBs, makes connections among CLBs and IOBs. Switch blocks (SBs) that connect wire segments in horizontal and vertical channels are also a part of the routing network. In commercial FPGAs, routing resources are fixed and fairly limited (Xilinx, 1994) . For example, there are only five tracks in each routing channel for Xilinx XC3000 series of FPGAs as in Fig. 2(a) . The placement problem is especially important in designs using such devices, because fixed routing resources make it difficult to achieve 100% automatic routing.
Automated FPGA layout generation can be divided into four major phases, partitioning, technology mapping, placement and routing (Rose et al., 1993) . Partitioning is used for very large logic circuits that require multiple FPGA chips. In technology mapping phase, a logic circuit is transformed to an optimized, generic logic input format that consists of CLBs and IOBs. In the placement phase, the circuit that is formed in the technology-mapping phase is assigned to specific CLBs and IOBs in the LCA. This phase of FPGA layout design is equivalent to the cell placement problem discussed earlier. Most commercial automated design tools for FPGAs use SA algorithm in the placement phase. SA technique provides high quality solutions but it is notably slow. In this paper, a fast placement algorithm is proposed for symmetrical-array FPGAs that produces layouts which are as good as the ones produced by SA.
Applying MFA to the cell placement problem
MFA technique merges the collective computation and the annealing properties of Hopfield neural networks (Hopfield and Tank, 1985) and SA (Kirkpatrick et al., 1983) , respectively, to obtain a general algorithm for solving combinatorial optimization problems. A combinatorial optimization problem consists of a set of configurations and a cost function. For example, for the cell placement problem the set of configurations corresponds to the set of all possible placements of the input circuit. Sometimes, configurations are also referred to as solutions. Cost function assigns a cost to each configuration of the problem. For the cell placement problem, the cost of each configuration (i.e. placement) is the routing length of that placement. Optimum solution of a combinatorial optimization problem is the configuration (i.e. solution) which has the minimum (maximum) cost if the problem is a minimization (maximization) problem. Hence, for the cell placement problem the optimum solution is the placement of the circuit which has the minimum routing length.
In the MFA technique (Peterson and Söderberg, 1989; Miller, 1989, VandenBout and Miller, 1990) , discrete variables called spins (or neurons) are used to encode the configurations of the problem. A configuration in the spin domain is a valuation of these discrete variables. An encoding is defined which is a one-to-one mapping from the set of configurations of the problem to the set of configurations of the spins. Then the cost function of the problem is formulated in terms of spins. This function defines the energy of a configuration in the spin domain. MFA algorithm is a search algorithm in the spin domain which looks for the configuration with the minimum energy. To achieve this goal, expected values of the spins are updated iteratively using a nondeterministic gradient descent algorithm. In the following sections, the formulation of the MFA technique for the cell placement problem is described.
Encoding
The MFA algorithm is derived by analogy to Ising and Potts models which are used to estimate the state of a system of particles, called spins, in thermal equilibrium (Peterson and Söderberg, 1989; Miller, 1989, VandenBout and Miller, 1990 ). In Ising model, spins can be in one of the two-states represented by 0 and 1, whereas in Potts model they can be in one of the K states. For the cell placement problem the Potts model is used for encoding the configurations of the problem.
In the K-state Potts model of S spins, the states of spins are represented using S K-dimensional vectors S i ¼ ½s i1 ; …; s ik ; …; s iK ÿ t , 1 Յ i Յ S, where 't' denotes the vector transpose operation. The spin vector S i is allowed to be equal to one of the principal unit vectors e 1 ,…,e k ,…,e K , and cannot take any other value. Principal unit vector e k is defined to be a vector which has all its entries equal to 0 except its kth entry which is equal to 1. Spin S i is said to be in state k if it is equal to e k . Hence, a K-state Potts spin S i is composed of K two-state variables s i1 ,…,s ik ,…,s ik , where s ik ʦ {0,1}, with the following constraint
To encode the configuration space of the cell placement problem using these K-state Potts spins, one spin is assigned to each cell of the circuit. Each state of a spin corresponds to a location in the layout, i.e. if a spin is in state k this means that the cell associated with that spin is placed to location k. 
Two types of cells are considered in FPGA placement, namely L-cells and IO-cells. That is, in the circuit
Q(C,N), C ¼ C L ∪ C IO ,
Logic cell encoding
In order to encode the configuration space of the placement problem, one Potts spin could be assigned to each Lcell i ʦ C L of the circuit Q(C,N) to be placed. A (K ¼ PQ)-dimensional Potts spin could be used to encode the location of each L-cell, where each state of the Potts spin corresponds to a location in the P ϫ Q LCA. In this encoding, there would be a total of |C L | (PQ)-dimensional Potts spins in the system for encoding L-cells. Since each Potts spin could be in one of the K states at a time, there would be a one-to-one mapping between the configuration space of the problem domain and the spin domain. As each Potts spin consists of K two-state variables, a total of |C L |PQ two-state variables would be required for this encoding. However, a more efficient encoding is to represent the location of each L-cell with two Potts spins with dimensions P and Q. Spins with dimension P are used to encode the rows of the LCA, and spins with dimension Q are used to encode the columns of the LCA. Note that this encoding also constructs a one-toone mapping between the configuration space of the problem domain and the spin domain. However, it is more efficient since it uses a total of |C L |(P þ Q) two-state variables instead of |C L |PQ two-state variables of the previous encoding. Spins with dimensions P and Q are called row and column spins and labeled as S 
Input/output cell encoding
In the Xilinx series of FPGAs, there are four IOBs, two on each side, at the boundaries of each row and column of the layout as shown in Fig. 2 . Therefore, 
If an IO spin is in state m the corresponding IO-cell is assigned to IOB at location m in the layout. In order to simplify the encoding, the FPGA model is extended by adding two new boundary columns and two new boundary rows as shown in Fig. 2(b) . Rows 0 and P þ 1, and columns 0 and Q þ 1 are allocated to IOBs. An L-cell can be assigned to any internal row p, 1 Յ p Յ P, and any internal column q, 1 Յ q Յ Q. An IO-cell can only be assigned to boundary rows 0 and P þ 1 or boundary columns 0 and Q þ 1. IOB locations are numbered in clockwise direction starting from the upper left corner of the layout from 1 to 4P þ 4Q. Two new functions row(m) and col(m) are defined to show the IOB location m in terms of its row and column locations. Using this numbering scheme, s io bm ¼ 1 means that IO-cell b is assigned to IOB at location m, that is IO-cell b is assigned to one of the two IOBs at location pq of the LCA where
Energy function formulation
In the MFA algorithm, the aim is to find the spin values minimizing the energy function of the system. In order to achieve this goal, the average (expected) values of the spin vectors S r i , S c i and S io b are iteratively updated using a nondeterministic gradient descent algorithm. Iterations continue until the system stabilizes at some fixed point. Define The encoding scheme defined here ensures that L-cells are assigned to the CLBs in the internal rows and columns of the LCA. Similarly, it ensures that IO-cells are assigned to the IOBs in the boundary rows and columns of the LCA. However, for the sake of both simplicity of presentation and the efficiency of implementation P þ 2 and Q þ 2 dimensional vectors are maintained for row and column spins, respectively, for each L-cell i ʦ C L ; 
where
. This representation scheme is chosen for IO-cells since IO-cells assigned to the IOBs in the same row and column of the LCA incur the same vertical and horizontal routing cost, respectively. As mentioned earlier, energy function in the MFA algorithm corresponds to formulation of the cost function of the cell placement problem in terms of spins. Since the MFA algorithm iterates on the expected values of the spins the expected value of the energy function is formulated. The gradient of the expected value of the energy function is used in the MFA algorithm to compute the direction of maximum energy decrease, and the expected values of the spins are updated accordingly. The expected value of the energy function is defined as follows for the cell placement problem. Using the expected values of the spin variables defined earlier the following probabilities can be computed:
P{one or more cells of net n is in row p} ¼ 1 ¹ P{no cell of net n is in row p}
where i ʦ n denotes a cell that is in net n. These values may be computed for the columns of the LCA similarly. p r np is defined as the probability of the event that no cell of net n is in row p and p c nq as the probability of the event that no cell of net n is in column q, i.e.
Note that, if i ʦ n is an L-cell then v r ip and v c iq correspond to the actual L-row and L-column spin variables for 1 Յ p Յ P and 1 Յ q Յ Q, respectively, and to dummy 0 variables for p ¼ 0,P þ 1 and q ¼ 0,Q þ 1 respectively, in our representation scheme. If i ʦ n is an IO-cell, then these values correspond to the respective entries of the row and column vectors maintained for IOspins as discussed earlier. The vertical and horizontal routing costs of a net n are defined as q v ϫ q n ϫ (vertical span of net n) and q h ϫ q n (horizontal span of net n), respectively. Here, q v and q h are the unit vertical and horizontal routing costs between two successive cell (cluster) locations on the same column and row, respectively. In FPGA design style, q v ¼ q h ¼ 1 is used. Formulation of the vertical routing cost of net n as an energy term E vn using these definitions is:
ϫP{vertical span of net n is between rows k and ᐉ}
P{net n is not in row s}
P{net n is not in row t}
Here, net n is in row k if and only if one or more cells of net n is in row k, otherwise net n is not in row k. Similarly, energy formulation for the horizontal routing cost of net n is:
Total vertical and horizontal routing cost terms of the energy function (i.e. E v and E h ) can be derived using the formulation given in Eq. (10) and Eq. (11) as
If the routing cost is used as the only factor in the cost function, the optimum solution is mapping all cells of the circuit to one location in the layout. This placement will reduce the routing cost to zero but obviously it is not feasible. Hence, a term in the cost function is needed which will penalize the placements that put more than one cell to the same location. This term is called the overlap cost. The energy term is formulated corresponding to the overlap cost for CLBs and IOBs as:
ϫP{L-cells i and j are in the same CLB location}
Note that this overlap cost term becomes equal to the sum of the inner products of the weights of the cells at each cell (cluster) location when the system converges. In general placement, this term is minimized when weights of all the clusters are equal. If there is an imbalance among the cluster weights, this term increases with the square of the amount of imbalance, penalizing imbalanced clusterings. In FPGA placement, all cell weights are equal to 1 and only one L-cell and one IO-cell can be placed to one CLB and one IOB location, respectively. In addition, |C L | Յ (P ϫ Q), |C IO | Յ M. Hence, the overlap cost is minimized when either a single or no L-cell (IO-cell) is located to each CLB (IOB) location. If there is an overlap in a location, the overlap cost term increases with the square of the amount of overlap, penalizing the overlapped locations. Total energy term can be defined in terms of the routing cost terms and the overlap cost term as:
Parameter b is used to balance the two conflicting objectives of the energy function: minimizing the routing cost and the overlap cost. Note that allocating all cells to the same location minimizes the routing cost while maximizing the overlap cost. Minimization of the above energy function corresponds to distributing the cells of the circuit to the locations in such a way that the semi-perimeter and overlap costs are minimized.
The derivation of the gradient of the energy function using the formulation discussed earlier results in substantially complex expressions. Hence, the total energy function given in Eq. (15) is simplified in order to get more suitable expressions for the gradient. Simplification of the E v and E h terms given in Eq. (12) is as follows. A close examination of Eq. (10) and Eq. (11) reveals the symmetry between E vn and E hn terms. In fact, expressions for E vn and E hn can be obtained from each other by interchanging 'r' with 'c', 'P' with 'Q', and 'q v ' with 'q h '. Hence, algebraic simplifications will only be discussed for the E vn term. Similar steps can be followed for the E hn term. The following notation is introduced for the sake of simplification of the routing cost terms:
Here, F r nk and L r nk denote the probabilities that net n has no cells in the first k þ 1 rows (rows 0,1,2,…,k) and the last P ¹ k þ 2 rows (rows k,k þ 1,…,P,P þ 1), respectively. Similarly, F c nk and L c nk denote the probabilities that net n has no cells in the first k þ 1 and the last Q ¹ k þ 2 columns, respectively. Using this notation, E vn in Eq. (10) can be rewritten as:
Eq. (17) becomes:
The innermost summation in Eq. (20) telescopes to:
since L n,Pþ2 ¼ 1. Substituting Eq. (21) into Eq. (20):
After computing the telescoping outer sum in Eq. (22) and through some algebraic manipulations, expression for E vn simplifies to:
Similarly, the expression for E hn in Eq. (11) simplifies to:
Note that Eq. (23) and Eq. (24) compute the vertical and horizontal routing cost of net n, respectively, in an incremental manner. Hence, total energy function in Eq. (15) can be rewritten as:
Derivation of the mean field theory equations
The expected values V to be in one of the P, Q and M states, respectively, when they converge. In the proposed MFA formulation, L-row, L-column and IO spins are updated in an alternate manner, i.e., each L-row spin update is followed by an L-column spin update which is followed by an IO-spin update.
In the proposed formulation, L-row, L-column and IO mean field vectors F 
Here, N i denotes the set of nets connected to cell i, and 
In Eq. (28) (28) and Eq. (30), respectively. Note that row (column) assignment of a cell does not affect the horizontal (vertical) spans of the nets connected to that cell. The last summation terms in Eqs. (27) and (29) and Eq. (31) represent the increase in the overlap cost term by assigning L-cell i to row p, L-cell j to column q and IO-cell b to IOB location m, respectively. Fig. 3 illustrates the pseudo-code for the MFA algorithm proposed for the placement problem. At step 1, temperature parameters T r , T c and T io are initialized to sufficiently high temperatures for the annealing of L-row, L-column and IO spins, respectively. At step 2, an initial high temperature spin average is assigned to each Potts spin. In general, each spin variable is initialized to 1/K plus a small disturbance term which varies between ¹0.1/K and þ0.1/K. Here, T io are all in the cooling range. At each iteration of the innermost repeat-loop (step 3.1.2), the mean field vector effecting on a randomly selected L-row spin is computed (step 3.1.2.1), then the respective L-row spin average vector is updated (step 3.1.2.2). Similar operations are performed for randomly selected L-column and IO spins as shown in steps 3.1.2.3-3.1.2.6. These spin update operations are repeated for random sequences of L-row, L-column and IO spins as shown in the repeat-loop (step 3.1.2). The system is observed at the end of each repeat-loop in order to detect the convergence to an equilibrium state at the current temperature. If the average energy decrease caused by the spin updates performed in the repeat-loop is below a threshold value, this means that the system is stabilized for the current temperature. Then, T r , T c and T io are decreased according to the cooling schedule (step 3.2) and the overall iterative process (step 3.1) is re-initiated. As mentioned earlier, the proposed MFA algorithm is an iterative process. The complexity of MFA iterations is mainly caused by the mean field computations. As seen in Eqs. (27) and (29) and Eq. (31), calculations of mean field values are computationally very intensive. In this work, an efficient implementation scheme is used which reduces the complexity of individual L-row, L-column and IO iterations to ⌰ðd avg P þ PQÞ;
respectively. Here, avg denotes the average cell degree, i.e. average number of nets connected to a cell. This scheme is based on the techniques developed in (Bultan and Aykanat, 1995) for circuit partitioning problem, and can be derived from the formulations in (Bultan and Aykanat, 1995) . Therefore, its details will not be given here. Note that a sequence of L-row, L-column and IO spin updates can be considered as a single MFA iteration. Hence, a single MFA iteration takes
Յ PQ for sufficiently large P and Q values.
Parameter selection and cooling schedule
The parameters b 
of these two terms using the initial random spin averages and compute b r as:
where constant g is chosen as 0.8. The parameters b c and b io are computed similarly. The same g ¼ 0.8 is used in these computations. Selection of initial temperatures is crucial for obtaining good quality solutions. In previous applications of MFA (Peterson and Söderberg, 1989; VandenBout and Miller, 1990) , it is experimentally observed that spin averages tend to converge at a critical temperature. It is suitable to chose initial temperatures slightly greater than these critical Fig. 3 . MFA algorithm proposed for the placement problem.
temperatures. Although there are some methods proposed for the estimation of critical temperature (Peterson and Söderberg, 1989; VandenBout and Miller, 1990) , an experimental way of computing the initial temperatures is preferred here. After the balance parameters b
are computed using initial random spin averages, respectively. Then,
where j is a constant. Our experiments indicate that it is suitable to chose the parameter j as 100. Note that initial temperatures are inversely proportional to the dimensions of the respective Potts spins which is also observed for the critical temperature formulations presented in other implementations (Peterson and Söderberg, 1989; VandenBout and Miller, 1990 The cooling process is realized in two phases, slow cooling followed by fast cooling, similar to the cooling schedules used for SA. In the slow cooling phase, temperatures are decreased using a ¼ 0.95 until T Ͻ T 0 /1.5. Then, in the fast cooling phase, a is set to 0.85. The cooling process continues until either 90% of the spins are converged or T reduces below 0.01T 0 . At the end of this process, the variable with maximum value in each unconverged spin is set to 1 and all other variables are set to 0. Then, the result is decoded as described in Section 4.1 and the resulting placement is obtained.
The resulting placement may be infeasible, i.e. more than one L-cell or IO-cell may be allocated to the same CLB or IOB location, respectively. In such cases, the spins causing infeasible allocations are re-initialized to random initial values together with the set of unconverged spins at the end of the cooling process. Then, MFA algorithm is executed only for these spins starting from the initial high temperatures according to the same cooling schedule. Note that converged spins are held in their decoded values during this re-heating process. This re-heating process is continued until a feasible placement is found. Fig. 4 illustrates the evolution of the energy corresponding to the total placement cost with MFA iterations for the placement of circuit c432 onto a 10 ϫ 10 FPGA. This figure is constructed by computing the total energy term (Eq. (25) at the end of each random sequence of L-row, L-column and IO spin updates. Three curves in Fig. 4 correspond to the evolution of the total placement cost for three different initial temperatures computed using j ¼ 10 000, j ¼ 100 and j ¼ 1 in Eq. (35). In Fig. 4 , the major decrease in the energy terms for all three cases occurs at the same temperature which corresponds to the critical temperature mentioned earlier. In this figure, j ¼ 10 000 and j ¼ 100 correspond to initial temperatures which are significantly and slightly greater than the critical temperature, respectively. As seen in this figure, both initial temperatures yield almost the same solution quality. Note that initial temperatures corresponding to j ¼ 10 000 and j ¼ 100 yield placement solutions with semi-perimeter costs of 408 and 407, respectively. In contrast, j ¼ 1 corresponds to an initial temperature smaller than the critical temperature. This case results in a significantly worse solution quality with a semi-perimeter cost of 553. In general, starting from initial temperatures which are slightly greater than the critical temperature is sufficient for obtaining good solutions.
Experimental results
This section presents experimental performance evaluation of the proposed MFA algorithm in comparison with Xilinx Automated Placement and Routing (APR 3.30) program which uses simulated annealing algorithm in placement. Our MFA algorithm was implemented in C language and run on Sun-4 ELC workstations. Seven MCNC benchmark circuits were used to test the performance and efficiency of both programs. Xilinx 3000 series chips were used as the target FPGAs. The circuits were mapped into 3000 series logic blocks by using Xilinx XACT tools and these mapping results were used as inputs to the placement programs. Table 1 illustrates the properties of the benchmark circuits. The first two columns illustrate the number of CLBs and IOBs in the circuits to be placed. The third column shows the number of multi-pin nets. The last two columns illustrate the P ϫ Q dimensions of the FPGAs and the names of the target Xilinx chips used for placement.
The placement and routing results are displayed in Table  2 and Table 3 . Both MFA and APR programs were run 10 times for each problem instance. Table 2 displays the average placement costs and the average execution times of 10 runs for each placement instance. The placement results of both MFA and APR placement programs are used as inputs to the routing program of Xilinx APR tool. The average, the minimum and the maximum values for the maximum path delays obtained in 10 runs are displayed in Table 3. Table 3 also displays the average execution times of Xilinx APR tool for routing the placements produced by MFA and APR programs. Maximum path delay values were computed by running Xilinx XDelay program for each routing result.
The APR routing program produced 100% routability for each placement result obtained by both placement programs for all circuits except the largest circuit c3540. The router fails to route all the nets in the placement of this circuit. Infeasibility caused by the assignment of L-cells to the same CLB locations was not experienced in our MFA runs. However, infeasibility caused by the assignment of IOcells to the same IOB locations was experienced in some of our runs. However, a single re-heating pass was sufficient for obtaining feasible solutions in all these placement instances.
The semi-perimeter cost values displayed in Table 2 correspond to the average normalized semi-perimeter costs computed for the placement results of both programs as described in Section 2. Here, normalization refers to assuming a unit square layout. That is, vertical and horizontal spans of the nets are normalized by multiplying them with 1/Q and 1/P, respectively, during the computation of total semi-parameter cost values for Table 2 . The APR cost values correspond to the average costs computed for the placement results of both programs according to APR's placement cost definition. The semi-perimeter costs of the placement results obtained by the MFA program are 105% better than those of the APR program. However, APR-costs of the placement results obtained by the APR program are 16% better than those of the MFA program. Table 4 illustrates the normalized relative performance results of the two placement programs. In this table, the averages of the maximum path delay values obtained by the Xilinx XDelay program after routing the placement results of APR placement program are normalized with respect to those of the MFA program. This table also illustrates the execution times of the APR placement program normalized with respect to those of the MFA program. As seen in this table, the MFA placements yield slightly better routing results in 3 circuits out of seven circuits. APR placements yield 3% better routing results on the overall average. However, as seen in Tables 2 and 4 , MFA placement program is significantly faster than the APR placement program in all instances. MFA placement program is 19.8 times faster than the APR placement program on the overall average. Fig. 5 illustrates sample routing results of the circuit c432 for placements obtained by APR and MFA.
Conclusions
In this paper, a fast nondeterministic cell placement algorithm was proposed for VLSI design automation 
