In this paper, a precorrected-FFT approach for fast and highly accurate simulation of circuits with on-chip inductance is proposed. This work is motivated by the fact that circuit analysis and optimization methods based on the partial element equivalent circuit (PEEC) model require the solution of a subproblem in which a dense inductance matrix must be multiplied by a given vector, an operation with a high computational cost. Unlike traditional inductance extraction approaches, the precorrected-FFT method does not attempt to compute the inductance matrix explicitly, but assumes the entries in the given vector to be the fictitious currents in inductors and enables the accurate and quick computation of this matrix-vector product by exploiting the properties of the inductance calculation procedure. The effects of all of the inductors are implicitly considered in the calculation: faraway inductor effects are captured by representing the conductor currents as point currents on a grid, while nearby inductive interactions are modeled through direct calculation. The grid representation enables the use of the discrete Fast Fourier Transform (FFT) for fast magnetic vector potential calculation. The precorrected-FFT method has been applied to accurately simulate large industrial circuits with up to 121,000 inductors and over 7 billion mutual inductive couplings in about 20 minutes. Techniques for trading off CPU time with accuracy using different approximation orders and grid constructions are also illustrated. Comparisons with a block diagonal sparsification method are used to illustrate the accuracy and effectiveness of this method. In terms of accuracy, memory and speed, it is shown that the precorrected-FFT method is an excellent approach for simulating on-chip inductance in a large circuit.
Introduction
The fast and accurate simulation of circuits with on-chip inductance is a growing problem, and future trends show that the relative contribution of inductive effects on circuit behavior will continue to increase as technologies shrink further and low-k dielectrics are used to diminish capacitive effects. Inductive effects have become important in determining power supply integrity, timing and noise analysis, especially for global clock networks, signal buses and supply grids for high-performance microprocessors.
One of the major problems in determining inductance has been associated with the fact that wire inductances are defined over current loops, and that the current loops are dependent on the circuit context of the switching wires. This leads to a chicken-and-egg problem where the inductance cannot be extracted until the current return paths are known, which, in turn, can only be determined after some knowledge of the inductance. Fortunately, an elegant way around this was found using the PEEC model [1], which does not require the current return paths to be predetermined. The PEEC approach introduces the concept of partial inductance of a wire or a wire segment, corresponding to a return path at infinity. The partial self-inductance is defined as the inductance of a wire segment that is in its own magnetic field, while the partial mutual inductance is defined between two wire segments, each of which is in the magnetic field produced by the current in the other. For two wire segments k and m, the partial mutual inductance is given by 
Here, r km is the distance between any two points on segment k and m. Simplified closed form formulae for partial self and mutual inductances of typical wire topologies that are useful in current-day integrated circuit environments have been calculated in [2] .
One drawback of using the PEEC method directly is that it requires the calculation of nonzero mutual inductances between every pair of nonperpendicular wire segments in a layout. This results in a dense inductance matrix that causes a high computational overhead for a simulator. Although many entries in this matrix are small and have negligible effects, zeroing them out may cause the resulting inductance matrix to lose its desirable positive definiteness property [3] , which is a necessary condition for the matrix to represent a physically realizable inductor system. Consequently, several efforts have been made to develop algorithms to sparsify the dense inductance matrix while maintaining this property.
The shift and truncate method [4] finds a sparse matrix approximation by assuming that the current return of each wire segment is not at infinity, but is distributed on a shell of finite radius R 0 , which must be constant for the analysis of the entire chip. Under this assumption, the inductance formula (1) is altered by subtracting a factor, which is inversely proportional to R 0 , from the partial inductance, and setting the value to zero if the result is negative. Although this method succeeds in removing faraway inductive interactions from consideration and maintains the positive definiteness of the matrix, the subtractive factor can cause errors in calculating nearby inductive interactions if the radius is not large enough. Moreover, finding a reliable global value of R 0 is a nontrivial problem: a high accuracy demands a large R 0 , which, in turn, can result in low sparsification. Although efforts in the direction of determining R 0 have been made in [3] , this is not a solved problem.
An alternative approach that uses return-limited inductances [5] is a shape-based method for sparsifying the inductance matrix using "halo rules". This method is good as a first order approximation. Another approach [6] introduces a block diagonal method that is a heuristic sparsification technique based on a simple partition of the circuit topology. This approach also maintains the positive definiteness of the matrix, but neglects mutual inductances between partitions. In [7] , the circuit element K, defined as the inverse of the traditional PEEC inductance matrix, was introduced as an alternative element for representing an inductance system. The K matrix is shown in [8] to have better properties than the inductance matrix in that it is symmetric, positive semidefinite and diagonally dominant. Similar to a capacitance matrix, the K matrix can be easily sparsified and can obtain a higher accuracy than an inductance matrix for the same sparsification. However, as in the case of the shift and truncate method, the algorithm provided in [7] uses a fixed window size within which local interactions are modeled.
Moreover, it requires that existing simulators are extended such that they handle the new circuit element K.
The shortcomings common to all of these methods are twofold. First, it is difficult to determine how to set the radius or partition size outside which couplings may be ignored. The principal problem is that it is difficult to definitively demarcate a region such that an aggressor wire segment outside this local interaction region is too weak to have a significant effect on a victim wire segment within it. Some of the heuristics proposed to overcome this limitation entail simulation, which may require large CPU times for large circuits. Second, although the individual couplings that are ignored may be small, it is difficult to determine the cumulative effect of ignoring a larger set of such couplings without detailed knowledge of the current distributions.
FastHenry [9] is a multipole -accelerated method for inductance extraction. However, it works in frequency domain and ignores the effects of capacitance on the estimation of current return path. In order to obtain the time domain simulation, an accurate compact model has to be constructed, which is not an easy procedure.
Recently, a number of methods for circuit and layout analysis and optimization for on-chip inductance have been proposed [10, 11, 12, 13, 14] . However these methods have typically used either RL inductance formulations or analytical models, which have limited accuracy for large circuit structures.
In this paper, we propose a precorrected-FFT method that, instead of entirely dropping long-range couplings, approximates these couplings, thereby overcoming the above two shortcomings. The main idea of this method is to represent the long-range part of the vector potential by point currents on a uniform grid and nearby interactions by direct calculations. The grid representation permits the use of the discrete Fast Fourier Transform (FFT) for fast potential calculations. Because of the decoupling of the short and long-range parts of the potentials, this algorithm can be applied to problems with irregular discretizations. Other techniques similar to the precorrected-FFT method, such as the adaptive integral method (AIM) [15] , have been proposed; the advantages of the former over the latter are discussed in [16] .
The idea of using a precorrected-FFT approach for accelerated electromagnetic calculations has been used in the past to accelerate the coulomb potential calculation for solving electromagnetic boundary integral equations for three-dimensional geometries. During the capacitance extraction technique introduced in [17, 18] , each iteration of the algorithm computes the product of a dense matrix with a charge vector to calculate electrical potential on each conductor. The basic precorrected-FFT method presented in this paper is inspired by the method in [17, 18] for capacitance extraction, which also demonstrates that for many realistic structures, the precorrected-FFT method is faster and uses less memory compared with the multipole -accelerated method. In our work, the precorrected-FFT method is adapted to the specific requirements of simulation of on-chip inductance. Unlike [17, 18] , we do not focus on extracting a matrix describing the parasitics (namely, the inductance matrix M in our case), but rather, directly consider how the inductance matrix is used in fast simulation algorithms. As described in Section 2, many simulators do not require M to be explicitly determined, but instead, require the computation of the product of M with a vector I. The approach developed in this paper accelerates the procedure that is used to directly determine the M × I product without explicitly finding M. It proceeds by first assuming that the entries in I are fictitious currents in inductors and then transforming the calculation of the M × I product to the calculation of the integration of the magnetic vector potential A v over the volume of the inductors, as depicted in equation (5) in Section 2. The long-range magnetic interactions are represented by point currents on a discretized grid, while short-range contributions to the M × I product are directly calculated. Several considerations are incorporated to make the algorithm efficient and applicable to large circuits and complex layouts. First, since mutually perpendicular segments do not have any inductive interactions, it is possible to apply the precorrected-FFT method to wire segments in the two perpendicular directions separately. This simplification is applicable to inductance systems and not to capacitance systems. Second, since IC chips typically have much larger sizes in the two planar dimensions than in the third (i.e., they tend to be "flat"), we show that a two-dimensional grid may be used instead of a threedimensional grid.
A comprehensive PEEC model, as described in [6] , is used in this paper. We demonstrate the application of the precorrected-FFT method within a simulation flow based on PRIMA [19] , on circuits of up to 121,000 inductors and nearly 7 billion mutual inductive couplings. These experiments demonstrate the speed, memory consumption and accuracy of the precorrected-FFT method as compared to the block diagonal method [6] . We also illustrate how tradeoffs may be made in order to obtain higher speed implementations with a small reduction in accuracy.
The remainder of this paper is organized as follows. In Section 2, the motivation for calculating the M × I product is presented, and a description of the problem formulation is provided. This is followed by a detailed description of the precorrected-FFT algorithm as applied to the inductance problems in Section 3. Experimental results and an analysis of the relation between the accuracy and speed are performed in Section 4, including a comparison with the block diagonal method in terms of speed, memory cost and accuracy. Concluding remarks are presented in Section 5.
Motivation and problem formulation
It is well known that a direct application of the PEEC model results in dense inductance matrices. The partial inductances of an n-wire segment system can be written as an n×n symmetric, positive semidefinite matrix M ∈ R nxn . Once this inductance matrix has been calculated, it may be incorporated into a circuit model that captures the interactions of R, L, C and active elements in the circuit. If the circuit is linear, it can be solved efficiently using model order reduction techniques such as PRIMA or using a SPICE-like transient simulation flow.
Model order reduction techniques
To understand how the inductance matrix is used within a model order reduction method, let us use PRIMA as a representative model order reduction engine (the application to AWE is very similar) and consider a circuit that is represented by the modified nodal equation
where (G+sC) is the admittance matrix, G is a conductance matrix, C is a matrix that represents the capacitive and inductive elements, X is a vector of unknown node voltages and unknown currents of inductors and voltage sources, and B is a vector of independent time-varying voltage and current sources. Specifically,
where N, Q and M are, respectively, the submatrices representing conductances, capacitances and inductances in the network. E consists of ones, minus ones and zeros, and N, Q, and M must be symmetric and positive definite to guarantee passivity. The submatrix of capacitances, Q, is typically sparse, while the submatrix M of inductances is dense.
Our objective is to reduce the large conductance and capacitance matrices into smaller reduced matrices, so that the reduced linear system may be simulated exactly in conjunction with nonlinear transistor models in a circuit simulator. The vectors of moments, m i , of X can be calculated by solving the equations
These are orthonormalized in each step to obtain an orthonormal X matrix. Note that the right hand side of Equation (4b) involves the multiplication of C by a constant vector. The matrix for the reduced order system can be calculated as follows:
Using these reduced matrices and the transistor models, a netlist for the reduced order system may be constructed and simulated using SPICE.
SPICE-like transient simulation flow
The time-domain modified nodal equation is given by:
where the definition and formation of G, C, X and B are the same as in (3) . Such equations can be solved using the backward-Euler method under a given time step, h, as follows:
Rearranging the above equation, we obtain:
Where P = (G + C/h) and q = (B + C/h X n ). Given the values of X at the n th time step, we can solve the above equation
for X at (n+1) th time step. This equation can be solved by direct methods such as LU factorization, or using an iterative solver such as GMRES [20] . For very large circuits and a dense M matrix in C, LU factorization of G+C/h matrix could become computationally expensive, and therefore the use of iterative methods becomes attractive.
The GMRES procedure requires the evaluation of the product of P by a vector, and therefore it is seen that this procedure, as well as the procedure of determining q, require the multiplication of the C matrix by a constant vector.
Problem formulation
Regardless of whether model order reduction techniques or transient simulations using an iterative solver are employed, we face the problem of the multiplication of C matrix with a constant vector. The product of M with a known vector I ∈ R nx1 for these wire segments can be written as: 
Here, we assume that m I is the fictitious current in wire segment m and km A r . It is the summation of the integration of the magnetic vector potential over wire segment k caused by the current in each aggressor wire segment. The expression (6) can be calculated using approximate formulae available in [2] or using accurate closed form formulae provided in [21] .
If the dense inductance matrix M is used, the computational cost for the matrix-vector product is very high: for a system with n variable s, this is O(n 2 ). The larger the circuit, the larger is the number of moments and ports, and the heavier is the overhead of calculating the dense matrix-vector product. Therefore, methods for sparsifying the M matrix have been widely understood as being vital to solving systems with inductances in an efficient manner.
On closer examination, however, we observe that in order to solve the circuit, it is not the dense inductance submatrix M that needs to be determined, but rather, the product of M with a given vector. This is the motivation for this work, and we present a technique that efficiently finds the product of M with a given vector using the precorrected-FFT approach that accelerates the computation of this matrix-vector product.
Therefore, the proposed method is general in that it can be applied whenever the circuit analyzer relies on the computation of the product of the inductance matrix with a given vector, such as PRIMA and SPICE-like transient analysis in the case where an iterative method is used for the equation solution, although it is not especially useful for an LU-factorization method since the latter requires the elements of the M matrix to be listed explicitly. In this work, we use PRIMA as the simulation engine to test the results of the algorithm.
Precorrected-FFT method
The precorrected-FFT method presented here provides an efficient method for estimating the dense M × I matrixvector product accurately, and is based on dividing the region under analysis into a grid. In the description of this algorithm, we will begin by using a three-dimensional grid, although we will show in the next section that in practice, a two-dimensional grid can also work well in an integrated circuit environment.
Consider the three-dimensional topology of wires that represents the circuit under consideration. After the wires have been cut into wire segments to be represented using the PEEC model, the circuit can be subdivided into
array of cells, with each cell containing a set of wire segments. The contribution to the values of
of wire segments within a cell under consideration (which we will call the "victim cell") that is caused by wires in other cells (referred to as "aggressor cells") can be classified into two categories: longrange interactions and short-range interactions. The central idea of the precorrected-FFT approach is to represent the current distribution in wire segments in the aggressor cell by using a small number of point currents on the grid that can accurately approximate the vector potential for faraway victim cells. After this, the potential at grid points caused by the grid currents is found by a discrete convolution that can be easily performed using the FFT. Figure 1 shows a schematic diagram of a multiconductor system subdivided into a grid of 3 ×3×1 cells. The current distributions of wires in each cell are represented by a 2×2×2 grid of point currents, using an approach that will be described later. 1. Projection: The currents carried by the wire segments that lie in each cell are projected onto a uniform grid of point currents in the same direction as the currents in the wires. Here, the grid is only required to have a constant grid spacing in each dimension, so that for a three-dimensional grid, the grid spacing can be different in each of the three perpendicular directions. The boundary condition that is maintained during projection is that the vector potentials at a set of test points on a collocation sphere surrounding the cell should match the vector potentials due to the actual wires.
FFT:
A multi-dimensional FFT computation is carried out to calculate the grid potentials at the "victim" grid points caused by these "aggressor" grid currents. This computation proceeds by automatically considering all pairs of aggressor-victim combinations within the grid.
Interpolation:
The grid potentials, calculated by the FFT computation, are interpolated onto wire segments in each "victim" cell.
Precorrection:
The projection of wire segments to the uniform grid in step 1 inherently introduces errors into the computation. While these errors are minimal for faraway grid points, they may be more serious in modeling interactions between nearby grid cells. Therefore, the precorrection step directly computes nearby inductive interactions accurately, and "precorrects" to remove the significant errors that could have been introduced as a result of projection.
A detailed description of the four steps is provided in the following subsections.
Projection
The first step in the precorrected-FFT algorithm is projection, which constructs the grid projection operator W.
Using W, the long-range part of the magnetic vector potential due to the current distribution in a given cell can be represented by a small number of currents lying on grid points throughout the volume of the cell. In other words, the current distribution in wire segments can be replaced by a set of grid point currents that are used to calculate the long-range part of the magnetic vector potential. An example of the top view of such a grid representation is shown in Figure 1 , where the current distribution in each cell is represented by a 2×2×2 array of grid currents. Since the grid currents are only a substitution for the current distribution in wire segments, the grid can be coarser or finer than the actual problem discretization.
The scheme for representing the current distribution in a cell by a set of grid currents throughout the cell can be illustrated using the first uniqueness theorem in electromagnetic fields [22] . Suppose the current (charge) distribution is contained within some small volume S 0 with radius R 0 , as shown in Figure 3 , and we are interested in finding the induced magnetic field (electric field) outside of region contained within a surface S. In order to find the magnetic field of a given stationary current distribution (electric field of a given stationary charge distribution) we solve Laplace's equation:
with the boundary condition V s , which is the known potential distribution on the boundary surface S. The first uniqueness theorem is related to the solution of Laplace's equation, and can be stated as follows:
First uniqueness theorem: The solution of Laplace's equation in some region is uniquely determined if the value of the potential is a specified function on all boundaries of the region. This theorem tells us that in order to solve the Laplace's equation, it is not necessary to know the detailed distribution of current sources (charge sources), but that it is sufficient to know the potential distribution on the boundary surface S of the problem region. This suggests a scheme where one current distribution can be replaced by another current distribution provided the two distributions result in the same potential on the boundary surface of the solution of Laplace's equation. For convenience of calculation, we choose the boundary surface as a sphere surface, called the collocation sphere as shown in Figure 2 , and point currents lying on a grid as a current distribution that substitutes the original one.
The radius of the current distribution region R 0 is a little larger than the cell size and the small volume S 0 contains the cell. Suppose there are p grid points on each edge of a cell and a p p p × × grid of currents is used to represent m currents in wire segments in cell k. A set of N t test points is chosen on a collocation sphere that has radius R c >R 0 , and whose center is coincident with the center of cell k. The problem region is outside of the collocation sphere. Then the potentials on these N t test points due to the grid currents are forced to match those induced by the current distribution in wire segments by solving the linear equation: 
is the pseudo-inverse of P gt [23] and can be calculated by singular value decomposition. There are two reasons to use the pseudo-inverse here: first, the number of test points may be larger than the number of grid points for cell k, and second, the possible symmetric positions of the test points on the collocation sphere may cause the P gt matrix to be nearly singular and introduce inaccuracies if the normal matrix inverse is used. This procedure provides us with W(k), the part of the projection operator associated with cell k; the j th column of W(k) is the contribution of the j th wire segment in cell k to the p 3 grid currents. Since P gt is small and is taken to be the same for all of the cells,
can be calculated once in the setup step with a very small computational cost, and is directly used in each step of the precorrected-FFT that requires its value. Note that the grid currents obtained from cell k constitute only a part of the currents on these grid points if they are shared with neighboring cells. The grid current on a grid point shared by multiple cells is calculated as the sum of the contribution from all of the wire segments which reside in those cells.
Calculation of grid potentials by FFT
Once the currents in wire segments are projected to the grid, the grid potentials due to the grid currents are computed through a multi-dimensional convolution, given by:
is the grid potential at the grid point whose index in three dimensions is (i,j,k) and each entry of H is given by:
which is the contribution to the grid potential at grid point (i,j,k) induced by unit point current at grid point (i',j',k').
It can be seen easily that (12) has the form of a convolution operation, and the discrete Fast Fourier Transform (FFT) can be exploited to rapidly implement this convolution. On a practical front, we observe that the matrix H needs to be computed only once during this computation. Moreover, the number of grid points in each dimension is best chosen as a power of two, or as a value with only small values of prime factors, so that the implementation of the FFT is efficient. For further efficiency, the sparsity properties of I g and H can be exploited.
Interpolation
After the grid potential is calculated using the FFT, the values of ∑ ∫ The proof of this theorem is provided in [17, 18] . However, whether the interpolation operator is the transpose of the projection operator or not depends on the discretization scheme used in the discretization of the integral equation [24] . As described in [24] , if a Galerkin scheme is used, so that the entries of the dense matrix include the integration about both the aggressor discrete element as well as the victim discrete element, the dense matrix will be symmetric and positive definite. In this case, the transpose of the projection operator can be used as the interpolation operator. If (1) and (2) are applied to calculate inductance values, the inductance matrix, M, is just the dense matrix resulting from the discretization under the Galerkin scheme, so that the interpolation operator is W T .
Precorrection
The grid representation of the current distribution in a cell is only accurate for potential calculations that correspond to long-range interactions. In practice, nearby interactions have the largest contribution to the total induced potentials, and therefore, they must be treated directly and accurately. Since the nearby interactions have already been included in the potential calculation after the above three steps, it is necessary to subtract this inaccurate part from the result of the interpolation step before the accurate measure of nearby interactions is added in. 
where W(l) and W (k) T are the projection operator in cell l and interpolation operator in cell k, respectively. H(k,l) is the part of the multi-dimensional convolution step that calculates the grid potential throughout cell k due to the grid currents throughout cell l. The precorrection step subtracts
and then adds the accurate direct
is a precorrection operator for cell k corresponding to cell l and is given by:
Although the M × I product may be calculated many times (for example, in the loop of calculating moments in PRIMA), the expense of computing ) , ( l k M is incurred only once in the initial setup step, and can thence be reused. After precorrection, V(k) is a good approximation to the real result of ∑
, because it includes long-range contribution to the potential through the grid representation and short-range contribution through the direct calculation.
Complete precorrected-FFT algorithm
Combining the above steps leads to the complete application of precorrected-FFT algorithm on the dense inductance matrix and vector product problem. The final solution of the induced voltages is:
where W is the sparse projection operator, of which each nonzero entry W ij is the contribution of jth entry in the I vector on to the grid current at the i th grid point. H can also be constructed as a sparse matrix for an efficient implementation of FFT. M is a sparse matrix because the number of cells included in the calculation of nearby interactions is small, and each nonzero entry ) , ( j i M is the error caused by the grid representation during the calculation of the value of
for wire segment i due to the current in wire segment j in a nearby cell. The complete algorithm, including the setup step, is illustrated in pseudo code as follows:
Precorrected-FFT approach to compute M × I:
1. 
Calculate the mutual inductance terms associated with the aggressor cell l and the victim cell k: M(k,l) while the nearby interaction are taken into account by direct calculations. This concept can be used for both the electric field and the magnetic field, and therefore for both capacitance and inductance extraction. The kernel of the calculation of both field potentials is 1/r, where r is the point-to-point distance or the point-to-origin distance.
Although the above description of the precorrected-FFT method is superficially similar to that in [17] , the implementation and the application of precorrected-FFT in this paper differs from that for the capacitance extraction in several ways. These differences, which constitute the contributions of this paper, are:
1. The computation of the projection operator W for capacitance extraction involves a two-dimensional integration, while for the magnetic field, it is a much more complicated three-dimensional integration.
2. Our objective is to solve V=MI fast and accurately, rather than calculating M exactly. Here I is treated as the a set of fictitious inductor currents and V is the summation of the integration of the magnetic vector potential over wire segments caused by the current in each aggressor wire segment. The magnetic field induced by I as well as V are not real world quantities, unlike capacitance extraction, where the method is used to solve V=PQ for a real physical electric field. We show how the calculated MI product can be used in various simulation schemes, including PRIMA and in circuit simulation using iterative solvers.
Computational cost and grid selection
Since VLSI chips are thin and flat, one option is to use only one cell in ẑ (thickness) direction. In addition, there are three parameters that need to be determined before the precorrected-FFT algorithm is applied to a circuit: p, q and d. Parameter p is the number of grid points on each edge of a cell, so that each cell is approximated by p 3 grid points in three-dimensional grid. For example in Figure 1 , there are 2 3 grid points throughout the volume of the three-dimensional cell.
Parameter q is the number of nearby cells that are considered in the precorrection step. For example, if we only consider the first nearest neighbors to each cell (defined as all cells that have a vertex in common with the considered cell, including the cell itself), the value of q is 9. Parameter d is the cell size, defined as the length of a cell's edge in the x and ŷ directions, which we will take to be equal. For a given chip size, the number of cells N c is inversely proportional to d 2 . We reiterate that in order to implement the FFT efficiently, it is convenient to choose the number of grid points as a power of two, or as a number whose prime factors are small.
As the interpolation operator W T is only the transpose of the projection operator W, the construction of W T has virtually no overhead, so that we only consider the projection, the FFT and the precorrection steps in the analysis of the computational cost. The complexity of each of these steps can be analyzed as follows:
• In the projection step, if n wire segments and p 3 test points are used to construct W, then the computational cost is From the above analysis, it is easily seen that the computational complexity of the entire precorrected-FFT procedure is )) log( ( n n O . It is clear that increasing the values of p and q will both increase the computational cost of the algorithm and its accuracy. Of the three parameters, if p and q are fixed, then a larger value of the cell size will result in a smaller number of cells, so that the computational cost of precorrection is increased. On the other hand, if the cell size is decreased and the number of cells is increased, the cost of performing the FFT will increase. This suggests that there is an optimal cell size that yields a minimum value of cost. To search for this optimum, it is possible to perform a search that starts with a larger cell size and a smaller number of grid points, and then decreases the cell size until the minimum run time is reached. The worst-case accuracy is a function of q; in most of the experiments in this paper, only the first nearest neighbors are included in the precorrection step and p is chosen as a small value, so that the cell size is easily selected. In this sense, the method for choosing the cell size is somewhat easier and more reliable than the methods used in [4] [6] to find the local interaction region, since in the precorrected-FFT approach we only need to look for a minimum value of CPU or memory cost with some consideration of accuracy.
Accuracy of the projection step
As stated in the earlier description, the precorrected-FFT method uses a grid representation for a current distribution. An analysis of each of the steps for sources of errors is as follows:
• The principal source of errors in the precorrected-FFT method lies in the projection step, where grid currents are used to replace currents in wire segments.
• The interpolation step results in the same theoretical error as the projection step, so that it is not necessary to separately consider this step in the analysis of accuracy.
• The FFT step, which is applied to calculate grid potentials, is an efficient implementation of the discrete convolution and does not introduce any theoretical error.
• The direct calculation of nearby interactions introduces no theoretical error.
Therefore, in order to maintain the accuracy of the precorrected-FFT method, it is critical to ensure the accuracy of the projection step. 1 One reason why we choose a spherical collocation surface is that the spherical surface has the minimum surface area, over shapes with the same volume as.
For the same number of the collocation points, the smaller surface area of the spherical collocation surface relative to any other surface results in a larger density of the collocation points, which accordingly produces a larger accuracy with the same computational cost. In cases where R c is small, such as 1.5× the cell size, the error decays slowly with the distance of the evaluation point from the current distribution, and falls off sharply when the evaluation points are near the collocation sphere.
It can also be seen that the error decays faster if the radius of the collocation sphere is larger. For example, when R c is 5.5 cell sizes and p equals 4, the error decreases from 3 
10
− at the first evaluation point to 8 
− at the evaluation point that is 25 cell sizes away from the current distribution. When R c is 1.5 cell sizes and p equals 4, the error is nearly level at 4 
− after a sharp change at the collocation sphere. For an R c value of 2.5×, 3.5× or 5.5× the cell size, the worst error is the same. It is also observed that no matter what the radius of the collation sphere is, the accuracy from a higher order approximation is also higher than that of a lower order approximation when the evaluation point is far away from the collocation sphere. These results are seen to be largely consistent for three values of θ. 
Experimental results
A set of experiments was carried out on a 400MHz Sun UltraSparc-II computer server to test the accuracy of the response from the precorrected-FFT method, and to compare the results with those of the block diagonal method [6] in terms of accuracy, speed and memory cost. The test circuit is a four metal layer conductor structure on layers M6, M7, M8 and M9 of a nine-layer chip, as illustrated in Figure 6 , which shows the top view and the cross sectional view of the structure. It lies within an area whose width is 330µm and thickness is 5µm. The circuit consists of three parallel signal wires, each with 0.8µm width, 0.8µm spacing and 0.5µm thickness. The power/ground wires are distributed densely in the four layers and the signal wires are on M8. The width of the test circuit is fixed throughout the experiments and the length changes as the length of the signal is varied in different experiments. The driver sizes for the three signal wires are identical and are altered with the wire length in order to maintain a slope of 40ps at the near end of the signal wires. The drivers are made to switch at the same time so that the inductance effect is maximized and the error incurred by the precorrected-FFT method can be determined for a worst case condition. In the last part of this section, the experiments on a large industrial clock net are carried out the test the efficiency of the precorrected-FFT method in on-chip inductance simulation. 
Accuracy of the precorrected-FFT method
In the accuracy experiments, the value of p is set to 4, and the nearest neighbors and the next nearest neighbors are included in the direct interaction region. The cell sizes in the x and y direction are each chosen to be 15µm, while in the thickness direction, it is set to 7µm, such that the test structure is at the center of the cell. The radius of the collocation sphere is chosen to be 2.5 times the cell size. The cell size is chosen in the following way:
As explained in Section 3, if p and q are fixed, then a larger value of the cell size will result in a smaller number of cells, so that the computational cost of precorrection is increased while that of the FFT will decrease. On the other hand, if the cell size is decreased and the number of cells is increased, the cost of the FFT will increase, but the cost for precorrection will decrease. Of the three matrices W, H and P to be constructed, the run time for constructing W is not affected by the change of cell size. If the cell size is larger, the cost for constructing H and the run time of the FFT will decrease, but the cost for P increases. On the contrary, if the cell size is smaller, the cost for P decreases while that for H and for the FFT computation increase. In all the experiments in Section 4.1 and 4.2, there are 13 ports and the number of moments per port in PRIMA implementation is 5, as in [6] , and it has been demonstrated that the response from the reduced order model converges here even if more moments are used in the simulation. Therefore the FFT have to be carried out 65 times because there are multiple ports and moments. The run time for constructing H and P, the run time for all of the FFT calculations, as well as the total run time for the three are listed in Table 1 . The cell size is chosen to be the one that results in the lowest run time. A simulation for the same circuit is also carried out with the block diagonal approximation [6] . The partition size in the block diagonal approach is 180µm×150µm, which is much larger than the direct interaction region of 75µm×75µm. Figure 7 shows a comparison of the results from the precorrected-FFT and block diagonal methods with the accurate waveforms for 900µm long wires, with waveforms at both the driver and receiver sides of the middle wire being shown. The accurate waveforms are obtained by using the full inductance matrix in PRIMA without any approximation while the approximate waveforms come from the same PRIMA simulator but using the precorrected-FFT or block diagonal method. There are six waveforms in Figure 7 , although only four are clearly visible since the waveforms from the precorrected-FFT almost completely overlap with those from the accurate simulation. The largest error in the response from precorrected-FFT is less than 1mV. With about 100mV oscillation magnitude induced by inductance, the relative error of the oscillation magnitude is 0.1%. The relative error in the 50% delay for the response from precorrected-FFT is even smaller. Although for a victim line segment, more aggressor line segments are considered in the direct interaction region in the block diagonal method than in precorrected-FFT, the error in the response from the block diagonal procedure is still larger than that of precorrected-FFT. The accumulated errors caused by the dropped mutual inductance terms could too large to be ignored if an accurate simulation is desired. Because of the high accuracy that can be obtained by the precorrected-FFT method for this example, we observe that we can sacrifice some of the accuracy for higher speed. Different orders of approximation are tested to study the relation between speed, memory requirements and accuracy. The layout tested is similar to the above experiment but the length of the signal wires is extended to 5400µm, which is the largest tested wire length, so as to show the largest reduction in accuracy with the coarsening of the grid. Since there are more than 31,000 inductors in this circuit, including all of the inductors of signal wires and supply wires, a total of nearly one billion mutual inductances is required for accurate simulation. It is therefore impossible to simulate for the accurate waveforms even in PRIMA, let alone in time domain simulation. To simulate the response most accurately, p is set to 4, the cell size is set to be 15µm, and the first, second and third nearest neighbors are considered in the precorrection step.
The response obtained from this setup is used as the accurate waveform for comparison purposes.
Other precorrected-FFT simulations are carried out with lower accuracy and a coarser grid, where only the nearest neighbors are considered in the direct interaction region and the cell size is 30µm, which doubles the cell size in the above experiment. The cell size and the size of the direct interaction region are fixed in these experiments. The grids are variously chosen to be three-dimensional with p=4, p=3, p=2, and two-dimensional with p=4, p=3, p=2. The two-dimensional grid is in the plane that is parallel to the x-y plane and lies at the mid-point of the thickness of the test structure. In the two-dimensional case, the collocation sphere reduces to a collocation circle in the x-y plane, as shown in Figure 8 . Reduction of the problem to a two-dimensional grid will increase the efficiency of the computation at some cost in accuracy. It is expected that larger cell sizes, smaller values of p, and reduction in the size of the direct interaction region will each contribute to a loss in accuracy, but with an accompanying increase in the speed of the computation and a reduction in the memory requirements. The waveforms at the driver and receiver sides of the middle wire are
shown with different levels of accuracy, corresponding to p=2, 3 and 4, are virtually indistinguishable. Closer examination reveals that the error in the 50% delay is insignificant for the three cases, but the relative error corresponding to the overshoot/undershoot is discernible, and is listed in the last column of Table 2. This table also lists the accuracy, memory requirements and speed for each level of approximation. Table 2 : A comparison of the accuracy, memory requirements and CPU time for different parameter settings for the precorrected-FFT in the simulation of three 5400µm long signal wires. Here, "2D" and "3D" correspond to the twodimensional and three-dimensional cases, respectively. The total CPU time corresponds to the time required for the entire simulation, including the time required by the precorrected-FFT computations.
The setup time is the most time-consuming step in the entire algorithm, and is further divided into two parts.
The first part corresponds to the calculation of the inductance values needed for the construction of the precorrection matrix, which is equal for each order of approximation, while the second relates to the time required for the calculation of the W, H and M matrices. For p=3, under a three-dimensional grid, the error at the peak is less than 1mV. The relative error in the oscillation magnitude at that point is 1%, while the speed is increased by 45% as compared with the accurate result. If p is further reduced to 2 under a three dimensional grid, the error is 9mV but the speed is improved by an additional 16% compared to the p=3 case. The t wo-dimensional grid representation with p=2 results in the largest error of about 10mV and a similar relative error, but the speed is increased only by 6% as compared to its three-dimensional counterpart. The reason for this relatively low speed improvement is that in the case that p=2, the precorrected-FFT is rather fast and the time consumed in the calculation of W, H and M matrices is only a small part of the total setup time. Therefore, even a large increase in the speed of calcula tion of W, H and M matrices will not yield a significant reduction of the total run time.
Another reason is that the number of grid points per cell is only reduced by half by going from the three dimensions to two. On the other hand, if we reduce the three-dimensional grid to two dimensions with p=4, the speed can be increased by 22% because the number of grid points per cell is reduced from 4 3 =64 to 4 2 =16, and the time required for the calculation of W, H and M matrices plays a more important role in the total setup time. In this case, the accuracy is still high even under a two-dimensional grid. The memory requirements show a similar trend as the run time: for p=4 and p=3, the memory requirements are reduced by 27.5% and 32%, respectively, as we go from the three-dimensional grid to a two-dimensional grid. 
Comparison of the precorrected-FFT method with the block diagonal method
The comparison in terms of accuracy between the precorrected-FFT and block diagonal methods has been described in Section 4.1. In this section, comparisons in terms of memory consumption and speed between the precorrected-FFT and the block diagonal methods using a Matlab implementation are carried out for structures of different wire lengths. Performance results using an optimized C++ implementation are reported in Section 4.3. The lengths of the signal wires in different experiments are set to 900µm, 1800µm, 3600µm, 4500µm and 5400µm. In the block diagonal method, the partition size is chosen to be 180µm ×150µm (180µm in the x direction and 150µm in the y direction). For the precorrected-FFT method, a two-dimensional grid is imposed with p=2, and the first nearest neighbors are considered for the precorrection step. The cell size is set to 30µm. Figure 9 shows the waveforms computed by the two methods at the receiver end of the middle wire for wire lengths of 900µm and 5400µm. The accuracy, memory requirements and speed for different wire lengths for the block diagonal and precorrected-FFT methods are listed in Table 3 . For the wire lengths of 900µm and 1800µm, the results of the precorrected-FFT and block diagonal methods are similar to each other, and the block diagonal method is faster.
However, as the wire length increases, the differences in the 50% delay and oscillation magnitude become larger.
For example, the 50% delays calculated by the precorrected-FFT and block diagonal methods are 95ps and 100ps
respectively for a wire length of 3600µm, which is a difference of about 5%. The difference increases to 8% when the wire length is 4500µm and 12.5% when the wire length is 5400µm. For wire lengths that exceed 1800µm, the precorrected-FFT and block diagonal methods perform their computations at approximately the same speed, but the former has nearly half the memory requirements as the latter since the partition size for the block diagonal method is much larger than the direct interaction region in the precorrected-FFT, i.e., the number of inductances per wire segment to be calculated by the block diagonal method is much larger than that for the precorrected-FFT approach.
Moreover, as the circuit size increases, the setup time and memory consumption are seen to increase at a faster rate for the block diagonal method. Similar trends are seen for the differences in the oscillation magnitude as for 50% delay. For example , if the wire length is 4500µm with a 210mV overshoot, the difference is 40mV. If the wire length is increased to 5400µm, the block diagonal method calculates a larger overshoot of about 300mV, which is about 150mV different from that computed by the precorrected-FFT approach. The precorrected-FFT predicts a more reasonable trend in the overshoot magnitude for different wire lengths: the overshoot increases as the wire length is increased from 900µm to 1800µm, and then decreases gradually as the wires grow longer. When the wire length reaches 5400µm, the output has a smaller overshoot compared with the cases when wires are 4500µm, 3600µm and 1800µm long.
However, the trend predicted by the block diagonal method is different: the overshoot magnitude increases from 900µm to 1800µm long wires, and then decreases if the wire length increases from 1800µm to 4500µm, as in the case of the precorrected-FFT method. However, when the wire length increases from 4500µm to the largest tested length of 5400µm, the overshoot is not reduced but is increased in the block diagonal method, which is clearly inconsistent. We observe that the difference between the results from the block diagonal method and those from the precorrected-FFT is larger for longer wires. Overshoot 151mV 120mV  300mV  120mV  123mV  142mV  161mV  Run time  2917s  681s  3235s  5032s  9700s 6hrs. 12hrs. Table 4 : Overshoots and run times at the receiver side of the middle wire with the length of 5400µm from the precorrected-FFT method (PCFFT) and the block diagonal method (BD) with different partition sizes:
30µm×30µm, 180µm×150µm, 330µm×150µm, 330µm×300µm, 330µm×600µm, 330µm×900µm.
When the partition width increases from 180µm to 330µm, the 300mV bump disappears: the reason may be that more power/ground wires are included in each partition, and the inductance effect is greatly reduced. If the partition length is increased from 150µm to 300µm and then to 600µm and 900µm, with a 330µm partition width, the overshoot increases and nears the result from the precorrected-FFT method. It is impractical to increase the partition size further because the simulation time for 330µm×600µm partition is 6hrs, and includes 26.6M mutual inductances, while the simulation time for 330µm×900µm partition is 12hrs, and uses up about 3Gb memory. On the contrary, the precorrected-FFT method produces a similar overshoot within an hour and only 110Mb memory.
We also test the same circuit with a higher level of accuracy in the precorrected-FFT method with the fifth nearest cells included in the precorrection step and the overshoot is only 2mV different. The trends in the overshoots and run time from the precorrected-FFT and block diagonal methods indicate that the precorrected-FFT converges easily, and therefore is a better candidate for fast simulation of large inductive circuits for higher accuracy.
The problem faced here by the block-diagonal method is common to most of the existing algorithms in onchip inductance extraction. As the circuit size is increased, the local interaction region should be larger to maintain the same accuracy in the simulation. However, it is hard to predict this interaction region a priori, and for large circuits, increasing the interaction region gradually is impractical as it could result in very long simulation times.
The precorrected-FFT method, on the other hand, overcomes this difficulty by including the calculation of far away inductance interactions using the grid representation.
Application of precorrected-FFT with optimized implementation on signal lines and a large clock net.
In addition to a Matlab-based implementation of the precorrected-FFT method, an optimized version using C++ was also implemented. To demonstrate the efficiency of the precorrected-FFT method, layout structures with different length of signal wires, as depicted in Section 4.2, and a la rge global clock net of an industrial giga-hertz microprocessor are simulated using this optimized implementation of the precorrected-FFT method.
For layout structures with different length of signal wires, the number of resistances, capacitances and inductors in circuits and the total CPU times of the simulations in the precorrected-FFT method are listed in Table   5 . A three-dimensional grid is imposed with p=3, and the first nearest neighbors are considered for the precorrection step. The cell size is set to 30µm. The simulations in the precorrected-FFT method can be very fast.
For the circuit with 5400µm signal wire, which includes 32.3K resistances, 64.5K capacitances and 32.3K
inductors, the total CPU time is about 6 mins. Table 5 : Circuit parameters and run times for layouts with different length of signal wires from the precorrected-FFT method.
The layout of the clock net is shown in Figure 10 and has 4 ports, 12 sinks and 121,065 inductors, which corresponds to 7.3G inductance terms. With the optimized implementation of the precorrected-FFT algorithm, the run time for generating the reduced order model was 21 minutes, using a three-dimensional grid. The signal responses from simulation of the RC model, the ROM generated using precorrected-FFT and block diagonal methods are shown in Figure 11 and the layout and experimental parameters are listed in Table 6 . On-chip inductance has a strong effect on the clock net responses. The 50% interconnect delay with the precorrected-FFT method is 130ps, compared with 86ps for the response with the RC model. Relative to the 0. Table 6 : Layout and experimental parameters (X, Y, Z: x, y and z directions in Figure 10 )
Conclusions
A precorrected-FFT algorithm for fast and accurate simulation of inductive systems is proposed in this paper, in which long-range components of the magnetic vector potential are approximated by grid currents, while nearby interactions are calculated directly. All inductance interactions are considered in computing the product of the inductance matrix with a given vector, so that the waveforms at the nodes of interest are calculated accurately. A comparison with the block diagonal algorithm showed that the precorrected-FFT method results in more accurate waveforms and less run time with much smaller memory consumption. Different approximations in the precorrected-FFT method, including using a two-dimensional grid structure, were tested and showed that the lower order of approximation greatly increases the speed and reduces the memory consumption without much loss in accuracy. Experiments carried out on large industrial circuits demonstrate that the precorrected-FFT method is a fast and highly accurate approach for on-chip inductance simulation in large circuits.
