Abstract-An algorithm is presented for the fast and accurate simulation of power/ground mesh structures. Our method is a direct (noniterative) approach for simulation based upon a parallel matrix inversion algorithm. The new dimension of flexibility provided by our algorithm allows for a more accurate analysis of power/ground mesh structures using resistance, inductance, capacitance, interconnect models. Specifically, we offer a method that employs a sparse approximate inverse technique to consider more reluctance coupling terms for increased accuracy of simulation. Our algorithm shows substantial computational improvement over the best known direct and iterative numerical techniques that are applicable to these large-scale simulation problems.
. Behavior of the model when square wave voltages were connected to its terminals.
to the Intel Core2 Duo 3 GHz CPU, using only a single core, corresponds to 45 000 clock cycles.
These simulation measurements show similar behavior to the results of the implemented memristor [2] by HP Labs.
IV. Conclusion
We presented a new SPICE macromodel of the recently implemented memristor. This macromodel could be a powerful tool for researchers and electrical engineers to design and experiment new circuits with memristors. Comparing the IV characteristics and also the time dependences of the state variables, our simulation results show the figures with qualitatively and quantitatively similar behavior to the lately published measurements of the physical implementation [2] . The functionality of our macromodel is demonstrated with computer simulations. The source code of our macromodel can be found in the appendix below. 
I. Introduction
The accurate and efficient modeling and simulation of power/ground networks has become a difficult problem for modern design. Increases to integration density have necessitated the use of large-scale power mesh structures, and with the scaling of voltages, the need for accurate simulation of these structures is crucial. Previously employed direct methods for simulation of this problem have become impractical due to both extraordinary memory requirements and prohibitive simulation times. This has prompted several variations of iterative schemes [1] - [7] that attempt to meet these rising computational challenges. The convergence for each of these methods, and therefore, the simulation time, is problem dependent (i.e., both switching activity within the network and branch coupling will affect the simulation time). Although most of these methods have been shown to be quite successful for large-scale simulations (millions of nodes) of resistor capacitor mesh structures, none have clearly demonstrated an efficient and scalable approach to deal with inductive coupling effects. This can be largely attributed to the fact that with the inclusion of inductive coupling, much of the locality for the problem is lost. Specifically, an iterative method that uses small independent or slightly overlapped subsets of the network in order to infer information about the global system dynamics will not converge quickly if there is significant coupling across different regions of the network. In addition, as was alluded to by the authors of [4] , the conditioning of the underlying system matrices would degrade if the interconnects, which constitute the mesh structure, are modeled as resistance, inductance, capacitance (RLC). By employing a parallel matrix inversion technique for simulation, we offer a stable alternative that allows for the efficient simulation of networks with a large amount of branch coupling. The parallel method for solving block tridiagonal systems presented in this paper scales well with the inclusion of additional reluctive coupling effects, within an assumed block tridiagonal structure.
II. Simulation of RLC Mesh Structures
When RLC interconnect models are used for the simulation of mesh structures (see Fig. 1 ), the typical modified nodal analysis representation yields equations of the form
where
Here, R,Ĉ, and L are the resistance, capacitance, and inductance matrices, respectively. The matrices A g and A c transform the conductances and capacitances into node based relationships. The matrices A l and A i link the node voltages and branch currents described by the state variable x. In addition, I s is the current vector that dictates, through the matrix A i , the relationship of the current sinks onto the nodes of the mesh. Considering a uniform discretization of the time axis with resolution h and using the notation v 
It is important to note that with the inclusion of inductance, for the modeling of the interconnects, we must now account for the effect of this additional susceptance term S.
A. Inductance Approximation Methods
We begin first with the construction of the coefficient matrix K from (2), given a regular power mesh topology. If all mutual inductive couplings are considered, both the reluctance matrix L −1 and coefficient matrix K will be dense. In this paper, we investigate the efficiency and accuracy of simulating power mesh structures when considering a block tridiagonal susceptance matrix. A matrix Y is block tridiagonal if it has the form
with N y diagonal blocks of size N x each. We now introduce a flexible method by which reluctance values can be approximated to produce a sparse block tridiagonal susceptance matrix.
In [8] , [9] the accuracy for simulation with interconnects was explored using window based techniques. In those papers, reluctive coupling was considered only to exist between neighbors in a given layer of parallel wires. In this paper, we consider the use of a sparse approximate inverse technique (SPAI) [10] . Given an inductance matrix L, the SPAI method can be used to form another matrix M that is constructed in an attempt to match the inverse of the inductance matrix under the Frobenius norm
where n in the number of columns of L and e i is the ith euclidean basis vector. Therefore, we can solve n independent The column labeled "NNZ" contains the number of non-zero entries in the coefficient matrix and the column labeled "S (%)" contains the percentage of the matrix whose entries are zero.
least squares problems in order to construct the approximate inverse matrix M. In this paper, we employ a threshold based approach to create a matrix whose entries must be significant with respect to the absolute maximum in a column
where the diagonal entries M i,i are always included. If τ is close to zero, this criterion would prevent fill-in and result in a matrix that is very sparse. The value τ = 1 would correspond to a matrix M where the entire pattern of L −1 will be considered. We may form a block tridiagonal coefficient matrix K by evenly separating the nodes in the RLC mesh structure into groups. This block decomposition is illustrated in Fig. 1 , where all nodes enclosed together are considered to be part of the same block. Given this decomposition, a sparse approximation to L −1 is formed so that S and K will be block tridiagonal. Table I examines the accuracy of simulation considering increases to the drop tolerance parameter τ. Next, we address the numerical challenges associated with using a direct (noniterative) approach for the simulation of these structures.
B. Inverses of Block Tridiagonal Matrices
The inverse of a symmetric block tridiagonal matrix can be computed explicitly, as detailed in [11] , [12] . Specifically, there exist two sequences of "ratio" matrices {R i }, {S i } so that the inverse of a block tridiagonal matrix K can be written using a "compact" representation
Here, the diagonal blocks of the inverse, D i , and the ratio sequences can be determined through a series of recursions [11] , [12] . The time complexity associated with determining the parametrization of K −1 by the above approach is O(N In this paper, we build upon these ideas to create a scalable distributed framework for the transient simulation of power mesh structures. We begin by generalizing the method from [11] in order to compute all information necessary to determine a distributed compact representation of K −1 . It is important to note that the method in [11] was developed specifically to determine the diagonal entries for a matrix with structure similar to that of K −1 , but not the entire compact representation. The question then becomes what additional computation is necessary to find the compact representation for K −1 , as shown in (6). If we examine closely the block tridiagonal portion of K −1 , we find the following relations:
where Z i denotes the ith off-diagonal block of K −1 . Thus, we would like to calculate both the diagonal and off-diagonal blocks of K −1 in order to formulate a compact representation for the matrix.
Our divide-and-conquer algorithm is illustrated in Fig. 2 . If K is separated into p sub-matrices {φ i } there will be log p combining levels with a total of p − 1 combining steps needed to form K −1 . Here, φ 1∼2 is the inverse of a matrix comprised of the blocks assigned to both φ 1 and φ 2 . It is important to note that using the matrix inversion lemma repeatedly to join submatrix inverses will result in a prohibitive amount of memory and computation for large simulation problems. This is due to the fact that at each combining step all entries would be computed and stored. For example, from Fig. 2 we can see that the final combining level would require the computation of half the entries from the matrix: N x N y 2 /2. Thus, we introduce matrix mappings in order to avoid any unnecessary computation during the combining process. Specifically, the matrix maps only require the computation of 4N 2 x entries for each combining step [11] , [12] .
It has been shown in [11] and [12] that both the boundary block entries (first block row and last block column) and the block tridiagonal entries from any combined inverse φ −1 i∼j must be attainable (not necessarily computed) for all combining steps. Thus, we assign a total of 12 N x ×N x matrix maps M 1−8;i and C 1−4;i , for each sub-matrix φ i . Fig. 3(a) illustrates the mapping dependencies for the first block row and last block row during the initial combining step, i.e., forming φ −1 1∼2 . Fig. 3(b) illustrates the mapping dependencies for the block tridiagonal portion of K −1 . Here, we see that the maps M 5−8;i are used to produce intra-domain information, while the "cross" maps C 1−4;i are used to produce inter-domain information.
Given these governing responsibilities for the mappings, we must follow the process from Fig. 2 in order to "update" the maps to their correct state during each of the hierarchical combining steps. Although the updates to these maps will mimic the procedure of [11] , in this paper we must also take into account interactions between neighboring divisions. This is due to the inclusion of the inter-domain cross maps C 1−4;i . Under this generalized framework, matrix maps can be used to determine both the diagonal and off-diagonal block entries for K −1 . This subsequently allows for the computation of the ratio sequences for K −1 , via the relationships shown in (7), in a purely distributed fashion. The complete matrix inversion process, including the necessary update procedures, are derived in [12] . If the problem is distributed evenly across p CPUs, the time complexity of the algorithm presented is O N 
C. Parallel Matrix-Vector Product
Given this distributed compact representation, we formulate recursions in order to compute a matrix-vector product for each step in the transient simulation, i.e., evaluate x k = K −1 c k at each time step k. In [12] it is demonstrated that the elements of x k can be found by through two sequences of vectors {W l } and {T l }. The elements from each sequence can be computed through the following recursions:
where c k;l is the lth portion of the vector c k (each of which has N x elements). We can then solve for x k;l = W k;l + T k;l−1 , where T k;0 = 0. Although this multiplication procedure seems to be of a strictly recursive nature (and hence not readily parallel) we show that the vector x k can in fact be computed efficiently in a distributed fashion. We begin by separating the matrixvector into p sub-problems
k ), where c (i) k = 0 outside the range of the sub-matrix φ i . This decomposition of the matrix-vector multiply results in a set of operations that can be cascaded across the processors in a hierarchical fashion [12] . Therefore, if the problem is distributed evenly across the p CPUs, the total complexity of the process is O(N 
III. Numerical Results
There are two main categories of algorithms for solving sparse linear systems of equations: direct and iterative. In this section we will first demonstrate the advantages in scalability of direct methods for the accurate transient simulation of mesh structures. Here, the transient simulation time using both direct and iterative methods are compared for varying levels of reluctive coupling. Finally, the ability to trade-off computing resources for both increased mesh sizes and transient simulation time using the parallel inversion algorithm will be highlighted.
Our algorithm has been implemented, in C, and compared against standard algorithms unsymmetric multifrontal package (UMFPACK), MATLAB Sparse LU, and the conjugate gradient (CG) method using incomplete LU (ILU) preconditioner. All simulations were performed on a cluster of 32-bit 3 GHz Intel Xeon workstations with 2 GB of memory for each node. The simulations considered in this paper are for square power meshes of dimension m×m. The sizes of the variables N x and N Y for the block tridiagonal representation of the coefficient matrix will depend on the amount of inductive coupling considered. All inductance values used as inputs for the windowing and SPAI approximation procedures were generated with the FastHenry extraction tool [13] . We consider V dd pins to be placed at equally spaced positions throughout the grid. A random subset from the remaining nodes are considered current sinks with a square waveform triggered by a rising clock edge. All simulations consisted of 1500 time steps, given a step size of 0.1 ps and clock signal of 50 ps. All interconnects were assumed to be of uniform size: 1 µm × 2 µm × 100 µm.
A. Simulation Time
The total simulation time, given any level of inductance approximation, is dominated by the fixed time cost of inversion or factorization plus the variable time cost to multiply or solve at each time step. When considering transient simulations involving a large number of time steps, any speed-up seen in the variable time cost will dominate the fixed time cost.
1) Comparison Against Direct Solvers:
In order to gain perspective for the computational limitations for each of the direct algorithms considered in this paper, several large-scale simulations m = 128, 192, 256, 384, and 512 were performed. Table II shows the size of coefficient matrix and the number of non-zero entries, considering different amounts of reluctance coupling. Table III shows the transient step solve times of the direct algorithms for these mesh sizes. It was determined that the UMFPACK algorithm was only able to handle up to a mesh size of m = 192 with τ = 0.85. This case corresponded to a coefficient matrix with approximately 110k unknowns, but less than 412k non-zero entries. The memory consumption of the Sparse LU algorithm scaled slightly better, being able to perform the simulation for a mesh size of m = 192 with τ = 0.90 which corresponds to over 2× the number of nonzero entries as compared to τ = 0.85. The divide-and-conquer method was clearly the most memory scalable of the direct algorithms, it was able to perform the largest example in this paper: m = 512 with window based approximation (involving 785K unknowns and 7M non-zero entries). The divide-andconquer approach was able to perform the largest simulation m = 512 with window based approximation using p = 32 computers and the remaining cases using p = 16 or lower. For the case of m = 512, we are considering a compact representation for the inverse of the coefficient matrix, shown in (6) , that would account for nearly 30 GB of memory. Next, in order to give a practical measure for the improvement of the divide-and-conquer approach we examine in more detail the effect of increasing the amount of reluctance coupling. Using several smaller mesh examples, m = 16, 32, 48, and 64, we examine the sensitivity of each algorithm to the inclusion of reluctance coupling terms, i.e., larger values of the parameter τ. Table IV shows the results for the divide-andconquer method and Table V for the Sparse LU and UMF-PACK algorithms. From Table V , first notice that although simulations using the UMFPACK algorithm with the windowing technique matched the accuracy seen across all other algorithms, the simulation time was often substantially larger than that seen using the more dense SPAI based approximation. This can be attributed to the fact that the UMFPACK algorithm was unable to properly decide on an efficient ordering for elimination given the windowing coefficient matrix.
2) Comparison Against Iterative Solvers: We now turn our attention to the performance of iterative methods for the transient simulation of power mesh structures. Specifically, we analyze the lack of scalability for the CG algorithm with respect to the addition of reluctance coupling. Although the CG method has the smallest memory consumption of any algorithm considered in this paper, we observe that the iterative CG method using ILU scaled the worst with respect to the inclusion of reluctance coupling. A comparison of transient step solve times for the example m = 64 are shown in Fig. 4 . It is important to note that the times for the divide-and-conquer method do not change as all terms involved in the parallel matrix-vector multiply are dense. This property does not hold for any other algorithm considered in this paper. Table VI shows the sensitivity of the CG algorithm to the inclusion of additional reluctance coupling terms, again using several smaller mesh examples. On average the time for CG was more than 11× slower when comparing the SPAI approximation with τ = 0.99 to the basic windowing approach. If we use this fact, we can arrive at speed-up factors when comparing the dominant computational task for transient simulation. Specifically, if we consider the maximum scaling of the divide-and-conquer method for the cases m = 256, 384, and 512, we calculate speed-up factors of 8.9×, 7.2×, and 9.2×, respectively, when considering transient solve times for a τ = 0.99 quantity of reluctance coupling.
IV. Conclusion
Currently employed techniques for the simulation of mesh structures attempt to address the issue of increased problem sizes by trading off accuracy for simulation time via iterative schemes. Our algorithm is a direct solver that facilitates the simulation of large mesh structures through a divide-andconquer approach. Due to the inherently parallel nature of the algorithm, computing resources can be flexibly allocated toward either speeding up the simulation of a problem of a given size, or solving problems of larger sizes in comparable time. The scalability of the divide-and-conquer method is clearly demonstrated by the absence of increases to time for the primary computational task for transient simulation, when considering the inclusion of these additional coupling terms. This attribute is not shared by any of the other methods analyzed in this paper. In addition, the divide-andconquer method was able to show substantial computational improvement over the most widely used numerical techniques applicable for these large-scale simulations. Specifically, the divide-and-conquer approach allows for the the simulation of a 512 × 512 RLC mesh with a speed-up factor over 9× when compared to the CG method with ILU. Therefore, we conclude that the divide-and-conquer algorithm presented here offers a framework which can be built upon for the large-scale accurate simulation of power mesh structures.
A Functional Unit and Register Binding Algorithm for Interconnect Reduction

Taemin Kim and Xun Liu
Abstract-This paper describes a simultaneous register and functional unit (FU) binding algorithm in high level synthesis. Our algorithm targets the reduction of multiplexer inputs, shortening the total length of global interconnects. Specifically, our algorithm maximizes the interconnect sharing among FUs and registers by considering flow dependences, common primary inputs, and common register inputs among operations. Experimental results have shown that our scheme achieves more than 20% multiplexer input count reduction, on average, over previously proposed algorithms. Our approach delivers a 18% wirelength reduction of global interconnects with minor area overhead.
Index Terms-DSP synthesis, high level synthesis, interconnect, resource binding, synthesis. 
