Abstract-In the past decades, harmonic balance (HB) has been widely used for computing steady-state solutions of nonlinear radio-frequency (RF) and microwave circuits. However, using HB for simulating strongly nonlinear post-layout RF circuits still remains a very challenging task. Although direct solution methods can be adopted to handle moderate to strong nonlinearities in HB analysis, such methods do not scale efficiently with large-scale problems due to excessively long simulation time and prohibitively large memory consumption. In this paper, we present a novel graph sparsification approach for automatically generating preconditioners that can be efficiently applied for simulating strongly nonlinear post-layout RF circuits. Our approach allows to sparsify time-domain circuit modified nodal analysis matrices that can be subsequently leveraged for sparsifying the entire HB Jacobian matrix. We show that the resultant sparsified Jacobian matrix can be used as a robust yet efficient preconditioner in HB analysis. Our experimental results show that when compared with the prior state-of-the-art direct solution method, the proposed solver can more efficiently handle moderate to strong nonlinearities during the HB analysis of RF circuits, achieving up to 20× speedups and 6× memory reductions.
post-layout RF and microwave circuits that involve a lot of parasitics.
To achieve greater computing efficiency than the traditional direct solution methods, several preconditioned Krylovsubspace iterative methods have been investigated and developed in recent years [1] [2] [3] [4] . However, developing highquality preconditioners for HB analysis has been a very challenging task, since the convergence property of the preconditioned iterative methods for HB analysis strongly depends on the effectiveness of the underlying preconditioners, especially when using the Krylov-subspace iterative methods, such as the GMRES algorithm [5] . For instance, although prior preconditioning methods have shown promising results for trading off the computational efficiency and preconditioning effectiveness, they can be inevitably facing with a variety of limitations and difficulties when handling large and strongly nonlinear postlayout RF and microwave circuits: the block-diagonal (BD) averaging preconditioners are easy to compute but only limited to handle weakly nonlinear systems [6] ; the hierarchical HB preconditioner proposed in [3] is more effective than the BD preconditioner and also suitable for parallel computing, but can lead to poor performance or divergence when handling strongly-nonlinear RF ICs since the frequency domain decomposition scheme will introduce large errors during the preconditioning step; another finite-difference Jacobian preconditioner can easily deal with strongly nonlinear systems, but will not work for more than one tones in HB analysis, as discussed in [1] ; a sparse block direct solver is developed for solving the Jacobian matrices of HB, but it will consume much more computational resources than iterative methods and can not scale well with large RF circuit designs. As a result, there is not a preconditioning method that can work robustly for a wide variety of RF and microwave circuits analysis problems, and at the same time be computationally efficient (scalable to large problems sizes). Consequently, it is very desirable to develop efficient yet robust solvers to facilitate fast HB analysis for addressing the challenges in future large-scale RF and microwave IC design and verification procedures.
In this paper, recent graph sparsification and support-circuit preconditioning (SCP) techniques [7] [8] [9] [10] are exploited for developing scalable Jacobian matrix solvers that can tackle large-scale strongly nonlinear post-layout HB analysis 0278-0070 c 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
problems. Our approach starts with sparsifying the representative modified nodal analysis (MNA) matrices [11] that can be obtained by examining a set of sampled MNA matrices in time domain. Subsequently the above sparsified representative MNA matrices can be leveraged for effectively sparsifying the entire HB Jacobian matrix. We show that the resultant sparsified Jacobian matrix can be used as a robust yet efficient preconditioner in HB analysis. Our experimental results show that when compared with the state-ofthe-art direct solution method [1] , the proposed HB solver can more efficiently handle moderate to strong nonlinearities during the HB analysis of large RF circuits, achieving up to 20x speedups and 6x memory reductions. The main technical contributions of this paper have been summarized as follows. 1) Based on the recent graph sparsification theory, we propose a circuit-oriented SCP approach that can scale almost linearly with large-scale strongly nonlinear postlayout RF circuit. 2) Inspired by the direct solution method for HB analysis [1] , we have developed a highly efficient sparse block lower upper (BLU) solver that can efficiently factorize and solve the preconditioned systems formed by sparsified RF circuit networks. 3) Additionally, to achieve the optimal sparsification of RF circuit networks during HB analysis, we propose a transient (TR)-analysis guided graph sparsification scheme to help automatically compute nearly-optimal preconditioners for iteratively solving HB Jacobian matrices. The rest of this paper is organized as follows. In Section II, we briefly review the HB algorithm for steady-state nonlinear circuit simulations, graph sparsification, and support graph preconditioning techniques. From Sections III-IV, we describe the proposed scalable HB algorithm in details. Section V introduces the implementation of block sparse LU solver. Section VI demonstrates the optimal sparsification scheme. Section VII analyzes the memory and computational complexity of the proposed algorithm. Section VIII demonstrates extensive experimental results of a variety of strongly nonlinear post-layout RF circuits to validate the proposed approach, which is followed by the conclusion of this paper in Section IX.
II. BACKGROUND
We first review the basics of harmonic balance (HB) method for steady-state simulations of RF circuits. Then, we provide a brief introduction to graph sparsification theory and its applications in developing scalable preconditioned iterative matrix solvers.
A. Review of HB Analysis
Compared to time-domain analysis that can be obtained by performing TR circuit simulations, steady-state simulations of RF, and microwave circuits typically require HB analysis [1] [2] [3] [4] , [6] , [12] that can naturally handle frequencydomain data such as S-parameters of linear networks. The basic theory of HB method is introduced as follows. Consider a nonautonomous circuit analysis problem described by the following:
where x(t) ∈ n represents a set of state variables, n is the number of unknowns, y is the matrix-valued impulse response function of frequency-domain linear circuit components (such as S-parameter models), q(•) denotes a function for the nonlinear charge and flux, f (•) represents the static (memoryless) nonlinearities, and b represents the time-dependent excitations that are assumed to be periodic with a time period T. The circuit steady-state response x(t), and functions q(•) as well as f (•) will be periodic with period T. By writing the above equation in frequency domain and applying Newton-Raphson (NR) method, it can be shown that the linearized system in frequency domain becomes
where X and B denote the Fourier-coefficient vector of x(t) and b(t) respectively, is a diagonal matrix denoting the frequency domain differentiation operator, and −1 are the fast Fourier transform (FFT) and inverse FFT (IFFT) matrices, while C and G are BD matrices with BD representing the linearizations of q(•) and f (•) at h time-domain sampled points that can be described as follows for i = 1, . . . , h, respectively:
When the double-sided FFT/IFFT are used, a total number of h = 2k + 1 harmonics are included to represent each signal, where k is the number of positive frequencies being considered.
In each NR step, a linearized system is solved with a Jacobian matrix of (2)
The most time-consuming step in HB analysis is the one for solving the large yet nonsparse Jacobian matrix J hb shown in (5) . It can be shown that the dense blocks in J hb are mainly due to the block-circulant matrices C −1 and G −1 [6] . For instance, the block-circulant matrix G −1 can be expressed as
As a result, directly solving such a Jacobian matrix using LU-based direct solution method can be very runtime and memory costly due to the very dense matrix structure and large number of fill-ins introduced during the factorization procedure. Consider a recent state-of-the-art HB simulator developed in [1] . It has been shown that HB analysis of a post-layout RF circuit (LNA + mixer + filter) with 44 K nodes and 20 harmonics will result in a Jacobian matrix with more than 1.8 million unknowns that would require around 100 h and more than 15 GB memory when using LU-based direct solution method. On the other hand, the runtime and memory costs for another smaller test case (with half problem size) are 10× less than the previous case, indicating a rather poor algorithm scalability.
To avoid the direct factorization of large and dense Jacobian matrices in HB analysis, iterative methods can be applied to dramatically improve the computing efficiency. Krylovsubspace iterative methods, such as GMRES method [5] , are particularly suitable for such problems since only the matrix-vector operations are needed throughout the solution procedures. It has been shown that for HB analysis, the matrix-vector product
can be computed very efficiently, without explicitly forming the real Jacobian matrix. Unfortunately, iterative methods, such as Krylov-subspace iterative methods in particular, may suffer from slow convergence or even divergence issues unless robust preconditioners are adopted. However, finding efficient yet robust preconditioners for tackling general HB simulation tasks still remains an open problem. A good introduction of existing preconditioning methods for HB analysis of RF circuits can be found in [1] .
B. Graph-Based Preconditioning Approaches
Recently, a series of support-graph based preconditioning techniques has been introduced to solve circuit simulation problems. For instance, a support-graph preconditioned solver was presented in [7] for solving linear circuit networks, and SCP techniques for nonlinear transistor-level circuit simulations were proposed in [8] and [9] , which have been shown to achieve nearly-linear runtime and memory efficiency. In this section, the related background and techniques will be described in details.
1) Graph Sparsification Problems:
General linear circuit analysis problems can be converted into equivalent graph problems [11] . For instance, a linear resistive network can be represented by a weighted, undirected graph G = (V, E, w), where V is a set of vertices, E is a set of edges, and w is a weight function that assigns a positive weight to every edge. The Laplacian matrix A of a weighted graph is defined as follows:
where sum(s) = (s,v)∈E w(s, v) denotes the sum of the incident weights of vertex s. From (8), we can observe that Laplacian matrix is a symmetric matrix with nonpositive offdiagonals and zero row sums, which can be considered as an admittance matrix in circuit theory. For a vector x ∈ V , the Laplacian quadratic form of G is defined to be
It can be seen that Laplacian matrix A provides a measure of the smoothness of x over the edges in G [13] , since the more x changes over an edge, the larger the quadratic form becomes.
Graph sparsification is a very important technique that has been playing important roles in designing nowadays efficient graph algorithms [13] [14] [15] [16] [17] . Given a graph G = (V, E), a graph sparsifier (also known as support graph) G is a sparse subgraph of G that can approximate G in some measures such as the pairwise distance, cut values or the graph Laplacian. The goal of graph sparsification is to approximate a given graph G by G on the same set of vertices such that G can be used as a proxy for G in numerical computations without introducing too much error. A good sparsifier should have very few edges that will immediately result in significantly reduced computation and storage cost. More details of this technique can be found in recent research papers [15] [16] [17] .
2) Ultrasparsifier Support Graph Preconditioners: Support-graph preconditioning is to first construct a graph G according to a given graph Laplacian matrix A, and then extract the support graph G of G that can be further used to construct a preconditioner P for iterative solvers such as conjugate gradient or GMRES solvers. In practice, for a given graph a maximum or low-stretch spanning tree can be constructed and used as its support graph, which has been proposed in the past for solving linear systems with symmetric and diagonally-dominant matrices in nearly-linear time [13] , [18] [19] [20] . Support-graph preconditioning seeks to compute the preconditioner P such that the generalized eigenvalues and the condition number of the matrix pencil (A, P) are bounded [21] . If both A and P are symmetric positive definite (SPD) matrices, the convergence of classic Krylov-subspace iterative methods depends on the condition number κ(A, P) computed by
where λ(A, P) denotes the generalized eigenvalues. A stronger theoretical result on convergence can be derived as follows. Define the support of (A, P), denoted by σ (A, P), as follows [18] , [21] :
Subsequently, if one can split A and P into A = A 1 + A 2 + · · · + A m and P = P 1 + P 2 + · · · + P m such that all τ P i − A i are SPD matrices, one can show that the generalized eigenvalue of (A, P) is bounded by τ .
Compared to the original graph, the support graph has fewer edges. Since the spanning tree of a graph that includes n vertices and m (m ≥ n) edges retains only n − 1 edges, the power dissipated on the support graph is much smaller than the one on the original system. If a support-graph preconditioner not only preserves the eigenvalues, but also the power dissipation of the original system, it can be more effective than the previous spanning-tree support graph preconditioner [14] in reducing the number of iterations when using Krylov-subspace iterative methods. Consequently, a much better support graph can be formed by selectively adding extra links to the spanningtree support graph, which is also known as the ultrasparsifier support graph [14] . When using ultrasparsifier support graphs as preconditioners for iterative solvers, it is important to trade off the effectiveness and efficiency by carefully choosing the extra edges to be added to the spanning tree.
III. OVERVIEW OF PROPOSED SCP APPROACH
In this paper, the graph sparsification and support graph theories will be exploited to develop scalable Jacobian matrix solvers for strongly nonlinear post-layout HB analysis. Although direct solution methods for solving the HB Jacobian matrix J hb can handle strongly nonlinear problems, the fast growing cost for solving the J hb matrix in HB analysis due to the block matrix fill-ins during the BLU factorization process will make such methods computationally prohibitive [1] , as shown in Fig. 1 . In our approach, the original RF circuit is first sparsified into a support circuit (a sparsified circuit that can well approximate the original circuit) graph that has much fewer edges and maintains a tree-like structure, such that the number of fill-ins during LU factorization procedures can be dramatically reduced. Such a support circuit 1 can be subsequently used as a preconditioner to facilitate fast HB analysis, thereby significantly improving the runtime and memory efficiency. Although the proposed SCP process will also introduce some block fill-ins during LU factorization, the number of new blocks will increase almost linearly with the problem sizes owing to the tree-like circuit structure, while the original circuit topology will typically result in exponentially increased block fill-ins. As a result, the proposed SCP approach will allow to analyze much larger RF and microwave circuits than ever before, and still maintain a decent convergence rate during Krylov-subspace iterations in presence of moderate to strong nonlinearities.
In this paper, we present a novel graph sparsification approach for generating preconditioners that can effectively and efficiently facilitate HB simulations of strongly nonlinear post-layout RF circuits. In the proposed method, the system MNA matrix of each sampled time point that can be obtained from the g i and c i matrices in (3) and (4), will be first decomposed into a Laplacian (A L ) matrix (with stamped equivalent resistors and capacitors) and a complement (A C ) matrix (with other stamped components, such as inductors, transconductances and voltage sources, etc), as shown in Fig. 2 . Throughout this paper, we define the network derived from the Laplacian matrix as the Laplacian network, and the network derived from the complement matrix as the complement network. Subsequently, a representative Laplacian matrix is obtained by scaling and averaging all the sampled Laplacian matrices, which can be subsequently sparsified into a sparsified representative Laplacian matrix by constructing an ultrasparsifier support graph based on its Laplacian network. In the next step, the sparsification pattern of the system MNA matrices can be obtained by combining the sparsified representative Laplacian matrix with the complement matrix. Next, FFT and IFFT algorithms are applied to compute the sparsified HB Jacobian matrix in frequency domain that can be leveraged as a robust and efficient HB preconditioner in the following Krylov-subspace iterative procedures.
It should be noted that during the iterative solution procedure, the HB Jacobian matrices need not to be constructed explicitly. Instead, only the matrix-vector multiplications are computed at each iteration, which is more computational and memory efficient than the original HB methods that typically express the full Jacobian matrix explicitly. Although the matrix factors L and U of the SCP have to be formed explicitly, the proposed graph (circuit) sparsification technique can greatly reduce the memory and runtime consumption. The sparsified preconditioner matrix is usually much sparser than the original HB Jacobian matrix, and it thus can be more quickly solved using existing direct solution methods, such as the BLU solver proposed in [1] , leading to nearly-linear runtime and memory efficiency.
IV. SCP FOR HB ANALYSIS
This section describes the detailed procedures for computing the proposed SCP for HB analysis.
A. Sparsification of Representative Laplacian Matrices 1) Extraction of Representative Laplacian Matrices:
It is hoped that by examining spatiotemporal MNA matrix patterns obtained through TR circuit analysis, more meaningful SCP can be generated for HB analysis based on graph sparsification techniques. To this end, we propose a simple method for extracting the representative Laplacian matrix by examining time-domain sampled MNA system matrices.
To extract the desired support circuit from the original RF circuit scheme, the equivalent resistors and capacitors of each NR iteration (during TR analysis) are first stamped into the Laplacian matrix A L while other components are stamped into the complement matrix A C for each sampled time point, as shown in Fig. 2 . To properly preserve the impact of energystorage components, a fixed time-step size which is determined by the largest harmonic in HB analysis can be adopted for computing the system MNA matrix. For example, the equivalent conductance C/ t for a capacitor with a value of C will be stamped into the Laplacian matrix A L during TR analysis, where t is the time step size corresponding to the highest frequency harmonic component. The above stamping strategy can assure that the proposed SCP will not miss the edges presenting critical energy-storage elements.
It should also be noted that for different sampled time points, although the entry values of the system MNA matrices obtained from nonlinear device evaluations can be quite different, the corresponding entry locations (patterns) remain the same. As a result, the Laplacian matrices that include the resistors and capacitors derived from nonlinear device linearizations are time-varying during NR iterations. Since the amplitudes of matrix entries sampled at different time points can be quite different from each other, directly averaging these Laplacian matrices may not effectively reflect the influence of some important circuit components. To retain the relatively important circuit components, the sampled Laplacian matrices will be normalized as follows: for each sampled Laplacian matrix obtained at a time point, all the matrix entries are scaled by a common factor such that the largest elements of these Laplacian matrices are always the same. As a result, by averaging these normalized Laplacian matrices, we can obtain a representative Laplacian matrix that can truthfully mimic the average circuit behaviors during TR simulations. Take the r ds of transistor companion model as an example, as shown in Fig. 6 . At different sampled time points the resistance of r ds will be quite different. By normalizing and averaging r ds of all the sampled time points, the new edge weight can reflect the relative importance of this resistor under all the harmonics.
2) Sparsification of Representative Laplacian Matrices: As described in Section II-B2, ultrasparsifier preconditioner can better approximate the original system than spanning-tree preconditioner by adding extra edges to the spanning-tree support graph. However, adding excessive edges to the spanning tree may result in a rather dense graph and thus it can lead to dramatically increased computation cost when using the ultra sparsifier as a preconditioner. In order to find the most important extra edges, the conductivity of the original graph and the degree of each vertex will be analyzed using a weighted degree metric. For a graph Laplacian matrix A L , the weighted degree wd(v) of a vertex v ∈ A L is defined as [22] wd
where w t (v) denotes the total weight (conductance) incident to the vertex v, S(v) represents the set of edges connected with v, and w(u, v) is the weight of the edge that connects vertex u and vertex v.
In this paper, we propose an effective method to determine the edges to be added to the spanning tree as shown in Algorithm 1. In the algorithm, we define α(v) to be the matching factor of node weighted degree
wherewd(v) is the weighted degree of vertex v in the support graph. We also define α th as the threshold of the weighted degree matching factor, which can be used to effectively control the approximation quality of the ultrasparsifier graph. For Compute the weighted degree wd(v) with (12). 3: end for 4: Obtain the sorted the node list swd ∈ n in descending order according to the wd of each node. 5: Arbitrarily pick one starting node and compute the maximum spanning tree support graph of graph A L . 6: for k = 0 to n − 1 do 7: Get node from the sorted node list v = swd(k) 8: Compute the latest weighted degreewd(v) with (12) and the matching factor α(v) with (13). 9: while α(v) < α th do 10: Restore a single previously removed edge that has the largest weight.
11:
Update the weighted degreewd(v) with (12). 12: end while 13 : end for 14: Return the Laplacian matrix B L of the final ultra-sparsifier. a weighted graph, the lower bound of α th can be computed by settingwd(v) = 1 and wd(v) = wd max , where wd max denotes the maximum weighted degree of the original graph. Consequently, it can be shown that α th will always fall within the range 1 wd max < α th ≤ 1.
When α th is set to be close to 1/wd max , the ultrasparsifier support graph will shrink to a spanning tree; on the other hand, when α th = 1, no sparsification is performed, which will result in the original graph.
B. Sparsification Pattern Extraction
After the graph sparsification algorithm (Algorithm 1) is applied to compute the sparsified representative Laplacian matrix by finding the ultrasparsifier support graph according to its Laplacian matrix, the SCP can be subsequently constructed by exploiting the sparsification pattern of the representative MNA matrix. The procedures for finding the sparsification pattern of the representative MNA matrix have been illustrated in Figs. 2 and 3 , while detailed steps are described as follows. 1) After performing circuit linearizations at different time points (Section II-A), the Laplacian matrices that only include the equivalent resistors and capacitors are extracted from the system MNA matrices, as shown in Fig. 2. 2) The representative Laplacian matrix is created by normalizing and averaging the Laplacian matrices sampled at multiple time points, as described in Section IV-A. 3) The representative Laplacian matrix is converted to an undirected graph for which an ultrasparsifier support graph is created using Algorithm 1. Next, the ultra sparsifier is converted to its matrix form that is defined as the sparsified representative Laplacian matrix, as illustrated in Fig. 3 . 4) Finally, the complement matrix is combined with the sparsified representative Laplacian matrix to create the final sparsification pattern matrix, as illustrated in Fig. 3 . Once the sparsification pattern matrix is obtained, it can be adopted to sparsify the HB Jacobian matrix: if one entry in the sparsification pattern matrix is removed, the corresponding block matrix (6) in the HB Jacobian matrix will also be removed. We will show that this sparsification pattern matrix can very efficiently and effectively facilitate the sparsification of large HB Jacobian matrices.
C. HB Jacobian Preconditioner Construction
As discussed in Section II-A, the dense blocks in HB Jacobian matrix J hb are mainly due to the block-circulant matrices similar to (6) . It has been shown that by reordering the unknowns, the HB Jacobian matrix J hb can be converted to an equivalent sparse block matrix that maintains the same sparsity as the MNA matrix for TR analysis [1] , as illustrated in Fig. 4 . For example, if there are three harmonics in the HB analysis, the HB Jacobian matrix can be obtained by replacing every entry in the time-domain system MNA matrix with a 3 × 3 circulant-matrix block. Similarly, it is not difficult to construct the sparsified HB Jacobian matrix when the sparsification pattern is obtained through the previous procedures. For instance, we can compute the circulant-matrix block by applying FFT to the entries of the sparsified time-domain sampled MNA matrices, as described in (6) .
We want to emphasize that the proposed HB preconditioner matrix can be more efficiently factorized than the original HB Jacobian matrix, since only a small number of block fill-ins will be created during the BLU matrix factorization procedures. In this paper, similar to the BLU solver developed in [1] , we have developed a BLU solver (descried in Section V) for factorizing the HB Jacobian matrix preconditioner. Due to the tree-like structure of the sparsified RF circuit, the HB Jacobian preconditioner matrix can be solved in nearly linear time. Since the proposed preconditioning method shares the advantages of prior direct solution methods [1] , it can be efficiently and reliably applied to handle strong nonlinearities in HB analysis of RF circuits.
D. Case Study: Double-Balanced Gilbert Mixer Sparsification
RF mixers serve as key elements in superheterodyne transceivers for wireless systems, but they are also the primary sources of nonlinear distortions. Most mixer designs (Fig. 5) are following the principle that a large local oscillator driven by an RF signal can modulate the incoming RF signal into an intermediate frequency signal.
In this section, we demonstrate a case study to show how an RF circuit can be sparsified into a sparser support circuit for preconditioned HB analysis. First, the equivalent resistors and capacitors of the linearized RF circuit need to be extracted for forming the Laplacian matrix A L . Consider a linearized MOSFET circuit in Fig. 6 that shows equivalent resistors and capacitors, such as C gd , C gs , and r ds as well as two controlled current sources. By keeping only the nongrounded equivalent resistors and capacitors of the linearized mixer circuit, a Laplacian network can be obtained, as shown in Fig. 5 . Like other SPICE simulation algorithms, each inductor will be treated similarly to voltage sources, in that it cannot be modeled using an I-V style Ohmic relationship, and requires an extra current variable when building the MNA matrix. As a result, all the stamped elements related to inductors will be Algorithm 2 LU Factorization PJQ = LU Input: J ∈ n×n Output: L, U ∈ n×n ,Q ∈ n ,P ∈ n 1: Q = AMD( J) {column reordering to reduce fill-ins} 2: L = I n 3: for k = 1 to n do 4: x = LTS(L,J(:, k)) 5 :
5: 6 : end for kept in the complement network. The devices connected with ground, such as C1 and R10 in Fig. 5 , will only introduce diagonal entries in the MNA matrix. It is obvious that these devices will not introduce any entry to the Laplacian matrix. Next, the Laplacian network can be converted into a weighted, undirected graph, and subsequently an ultrasparsifier graph can be created by following the steps in Algorithm 1, as depicted in Fig. 7 .
V. BLOCK SPARSE MATRIX DIRECT SOLVER
When using preconditioned Krylov-subspace iterative methods for HB analysis, such as the GMRES method [5] , the Jacobian preconditioner matrix needs to be factorized once during each NR iteration. Therefore, the efficiency of the preconditioner factorization is critical to the overall performance of HB analysis. Inspired by the method proposed in [1] , we have developed a sparse BLU solver that can efficiently factorize the HB Jacobian preconditioner matrix by leveraging the sparsity patterns extracted from sparsified MNA matrices during TR analysis as well as the high performance basic linear algebra subprograms (BLAS) [23] and linear algebra package (LAPACK) [24] libraries.
A. Test Matrix Factorization
Before performing BLU factorization on the preconditioner matrix, we first examine the sparsified representative MNA matrix (test matrix) obtained from TR analysis. Since the HB preconditioner matrix has the same block entry pattern as the sparsified representative MNA matrix, each entry in the test matrix will correspond to a block entry in the HB Jacobian preconditioner matrix. Consequently, by factorizing this test matrix, the possible fill-in locations and computations during the following BLU factorizations can be precisely predicted.
To factorize the test matrix, the classical left-looking LU factorization algorithm [25] shown in Algorithm 2 is used, which includes two phases: 1) a symbolic analysis phase and 2) a numerical analysis phase. During the symbolic analysis phase, the columns of the matrix will be reordered to reduce fill-ins during the factorization. There are several ordering approaches available, such as the approximate minimum ordering (AMD) [26] , and column approximate minimum degree ordering algorithm (COLAMD) [27] ordering schemes. In this paper, we use AMD to reorder the test matrix as shown in Algorithm 2. During the numerical analysis phase, the L and U factors will be calculated with partial pivoting to reorder the rows to avoid very small diagonal elements. The core of the numerical analysis phase is the Gilbert/Peiels factorization algorithm [28] , which is to solve a lower triangular system Lx = b repeatedly, as shown in Algorithms 2 and 3.
B. Sparse BLU Algorithm
In this subsection, we will extend the above classical LU algorithm to develop our sparse BLU algorithms (Algorithms 4 and 5) for factorizing the HB Jacobian preconditioner matrix efficiently. The key idea of our BLU algorithm is to exploit the results of the previous test matrix LU factorization. After factorizing the test matrix, the column (Q) and row (P) permutation vectors will be obtained respectively, which can be readily leveraged to factorize the HB Jacobian matrix. By replacing the element-wise multiplications and divisions with matrix-wise multiplication and divisions, the HB preconditioner matrix can be factorized very efficiently. Since each block entry of the HB preconditioner matrix is a dense matrix, the matrix-wise multiplications and divisions can be performed using aggressively optimized BLAS and LAPACK implementations for the target CPU architecture. It should be noted that during the lower triangular system solving step, we do not need to perform the depth-first search algorithm any more, since the nonzero locations of the L and U factors (LU nnz ) can be easily predicted based on the result of the previous test matrix factorization. In the last, the forward and backward substitutions are performed to compute final results.
VI. TR ANALYSIS GUIDED SPARSIFICATION
In order to minimize the total HB simulation runtime, the matching factor threshold α th of node weighted degree needs
Algorithm 5 BLTS: Block Lower Triangular System Solver
c 2 = hj 5: for i = j + 1 to n do 6:
r 2 = hi to be carefully determined to achieve the optimal graph sparsification that may lead to the lowest runtime cost. In this section, we introduce a novel method for achieving nearlyoptimal graph sparsification that can potentially lead to the minimum HB simulation runtime.
A. HB Simulation Runtime Profiling
The runtime of a complete NR iteration during HB analysis can be estimated as follows:
where T LU is the matrix factorization time of the HB preconditioner, T GMRES is the runtime for one GMRES iteration and N denotes the total number of GMRES iterations during one NR iteration step. As illustrated in Fig. 8 , a sparser ultrasparsifier support graph will result in smaller preconditioner factorization time T LU , but much slower convergence (more iterations) due to the worse approximation of the original graph. On the other hand, a denser support graph will result in greater cost in factorizing the HB Jacobian preconditioner matrix that can dominate the overall simulation time. Therefore, to achieve the optimal runtime performance of HB simulation using the proposed iterative solver, the sparsity of preconditioner matrix needs to be determined carefully.
B. Runtime Performance Modeling
To achieve the optimal sparsification of the ultrasparsifier support graph that is key to optimizing the runtime performance of HB simulation, we propose a simple yet effective method to efficiently find the nearly-optimal matching factor threshold α th of node weighted degree by using a TR analysis-guided performance modeling approach. Since the MNA matrix in TR analysis is much smaller in size than the Jacobian matrix in HB analysis, while it has the same matrix structure as the HB Jacobian matrix, it can be used as a good surrogate for estimating the efficiency of the HB preconditioner when the same representative sparse matrix pattern is used. As a result, the runtime performance of TR analysis can effectively reflect the efficiency of the corresponding HB preconditioner. For instance, if the GMRES iteration number is large during TR analysis, it indicates that the preconditioner may not be accurate enough and therefore additional edges should be included into the sparsified graph. We want to emphasize that the TR analysis does not need to reach the steady state for performance modeling purpose. Instead, we just need to perform a few steps of TR analysis to proximately estimate the quality of the HB preconditioner based on the convergence behavior. Although the resultant HB preconditioner may not be the optimal one during the entire HB analysis procedure, it should be able to provide a reasonably good initial preconditioner that can be improved later based on the actual numbers of GMRES iterations during the following HB analysis steps (e.g., NR iterations).
In our approach, N different matching factor thresholds, α th,i for i = 1, . . . , N, are uniformly chosen within the range (14) described in Section IV-A2. It is observed that TR simulation runtime results under different α th values correlate well with the corresponding HB simulation runtime results, as shown in Fig. 9 . The above observation suggests to build a parameterized runtime performance model with variable α th under the guidance of TR analysis results. Subsequently, the identification of the optimal α th for HB analysis can be efficiently performed based on the above runtime performance model.
In this paper, we build the following quadratic runtime performance model for predicting TR simulation runtime under different α th values: where T TR represents the overall runtime of TR simulation and a, b, and c are the coefficients of the quadratic function. The above model can be obtained by running a few time steps of TR simulations with different α th values. Since running TR analysis is much faster than running full HB simulations for RF circuits, the cost for building the proposed performance model can be negligible. In the last, the optimal α th value for HB analysis can be easily found based on the above quadratic performance model. It should be noted that this novel HB runtime performance modeling approach allows to automatically and robustly compute the nearlyoptimal sparsified circuit networks for preconditioning purpose, while previous manually tuned sparsification algorithm in [10] may require excessive effort in finding a relatively good preconditioner.
C. Nearly-Optimal Sparsification Scheme
Since the linearized models of nonlinear devices may change drastically during NR iterations, the sparsified circuit networks may also change accordingly, which may require on-the-fly update of the preconditioner. In this paper, we propose to apply the quadratic runtime performance modeling/optimization procedures only when the previous NR iteration runtime has changed dramatically. On the other hand, when the runtime of the latest NR iteration does not change much, the previous sparsified network topology will be reused with updated element values. Fig. 10 illustrates the proposed sparsification scheme during an NR iteration of HB simulation, where T HB (k) denotes the kth NR iteration runtime and β denotes the threshold value of the changing rate for NR iteration runtime. Since the runtime performance models can be efficiently generated on-the-fly for finding the nearly-optimal Evaluate devices and compute the linearized circuit system matrices at each sample time points; 4: Create the Laplacian and complement matrices at each sample time point; 5: if Previous HB NR iteration runtime change drastically then 6: Update the sparsification pattern: 7: a) Create the representative Laplacian matrix P by scaling and averaging all the Laplacian matrices; 8: b) Create the performance model function T TR = f (α th ); 9: c) Compute the optimal matching factor threshold α th ; 10: d) Extract the ultra-sparsifier support graph from the representative Laplacian matrix P; 11: e) Build the sparsified representative Laplacian matrix P usg ; 12: f) Form the sparsification pattern by combining the sparsified representative Laplacian matrix with the complement matrix; 13: else 14: Reuse the previous sparsification pattern; 15: end if 16: Sparsify the HB Jacobian matrix; 17: Construct and factorize the Jacobian preconditioner matrix; 18: Perform preconditioned GMRES iterations; 19: Update the solution vector and transform the solution from frequency domain to time domain using IFFT; 20: Check NR convergence: if converged, stop the NR iteration; otherwise, go to the next NR iteration. 21 : end while 22: Return the final steady-state solution.
α th , high robustness and efficiency of HB simulation using the proposed adaptive preconditioning approach can be always achieved.
We want to emphasize that when the circuit becomes nonlinear device dominant, the proposed preconditioner will be very similar to the original HB Jacobian matrix. As a result, the proposed method will work like a direct solver. As the increase of parasitic components of the post-layout circuits, the proposed preconditioner will be much sparser than the original HB Jacobian matrix, and the proposed method will work as an iterative solver. The automatic and smooth switching between the direct solver (no sparsification) and iterative solver (multiple sparsification levels) allows the proposed method to reliably and efficiently obtain the steady state solution for different RF circuits.
VII. SCALABLE HB ANALYSIS ALGORITHM
The algorithm flow of the proposed support-circuit preconditioned HB (SCPHB) method is summarized in Algorithm 6, while the complexity of the proposed algorithm is discussed as follows. Benefited from the iterative solution method, we only need to explicitly construct the factor matrices L and U of the preconditioner. Since the ultrasparsifier preconditioner maintains a tree-like structure, the memory and computational cost due to the fill-ins introduced during the BLU factorization procedure will scale almost linearly with the problem size. where nnzl sg and nnzu sg denote the nonzeros in the L and U factors of the HB Jacobian preconditioner, while h denotes the number of harmonics for HB analysis. It is obvious that the proposed method has better memory efficiency than the direct solution method whose memory consumption is given by [1] (nnzl + nnzu)h 2 where nnzl and nnzu are the nonzeros in L and U factors. Since the proposed sparsification technique can greatly reduce the number of edges/elements of the original circuit network, (nnzl + nnzu) should be much greater than (nnzl sg +nnzu sg ) due to the dramatically reduced fill-ins during the BLU factorization procedure.
The computation complexity of the BLU solver [1] is
where β > 1 but it is usually very close to 1. We assume the preconditioned GMRES method converges in m iterations, then the complexity of HB matrix-vector products is
where nnzc and nnzg are the numbers of nonzeros in C and G, respectively. The complexity for triangle solves during GMRES iteration can be estimated to be O(m(nnzl
The complexity of applying the proposed preconditioner in m GMRES iterations is O(m 2 nh). Therefore the total runtime complexity including the preconditioner factorization time, preconditioner triangle solve time, and matrix-vector multiplication time can be estimated as
Based on the above analysis of runtime and memory cost using the proposed method, it is obvious that our approach will lead to much better computational efficiency than the prior direct solution method [1] , especially when the number of harmonics during HB analysis is large.
VIII. EXPERIMENTAL RESULT

A. Experimental Setup
We have tested several widely used RF circuits using the proposed SCPHB method. All the test circuits are post-layout To demonstrate the advantages of our proposed method, the averaging BD preconditioning method and the state-of-art direct solution method [1] are also implemented and evaluated. Detailed characteristics of the test cases are summarized in Table I , where "freqs" denotes the number of harmonics, and "unknowns" stands for the problem size. All experiments are performed using a single CPU core of a computing platform running 64-bit RHEW 6.0 with 2.67 GHz 12-core CPU and 48 GB DRAM memory.
B. Experimental Results
First, we would like to demonstrate the runtime and memory efficiencies of the proposed SCP method. Table II shows that the CPU runtime and memory consumption comparisons among the SCPHB method based on manually-optimized parameters, SCPHB method based on automatically generated parameters, direct solution method and BD preconditioning method for a variety of test cases with different circuit sizes and numbers of harmonics. The speedup results of SCPHB method are compared against the direct solution method in [1] . In Table II , "mem" denotes the memory cost measured in GB, "K-its" indicates the total numbers of GMRES iterations for iterative solvers, and "time" is the CPU runtime that is measured in seconds. From the table, we can observe that BD preconditioning method is the fastest one when the circuits are weakly or mildly nonlinear, and the proposed method is 5× slower. However, with the growth of circuit complexity and nonlinearity, BD preconditioning method will become much less efficient: we observe that when solving RF circuits with strong nonlinearities (test cases 5-7), the proposed method can be at least more than 3× faster than the BD preconditioning method. Moreover, when solving strongly nonlinear circuits, BD preconditioning method is not able to converge within an iteration limit of 500 and is reported as "do not finish." When compared with the direct solution method, our proposed method can always provide much better runtime and memory efficiency: up to 20× runtime speedup and 6× memory reduction have been observed in Table II. The results in Table II also show that when compared with the manually optimized SCPHB method, the proposed TR analysis guided sparsification scheme can always automatically find the nearly-optimal simulation configurations (matching factor threshold α th ). Fig. 11 shows the comparison of the simulation runtime between the proposed SCP and the BD preconditioner for an LNA + mixer with two tones of equal power applied at the input. We observe that in low input-power region (RF circuit exhibits weakly nonlinear behaviors), BD preconditioning method is faster than the proposed SCP method. Since for weakly nonlinear circuits, there is almost no large off-diagonal entries in the block circulant matrices of the HB Jacobian matrix, so the BD preconditioner matrix can well approximate the properties of the original HB Jacobian matrix. However, as the input power increases (circuit system becomes strongly nonlinear), the simulation runtime of the BD preconditioning method and the number of GMRES iterations for each NR iteration will grow dramatically. On the other hand, the proposed SCP method always results in the similar numbers of GMRES iterations and achieves desirable nearly-linear runtime and memory efficiency even for these strongly nonlinear RF circuits. Fig. 13 compares the performance between our SCPHB solver and the incomplete LU factorization (ILU) preconditioned GMRES iterative solver, for test case circuit (CKT) 6, by showing the runtime and numbers of GMRES iterations for the first few NR steps. The ILU preconditioner is built from PETSc [29] with ILU factorization level set to be 5. When the ILU levels are set to be 0-4, the ILU-preconditioned GMRES solver cannot converge for the first NR step. From the figure, we observe that the proposed SCPHB method can converge much faster than ILU-preconditioned method with much smaller iteration number and total runtime. In the last, we would like to demonstrate the accuracy of the proposed SCP method. Fig. 12 illustrates the voltage waveforms of double balanced mixer circuit showing that the results of our proposed SCP method can accurately match the one obtained from the direct method. Although other preconditioning methods can also achieve the same level of accuracy, our method can usually converge much faster as observed in our experimental results.
C. Scalability
In this section, we would like to demonstrate the scalability of our proposed method. Fig. 14 illustrates the total HB simulation runtime results of different problem sizes (unknowns). Fig. 15 shows the peak memory consumptions of different problem sizes. It is obvious that the proposed SCPHB method has much better scalability than direct solution method. In Fig. 14, with the increase of the problem size, the runtime cost of direct solution method grows much faster than the proposed method. We observe 3× to 20× speedups and 2× to 6× memory cost reductions in Figs. 14 and 15.
IX. CONCLUSION
We exploit recent graph sparsification and support graph theories to develop scalable HB Jacobian matrix solvers for tackling frequency-domain strongly nonlinear post-layout RF circuit analysis. Our approach sparsifies the representative MNA matrices and subsequently the entire HB Jacobian matrix to significantly reduce the matrix factorization cost. We show that the resultant sparsified Jacobian matrix can be used as a robust yet efficient preconditioner in HB analysis. By leveraging the proposed TR-analysis guided graph sparsification scheme, nearly-optimal SCP can be computed on the fly automatically. Our experimental results show that when compared with existing state-of-the-art direct solution method, the proposed HB solver can more efficiently handle moderate to strong nonlinearities during the HB analysis of RF circuits, achieving up to 20× speedups and 6× memory reductions.
ACKNOWLEDGMENT
Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of these sponsors.
