I. INTRODUCTION
The growing VLSI circuit size and increasing structure complexity make the transistor level circuit simulation more and more like a mission impossible. In most transient analysis in SPICE tools, simulation of some moderate scale circuit designs takes days to accomplish. Low simulation efficiency becomes a critical bottleneck for modem CAD tools.
The rapid development of multi-core and many-core systems in recent years provides a promising way to solve this problem through parallel simulation. Synopsis HSPICE has already released its dual core parallel version, which for some simulation tasks can improve the efficiency by 70%.
A prerequisite step for parallel simulation is circuit partition. Partition problem is a classical problem in CAD research and has wide range of applications. Algorithms to solve this NPC problem are well developed in recent years [1] . But most traditional methods may encounter difficulties in two aspects when facing VLSI partition problems for multi-core parallel simulation.
For the direct algorithms [2] , [3] , [4] - [11] , overwhelming computing time corresponding to enormous problem size may This work was supported by National Natural Science Foundation of China(No.60870001, No.90207002) and TNList Cross-discipline Foundation.
Xiaowei Zhou was in NICS group, EE. Department, Tsinghua University, Beij ing, China. (e-mail: zhouxw3700@hotmail.com).
Yu Wang, the corresponding author, is assistant researcher in NICS group, EE. Department, Tsinghua University, Beijing, China. (e-mail:yuwang@mail.tsinghua.edu.cn) Huazhong Yang is professor in NICS group, EE. Department, Tsinghua University, Beijing, China. (e-mail:yanghz@tsinghua.edu.cn).
978-1-4244-3870-9/09/$25.00 ©2009 IEEE be the main problem. If taking too much time on partition process, the whole simulation efficiency is very likely to be hampered, no matter how fast the parallel simulation is accomplished. For algorithms with pre-assemble steps [12] - [19] , such as clustering methods, the pretreatment processes are often based on certain mathematical search criteria, rendering the solution quality of whole algorithm varies with circuit structure, cluster size and clustering strategies. Their scalability and stability are not well expected.
In this paper we introduce a new circuit partition algorithm specially designed for VLSI partition and multi-core parallel simulation, called DCCB (Direct Current Connected Blocks) and SCC (Strong Connected Components) based partition. The proposed algorithm shows considerable improvements in efficiency. We run our algorithm on some typical circuit designs and experimental results show that the algorithm gives satisfying computing time speedups and quality improvements.
The following of this paper is organized as below: Section II describes algorithm framework. Section III presents terminologies for recognizing DCCBs. Section IV gives out the analysis and strategies for recognizing SCCs and the overweight circle. Section V presents experimental results. Section VI concludes our work with some expectations and plans for future research.
II. CIRCUIT PARTITION FRAMEWORK
To implement the proposed algorithm we establish a software framework carrying out the process the algorithm describes. Fig. 2 gives an example of DCCBs. All the 10 MOSFETs enclosed in the black dashed line form DCCB 1, and the rest two outside the enclosure form DCCB2. Fig. 3 gives out the pseudo code ofDCCB recognition.
III. DCCDs AND DIRECTED CYCLIC GRAPH
In our partition algorithm, we first read the input netlist. With that we form the original graph Go representing the circuit. Then we pre-partition Go to a digraph G DCCB illustrated in section III in details. After that we transform this digraph G DCCB to a DAG G scc by recognizing SCC and the overweight circles. This step is seen in subsection A, B of section IV. At last we apply a classical multiple k-way partition to this DAG G scc to get a final result [ll] .
The advantage of the proposed algorithm can be explained in these following aspects: 1) The search space for optimization is greatly reduced thus the partition time may shrink by orders. This ensures the high simulation efficiency of our algorithm. 2) Most VLSI designs have quite clear functional structure and signal flows. Different DCCBs are often unconnected or simply coupled. Low communication cost is expected among DCCBs. On the other hand, duplication and redundancy commonly used in VLSI designs make the load balance requirement easy to be met. These two helpful characters ensure the good final solution quality our method reach. 3) Since DCCBs are commonly seen as basic functional blocks in VLSI circuits, our method is robust to most of application cases. Additionally, the recognition process is simple and determinable, few parameters would fluctuate solution quality.
A. DCCB Recognition
The input netlist of circuit design can be viewed as an original graph Go(V,E). Typically, Go of a VLSI circuit may include tens of thousands of elements. In order to reduce problem size and speed up partition, we put MOSFETs and related passive device networks with direct current passage together to form DCCD (Direct Current Connected Blocks). This pretreatment method is for the first time introduced in VLSI circuit partition algorithms for adapting parallel simulation task. To better illustrate the ideas of DCCB, we first give out some definitions and classifications in Table I: With proper data structure, this recognition process has linear time complexity. After recognizing DCCBs, we need to further identify the sequence of them to correctly represent the signal flow in the circuit. Few more corresponding definitions are given in Table II:   TABLE II 
DCCBj.
By recognizing DCCBs and identifying their sequence relationship, we transform the original input circuit graph Go to a new directed cyclic DCCB graph G DCCB • Fig. 4 gives an example of a G DCCB . Root is an artificial node pointing to DCCBs not pointed at by any other DCCBs.
B. Theoretical analysis for algorithm speedup
In this subsection we try to give out some approximate estimation of the expected improvement of DCCB based Let the average efficiency speedup to be y , we get: r=T;;/T;;',
In the estimation above we neglected the DCCB recognition time itself which has linear time complexity.Take O(Na) to be NlgN as an example. Based on real VLSI scale, we further assume N to be 10 4 and 13 to be 10. We get T; from (1) and t; from (2):
T e ' =ke(N/ {3)-Ig£N/ {3)=3k el0 3 ,
Still neglect the recognition time cost:
r =T, / t; =16.67 .
We see there is more than an order speedup theoretically expected for applying DCCB recognition method.
IV. RECOGNIZING SCC AND OVERWEIGHT CIRCLES

A. Recognizing see
The DCCB based graph G DCCB is much more simplified than the original graph Go, but still not suitable for applying partition algorithm, because it may contain circles, and assigning elements in one circle to more than two parts increases the communication cost. Due to this consideration, circle elements are better to be assigned in one partition. Thus we further cluster circling DCCBs to SCCs (Strongly Connected Components). SCC is recognized as follows: Theorem 4.1: A DCCB i is recognized as component of SCC j, iff DCCB i can reach any other DCCBs in SCC j through a certain series of directed arcs.
By forming SCCs we transform the graph G DCCB to a new DAG G scc . Fig. 5 gives the SCC recognition result of the example in Fig. 4 . SCCs are basic elements for partitioning in G scc and traditional k-way partition methods are used. Fig. 6 gives pseudo codes of S~ition.
(DCCB2 )
'-'0_._._.-.- 
B. Recognizing overweight circles
For some VLSI designs, there may be long feedbacks covering a wide range of stages and a large percent of the nodes in circuits. If they are put together by SCC recognition process, successive partition process may have difficulty to reach satisfying load balance. We call this kind of SCC "overweight circle". In our algorithm, overweight circles are not recognized as SCCs and are available for partition. This strategy is adopted with the following steps: first we recognize all SCCs, then for those SCCs larger than preset threshold size, we find an inner key node and cut off at least one adjacent node pointing to it in the SCC. We further recognize smaller circles in the original SCC. Fig. 7 gives the pseudo codes of this recognition process: In this paragraph we give out primary experimental results of our algorithm performance. We use traditional k-way F-M partition algorithm [11] as a comparison. Due to convenience of program testing and debugging, several typical mini-circuits are chosen as testbenches in the first step. All the experiments are done on an Intel 2.66GHz PC with 512M memory, on the platform ofVC++6.0.
Circuit C1 is a 2-bit calculator in digital circuits. It contains 32 MOSFETs and 3 capacitors. Circuit C2 is a typical charge pump block in PLL, containing 90 MOSFETs and 6 passive devices. Algorithm performances are listed below in Table III . N is the problem size. The "k-way" row lists the performances given under a k-way partition task. As for C1 there are not enough DCCBs for 8-way partition and not enough SCCs for more than 4-way partition, performances are unavailable in corresponding rows. From Table 3 we can see that: 1) Solution quality of our DCCB/SCC based algorithm appears better or not worse than direct F-M algorithm. This exceeds our theoretical expectation. Reason for it may lies in the fact that k-way F-M is very sensitive to initial solution, and our algorithm are not likely to begin with bad initial solutions. 2) The load balance of the proposed algorithm appears awful in both test circuits. It is not because of the disadvantages our algorithm has born with. It is due to lack of duplication and similar blocks in the two test circuits. In much larger real circuit design containing over 10 3 device elements, duplications and redundancies are commonly seen. We then have reason to believe that load balance requirements may be quite easily achieved in those cases. 3) Speedups are observed in all comparable cases. Notice that the speedup rate doesn't reach theoretical expectation. This is because the neglected processes of DCCB/SCC recognition still account for a certain percent of CPU time. This percentage is likely to decrease when scale N grows, as the two steps have lower time complexity compared to partition algorithm. Speedup for partitioning real VLSI circuits is expected to approach the theoretical expectation.
1250
This estimation can be to some extent approved by the speedup increase observed in C2 which is larger in problem size.
VI. CONCLUSIONS AND FUTURE WORK
In this paper we present a DCCB/SCC based fast circuit partition algorithm. The DCCB/SCC recognition is simple and determinable with good stability and scalability, and the whole algorithm is hopefully suitable for VLSI parallel simulation. We run our algorithm process on some mini-circuits in real designs as first step experiment and results achieved are quite preferable. Future works may contain these following aspects: 1) Fulfill experiment results of real circuits with over 10 3 sizes.
2) Develop algorithms that properly assign every partition to a certain multi-core network with given topology restrict, in order to keep partition communication as low as possible.
