Significant portion of digital design flow runtime is related to the physical design stages. Partitioning is a critical stage of physical design and its quality and runtime has considerable impact on physical design efficiency. In this paper, a new parallel partitioning algorithm is proposed and it is suitable for GPU system. In the proposed algorithm, coarsening phase of the partitioning is accelerated by parallelizing on GPU. Experimental results show that runtime can be improved up to 7x for attempted circuit with negligible quality degradation.
I. INTRODUCTION
Integrated circuits complexity has been grown in last decades with exponential rate and it is predicted that this flow will be continued in future years [1] , [2] . This exponential growth of modern circuits is an important challenge for computer-aided design tools because their runtime will be increased dramatically coping with this complexity growth [3] , [4] . On the other hand, the gap between complexity growth and productivity of CAD tools (i.e. productivity gap) will be increased for new ultra-large systems such as system-on-chips and multi-core systems.
Considerable portion of total design time is related to physical design stage. Physical design consists of some important steps such as partitioning, floor-planning, placement and routing. In Partitioning phase, design graph is partitioned to some sub-graphs to facilitate rest of algorithms. Floor-planning find location and shape of large blocks. Exact location and orientation of the standard cells are fixed in placement step and finally, the geometry of interconnects will be determined in routing stage. A bundle of physical design algorithms are NP-complete and NP-hard [5] , [6] . Graph partitioning is a critical sub-stage of physical design. Quality and execution time of partitioning stage has considerable impact on total quality of design and total execution time of physical design flow. Therefore, we focused on parallelizing the graph partitioning stage to improve its runtime regarding to the partitioning quality.
Many contributions are reported on parallelizing the CAD algorithms on multi-processor systems in the last years of 90's Manuscript received September 5, 2016 ; revised December 12, 2016. Atefe Taheri was with the Electronic and Computer Engineering Department, Shahid Beheshti University, Tehran, Iran (e-mail: atefetaheri1439@gmail.com).
Ali Jahanian is with Electronic and Computer Engineering Department, Shahid Beheshti University, Tehran, Iran (e-mail: jahanian@sbu.ac.ir).
Behin Molaie is with the Computer Engineering Department, Sharif University of Technology, Tehran, Iran (e-mail: molaie@ce.sharif.edu).
decade. Most of the proposed algorithms were used in multi-processor and multi-computer systems that do not equipped to efficient shared memories and communication between processors, and inter-thread communications could be done only with message passing protocol. Another drawback of the previous algorithms is that they require expensive hardware resources and do not efficiently executed on ordinary personal computers.
Based on the previous researches which will be introduced in Section II, more than half of the partitioning algorithm runtime is spent in coarsening phase. Therefore, we focused on accelerating this phase of the algorithm in this paper. In this paper, a GPU-optimized parallel algorithm is proposed to reduce execution time of the coarsening phase of the hyper-graph partitioning without considerable quality degradation. Our algorithm is a combination of sequential and parallel phases. We analyzed the sequential algorithm and parallelized more time consuming section over the underlying GPU architecture. The important point is that the parallelization should have as less as possible negative impact on the quality of partitioning. In the proposed algorithm, circuit netlist database is divided into n sections that are executed as parallel threads on GPU platform. Each partition will be coarsened in separated environment and the result sets will merge together on CPU.
This paper organized as follows. Section III describes the concept and existing algorithms for Multi-Level Hyper-Graph Partitioning and the coarsening phase is illustrated in Section IV. The article continues to describe the GPU programming in Section V and Section VI describes the proposed coarsening algorithm. Section VII describes the experimental results and finally, Section VIII concludes the paper.
II. LITERATURE REVIEW
In last few years computing power of the main processing units has not grown in comparison with growth of cells count in the digital circuits. So the run time of the algorithms has increased. On the other hand, Graphics Processing Units revolutionized the general purpose algorithms performance with their high computing power and low cost. These days many of the software companies are trying to improve their software performance using GPUs. Our idea was to create a parallel GPU based algorithm to run the partitioning phase of the design flow on a high performance GPU to speed up the design process.
Actually, in these days, there are many [7] applications in scientific computing for hypergraph partitioning. Foad Lotfifar and Matthew Johnson presented a new partitioning method for hypergraphs. They innovated a sequential multi-level hypergraph partitioning algorithm. This algorithm uses a technique of rough set clustering to categorize the vertices of the hypergraph. They have paid attention so their algorithm doesn't make a greedy decision. In fact, this algorithm makes a trade-off between local decision and global decision. Results show that the algorithm makes better partitioning quality.
Hairong Liu [8] and his colleagues present a new partitioning framework. This framework is based on divide-and-conquer. Their new partitioning framework is called dense subgraph partitioning (DSP). DSP has properties such as revealing all meaningful clusters and etc. Moreover, to presenting a new framework they established a relationship with the densest k-subgraph problem (DkS). The results show that this approach is suitable for parallel processing because it is time-efficient and memory-friendly.
Many contributions are addressed in last few years to boost the performance of CAD problems using high performance GPUs. Authors of [9] have developed a parallel algorithm called mPL to fit on GPU so they can speed up the time they need to solve placement problems. mPL is an analytical global placement method and it can find a reasonable solution in a little time.
In [10] parallelizing the wire-length estimation on GPU with CUDA structure is presented. Authors of [10] created a parallel algorithm for calculating the wire-length of a solution on GPU and they got 160x speed up over a serial CPU algorithm.
Karypis and LaSalle implements the partitioning a graph on multi core systems using OpenMP and MPI [11] . They developed an algorithm to run on multithreaded CPU environment and they get some good result on speed up and memory usage and partitioning quality.
Authors of [12] proposed a parallel pathfinder global routing algorithm that is the most used FPGA routing algorithm. They introduce a parallel method to run on multi-core systems mainly to improve the runtime of the routing phase. They show with their experimental results that the runtime of the routing phase. They show with their experimental results that the run time can be reduces by 47.8% and 70.9% with dual and quad core systems.
In [6] , a parallel Simulated Annealing algorithm on multi-core systems is proposed. Simulated Annealing is a known method to solve the optimization problems since 1953. The algorithm is used for complex and nonlinear combinatorial optimization problems. The algorithm searches the search space and finds a near optimal solution. The algorithm is taken a long time to finish if the search space is large. They show by their experimental results that they can improve the run time to 32% on average with considering the quality of the solution.
Caldwell et al. [13] introduced a new algorithm on multilevel partitioning. In their approach they introduce a technique of move-based hypergraph partitioning heuristic and they evaluated the performance of these heuristic in the context of VLSI design their first result was software architecture consist of 7 different reusable components. The formula allows a flexible, efficient and accurate assessment of the practical implications. Their second result was an assessment of the modern context for hypergraph partitioning research for VLSI design applications.
The biggest limitation of the algorithms mentioned before is the runtime to find a good answer. Another problem is that existing methods need a high resource environment to run.
III. MULTI-LEVEL HYPERGRAPH PARTITIONING
Hypergraph partitioning is an important problem in many engineering and optimization applications such as VLSI design flow. The main problem is to partition the nodes of a hypergraph into k different sets such that cut size of partitioning is minimized and balance criteria is not violated. In other words, this problem is an optimization problem whose goal is to minimize the cut size and its condition is the balancing rule. Hypergraphs are generalization of graphs that each edge can be a hyper edge. A hyper edge is an edge that connects a set of vertices (two or more vertices).
Hyper edge partitioning is an NP-Hard problem in general situation and an optimal solution is not viable practically for large circuits. However, many heuristic and randomized algorithms are developed for this problem to give a reasonable answer because of the importance of this problem.
As mentioned in the previous section, HMetis [14] hypergraph partitioning and it uses a multilevel hyper edge paradigm. HMetis is a multi-level partitioner in which the main objective is that local cut size of each level is considered corresponding with the cut size of the next levels. In other means, HMetis is planned to use the global cut size information inclined with the local connection information of the netlist. HMetis uses the hypergraph representation of the netlist in which nodes are standard cells and edges show the nets of the circuit. HMetis partitioning algorithm has three basic phases:
A. Coarsening Phase
In this phase highly connected hypergraph nodes are merged (coarsen) together and super nodes are generated. Each super-node consists of the nodes that have more connectivity and they should be clustered in a cluster. HMetis coarsen the hypergraph recursively until the number of super-nodes goes to 200 or less vertices in the hypergraph.
B. Initial Partitioning Phase
After the coarsening phase, hypergraph is reduced and a classical partitioning algorithm can be applied to partition the coarsened hypergraph into 2 parts. HMetis uses the Fiduccia-Mattheyses [15] algorithm to bisection the hypergraph. At the end of this phase, coarsened hypergraph is partitioned into two balanced partitions.
C. Uncoarsening and Refinement Phase
In this phase, the partitioned graph is un-coarsened and each super-node is expanded into basic nodes. After coarsening of each super-node, the balancing of the partitioned graphs may be violated. Therefore, a refinement algorithm should be applied in each level of the un-coarsening to balance the partitioned graph. Fig. 1 shows the process of HMetis algorithm. In this figure, the various phases of the multilevel graph bisection. In the coarsening phase, the size of the graph is successively decreased and the initial partitioning phase, a bisection of the smaller graph is computed and the Uncoarsening phase, the bisection is successively refined as it is projected to the larger graphs.
Coarsening phase
Uncoarsening phase Initial partitioning phase As mentioned before, HMetis developed in the multilevel framework. The algorithm is very fast and has good results with high quality partitioning. Hypergraphs with over 100,000 nodes can be bisected in a few minutes. The most time consuming phase of the HMetis is coarsening phase. Therefore, we focused on accelerating this phase of the algorithm in this paper. The next subsection, describes this phase in more details.
IV. COARSENING PHASE
In the coarsening phase, the goal is creating a hypergraph with smaller vertices to facilitate the initial partitioning phase. In general case, a hypergraph with more than 1000,000 vertices should be reduced to a hypergraph with fewer than 200 vertices. Coarsening process can be performed by three different methods as follows:
A. Edge Coarsening
In this method, a hyper edge with more than two vertices are selected and two of its vertices are combined together. In iteration a coarser graph with "v-1" vertices will be constructed. This process is shown in Fig. 2 . 
B. Hyper Edge Coarsening
In this method, hyper edges with fewer vertices are selected and their vertices are combined together to create a coarser graph and this coarsening is repeated until the goal of 200 vertices is achieved. This method is shown in Fig. 3 . 
C. Modified Hyper Edge Coarsening
In this method process is performed as hyper edge Coarsening method but after coarsening hyper edges, all the other vertices that are not combined with any other vertex, will be combined together. This technique is shown in Fig. 4 .
We used the Hyper Edge Coarsening technique in our implementations. 
V. PROGRAMMING ON GPU
To use the power of GPU in a general purpose application a framework or a toolkit is required to provide the programming, debugging and testing capabilities. In this section, we provide a brief overview of these tools from early day until now. In general, this field is quickly advancing and we advise the readers to check the internet for newest resources [16] .
A. SH
It is an open source project to write C++ programs on GPUs. This tool is an independent platform and supports many kinds of graphics cards. The language of the main program could be anything but the language of the source code of the GPU program should be in C++. In early days it was a great tool to use the performance of the GPU to accelerate general purpose applications [17] .
B. Direct Compute
Microsoft invented its own tool to use all the DirectX supported GPUs in Windows Vista, Windows 7 and Windows 8 for general purpose programming. Microsoft Direct Computer is an API to program on GPU and it was released with the DirectX 11 but it is compatible with DirectX 10 devices [18] .
C. OpenCL
Open Computing Language known as OpenCL is a programming framework for writing programs that can run on heterogeneous platforms consisting CPUs, GPUs, DSPs and other processors. The language of programming is based of C99 standards. This language is used to write kernels (Functions that execute on OpenCL Processors) and there is and API to define and control the run of the whole program. We can write programs that are task-based parallel or data based parallel with OpenCL. Currently Many Processors are OpenCL supported [19] . They can execute OpenCL kernels on them.
D. Nvidia CUDA CUDA (Compute Unified Device Architecture) is a programming framework created by Nvidia to write parallel applications that run on CUDA enabled GPUs [20] . CUDA unleashed some virtual instruction sets and memory for parallel computation in CUDA GPUs. CUDA framework enables developers to write programs that run on GPUs. The main difference of this framework is that the programming is easy and it is in higher performance compared to other frameworks. The CUDA framework compiler is available for C, C++ and FORTRAN languages. NVIDIA's compiler for C/C++ CUDA is a llvm based compile called nvcc. CUDA is provided in Microsoft Windows, Linux and Mac OS. The SDK and drivers are downloadable in the NVIDIA's developer's website. All new Nvidia GPU's are CUDA compatible. There are 3 lines of NVIDIA GPU's known as GeForce, Quadro and Tesla. CUDA Binaries are compatible to future cards but some instructions are not available in older GPUs [15] .
VI. PARALLELIZING THE COARSENING PHASE OF HYPERGRAPH PARTITIONING ON GPU
As mentioned before, partitioning is a widely used algorithm in many of physical design tools. HMetis is a high-quality and fast partitioning algorithm that is known as the best hyper-graph partitioner in various applications such as VLSI-CAD. One of the most time-consuming parts of this algorithm is coarsening phase. To parallelize coarsening phase and implement it on GPU, the circuit hypergraph database should be distributed on GPUs. After this step, all of these parts must send to GPU cores for processing and then, GPU cores will do the coarsening of all parts in parallel. It is noting that all parts will be coarsened separately. Since all of the nodes for any part chosen randomly and coarsening operation can be done on the basis of connection between nodes, we must pay attention it is possible that there are a few connections or no connection at all between nodes. Other point we must consider is that some edges eliminated in the partitioning phase. These edges must affect in the final result of the coarsening phase.
It is worth noting that the quality of the coarsened graph in GPU is lower than the coarsened graph in CPU normally because, parallel coarsening of the GPU cores may tend to incorrect decisions, because the graph is partitioned into smaller sub-graphs and each core doesn't have a global view. We divided the netlist based on its nets to divide the netlist with lower level of connections. As will be seen in experimental results, this database distribution makes good results with considerable runtime improvement. Fig. 5 shows the parallelized algorithm.
Parallel Coarsening pseudo code
Step 1
Read input hyper graph g.
Step 2
Partition g into n parts.
Step 3
Create n sub-graphs of g.
Step 4
FOR each sub graph gi of n sub graphs DO Step 4-1
Copy sub graph gi into GPU global memory.
Step 5
Call GPU kernel to parallelize following tasks.
Step 5-1
While (number of vertices in this part > k) LOOP
Step 5-1-1
Find an edge e in sub graph.
Step 5-1-2
Merge all vertices e.
Step 6
Wait for all kernels to complete.
Step 7
Take back all the results from GPU global memory.
Step 8
Merge all coarsened sets in g.
Step 9
While the number of vertices in g is greater than 200 LOOP Step 9-1
Find an edge e in g.
Step 9-2
Merge all vertices of e.
Step 10
Write the output. Fig. 5 . Parallelized coarsening algorithm.
VII. EXPERIMENTAL RESULTS
We implemented the proposed parallel algorithm in C++ using CUDA platform on Nvidia Geforce 295 GTX GPU system. Eleven benchmark are selected from IWLS suite [21] to evaluate the quality and runtime speedup of the proposed algorithm. Statistical information of the attempted benchmarks is shown in Table I . We proposed and implemented several methods for initial partitioning graph node for each GPU thread. The simplest idea was to assign the nodes randomly to each thread. The second approach was to divide nodes based on the occurrence of the nodes in the input file. Another idea was to pick nodes based on nets and its neighbors. The last idea was to use BFS algorithm to pick nodes for each thread. As we tested, we figured out that the method to divide nodes based on the input file wat the best one, because netlist files are generated such that the connected nodes are neighbors in the file and it is not random.
As mentioned before, the quality of a parallelized algorithm may be degraded because each GPU core should decide based on its local information. Therefore, quality of the coarsening is an important parameter that must be considered. In other words, in addition to the low execution time, coarsening phase should have considerable quality results. The quality of a coarsened graph does not have any standard measure. We defined a new metric called internality to evaluate the quality of a coarsened graph. Average Internality of graph g is defined as the average of internality of each hyper edge of the graph.
Internality( ) AverageInternality( ) || e e g g   (1) Internality of an edge is formulated as degree of locality of the nodes in the edge. For example, consider an edge with 2 vertices. If both of them are coarsened together the internality is 100% and if they are in different sets the internality is 0%. If an edge with 3 vertices has 2 vertices in a set and one vertex alone the internality is 33%. The internality of an edge can be computed as: Experimental results of running the algorithm on GPU are shown in Table III and Table IV . The Part variable shows the number of parallel threads of CUDA.
It is worth noting that execution of the last benchmark with 8 parts has been too time consuming to make reports.
As can be seen in Table IV , Runtime can be reduced considerably without significant quality degradation. It is worth noting if the quality is reducing because we need both runtime and quality. From our results we can see that if the problem size is small GPU overheads takes advantage over computing the result but in larger problems GPU shows its capability and the run time improves much better. Table V shows the Overall comparison between various implementation of the parallel algorithm vs. serial algorithm. In these algorithm columns RT and INT show the percentage of runtime improvement and quality degradation of parallel algorithms compared to serial implementation on CPU.
VIII. CONCLUSION
Significant portion of total digital design flow runtime is related to various stages of the physical design such as partitioning, floor planning, and placement and routing. In this paper, a new parallel partitioning algorithm was proposed for GPU system. In the proposed algorithm, coarsening phase of the partitioning was accelerated by parallelizing on GPU. Experimental results show that runtime can be improved up to 7x luck circuit with negligible quality degradation. Our analyses show that the results are better for larger circuits with more part number.
REFERENCES
Atefe Taheri was born in Babol, Mazandaran, Iran in 1987. She received the bachelor degree in hardware engineering from Shahid Beheshti University, Tehran, Iran in 2010 and her master's degree in hardware architecture from Shahid Beheshti University, Tehran, Iran in 2013.
She then worked for some companies such as SAAT Co, Tehran, Iran as a hardware designer. Currently she is employed as hardware/software designer in Intelligent Information Solutions Center in Sharif University, Tehran, Iran. Her current research interest is CAD, High Performance Computing and Hardware/Software Codesign.
Ali Jahanian received the B.Sc. degree in computer engineering from Tehran University, Tehran, Iran in 1996, and the M.Sc. and Ph.D. degrees in computer system architecture from Amirkabir University of Technology, Tehran, Iran in 1998 and 2008, respectively.
He is currently an assistant professor of Electrical and Computer Engineering Department of Shahid Beheshti University. His current research interest consists of VLSI design automation, Emerging on-chip interconnect technologies, and embedded system design.
Behin Molaie received his bachelor's degree in computer engineering from Sharif University of Technology, Tehran, Iran in 2009 and the M.S. in software engineering from Sharif University of Technology, Tehran, Iran in 2015. He is currently working toward the Ph.D. degree in Software Engineering at Sharif University of Technology, Tehran, Iran. He worked for Kamasystem company, Tehran, Iran as a technical manager for 5 years. Currently he is employed as a technical team manager in Intelligent Information Solutions Center in Sharif University of Technology, Tehran, Iran. His current research interest is IoT, big data and parallel computing.
Mr. Molaie won a gold medal in Iran National Olympiad in informatics in 2004
