GPU platforms are becoming increasingly attractive for implementing accelerators because they feature a larger number of cores with improved programmability. In this paper, we describe our implementation of a state-of-the-art academic multi-level analytical placer mPL [8] on Nvidia's massively parallel GT200 series platforms. We detail our efforts on performance tuning and optimizations. When compared to software implementation on Intel's recent generation Xeon CPU, the speed of the global placement part of mPL is 15X faster on average using a Tesla C1060 card, with comparable WL. (less than 1% WL degradation on average)
INTRODUCTION
Circuit placement, which determines the location of the cells and macros in a layout, is a critical problem in the VL-SICAD domain. The whole design flow needs a fast placement engine to generate the physical hierarchy of the design in order to provide feedback to synthesis and guide the optimizations [10] . As the silicon process technique continues shrinking down to 45nm and 32nm, and the degree of integration continues going up, circuit placement becomes more and more computationally intensive. Recently, the frequency scaling of microprocessors has slowed down, and the trend is that more and more cores are being put in a single chip. We urgently need to implement parallel CAD algorithms [7] , e.g., parallel circuit placement algorithms, to make better use of multi-core or many-core systems.
Placement algorithms can be roughly divided into three categories: partition-based approach such as Capo [6] , iterative refinement (e.g., simulated annealing) based such as TimberWolf [20] , Dragon [22] , and the analytical-based approaches such as mPL [8] , Aplace [16] , FastPlace [21] and Kraftwerk [13] , etc. The partition-based method recursively partitions the set of cells to obtain a placement solution. Iterative refinement mainly employs pair-wise exchange or cell move to optimize the solution. The method is relatively local and lacks a global view. Analytical placement has a much better global view and allows simultaneous movement of all the cells under the density constraint. Analytical methods are much more competitive compared to the partition-based approach and simulated-annealing-based approach. In the ISPD'06 placement contest [2] , the top three placers all used analytical methods. The reference placement engine used in our work is mPL [8] , which generated the best-quality results in that contest, but only end up with the second place due to a longer run-time (8% better wirelength with 4X longer runtime compared to the first place winner) [19] . Ideally, one would like to see shorter runtime without WL degradation.
Parallel placement is one way to improve the runtime of placement. It has been studied in [9, 14, 17] . Most of the placements are either based on the partitioning-based approach or based on simulated annealing with iterative refinements. The approach in [9] first partitions the circuit and then runs simulated annealing in parallel on different parts of the circuit. Alternatively, the approach in [17] tries to use parallel moves in simulated annealing where they evaluate multiple candidate cell swap or cell move simultaneously. Both approaches were implemented in a distributed environment or commodity multi-core systems. A recent study [14] uses a graphics processing unit to accelerate placement algorithms. The authors implemented their algorithm based on FastPlace [21] . They mainly accelerated the linear system solver for the quadratic placement. Many parts in the density computation and spreading force computation were still left to the CPU due to the architecture restrictions of the GPU that was used. The overall speedup and wirelength quality of the placer on any commonly used placement benchmarks were not presented.
In this work we implement the most computational-intensive kernels of mPL on GPUs, to accelerate the analytical global placement part of the placer. The graphics processing unit (GPU) is a new class of many-core system. Compared to the traditional multi-core CPU, it features a much larger num-ber of cores, but each core is simpler than one conventional core in CPU. Programmers need to expose the parallelism in the algorithm in a way that makes the best use of the underlying hardware architecture. For example, in the Nvidia GT200 series architecture, a number of cores (e.g., eight in GT200) are grouped into one multiprocessor, and one multiprocessor only has one instruction unit. SIMD execution will be performed within the group. The SIMD style of execution may have a significant negative impact on programs with many divergent controls. Careful tuning on the load balancing is needed to alleviate the effects. Also CUDA requires a special data access pattern, with a group of contiguous threads access contiguous data, to make better use of memory bandwidths. Our implementations try to address these issues for performance optimization.
1
This paper is organized as following: Section 2 provides an overview of mPL and summarizes the computational kernels of the multi-level analytical circuit placement we are using. Section 3 briefly presents the GT200 GPU architecture and CUDA programming model. Section 4 presents implementation details, including data structures and our efforts on performance tuning. We present our experiment results in Section 5 and conclude the paper in Section 6.
MULTI-LEVEL ANALYTICAL CIRCUIT PLACEMENT
Analytical placement has a better global view and tends to provide a better solution. mPL is one state-of-the-art academic placer that performs analytical placement using the multi-level approach combined with density-constrained non-linear programming. Detailed mathematical derivation is shown in [8, 19] .
Here we briefly outline the key idea of the placer. The placer tries to minimize the total wire-length:
subject to density constraints:
HPWL (half parameter wirelength) is a widely used metric in interconnect prediction. ( , ) is the density at location ( , ), and is the density target. This formulation suffers from a practical difficulty: both the HPWL and the density are not smooth or differentiable. mPL uses smoothed HPWL by log-sum-exp approximation as the optimization objective. Here is the equation for the smoothed HPWL of one hyperedge:
where ( , ) is the location of the one pin in the hyperedge, and is a parameter that can be tuned for different approximation accuracy. For density smoothing, mPL makes use 1 Note the idea presented in this paper can also be used to parallelize mPL on commodity multi-core CPUs or clusters. Paralleling the code using multi-core CPUs and multithreading is a more portable solution. This paper focuses on the implementation using the heterogenous component -GPU to achieve a better speedup with slightly more design effort.
of Helmholtz smoothing:
where △ is Laplacian operator, is a constant smoothing parameter, Ψ is the smoothed density (or the potential of the density field), ( , ) is the density, is the range and ∂ is the boundary. So instead of solving the problem defined by Eq. (1) and Eq.(2), mPL solves a smoothed problem whose objective is smoothed HPWL and whose constraint is that the smoothed density equals the smoothed density target everywhere. This non-linear constrained optimization problem is solved by Augmented Lagrangian method in mPL's latest implementation [11] . The Augmented Lagrangian method tries to minimize
Here is the optimization objective (sum of the smoothed HPWL), is the Lagrangian multiplier, is the penalty coefficient, and˜is the smoothed density target. In the outer iteration, the Lagrangian multiplier and penalty coefficient is updated according to certain rules. In the inner iteration, a gradient descent method with constant stepsize is employed to solve the subproblem specified in Eq. (5) for a given and .
The most costly part in the gradient descent method is the gradient computation. The gradient is composed of two parts: one is the gradient for the smoothed HPWL, and the second is the gradient relating to the density penalty terms. These gradients can also be interpreted as forces on the modules. mPL uses two type of forces in its implementation. The first one is called the native force, which is derived from the gradient of the smoothed HPWL. The x-component of native force on pin is computed as
and the y-component can be derived similarly. The force on one module is the sum of the forces on its pins. The second type of force is the spreading force; this is the force that tries to spread the cells evenly. The x-component of the spreading force on one module can be computed efficiently by the difference of the integration of spreading force on its left boundary and the integration of spreading force on its right boundary [11] . The equation to compute the x-component of the spreading force of one module is
where , ℎ , and denote the location of the corner points of a module. The y-component of the spreading force can be computed similarly. Interpolation is performed for sampling points that are not on the grid points. The justification for using Eq. (7) as the gradient of the density penalty terms was presented in [11] .
Computing the spreading force also needs bin-density computation and density smoothing. The partial differentiable equation in Eq.(4) can be efficiently solved using 2D-DCT/IDCT [8] .
mPL also employs a multi-level framework. It first performs recursive clustering and then places the modules at The placement solution at a coarser level is used to construct an initial placement solution at a finer level. This multi-level approach also saves runtime compared to a flat placement. The above-mentioned mathematic formulation and gradient/force computation works for any level of the placement within the multi-level framework. Note that in different levels, the number of placement objects and the grid size for bin-density computation differ.
We obtained the latest version of mPL source code from the author of [8, 11] . Based on the profiling information we collected, native force computation, bin-density computation, bin-density smoothing and spreading force computation are the key functions that are repeatedly used in the iterative process and are the most time-consuming parts in the analytical global placement. Figure 1 shows the profiling graph for a sample run using benchmark IBM01. Native force computation takes about 60% of the runtime, while bin-density computation, density smoothing and spreading force computation each takes about 10 to 15%; the remaining parts include parser, IO and clustering (less than 10%). Note this is the profiling for benchmark IBM01. The profiling differs for different benchmarks. For larger benchmarks, the core computation takes a larger ratio and the parser, IO and clustering take a smaller ratio (less than 4% of the total runtime typically).
NVIDIA GT200 SERIES GPU ARCHITEC-TURE
GPU is the processing unit specifically designed for handling the computation for computer graphics. Conventional GPU contains a number of processors for vertex processing, texture processing and fragment pipeline. Texture processing is normally vector-based as textures will contain RGB components for visualization. Writing general programs for GPU requires an in-depth understanding of the graphics concepts and a conversion from the general computing task to a graphics processing pipeline. This, often, is not an intuitive and easy task.
The Nvidia GT200 series GPU 2 uses a unified shader architecture, so that tasks for vertex processing, fragment pipelining and texture processing are all performed by the unified shader. Also, it uses scalar processors rather than vector processors; thus it is much friendlier for general com- Figure 2) , and shares the instruction unit. These processors run exactly the same threaded program in the SPMD (single program multiple data) model. The constant cache and texture cache are also shared by the processors in the same multiprocessor. Compared to the previous generation G80 series, GT200 also adds the double precision support. Each multiprocessor has one double precision unit.
CUDA is the C environment for programming the G80, G92 and GT200 series GPUs. It uses massive threading to overlap the computing with the memory access latency, and it also provides the scratchpad-like shared memory for fast access (note the size of the shared memory is very limited). Thread creation and switching are very lightweight on GPU compared to multi-core CPU. In the CUDA environment, users can write a SPMD code to map onto the hardware architecture. Inside the code, users can define the parallelism among different multiprocessors by specifying the number of parallel blocks to use:
. Also the parallelism inside each multiprocessor is specified by the number of threads inside each block: ℎ . A total of * ℎ threads shall be created for that code. The block ID and thread ID are used to identify different threads, and they can access different data based on their IDs. Each thread can access per-thread registers, per-thread-block shared memory, per-thread-block cache and per-device global memory.
The hardware architecture of the Nvidia GPUs also has certain limitations. Because of the SIMD fashion used in the stream multiprocessor, there is significant overhead in handling divergent control flows. No synchronization between thread-blocks is provided except through global memory on the board. Memory access needs to follow a special access pattern (called coalesced access) where a half-wrap (16 in the GT200 architecture) of threads shall access continuous memory locations for optimal memory bandwidth. These architecture restrictions impose considerable challenges for implementation, and require significant tuning efforts for better performance.
OUR IMPLEMENTATIONS

Native force computation
Native force is derived from the gradient of smoothed HPWL (the equation of the force is in Eq. (6)). To compute the force, first we need to compute
and ∑ − / for every hyperedge(net); next, for every pin in the hyperedge, we compute the force on the pin (and thus on the connected module) by computing
force on each module is the sum of the forces on all its pins. Figure 3 shows the pseudo code for the computation involved in Eq.(6). ( , ) are the offsets of the pin locations versus the central locations of the module. So ( + , + ) are the actual location of the pin. Clearly, we know that 
Parallelization schemes
Clearly the outer loop (the loop over the hyperedges) is parallelizable. The inner loop is also parallelizable as it mainly involves reduction operations and a parallelization scheme that is similar to a tree-adder (or a more optimized parallel reduction [15] ) can be implemented. If we parallelize the outer loop only, the inner loop will be executed sequentially. Since the loop bound of the inner loop (the loop over the pins) is the number of pins of the hyperedge (not a constant value), we need to handle load balancing carefully. If we parallelize the inner loop only, because the average number of pins in a hyperedge is a a small number, a parallel reduction implementation will leave many threads idling and thus we will not have a good utilization of the resources. We tried both approaches and found out that parallelizing outer loop delivers better performance.
In order to solve the load balancing issue, we sort the hyperedge array based on the number of pins in each hyperedge. CUDA uses a SIMD scheme within each group of threads (called wrap, one wrap is 32 threads). If we do not do sorting, the difference of the inner loop bound within one wrap may be large, and the threads that terminate earlier (a smaller inner loop bound) have to wait for the threads that terminate later (a larger inner loop bound) due to the restrictions of SIMD scheme. Through sorting, the difference of the inner loop bound within each wrap is minimized, and we can somewhat alleviate the negative effects due to load balancing of the SIMD scheme.
In addition to this, the memory writes into the array (line 17 to 20) need to use atomic writes to avoid possible conflicts, because it is possible that multiple threads try to update the same entry of simultaneously. (Atomic operation is a supported scheme in CUDA to avoid WAW hazard in the parallel scenario.) The device we use only supports integer atomic operations in the global memory; thus we also have to convert the to a fixed-point representation so that we can make use of integer atomic operations.
F o r A l l p i n ( o f t h a t h y p e r e d g e )
While this is a simple solution that works well, there is still room for further improvement. This parallelization scheme needs to use atomic writes. We discovered that the atomic write in the global memory is not very efficient, because atomic operations need to check all the pending requests, and serialize the requests, if necessary. Global memory access is a long-latency operation, and the serialization of the accesses would significantly impact the performance. We further looked into the algorithm and found that it can be rewritten in such a way that no atomic operation is needed.
The new pseudo code is shown in Figure 4 . The major difference is in the second half of the code. Instead of looping over hyperedges, we loop over modules. For each module, we compute the through reduction accordingly. This way, we can get rid of atomic operations. However, we need to store the values of , , and into arrays so that the second loop (line 15 to 26) can access them; the unmodified version in Figure 3 does not need to store them into arrays and can access them in registers directly. This does add some overhead, but the pseudo code in Figure 4 still runs faster than the version with atomic writes.
After we rewrite the code in a way similar to Figure 4 , we also need to reconsider the parallelization scheme for the second nested loop. We also have the option to parallelize either the outer loop (the loop over modules) or inner loop (the loop over pins), and the decision is influenced by the average number of pins in a module. mPL uses the multilevel approach and the number of modules in the placement varies at different levels. When the number of modules is small (coarse-level), each module has a large number of pins. We would like to parallelize the inner loop as the parallel reduction performs very well for large reductions. When the number of modules is large (fine-level), each module has a small number of pins. We would like to parallelize the outer loop. We use a threshold on the average number of pins in a module (32 in our implementation) to determine which scheme to use.
Data structures and their memory layouts for efficient access
In order to perform the actual computation on the GPU, the netlist data structure needs to reside in the memory space of GPU. Because the size of the array data is relatively large, we need to put the data structure in global memory space. It is reported that coalesced data access delivers one magnitude larger bandwidth than uncoalesced access. We need to design and layout the array data so that most data accesses are coalesced.
The initial implementation for pseudo-code in Figure 4 involves the following arrays: expCurrX and expCurrY for looking up / and / ; expPinX and expPinY for looking up / and / ; hedgeArray which stores the IDs of the pins in the hyperedges; moduleArray which stores the IDs of the pins in the modules; and pinToModuleArray which is used to the perform the inverse lookup in the getModule function; pinToHedgeArray which is used to perform the inverse lookup in getHyperEdge function. array and inverse look up from either the pinToModuleArray or the pinToHedgeArray. This two-step process can be reduced to a one-step process if the data arrays take the loop variable rather than the pin ID as the array index. For example, we can create an array called expCurrX hedgeArray where
With this approach, we can get rid of most indirect data accesses.
The CUDA programming manual states that global memory bandwidth is used most efficiently when the simultaneous memory accesses by threads in a half-warp (16) can be coalesced into a single memory transaction. So far we assume many arrays such as expCurrX hedgeArray are 2D arrays (one dimension is module or hyperedge ID and another dimension is for the index of pins in one module or hyperedge). Figure 5(a) shows the 2D array-based data layout. The black dots show a sample data access pattern by a half-warp of threads. We see that this layout can ensure a coalesced access. Clearly, it is not memory efficient, as the number of pins in a module / hyperedge varies a lot. We can also convert the layout into a 1D array and uses another array to record the starting address offset of each module / hyperedge. However, this will make coalesced access difficult and significantly degrade the performance. We designed a stair-case-like data array shown in Figure 5(b) . Instead of using a single 2D array, we use many small 2D arrays. The size of each small array equals the half-wrap (16) multiplied by the maximum number of pins in that half-wrap of modules. We can balance both the allocated memory size and the coalesced access requirements. We try to group the remaining uncoalesced aceesses together. For example, the accesses for SumExpPosX, SumExpPosY, SumExpNegX and SumExpNegY in the second nested loop in Figure 4 are uncoalesced. We merge these four arrays into a single array whose entry has a float4 type, so that the penalty of uncoalesced access is smaller as we access wider data.
Note that the data access pattern is strongly correlated with the parallelization scheme. The above-mentioned data layout is designed to parallelize the outer loop. Also note that we choose to parallelize the inner loop rather than outer loop when each module has a large number of pins. In this case, a 1D array with some padding is sufficient to satisfy the coalesced access requirements.
Bin density computation
Bin density counts the total overlapped area in each bin. We outline the basic nested loop structure for bin density Figure 6 . We also can select whether to parallelize the outer loop or inner loop. In the multi-level approach, the grid size is the same magnitude as the module size. The loop bound of the inner loop (the loop over bins in the list) for many modules is small (e.g, the loop bound is equal to 1 if a module resides completely within one bin), and parallelizing the inner loop will leave many computing units idle. Also, if we parallelize the inner loop only, some computation (finding the bins we need to update for that module) needs to be done sequentially. Parallelizing the outer loop does not carry these restrictions. Thus, we decide to parallelize the outer loop initially. We also use a sorting of the geometrical size of the module to help resolve the load balancing issues. However, there are also some macros that are much larger than standard cells, and these macros overlap with a large number of bins. For these macros, the overhead due to load balancing in the SIMD scheme comes into play if we parallelize the outer loop only. We have implemented two version of code. We parallelize outer loop for standard cell and dummy cells, and parallelize the inner loop for macro cells. This combined approach runs much faster than running any scheme alone.
We use atomic operations to update the bin density, because simultaneous writes into the same bin are possible. We also convert the bin density from floating point to fixedpoint (by multiplying a large number), as atomic operations in CUDA do not support floating point.
Density smoothing
Density smoothing mainly employs 2D-DCT/IDCT as the computational engine to solve the PDE specified in Eq.(4).
Here we use . * to denote point-wise multiplication. ℎ is a matrix derived from the Eigenvalues of the PDE system of Eq.(4) (this weight matrix is a constant matrix for one level of placement). There are two ways to perform 2D-DCT. The first approach is called direct matrix multiplication. The second approach relies on 2D-FFT. The complexity of direct matrix multiplication ( 3 ) is certainly larger than 2D-FFT ( 2 ). But since FFT has to frequently use strided memory access pattern, the FLOPS performance of 2D-FFT on GPU is usually much smaller than FLOPS in matrix multiplication.
We implemented three versions of codes for Eq. (8) . The first version uses direct matrix multiplication to compute DCT and IDCT. This version uses matrix multiplication sample code in the Nvidia CUDA SDK. The second version also uses direct matrix multiplication, but uses the CUBLAS library for matrix multiplication. The third version implements the FFT-based DCT using algorithms presented in [18] , and uses the CUFFT library for the FFT implemen-
End F o r A l l Table 1 . Based on these results, we decide to use direct matrix multiplication (version 1) to perform 2D-DCT when the grid size is no larger than 64 by 64, and use the FFT-based approach for larger grids.
Spreading force computation
The pseudo code for spreading force computation is shown in Figure 7 . This is a loop over all the modules; for each module, we need to compute four integrations (over the four boundaries of the module). We also decided to parallelize the outer loop over modules. The integration is computed using the composite trapezoidal rule [12] . We declared the smoothed density as a 2D texture, and used texture lookup to obtain the data used in integration. The hardware device has a texture cache to speedup the texture access. Linear interpolation is performed in the software implementation. CUDA also provides hardware-based interpolations including both nearest neighbor or linear interpolation. Out-ofboundary access is also handled automatically.
The magnitude of the spreading force and the native force on each module is used to determine the movement of each module. and compute the sum of the square of spreading force on the modules. This is used to detect the convergence or update some penalty scaling coefficients if necessary. A parallel reduction [15] is used to compute the sum.
Accuracy issue and miscellaneous pitfalls
The original software implementation uses double precision. The device we use can support double precision, but we do not use this because it is less efficient compared to single precision. (One multiprocessor (8 cores) shares a double precision unit, while each core has its own single precision unit.) Moreover, we turn on an option in the CUDA compiler "-use fast math" to sacrifice some accuracy for faster execution. Thus we have to convert the floating point to a fixed point due to the limitation of atomic operations. These accuracy compromises lead to a slightly different placement solution.
The approximated HPWL using the logsumexp model in Eq.(3) has a parameter . The software implementation uses = 0.01. The placement region is normalized to a [0, 1] by [0, 1] area. Thus the exp operation in Eq.(3) may generate a maximum value 1/0.01 , which can not be represented using single precision. and occurs in the internal computation if we leave the = 0.01. We use a = 0.012 in our CUDA implementations. The maximum single precision value is 3.4028234E+38. Setting (1/ ) <3.4028234E+38 will get > 0.01127. We enlarge the value of to 0.012 to reduce the out-of-range possibility in the subsequent additons/substractions of native force computation. The selection of is stable for the test cases we use. But there won't 
