In multiple electron beam lithography (MEBL), a layout is split into stripes and the layout patterns are cut by stripe boundaries, then all the stripes are printed in parallel. If a via pattern or a vertical long wire is overlapping with a stitch, it may suffer from poor printing quality due to the so called stitch error; then the circuit performance may be degraded. In this paper, we propose a comprehensive study on the stitch aware detailed placement to simultaneously minimize the stitch error and optimize traditional objectives, e.g., wirelength and density. Experimental results show that our algorithms are very effective on modified ICCAD 2014 benchmarks that zero stitch error is guaranteed while the scaled half-perimeter wirelength is very comparable to a state-ofthe-art detailed placer. In addition, our technique is very generic that it is applicable to many other placement targets, such as local congestion optimization, which is also demonstrated in the experimental results.
Introductions
Due to the capability of accurate pattern generation, e-beam lithography (EBL) is a promising candidate for next generation lithography technologies for sub-14 nm nodes, along with other techniques such as extreme ultra violet (EUV) and directed selfassembly (DSA) [1] [2] [3] . However, low throughput is still the bottleneck of an EBL system. Recently, an extended EBL technique, multiple ebeam lithography (MEBL), is proposed to improve manufacturing throughput using parallel beam printing [4] . MEBL system utilizes thousands of parallel beams to write multiple layout patterns simultaneously. Industry has already explored different MEBL implementations and has demonstrated promising performance in terms of both lithography accuracy and throughput [5, 6] .
In MEBL manufacturing process, a layout is split into stripes, and the boundary between two touching stripes is defined as a stitch line. Each stripe has width of 50∼200 µm, and different stripes are printed simultaneously through different electron beams. Although the parallel writing scheme can dramatically improve the system throughput, it also introduces serious printability issues. That is, each stitch can introduce so called stitch error, in an area with width around 15 nm [5] . If a pattern is overlapping with a stitch, it may suffer from poor printing quality due to the stitch error. Therefore, if not carefully designed, due to the shape distortion, an MEBL system may confront yield issue or even functional error.
We observe very significant shape distortions on via patterns and long vertical wires. Fig. 1 shows two SEM images of shape distortion on via layer and metal layer, respectively. In Fig. 1(a) we can see that all vias are very regular inside the beam stripes. However, at the stripe boundaries, the vias suffer from obvious distortions and irregular shrinking. In Fig. 1(b) we can see that the vertical wires are malformed in the stitch regions. Similar observations were also reported by Fang et al. [7] that the vertical wires are more susceptible to stitch errors than the horizontal wires. There are several methods to minimize the impacts of stitch errors from lithography perspective, e.g., avoiding dividing a critical pattern into adjacent sub-fields [8] , using different field sizes [9] , or reducing the field size [10] . Recently, Fang et al. [7] considered the stitch error during detailed routing stage. However, detailed routing is a very late stage in physical design flow, thus there may exist some stitch errors difficult to be removed. For instance, stitch errors from vias dropped on pins of a standard cell cannot be optimized during routing stage. There are various detailed placement algorithms to address other emerging issues in advanced technology nodes, such as multiple patterning lithography [11] [12] [13] , N10 design rules [14] , multiple-row height cells [15] [16] [17] , etc., which are summarized in [18] .
In this work we propose a comprehensive study to consider the stitch error removal in detailed placement. We can directly optimize the positions of both vias and intra-cell vertical wires. In addition, we consider local congestion, thus a router (e.g. [7] ) has more routing options to effectively remove stitch errors in higher metal layers. Fig. 2 shows a placement example with three gates, where the density of vertical metal1 segments varies from cell to cell. Some cells are more susceptible to stitch errors as they have vertical wire segments distributed at every site, while other cells have more space to avoid stitch errors. The comparison between Fig. 2 (a) and Fig. 2(b) shows that it is possible to smartly avoid stitch errors with small cell movement.
To the best of our knowledge, this is the first work taking stitch errors into consideration in placement stage. Our contributions are summarized as follows.
• We propose a comprehensive detailed placement study to simultaneously minimize the stitch error and optimize traditional objectives, e.g., wirelength and density.
• We develop a swap-based detailed placement engine with an optimal stitch aware single row placement.
• We present an nM ( ) pruning technique to speed-up the single row problem, where n and M are number of cells in a row and maximum displacement respectively.
• Our pruning technique is very generic that it is applicable to conventional placement and other applications. We show that the pruning technique is also adjustable for local congestion optimization.
The rest of this paper is as follows. Section 2 introduces the stitch constraints and the problem formulation. Section 3 explains the optimization algorithms in detail. Section 4 lists the experimental results, followed by conclusion in Section 5.
Preliminaries and problem formulation
In an MEBL system, stitch lines repeat periodically with equal intervals. If a standard cell is not carefully placed and overlaps with one stitch line, it may suffer from stitch error. In this work we consider three kinds of possible stitch errors, as follows. (1) Stitch over via: if a via is cut by a stitch line, it can lead to potential disconnection. (2) Vertical routing: a vertical routing segment suffers more from stitch lines than horizontal lines. (3) Short polygon: short horizontal routing segment with vias may also result in problem.
To accurately capture a stitch error, we partition each cell into sites with width equal to the poly pitch. Since some intra-cell segments or vias are very susceptible to stitches, we note those sites covered by these segments/vias as dangerous sites. For example, Fig. 3 shows the dangerous sites of cell BUF_X8. Note that for simplicity, here only intra-cell segments are illustrated. A stitch error happens if one dangerous site overlaps with an MEBL system stitch line.
This work adopts scaled half-perimeter wirelength (sHPWL) from ICCAD 2013 placement contest, defined as follows.
where α is set to 1, and HPWL denotes half-perimeter wirelength. P ABU represents ABU penalty to evaluate the placement congestion. Please refer to [19] for more details regarding the P ABU calculation.
Problem 1 (Stitch Aware Detailed Placement). Given an initial detailed placement with the information of dangerous sites for each standard cell, we seek a legal placement to minimize the stitch errors and the sHPWL, simultaneously. After solving Problem 1, we further perform a local congestion refinement step to improve local routability and pin access without introducing any additional stitch error and demonstrate the flexibility of the proposed algorithm.
Problem 2 (Stitch Aware Local Congestion Refinement). Given a detailed placement solution without stitch errors, refine local congestion while minimizing displacement without introducing any stitch error.
It should be noted that Problem 2 aims at smoothing congested regions, as shown in Fig. 4 where cells are shown with dangerous sites and pins are as cross marks. The region in Fig. 4(a) is very congested due to large number of pins in the region. We can insert whitespaces to relieve congestion in Fig. 4(b) while it is necessary to avoid stitch errors at the same time. It is suitable to solve it by minimization of a maximum cost, which helps relieve congestion without large perturba- 
Y. Lin et al.
INTEGRATION the VLSI journal 58 (2017) [47] [48] [49] [50] [51] [52] [53] [54] tion to the layout. Further details will be discussed in Section 3.3.
Detailed placement algorithms
In this section we describe the details of our placement algorithms. As shown in Fig. 5 , our framework mainly consists of two stages. In the first stage, single row based approach is applied to optimize wirelength and stitch errors optimally. If all stitch errors are removed successfully by this stage, we directly output placement solutions. Otherwise, in the second stage, cell swapping and movement are introduced to improve both wirelength and congestion. Note that the stitch error is considered through the whole flow.
Single row placement
As a powerful approach in detailed placement, single row based placement is widely studied in both conventional placement [20] [21] [22] and lithography aware application, such as triple patterning lithography (TPL) compliance [23] [24] [25] [26] 13] . If there are fixed macros in the layout, conventional single row algorithms (e.g. Abacus [22] ) divide a row into several sub-rows. However, this strategy is not suitable for MEBL application, as the stitch lines are soft constraints rather than hard constraints. In TPL compliance, the main challenge lies in the distance between abutting cells, while the stitch errors in MEBL are not related to neighboring cells. In addition, in the single row algorithm proposed by [23] , a graph based approach is applied to find optimal solution in mnK ( ). Here m is the site number in the row, n is the cell number, and K is the number of pre-coloring solutions for each cell. Usually m is a very large number, thus this algorithm may suffer from runtime for large size circuits.
In this paper we adapt a dynamic programming based algorithm [27] to solve single row detailed placement. Different from other techniques (e.g. [22] ), it can naturally handle both hard constraints (fixed macros) and soft constraints (stitch errors). Each cell is associated with a movable range, which is usually a finite site candidates. The dynamic programming scheme is able to achieve optimal solution for combined cost functions, such as movement, wirelength and stitch errors. Note that comparing with [27] , we significantly improve the runtime complexity while still maintaining the optimality.
For convenience, Table 1 − to M, the value of which is marked in the node. Each edge also contains a cost according to Eq. (2). Two additional nodes, s and t, are inserted to the graph. The problem is stated as finding the path with lowest cost from node s to node t, which can be solved with dynamic programming.
The cost p ( ) i i function in the experiments is as follows,
where WL denotes wirelength cost, MOV denotes movement, and SP denotes stitch error penalty. SP is set to a very large number when a stitch error is generated, e.g., half-perimeter of the layout. In our experiments, τ, ϕ, and ν are set to 10, 1, and 1. In legalization step, we simply set τ and ν to zero.
Given an ordered sequence of cells S, to calculate wirelength cost for cell c i , we need to fix the positions for all other cells. The wirelength 
Table 1
Notations used in single row placement.
M
Maximum displacement for a cell. 
t p ( ) i i
The cost of best placement solution from c 1 to c i in which c i is placed
The position of c i−1 in the optimal solution of c 1 to
Whether the solution corresponding to t p
The cost of c i when it is placed to p i . w i
Width of Cell c i .
cost is determined by the bounding boxes of nets. But if cell c i has connection to any cell in S, the wirelength cost for cell c i cannot be determined since cells in S are not fixed. To handle this, we introduce the wirelength model in [28] which ensures the wirelength cost for cell c i is independent to other cells in S, while the optimality of the solutions are maintained. If cell c i is connecting to any cell c j in the left of cell c i in the same placement row, we regard the position of c j as the left boundary of the row when computing the wirelength cost for c i . Similarly, if cell c i is connecting to any cell c j in the right of cell c i in the same placement row, we regard the position of c j as the right boundary of the row. This model is widely used in ordered single row placement, which turns out to be equivalent to HPWL [29, 28] .
We can see that function cost p ( ) i i is quite flexible, since we can include movement, wirelength and stitch errors. For hard constraints like fixed macros, we only need to set its maximum displacement M to zero. For soft constraints like stitch lines, additional cost is applied if a cell has overlap with them. Fig. 6 is optimal for cost function in Eq. (2) .
Lemma 1. Algorithm shown in
The proof is similar to that in [27] , and is omitted here for brevity. The basic idea is that the optimal placement solution can be found through a shortest path from s to t, and all the positions of cells can be derived from the displacement values of corresponding nodes. Since the constructed graph is a directed acyclic graph, the shortest path can be calculated using topological traversal in nM ( ) 2 steps, where n is the cell number in the row, and M is the maximum displacement for each cell.
An nM ( ) pruning algorithm
The runtime complexity of the above single row placement is nM ( ) 2 . When M is very large, the runtime becomes unacceptable. Here we propose a set of pruning techniques to achieve further speedup, while still keeping the optimality. In addition, we can theoretically prove that the runtime complexity can be improved from nM ( ) 2 to nM ( ).
Algorithm 1. Single row placement with pruning
Require: A set of ordered cells c 1 to c n of a row. Ensure: All the cells in the set are placed subjecting to optimal objective function. 1: for i n ← down to 2 do 28:
end for
The details of our nM ( ) implementation are shown in Algorithm 1. The main difference between the problems in [23, 27] and our problem lies in the cost function. That is, the cost functions for a cell in the former problems depend on other cells, such as the distance or coloring cost between two abutting cells, while the cost defined in Eq. (2) is only related to the cell itself; i.e., it is independent to any other cell. Due to the independence in the cost function, we can minimize the total cost with nM ( ) time complexity. Our speedup technique is generic that it can also be applied into conventional detailed placement and legalization with an objective like wirelength or movement. 
Flexibility of pruning techniques
In this section, we solve Problem 2 with an extension to the single row algorithm with pruning technique. It should be noted that our pruning algorithm is flexible to any cost function cost p ( ) i i as long as it only depends on p i itself. That is, it can be applied to speed-up the conventional single row detailed placement problems [20] [21] [22] .
Furthermore, the cost function can also be extended from summation to maximization in line 8 of Algorithm 1. The application comes from the minimization of maximum displacement of cells when optimizing local congestion, where the cost function of each cell is adjusted from Eq. (2) to the following,
where
is the spacing cost between two neighboring cells and UB is a user-defined upper bound for the spacing cost. The spacing cost can be more complicated such as that in [27] as long as the cost is non-increasing with the increase of spacing, while we use a simple version for illustration. We switch the symbol of cost function from cost p ( )
because the cost in Eq. (4) depends on positions of both cell c i−1 and c i .
In Algorithm 1, line 8 is adjusted to,
.
It should be noted that enabling minimization of the maximum cost ensures small spacing and displacement cost of the worst case, which facilitates to solve Problem 2.
With the extension of cost function, it is not hard to see that Lemma 2 still holds, which means we can still prune inferior solutions in lines 18, but the correctness of early exit in line 14 needs to be explained. Due to the pruning of inferior solutions, t p ( )
is non-decreasing (increasing) w.r.t p i−1 . The maximization operation between a decreasing function and a non-decreasing function results in the fact that, given p * i−1 as the best position, in the region of p p ≤ *
is non-increasing, while in the region of p p ≥ * 
where the equality comes from the discussion in Fig. 7 
which indicates that current solution of p i−1 and q i is inferior to α p ( )
with p * i−1 and q i . Therefore, we can directly start from p * i−1 when searching for the best solution of t q ( ) i i . Although the condition of overlap is not mentioned, it can be integrated to the spacing cost, which still leads to non-decreasing cost function w.r.t p i−1 . Hence the proof holds for the new cost function with maximization operation and spacing cost. 
Stitch aware global swap
In this step, the main objective is to optimize regions that contain cells involved in stitch errors. After the optimization of single row placement, most stitch errors have been resolved. The remaining ones usually appear in highly congested placement bins. Therefore, we only try to move cells in such bins to alleviate the congestion and meanwhile reduce wirelength.
Due to the congestion of these regions, it is difficult to resolve them with local perturbation such as reordering or sliding window. Thus global swap [28, 30] is adopted where cells are allowed to move anywhere within the displacement constraints. Generalized swap not only enables swapping with cells but also white spaces, which integrates both swapping and moving strategies. The basic procedure for cell swap is iteratively repeating the following three steps: (1) Select a source cell to swap; (2) Identify optimal region for source cell; (3) Find the best cell or white space to swap with the source cell in the optimal region.
In our implementation, we set the score function for swap as follows,
where sHPWL Δ indicates sHPWL improvement, P ds indicates the penalty for density increase of dangerous sites, and P ov is overlap penalty. Suppose cell c i is in bin B i and cell c j belongs to bin B j . The area of both bins is A b . We define the density of dangerous sites as the number of dangerous sites over total amount of available sites in a bin. If a bin has overlap with any stitch line, we account only 70% of its total sites as available. Let D ds (i) denote the density of dangerous sites in bin B i before swap and D i ′ ( ) ds denote the density of dangerous sites in bin B i after swap. Then we can define P ds with the following equation: 
The overlap penalty is the area difference between the source cell and target cell or white space. If the target white space is larger than the source cell, overlap penalty is zero. To achieve an equivalent numeric scale to wirelength cost, P ds and P ov are divided by site half-perimeter in the implementation. In this way, all the costs have the same unit as distance. λ and μ are set to 100. Only swapping attempt with best positive scores is accepted. The scoring scheme proposed in Eq. (8) aims for balancing the density of cells and dangerous sites while improving wirelength. Although the penalty from ABU density is able to handle global density distribution, local control is necessary to avoid extremely dense regions. Furthermore, it is easier for a congested region with very few dangerous sites to find a stitch-error-free solution than that with a lot of dangerous sites. Thus we introduce P ds as the additional penalty for such kind of regions. Since row-based legalization engine is applied, the height of bins for P ds is set to row height.
Overlap penalty is introduced to control the efforts during legalization. High legalization efforts will incur large displacement for some cells and thereby large wirelength degradation. Hence, after every 5000 swaps, legalization algorithm will be performed to remove overlaps. Legalization algorithm is based on single-row placement (Section 3.1) with minimum movement as an objective.
We observe that the runtime for global swap is highly related to the complexity of score function. Considering that wirelength is included in the calculation, it will be very slow to query the bounding box of large nets. Thus we develop a data structure in which pins of a net are stored as an ordered sequence according to pin positions. Cells in a row is kept in a linked-list [30] for fast cell swap and movement.
Usually a cell is connected to limited number of nets, thus its degree can be treated as constant. Using the data structures above, it only takes constant time to query the bounding box and e (log ) to update cell position in a net with e pins. Since score calculation happens much more frequent than actual cell swap or movement, faster score calculation helps to reduce overall runtime. Let k be the number of swapping candidates for a cell c i , we can achieve k ( ) time complexity for score calculation and e (log ) max for cell position update if a swap or movement is accepted, where e max is the maximum e of nets connected to cell c i .
Experimental results
Our algorithms were implemented in C++ and tested on a 3.40 GHz Linux machine with 32 GB memory. Since traditional academic placement benchmark suites has no intra-cell wire information, we integrated the NanGate 15 nm standard cell library [31] into ICCAD 2014 placement benchmarks [19] . ICCAD 2014 placement contest defines two maximum displacement values for each benchmark, and we choose the smaller ones for less perturbation to the original placements. We applied a state-of-the-art detailed placer, RippleDp [32] , to generate the initial placement solutions. We scaled the bin dimensions for ABU density analysis from the ICCAD 2014 benchmarks, so most generated test cases match to the number of bins in the original ones. We pre-computed dangerous sites for all standard cells in the library, which was served as input to our placer. We set the stripe width of each single beam to 50 µm.
The metrics of the new benchmarks are shown in Table 2 , where columns "#cells" and "#nets" list the total cell number and net number, respectively. Besides, columns "#blk", "d t " and "Disp." represent the blockage (fixed macro) number, the target density, and the maximum displacement in um. Target density d t is necessary for computing ABU penalty. Column "Util." denotes the area utilizations of benchmarks. Note that test cases mgc_edit_dist, mgc_matrix_mult and netcard contain mixed-sized cells. Table 3 lists the performance of our placer at different optimization stages. The initial placement solutions (column "Init.") are generated by a traditional detailed placer, RippleDp [32] , which aims at minimizing wirelength. As the state-of-the-art detailed placer, RippleDp can produce very high quality placement solutions in terms of both HPWL and sHPWL. Here we set displacement constraint to be a very large number so that RippleDp can produce converged results. Column "SR" stands for single row placement, while column "Full Flow" denotes the whole flow combining global swap and single row placement. To evaluate the effectiveness of our algorithms, following metrics are introduced. HPWL stands for half perimeter wirelength which is used as a metric for wirelength. ST# represents the number of cells that contains stitch errors. It is measured by how many dangerous sites are covered by the beam boundaries. Placement solutions with high congestion are not desired, so we introduce sHPWL as discussed in Section 2. When measuring Runtime, which is the CPU run time in seconds, single thread is applied for consistency of results.
From Table 3 we can see that, with certain displacement constraints, the proposed single row placement can achieve very good efficacy in stitch error cancellation. That is, 99.9% of the initial stitch errors are removed. Meanwhile, an average of 0.19% HPWL improvement and slight sHPWL increase are observed. However, for some corner cases, such as leon2 and netcard, the single row placement is not powerful enough due to the movement constraints from blockages or congestions. Therefore, global swap is introduced as a follow-up optimization step, and the corresponding results are shown in the last column. We can see that swapping cells between rows improves congestion in dense regions and optimize wirelength. By applying global swap together with single row algorithm, we are able to achieve zero stitch errors for all test cases. As only small number of bins are considered for global swap, the runtime overhead can be neglected. Small changes in HPWL and sHPWL also indicate that the algorithm produces little perturbation to initial placement. It should be noted that the runtime of single row placement in Table 3 for case netcard is very close to that of leon2, while the former has much larger cell number. The reason lies in those blockages in netcard. That is, the runtime of single row placement is not only related to the number of cells, but also the amount of maximum displacement. Blockages have zero maximum displacement. So during the propagation of candidate solutions in the dynamic programming process, many infeasible solutions are automatically pruned. Therefore, the solution space has been significantly reduced and as a consequence, the best solution is found in shorter time. Fig. 8 compares the runtime difference for variant amounts of cells in a row between whether applying pruning techniques or not. The data is directly collected from benchmarks in Table 2 and the runtime values of rows with the same number of cells are averaged. We can see the runtime grows linearly with the problem size and the difference in the slopes shows that pruning techniques effectively drop runtime. We compare the solutions from whether the pruning techniques are enabled or not and average the runtime in Fig. 8 . On average, the nM ( ) pruning technique can provide around 30×speedup without any loss of optimality.
We also evaluate the of local congestion optimization discussed in Problem 2 of Section 3.3 as a post refinement step. Table 4 gives the result comparison between the full flow in [33] and that with our congestion refinement. Since the objective of the refinement is to increase the gap between pairs of cells while minimizing maximum displacement, we introduce a metric called "pair spacing ratio (PSR)" for each pair of horizontally neighboring cells to evaluate the performance, 
where size i and size j denote the width of cell c i (left) and c j (right), respectively. The lower left corners of cells c i and c j are denoted by p i and p j , respectively. In other words, PSR c c ( , ) i j is defined as the total width of two cells divided by their total spanning width including the spacing between them. The overall PSR cost is evaluated with the average PSR of all horizontally neighboring cell pairs. From the table, we can see that the congestion refinement is not only effective in removing local congestion, but also smoothing the density, because both PSR and sHPWL are improved by 4% and 1.1% compared with the flow in [33] , while no stitch errors occur. Although there is degradation in HPWL, better density and local congestion are more important for routability and final routed wirelength, considering the improvement in sHPWL. In the refinement, we set the maximum displacement M to 10 to avoid large perturbation to the layout, which also speeds up the algorithm. As a consequence, there is only 7% runtime overhead. The weight for spacing cost is set to 10 in the experiment and UB is set to the width of smaller cells in the cell pairs.
Conclusion
This work develops the first placement framework considering ebeam stitch errors during detailed placement stage. A linear-time single row placement algorithm is proposed with highly-adaptable objective functions. Experimental results show its effectiveness in stitch cancellation while maintaining wirelength and congestion. With the collaboration of stitch aware post-placement optimization such as [7] , better manufactorability can be achieved. In addition, our high performance pruning technique can be naturally embedded into existing physical design flow with different metrics (e.g., wirelength, routability, or congestion).
