Design-technology co-optimization in next generation lithography by Zhang, Hongbo
c 2012 Hongbo Zhang
DESIGN-TECHNOLOGY CO-OPTIMIZATION IN NEXT GENERATION
LITHOGRAPHY
BY
HONGBO ZHANG
DISSERTATION
Submitted in partial fulllment of the requirements
for the degree of Doctor of Philosophy in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2012
Urbana, Illinois
Doctoral Committee:
Professor Martin D. F. Wong, Chair
Associate Professor Deming Chen
Professor Elyse Rosenbaum
Dr. Rasit O. Topaloglu, IBM
ABSTRACT
Lithography continues to be the backbone of the integrated circuit (IC) in-
dustry. While the critical dimension (CD) keeps shrinking with the pace of
Moore's law, the progress in lithography lags far behind. The gap between
manufacturing capability and the expectation of design performance nally
pushes the conventional 193 nm ArF immersion lithography to its limit, call-
ing for a complete new set of design and manufacturing methodologies under
the scope of next generation lithography. Many innovations are needed to
co-optimize both design and process at the same time.
Design-technology co-optimization (DTCO) in next generation lithography
(NGL) could be dened very dierently under dierent circumstances. In
general, progress in NGL happens along four dierent directions:
 new patterning technique (e.g. litho-etch-litho-etch, self-aligned patter-
ing)
 new design methodology (e.g. restricted design rule, 1-D design)
 new illumination system (e.g. extreme ultraviolet lithography, electron-
beam, directed self-assembly)
 new simulation and verication approach (e.g. process windows optical
proximity correction, parallel simulation)
Corresponding to these four research directions, in this dissertation, we
propose our research topics as follows:
 Self-aligned double patterning (SADP)/self-aligned quadruple pattern-
ing (SAQP) for new patterning technique (Chapter 2 { 5)
 1D design for new design methodology (Chapter 6 { 8)
ii
 Extreme ultraviolet (EUV) for new illumination system (Chapter 9 {
10)
 Graphics processing unit (GPU)-based aerial image for new simulation
method (Chapter 11)
For the research direction of new patterning technique, we mainly study
self-aligned patterning techniques. SADP process has been studied exten-
sively in the past. This self-aligned type double patterning strategy is de-
signed to mitigate the inevitable overlay in other pitch-splitting multiple
patterning techniques, such as litho-etch-litho-etch (LELE). However, SADP-
based design-technology ows are still incomplete in three senses. First, in
SADP process the mask pattern is no longer the designed pattern, which ne-
cessitates an ecient decomposition approach for mask generation. Second,
how to nd a decomposition result which has the best overlay control is a
non-trivial question. Third, design rule check should be largely expanded to
detect the hot spot in the layout which makes the whole layout indecompos-
able. In this dissertation, we propose a ow for SADP decomposition, which
takes all the above incompleteness into consideration. This study leads to
the rst published decomposition algorithm, the rst overlay minimization al-
gorithm and the rst hot spot detection algorithm for SADP decomposition.
As an extension, we then propose a novel characterization ow to provide
the rst design enablement estimation method for SAQP lithography.
For the research direction of new design methodology, our major contri-
bution is on 1D design optimization. 1-D regular design style is a novel de-
sign style under the requirement of restricted design rule for better process
windows. By dierent pitch requirements under dierent technology nodes,
1-D circuit patterns can be manufactured by single patterning or print-and-
cut technique. For the single patterning technology, we study tip-tip gap
distribution for a better process window and propose a novel algorithm to
retarget the line-ends and oating dummies with/without performance im-
pact constrains. With our proposed algorithms, we can signicantly increase
the process windows with limited impact on the original design. For the
print-and-cut technique, we focus on the mask cut complexity reduction. By
optimally extending the line-end, we can largely reduce the cut mask com-
plexity and thus save manufacturing costs.
For the EUV process as a new illumination system, we focus on a defect
iii
mitigation algorithm. EUV process is a new process compared to the conven-
tional 193 nm immersion lithography. With only 13.5 nm wavelength, EUV
process has a capability to work on the circuit in sub-20 nm technology node.
However, before the nal implementation of EUV process, defective blank for
mask manufacturing is still a huge problem that needs to be addressed. In
this dissertation, we propose an ecient layout shift and rotation method
to mitigate blank defect impact. Our algorithm shows a signicant speedup
compared to the existing commercial tool. We also update our algorithm to
further accept the small angle rotation movement to increase the success rate
of defect mitigation.
For the new simulation/vercation method research direction, we utilize
GPU to achieve a substantial contribution to aerial image simulation. Aerial
image simulation is a fundamental step in the process-related simulation and
verication, which requires vast numerical computation. The recent advance-
ment of general purpose GPU computing provides an excellent opportunity
to parallelize the aerial image simulation and achieve great speedup. In this
dissertation, we present and discuss two GPU-based aerial image simulation
algorithms. Compared to the previous work, our approach has signicant
speedup and much smaller numerical errors.
iv
To my parents
To my wife
To my motherland
v
ACKNOWLEDGMENTS
I am heartily thankful to my adviser, Prof. Martin D. F. Wong. His en-
couragement, guidance and support have constantly led me forward from the
initial to the nal level of my thesis subjects.
Besides my adviser, I would like to show my gratitude to the rest of my
doctoral committee, Prof. Deming Chen, Prof. Elyse Rosenbaum and Dr.
Rasit Topaloglu, for their insightful comments and constructive suggestions.
I also want to express my grateful thanks to Semiconductor Research Cor-
poration for funding and supporting my research on lithography. In particu-
lar, I would like to thank my liaisons Dr. Kai-Yuan Chao, Dr. Will Conley,
Dr. Rasit Topaloglu for sharing their precious experience and knowledge for
my research and experiments. I would also like to thank Dr. Jongwook Kye,
Dr. Yunfei Deng, Dr. Yuansheng Ma, Dr. Fan Jiang, Dr. Pawitter Mangat
and Dr. Chris Cliord for the great help on the research of double pat-
terning and EUV during my internship in GlobalFoundries Inc.. My special
thanks also go to Mr. James Hutchinson for oering the great instructions
to improve my paper writing skills.
It is my great honor to share my happiness with the members of Prof.
Wong's research team. I would like to thank Yuelin Du for the great team-
work during the past years. I would also like to thank Prof. Mark Po-Hung
Lin, Dr. Tan Yan, Qiang Ma, Ting Yu, Zigang Xiao, and Haitong Tian for
all the great inspiring discussions and productive collaborations we had. I
would like to thank Leslie Hwang for always sharing her jokes and food with
every one of us. I would also like to thank Dr. Hui Kong, Dr. Lijuan Luo,
Dr. Liang Deng and Dr. Yu Zhong for the help with my life and studies.
Last but not least, I owe my deepest gratitude to my family for their
endless love and support. I want to thank my parents for raising me up, and
providing me whatever they have to let me pursue my goal. I also want to
thank my wife for the support and encouragement of my Ph.D. study.
vi
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . xvii
I INTRODUCTION 1
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 2
1.1 Lithography System . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Next Generation Lithography . . . . . . . . . . . . . . . . . . 4
1.3 Overview of this Dissertation . . . . . . . . . . . . . . . . . . 4
II NEW PATTERNING TECHNIQUE 9
CHAPTER 2 SADP DECOMPOSITION . . . . . . . . . . . . . . . . 10
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Overview of 2D SADP Process . . . . . . . . . . . . . . . . . . 12
2.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Negative Tone Process Decomposition . . . . . . . . . . . . . . 24
2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
CHAPTER 3 OVERLAY MINIMIZATION FOR SADP . . . . . . . 29
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Layout Decomposition Problem Formulation . . . . . . . . . . 30
3.3 Problem Reduction . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
CHAPTER 4 HOTSPOT DETECTION FOR SADP . . . . . . . . . 41
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 ILP-Based Algorithm . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Graph-Based Algorithm . . . . . . . . . . . . . . . . . . . . . 43
vii
4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 48
4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
CHAPTER 5 CHARACTERIZATIONOF SAQP-FRIENDLYDE-
SIGN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Process Information . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3 SAQP Patterning Conditions and Feasibility . . . . . . . . . . 56
5.4 Patterning Rules . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.5 SAQP-Friendly Layout and Feature-Region Assignment . . . . 61
5.6 Experimental Results and Analysis . . . . . . . . . . . . . . . 64
5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
III NEW DESIGN METHODOLOGY 68
CHAPTER 6 LAYOUTOPTIMIZATION FOROPTIMAL PRINT-
ING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2 Description of the Problem . . . . . . . . . . . . . . . . . . . . 70
6.3 Uniformity-Aware Cell Design Guideline . . . . . . . . . . . . 72
6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 77
6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
CHAPTER 7 LAYOUTOPTIMIZATIONWITH PERFORMANCE
CONSTRAINTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
7.2 The 1-D Layout Modication Problem . . . . . . . . . . . . . 83
7.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 88
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
CHAPTER 8 CUT-MASK OPTIMIZATION . . . . . . . . . . . . . . 93
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.2 Overview of Print-and-Cut Process for 1-D Design . . . . . . . 94
8.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 95
8.4 Polygon Simplication Algorithm . . . . . . . . . . . . . . . . 97
8.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 101
8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
IV NEW ILLUMINATION SYSTEM 108
CHAPTER 9 EUV MASK BLANK DEFECTS MITIGATION I . . . 109
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
9.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
viii
9.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 115
9.4 Problem Solutions . . . . . . . . . . . . . . . . . . . . . . . . . 121
9.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 123
9.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
CHAPTER 10 EUV MASK BLANK DEFECTS MITIGATION II . . 128
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
10.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
10.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . 131
10.4 Problem Solutions . . . . . . . . . . . . . . . . . . . . . . . . . 137
10.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 140
10.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
V NEW SIMULATION METHOD 143
CHAPTER 11 ACCELERATING AERIAL IMAGE SIMULATION
WITH GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
11.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
11.3 GPU-Based Aerial Image Simulation . . . . . . . . . . . . . . 150
11.4 Experimental Results and Analysis . . . . . . . . . . . . . . . 156
11.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
VI CONCLUSION 163
CHAPTER 12 CONCLUSIONS AND FUTURE WORK . . . . . . . 164
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
ix
LIST OF TABLES
2.1 Results of Decomposable Standard Cells in the Nangate
Cell Library [24] . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Results of Decomposable Standard Cells in Nangate Stan-
dard Cell Library . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1 Geometry Constraints for Future Technology Nodes . . . . . . 54
5.2 Distance between Dierent Feature Regions . . . . . . . . . . 58
6.1 Dierent Test Circuit and the Run Time of Our Program . . . 80
7.1 Runtime of Shortest Path Method . . . . . . . . . . . . . . . . 90
7.2 Runtime of Approximated Method . . . . . . . . . . . . . . . 91
8.1 Process Control Parameter Used in the Experiment . . . . . . 102
8.2 Experimental Results from Three 28 nm Benchmarks . . . . . 107
9.1 Experimental Results for Dierent Test Sets . . . . . . . . . . 125
10.1 Experimental Results for Dierent Test Sets . . . . . . . . . . 142
11.1 Data Set for Our Experiments . . . . . . . . . . . . . . . . . . 158
11.2 Comparison between the Three Approaches. (Unit of time
is second.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
x
LIST OF FIGURES
1.1 Schematic diagram of the conventional lithography projec-
tion system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 An example to show SADP decomposition result with pos-
itive tone process [22]. . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Positive tone process and negative tone process [22]. . . . . . . 13
2.3 Overlap needs to be larger than overlay distance to help
avoid feature overlay. . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Impact of overlay on dierent edges. . . . . . . . . . . . . . . 17
2.5 Variable denition on each tile. . . . . . . . . . . . . . . . . . 18
2.6 Sidewall adjacency rule setup based on the sidewall width
from the process setting. . . . . . . . . . . . . . . . . . . . . . 19
2.7 Minimum corner-corner rule setup based on the MinCrnr2Crnr
from the process setting. . . . . . . . . . . . . . . . . . . . . . 20
2.8 Minimum core space rule setup based on the MinCoreSpace
from the process setting. . . . . . . . . . . . . . . . . . . . . . 21
2.9 Minimum core width rule setup based on the MinCoreWidth
from the process setting. . . . . . . . . . . . . . . . . . . . . . 22
2.10 Extended work to perform overlay reduction. . . . . . . . . . . 23
2.11 The decomposition result for AND4 X1 [24]. . . . . . . . . . . 26
3.1 Variable denition on each tile. . . . . . . . . . . . . . . . . . 31
3.2 Constraints setup based on geometry rules. . . . . . . . . . . . 32
3.3 Core and sidewall grid setup in the non-feature region for
a single feature. . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4 A demonstration for the non-feature region split by dier-
ent grids. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Trim grid setup in the non-feature region for a single feature. . 38
3.6 A test case and its decomposition result. . . . . . . . . . . . . 39
4.1 The dierence between the design rules and mask rules. . . . . 41
4.2 Slack insertion to detect the minimum number of conict-
ing constraints. . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Dierent decomposability situation of similar patterns. . . . . 45
4.4 The denition of distance between two features. . . . . . . . . 46
xi
4.5 The relationship between distance and the option of pat-
tern assignment, when feature boundaries are dened by
sidewalls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.6 The relationship between distance and the option of pat-
tern assignment, when feature boundaries can be dened
by trim mask. . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.7 A non-decomposable layout and its conicting graph. . . . . . 48
4.8 The hot spot detection for NOR2 X1 cell and the decom-
position result after xing. . . . . . . . . . . . . . . . . . . . . 50
5.1 The process ow of the SAQP. . . . . . . . . . . . . . . . . . . 53
5.2 The demonstration of the SAQP process to generate 2D
patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Three types of feature separations and their distance denitions. 57
5.4 The feasibility of dierent distance from 0 to 1. . . . . . . . . 58
5.5 Illustration of NonSingleRule. . . . . . . . . . . . . . . . . . . 59
5.6 Illustration of SidewallNoBranchRule. . . . . . . . . . . . . . . 60
5.7 Illustration of SidewallAsideRule. . . . . . . . . . . . . . . . . 60
5.8 Illustration of SameSideRule . . . . . . . . . . . . . . . . . . . 61
5.9 A demonstration of conicting graph construction. . . . . . . . 62
5.10 The process of A-B-C color assignment. With the conict-
ing graph in (a), A and BC colors are rstly assigned in
(b); B and C colors are then assigned on the nodes with
BC color in (c). . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.11 The denition of connected component set (CCS). . . . . . . . 64
5.12 Feature region assignment on an SAQP-unfriendly layout
(a) and an SAQP-friendly layout (b). The layout in (a)
is SADP-friendly, but this layout can only guarantee no
conict in the A-BC 2-coloring assignment; the layout will
have other types geometry rule violations and A-B-C col-
oring conicts. . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.13 Several examples of SAQP-unfriendly patterns and their
violations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.14 Several examples of SAQP-friendly patterns and their fea-
ture region assignments. . . . . . . . . . . . . . . . . . . . . . 67
6.1 An SEM image to show how line-end gaps aect line width
roughness [31]. Clear wavy shapes can be found when there
is a gap nearby. . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2 An example to show how gap patterns aect printability
and process windows. . . . . . . . . . . . . . . . . . . . . . . . 71
6.3 Test patterns. Each test site is labeled. . . . . . . . . . . . . . 72
6.4 Process windows comparison between sites a and b. . . . . . . 73
6.5 Process windows comparison among sites b, c, d and e. . . . . 73
6.6 Process windows comparison among sites e, f and g. . . . . . . 74
xii
6.7 Example of eective gap and critical gap. . . . . . . . . . . . . 74
6.8 An example of an AOI21 gate for the layout improvement
targeting on printability of poly/MG and metal 1. . . . . . . . 75
6.9 Illustration of the greedy algorithm. . . . . . . . . . . . . . . . 76
6.10 (a) and (b) are the two cases of the active line-end A. Note
that in (c), although the line-end A is adjacent to a real
wire, it is not an active line-end. . . . . . . . . . . . . . . . . . 77
6.11 Two types of critical gaps caused by line-end A. Both
critical gaps would harm the printing of the line-end B.
By xing the line-end A, as shown in (c), (d) and (e),
critical gap can be removed. . . . . . . . . . . . . . . . . . . . 77
6.12 Three cases on the initialization. . . . . . . . . . . . . . . . . . 79
6.13 The extension and dummy insertion without considering
the critical gaps. . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.14 The extension and dummy insertion with considering the
critical gaps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.15 Process window comparison on 115 nm pitch process. . . . . . 81
6.16 Process window comparison on 75 nm pitch process. . . . . . . 81
7.1 Demonstration of how the extension wire will impact the
capacitance in the circuit. . . . . . . . . . . . . . . . . . . . . 84
7.2 The demonstration of how the extension wire will impact
the resistance in the circuit. . . . . . . . . . . . . . . . . . . . 84
7.3 All the potential candidates for a wire with extension limit
of 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
7.4 Adjacent tracks have dierent modications with dierent
costs. E is one unit of extension cost, R is one unit of
extension cost, and C is one unit of critical gap cost. Here,
E  R C. . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.5 A graph corresponding to the layout with M tracks. The
shortest path between S and T gives the optimal solution
of the wire-end extension. . . . . . . . . . . . . . . . . . . . . 87
7.6 The comparison results with two dierent approaches in
this chapter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.7 The variation trend of EPE along with increasing extension
limit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7.8 The variation trend of normalized delay along with increas-
ing extension limit. . . . . . . . . . . . . . . . . . . . . . . . . 92
7.9 The variation trend of normalized power along with in-
creasing extension limit. . . . . . . . . . . . . . . . . . . . . . 92
8.1 Illustration of 1-D design. From dense lines (a), we need
to trim away unwanted patterns (b). Then after etching
away the patterns covered by the cut polygons, the nal
circuit design will be formed. . . . . . . . . . . . . . . . . . . . 93
xiii
8.2 Cut for the dense line has low printing requirement. Al-
though the normal edge placement error (EPE) measure-
ment is high in (b), the impact on the nal wire is very
limited. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8.3 Cut mask simplication process. (a), (c) and (e) are the
pre-simplication circuit patterns, cut patterns and post-
OPC cut mask, respectively. (b), (d) and (f) are the cor-
responding post-simplication version. By simplication,
the number of cut polygon edges is reduced from 38 to 20,
and the post-OPC polygon edge is reduced from 176 to 96. . . 96
8.4 A polygon example. Vertical edges are labeled with num-
bers, and horizontal edges are labeled with hi. S0 and S1
are the starting point for the sub-edge arrays G1 and G2,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.5 The complete CSP graph of polygon P in Fig. 8.4. The
graph is made up of two subgraphs of G1 and G2 and
connecting them to the source and target. Any path from
S to T will represent a modied polygon. . . . . . . . . . . . . 99
8.6 Build subgraph for an SEA. Each edge from node h0 is
assigned weight as shown in the right box. . . . . . . . . . . . 100
8.7 The path comparison in CSP graph. From the polygon
given in Fig. 8.4 and the CSP graph in Fig. 8.5, two paths
A and B and the corresponding polygons are given, which
demonstrates that dierent path will generate dierent poly-
gons and path in CSP graph need to be evaluated for a
better simplied cut polygon. . . . . . . . . . . . . . . . . . . 101
8.8 Building the CSP graph with multiple polygons. (a) shows
the individual polygons and (b) shows the corresponding
CSP graph for each polygon. By linking the CSP graph
together, the complete graph is shown in (c). . . . . . . . . . . 102
8.9 Cut polygon edge reduction vs. extended wire length. . . . . . 103
8.10 Cut polygon edge reduction vs. post-OPC mask edge number. 104
8.11 The comparison of layouts. A small portion of M1 layer
in a large layout is shown to illustrate how k will impact
circuit. The cases of k = kmax, k = kcenter and k = kmin are
provided. (a), (b) and (c) show the intended cut polygons
and (d), (e) and (f) show the simulated real cut polygons. . . 105
9.1 A typical structure of EUV system [43]. . . . . . . . . . . . . . 109
9.2 Mask defect on the simulated aerial image [51]. . . . . . . . . 110
9.3 EUV mask: the blank and layout. . . . . . . . . . . . . . . . . 113
9.4 The cross-section of the defect region. Note that the defect
size is exaggerated to better show the Gaussian scheme of
the surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
xiv
9.5 Defect shift to mitigate the impact of defect. . . . . . . . . . 114
9.6 The 8 possible orientations of the layout on the EUV blank.
The feasibility of the orientation depends on the process
and design requirement. . . . . . . . . . . . . . . . . . . . . . 115
9.7 Layout shift movement on the blank. Due to the freedom
of the shift location, features A, B and C only need to
consider the impact of defect 3, 1 and 2, respectively. . . . . . 117
9.8 Denition of prohibited rectangle (PR). . . . . . . . . . . . . . 119
9.9 Denition of prohibited shift rectangle (PSR). . . . . . . . . . 120
9.10 Overlapping PSR. The number represents the number of
overlapping PSRs in that region. . . . . . . . . . . . . . . . . 121
9.11 Use tile to reduce the calculation complexity. . . . . . . . . . . 123
9.12 Layout demo for experiments. In (a), 6 edges (red) are
aected by defects; in (b), 1 edges (red) are aected by
defects. Note that the size of the defect represents the
FWHM of each defect, which is not the impact range of
the defect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
9.13 Defect number vs. defect impact. . . . . . . . . . . . . . . . . 126
9.14 Defect size vs. defect impact. . . . . . . . . . . . . . . . . . . 127
10.1 Rotation helps defect mitigation. . . . . . . . . . . . . . . . . 129
10.2 Mask preparation and ducial generation ow. The high-
lighted third step is the key step for both alignment meth-
ods and is the focus of this chapter as well. . . . . . . . . . . . 132
10.3 The relationship between the rotation and shift. . . . . . . . . 133
10.4 One example of bounding octahedron, which denes the
solution space of shift and rotation. . . . . . . . . . . . . . . . 133
10.5 Denition of prohibited rectangles (yellow) introduced by
the defect (red). The number of prohibited rectangles is de-
termined by the impact region and boundary number; the
size of prohibited rectangles is determined by the defect
size. The feature regions (green) other than the prohibited
rectangle will be the allowable region for covering-only re-
quirement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
10.6 Denition of prohibited relocation movement of the center
of a defect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
10.7 One example of the prohibited relocation cube. . . . . . . . . 136
10.8 The defect moveable region demonstration. The outside
defects have much larger movable regions than the inside
ones. Only the prohibited rectangles covered by the cropped
region need to be considered. The size of prohibited rect-
angle is exaggerated for better illustration. . . . . . . . . . . . 138
xv
11.1 Rectangle decomposition. The impact ofR((x1; y1) (x2; y2))
on the central pixel can be calculated by looking up the im-
pacts of the four shaped rectangles together. . . . . . . . . . . 145
11.2 Rectangle extending outside the lookup table is truncated. . . 147
11.3 Partition the image into tiles: one tile per threading block. . . 152
11.4 Plot of tile size vs. runtime for PPT const tex. . . . . . . . . . 157
11.5 The input layout and output aerial image of small 1. . . . . . 162
xvi
LIST OF ABBREVIATIONS
ALU Arithmetic Logic Unit
AR Allowable Rectangle
ARC Allowable Relocation Cube
BO Bounding Octahedron
CCE Critical Case Elimination
CCS Connected Component Set
CD Critical Dimension
CNF Conjunctive Normal Form
CPU Central Processing Unit
CSP Constrained Shortest Path
CUDA Compute Unied Device Architecture
DFM Design for Manufacturability
DTCO Design-Technology Co-Optimization
DPT Double Patterning Technique
DRC Design Rule Check
DSA Directed Self Assembly
E-Beam Electron Beam
EPE Edge Placement Error
EUV Extreme Ultraviolet
FFT Fast Fourier Transform
xvii
FPGA Field-Programmable Gate Array
FWHM Full Width at Half Maximum
GPU Graphics Processing Unit
HP Half Pitch
IC Integrated Circuit
ILP Integer Linear Programming
LELE Litho-Etch-Litho-Etch
LFLE Litho-Freeze-Litho-Etch
LHS Left-Hand Side
Max-CO Maximum Cube Overlapping Problem
MFD Manufacturing for Design
Min-RO Minimum Rectangle Overlapping Problem
ML Multi-Layer
NA Numerical Aperture
NGL Next Generation Lithography
NP Nondeterministic Polynomial
OAI O-Axis Illumination
OLB Overlay Length Bound
OPC Optical Proximity Correction
PPT Pixel Per Thread
PR Prohibited Rectangle
PRC Prohibited Relocation Cube
PSR Prohibited Shift Rectangle
PWOPC Process Windows Optical Proximity Correction
RDR Restricted Design Rule
RET Resolution Enhancement Technology
RPB Rectangle Per Block
xviii
SADP Self-Aligned Double Patterning
SAQP Self-Aligned Quadruple Patterning
SAT Boolean Satisability
SEA Sub-Edge Array
SEM Scanning Electron Microscope
SID Sidewall Is Dielectric
SIMD Single-Instruction Multiple-Data
SMO Source Mask Optimization
SP Starting Point
SPM Shortest Path Method
VLSI Very Large Scale Integrated Circuit
xix
PART I
INTRODUCTION
1
CHAPTER 1
INTRODUCTION
As lithography continues to be the backbone of the integrated circuit (IC)
industry, many new techniques in the context of \design-technology co-
optimization (DTCO)" have been developed to improve yield, manufactura-
bility, design performance, time to market, cost-eectiveness, etc. Those
eorts are not only from the manufacturing process perspective, but also
from the perspective of manufacturing-aware design optimization. Mean-
while, manufacturers will also have to take the designer's requirements into
consideration and try to optimize the process and improve the design en-
ablement process. Therefore, the concept of DTCO is extremely important,
which is to use modeling, patterning and automation to solve those joint
challenges and optimize the trade-o between design and process. Facing
the bottleneck of conventional 193 nm lithography process and challenges in
next generation lithography, any contributions in DTCO could be signicant.
1.1 Lithography System
A typical lithography system is shown in Fig. 1.1, which is usually made
up of four basic components: light source, projection lens, mask and wafer.
In this technique, a light image of the desired pattern, transmitted through
a mask, is reduced in size and precisely focused onto a resist-coated wafer
using a system of projection lenses. As the basic setup in the most recent
decade, ArF 193 nm excimer laser is used as the light source, and the complex
2D rectlinear layout patterns on the mask will be transformed through an
extremely complex combination of projection lens onto the wafer with single
exposure.
In this optical illumination system, the minimum resolution Wmin is rep-
2
θmax 
Light Source
Mask
Projection 
Lens
Wafer
Medium
Figure 1.1: Schematic diagram of the conventional lithography projection
system.
resented by the following equation:
Wmin = k1  
NA
(1.1)
where  is the wavelength of the illumination light, NA is the numerical
aperture, and k1 is process factor. In order to achieve a large NA above 1,
instead of air, water is used as the medium between the last lens to wafer
(called 193 nm immersion lithography) and NA can be increased to 1.35.
With this 193 nm immersion lithography system and optimized o-axis il-
lumination (OAI) light source, the optical illumination system with single
print has been extended up to 45 nm or even 32 nm technology. However,
facing the challenges to print patterns with critical pitch less than 80 nm and
maximum critical dimension (CD) variation less than 2 nm, the conventional
lithography is no longer capable and a new lithography method is called for
3
help.
1.2 Next Generation Lithography
Next generation lithography, just as its name implies, comprises novel lithog-
raphy techniques other than the conventional ArF 193 nm single exposure
on complex 2D rectlinear layout patterns. The innovations actually come
from all aspects in the lithography system, including the optimization from
both design and process. In other words, since in the next generation lithog-
raphy, design and manufacturing are coupled more closely than ever before,
co-optimization including design for manufacturability (DFM) and manufac-
turing for design (MFD), is necessary. The major eorts in industry can be
classied into the following sub-areas:
 New patterning technique: to nd a cost-eective way to print pat-
terns with multiple exposures (e.g. optical proximity correction (OPC)
and double patterning technique (DPT)).
 New design methodology: to change the design pattern to fur-
ther utilize the properties of the lithography system for a better print-
ability or performance (e.g. 1-D regular design, restricted design rules
(RDRs)).
 New illumination system: to nd new energy sources to replace the
193 nm ArF excimer laser (e.g. extreme ultraviolet (EUV) and electron
beam (E-Beam)).
 New simulation and verication approach: to utilize the most
recent computational resource for better and faster simulation and ver-
ication (e.g. parallel programming and data mining).
1.3 Overview of this Dissertation
In this dissertation, we present our research results on the above four sub-
areas: self-aligned double/quadruple patterning (SADP/SAQP) as a new
patterning technique [1{4], 1-D style design optimization as a new design
4
methodology [5{8], EUV defect mitigation as a new illumination system
study [9, 10], GPU-based aerial image simulation as a new simulation and
verication approach [11].
In Chapter 2, we present the basic self-aligned double patterning (SADP)
decomposition algorithm. Although SADP is the critical technology to solve
the lithography diculties in sub-32 nm 2D design, none of the previous
works could decompose a layout with reasonable overlay and perform a de-
composability check, which is essential for the nal implementation of SADP
process. In Chapter 2, by formulating the problem into an SAT formation,
we can solve the above two problems optimally. This is the rst work ever
published with a detailed algorithm to perform the SADP decomposition. In
a layout, we can eciently check whether a layout is decomposable. For a
decomposable layout, our algorithm guarantees to nd a decomposition so-
lution with reasonable overlay reduction requirement. With little changes on
the clauses in the SAT formula, we can address the decomposition problem
for both the positive tone process and the negative tone process. Experi-
mental results validate our method, and decomposition results for Nangate
Open Cell Library and larger test cases are also provided with competitive
run times.
Although SADP lithography is a promising technology which can reduce
the overlay and print 2D features for sub-32 nm process, how to decompose
a layout to minimize the overlay is still an open problem. In Chapter 3, we
present an algorithm that can optimally solve the SADP decomposition prob-
lem with overlay minimization. For a decomposable layout, our algorithm
guarantees to nd a decomposition solution that minimizes overlay. Exper-
imental results validate our method, and decomposition results for Nangate
Open Cell Library and larger testcases are also provided.
A necessary design-technology co-optimization step { the non-decomposable
pattern (hot spot) detection detection { is necessary to achieve an SADP-
friendly design. In Chapter 4, targeting the hot spot detection diculties
in SADP process, we rst extend our ILP-based SADP decomposition algo-
rithm in Chapter 3 to an ILP-based hot spot detection method without any
preconditions on the design. Then, with some simple common requirement
in 2D random layout, we further provided a graph-based hot spot detection
algorithm. From the Nangate standard cell library, our experiment validates
the hot spot detection process and demonstrates an SADP-friendly design
5
style is necessary for the upcoming 14 nm technology node.
Following SADP, self-aligned quadruple patterning (SAQP) lithography is
one of the major techniques for the future process requirement after 16 nm/14 nm
technology node. In Chapter 5, based on the existing knowledge of current
193 nm lithography and process ow of SAQP, we process an early study
on the denition of SAQP-friendly layout. With the exploration of the
feasible feature regions and possible combinations of adjacent features, we
dene several simple but important geometry rules to help dene the SAQP-
friendliness. We also introduce a conicting graph algorithm to generate the
feature region assignment for SAQP decomposition. This is the rst study
on SAQP design-friendliness. Our experimental results validate our SAQP-
friendly layout denition, and basic circuit building blocks in the low level
metal layer are analyzed.
In Chapter 6, we introduce our new 1-D style design optimization for
better process windows. When VLSI technology scales down to sub-40 nm
process node, systematic variation introduced by the lithography is a per-
sistent challenge to the manufacturability. The limitation of the resolution
enhancement technologies (RETs) forces people to adopt a regular cell design
methodology. In Chapter 6, targeted on 1-D cell design, we use simulation
data to analyze the relationship between the line-end gap distribution and
printability. Based on the gap distribution preferences, an optimal algorithm
is provided to eciently extend the line ends and insert dummies, which will
signicantly improve the gap distribution and help printability. Experimen-
tal results on 45 nm and 32 nm processes show that signicant improvement
can be obtained on edge placement error (EPE).
Note that poly/gate redistribution techniques in Chapter 6 require layout
modication of the original layout and thus will impact circuit performance
and power consumption. Such potentially undesirable impacts on perfor-
mance and power have to be carefully considered. In Chapter 7, we present
performance-driven gate redistribution algorithms which consider bounds on
line-end extension. Experimental results demonstrate the feasibility of our
algorithms, and lithography simulation and circuit analysis show the trend
of the trade-o between printability, delay, and power.
As CD continues to shrink, two manufacturing steps { dense line print and
tip-tip gap cut { will be necessary for 1-D style pattern generation. As the
dense line printing is highly regular, the major complexity would be on the
6
cut mask, adding to the sky-rocketing manufacturing cost. In Chapter 8, we
present a cut mask cost reduction method with circuit performance consider-
ation for 1-D style design. This is the rst research to focus on the mask cost
reduction issue for SADP from a design perspective. We simplify the poly-
gons on the cut mask, by formulating the problem as a constrained shortest
path problem and solving it by dynamic programming. Experimental results
show that with a set of layouts in 28 nm technology, we can largely reduce
the complexity of cut polygons, with little impact on performance.
Blank defect mitigation is a critical step for extreme ultraviolet (EUV)
lithography. Targeting the defective blank, a layout relocation method, to
shift and rotate the whole layout pattern to a proper position, has been
proved to be an eective way to reduce defect impact. In Chapter 9, we
successfully present a novel algorithm that can optimally solve this pattern
relocation problem. Experimental results validate our method. The reloca-
tion results with full scale layouts generated from Nangate Open Cell Library
have shown great advantages with competitive runtimes compared to the ex-
isting commercial tool.
Although pattern relocation is proved to be an eective method for de-
fect mitigation, when the defect number increases, pattern shift in only X-Y
directions becomes far from enough, requiring the reticle holder to rotate a
small angle to provide a third exploring dimension. This non-trivial exten-
sion from 2D to 3D exploration requests ecient runtime as well as enough
accuracy to handle dierent defect sizes and locations on the dierent fea-
tures. In Chapter 10, we present the rst work with a detailed algorithm to
nd the optimal shift and rotation for layout patterns on blanks. Compared
to the straightforward method, which is to check every pair of defect and
feature at every possible relocation position, our proposed algorithm can sig-
nicantly reduce the runtime complexity to scale linearly with the size of the
full solution space. The experimental results validate our method and show
a largely increased success rate of defect mitigation by shift and rotation.
Aerial image simulation is a fundamental problem for modern VLSI design.
It requires vast numerical computation. The recent advancement of general
purpose GPU computing provides an excellent opportunity to parallelize the
aerial image simulation and achieve great speedup. In Chapter 11, we present
and discuss two GPU-based aerial image simulation algorithms. We show
through experiments that the fastest algorithm we propose can achieve 50X
7
to 60X speedup over the CPU based serial algorithm. The error of our
approach is shown to be insignicant.
8
PART II
NEW PATTERNING
TECHNIQUE
9
CHAPTER 2
SADP DECOMPOSITION
2.1 Introduction
Because of the existing printing diculties in the current IC industry, double
patterning lithography (DPL) becomes the most critical technology for the
current sub-32 nm nodes in the 193 nm micro-lithography process [12]. The
conventional DPL, such as litho-etch-litho-etch (LELE) or litho-freeze-litho-
etch (LFLE), which splits the intended patterns onto two exposures, has
drawn great attention to the layout decomposition problem [13{16]. How-
ever, due to the inevitable overlay problem in the process, this conventional
DPL technology has tremendous diculty controlling the signicant pro-
cess variations, and these technologies have thus been bogged down in their
progress toward real implementation [17,18].
Self-aligned double patterning (SADP) is a successful alternative double
patterning lithography which has the intrinsic capability to solve the overlay
problems. In the SADP process, since most of the critical features are self-
aligned to the cores generated from one single exposure, it has the intrinsic
capability to avoid overlay between two exposures [17]. Meanwhile, by the
self-aligned techniques, the intended pitch can be doubled by properly setting
up the core width and pitch, and therefore the printability for the sub-32 nm
process can be greatly improved. Dierent works have shown the progress of
SADP in the sub-30 nm processes. In [19], SADP has shown the capability
for real implementation in 22 nm logic cells with 1-D gridded design rules;
however, without the exibility to handle the random 2-D patterns which
are commonly seen in logic design, it was still far from a real application.
Dai [20] and Sun [21] demonstrated the initial idea of implementing SADP
on 2D patterns with the core and trim mask design. Ma [22] presented an im-
portant work which shows the analysis of the two main 2-D SADP processes
10
{ positive tone process and negative tone process. From the comparison in
Ma's work [22], the positive tone process is suggested to be a better process,
due to the large freedom of design and controllability over overlay. More
importantly, in that work an open problem is introduced to minimize the
overlay problem for the positive tone process. With 2- or 3-mask processes,
Chang [23] demonstrated a Cadence SADP decomposition tool without any
algorithm, runtime or explicit cost function released.
Trim MaskSidewallCore MaskTarget
(a) Target layout (b) Core mask and the
surrounding sidewall
(c) Trim mask is applied to
cover the target
Figure 2.1: An example to show SADP decomposition result with positive
tone process [22].
Normally, to print one single layer, SADP process needs two types of masks
{ core and trim. By using the core mask, we can generate cores on the wafer,
and therefore we can further deposit sidewall beside the core boundaries
using the self-aligned method. Then the trim mask can be further used to
trim away the unneeded patterns. Unlike conventional double patterning
lithography, in which the patterns on the two masks are directly derived
from the intended layer using a two-coloring formulation, the core and trim
masks in SADP need a novel decomposition strategy to generate. As shown
in Fig. 2.1, the nal core pattern can be the original layout feature, a modied
feature, or even an assistant feature which is not shown in the original layout.
The trim patterns can also have their boundaries either located on the feature
boundaries or overlapped with sidewalls. The problem of generating the core
and trim mask from a 2D designed layout is called SADP decomposition. For
a given layout, dierent decomposition methods will provide dierent core
and trim masks; thus, an optimal decomposition strategy is always necessary
for better printability and less process variation. Note that in the rest of
the chapter, if no specic notation is given, we will reuse the color denition
11
provided by Fig. 2.1.
The SADP decomposition process has several major challenges, dierent
from the conventional two-coloring DPL decomposition. First, compared to
the two-coloring DPL decomposition where the solutions are directly derived
from the designed pattern by stitching and color assignment, in SADP pro-
cess, the intended patterns are not the ones on the mask, and both core and
trim masks dier greatly from the designed layout. Second, because of the
indirect mapping from the designed patterns to the mask patterns, the rules
generated from the mask process cannot be used for layout design. Third,
although the SADP process has largely improved the tolerance on overlay
compared to other DPLs, as the trim is needed for feature generation in 2D
SADP process, overlay still exists. Therefore, how to avoid overlay in the
sensitive regions or sensitive pattern edges is a big issue.
In this chapter, targeting the above diculties in SADP process, we pro-
pose an ecient algorithm for 2D positive-tone SADP process layout decom-
position using a SAT formulation. We can eciently check whether a layout
is decomposable. For a decomposable layout, our algorithm guarantees to
nd a decomposition solution with reasonable overlay reduction requirement.
With simple change on the clauses in the SAT formation, we can address the
decomposition problem for both the positive tone process and the negative
tone process. This is the rst ever published work to perform the SADP
decomposition.
The rest of the chapter is organized as follows. In Section 2.2, the two
types of SADP process are reviewed and our goal and constraints are ana-
lyzed. Based on the positive tone process, Section 2.3 illustrates the SAT
formulation based on the constraints. In Section 2.4, we further analyze the
SAT formulation for the negative tone SADP process. In Section 2.5, we
carry out experiments to test our algorithm on an existing 2D cell library.
Finally, Section 2.6 concludes this chapter.
2.2 Overview of 2D SADP Process
Normally, SADP can be categorized into two types: positive tone process
or negative tone process [22]. In this section, we will review the process
steps, introduce the process constraints, and analyze the overlay issue for
12
both SADP processes. Note that, without losing generality, we dene the
trench to be our intended circuit feature during our following analysis.
2.2.1 Positive Tone Process and Negative Tone Process
Positive tone process and negative tone process are two major types of SADP
process. The positive tone process shows a large amount of exibility on the
feature width and space, which ts for the requirement of complex 2D logic
circuit manufacturing. Fig. 2.2(a) shows the primary steps in positive tone
process. In step 1, sidewalls with the uniform width Ws are automatically
aligned to core patterns. In step 2, cores are removed, and the non-sidewall
regions covered by trim patterns get exposed. In step 3, the exposed regions
will be etched to be the required feature. In the end, the nal features will
be the non-sidewall regions covered by trim mask.
feature
Step 1:
Step 2:
Step 3:
Wafer
feature
Etch Layer
Wafer
                             Resist
Trim 
Pattern
Overlay
Trim Pattern
Core
Etch Layer
Wafer
Core
Wafer
Etch Layer
Wafer
Trim Pattern
Overlay
Core
Etch Layer
Wafer
Core Core
(a) Spacer Positive Tone Process (b) Spacer Negative Tone Process
Overlay Overlay
Figure 2.2: Positive tone process and negative tone process [22].
Negative tone process is the other type of SADP where the sidewall region
denes the trench. Although this process technology suers from design in-
exibility, the more accurate process control provides an alternative technol-
ogy for manufacturers. Fig. 2.2(b) shows the negative tone process. Similar
to the positive tone process, the negative tone process also needs to form
13
core and self-aligned sidewall in step 1 and apply the second exposure with
trim mask in step 2. In step 3, the trench will be formed where the sidewall
is but not covered by the trim pattern.
2.2.2 Process Rules
Because mask rules are always necessary for manufacturing, the SADP de-
composition should be performed according to those constraints. In this
chapter, we will take the following geometry constraints into consideration:
 MinCoreWidth, denoted as Wc;min, denes the minimum pattern
width of the core layer.
 MinTrimWidth, denoted as Wt;min, denes the minimum pattern
width of the trim layer.
 MinCoreSpace, denoted as Sc;min, denes the minimum space be-
tween the patterns in the core layer.
 MinTrimSpace, denoted as St;min, denes the minimum space be-
tween the patterns in the trim layer.
 MinCoreCrnr2Crnr, denoted as Sc2c;min, denes the minimum space
between two corners of the patterns in the core layer.
 MinTrimCrnr2Crnr, denoted as St2t;min, denes the minimum space
between two corners of the patterns in the trim layer.
 SidewallWidth, denoted asWs, denes the uniform width of the side-
walls surrounding the core patterns.
 TrimOverlay, denoted as Wo, denes the maximum allowable trim
overlay.
In the real implementations, there might be many other types of rules,
which we cannot fully cover in this chapter. However, we can always catego-
rize those rules into four types: distance rule, pattern rule, adjacency rule and
process rule. In our rule list, MinCoreWidth, MinTrimWidth, MinCoreSpace
14
and MinTrimSpace are mainly distance rule; SidewallWidth are mainly adja-
cency rule; MinCoreCrnr2Crnr and MinTrimCrnr2Crnr are mainly a combi-
nation of pattern rule and distance rule; TrimOverlay is a process rule. For a
complex mask rule set, we can decompose a complex rule into a combination
of several rules and adopt a formulation strategy similar to that introduced
in the following section.
2.2.3 Overlay Impact
Fig. 2.2 also shows two types of boundary formations, which are called
sidewall-dened boundary and trim-dened boundary. Both boundary types
exist in positive tone process and negative tone process. Since overlay hap-
pens on the second exposure when trim mask is applied and sidewall is self-
aligned to the core from the rst exposure process, trim-dened edge will
suer from overlay and sidewall-dened boundary does not have overlay. In
the decomposition process, we should reduce the number of the trim-dened
edges as much as possible and avoid the trim-dened edge for the critical
features.
Indeed, generating the sidewall around the intended feature region is not
guaranteed to be overlay-free. In order to eliminate overlay, the trim will also
need to overlap with the sidewalls for at least Wo distance. Note that in the
current process condition, this Wo is usually considerably large (about 1/4
to 1/2 of the minimum feature width depending on the process condition),
and thus this overlay plays a signicant role in the SADP decomposition
process. Like the example shown in Fig. 2.3 with positive tone process, the
feature region must be covered by trim patterns, but if the overlap between
the trim and sidewall is not large enough compared to the overlay, the nal
printed feature will still probably suer from the overlay. So, in the positive
tone process, the overlap between the sidewall and trim must be at least Wo
distance; in the negative tone process, the trim pattern must be Wo distance
away from any feature.
However, in a normal 2D design, it might not be possible to have a decom-
position solution without any overlay. It is necessary to locate the critical
boundaries rst and assign high priority during the decomposition process to
avoid overlay. Besides the critical circuit features, such as the poly edges to
15
Overlap
Overlay
Figure 2.3: Overlap needs to be larger than overlay distance to help avoid
feature overlay.
dene channel length and the diusion region boundaries to dene channel
width, it is usually necessary to limit the overlay on the long edges as well.
As shown in Fig. 2.4, if edge a and b are both dened by the trim mask, then
the aected areas will be Wo  la and Wo  lb, respectively. Obviously, the
edge a is more sensitive to overlay than edge b. In this chapter, we dene the
concept of overlay length bound (OLB). When an edge length is larger than
or equal to OLB, we call this edge a sensitive edge, which means it is intol-
erant to overlay, and the corresponding boundary should only be dened by
sidewall. A single polygon could have sensitive edges and insensitive edges at
the same time. Therefore, after the sensitive edges are set, we should always
try to avoid the case when the sensitive edges are dened by trim mask in
our decomposition method.
2.3 Problem Formulation
In this section, we will take the positive tone process as an example to for-
mulate the layout decomposition problem into a Boolean satisability (SAT)
problem and call the standard SAT solver to generate the nal masks. Usu-
ally, it is not necessary to worry about the conict of patterns far apart;
therefore, we can apply a cluster algorithm to divide the layout into several
small groups and then apply the layout decomposition on each of them. We
can also rst look into a cell and then combine them when the decomposition
16
O
v
e
rla
y
Overlay
A
Overlay
B
Affected area
Figure 2.4: Impact of overlay on dierent edges.
is nished for each cell. This will reduce the load of our task each time.
2.3.1 SAT Formulation and Feature Generation
SAT is the problem of determining if the variables of a given Boolean formula
can be assigned in such a way as to make the formula evaluate to TRUE. In
complexity theory, the SAT is a decision problem, whose instance is a Boolean
expression written using only AND, OR, NOT, variables, and parentheses.
A literal is either a variable or the negation of a variable, which we denote as
x or :x. A clause is a disjunction of literals. The SAT problem is a typical
NP complete problem, which is commonly seen in the EDA area. Usually
the SAT formula is expressed in a conjunctive normal form (CNF).
The primary property of the positive tone process, as illustrated in Sec-
tion 2.2, is that the nal feature is generated in the non-sidewall region,
which is covered by trim mask in the meantime. If we use a Boolean vari-
able Sidewall = TRUE to denote that a sidewall exists in one location and
use another Boolean variable Trim = TRUE to denote that this location is
covered by trim mask, we can describe the above feature generation process
in a geometric Boolean equation as:
Feature = :Sidewall ^ Trim (2.1)
This Boolean equation suggests a possible way to describe the SADP pro-
17
cess in a single mathematical expression. That is, given a layout, for the
location where the design features exist, eq. 2.1 should be TRUE; for the
location where the design features do not exist, eq. 2.1 should be FALSE. In
order to describe the above relation in detail, we can further map the whole
layout into meshed grids, as shown in Fig. 2.5. Then, for the tiles where a
feature exists, eq. 2.1 is TRUE; for the tiles where no feature exisits, eq. 2.1
is FALSE.
Fi
Fj
Region i:
Core: Ci
Sidewall: Si
Trim: Ti
Fj:  Sj Ģ┐Tj = TRUEFi: Si = FALSE Ti = TRUE
Figure 2.5: Variable denition on each tile.
Furthermore, we can assign three Boolean variables Ci, Si and Ti to rep-
resent whether the tile i is core, sidewall or trim mask, respectively. In
consequence, the featured and non-featured tiles can be expressed by the
following two types of Boolean clauses as:
for any featured region i : Fi = :Si ^ Ti = TRUE (2.2)
for any non-featured region j : :Fj = Sj _ :Tj = TRUE (2.3)
Thus, the nal target of the SADP decomposition problem will be equiva-
lent to nding a possible combination of all the Boolean variables Ci, Si and
Ti to make all the clauses simultaneously TRUE. In this way, we can formu-
late the 2D layout decomposition problem into a SAT problem. Note that, in
order to correctly characterize the overlay and trim for feature boundary, the
meshed grid size should be at most minfWo;Wc;min Wf;ming. Here, Wf;min
is the minimum feature width.
18
2.3.2 Geometry Constraint Setup
After the denition of the featured and non-featured clauses in eq. 2.2 and
eq. 2.3, in this subsection, we will continue to formulate the geometry con-
straints into Boolean expressions.
Sidewall Adjacency Rule
Core
Sidewall Width
Cj+m
Si
...Cj+1
Cj Cj+2
Figure 2.6: Sidewall adjacency rule setup based on the sidewall width from
the process setting.
The rst type of the geometry rule is the sidewall adjacency rule, as shown
in Fig. 2.6. Note that the sidewall is always along with the core patterns
and the self-aligned sidewall has universal width everywhere, except in those
regions where the sidewalls merge with each other. Therefore, for a single
tile i, Si is TRUE if and only if Ci is FALSE, and among the core variables
fCj; Cj+1; :::; Cj+mg within the sidewall width distance, there is at least one
variable equal to TRUE. The above sentence can be expressed in the following
Boolean equation:
Si 
 :Ci ^ (Cj _ Cj+1 _ : : : _ Cj+m) (2.4)
We can further adopt De Morgan's laws and transform the above sentence
into the following clauses in CNF format, which can be directly used in our
19
SAT solution.
Si _ Ci _ :Cj
: : :
Si _ Ci _ :Cj+m (2.5)
:Si _ :Ci
:Si _ Cj _ Cj+1 _ : : : _ Cj+m
The rest of the geometry constraints, such as minimum corner-corner dis-
tance rule, minimum space rule and minimum width rule, are all typical
distance rules that can be found in the core and trim mask preparation.
Since the properties of the core and trim mask are identical for these types
of rules, we will take core mask as examples in the following subsections.
Minimum Corner-Corner Distance Rule
Ci+2
Ci
Ck+1
Ck Ck+2
Ci+1
...
Ck+q
Core
Core
MinCoreCrnr2Crnr
Figure 2.7: Minimum corner-corner rule setup based on the MinCrnr2Crnr
from the process setting.
Minimum corner-corner distance rule denes the minimum distance be-
tween one corner and other patterns. As demonstrated in Fig. 2.7 for the
core mask, within MinCoreCrnr2Crnr distance in the upper right direction
20
from the corner of the core, all the core variables fCk; Ck+1; :::; Ck+qg should
be FALSE, which means in this region there is no core pattern. For the tile
i, it is a core corner if and only if Ci is TRUE, and Ci+1 and Ci+2 are both
FALSE. We can transform the above sentence into a Boolean equation as:
(Ci ^ :Ci+1 ^ :Ci+2)! (:Ck ^ :Ck+1 ^ : : : ^ :Ck+q) (2.6)
Using the CNF format to describe the above expression, we can nally
have the following clauses for the minimum core corner-corner distance rule.
:Ci _ Ci+1 _ Ci+2 _ :Ck
: : : (2.7)
:Ci _ Ci+1 _ Ci+2 _ :Ck+q
Minimum Space Rule
MinCoreSpace
Ci Ci+1 ...Ci+2 Ci+p
Core
Core
Figure 2.8: Minimum core space rule setup based on the MinCoreSpace
from the process setting.
Minimum space rule denes the minimum distance between one edge and
another. As demonstrated in Fig. 2.8 for the core mask, within MinCoreSpace
distance in the right boundary of the core pattern, all the core variables
fCi+1; Ci+2; :::; Ci+pg should be FALSE, which means in this region there
is no core pattern. For the tile i, it is a core boundary if and only if Ci
21
is TRUE and Ci+1 is FALSE. We can transform the above sentence into a
Boolean equation as:
(Ci ^ :Ci+1)! (:Ci+1 ^ :Ci+2 ^ : : : ^ :Ci+p) (2.8)
Using the CNF format to describe the above expression, we can nally
have the following clauses for the minimum core space rule.
:Ci _ Ci+1 _ :Ci+2
: : : (2.9)
:Ci _ Ci+1 _ :Ci+p
Minimum Width Rule
Ci Ci+1 ...Ci+2 Ci+n
MinCoreWidth
Core
Figure 2.9: Minimum core width rule setup based on the MinCoreWidth
from the process setting.
Minimum width rule denes the minimum width of one feature. As demon-
strated in Fig. 2.9 for the core mask, within MinCoreWidth distance in the
left boundary of the core pattern, all the core variables fCi+1; Ci+2; :::; Ci+ng
should be TRUE, which means in this region there is no core pattern. For
the tile i, it is just outside the core boundary if and only if Ci is FALSE and
Ci+1 is TRUE. We can transform the above sentence into a Boolean equation
as:
22
(:Ci ^ Ci+1)! (Ci+1 ^ Ci+2 ^ : : : ^ Ci+n) (2.10)
Using the CNF format to describe the above expression, we can nally
have the following clauses for the minimum core width rule.
Ci _ :Ci+1 _ Ci+2
: : : (2.11)
Ci _ :Ci+1 _ Ci+n
2.3.3 Overlay Reduction Constraints
Overlay is a big issue for the SADP decomposition. Since it is inevitable
to use trim to dene feature boundaries in 2D design, we need to limit the
impact of the overlay. According to the analysis in Section 2.2, we have to
apply circuit analysis and set OLB to set up the sensitive edges for overlay
reduction.
Ti Ti+1
Ti+r
...
TrimOverlay
Sensitive Edge
Feature
Figure 2.10: Extended work to perform overlay reduction.
Because in the positive tone process, the trim will always cover the feature
region, in order to force the sensitive edge to be dened by sidewall other than
by trim, we have to set the trim to cover the region outside the boundary
for at least TrimOverlay distance. As shown in Fig. 2.10, for the highlighted
23
sensitive edge, all the trim variables in the shaded region fTi; Ti+1; : : : ; Ti+rg
should be TRUE. Thus, by adding this constraint, we can further transform
the SAT formula to gain a result considering the sensitive edges.
However, one big drawback of this overlay reduction work is that this ex-
tra constraint for overlay will largely shrink the solution space, and therefore
reduce the possibility of the layout decomposability. To solve this problem,
we can adjust the OLB to trade o the overlay tolerance and the decompos-
ability. By setting dierent OLB, the dierent overlay tolerance can be set {
when OLB is 0, overlay is absolutely forbidden; when OLB is innite, overlay
is ignored. As a compromise, we can always set some critical boundaries to
be sensitive edges as we wish.
So far, we have described the formulations for the most common geometry
rules that will be used in the SADP process. Note that all the geometry
constraints can be expressed as the combination of the pattern, boundary,
distance and adjacency relations. Therefore, without any solid proof, we
can declare that this SAT formulation should be able to handle most of the
geometry constraints. With this logic relation of the variables, the Boolean
satisfactory problem will be sucient for the SADP decomposition process.
If the overall SAT formula can return TRUE with some combinations of the
variables, we can gain the nal core and mask patterns based on the value
of the variables. In this way, a SADP decomposition can nally be achieved.
2.4 Negative Tone Process Decomposition
Although the positive tone process has shown its great advantages in design
exibility, negative tone process still exists as an alternative in current in-
dustrial laboratories. To solve the negative tone process, we can also try to
formulate the process requirement into a SAT problem in a similar way to
the positive tone process. As Fig. 2.2(b) shows for negative tone process,
without losing generality, suppose the nal required features are the trenches
on the etch layer. Similar to the geometric Boolean equation eq. 2.1 for pos-
itive tone process, the formation of the feature in negative tone process can
be written as the following geometric Boolean expression:
Feature = Sidewall ^ :Trim (2.12)
24
Following the same idea to gridize the layout and assign variables onto
tiles, similar to eq. 2.2 and eq. 2.3, we can also have two new expressions for
negative tone process:
for any featured region i : Fi = Si ^ :Ti = TRUE (2.13)
for any non-featured region j : :Fj = :Sj _ Tj = TRUE (2.14)
Because the geometry constraints are not aected by the dierent tone
process, the clauses represented by eq. 2.5, eq. 2.7, eq. 2.9 and eq. 2.11 are
valid for both tone processes. Therefore, we can continue to use those clauses
for our SAT formula.
In the negative tone process, overlay exists where the trim boundary is
closed to the sidewall region. So, for a given sensitive edge, as shown in
Fig. 2.10, we should set all the trim variables in the shaded region fTi; Ti+1; : : : ; Ti+rg
to be FALSE (note that for the positive tone process, these variable should
be TRUE). Following the exact strategy described in Section 2.3, we can
carry on the overlay reduction and trade o the overlay reduction with de-
composability.
2.5 Experimental Results
Since the positive tone process and negative tone process are very simi-
lar in the SAT formulation, in this section, we will mainly use positive
tone decomposition to check our proposed algorithm's feasibility and e-
ciency. To test the validation of our proposed algorithm, we implement
our algorithm using C++ on a workstation with the RedHat 9.0 opera-
tion system on an Intel Xeon CPU at 3.00 GHz with 4 GB memory. With
proper variable reduction strategy, we implement a set of demo rules for the
28 nm design from our industrial partner, with the following rules: Side-
wallWidth, MinCoreWidth, MinCoreSpace, MinCoreCrnr2Crnr, TrimOver-
lay, MinTrimWidth, MinTrimSpace, and MinTrimCrnr2Crnr.
To test the decomposability of the current industrial data, we chose the
45 nm Nangate Open Cell Library [24] and scaled the cells down to t into our
test frame. Since the Nangate cell library is not designed for 2D SADP, most
of the cell after scaling is not decomposable. Fig. 2.11 shows a decomposition
25
(a) Original Design (b) Core and Sidewall (c) Sidewall and Trim
Feature
Core
Sidewall
Trim
Figure 2.11: The decomposition result for AND4 X1 [24].
result for the metal 1 layer in the AND4 X1 gate. In this experiment, we set
up the OLB to be 3Wf;min. The original design of the metal 1 is shown in
Fig. 2.11(a). After the decomposition, the core patterns (blue) and the self-
aligned sidewall (yellow) are generated in Fig. 2.11(b). Fig. 2.11(c) nally
shows the trim (red) and the sidewall. Here, in Fig. 2.11(c), it can be seen
that the black non-sidewall region covered by the trim is exactly the same as
the original patterns in Fig. 2.11(a), and most of the trim mask boundaries
are covered by sidewall, which means the overlay is largely eliminated. From
the solution, we can also nd some small and useless jogs and patterns in the
core and trim mask, which means further work is also needed to reduce the
mask complexity.
Table 2.1 shows the comparison targeting on the rest of decomposable
cells in the Nangate library and two larger test cases from our own creation.
The runtime of SAT seems uncorrelated to the problem size, which can be
represented by the number of features and the layout area. To go deep into
the cell, we would nd the runtime will be mostly correlated to the layout
structure; as more features are correlated, the runtime increases. We also
tune the gridsize to check the impact on the solution and runtime. Indeed,
the experimental results show that with ner grids, the solution provided by
the SAT solver is not guaranteed to improve but takes much longer. This
is because the ner grid will make the problem bigger and more dicult to
solve optimally.
26
Table 2.1: Results of Decomposable Standard Cells in the Nangate Cell
Library [24]
Name Feature Layout area Runtime (s)
# (m2) grid=10 nm grid=5 nm
AND2 X1 6 0.48 2.91 5.56
AND3 X1 7 0.58 6.55 24.23
AND4 X1 8 0.66 3.85 26.98
BUF X16 5 0.66 15.55 47
BUF X1 5 0.42 1.52 3.58
CLKBUF X1 5 0.37 2.41 4.64
INV X1 4 0.34 1.43 1.77
NAND2 X1 5 0.42 6.58 11.29
OR2 X1 6 0.51 2.05 7.32
OR4 X4 8 0.66 2.09 3.42
TEST 17 1.9 0.59 0.93
TEST30 510 67.96 57.51 134.33
2.6 Conclusions
In this chapter, we have nished the SADP decomposition process targeting
two dierent tone processes. We successfully characterize the overlay and the
geometry mask rules into an SAT problem. Experimental results validate
our method, and decomposition results for Nangate open cell library and
additional industrial testcases are also provided with reasonable runtimes.
This is the rst published work for a real SADP decomposition algorithm
with reasonable overlay reduction, decomposability check and competitive
solution speed.
Note that decomposability check is also an important task that will be
commonly used to guide the SADP-friendly circuit design. Since there is
no direct mapping method from the designed layout to the nal masks, the
design rule check (DRC) will not be able to directly guide the SADP de-
composability check, so the DRC and decomposability check are dicult to
perform and coordinate with each other. In this work, as the SADP de-
composition problem is formulated into a SAT problem, the decomposability
check can be realized naturally. The SAT problem is a typical NP-complete
problem, and the SAT solver will always return \Yes" or \No". Once we
have nished the SAT formulation, we receive the answer \Yes" if there is at
27
least one solution; if we receive \No", no feasible solution exists. In this way,
the decomposability check is easily performed. The ecient decomposition
algorithm provides a convenient bridge to x the gap between the design
rules and mask rules.
28
CHAPTER 3
OVERLAY MINIMIZATION FOR SADP
3.1 Introduction
In Chapter 2, we have successfully addressed the initial layout decomposition
problem with a limited overlay control strategy. In the 2D SADP process,
due to the diculty of layout decomposition, some feature boundaries have to
be generated by trim, where overlay will be introduced. How to decompose a
given layout to minimize the overlay is still the major diculty in the SADP
decomposition problem.
In this chapter, targeting the overlay minimization problem in SADP pro-
cess, we propose an ecient algorithm for 2D positive-tone SADP process
layout decomposition with the target of minimizing the overlay and reducing
the complexity of the core and trim mask. We rst inspect the correlation
between the core, trim and feature patterns. Then, we map the mask ge-
ometry constraints into integer linear programming (ILP) constraints, and
assign binary variables onto dierent component regions to formulate an ILP
problem. Finally, we call an ILP solver to solve the overall problem. The
nal result indicates the validity of our proposed algorithm, and we also test
a commonly used standard cell library and analyze under the circumstances
of the SADP process.
The rest of the chapter is organized as follows. In Section 3.2, we mainly
illustrate the ILP formulation based on dierent constraints and criteria. In
Section 3.3, a set of variable reduction methods is given to largely reduce
the computational load. In Section 3.4, we carry out experiments to test our
algorithm on an existing 2D cell library. Finally, Section 3.5 concludes this
chapter.
29
3.2 Layout Decomposition Problem Formulation
According to the property and the constraints in the positive tone process,
in this section, we will formulate the layout decomposition strategy with the
aim of minimizing the overlay in the following section. It is not necessary to
consider the patterns which are far apart; therefore, we can apply a cluster
algorithm to divide the layout into several small groups and then apply the
layout decomposition on each of them. We can also rst look into a cell and
then combine them when the decomposition is nished for each cell. This
will reduce the load of our task each time.
3.2.1 Feature Generation and ILP Formulation
In this chapter, for an easy illustration, we call the feature boundary dened
by trim as non-overlay boundary ; we call the feature boundary not dened
by trim as overlay boundary. Note that each manufacturing pattern will have
its own geometry constraints, such asWc;min for minimum core width, Wt;min
for trim minimum width, Sc;min for the minimum core space, St;min for the
minimum trim space and Wo for the maximum trim overlay. These mask
constraints/rules are directly used to guide the SADP decomposition, and
thus indirectly impact the designed layout.
Trim mask setting is the key to reduce the overlay and generate the nal
features. In order to minimize the overlay boundary, the trim will need to
overlap with the sidewalls with at leastWo distance. In the current sub-30nm
design, this Wo is considerably large (about 1/2 or 1/3 of the minimum
feature width). A trim pattern will still introduce overlay if the distance
between the trim boundary and the feature boundary is too closed, even
if the feature boundary is guarded by sidewall. Therefore, the overlapping
distance between the trim and sidewall must be counted when the overlay is
measured.
The primary property of the positive tone process, as illustrated in Sec-
tion 2.2, is that the nal feature is generated in the non-sidewall region,
which is covered by trim mask. If we express this relationship in a geometric
Boolean equation, it can be written as:
Feature = Sidewall  Trim (3.1)
30
Fi
Fj
Region i:
Core: Ci
Sidewall: Si
Trim: Ti
Fi:
Fj:  Sj + 1 - Tj > 0
Si = 0
Ti = 1
Figure 3.1: Variable denition on each tile.
According to this equation, if we map the whole layout into meshed grids,
as shown in Fig. 3.1, then on each tile i, we would have three binary variables
Ci, Si and Ti to represent whether this tile is core, sidewall or trim mask,
respectively. In this way, we can formulate the 2D layout decomposition
problem into a 0-1 integer linear programming (ILP). The rest of the section
will focus on how to formulate the ILP objective and constraints, using the
mask rules and intended features. Note that, in order to correctly charac-
terize the overlay and trim cut for feature boundary, the meshed grid size
should be at most min(Wo;Wc;min  Wf;min). Here, Wf;min is the minimum
feature width.
3.2.2 Constraint Setup
The rst set of constraints in our ILP formulation is the feature and non-
feature region constraint. In eq. 3.1, Feature is true if and only if this
location has trim mask and no sidewall. To express this relation with the
variables denition above, on each tile i where a design feature exists, we
have the following constraints:
Si = 0 (3.2)
Ti = 1
31
and on each tile j where design features do not exist, we will have the con-
straints:
Sj + 1  Tj > 0 (3.3)
In this way, we can set up the basic relationship between the designed
layout features and the manufacturing patterns.
Sidewall Width
(a) Sidewall Adjacent Setup
...
Cj+2
Cj+m
Si
Ci+2
Ci
Ck+1
Ck Ck+2
Ci+1
...
Ck+q
(b) Minimum Corner-Corner Setup
Min Corner-Corner
Min Space
Ci Ci+1 ...Ci+2 Ci+p
(c) Minimum Space Setup
Min Width
Ci Ci+1 ...Ci+2 Ci+n
(d) Minimum Width Setup
Cj+1
Cj
Figure 3.2: Constraints setup based on geometry rules.
The second set of constraints is the core and trim mask geometry con-
straints, which have been introduced in Section 2.2. Taking the core mask
constraint as an example, Fig. 3.2 is provided to demonstrate how the side-
wall adjacency rule, min width, min space and the min corner-corner rules
are converted into ILP constraints.
32
Lemma 1 (Sidewall Adjacency Rule) As shown in Fig. 3.2(a), Si is 1 if
and only if Ci is 0 and, among the core variables fCj; Cj+1; :::; Cj+mg within
the sidewall width distance, there is at least one variable equal to 1. The
corresponding ILP constraints are:
Si + Ci + 1  Cj  1
: : :
Si + Ci + 1  Cj+m  1 (3.4)
1  Si + 1  Ci  1
1  Si +
j+mX
l=j
Cl  1
To prove Lemma 1, we can rewrite the binary variable into a Boolean
expression Si = Ci  (Cj + : : : Cj+m) and then apply De Morgan's law to
transform the expression into the CNF format. Finally each clause in the
CNF expression can be rewritten as one ILP constraint. The following lem-
mas show the ILP conversion for the minimum corner-corner rule, minimum
space rule and minimum width rule.
Lemma 2 (Minimum Corner-Corner Rule) As shown in Fig. 3.2(b), if Ci
is 1, Ci+1 is 0 and Ci+2 is 0, then all variables fCk; Ck+1; :::; Ck+qg within
the minimum corner-corner distance should be 0. The corresponding ILP
constraints are:
1  Ci + Ci+1 + Ci+2 + 1  Ck  1
: : : (3.5)
1  Ci + Ci+1 + Ci+2 + 1  Ck+q  1
Lemma 3 (Minimum Space Rule) As shown in Fig. 3.2(c), if Ci is 1 and
Ci+1 is 0, then all variables fCi+2; Ci+3; :::; Ci+pg within the minimum space
distance should be 0. The corresponding ILP constraints are:
1  Ci + Ci+1 + 1  Ci+2  1
: : : (3.6)
1  Ci + Ci+1 + 1  Ci+p  1
33
Lemma 4 (Minimum Width Rule) As shown in Fig. 3.2(d), if Ci is 0 and
Ci+1 is 1, then all variables fCi+2; Ci+3; :::; Ci+ng within the minimum width
distance should be 1. The corresponding ILP constraints are:
Ci + 1  Ci+1 + Ci+2  1
: : : (3.7)
Ci + 1  Ci+1 + Ci+n  1
With other kinds of geometry rules, as long as they can be expressed using
the logic relations, we can change them into ILP constraints. In this way,
we can map the manufacturing rules into a decomposition rule and use it to
guide the following min-cost SADP decomposition.
3.2.3 Objective for Overlay Minimization
After the feature and geometry constraints are constructed, we can build up
our cost function for an optimal 2D SADP decomposition. As mentioned pre-
viously, the most critical target for the layout decomposition is to minimize
the total overlay, or in other words, to maximize the non-overlay boundaries'
length.
From the denition of non-overlay boundary, this kind of boundary should
be guarded by sidewalls, and the trim mask will overlap with the sidewalls
for at least length Wo . By the feature constraint in eq. 3.3, for the region
i which is in a nonfeature region but within distance Wo of any feature,
if we set the objective function to force Ti to be 1, in the same tile, the
sidewall Si will automatically be forced to 1. In this way, the feature is
automatically surrounded by sidewalls, and the boundary becomes a non-
overlay boundary. As a result, setting the objective function of our ILP
to maximize the summation of all the trim mask variables that are within
distance Wo of any feature, will maximize the overall non-overlay boundaries
and thus minimize the overlay.
However, with only the above objective function, we cannot generate the
simple polygons. This is because the trim pattern can be in any shape, as
long as it is within the sidewall region, and useless core will appear in other
empty space. To eliminate those complex patterns, or zigzag boundaries, and
34
make the design as tight as possible, one has to reduce the area of the trim
and core pattern. In this way, the trim will stick to the innermost boundary
and no useless geometry pattern will be generated. Suppose the summation
of all the trim variables within distance Wo to be B, the full trim pattern
area AT =
P
all Ti and core pattern area AC =
P
all Ci; then we will have
the following new cost function:
min :   B +   AT +   AC (3.8)
and we have    .
So far, we have nished the ILP formulation for an optimal 2D SADP
decomposition with eq. 3.2{3.8. With dierent criteria and preferences, such
as emphasizing the overlay of the critical feature boundaries or encouraging
line-end cut, we can reformulate our cost function and call any ecient 0-1
ILP solver to handle the rest of the problem.
3.3 Problem Reduction
So far, we have nished the problem formulation based on the ILP formu-
lation. Supposing the layout can be represented by an m  n matrix, the
corresponding number of the variables will be 3m n and the number of
the clause for ILP will be O(m  n). However, it is not necessary to use so
many variables and clauses for a complete solution, and a problem reduction
is suggested. In the rest of this section, we will focus on variable number
reduction in our ILP formulation to speed up our solution.
3.3.1 Feature Region Variable Reduction
Eq. 3.2 has clearly shown that in one single feature, the trim variable will
always be 1, and the sidewall variable will be 0. Since any discontinuity of
core pattern will introduce new sidewalls, in the feature region the core vari-
able will be unique and cannot be assigned a dierent value in one solution.
Therefore, we can combine core variables in one continuous feature; then
the total variable number, including trim, core and sidewall, in the feature
region will be equal to the number of features. We will only need to study
35
the non-feature region variable reduction.
3.3.2 Core and Sidewall Variable Reduction
For the core mask, the non-feature region can be split into core, sidewall and
space, as shown in Fig. 3.3. Due to the preference of the non-overlay feature
boundary, regions 3 and 4 withWs width are needed to place sidewalls beside
the feature. This sidewall can be generated either by the core in the feature
region (case 1) or the cores in the adjacent region (case 2); then the extra
core regions (regions 2 and 5 with Wc;min width) are needed to support this
solution in case 2. Because those extra cores in case 2 will generate sidewalls
on other side, regions 1 and 6 are needed as well with Ws width. The rest of
the space will either be empty or be left for setting up other feature regions.
Region 1
Region 2
Region 3
Region 4
Region 5
Region 6
Feature
Case 1:
S3=S4=1
S1=S2=S5=S6=0
C1=C2=C3=C4=C5=C6=0
Core
Sidewall
Sidewall
Core
Sidewall
Sidewall
Space
Case 2:
S1=S2=S3=S4=S5=S6=0
C1=C3=C4=C6=0
C2=C5=1
Space
Sidewall
Sidewall
Space
Space
Space
Core
Ws
Ws
Wc,min
Ws
Ws
Wc,min
Figure 3.3: Core and sidewall grid setup in the non-feature region for a
single feature.
Using this grid setting, each feature has its own dened region. Note
that the minimum width extra core regions (region 2 and 5 in Fig. 3.3) are
enough for one single feature, and the core regions from dierent features can
touch each other to allow a larger extra core, which can be shared by several
features. If the regions from dierent features overlap or cross each other,
the grids which create the region boundaries will behave like a Hanan grid,
extended to create more ne regions until it touches other features or layout
36
boundaries, as shown in Fig. 3.4. On each region, we need to set individual
variables to represent dierent region combinations. Note that since sidewalls
have unique width everywhere, unless two sidewalls collapse, we have no need
to assign sidewall variables for the regions where both dimensions are larger
thanWs. Furthermore, we only need to assign variables on the regions which
are within distance 2Ws+Wc;min of any feature (shaded region in Fig. 3.4),
and leave the rest of the rectangles unassigned. Additionally, for the feature
whose width is less than Wc;min, we have to add extra grids to make this
feature region large enough to t in a minimum width core.
Bound Grid
Core Grid
Sidewall Grid
Feature Grid
Figure 3.4: A demonstration for the non-feature region split by dierent
grids.
3.3.3 Trim Variable Reduction
Since a pattern is automatically printed whenever the trim pattern and the
non-sidewall region overlap, in order to avoid printing the non-designed fea-
tures, it is strictly required that trim variables be 1 only in the feature and
the surrounding sidewall regions. So to reduce the number of the trim vari-
ables, we only need to assign trim variables on the regions within distance
Ws from any feature, and leave other regions unassigned. Since we are try-
ing to minimize the area of the trim patterns, the feature boundary will be
automatically assigned either to be an overlay boundary or a non-overlap
boundary with surrounding sidewalls within distance Wo. Thus, as long as
the constraints for trim pattern width and space are met, only one extra grid,
37
which is Wo distance from the feature boundary, is needed to set up the trim
boundary, as shown in Fig. 3.5.
Overlay Overlay
(a) when d < 2WO+St,min (b) when d ≥  2WO+St,min
d d
Figure 3.5: Trim grid setup in the non-feature region for a single feature.
Using the above variable reduction strategies, we can largely combine the
uniform regions and reduce the number of variables without losing the quality
of our result. The solution's runtime is then reduced from a function of
the layout area to the function of features, which will largely benet our
performance.
3.4 Experimental Results
3.4.1 Experimental Setup and Validation
To test the validation of our proposed algorithm, we implement our algorithm
using C++ on a workstation with the RedHat 9.0 operating system on an
Intel Xeon CPU at 3.00 GHz with 4 GB memory. In our program, we call
the GUROBI 4.0 ILP Solver [25] to generate the ILP solution.
We implement a set of demo rules for the 28nm design from our industrial
partner, with the following rules: SidewallWidth, MinCoreWidth, MinCores-
Pace, MinCoreCrnr2Crnr, TrimOverlay, MinTrimWidth, MinTrimSpace, and
MinTrimCrnr2Crnr.
Fig. 3.6(a) shows a test case for our SADP decomposition process. In this
layout, we include some of the most dicult components in the 2D SADP
decomposition, such as L, U, T, , E, uniform dense line, and line-end cut,
which is 1:88m2 large. By setting the line spaces to be equal to 30nm, which
38
Feature Core Sidewall Trim
(a) Feature (b) Core, Sidewall and Trim
Figure 3.6: A test case and its decomposition result.
is the minimum allowable line space, we successfully decompose this input
layout to match all our input process constraints. The whole ILP solution,
as shown Fig. 3.6(b), takes about 0.97s to complete.
3.4.2 Real Cell Analysis
To test the decomposability of the current industrial data, we also chose
the 45nm Nangate Open Cell Library [24] and scaled the cells down to t
into our test frame. Since the Nangate cell library is not designed for 2D
SADP, most of the cell after scaling is not decomposable. Table 3.1 shows
the comparison targeting on the rest of the decomposable cells in the Nangate
library and two larger test cases from our own creation. The runtime of ILP
seems uncorrelated to the problem size, which can be represented by the
number of features and the layout area. To go deep into the cell, we would
nd the runtime to be mostly correlated to the layout structure; the more
features correlated, the longer the runtime. It is usually very quick to check
the feasibility of the solution, but the optimal solution usually takes much
longer.
We test a large layout with an area of 241.28 m2. We decompose the
layout into small parts and run our program for a quicker inspection. Ex-
cluding the decomposition time, the total runtime is 933.61s. We also solve
the problem using a smaller grid. However, it is not guaranteed that this
ner grid will provide a better solution. Indeed, the experimental results
show that with ner grids, the solution provided by the ILP solver is not
39
Table 3.1: Results of Decomposable Standard Cells in Nangate Standard
Cell Library
Name Feature Layout area Overlay Runtime
# (m2) (nm) (s)
INV X1 4 0.34 260 8.13
BUF X1 5 0.42 490 2.15
NAND2 X1 5 0.42 650 35.87
And2 X1 6 0.48 680 43.53
OR2 X1 6 0.51 930 2.62
OR4 X4 8 0.66 1270 3.1
BUF X16 5 0.66 1460 39.38
TEST1 17 1.9 1130 1.07
TEST2 204 26.16 13400 70.42
guaranteed to improve. This is because the ner grid will make the problem
bigger and more dicult to solve optimally. So far, we have not tried using
parallelism in our ILP solvers, but a faster and better solution is expected
with parallelism.
3.5 Conclusions
In this chapter, we have nished the SADP decomposition process with over-
lay minimization. We successfully characterize the overlay and the geometry
mask rules into an ILP problem. Based on the formulation, we perform layout
decomposability. Experimental results validate our method, and decomposi-
tion results for Nangate open cell library and additional industrial testcases
are also provided with reasonable runtimes. This work advances current
art for overlay minimization with more detailed open-layout testcases with
competitive solution speed.
40
CHAPTER 4
HOTSPOT DETECTION FOR SADP
4.1 Introduction
In Chapter 3, we have successfully addressed the initial layout decomposition
with overlay minimization. However, in order to push the SADP process to
be accepted by the layout designer, there are still two major diculties to
solve:
1. Bridging the design rules with mask rules. Since in the SADP process,
as shown in Fig. 4.1, the circuit pattern is no longer what to be printed
on the wafer. Thus, the mask rules are lacking correlations with design
rules.
2. Calling for an ecient way to detect non-decomposable location where
the pattern combination is invalid.
Both of these diculties call for a novel tool to handle decomposability check
and non-decomposability location detection.
Core RulesDesign Rules Trim Rules
Figure 4.1: The dierence between the design rules and mask rules.
41
4.1.1 Process Constraints
The major contributing factor of the hot spot in SADP non-friendly design
is the mask rule coniction. One functionality of our target in this chapter is
to match the mask rules to the design rules, and therefore help the designer
to construct the SADP-friendly layout. As shown in Fig. 4.1, a valid set
of design rules will be very dierent from the mask rules, which must take
the pattern combination into consideration. As introduced in our previous
chapter, our valid mask rules could be classied into four types: distance rule,
pattern rule, adjacency rule and process rule. For a complex mask rule set, we
can decompose a complex rule into a combination of several rules and adopt
a formulation strategy similar to that introduced in the following section. In
this way, the feature generation and geometry constraints are formulated in
the way of ILP, and the whole problem can be optimally solved by calling
commercial software.
In this chapter, we keep the common rules that have been shown in Sec-
tion 2.2. Here, MinCoreWidth, MinTrimWidth, MinCoreSpace and MinTrimSpace
are mainly distance rule; SidewallWidth are mainly adjacency rule; MinCore-
Crnr2Crnr and MinTrimCrnr2Crnr are mainly a combination of pattern rule
and distance rule; TrimOverlay is a process rule.
In this chapter, targeting the above diculties to address the design-
friendly issue in SADP process, we propose two algorithms for 2D positive-
tone SADP process.
 From the ILP based layout decomposition algorithm, a slack insertion is
further adopted as an ILP-based algorithm to address the fundamental
hot spot detection without precondition on the design patterns.
 With the graph-based formulation and a certain precondition on the
geometry patterns, we can also use 2-coloring assignment to address
the hot spot detection.
Both of the algorithms show the validation in the experiment.
The rest of the chapter is organized as follows. In Section 4.2, we mainly
illustrates the ILP-based algorithm. In Section 4.3, the graph-based algo-
rithm is introduced. In Section 4.4, we carry out experiments to test our
algorithms on an existing 2D cell library. Finally, Section 4.5 concludes this
chapter.
42
4.2 ILP-Based Algorithm
In the previous ILP-based SADP decomposition algorithm, one of the basic
ideas is to translate the input mask rules and geometry constraints into the
ILP constraints, as shown by eqs. 3.2{3.7 , in Chapter 3.
Following the ILP formulation, once the layout is non-decomposable or
no feasible solution is found, we need to check the location where the ge-
ometry constraints are violated or which features are causing problems. In
the context of ILP formulation the hot spot detection will be equivalent to
detecting the conicting constraints. To perform a meaningful hot spot de-
tection, we can indeed report several types of violation information which
can help designers to debug their layout.
Since each constraint in the ILP formulation has its own physical mean-
ing that can be localized to a special spot or feature, to nd the minimum
conicting constraints might be an ecient way to report hot spots for the
design. In other words, by reporting minimum violating constraints, we can
thus perform hot spot detection. In our ILP formulation, we can do the
job by inserting extra binary slack variables Si onto each constraint i and
changing the objective function to have an extra term  PSi, in which 
is a very large number. The modication is shown in Fig. 4.2. Because, in
our formulation, the right-hand side (RHS) of the constraint is 1 and the
left-hand side (LHS) is always greater than or equal to 0, if one constraint is
invalid, the LHS must be 0. Then by setting the slack variable to be 1, we can
make the constraint valid. Thus, the updated ILP problem is always feasible.
To minimize the summation of the slack variable will be equal to nding the
minimum number of conicting constraints. Any slack variable that is nally
set to be 1 will demonstrate the corresponding invalid constraint. Then we
can backtrack to the source of this constraint to locate the hot spot. The
layout will continue to be decomposed, showing the decomposition solution
when the minimum number of hot spots are detected.
4.3 Graph-Based Algorithm
So far, we have nished the problem formulation based on the ILP formula-
tion. However, the formulation of integer linear programming (ILP) intrinsi-
43
min: a·x1+b·x2+c·x3
s.t.:
1 - x1 + x2 ≥ 1
x1 + x2 ≥ 1
1 - x2 + 1 - x3 ≥ 1
x3 ≥ 1
xi are binary
min: a·x1+b·x2+c·x3+δ·Σsi
s.t.:
1 - x1 + x2 + s1 ≥ 1
x1 + x2 + s2 ≥ 1
1 - x2 + 1 - x3 + s3 ≥ 1
x3 + s4 ≥ 1
xi, si are binary
Figure 4.2: Slack insertion to detect the minimum number of conicting
constraints.
cally is an NP problem, which requests a large amount computing resources
and lacks computing eciency. In the full layout hot spot detection process,
where millions of patterns are involved, a more ecient method for hot spot
detection is needed. In this section, we will focus on a graph-based algo-
rithm to process the hot spot detection with certain preconditions and high
eciency.
4.3.1 Precondition of Graph-Based Algorithm
As mentioned above, SADP decomposition process is very dierent from the
LELE decomposition process. The major dierence is the condition that
the patterns in the layout are single-printable. The terminology single-
printable means if one single polygon is not adjacent to any other polygons,
it can be printed well with only a single exposure. This criterion may limit
design exibility, but with random 2D logic circuit, where overlay is sensitive,
this condition can be easily fullled. As shown in Fig. 4.3, single-printability
plays an important role in layout decomposability check. When the patterns
are all single-printable (in the center column), each feature can be printed
with a single core, thus we only need to dene which feature is in core region
and which feature is in spare region. However, when the patterns are not
single-printable (in the right column), when the 3 features in parallel can
be manufactured with certain overlay existing, the 4 features in parallel can
never be manufactured with SADP process. The precondition that all the
features are single-printable makes hot spot detection in SADP equivalent to
checking the feasibility of pattern combination, thus saving a lot of eort to
nd the solution for one single feature printing.
44
No Solution
W < MinCoreWidth
S = SidewallWidth
SW SW W
SW SW W S W
W = MinCoreWidth
S = SidewallWidth
Original Design
Figure 4.3: Dierent decomposability situation of similar patterns.
4.3.2 Study of Pattern Combination
Once the layout to be SADP decomposed meets the precondition of single-
printability, the rest of the hot spot detection is to study the space between
patterns. Every core pattern will form a sidewall self-aligned to the pattern
boundary, and the sidewall region will always form the non-feature regions.
Thus, every single feature can never be in the core region and pure spare
region (non-core and non-sidewall regions) at the same time. Deciding the
decomposability of a layout becomes checking whether every pattern can be
assigned to be in either core region or spare region without any coniction.
Through the distance between two adjacent polygons, we study the feasibility
of the situation of two-feature combination.
Fig. 4.4 shows the denition of distance between two features. To test the
validation of the two-feature combination, we study all the distances from 0
to innite. Fig. 4.5 shows four critical check points when feature boundaries
are dened by sidewall (overlay is minimized) among all the distances. The
segment in green represents the cases where no-overlay decomposition solu-
tion can be found; the segment in orange represents the cases where decom-
45
d1
d2
d3
Figure 4.4: The denition of distance between two features.
position solution can only be found when overlay exists; the segment in red
represents the cases where no decomposition solution can be found. Fig. 4.6
shows another situation when feature boundary can be dened by the trim
mask. With the precondition that MinCoreSpace is equal to MinTrimSpace
and MinCoreWidth is equal to MinTrimWidth, we can nd that only the
segments (0, SidewallWidth) and (SidewallWidth, MinCoreSpace) are the
distances where no decomposition solution can be found. With all four region
assignment combinations (core-spare, spare-core, core-core and spare-spare)
available, any distance in [MinCoreSpace, +1) can make the two feature
combination feasible.
∞0
2 x Sidewall MinBlockSpace+2xOverlayMinCoreSpaceSidewall
Sidewall MinCoreSpace 2 x Sidewall MinBlockSpace+2xOverlay
Figure 4.5: The relationship between distance and the option of pattern
assignment, when feature boundaries are dened by sidewalls.
Therefore, the rst step of hot spot detection is to check whether any
distance between adjacent features is falling into the section of (0, Sidewall-
Width) and (SidewallWidth, MinCoreSpace). For any distance larger than
or equal to MinCoreSpace, we can simply consider those pattern combina-
46
∞0
MinBlockSpace MinBlockSpaceSidewall
Sidewall MinBlockSpace
Figure 4.6: The relationship between distance and the option of pattern
assignment, when feature boundaries can be dened by trim mask.
tions to pass the hot spot detection. Then, the only problematic combination
when the distance is SidewallWidth can be solved by the two-coloring prob-
lem, which will be illustrated in the following subsection.
4.3.3 2-Color Assignment for Hot Spot Detection
As we have mentioned above and shown in Fig. 4.5 and Fig. 4.6, the distance
SidewallWidth between two features is the only case that requests special
core/spare assignment for a decomposability check. With the SidewallWidth
apart from each other, the adjacent features cannot be assigned to be the
same core/spare regions at the same time, and those features will conict
with each other in the region assignment. Because one feature can only be
in either core or spare regions, we can consider one single feature as a single
node in a conicting graph, and an edge is connecting two nodes when the
corresponding features have SidewallWidth distance. Once we complete the
conicting graph, and if the graph is 2-colorable, we can conclude there is
no hot spot in the current layout. If the conicting graph is not bipartite,
we only need to pick up the edge with the minimum xing price as the hot
spot. Once we can further modify the layout to remove that edge from the
conicting graph, then we can successfully remove the hot spot in our real
design. Since it takes O(n2) time to construct the conicting graph and
O(n) time to check the odd cycle in the conicting graph, the graph-based
algorithm has very high eciency. Fig. 4.7 shows a demo of the conicting
graph construction and the hot spot detection from a given layout.
47
a b
c e
f
g
Hot Spot
d
(a) Original Non-decomposable Layout (b) Conflicting graph and hot spot
a
b
c d
e
f
g
SidewallWidth
Figure 4.7: A non-decomposable layout and its conicting graph.
The overall graph-based algorithm can be shown in Alg. 1.
Algorithm 1 Overall Graph-Based Algorithm
Require: Layout with feature set F = fi; i = 1:::n
Ensure: The distance set Dhotspot that makes the layout non-decomposable.
1: for all feature fi do
2: Insert node Ni in the conicting graph G
3: end for
4: for all distance di;j between each adjacent feature fi and fj do
5: if di;j 2 (0; SidewallWidth)jj(SidewallWidth;MinCoreSpace) then
6: Insert di;j into Dhotspot
7: else if di;j == SidewallWidth then
8: Insert edge between node Ni and Nj .
9: end if
10: end for
11: Detect the odd cycle in G
12: Insert the critical edge in the odd cycle into Dhotspot
4.4 Experimental Results
4.4.1 Experimental Setup and Validation
To test the validation of our proposed hot spot algorithm, we implement our
algorithm using C++ on a workstation with the RedHat 9.0 operating system
48
on an Intel Xeon CPU at 3.00 GHz with 4 GB memory. In our program, we
call the GUROBI 4.0 ILP Solver [25] to generate our ILP-based algorithm.
We implement a set of demo rules for the 28 nm design from our industrial
partner, with the following rules: SidewallWidth, MinCoreWidth, MinCores-
Pace, MinCoreCrnr2Crnr, TrimOverlay, MinTrimWidth, MinTrimSpace, and
MinTrimCrnr2Crnr.
To test the decomposability of the current industrial data, we chose the
45 nm Nangate Open Cell Library [24] and scaled the cells down to t into
our test frame. Since the Nangate cell library is not designed for 2D SADP,
most of the cell after scaling is not decomposable. Fig. 4.2 shows a hot spot
detection result for the metal 1 layer in the NOR2 X1 gate. Since the hot
spot (the blue feature in Fig. 4.8(a)) is only Sc;min apart from its two adjacent
features, its own region thus has be assigned as a core. However, this setting
brings a core competing issue to other features and makes this layout not
decomposable. Our ILP-based program reported the conict and pointed
out the blue feature as the only hot spot in the layout with 5.88 seconds
and our graph-based program reported the conict less than 0.1 second.
After a simple x by shifting the blue feature to a new location, which is
farther from the neighbor features, the layout is thus decomposable. The
nal decomposition result by our ILP-based algorithm is given as Fig. 4.8(b).
We further apply our hot spot detection program to test all the cells in
the Nangate 45 nm library. The decomposability rate of this public standard
cell library is very low. With over 130 standard cells in the library, only 9
cells are decomposable, leaving the rest of the cells non-decomposable, any of
which might have 1 to 5 hot spots. With a simple study of the standard cell
layout, we can nd that even with only a few patterns in one layout, if there
is one pattern having SidewallWidth distance with too many other features
(over three features), this layout will be more likely to be non-decomposable.
As we know, the power track in one standard cell always connects with
the power track in other standard cells; therefore, to make the whole layer
decomposable, it will be better to make the features in the standard cell
at least MinCoreSpace away from the power track and its ngers. With
this setting, the number of hot spots can be greatly increased. Overall, a
SADP-friendly standard cell library is sorely needed. By studying the hot
spot in the layout, we can conclude some recommended design rules for the
SADP-friendly design with empirical data.
49
(a) Hot spot detection
and its modification
(b) The corresponding
decomposition
Figure 4.8: The hot spot detection for NOR2 X1 cell and the
decomposition result after xing.
4.5 Conclusions
In this chapter, we have addressed one of the critical issues in the SADP
decomposition process { the hot spot detection. We successfully characterize
the geometry mask rules into an ILP problem and build up an ILP-based
hot spot detection method without any preconditions. Based on the pre-
condition of single-printability, we further implement a graph-based method,
using 2-coloring problem to solve our hot spot detection. Experimental re-
sults validate our method, and decomposition results for Nangate open cell
library are analyzed. With the study of the hot spots existing in the current
Nangate library, some recommended design rules are suggested, and the high
number of hot spots calls for an SADP-friendly standard cell library, and
soon, to meet the requirement of upcoming 14 nm technology node.
50
CHAPTER 5
CHARACTERIZATION OF
SAQP-FRIENDLY DESIGN
5.1 Introduction
Developed from SADP for the sub-20 nm technology node, the self-aligned
quadruple patterning (SAQP) technique is expected to be a major solution for
the further process requirement after 16 nm/14 nm technology node. SAQP
technique has already shown its capability for 15 nm half-pitch process under
NAND ash memory manufacturing [26]. But, it is still unknown whether
this technique can be implemented in the random logic circuit on the lower
level metal interconnect layer.
Currently, the major challenges for SAQP are from several aspects. First,
the process enablement depends on an ecient layout decomposition method
for mask set generation. Although a simple, inecient decomposition algo-
rithm can be developed from the existing decomposition solution for SADP
in Chapter 2, without clear rule denitions the complete full layout decom-
position solution is still not available. Second, the mask rule denition has to
be nished in the ecosystem of design-manufacturing, including cost, yield,
regularity and design/manufacturing friendliness at the same time. Third,
it is always dicult to balance the needs of design and manufacturing. It is
trivial for SAQP to generate a regular 1D pattern, but the best denition of
manufacturing-friendly design style for SAQP is still unclear.
So far, because it is still an open problem to have the mask rule clearly de-
ned for SAQP, it is impossible to have a complete decomposition algorithm
for the full chip layout.1 Instead, based on the existing knowledge of current
193 nm lithography and the process ow of SAQP, we will focus on the early
study of SAQP-friendly layout with its implicit feature-region assignment
1Indeed, we can directly have an algorithm of decomposition developed from the SADP
decomposition algorithm in Chapter 2 with pre-assumed mask rule. But, this work will
not help the study of the characterization and early study of SAQP-friendly layout.
51
for the future decomposition. We will study the feasible feature regions of
SAQP process and explore the possible combinations of adjacent features.
Then, several simple but important geometry rules will be illustrated. Based
on the rules and combination relations of adjacent features, we will intro-
duce a conicting graph algorithm for our feature-region assignment. Our
experimental results validate the SAQP-friendly layout denition, and basic
building blocks in the low-level metal layer are analyzed.
The rest of the chapter is organized as follows. In Section 5.2, we rst
introduce the key ideas and existing conditions of current SAQP process.
In Section 5.3, based on the process analysis, the patterning feasibility is
studied. Section 5.4 demonstrates several important geometry rules for the
SAQP-friendly layout denition. Then we introduce a conicting graph al-
gorithm for the feature-region assignment of any given layout in Section 5.5.
Section 5.6 shows experimental results to dene our SAQP-friendly layout
and analyze the feasibility of some common building blocks in the low-level
metal layer. Finally, Section 5.7 concludes this chapter.
5.2 Process Information
5.2.1 SAQP Process Flow
In the previous study of SAQP process [26], two process schemes of SAQP
are introduced based on dierent sidewall2 generation processes. However,
if we ignore the technique details, single-APF and double-APF would be
identical in the geometry relationships between sidewall1 and sidewall2, and
we can therefore use Fig. 5.1 (the single-APF SAQP process) to represent
the whole process ow. In Step 1, we will rst generate the core pattern with
mandrel mask and deposit sidewall1 along the core boundaries. In Step 2,
core is removed and sidewall2 is deposited along the sidewall1 boundaries. In
Step 3, we will etch away the sidewall1 region, and the sidewall2 regions will
help dene the dielectric between circuit patterns. The whole process with
positive tone process will also be named as SAQP SID (sidewall is dielectric).
52
S1 S1 S1 S1
S2
WCore SCore
S2 S2 S2 S2 S2 S2 S2
A B A C A B A
S1 S1 S1 S1
Step 1:
Step 2:
Step 3: CC
Figure 5.1: The process ow of the SAQP.
5.2.2 Pattern Dimensions
In Fig. 5.1, we name all the non-sidewall2 regions feature regions, which can
be kept after trim to generate the intended circuit features. The feature
region from the sidewall1 is called sidewall1 region or simply noted as \A";
the region dened by the core is called core region, or simply noted as \B";
the region dened by space between cores is called space region, or simply
noted as \C". In this chapter, we can dene the following critical process
parameters: the sidewall1 width W1, the sidewall2 width W2, the core width
WCore, and the core space SCore. Among those critical parameters, W1 and
W2 are dened by the process, which are xed for the designer. The geometry
variables WCore and SCore are the key parameters decided by the circuit
patterns for the nal mask generation.
As mentioned in the previous section, according to the \Standard CD"
condition, for a general layout where feature width and feature space are
equilibrium, the ideal sidewall width W1 and W2 will be both 1=8P , where P
is the standard mandrel pitch. Thus A's width will always beW1 = 1=8P , the
minimum B width WB;min and the minimum C width WC;min will be 1=8P .
In this way, the minimum core width will have to be WCore;min = 3=8P
and the minimum core space will have to be SCore;min = 5=8P . Note that,
as mentioned in the previous work [26], the real core on wafer to generate
sidewall1 will have to be the initial core size minus the amount of etch trim
53
Table 5.1: Geometry Constraints for Future Technology Nodes
14 nm HP 10 nm HP
W1, W2 14 nm 10 nm
WFeature;min 14 nm 10 nm
WCore;min 42 nm 30 nm
SCore;min 70 nm 50 nm
Mandrel Pitch 112 nm 80 nm
E 0. Thus, the mandrel pattern minimum width WMandrel;min = 3=8P + 2E 0
and the mandrel minimum space will have to be SMandrel;min = 5=8P   2E 0.
In this way, the minimum feature width WFeature;min and space SFeature;min
will be both equivalent to 1=8P . We will use this estimation in the following
paragraph to determine the basic rules of the feature in SAQP. Table 5.1
shows the standard geometry parameters for 14 nm and 10 nm half-pitch
process.
As described in the previous section, features can only be located in side-
wall region (A), core region (B), or space region (C) in Fig. 5.1. The width
of each region can be expressed in the following equations:
WA = W1 (5.1)
WB = WCore   2W2 (5.2)
WC = SCore   2W2   2W1 (5.3)
From the above equations, any feature with width larger than W1 cannot
be in sidewall region A.
5.2.3 SAQP Patterning
Fig. 5.2 demonstrates a 2D patterning process by SAQP, where L, U and
parallel line are presented. In this process, after sidewall2 is generated to
dene the dielectric, the trim mask is applied to cut the wires and trim away
the unnecessary patterns. In this technology denition, only two masks {
mandrel and trim { are needed, where the mandrel mask is to dene the core
and the trim mask is to dene the trim.
Similar to the geometry Boolean function shown in Chapter 2 for SADP
SID process, the feature generation in SAQP SID process can be expressed
54
(a) Intended Layout (b) Core and Sidewall1
(c) Sidewall1 and Sidewall2 (d) Sidewall2 and Trim
Feature
Core
Sidewall1
Sidewall2
Trim
Figure 5.2: The demonstration of the SAQP process to generate 2D
patterns.
by the following geometric boolean expression:
Feature = :Sidewall2 \ Trim (5.4)
In order to have an SAQP patterning solution, we will need a complete
decomposition method for a given layout. For the SAQP-friendly design,
it is necessary to have mandrel and trim masks generated for the full chip
layout in a reasonable time; for the SAQP-unfriendly design, the troublesome
hot spots have to be detected. This complete SAQP decomposition scheme
need fulll three key criteria: a decent set of mask rules for mandrel and
trim, an ecient decomposition algorithm, and an easy denition of SAQP-
friendly layout. Among these criteria, the set of mask rules is still too early
to have explicit denition; an early inecient decomposition algorithm can
be developed following the idea of SAT-based decomposition algorithm for
SADP in Chapter 2; we will mainly discuss the SAQP-friendly layout by the
55
existing process conditions and the geometry rules, which will be shown in
the following sections.
5.3 SAQP Patterning Conditions and Feasibility
5.3.1 Existing Conditions of SAQP Patterning
In order to achieve the task of the chapter, the existing conditions of SAQP
process must be explored rst.
 Limited Mandrel Pitch. Since cores have to be printed by 193 nm
lithography, the mandrel pitch will be limited by the normal optical
lithography process. For the ideal conditions, the mandrel pitch with
the immersion system will be 72 nm and the one with the dry sys-
tem will be 104 nm. Considering the exibility of feature patterns
and considerable process windows, the mandrel pitch should always be
considerably larger than the ideal numbers.
 Standard CD for One Technology Node. SAQP process can be
built up for 14 nm or 10 nm half-pitch (HP) patterning. In this way, the
sidewall1 width W1 is always working as the minimum feature width,
and the sidewall2 width W2 is working as the minimum feature space.
Therefore, the standard feature pitch can beW1+W2, and the standard
mandrel pitch will have to be 4 (W1 +W2).
 Minimum Bound of Mask Width and Space. As we know, 28 nm
technology node is the last technology node with single patterning for
the critical layer printing. Thus, we can assume the mask rules will be
similar to the ones used in 28 nm technology node. Based on the above
\Standard CD" condition, the required trim space and width will be
larger than W1 +W2.
 One Feature in One Region. Similar to SADP, in SAQP, dierent
feature regions will be separated by sidewall2. Thus one continuous
feature will be always in one region.
Following the argument of the existing process conditions, we will explore
the feasibility of the SAQP patterning in the following section.
56
5.3.2 Feature Region Adjacency
According to the \One Feature in One Region" condition, unlike the LELE
process, there is no stitch to let dierent feature regions generate one con-
tinuous feature. The layout decomposition will be equivalent to explore the
feature region adjacency rules for a feasible or optimal feature-region assign-
ment. Once the feature-region assignment is determined, the mandrel and
block mask can be generated, based on the mask rules. Therefore, it is ex-
tremely important to have feature-region assignment determined at the early
stage of SAQP decomposition.
In the SAQP process, feature regions are always following the same pat-
tern along the same direction, which is A-B-A-C. However, because trim mask
can be used to trim away unnecessary patterns, two adjacent features can be
separated in three ways: sidewall-separation, sidewall-trim-separation, and
trim-separation, as shown in Fig. 5.3. Sidewall-separation is the most com-
monly seen case where no feature region is trimmed away. Trim-separation
is the case where one continuous feature region is cut o. Sidewall-trim-
separation is the most exible separation where the feasible regions between
the features are trimmed away.
S2
C
A
S2
S2
C
S2
B
A
S2
S2
A
B
S2
S2
C
A
S2
S2
A
C
A
Dist. Dist.
C
B
C
A
B
C
A
Dist.
Dist.
Dist.
(a) Sidewall-separation (b) Sidewall-Trim-separation
(c) Trim-separation
Figure 5.3: Three types of feature separations and their distance denitions.
Counting the adjacent regions after trim, we can have all region combi-
nations available: A-A, B-B, C-C, A-B, A-C, and B-C. Depending on the
separation method, those adjacent features can be generated from one, two
or multiple cores. Note that one single mandrel will generate 1 B and 2 A's,
57
and two adjacent mandrels will have 4 A's, 2 B's and 3 C's. Note that trim-
separation will only happen when both features can be in the same feature
regions and the distance between them is larger than a trim. Except for
the trim-separation case, Table 5.2 shows the possible adjacency distance in
terms of SAQP key parameters if both regions are from one or two adjacent
cores. For the cases where the adjacent features are from regions dened
by several cores apart, we can simply add several mandrel pitches on those
numbers in the table.
Table 5.2: Distance between Dierent Feature Regions
Single Core Double Cores
Adjacency Feature Space Trim Space Feature Space Trim Space
A-A WCore WCore  W2 SCore   2W1 SCore   2W1  W2
B-B NA NA SCore + 2W2 SCore +W2
C-C NA NA WCore + 2W2 + 2W1 WCore +W2 + 2W1
A-B W2 0 SCore +W2  W1 SCore  W1
A-C W2 0 WCore +W1 +W2 WCore +W1
B-C W1 + 2W2 W1 +W2 NA NA
By comparing the trim space with the \Minimum Bound of Mask Width
and Space" condition, we can check the feasibility of each combination in
Table 5.2 and label them with dierent colors. Green represents the allow-
able combinations; red represents the disallowed combinations, which are
violating the \Minimum Bound of Mask Width and Space" condition; yel-
low represents undetermined combinations whose feasibility depends on the
mask rules.
Feasible distance
0 Ğ 
Distance
S2 MinFeasibleBoundMaxInfeasibleBound
Figure 5.4: The feasibility of dierent distance from 0 to 1.
According to the contents in Table 5.2, there exist two distance values
called MinFeasibleBound and MaxInfeasibleBound, such that any distance
larger than MinFeasibleBound will make all possible combinations feasible
and any distance smaller than MaxInfeasibleBound except W2 will make the
feature combinations infeasible. If we consider the distance between two adja-
cent features from 0 to innity, as shown in Fig. 5.4, the red intervals (0;W2)
or (W2;MaxInfeasibleBound] represent the infeasible distance that any
58
SAQP-friendly layout should avoid. W2 is the distance where two feature ad-
jacency could only be A-B or A-C. Any distance in [MinFeasibleBound;1)
will make all the options of A, B or C region assignments feasible. Note
that the priorities of dierent feature combinations with the distance in
(MaxInfeasibleBound;MinFeasibleBound) are unclear due to the unde-
termined mask rules, and the feature region feasibility is also mildly impacted
by the distance in this region. But, becauseW2 is always the most commonly
seen distance, for the target of SAQP-friendly characterization at the current
stage, we will focus on the distance W2 for the decomposability check. The
adjacency options for the distance in (MaxInfeasibleBound;MinFeasibleBound)
will be left as an open question to be solved, after mask rules and pattern
combination preferences become clearer.
5.4 Patterning Rules
In this section, we will enumerate ve important patterning rules based on
the existing process conditions and the arguments in the previous section.
5.4.1 Non-Single Min-Wire Rule (NonSingleRule)
NonSingleRule requires that any portion of a feature with the minimum
width have other feature W2 distance away. As shown in Fig. 5.5, in order
to have a single wire with minimum width W1, the trim pattern has to be
W1 +W2 wide, which is violating the \Minimum Bound of Mask Width and
Space" condition.
B
S2
S2
C
A
S2
S2
Wtrim = W1+W2
         = ¼ MinCorePitch
Too thin for irregular 
patterns
Figure 5.5: Illustration of NonSingleRule.
59
5.4.2 Sidewall Feature No Branch Rule
(SidewallNoBranchRule)
SidewallNoBranchRule requires any feature in region A have no branch. This
rule is also determined by the \Minimum Bound of Mask Width and Space"
condition. Region A is generated along the original core patterns. In order
to merge sidewall1 to make region A wider or branch, the core space must be
at most 2W1, as shown in Fig. 5.6, which is impossible. Therefore, regions A
must always be W1 wide and have no branch.
A
CoreCore
SCore=W1
Too small for 
Core Space
Figure 5.6: Illustration of SidewallNoBranchRule.
5.4.3 Sidewall Feature Aside Rule (SidewallAsideRule)
SidewallAsideRule requires that any feature in region A have other features
in B or C W2 distance away along the whole feature. This rule is developed
from NonSingleRule and SidewallNoBranchRule. Since the sidewall1 region
is always W1 wide, if there is any feature assigned to be A, there must be
features in B or C W2 distance away along the path of feature, as shown in
Fig. 5.7.
C
A
B
NonSingleRule
Figure 5.7: Illustration of SidewallAsideRule.
60
5.4.4 Same Side Rule (SameSideRule)
SameSideRule, illustrated in Fig. 5.8, requires that any feature assigned to
sidewall1 region A must have all the adjacent B features on one side and
C features on the other side. This rule comes from a combination of the
existing conditions and previous rules. Since region A is a path without any
branches, it becomes easy to dene the left/right side along the region path.
In the meanwhile, a continuous region A can only be generated from one
single core (as illustrated in SidewallNoBranchRule). If B features can be
found on both sides of A, the corresponding cores must violate the \Minimum
Bound of Mask Width and Space" condition.
B B
C
C B
B
Figure 5.8: Illustration of SameSideRule
5.4.5 Self Apart Rule (SelfApartRule)
SelfApartRule requires that any feature must not have any portion that is
within WCore;min distance away from other portions of its own. Because one
feature has to be in one region, according to Table 5.2, the smallest same
region adjacent distance will be WCore in A-A case.
5.5 SAQP-Friendly Layout and Feature-Region
Assignment
Based on the previous rules and existing conditions, we will introduce our
SAQP-friendly layout check and feature-region assignment algorithm for SAQP
decomposition in this section.
61
5.5.1 Conicting Graph
To determine whether a layout is SAQP-friendly or not, the most important
step is to check if we can assign the feature into the region A, B or C without
introducing any conicts with the existing rules. Given that any feature will
fulll SelfApartRule and any distance between two adjacent features is not
in the infeasible regions of Fig. 5.4, we will only need to check the feature
combinations with distance W2.
As mentioned in the previous section, W2 is the distance only able to
separate A and B/C. Therefore, to construct a conicting graph, we can let
one node represent one feature and connect two nodes with one edge if there
is any distance between the features to be W2, as shown in Fig. 5.9. Then
the SAQP-friendly check will be equivalent to assign A, B and C colors onto
this conicting graph.
1
2
3
4
5
76
8
1
2
3
4 5 7
6
8
(a) Layout (b) Conflicting Graph
Figure 5.9: A demonstration of conicting graph construction.
5.5.2 Color Assignment
In the conicting graph, each edge represents a color conict between A
and B/C. We rst group B and C colors together as the BC color, then
the problem is translated into a 2-coloring problem that is to assign A and
BC color onto each node. Note that a feature in A must have no portion
wider than W1, so we can always assign color BC to those nodes whose
corresponding features cannot be A.
Then we will need to assign colors B and C to the nodes with color BC.
According to SameSideRule, the feature with color B must always be on one
side of A, while the feature with color C must always be on the other side.
62
(a) Conflicting Graph (b) A-BC 2-color assignment
1
2
3
4 5 7
6
8
1
2
3
4 5 7
6
8
1
2
3
4 5 7
6
8
(c) A-B-C 3-color assignment
Feature
Color A
Color BC
Color B
Color C
Figure 5.10: The process of A-B-C color assignment. With the conicting
graph in (a), A and BC colors are rstly assigned in (b); B and C colors are
then assigned on the nodes with BC color in (c).
We can always check node with color A one by one, to split the BC nodes
into two sides. If there is any conict with the pre-assigned B/C color with
the newly assigned B/C color, we will have to report a B-C conicting error,
which represents graph failure in the SAQP-friendly check. The process is
illustrated in Fig. 5.10.
Note that if we dene a simple graph to be a graph with all nodes connect-
ing to any other node through a path, a complete layout can be represented
by several simple conicting graphs. Here we dene connected component
set (CCS) to be the set of the features connected by the conicting edges, as
shown in Fig. 5.11 where 2 CCS's are presented. In any CCS, if there is no
feature wider than W1, then the colors A and BC become interchangeable.
Also, in any CCS, B and C are always interchangeable. So, in a CCS with a
determined BC color assignment, there will be only 2 options of A-B-C color
assignments; in a CCS without a determined BC color assignment, there will
be 4 options of A-B-C color assignments. Suppose there are p CCS's with
determined BC color assignment and q CCS's without determined BC color
63
CCS1
CCS2
Figure 5.11: The denition of connected component set (CCS).
assignment in a complete layout; the total number of the feature-region as-
signment options will be 2p4q. The nal decomposition and mandrel-trim
mask generation can be further developed from those options.
So far, we can have our SAQP-friendly layout dened as any layout that
can successfully pass the pattern rules and color assignment on the conicting
graph. It is O(n log(n)) time complexity to build up the conicting graph; n
is the total edge number. It is O(m) time complexity to check the coloring
feasibility and assign color; m is the total feature number.
5.6 Experimental Results and Analysis
The conicting graph and coloring assignment algorithm is implemented in
C++ on a Ubuntu 10.04 system with Intel Xeon 2.80 GHz CPU and 4 GB
memory.
5.6.1 Method Validation
We carried our experiments targeting on SADP-friendly layouts after proper
scaling. As demonstrated in Fig. 5.12(a), the SADP-friendly layout can pass
the A-BC 2-coloring assignment, which matches the previous research con-
clusion on SADP in Chapter 4. However, the SADP-friendly layout could
probably fail in the A-B-C 3-coloring assignment and violate pattern rules.
The whole process of SAQP-friendly check and feature region assignment
takes 0.04 s to nish with 15 violations reported. Fig. 5.12(b) shows the
64
feature assignment toward the example of Fig. 5.2. With 0 failure reported,
the A-B-C color assignment succeeds and the whole process takes 0.01 s.
(a) SAQP-Unfriendly Layout (b) SAQP-Friendly Layout
Figure 5.12: Feature region assignment on an SAQP-unfriendly layout (a)
and an SAQP-friendly layout (b). The layout in (a) is SADP-friendly, but
this layout can only guarantee no conict in the A-BC 2-coloring
assignment; the layout will have other types geometry rule violations and
A-B-C coloring conicts.
5.6.2 Common Case Study
We have carried out several large layout inspections, but unfortunately none
of the existing layouts is SAQP-friendly. Instead, we will report the common
patterns that are SAQP-friendly in the interconnect layer and provide the
common patterns that fail to be SAQP-friendly with violation reported, as
shown in Fig. 5.13 and Fig. 5.14.
Fig. 5.13 demonstrates the commonly seen patterns that are SAQP-unfriendly
and Fig. 5.14 demonstrates the patterns that are classied as SAQP-friendly
patterns. From the comparison, we can nd that the allowable pattern num-
ber is further reduced in SAQP process compared to SADP. For the irregular
patterns, such as L, U and wide pad, the track assignment is always follow
a 4-jog rule. In other words, any wire jog (wrong direction wire) has to be
4  n track long. Equivalently, if there is a pad, the adjacent 3 tracks are
occupied.
This common case study clearly demonstrates an idea that SAQP has
limited capability in the low level interconnect layer manufacturing. We will
need to update our routing tools as well as the pin assignment to nally
achieve a SAQP-friendly design.
65
NonSingleRule 
Violation
Odd Cycle
Violation
SelfApartRule
Violation
Odd Cycle
Violation
Odd Cycle
Violation
BC Coloring
Confliction
A-A Coloring
Confliction
Odd Cycle
Violation
BC-BC Coloring
Confliction
BC-BC Coloring
Confliction
Infeasible Dist.
Violation
Figure 5.13: Several examples of SAQP-unfriendly patterns and their
violations.
5.7 Conclusions
So far, we have nished our study of the denition of SAQP-friendly layout.
With the exploration of the feasible feature regions and possible combinations
of adjacent features, we have dened several simple but important geometry
rules to fulll our purpose. Then, we introduce a conicting graph algo-
rithm to generate the feature region assignment for SAQP decomposition.
Our experimental results demonstrate the validations of the SAQP-friendly
layout denition, and some basic building blocks in the low-level metal layer
are studied. It can be expected that SAQP will be a promising technique
and more relative researches will be carried on for the rule denition, man-
drel/trim mask synthesis, and design-technology co-optimizations.
66
Figure 5.14: Several examples of SAQP-friendly patterns and their feature
region assignments.
67
PART III
NEW DESIGN
METHODOLOGY
68
CHAPTER 6
LAYOUT OPTIMIZATION FOR OPTIMAL
PRINTING
6.1 Introduction
Since 193 nm ArF immersion lithography is still the only available light source
for mass production of ICs at present (32 nm/22 nm technology node), in
addition to the novel patterning techniques described in previous chapters,
novel design style, such as restricted design rule on 1-D cell, is needed for
process reliability and yield improvement [27{30].
Figure 6.1: An SEM image to show how line-end gaps aect line width
roughness [31]. Clear wavy shapes can be found when there is a gap nearby.
The new requirement for cell regularity needs new design methodologies.
In the 1-D regular circuit, the line width, spacing, and line-end gap are
the three factors to determine the circuit pattern in one layer. While line
width and spacing are determined by the technology node, we can control
69
the positions of the line-end gaps which can greatly aect printability. A
new method for physical design of regular 1-D cells has been introduced
in [32]. In [33], based on inverse lithography technology, line-end gap size
is illustrated together with space/width ratio. Figure 6.1 is an SEM image
from [31] that clearly shows how bad the gaps make the line width roughness
in the dense line printing technology.
In this chapter, targeting on 1-D cell design, we use simulation data to an-
alyze the relationship between the line-end gap distribution and printability,
which cleanly shows a retargeting requirement for the gap distribution con-
sideration. Based on the gap distribution preferences, an optimal algorithm
is provided to eciently extend the line ends and insert dummies, which will
signicantly improve the gap distribution and help printability. Experimen-
tal results on two dierent pitch processes show that signicant improvement
on process windows can be obtained.
6.2 Description of the Problem
Dense line printing is a type of lithography technology that can utilize the ad-
vantages of O-Axis Illumination (OAI) to print highly regular 1-D patterns
with randomly distributed gaps. In this section, we present simulation results
to analyze the relationship between the printability and the gap distribution.
Usually, gap distribution should be a factor in dense line printing, and some
types of gap distribution will signicantly aect process windows, as Fig. 6.2
demonstrates. The simulation condition is an annular light source with  =
0:9, pitch = 140 nm, NA = 0:92 and gapsize = 70 nm. Although with 0 nm
defocus condition, both patterns can be printed well, with 100 nm defocus a
great dierence can be seen between the stage-like gap distribution pattern
and the regular dense line pattern. This example shows that printability can
be improved just through changing the gap distribution.
Figure 6.3 shows a more complete set of test patterns on how the gap dis-
tribution matters for printability. The labels on the patterns represent the
sampling point for the process window comparison. With the same illumina-
tion condition in the above example, EPE values within  0:15m 0:15m
defocus are measured. For dierent purposes, the comparison results are
shown in Figures 6.4, 6.5, and 6.6.
70
Defocus = 0nm Defocus = 100nm
(a) (b)
(c) (d)
Figure 6.2: An example to show how gap patterns aect printability and
process windows.
 Fig. 6.4 compares the patterns with and without a gap to show how the
gap aect the process windows. From the EPE dierence on 100 nm
defocus, we can see that the process window is slightly reduced by a
gap.
 Fig. 6.5 is another comparison on how the process windows react when
dierent numbers of gaps are lined up. From the diagram, it can be
seen that the EPE curves for all four cases are almost overlapped, which
means that the process window is not sensitive to the size of the aligned
gaps.
 Fig. 6.6 shows the impact on process window when the aligned gaps
are placed as a stage. It clearly shows that when gaps are nearby but
not aligned up, the negative impacts will interact with each other and
greatly destroy the line ends within the gap region.
As mentioned in previous analysis, printability is greatly impacted by the
gap distribution. To better describe the pattern of gap distribution, two basic
examples are given in Fig. 6.7. In this gure, an eective gap is dened when
a gap is adjacent to a real wire, which will inuence the process window of
71
a b c
f
g
d e
Figure 6.3: Test patterns. Each test site is labeled.
that real wire, as Fig. 6.4 shows. Because the process window is not sensitive
to the number of aligned gap, as Fig. 6.5 shows, one eective gap would be
enough to represent the eect when several gaps are aligned together and
adjacent to a real wire. Therefore, the number of eective gaps directly
represents how the real wires are aected by the gaps. To line gaps up to
form an eective gap and to shift the gap away from the real wires would be
an ecient way to reduce the number of the eective gaps and thus benet
the printability.
As Fig. 6.6 shows, when gaps form a stage-like pattern (in other words,
when a gap is near a line-end), the line-end will be greatly impacted by the
gap, which alerts us to avoid this kind of gap distribution. We call it a critical
gap when an eective gap is adjacent to a line-end of a real wire. In order to
maintain the printability of the dense line, the critical gap should be avoided
in the cell design.
6.3 Uniformity-Aware Cell Design Guideline
Consider the local interconnect with nished transistor placement in a 1-D
cell. Local wires will be assigned on the tracks in the lower layer, e.g., metal
1 layer, and on each layer wires will only contribute to one direction of con-
nection, as a 1-D circuit requires. Starting from any routing algorithm that
72
05
10
15
20
25
30
35
-150 -100 -50 0 50 100 150
Defocus (nm)
E
P
E
 (
n
m
)
test a
test b
Figure 6.4: Process windows comparison between sites a and b.
0
5
10
15
20
25
30
35
-150 -100 -50 0 50 100 150
Defocus (nm)
E
P
E
 (
n
m
) test b
test c
test d
test e
Figure 6.5: Process windows comparison among sites b, c, d and e.
routes wires unidirectionally in one layer, wires are initially placed onto dif-
ferent tracks without considering the gap distribution. Due to the restriction
of the interconnect in a 1-D cell, all wires will remain within the boundary
of the cell and all line-ends of the real wires will be exactly on the grids
where the transistor gates and active areas are located, as Fig. 6.8(b) shows.
Therefore, the grid size would be the half pitch of the gate.
Note that dummies are widely used to keep the line pitch uniformity.
Each line-end of a real wire or a dummy corresponds to a gap. Note that no
matter how the dummies are inserted, only line-end extension would work
for a better gap distribution. Normally a via/contact connects to the line-
end of the real wire, where the printed results should not be degraded from
the intended shape and the extension of the line-end would help. However,
73
05
10
15
20
25
30
35
40
45
-150 -100 -50 0 50 100 150
Defocus (nm)
E
P
E
 (
n
m
)
test e
test f
test g
Figure 6.6: Process windows comparison among sites e, f and g.
Effective Gap Critical Gap
Real
Wire
Figure 6.7: Example of eective gap and critical gap.
because the extension of the real wires and dummies are not on the signal
path, the degradation from the lithography on those parts will have little
impact on the signal itself. Therefore, we can consider that the extension part
is not a part of the real wire, the adjacent gap from these redundant features
will not be eective, and thus their printability problem will be neglected.
Fig. 6.8(c) shows a nished cell design with line-end extension and dummy
insertion while considering the gap distribution. We can see that the gaps
are aligned and eective gap number is reduced. In the following part, based
on the initial real wire assignment, an optimal greedy algorithm for line-end
extension and dummy insertion will be introduced to minimize the number
of eective gaps.
74
(a) (b) (c)
1   2  3  4 5    6   7
1
2
3
4
5
A
C
B
F
A
A
B
B
C
C
Figure 6.8: An example of an AOI21 gate for the layout improvement
targeting on printability of poly/MG and metal 1.
6.3.1 An Optimal Greedy Algorithm
Consider the original wire assignment with the line ends on grid as Fig. 6.9(a)
shows. We dene the active left/right line-end when there is an adjacent real
wire on the right/left grid to the line-end, as Fig. 6.10 shows.
In the rst round of the algorithm, each left line-end of the real wires is
extended until this line-end is not active or it cannot be extended any more.
In the second round, all the right line-ends go through the same process as
in the rst round. After both sides of the line-ends have been extended, all
the blank area is lled with dummies. Fig. 6.9(b), (c) and (d) show the three
steps of this greedy algorithm. Alg. 2 gives the pseudocode of the greedy
algorithm.
We can prove that this algorithm guarantees to obtain the minimum num-
ber of eective gaps. Since it needs constant time to check if a line-end is
active, the run time of this algorithm is O(mn), where m is the number of
the tracks and n is the number of the grids in one track.
6.3.2 Extension
In the previous section, the greedy algorithm minimizes the number of eec-
tive gaps. In order to prevent the destruction of the line-end of the real wire
by critical gaps, an extension should be added onto the previous algorithm.
75
(a) Original Wires (b) Left End Extension
(c) Right End Extension (d) Dummy Insertion
Real Wire Extension Dummy
Figure 6.9: Illustration of the greedy algorithm.
In reality, only patterns as shown in Fig. 6.11(a) and (b) are critical, which
can be detected locally in constant time. By simply extend the wire line-end
A or B, the critical can be removed as Fig. 6.11(c), (d) and (e) show.
To deal with the critical gaps, initialization is needed to remove the critical
gaps in the beginning. In this step, each line-end goes through the pattern
check to see whether its neighbor matches the case in Fig. 6.12(b). If the
pattern in Fig. 6.12(b) is found, an extension on line-end A will be processed.
Note that although the case in Fig. 6.12(a) forms critical gaps, both line-
ends A and B are not blocked. In other words, at least either of them will
be extended automatically in the future, so no initialization step is needed.
Also note that if the pattern in Fig. 6.12(c) is found, this critical gap cannot
be removed, which means the cell should be redesigned.
After the initialization step, in the rst round, the left line-end is sorted
by the position from left to right, and each time the leftmost one will be
extended. The extension should never be in the forbidden areas shown in
Fig. 6.11 and will stop if the line-end is no longer active or it cannot be
extended any more. Then in the second round, the right line-end will also
76
A(c)
A
(b)
A
(a)
Figure 6.10: (a) and (b) are the two cases of the active line-end A. Note
that in (c), although the line-end A is adjacent to a real wire, it is not an
active line-end.
A
B
A
B
B
A A
B
A
B
(a) (b)
(c) (d) (e)
Figure 6.11: Two types of critical gaps caused by line-end A. Both critical
gaps would harm the printing of the line-end B. By xing the line-end A,
as shown in (c), (d) and (e), critical gap can be removed.
be sorted from right to left, and the rightmost line-end will be extended, one
by one, following the same rule. In the end, dummies will ll into the free
space. The run time of the worst case will be O(mn log(mn)).
6.4 Experimental Results
Experiments were carried out on the metal 1 layer of 115 nm pitch (p115)
and 75 nm pitch (p75) processes. Test cases are built based on intervals
and gaps structure to test the speed and eciency. Fig. 6.13 and Fig. 6.14
show the layout after extension. Comparing these two gures, we can see
that without considering the critical gaps, after extension some critical gaps
77
Algorithm 2 Greedy algorithm on minimizing eective gaps
1: let Gi;j represent the ith leftmost grid on the jth track
2: fFirst round, extend the left endsg
3: for all left line-ends on the real wires at Gi;j do
4: while Gi+1;j is not occupied and Gi;j is active do
5: extend the line-end to Gi+1;j
6: mark Gi+1;j as occupied
7: end while
8: end for
9: fSecond round, extend the right endsg
10: for all right line-ends on the real wires at Gi;j do
11: while Gi 1;j is not occupied and Gi;j is active do
12: extend the line-end to Gi 1;j
13: mark Gi 1;j as occupied
14: end while
15: end for
16: fThird round, insert dummiesg
17: for all empty grids not next to an occupied grid do
18: ll in dummy
19: end for
exist, as shown in Fig. 6.13, and with considering the critical gaps all of the
critical gaps are removed, as shown in Fig. 6.14. Table 6.1 shows the scale
and the runtime of 7 dierent test benches.
For the p115 process, an annular light source and 0.92 NA were adopted.
For the p75 process, the parameters would be an annular immersion light
source with 1.40 NA. Process windows were compared between the design
with only dummy insertion and the technology mentioned in this project.
Fig. 6.15 and Fig. 6.16 show the comparison results for p115 and p75 re-
spectively. On 50 nm defocus, our method would reduce the EPE by 33.8%
for p75 process and 47.9% for p115 process. This dramatic improvement
in printability will greatly benet the manufacturability and improve the
feasibility of dense line printing for 1-D circuits.
6.5 Conclusions
With the trend of using 1-D cells for the sub-40 nm node circuit, new design
methodology is needed. In this chapter, with the consideration of the im-
pact of the gap distribution, a new set of fast line-end extension and dummy
78
A(a)
B
A
(b)
BC
A
(c)
BC
D
Need not 
initiation
No solution exists
Report violation
A
BC
Figure 6.12: Three cases on the initialization.
Figure 6.13: The extension and dummy insertion without considering the
critical gaps.
insertion methods has been introduced. With careful consideration of the
gaps, the manufacturability of cell-level features has been greatly improved.
Due to the eciency of this fast algorithm, the work can be further imple-
mented into higher interconnect layer, where line edge roughness due to the
lithography is also widely concerned.
This work presents techniques to improve 1-D cell design, while considering
the impact of process. With the help of the algorithms, we can eectively
evaluate the best printability that a 1-D cell can achieve. Although the
Figure 6.14: The extension and dummy insertion with considering the
critical gaps.
79
Table 6.1: Dierent Test Circuit and the Run Time of Our Program
Run time (s)
Test Circuit Wire # No Crucial Crucial
INT 100x100 1673 0.09 0.09
INT 100x1000 16658 0.43 0.73
INT 1000x100 16595 0.4 0.52
INT 500x500 41619 0.94 1.44
INT 500x1000 83402 1.88 3.48
INT 1000x1000 166673 3.86 6.84
INT 1000x5000 833550 19.55 82.58
dummy lls and wire extensions will increase the wire capacitance and cou-
pling eect, usually those features are bounded by the cell boundary and
other dense cell features, and are relatively small compared to other parame-
ters in the cell. Moreover, as our experiment shows, with their help the yield
and printability can be greatly improved, and variation can be thus reduced.
80










    
'HIRFXVQP
(3
(
Q
P 2QO\GXPP\
LQVHUWLRQ
'XPP\LQVHUWLRQ
ZLWKOLQHHQG
H[WHQVLRQ
Figure 6.15: Process window comparison on 115 nm pitch process.







      
'HIRFXVQP
(3
(
Q
P
2QO\GXPP\
LQVHUWLRQ
'XPP\LQVHUWLRQ
ZLWKOLQHHQG
H[WHQVLRQ
Figure 6.16: Process window comparison on 75 nm pitch process.
81
CHAPTER 7
LAYOUT OPTIMIZATION WITH
PERFORMANCE CONSTRAINTS
7.1 Introduction
In Chapter 6, we introduced the gap distribution problem in 1-D design
for printability and showed a tremendous improvement in printability with
line-end extension and dummy insertion under 45 nm technology node and
beyond. In this design style, we should use the following tip-tip gap distri-
bution rules for the gridded 1-D design:
 Dummy wire should be inserted into large line-end gaps to make the
nal gaps have uniform size.
 Line-end gaps should not be placed adjacent to real wires.
 Line-end gaps should not be adjacent to real wires' line-ends.
However, without considering the modication impact on circuit perfor-
mance and power consumption, the trade-o between printability and circuit
impact is dicult to control. In order to better understand the extent to
which the layout modication can be applied, as well as its circuit impact,
a solution to better control the layout modication with circuit performance
awareness is needed.
In this chapter, focusing on gridded 1-D layout design, we present a novel
line-end extension and dummy insertion strategy with circuit performance
consideration. With bounded extension limit on each wire, we develop the
line-end extension and dummy insertion strategy. We introduce two algo-
rithms with complementary properties, and successfully implement a ow
for 1-D layout modication while considering the preferred gap distribution
rules. Simulation data are given to demonstrate how the extension limit will
simultaneously impact on printability, power consumption and circuit delay,
clearly illustrating the trade-o between printability and circuit performance.
82
The rest of the chapter is organized as follows. In Section 7.2, the new
problem is introduced. In Section 7.3, with the constraint of impact on circuit
performance, the problem is initially formulated. Then we introduce two al-
gorithms based on dierent criteria. Section 7.4 shows the nal experimental
results and Section 7.5 concludes the chapter.
7.2 The 1-D Layout Modication Problem
As shown in Chapter 6, by extending line-ends optimally, we can largely
improve process windows, but the extension of the line-ends can also impact
circuit performance, and such impact takes two main forms. As Fig. 7.1
shows, one impact is the increased load capacitance. Compared to the pre-
modication situation, the extra CL (made up of parallel capacitance Cp,
the fringing capacitance Cf , and the coupling capacitance Cc) will be shifted
from dummies to the real wires. Those extra capacitances will increase the
load of interconnect and increase delay and dynamic power consumption.
The second impact is on wire resist. As shown in Fig. 7.2, because the
extended wire is not on the signal path, the extra resist will not impact
the performance, but the extension of the adjacent wires will largely benet
the printability of the current wire; therefore the variation of the resist will
decrease and circuit delay will benet from this modication. Therefore, we
need a method of extension control in order to analyze the modication's
impact on the delay and power, and we need to use this method to trade
o between printability and circuit performance. Furthermore, if there is no
limitation on the line-end extension, the exact extended wire length will be
dicult to control and the extension part will probably become very long
and heavily inuence the original design.
In order to better control the amount of the circuit extension, but still
optimally modify the circuit layout with limited wire end extension, a new
formulation of the layout modication is needed. We can set up an extension
bound for each wire and not allow the extension length of a wire to exceed
the limit. The new problem can be therefore formulated as: For each wire i
with extension limit Bi, extend the two ends by Lleft and Lright, so that:
1. no wire will overlap another;
83
IN OUT
CL CL CL CL
Figure 7.1: Demonstration of how the extension wire will impact the
capacitance in the circuit.
R+ѐR R
Figure 7.2: The demonstration of how the extension wire will impact the
resistance in the circuit.
2. Lleft + Lright  Bi;
3. gaps between line-ends will have size either 1 grid unit or larger than
2 grid units to allow dummy insertion;
4. all critical gaps will be eliminated;
5. the number of regular gaps is minimized.
The above formulation of the regular circuit design modication uses the
extension limit B to control the extension amount. For the study of the
impact on circuit performance, we can tune the bound B to characterize the
circuit performance change, which will be shown in Section 7.4.
7.3 Algorithms
In this section, we introduce two algorithms to solve the problem. With a
small scale of the design (mostly within cell level), the shortest path method
(SPM) is introduced to solve the problem optimally. For large-scale design,
a fast suboptimal approximated approach is introduced.
84
7.3.1 Exact Method
In this section, we introduce the SPM to exactly solve the layout modication
problem with limited wire extension budget. For a wire that has maximum
B extension budget, the maximum possible number of options for extending
the wire is B(B   1), and for a gap whose original size is G, the maximum
possible number of possible options for the wire extension on both sizes is
1
2
G(G 1), as shown in Fig. 7.3. Considering all the potential candidates for
all the wires on one track, the number of options will be O((B(B   1))n), or
O((1
2
G(G::1))m), where n is the wire number and m is the gap number.
Case 1
Case 2
Case 3
Case 4
Case 5
Case 6
Extension Limit = 2
Figure 7.3: All the potential candidates for a wire with extension limit of 2.
The combination of dierent candidate solutions of the adjacent tracks will
generate dierent gap distributions, which have dierent costs from which
we need to choose. In order to check the critical gap and count the regular
gap number, we need to check the adjacent track as well as the current
track. Therefore, as shown in Fig. 7.4, the combination of two candidates
from adjacent tracks will have dierent critical gap patterns and dierent
numbers of regular gaps, which are irrelevant to another candidate in any
other tracks. Within one track, when a double gap (gap size is two units)
happens, we can set up a very high cost on that candidate as well. By
setting up dierent costs on the critical gaps, regular gaps, double gaps and
extension, we can experiment with the tradeo among those factors. In this
way, we can construct a graph as shown in Fig. 7.5 from the top-most track
to the bottom-most track. Any path in this graph represents one layout
modication solution on the original design, and the cost of the path is the
price that we need to pay for printability. Therefore, the minimum cost
path, i.e., the shortest path, is the optimal modication for the required
circuit performance consideration. To nd the shortest path in the graph,
85
we can use any regular shortest path solver.
(b) Candidate 2: Cost = 1R + 2E
Track m:
Track m+1:
Track m:
Track m+1:
(a) Candidate 1: Cost = 1C + 2E
Figure 7.4: Adjacent tracks have dierent modications with dierent
costs. E is one unit of extension cost, R is one unit of extension cost, and
C is one unit of critical gap cost. Here, E  R C.
7.3.2 Approximated Method
Although the SPM could provide optimal solution of wire extension in a
small area, it is an exponential algorithm which is impossible to apply in a
large area; in this case, an approximated method is necessary.
While not considering the critical gaps impact, a greedy algorithm in Chap-
ter 6 has been shown to be optimal for the unlimited wire extension problem.
With the requirement to consider the impact on circuit performance, the
greedy algorithm can be easily changed to handle the limitation of line-end
extension. Starting from the left-most wire among all the wires, that wire's
left line-end is extended to the point where the corresponding regular gap
can be removed within the extension limit or not extended at all if no cor-
responding regular gap can be removed. Then, we extend the right line-end
to the point where the remaining budget of extension is used up or the cor-
responding regular gap of the right line-end is removed. In this way, we can
86
Track 1:
Track 2:
Track 3:
Track M-1:
Track M:
S
32 m1
32 n1
32 p1
T
Figure 7.5: A graph corresponding to the layout with M tracks. The
shortest path between S and T gives the optimal solution of the wire-end
extension.
optimally extend the wire-end with extension bounds. This greedy process
is called the left-rst method.
In the situation when critical gaps count, we have to remove all the critical
gaps and not create new ones. Usually, as shown in Fig. 6.11, a critical
gap involves two line-ends. Normally, extending either of the two wires will
remove the critical gap. If one of the two ends cannot be extended due to
blockage or extension limit, the other line-end must be extended, which we
can call a must-do case. In our approximated algorithm, we will rst deal
with all the must-do cases.
After all the must-do cases are solved, we will need to handle the case
where a critical gap exists and both line ends can be extended to remove
the critical gap. Since we have two choices to remove one critical gap, and
modication on one line-end may also introduce some other critical gaps,
the better choice is not clear in the beginning. We have to pick one option
and later on justify whether this option matches all the requirements. If that
choice is proved \bad", which means it fails to remove the critical gap without
introducing other irremovable critical gaps, we should be able to trace back
and choose the other one. Therefore, stacks are needed to memorize all the
critical gaps where a modication decision should be made. The critical gap
will be popped out from the stack once the choice has been shown correct.
For the case when all critical gaps have been solved at a certain extension
87
point (all the critical gaps at a wire end's extension point have been popped
out), we consider the initial critical gap solved. Otherwise, we clear the
stack, trace back to the critical gap in the former stack and switch to the
other choice. If both choices fail, the critical gap is unsolvable. Algorithm 3
is the critical case elimination (CCE), where Wi is the ith wire end, Ci;j is
the critical gap formed by Wi and Wj, and Bi is the extension bound of Wi.
Algorithm 3 Function CCE(Wi)
1: if no critical gap exist at Wi then
2: return true
3: end if
4: while critical gap exist at Wi do
5: Push all the critical gaps at Wi in to stack
6: Mark the top entry in the stack
7: if boundi > 0 then
8: Bi  Bi   1
9: Extend Wi by one grid
10: if No critical gap exist then
11: return true;
12: end if
13: else
14: break
15: end if
16: end while
17: while stack is not empty do
18: Pop stack until nd Ci;j is unmarked or stack is empty
19: while CCE(Wj) return true do
20: Pop stack to Ci;j
21: if Ci;j is marked or stack is empty then
22: return true
23: end if
24: end while
25: if Wi has no extension then
26: return false
27: end if
28: retreat Wi and Bi   
29: end while
7.4 Experimental Results
We implement the two algorithms in the previous section in C++. Fig. 7.6
shows the nal extension results with the two algorithms. The red ellipses
88
indicate areas of dierence between the optimal solution from the shortest
path method and the approximated approach. Compared to the optimal
solution, the approximated approach has 2 more critical gaps and 2 more
regular gaps. From the comparison we can see that the approximated method
has already provided a near-optimal solution in most of the area, but in some
cases, where printability is critical, the shortest path will need to be further
applied to x those critical small regions.
(a) Shortest Path Method (b) Approximated Method
Figure 7.6: The comparison results with two dierent approaches in this
chapter.
Nevertheless, there is a serious limitation on the size of the SPM. Ta-
ble 7.1 shows the size of the test benches and the runtime for the SPM, while
Table 7.2 shows the size of the test benches and the runtime for the approx-
imated method. With Core Duo T7300 2GHz CPU and 2GB memory, using
the shortest path method, a layout with 20  20 grids takes 18.15 second
and a layout with 30 30 grids cannot be solved due to memory limit. The
exponential runtime and space limit the implementation of SPM. However,
the approximated method shows a much larger feasible region among the dif-
ferent sizes of test benches. We can say from the comparison of the runtime
of the two methods, for the cell-level design and optimization, where the lay-
out is small, SPM is still an excellent algorithm to optimally redistribute the
gap and optimize the printability while considering the circuit performance.
Moreover, shortest path method can also be combined with the approximated
approach to form a hybrid method. We can let the approximated approach
handle the general large wire extension case, and then apply SPM to the spe-
cial required region or the region where the approximated method is dicult
89
to handle.
Table 7.1: Runtime of Shortest Path Method
Grid Size # of wires Runtime (ms)
10x10 20 0
20x20 85 280
30x30 182 18150
40x40 316 NA
Several real circuit test benches are adopted to analyze the impact on the
printability, delay and power with a lithograph-aware characterization ow.
Along with increasing normalized extension limit on the x-axis in Fig. 7.7,
Fig. 7.8 and Fig. 7.9, the variation trends of EPE, delay and dynamic power
are clearly shown, respectively. While the average edge placement error
(EPE) value can be monotonically reduced by up to 20% with increasing
extension limit, the delay will only monotonically increase by up to 1%, and
power will only monotonically increase by less than 0.1%. In other words,
the value of the extension limit is as an accurate handle to tune the required
EPE and intended delay and power. With less sensitivity on the delay and
power, there is much room to play the tradeo between printability and cir-
cuit performance. Depending on the requirement of the circuit design, we
can work on the controlled gap redistribution to benet printing while having
little impact on the delay and power of the circuit.
7.5 Conclusion
Due to the need for research on 1-D cell design, in this chapter, we stud-
ied the gap redistribution and line-end extension problem while considering
circuit performance. Based on a set of preferred gap distribution rules, we
introduce an algorithm to optimally solve small-scale problems. For large-
scale layouts, we further introduce an approximated algorithm to extend the
wire-ends eectively. Experimental results show a promising tradeo between
printability and circuit performance and power consumption.
90
Table 7.2: Runtime of Approximated Method
Grid Size # of wires Runtime (s)
10x10 21 0
100x100 1644 0.03
200x200 6695 0.82
300x300 14993 3.1
400x400 26530 7.51
500x500 41631 16.02
600x600 59840 31.04
800x800 106650 88.96
1000x1000 166696 215.5
2000x2000 666351 3936
0 10 20 30 40 50
1.7
1.8
1.9
2
2.1
2.2
2.3
2.4
2.5
Allowed extension limit (%)
Av
er
ag
e 
EP
E 
(nm
)
Average
Figure 7.7: The variation trend of EPE along with increasing extension
limit.
91
0 20 40 60 80 100
0.97
0.975
0.98
0.985
0.99
0.995
1
1.005
1.01
1.015
1.02
Allowed extension limit (%)
N
or
m
al
iz
ed
 D
el
ay
 C
ha
ng
e
Average
Figure 7.8: The variation trend of normalized delay along with increasing
extension limit.
0 20 40 60 80 100
0.99
0.992
0.994
0.996
0.998
1
1.002
1.004
1.006
1.008
1.01
Allowed extension limit (%)
N
or
m
al
iz
ed
 P
ow
er
 C
ha
ng
e
Average
Figure 7.9: The variation trend of normalized power along with increasing
extension limit.
92
CHAPTER 8
CUT-MASK OPTIMIZATION
8.1 Introduction
As the technology node continues to shrink to sub-32 nm, the 1-D on-track
design has been widely used in gate layer, such as memory array [17]. The
design patterns need to be adjusted to t the restricted design rules (RDRs)
[6, 19, 27, 28, 32], and the ne-grid 1-D on-track structure will be adopted
in the metal layer as well in the near future [34]. The circuit patterns are
always on track and generated by dierent kinds of processes (e.g. SADP [21],
SAQP [26], interference lithography [30], and directed self-assembly (DSA)
[35]), and then a trim mask is applied to cut through the continuous line and
generate intersections. The whole cut process is demonstrated in Fig. 8.1.
However, due to the high complexity of cut mask and extra process steps,
the cut mask becomes very costly.
(a) Print dense lines (b) Trim unwanted patternsby etching (c) Final design
p
2×p
Cut1 Cut2
Cut3 Cut4
p
Figure 8.1: Illustration of 1-D design. From dense lines (a), we need to trim
away unwanted patterns (b). Then after etching away the patterns covered
by the cut polygons, the nal circuit design will be formed.
In this chapter, we will address the mask cost reduction issue for 1-D
design. Usually, the mask cost can be approximated by the polygon edges.
93
Once we simplify the edge number of the polygons on the mask, we can reduce
the mask shot number and reduce the mask write time and manufacturing
cost in consequence [36{39]. In this chapter, we propose an approach for 1-D
design to reduce cut mask complexity with circuit performance considera-
tion. By pushing the polygon edges inward on the cut mask to merge with
other edges, we can generate a modied polygon with lower edge numbers
and lower complexity. We can formulate this polygon simplication problem
into a constrained shortest path problem and use dynamic programming to
optimally solve it.
The rest of the chapter is organized as follows. In Section 8.2, we review
the cut mask for 1-D design process. The process cost is analyzed and the
cost reduction problem is formulated into a polygon simplication optimiza-
tion problem in Section 8.3. The problem is transformed into a constrained
shortest path problem, and solved by dynamic programming optimally in
Section 8.4. Section 8.5 presents the experimental results and analyzes the
trade-o between the impact on the circuit performance and mast cost re-
duction. Finally, a conclusion is given in Section 8.6.
8.2 Overview of Print-and-Cut Process for 1-D Design
Generally, manufacturing a 1-D design layer has two major steps. Step 1 is
dense line generation, which may vary a lot by dierent processes. However,
as the step 1 is identical for any 1-D circuit designs with the same technology
and pitch, the major optimization for cost reduction mainly happens in step
2.
In step 2, after dense lines are initially generated, extra patterns need be
trimmed to form circuit patterns. As shown in Fig. 8.1, by printing the cut
patterns onto the paralleled features, an etching process is applied, and the
features under the exposed region will be etched away. Note that to print
the cut patterns requires a conventional printing technology with about 45
nm-pitch tracks (for 22 nm process), which means that the printability of
the cut mask is still a challenge. Since the wire's line-end gaps are randomly
distributed, the cut shapes could be either too complicated to print or too
complex and costly to aord. Thus, circuit patterns should be well designed
in order to guarantee the cut mask's printability.
94
(a) Intended cut (b) Real cut
Significant
EPE
Minor
Error
(c) Line-end Error
Figure 8.2: Cut for the dense line has low printing requirement. Although
the normal edge placement error (EPE) measurement is high in (b), the
impact on the nal wire is very limited.
Nevertheless, compared to the critical features in a circuit, the printing
requirement on the cut mask is still relatively low. As shown in Fig. 8.2,
unlike the conventional printing process, the cut mask will aect the nal
wafer only where the cut is performed. The rest of the printed patterns are
useless. The most important printing standard is whether the cut is printed
at its desired location.
Based on the properties of cut mask, the designer can further perform
design optimization, which is to be illustrated in the following section.
8.3 Problem Formulation
As we mentioned in Section 8.1, cost is a big concern for 1-D design. Note that
in 1-D design, as the dense line generation step for one technology node with
the same pitch is almost identical, the dierent design patterns are generated
only by dierent cut styles and locations. In the real implementation, what
the designer can control is the circuit wires and the corresponding cut mask
patterns. The more the cut can be merged together, the less the edge number
of the cut polygon will become. Therefore, more cut mask cost will be
reduced. According to the properties of cut mask in previous section, we can
slightly extend some line-ends and reduce the edge number of the intended
cut, and therefore reduce the complexity of the nal cut mask and lower the
95
cost. The basic idea is illustrated in Fig. 8.3.
(a)
(e)
(c)
(f)
(d)
(b)
Figure 8.3: Cut mask simplication process. (a), (c) and (e) are the
pre-simplication circuit patterns, cut patterns and post-OPC cut mask,
respectively. (b), (d) and (f) are the corresponding post-simplication
version. By simplication, the number of cut polygon edges is reduced from
38 to 20, and the post-OPC polygon edge is reduced from 176 to 96.
Fig. 8.3(a) shows an original design that needs to be manufactured by 1-D
design, Fig. 8.3(c) shows the corresponding cut polygon and Fig. 8.3(e) is
the post-OPC cut mask. As shown in Fig. 8.3(c), in order to avoid trimming
patterns on the original design, we can only push the vertical edges inwards
to reduce the edge number. Each time, one merging of two adjacent edges
will reduce the edge number by 2. After the cut simplication, the modi-
ed circuit patterns are shown in Fig. 8.3(b), the simplied cut is shown in
Fig. 8.3(d) and the corresponding post-OPC mask is shown in Fig. 8.3(f).
Due to the cut simplication, the cut edge number and the corresponding
post-OPC mask edge number can be both greatly reduced.
96
However, the mask simplication will not reduce the mask cost for free.
The extra wire patterns introduced by the simplied cut (the extra wires
in Fig. 8.3(b)) on the line-end will have a potential impact on the circuit
performance. Dierent methods will have tremendously dierent impacts on
circuit performance. Therefore, we have to nd an optimal way to reach
the cost reduction target and minimize the impact on circuit performance
simultaneously.
Usually, it is dicult to directly control the edge number in post-OPC
mask on the design phase, and the following OPC process will introduce
more wire fragments. However, due to the low printing requirement on the
cut mask as illustrated in Section 8.2, more fragment merging can be applied
to provide a better correlation between the pre-OPC and post-OPC mask
edge numbers. In that case, we have to utilize the intended pre-OPC cut
polygons to trade o between the impact on circuit performance and mask
cost. Without loss of generality, we can assume our track is horizontal and
the polygon is counter-clockwise. We can reduce the edge number of the
polygon by pushing vertical edges inwards. As the polygon area of the cut
mask is proportional to the length of covered dense lines, we can transfer the
1-D design cost reduction problem into a Polygon Simplication problem.
Polygon Simplication :
For a polygon set R = fr1; r2 : : : rng and a target edge number k,
by only pushing edges in R inwards, nd a modied polygon set R0 =
fr01; r02 : : : r0ng which has exact k total edges and minimum area change.
8.4 Polygon Simplication Algorithm
8.4.1 Algorithm
Constrained Shortest Path (CSP) is a problem that states: Given an integer
k, a directed graph G, a source node S and a target node T , nd the short-
est path from S to T , which has exactly k edges in G. CSP was initially
introduced in [40] and can be optimally solved using dynamic programming.
First of all, we show how to transform a polygon into a directed graph
for CSP. Fig. 8.4 shows a polygon P , which will be used to demonstrate
how the CSP graph is built in the following section. Firstly, we label all the
97
12
3
4
5 6
7
8
9
S1
S0
G1 G2h2
h3
h4
h5
h6
h7
h8
h9h0
h1
Figure 8.4: A polygon example. Vertical edges are labeled with numbers,
and horizontal edges are labeled with hi. S0 and S1 are the starting point
for the sub-edge arrays G1 and G2, respectively.
vertical edges with numbers, and all the vertical edges with hi in the same
sequence. Then, all vertical edges are separated into two categories based on
their heading directions, and the horizontal edges that connect two dierent
types of vertical edges are labeled as starting point (SP), such as S1 and
S2. Therefore, the polygon is partitioned into several sub-edge arrays (SEA)
by these SPs, such as G1 and G2. Note that in one SEA, all vertical edges
are heading to the same direction, and two adjacent SEAs will have dierent
directions. Since edges in dierent SEAs can never be merged together, each
SEA will correspond to a subgraph which can be linked together to form a
complete graph.
Fig. 8.5 shows the complete CSP graph for the polygon P , which is actually
constructed by the subgraphs of G1 and G2. To build the subgraph, we insert
one node for each horizontal edge hi, and set the rst node to be the source
and the last node to be the target. Each edge E(i; j) represents a modied
polygon vertical edge, which directly connects the horizontal edge hi and hj
with certain cost W (i; j) on the area change. Fig. 8.6 shows how to setup
the edge cost W (i; j) for the edge. Suppose vertical edge i has the original
location xi on x direction and xmax(i; j) is the most inner position among all
the vertical edges between hi and hj. This xmax(i; j) will be the location of
E(i; j). Let Lk be the length of kth vertical edge, and the corresponding edge
98
cost W (i; j) can be expressed by the area change in the following equation:
Wi;j =
jX
k=i+1
(xmax   xk) Lk (8.1)
h5 h6 h7 h8 h9 Th0 h1 h2 h3 h4 h5S
G1 G2
Figure 8.5: The complete CSP graph of polygon P in Fig. 8.4. The graph is
made up of two subgraphs of G1 and G2 and connecting them to the source
and target. Any path from S to T will represent a modied polygon.
Once the CSP graph is completed in Fig. 8.5, any path from the source to
the target will represent a distinguished polygon. Fig. 8.7 shows two paths
A and B and the corresponding polygons. In path A, as the path goes from
h0 to h2, a vertical edge is thus directly connecting the horizontal wire h0
and h2, and the edge 2 is pushed inward to merge with edge 1. Following the
same way, path A will merge edge 3 and 5 with edge 4, and merge edge 7 with
edge 8. The nal polygon has 10 edges and the area change is 19. Path B
shows a better option, where only 8 edges are nally left and the area change
is also less than the path A. For any given CSP graph of a polygon, we need
to nd out the optimal path in order to generate our desired cut polygon
with k edges. We implement the dynamic programming algorithm [40] to
solve the exact problem optimally.
Note that it is necessary to impose a bound on how far each edge can be
pushed inward. This bound can limit the amount of performance impact by
wire extension. Also it can be used to guarantee we do not get a degeneration
polygon. To handle these bounds, we simply remove some of the edges that
correspond to bound violation.
99
For W01, W02 and W03: xmax = x1
W01=(x1-x1)×L1
W02=(x1-x2)×L2+(x1-x1)×L1
W03=(x1-x3)×L3+(x1-x2)×L2+(x1-x1)×L1
For W04 and W05: xmax = x4
W04=(x4-x4)×L4+(x4-x3)×L3+(x4-x2)×L2
+(x4-x1)×L1
W05=(x4-x5)×L5+(x4-x4)×L4+(x4-x3)×L3
+(x4-x2)×L2+(x4-x1)×L1
W02
W03
W04 W05
W01 1
2
3
4
5
x2 x1 x5 x4x3 x
h0
h1
h2
h3
h4
h5
L1
L2
L3
L4
L5
Figure 8.6: Build subgraph for an SEA. Each edge from node h0 is assigned
weight as shown in the right box.
8.4.2 Extensions
In the real implementations, each polygon can be set to have individual in-
tended edge number and simplied by the method described in the previous
subsection. However, it is also necessary to simplify multiple polygons simul-
taneously with one total intended edge number. Our previous method can
be adjusted to address this issue easily.
Fig. 8.8 shows how to build up CSP graph from individual polygon's CSP
graph. Since the polygons have no interference with each other once the edge
placement bound is set, by linking CSP graphs together, we can set up a
total intended edge number for all the polygons and run the CSP algorithm
to optimally solve polygon simplication problem all at once. The multi-
polygon simplication can be used to simplify the whole cut mask and achieve
global optimality.
Finally, we note that holes might exist inside the polygons. These holes
could also be considered as polygons. The problem can be handled in a
way similar (but with some subtle dierences) to our strategy for multiple
polygons.
100
1
2
3
4
5 6
7
8
9
1
2
3
4
5 6
7
8
9
S1
S0
h2
h3
h4
h5
h6
h7
h8
h9h0
h1
Path A:
6ĺK0ĺK2ĺK5ĺK6ĺK8
ĺK9ĺ7
Edge #: 10
¨Area: 19
Path B:
6ĺK0ĺK3ĺK5ĺK8ĺK9
ĺ7
Edge #: 8
¨Area: 12
1
2
3
4
5 6
7
8
9
h5
Figure 8.7: The path comparison in CSP graph. From the polygon given in
Fig. 8.4 and the CSP graph in Fig. 8.5, two paths A and B and the
corresponding polygons are given, which demonstrates that dierent path
will generate dierent polygons and path in CSP graph need to be
evaluated for a better simplied cut polygon.
8.5 Experimental Results
In this section, we present experiments and results on 1-D design cost reduc-
tion and point out the cost-design trade-o.
8.5.1 Experiment Setup
Throughout the experiment, we implement the polygon simplication algo-
rithm with C++ under Ubuntu 9.10 environment and adopt several bench-
marks generated from a 28 nm cell library, which contains 25 standard cells
that are specially designed for 1-D design. In these benchmarks, POLY, M1
and M2 layers are used for the experiments, where POLY and M1 are lo-
calized wires within a cell and M2 works as horizontal wires for intra- and
inter-cell interconnect.
Experiments were performed on 28 nm processes. Estimates of process
control parameters are summarized in Table 8.1. The process parameters
shown in [41,42] are used for 28 nm process. In order to guarantee the print-
ability of cut polygons and simplify the comparison without worrying about
some extra process steps, such as PSM and SRAF, an aggressive immersion
101
S TG1S1 T1 G2S2 T2 G3S3 T3
P1
P2
P3
P1: G1S1 T1
G2S2 T2
G3S3 T3
P2:
P3:
(a) Multiple polygons (b) The CSP graph of
individual polygon
(c) The final CSP graph for the multiple polygons
Figure 8.8: Building the CSP graph with multiple polygons. (a) shows the
individual polygons and (b) shows the corresponding CSP graph for each
polygon. By linking the CSP graph together, the complete graph is shown
in (c).
lithography is used for OPC and optical simulation. The optical working
environment is Calibre v2009.2 18.12. For the polygon simplication, by set-
ting up one total intended cut polygon edge number k for the whole layer,
we can optimize all polygons simultaneously.
Table 8.1: Process Control Parameter Used in the Experiment
Line width (nm) Pitch (nm)
Poly M1 M2 Poly M1 M2
Light source NA
28 28 28 62 62 62 Annular 1.4
8.5.2 Polygon Simplication Results on Dierent Layers
Because it is dicult to control the OPC process in the design phase, we have
to use the edge number of the intended cut polygons k to control the trade-
o between circuit performance and mask cost. For a given polygon with
original edge number n, there exists a lower bound of edge number l, and
no simplied polygon version can have edge number lower than that bound.
102
Therefore, for a polygon set, we have the summation of all the polygons'
lower edge number bounds kmin =
P
i li to be minimum edge number and
the summation of all the polygons' original edge number kmax =
P
i ni to be
the maximum edge number. Only when the intended polygon edge number
k 2 [kmin; kmax], is it possible to generate an feasible solution. We dene
the ratio of the extended wire length to the original total wire length to
be the impact of cut mask modication on circuit performance. Using the
28 nm benchmarks, we can adjust k within the feasible region to check its
relationship with circuit performance and the post-OPC mask edge number.
The results are shown in Fig. 8.9 and Fig. 8.10, whose x axis is the normalized
mask edge reduction in the whole feasible region. When x = 0, there is no
reduction processed, and when x = 1, the intended polygon edge number k
is set to be kmin.
0 0.2 0.4 0.6 0.8 1
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
Normalized Mask Edge Reduction
R
at
io
 o
f E
xt
en
de
d 
W
ire
 L
en
gt
h
 
 
Poly
Metal1
Metal2
Figure 8.9: Cut polygon edge reduction vs. extended wire length.
Fig. 8.9 shows the relationship between k and the impact on circuit perfor-
mance. We can see that the impact on circuit performance is getting larger
when more mask edges are reduced, and relationship between the reduced
edges and the impact follows an exponential trend. In order not to interpret
circuit performance too much, moderate target edge number k should be set,
103
0 0.2 0.4 0.6 0.8 1
0.75
0.8
0.85
0.9
0.95
1
Normalized Mask Edge Reduction
N
or
m
al
iz
ed
 P
os
t−
O
PC
 M
as
k 
Ed
ge
 #
Poly
Metal2
Metal1
Figure 8.10: Cut polygon edge reduction vs. post-OPC mask edge number.
which can guarantee a considerable mask cost reduction and little impact on
circuit performance simultaneously.
Indeed, the real mask cost depends on the complexity of post-OPC layout.
With the same design, dierent OPC and RET strategies may cause a huge
dierence in mask cost. Based on our optical environment setting, we process
OPC for the cut polygons and demonstrate the relationship between post-
OPC mask edge number and k in Fig. 8.10. The gure shows the edge
number of post-OPC polygon is linearly dependent on k. In some situations,
a drawback occurs (as shown in M1 layer in Fig. 8.10), once edge reduction
goes beyond certain point. The drawback happens when the edges on the cut
polygons are pushed too aggressively, and we can simply avoid this situation
by setting up proper bound during polygon simplication process or trying
to avoid too small k.
Cut polygon simplication performs dierently on dierent layers. As Poly
is usually the most regular layer in the layout, there is not much simplication
space left. From Fig. 8.9 and Fig. 8.10, we can see with almost no change on
the original design, the post-OPC cut mask of Poly layer can be reduced by
about 2%. M1 is usually the most dense and random layer. As there is a large
amount of simplication left for us to do, we can easily reduce the post-OPC
104
cut mask complexity by over 20%. But in the meantime, it will also cause
noticeable impact on circuit performance, which we should avoid. Hence, a
proper trade-o is needed for M1 layer. Compared to Poly and M1 layer,
M2 layer is a local interconnect layer, where line-ends usually stop at vias
and tracks are not as dense as those on Poly and M1 layer. This randomness
increases the cut mask complexity, so that a simplication process is also
needed.
k = kmax k = kcenter k = kmin
Cut
Extended
Wire
Original
Wire
(a) (b) (c)
(d) (e) (f)
Figure 8.11: The comparison of layouts. A small portion of M1 layer in a
large layout is shown to illustrate how k will impact circuit. The cases of
k = kmax, k = kcenter and k = kmin are provided. (a), (b) and (c) show the
intended cut polygons and (d), (e) and (f) show the simulated real cut
polygons.
In reality, k should be selected depending on the exact design and cost
target. In the following experiment, it can be found that when k = kcenter =
(kmax + kmin)=2, for all the Poly, M1 and M2 layers, the impact on circuit
performance is little, and the mask complexity can be reduced by a consid-
erable scale; we will use this kcenter number as our intended edge number k
in the experiments.
The experimental results are listed in Table 8.2. On the Poly layer, with
only 0.09% impact on circuit performance, the overall complexity of intended
cut polygon can be reduced by 2.82%. On the M1 layer, we can reduce the
105
complexity of intended cut polygon by 20.6%, while losing 2.12% on circuit
performance. On the M2 layer, the complexity of the cut polygon can be
reduced by 14.1%, with 0.95% impact on circuit performance. Based on our
optimal environment setting, the post-OPC mask complexity is also shown,
and Poly, M1 and M2 layers have the complexity of the post-OPC mask
reduced by 1.6%, 12.41% and 9.65%, respectively. The comparison of printed
patterns indicates that the original and our modied designs have the same
level of printability. The runtime of programming is also provided.
Fig. 8.11 shows one small portion of intended M1 layer and the corre-
sponding printing results on the situations when k = kmax, k = kcenter and
k = kmin. We can see that all of the three cut masks can be printed cor-
rectly. k = kcenter will introduce only a few wire extensions and leave long
edges unchanged, and k = kmin will push all the movable edges to align them
together and thus introduce a large amount of wire extension.
8.6 Conclusion
High complexity of cut mask leads to sky-rocketing manufacturing cost. In
this chapter, we present a mask cost reduction method with circuit perfor-
mance consideration for 1-D design. We simplify the polygons on the cut
mask by formulating the problem as a constrained shortest path problem.
Experimental results show that the intended cut polygon can be used to
control the mask cost and trade o with the impact on circuit performance.
Using the 28 nm benchmarks, we can actually reduce the cut mask cost by
14.2% on Metal 2 layer and 20.6% on Metal 1 layer with only a little impact
on the designed wire length.
106
T
ab
le
8.
2:
E
x
p
er
im
en
ta
l
R
es
u
lt
s
fr
om
T
h
re
e
28
n
m
B
en
ch
m
ar
k
s
T
es
t
In
te
n
d
ed
E
d
ge
#
P
os
t-
O
P
C
E
d
ge
#
W
ir
e
L
en
gt
h
(n
m
)
W
ir
e-
en
d
E
rr
or
(n
m
)
R
u
n
b
en
ch
L
ay
er
O
R
G
O
U
R
D
IF
F
O
R
G
O
U
R
D
IF
F
O
R
G
O
U
R
D
IF
F
O
R
G
O
U
R
T
im
e(
s)
P
ol
y
70
6
69
2
1.
98
%
31
12
30
84
0.
90
%
31
89
28
31
92
07
0.
09
%
2.
22
06
8
2.
21
08
0.
34
T
1
M
1
18
40
14
90
19
.0
2%
56
16
49
16
12
.4
6%
30
91
32
31
55
80
2.
09
%
3.
05
33
9
3.
02
02
1.
49
M
2
91
2
78
2
14
.2
5%
38
52
35
22
8.
57
%
31
70
99
32
00
76
0.
94
%
2.
65
84
6
2.
81
76
1
0.
36
P
ol
y
63
4
61
2
3.
47
%
29
96
29
22
2.
47
%
32
64
30
32
66
78
0.
08
%
2.
38
5
2.
38
30
4
0.
18
T
2
M
1
18
54
14
64
21
.0
4%
55
24
48
20
12
.7
4%
30
85
74
31
52
39
2.
16
%
3.
25
03
5
3.
22
34
7
0.
89
M
2
89
6
76
6
14
.5
1%
38
24
34
54
9.
68
%
31
64
48
31
96
73
1.
02
%
3.
16
09
3
3.
18
5
0.
24
P
ol
y
66
0
64
0
3.
03
%
31
28
30
80
1.
53
%
32
39
50
32
42
29
0.
09
%
2.
04
03
4
2.
05
87
5
0.
33
T
3
M
1
18
20
14
26
21
.6
5%
55
02
48
40
12
.0
3%
30
70
55
31
38
44
2.
21
%
3.
45
75
5
3.
48
03
1.
56
M
2
76
6
66
2
13
.5
8%
33
24
29
68
10
.7
1%
31
63
55
31
92
07
0.
90
%
2.
86
36
6
3.
05
82
8
0.
27
107
PART IV
NEW ILLUMINATION
SYSTEM
108
CHAPTER 9
EUV MASK BLANK DEFECTS
MITIGATION I
9.1 Introduction
As current VLSI technology keeps shrinking to sub-20 nm technology, EUV
lithography has become a major solution to the next generation lithography
process with ner resolution and reasonable throughput.
Figure 9.1: A typical structure of EUV system [43].
Because of the intrinsic behavior dierence between the 13.5 nm light and
the excimer laser, the EUV lithography is much more complex than the
conventional optical lithography system. Fig. 9.1 demonstrates a typical
structure of EUV system [43], which also has the basic lithography structures,
such as light source, projection mirrors, mask, and wafer. Currently, the
EUV technology faces several challenges before the mass production solution
109
is nalized. Besides the diculties in exotic light source setup and the tuning
of the resist for line edge roughness and sensitivity, the chip fabrication with
defective masks remains a huge challenge [44,45].
The main source of the defects on the blank are the buried defects in the
multilayer (ML) reective structure. Many works have been done to study
the impact of the defect [46{48]. Usually the defect has the most signicant
impact when it is adjacent to the feature boundary, and severe printability
issues, such as bridging or breaks, as demonstrated in Fig. 9.2, will be intro-
duced consequently. The rst step to mitigate the defect is to remove it from
the blank. Ion-beam and electron beam are introduced for dierent types
of defects [45]. A recent report has shown that the printable defect density
has been successfully reduced to 0:63 cm 2 [49]. Note that to process defect
removal requires the precise input of the defect map from the defect inspec-
tion tools. But, according to [50], there are still many problems with the
defect location detection with 10 nm  30 nm inaccuracy, and the inspec-
tion throughput is still far from the requirement. During the defect removal
process, damage on the ML structure might also happen [47]. At present, to
make a defect-free EUV mask blank is still too costly and impractical.
A
e
ri
a
l
Im
a
g
e
M
a
s
k
C
ro
s
s
s
e
c
ti
o
n No Defect Buried Particle Defect
Figure 9.2: Mask defect on the simulated aerial image [51].
Because of the absence of the defect-free mask, it is necessary to develop
110
a process to apply EUV lithography with defective masks. One research
direction with absorber shape modication has been introduced in [51{53].
However, this OPC-like defect mitigation method requires extremely precise
defect location, height, width and process variation control, such as defocus
and overlay, which is impractical at present. [48, 54] show that as long as
the phase change is insignicant on the feature boundary, the impact of the
defect can be ignored, suggesting that to increase the distance from the de-
fect to the feature boundary is an eective defect mitigation method. In this
way, [55] demonstrates an idea to shift the pattern on the blank and move
the defects to the spare region where the printing will not be aected. [56]
is the rst paper to demonstrate the pattern shifting and rotation process to
mitigate defects without a precise defect model. Without releasing the algo-
rithm, [56] reported the total CPU runtime to be 146 minutes with 64 CPUs,
about overall 6{7 days single CPU runtime, which is extremely inecient.
Furthermore, [56] fails to report the optimal shift and rotation position when
defects cannot be fully avoided, which is needed to guide the most destructive
defect removal process.
In this chapter, targeting the disadvantages in [56], we propose an ecient
layout relocation process to minimize the defect impact on feature bound-
aries. In our formulation, the following important factors are addressed:
 precise defect model in [57] and defect parameters;
 layout patterns and maximum tolerable critical dimension (CD) change
on each boundaries;
 inaccuracy of defect map and misalignment error of the absorber.
We rst formulate the original pattern relocation problem into a well-known
rectangle overlapping problem. Then, through tiling, the large problem can
be further mapped into many small portions, and the runtime can be reduced
to scale linearly with the number of rectangles. The experimental results val-
idate our proposed method and show the eciency of our algorithm. With
the optimal relocation information, we will know the exact positions of the
fewest working defects. This information can further guide the following
defect removal process to increase throughput and cost-eciency. The ex-
perimental results conrm the necessity of the defect removal process and
the large defect control process for defect mitigation.
111
The rest of the chapter is organized as follows. In Section 9.2, the back-
grounds about EUV mask and buried defects are rst introduced. Then, with
a precise defect model and consideration of control on process variation, we
formulate our problem into a minimum rectangle overlapping problem in Sec-
tion 9.3. In Section 9.4, we further analyze the specialness of our problem
and propose an optimal algorithm for a large amount of data. Section 9.5
shows the experimental results and analysis, and nally Section 9.6 concludes
this chapter.
9.2 Background
In this section, we will briey introduce the backgrounds of EUV mask prepa-
ration and buried defects.
9.2.1 EUV Mask Preparation
As mentioned in the previous section, EUV mask preparation is a key step
for the success of EUV lithography. Because of the 13.5 nm wavelength,
a multilayer (ML) structure is needed for the reective optics. As shown
in Fig. 9.3, a mask is made up of a mask blank and patterned absorbers.
Usually, an EUV mask is prepared through two steps. The rst step is mask
blank fabrication with the ML structure deposition. This is also the step
when the buried defects are randomly generated. In the second step, mask
absorbers with designed patterns are placed on top of the mask blank with
4X reduction factor. Normally, as the layout to print is smaller than the
blank, there is certain freedom to nd the best layout location on the blank.
9.2.2 Analytical Model of Buried Defects
According to [57], all the buried defects follow a Gaussian scheme after the
smoothing process during ML deposition. The defect parameters { height
and full width at half maximum (FWHM) { become the major factors to
determine the impact on the feature CD. Provided that the major printability
issue is from the phase change on the feature boundary, the damage can be
evaluated by the surface height hd at the feature boundary as:
112
Blank
Layout dX
dY
Feature
Figure 9.3: EUV mask: the blank and layout.
hd = H  2 ( 2RFWHM )2 (9.1)
Feature
H
FWHM
hd
R
Figure 9.4: The cross-section of the defect region. Note that the defect size
is exaggerated to better show the Gaussian scheme of the surface.
As shown in Fig. 9.4, H is the height of the defect, R is the distance
from the feature boundary to the center of the defect, and FWHM is the
full width at half maximum. With tting coecients md = 0:191nm
 1 and
bd =  0:094 [57], we can further determine the critical dimension (CD)
change CD with the surface height at the feature boundary hd by the
following equation:
CD =
p
INoDefect  (md  hd + bd)
ImageSlope
 ! (9.2)
Normally, INoDefect = 0:3 as a common image threshold, and image slope is
equal to 0:0471 nm 1.
113
9.2.3 Pattern Relocation
From eq. 9.1 and eq. 9.2, CD can be reduced by increasing R, the distance
from the feature boundary to the center of the defect. In eq. 9.2, ! is a
tuning factor for dierent defect locations, which should be setting to reect
the impact from the absorber. According to [58,59], for the defect inside the
boundary, !in = 0:5; for the defect outside the boundary, !out = 1.
The dierence between !in and !out suggests the feasibility of mitigating
defects by moving features to cover defects with a certain margin, such that
the surface height at the boundary hd;in is small enough. [46] provides ex-
perimental results to prove the eectiveness of this defect covering strategy.
Fig. 9.5 demonstrates the pattern relocation. In Fig. 9.5(a), as a large sur-
face change occurs at the right boundary of A and the feature B is covering
a defect without enough margin, both features suer from defects. After
shifting both features by dx, in Fig. 9.5(b), A is set to cover the defect with
large margin and B is moved away from the defect; thus both features are
immune from the destructive defects. In this way, we can either shift the fea-
ture away from the defect or cover the defect with some boundary margin to
improve the printability. [55] shows the printability improvement after layout
relocation and hints at the necessity of a software for ecient and automatic
layout relocation in the long term.
A B
A B
(b)
(a)
dx
Figure 9.5: Defect shift to mitigate the impact of defect.
Note that the inaccuracy of the defect location and misalignment of the
absorber deposition on the blank are both considerable in the current status.
As reported by [50], the defect location accuracy is around 10{30 nm, and
114
less than 15 nm of misalignment is also observed. It is necessary to take this
inaccuracy into the consideration in the layout relocation problem.
9.3 Problem Formulation
In this section, we will analyze the layout relocation problem and formulate
the whole problem.
9.3.1 Problem Description
To nd the best relocation position, three types of movement { shift, rotation
and ip { can be applied. Note that through the rotation and ip movements,
there are 8 dierent orientations for the layout on the blank, as shown in
Fig. 9.6. For each orientation, we need to nd the corresponding best shift
position, and then after comparing all the shift position candidates, we can
nally generate the optimal position.
Blank
(a) Original
Layout
Blank
(h) F270
Blank
(g) F180
Blank
(f) F90
Blank
(e) F0
Blank
(b) R90
Blank
(c) R180
Blank
(d) R270
Figure 9.6: The 8 possible orientations of the layout on the EUV blank.
The feasibility of the orientation depends on the process and design
requirement.
Note that, except for the shift movement, rotation and ip might be infea-
sible due to certain criteria. The rotation movement needs support from the
EUV exposure tools, and requires dierent probe-cards that might increase
115
the diculty of wafer testing. To carry on ip movement, the via and pin
locations need to be ipped at the same time; consequently, all the printing
layers need a ip movement, which might be a tough requirement. In the
real implementation, we can always rst change the layout to the feasible ori-
entation and nd out the best shift location. In this way, the core question
becomes what would be the best shift location (dx; dy) in each orientation,
such that the defect impact is minimized. In the following sections, we will
only consider the pattern shift movement.
Based on the analysis in the previous section, the layout relocation problem
on the EUV mask blank can be described as follows:
Layout Relocation Problem :
Given a design layout of dimensions LW with (X;Y ) shift freedom
on x and y direction, and an EUV mask blank with known defect
distribution map D = fdi(hi; FWHMi); i = 1; : : : ; Ndefectg with lo-
cation inaccuracy le and misalignment lm, nd a shift location of the
layout (dx; dy), such that the impacted feature boundary number is
minimized.
9.3.2 Determination of Shift Freedom
Due to the size dierence between layout and blank, usually the layout has
a certain freedom to shift top or right to nd the best location for defect
mitigation from the original (0; 0) position. As layout patterns must never
shift out of the blank boundary, the shift location (dx; dy) has to match the
requirement dx 2 [0; X] and dy 2 [0; Y ], where X and Y are shift freedom
bounds on x and y directions.
Fig. 9.7 shows the shift freedom (X; Y ) of the layout on the blank. Each
polygon in the layout has its own possible existing region (blue region RA,
RB and RC in Fig. 9.7(b)). Because the impact of a defect is always localized
as illustrated in the previous section, only the defects that are covered by
or close to the possible existing region have potential impact on the polygon
boundaries, such as defect 3 to feature A, defect 1 to feature B and defect 2
to feature C in Fig. 9.7(b). Here we call a defect that might have an impact
on a feature boundary a working defect, and the corresponding victim feature
boundary the target boundary.
116
LW
Blank
Layout
(a)
X
Y
A
B
C
(b)
RC
RB
RA
X
Y
A
B
C
1
2
3
4 5
6
1
2
3
4 5
6
Figure 9.7: Layout shift movement on the blank. Due to the freedom of the
shift location, features A, B and C only need to consider the impact of
defect 3, 1 and 2, respectively.
Note that since the freedom (X;Y ) is much smaller than the size of layout,
to solve the layout relocation problem, we only need to search for the shift
position for all the pairs of target boundaries and their working defects. Thus,
the problem complexity is no longer a function of layout size, but the number
of blanks and the scale of the freedom.
9.3.3 Prohibited Defect Region
To nd the best shift position, we also need to consider the impact between
each defect/boundary pair. As each boundary might have distinct sensitivity
to the CD change, it is necessary to study the defect impact for each pair of
target boundary and its working defect.
In the process variation control process, the maximum allowable CD change
CDmax is set for lithography process parameter selection, which is highly
correlated to the circuit performance, reliability and yield requirement. For
the critical feature boundaries, such as gate areas or parallel line edges,
CDmax can be very strict; for the non-critical regions, such as line-ends
or dummy features, the value of CDmax could be larger. Usually, CDmax
is provided by the layout verication, and it must be addressed in the defect
mitigation process.
With given CDmax, we can further generate the maximum surface height
117
on the feature boundary. Note that as !in is dierent from !out, we have two
maximum surface heights at the feature boundary { hd;max;in for the defects
inside features and hd;max;out for the defects outside the feature. Derived from
eq. 9.2, hd;max;in and hd;max;out can be expressed as:
hd;max;in =
1
md
 (CDmax  ImageSlope
!in 
p
INoDefect
  bd) (9.3)
hd;max;out =
1
md
 (CDmax  ImageSlope
!out 
p
INoDefect
  bd) (9.4)
Therefore, we can further determine Ref;in and Ref;out, the minimum dis-
tance from the in-feature and out-feature defect center to the feature bound-
ary respectively, as:
Ref;in =
1
2
 FWHM plog2H   log2 hd;max;in (9.5)
Ref;out =
1
2
 FWHM plog2H   log2 hd;max;out (9.6)
However, only to keep the defect center Ref away from the feature bound-
ary might not be enough. We must consider the inaccuracy of the defect
location le and misalignment of the absorber lm. As those misplacements
are random and could happen in any direction, to guarantee a safe distance,
the location inaccuracies are added to the minimum distance Ref to generate
more conservative minimum distance din and dout, as:
Inside defect: din = Ref;in + le + lm (9.7)
Outside defect: dout = Ref;out + le + lm (9.8)
In this chapter, we call this conservative minimum distance from the center
of the working defect to the target boundary din and dout the safe distance.
Once the distance between the center of the defect to the boundary is less
than the safe distance, the boundary is considered damaged; otherwise, the
boundary is well printed. Note that the safe distance is a function of the
properties of the target boundary (e.g. CDmax) and the properties of the
defect (e.g. H and FWHM). Each pair of working defect and target bound-
ary will have its unique safe distance.
As shown in Fig. 9.8(a), the inside and outside safe distance around a
118
feature boundary for a certain working defect will form a rectangle. We call
this rectangle a prohibited rectangle (PR), from which we should try to keep
the working defect away as much as possible. If we generate all the PRs
from all the target boundaries as shown in Fig. 9.8(b), the problem will be
equivalent to avoiding the center of working defects in any of their own PRs
by proper shift movement.
(a) (b)
din dout
dout
dout
L
Figure 9.8: Denition of prohibited rectangle (PR).
9.3.4 Minimum Rectangle Overlapping Problem
In order to achieve the overall minimum defect-impact location, the layout
should be shifted to make each defect avoid its own PRs. As demonstrated
in Fig. 9.9(a), for a defect at location (xd; yd) and its PR R(x0; y0; x1; y1), the
prohibited area for (x0; y0) relocation (purple region) isR(xd l; yd w; xd; yd).
If we shift the prohibited area to a new coordinate system with (x0; y0) as the
original point, we will form a new rectangle R(xd   x1; yd   y1; xd   x0; yd  
y0), which is named the prohibited shift rectangle (PSR) (blue rectangle in
Fig. 9.9(b)). Any point (x0; y0) on the new coordinate represents a shift
option dx = x0; dy = y0 and the PSR covers all the shift options that could
bring damage to the target boundary from the working defect. Therefore,
all the shift options covered by PSR are the ones that we should avoid.
Obviously the coordinate of PSR will never exceed the shift freedom bound
119
R(0; 0; X; Y ), which is because we can never shift our layout out of the blank
boundary.
Prohibited
Rectangle
(x0,y0)
(xd,yd)
defect
(0,0) x
y
(xd-l, yd-w)
Prohibited Area for
(x0,y0) relocation
l
w
(x1,y1)
Prohibited
Rectangle
defect
(0,0) x'
y'
(xd-x1, yd-y1)
Prohibited Shift
Rectangle (PSR)
l
w
(x1-x0,y1-y0)
X
Y
(xd-x0,yd-y0)
(a)
(b)
Coordinate Shift
Figure 9.9: Denition of prohibited shift rectangle (PSR).
Note that each pair of a working defect and its target boundary will gen-
erate a PSR through the previous steps. If we draw all the PSRs together
in the same coordinate system, as shown in Fig. 9.10, all the PSRs will be
within the R(0; 0; X; Y ) range and might overlap with each other. The over-
lapping rectangle number on one point represents the number of damaged
boundaries once that shift position is adopted by current layout. In order
to nd the best location to minimize damaged boundaries, we only need to
nd the point within R(0; 0; X; Y ), where the minimum number of PSRs are
overlapping.
120
(0,0) x
y
X
Y
1
0
1
1
1
1
1
1
1
0
0
2
2
22
22
2 3 2
2
Figure 9.10: Overlapping PSR. The number represents the number of
overlapping PSRs in that region.
So far, the original layout location detection problem can be formulated
to be the minimum rectangle overlapping (Min-RO) problem.
Minimum Rectangle Overlapping (Min-RO) Problem :
Give a range R(0; 0; X; Y ) and a set of rectangles fR(xi; yi; x0i; y0i); i =
1; : : : ; ng, nd the location (dx; dy) where minimum number of rect-
angles are overlapping.
9.4 Problem Solutions
In this section, we will focus on the solution for the Min-RO problem in the
previous section.
9.4.1 Overall Algorithm
The generalized Min-RO problem is a well-known computation geometry
problem. If we consider the solvation of Min-RO as a black box, the overall
ow can be expressed in the Alg. 4.
Except for the runtime of Min-RO solvation, the preprocess time is O(pq),
where p is total boundary number and q is the total defect number. Here we
notate the total PSR number to be n =   pq, where  is the ratio of the
pair number of working defects and target boundaries to all pairs of defects
121
Algorithm 4 Overall Algorithm
Require: Defect set D = fdi; i = 1 : : : pg, feature boundary set B = fbj ; j =
1 : : : qg, layout size LW , shift freedom (X;Y ) and inaccuracy amount le+lm.
Ensure: best relocation (dx; dy) for layout relocation.
1: for all feature boundary bj in B do
2: for all defect di in D do
3: if di is bj 's working defect then
4: Calculate the PR(di; bj) with eq. 9.3 { eq. 9.8
5: Shift PR(di; bj) to generate the PSR(di; bj).
6: end if
7: end for
8: end for
9: Call Min-RO solver to generate the minimum overlapping region of all
PSR(di; bj).
10: Report the nal location (dx; dy).
and boundaries, which has  1.
9.4.2 Solutions for Min-RO Problem
Although the generalized solution for the Min-RO problem has been well
studied, because of the specialness of this problem, it is still worthwhile to
analyze the detailed algorithm in this subsection.
The general Min-RO problem can be solved by the sweeping line algorithm
in [60]. However, because of the large number of input rectangles (usually
over one hundred million), we have to split the overall range R(0; 0; X; Y )
into small tiles and calculate the minimum overlapping value in each tile
to reduce the problem size each time, as shown in Fig. 9.11. This idea is
similar to the divide-and-conquer strategy in [61], and reduces the internal
space requirement from O(n log(n)) down to O(m), where m is less than the
number of the rectangles. In this way, we can downgrade the problem size
in each tile to a solvable scale and implement the sweeping line algorithm to
solve it.
Unlike the general overlapping rectangle problem that has rectangles dis-
tributed in any location on the x  y plane, in our case, all n rectangles are
located in the R(0; 0; X; Y ) region, where n  X  Y . Thus instead of the
using the sorted x and y coordinates of the rectangles as the sweeping grid,
which takes O(n log n) time, we can directly use the tile grid as the sweeping
grid and each rectangle only needs L time to be updated. Here L is the
122
(0,0)
x
y
X
Y
Figure 9.11: Use tile to reduce the calculation complexity.
boundary length of the rectangle, which can be considered as a constant. In
this way, the runtime can be reduced from O(n log n) to O(XY + n). As
shift freedom (X; Y ) is usually constant and much smaller than the layout
size, we can thus reduce the complexity of algorithm to linear. Compared to
the straightforward enumeration method of trying all the positions, which as
the runtime complexity O(XY n), our method has achieved a large amount
of runtime saving.
Note that tiling is also a natural process for parallel programming. Since
there is no correlation between each tile, each CPU can grep one tile at a time,
and nd the minimum overlapping location with no interference with other
processing CPUs. After all the tiles are processed, we only need to compare
the outputs from every CPU and generate the minimum overlapping regions.
9.5 Experimental Results
9.5.1 Experimental Setting and Validation
To test the validation of our proposed method, we implement our sequen-
tial algorithm using C++ on a workstation with the Ubuntu 10.10 32-bit
operating system on an Intel Core2 Quad 2.33 GHz CPU.
We carry out our experiment with a 16 nm design which is generated
from the scaled Nangate Standard Cell Library [24] and with a 4X reduction
123
factor. The defect is randomly distributed on the blank with the property
setting of height and FWHM.
Fig. 9.12 shows a test demonstration for our layout relocation process.
In this test, the layout is set to be a simple AOI22 cell [24] with size of
1150 nm  1570 nm; the blank is 2 m  2 m large with 10 defects on it.
The maximum CD change for the boundaries is set to be 1 nm. Compared to
the original position (0; 0) where the cell has 6 edges damaged, the relocation
position (132; 77) makes only 1 edge damaged. The whole process takes less
than 1 s to complete.
(a) Pre-shift Layout on the Blank (b) Post-shift Layout on the Blank
(0,0)
(132,77)
Figure 9.12: Layout demo for experiments. In (a), 6 edges (red) are aected
by defects; in (b), 1 edges (red) are aected by defects. Note that the size
of the defect represents the FWHM of each defect, which is not the impact
range of the defect.
9.5.2 Experimental Results
In order to test the performance, we carry out several tests with full scales
layouts generated from Nangate Standard Cell Library and programmed de-
fect maps with random defects FWHM from 50 nm to 200 nm and height
from 3.0 nm to 6.0 nm. With all the shift freedom set to be 200 m large,
the experimental results are shown in Table 9.1.
As shown in Table 9.1, with the xed shift freedom, the runtime is mainly
determined by the number of defects in the blank area. As the number of de-
124
Table 9.1: Experimental Results for Dierent Test Sets
Defect Full Early Stop Aected
# Runtime (s) Time (s) Edge #
Test1 30 1333.81 NA 2
Test2 30 1334.29 203.7 0
Test3 30 1333.17 446.35 0
Test4 60 1595.38 NA 17
Test5 60 1594.01 NA 19
Test6 60 1537.83 NA 8
Test7 90 1940.27 NA 47
Test8 90 1921.76 NA 62
Test9 120 2061.08 NA 54
Test10 120 2057.32 NA 87
fects increases, the total run time to detect the minimum aected boundaries
increases linearly, which validates our runtime analysis in the previous sec-
tion. Compared to the reported runtime from [56], the single core CPU time
is tremendously reduced from days to minutes. From column 4 in Table 9.1,
we can further observe that to achieve the optimal shift and rotate position,
it is not necessary to go through all the processes. We can implement an
early stop once 0 defect impact is detected, in which way a large amount of
runtime can thus be saved.
9.5.3 Experimental Analysis
In this subsection, we further carry on dierent sets of the comparison to
demonstrate the defect mitigation eect with defect number and defect size,
respectively.
By xing the size of test layouts and blanks, Fig. 9.13 shows the relation-
ship between the defect number and the defect impact on the post-relocation
layout. This gure shows 5 test sets of randomized blanks and the average
value. From this gure, although there is some randomness in each test case,
we can clearly see that the defect impact is still following a roughly linear
relationship to the number of the defects on the blank. Therefore, it always
necessary to reduce the blank number on the blank as much as possible.
Fig. 9.14 shows another experimental result for the relationship between
the defect size and the defect impact on the post-relocation layout, when
125
100
D
a
…
0
2.8
6.6
16.2
21.4
28.8
36
41.2
49.4
57.2
0
10
20
30
40
50
60
70
10 20 30 40 50 60 70 80 90 100
D
a
m
a
g
e
d
 B
o
u
n
d
a
ry
 #
Normalized Defect # per Blank
test1
test2
test3
test4
test5
Average
Figure 9.13: Defect number vs. defect impact.
we tune the maximum defect size and keep other experimental conditions
unchanged. With 5 test sets of randomized blanks, the defect impact will in-
crease superlinearly to the increase of maximum defect size. This is because
as a large defect turns to aect more feature boundaries, the number of spare
areas to t those defects is shrinking non-linearly. This super-linear relation-
ship demonstrates that during the defect removal process, large defects have
higher priority to be removed.
9.6 Conclusion
In this chapter, we have nished the pattern relocation process with min-
imum impact of the EUV mask defects on the printed feature boundaries.
We successfully formulate the problem into a minimum rectangle overlap-
ping problem with eective buried defect model and a process variable con-
trol strategy. Experimental results validate our methods, and the relocation
results of large layouts from the scaled Nangate open cell library are pro-
vided with competitive runtime. The comparison results between the defect
size/number and defect impact on the boundaries conrm the necessity of
the defect removal process. With the relocation information, we will know
the exact location of working defects that are causing problems. If the inter-
action between defect removal and mask preparation processes is possible,
126
05.2
12.6
21.6
34
0
5
10
15
20
25
30
35
40
45
50
30 60 90 120 150
D
a
m
a
g
e
d
 B
o
u
n
d
a
ry
 #
Max Defect Size (nm)
test6
test7
test8
test9
test10
average
Figure 9.14: Defect size vs. defect impact.
we can then provide the location information to help remove those work-
ing defects with the minimum amount of work to process, and thus increase
the throughput and cost-eciency. With the optimal relocation result and
possible impact of the defect, we will have a better understanding of the
matching of the design layout and the blanks, which is valuable information
for the mask provider.
127
CHAPTER 10
EUV MASK BLANK DEFECTS
MITIGATION II
10.1 Introduction
In Chapter 9, we introduced our ecient algorithm for defect mitigation with
layout defect and 90 rotation. However, the limitations on the rotations with
only four options (0, 90, 180, and 270) cause a signicant drop in the
success rate when defect number increases. In order to increase the success
rate, dierent methods have been proposed from dierent aspects. With-
out improving the individual defect mitigation success rate, [62] proposes a
layout-blank matching method for an overall higher success rate for multiple
layouts and blanks. However, the eciency of this method largely depends
on the average success rate of one single layout-blank pairing, which might
drop signicantly when more defect mitigation restrictions are required. [63]
introduces another approach to modify the die oorplanning on the blank, in
which dies are no longer tied to each other but moved independently. Because
in this approach, the oorplanning for dierent layers has to be exactly the
same, one mask failure for one layer might cause a total waste of all masks
for all layers, which makes the die oorplanning method far from practicality.
Fortunately, the reticle holder's alignment scheme provides an extra relo-
cation dimension { small-angle rotation { to extend the exploration from the
original 2D exploration approach to 3D. In this way, the exploration space is
largely expanded and the original 2D solution space becomes just a special
case when the rotation angle is 0. This extension is non-trivial, which requires
high eciency to support the data preparation in [62] and enough accuracy
to handle dierent defect sizes and locations on the dierent features.
In this chapter, we propose an ecient layout small-angle rotation and
shift approach for better EUV defect mitigation. This is the rst paper to
perform the small rotation angle exploration with the layout shift on the
128
blank. By continuously exploring the allowable rotation region to set up
the rotation , from 0 degree to max, we can have a series of new defect
distributions relative to the layout. Therefore, the success rate for a one-to-
one matching between blank and layout will be increased, as demonstrated in
Fig. 10.1. To solve this problem, we rst formulate the original pattern shift
and rotation problem into a 3D cube overlapping problem. Then, through an
extended sweeping line algorithm, the large problem can be further mapped
into a minimum rectangle overlapping problem. In this way, compared to
the straightforward method that is to check every pair of defect and feature
at every possible rotation and shift position, our runtime can be reduced to
scale linearly with the size of the solution space. The experimental results
validate our proposed method and show the eciency of our algorithm.
Layout
Defect
Defect
(a) Without rotation, no defect-free 
location can be found
(b) With little rotation θ, a defect-free 
location can be found
θ Blank
Figure 10.1: Rotation helps defect mitigation.
The rest of the chapter is organized as follows. In Section 10.2, the process
of EUV mask preparation is introduced and the validation of reticle rotation
and ducial mark generation ow are discussed. Then, with a defect model,
we formulate our problem into a minimum cube overlapping problem in Sec-
tion 10.3. In Section 10.4, we further analyze the speciality of our problem
and propose an optimal algorithm for a large amount of data. Section 10.5
shows the experimental results and analysis, and nally the Section 10.6
concludes this chapter.
129
10.2 Background
In this section, we will briey introduce the background of EUV mask prepa-
ration process and discuss the potential reticle rotation method and ducial
mark generation for defect mitigation.
10.2.1 EUV Mask Preparation
As mentioned in the previous section, EUV mask preparation is a key step
for the success of EUV lithography. Because of the 13.5 nm wavelength, a
multilayer (ML) structure is needed for the reective optics. Ideally, a mask
is made up of a mask blank and patterned absorbers, and mask preparation
requires two steps. In the rst step, ML structure is coated and blank defects
will also be randomly generated [64]. In the second step, mask absorbers with
designed patterns are placed on top of the mask blank with 4X reduction
factor. Normally, because the blank is always a little bit larger than the
layout, a certain margin is left over between the layout boundary and blank
boundary, for the patterned absorber to move on top of the blank.
10.2.2 Validation of Layout Shift and Small Angle Rotation
Mask alignment is a necessary step for the absorber deposition on the blank
and for a scanner to adjust the mask and have the correct pace to move
with the wafer before and during exposure. Typically, this alignment process
is done by a set of marks, called ducial, on the mask. Together with the
reticle holder's alignment capability and ducial, the layout shift and small
angle rotation for defect mitigation can be validated by either of the following
approaches.
 Extra ducial alignment. Since the defects always exist on the blank
with certain density, a ducial set is used for defect inspection and de-
fect map buildup. Therefore, before the defect identication (size and
location) is detected from the ML inspection tool and a pattern relo-
cation position is determined, this specialized ducial alignment mark
should have been generated [45]. Thus, another set of ducial align-
ment marks can be used to mark the new pattern relocation position.
130
In this way, at least two ducial sets are needed. The rst set is used
to mark the empty blank and provide coordinates of defect map. This
set should be generated by the blank vendor. The second set is used to
mark the relocation position and guide the scanner during the exposure.
This set should be manufactured after the a layout-blank matching is
nished [62] but before the absorber is deposited.
 Reticle holder alignment. The extra ducial alignment approach
requires an extra step from the mask vendor for second ducial gen-
eration, which is costly and inconvenient, and therefore reticle holder
alignment can be adopted to handle the defect relocation without the
second ducial generation. This approach will require the mask pat-
terning tools and scanners to do the shift and rotation based on the
rst ducial, and the alignment accuracies rely on the reticle holder.
This method requires much higher alignment accuracy and consistency
of the scanning tools.
Fig. 10.2 shows the mask preparation process for both alignment methods.
Although dierences exist in the alignment step, the third step { relocation
position generation, which is the focus of our technique { is identical for both
methods.
10.3 Problem Formulation
In this section, we will analyze the layout relocation problem and formulate
the whole problem.
10.3.1 Denition of Solution Space
Depending on the limits of the mask alignment tool and EUV scanner, the
reticle may have a certain freedom in its small angle rotation and shift. With
maximum shift upper bound Fmax and maximum rotation upper bound max,
the correlation between the shift and rotation bound is linear, demonstrated
in Fig. 10.3. The 2D exploration with only shift in [56] and in Chapter 9 can
be considered as a special case when  = 0. Note that this small angle rota-
tion does not contradict the 90 rotation, which can happen simultaneously.
131
Blank generation
The first fiducial generation
Defect Inspection
Relocation position (dx, dy, θ)  generation 
for a certain layout*
The second fiducial generation
Absorber deposition and mask 
preparation
Finish
Absorber deposition with bias 
(dx, dy,  θ) and mask 
preparation
Extra fiducial alignmentReticle holder alignment
Figure 10.2: Mask preparation and ducial generation ow. The
highlighted third step is the key step for both alignment methods and is the
focus of this chapter as well.
Without losing generality, we can consider no 90 rotation in the following
part; the real 90, 180 and 270 rotations can be considered as the identical
processes with newly generated defect maps. Within the shaded region are
all the shift and rotation candidates. In the rest of chapter, we notate the
shift amount in X-Y directions as (X;Y ) and the rotation amount as .
As X has no correlation with Y but both X and Y are bounded by
the maximum shift upper bound, the whole solution space is an octahedron
(denoted as Bounding Octahedron (BO) in the rest of the chapter) in the X-
Y - space, as shown in Fig. 10.4. The equivalent problem is to nd the
best relocation movement (X;Y; ) in the BO. Note that Fmax is usually
about several hundred microns and max is around several milliradian.
Note that because the size of blanks (150 mm150 mm) is much larger
than the normal feature size on the mask (102  105 nm2) and defects are
randomly distributed on the blank, a small rotation with a few hundred
nanoradian could relocate defects tens or hundreds of microns away, which
132
Shift
Rotation
Fmax
θmax
0
Figure 10.3: The relationship between the rotation and shift.
ΔX
θ 
Fmax
Fmax
-Fmax
-Fmax
ΔY
-θmax
θmax
Figure 10.4: One example of bounding octahedron, which denes the
solution space of shift and rotation.
indeed provides a new defect distribution map and largely increases the suc-
cess rate.
10.3.2 Prohibited Relocation Cube
During the layout relocation process, a defect might potentially impact dier-
ent boundaries according to certain shift and rotation amount. We introduce
the concept of prohibited rectangle (PR) to represent the defect relocation po-
sitions that a defect center should avoid, in order not to cause damage for a
certain feature boundary. Each pair of a defect and its potentially impacting
feature boundary denes one prohibited rectangle, which is illustrated by the
yellow rectangles in Fig. 10.5.
In order to calculate the optimal defect relocation movement, we should
consider all the prohibited rectangles simultaneously. As shown in Fig. 10.6,
in order to avoid a prohibited rectangle PRi(Li; Bi; Ri; Ti) for a defect at
133
Impact region
Layout
rd
Defect
Prohibited 
Rect.
Feature
Allowable 
Region
Figure 10.5: Denition of prohibited rectangles (yellow) introduced by the
defect (red). The number of prohibited rectangles is determined by the
impact region and boundary number; the size of prohibited rectangles is
determined by the defect size. The feature regions (green) other than the
prohibited rectangle will be the allowable region for covering-only
requirement.
location (Xd; Yd), some shift and rotation movement should be prohibited.
If we notate such prohibited relocation movement as (Xi;Yi; i), the pro-
hibited movement should satisfy the following equations:
Li  Xd cos i   Yd sin i +Xi  Ri (10.1)
Bi  Xd sin i + Yd cos i +Yi  Ti (10.2)
However, because of the non-linearity of eq. 10.1 and eq. 10.2, a transfor-
mation has to be processed to easily apply the above equations. When i is
small, a small angle approximation sin i  i and cos i  1 becomes valid.
In this way, eq. 10.1 and eq. 10.2 can be rewritten in linear form as eq. 10.3
134
Prohibited 
Rectangle
(Li, Bi)
(Ri, Ti)
θi
(Δ X
i , Δ Y
i)
Xd’=Xdcosθi-Ydsinθi
Yd’=Xdsinθi+Ydcosθi
x
y
Defect Center: 
(Xd, Yd)
(0, 0)
Figure 10.6: Denition of prohibited relocation movement of the center of a
defect.
and eq. 10.4.
Li  Xd   Yd  i +Xi  Ri (10.3)
Bi  Xd  i + Yd +Yi  Ti (10.4)
To guarantee the validation of eq. 10.3 and eq. 10.4, the total error j1+i 
cos i  sin ij must be within the accuracy requirement Emax. Therefore, the
maximum allowable rotation for linearity is less than a rotation bound linear,
depending on the relocation accuracy and (Xd; Yd). Under the current blank
specication, the Xd and Yd could have maximum value 75 mm. In order
to guarantee the relocation error of less than 1 nm, using the small angle
approximation, the maximum rotation bound would be linear = 1:6330 
10 4 Rad.
Note that eq. 10.3 and eq. 10.4 form a cube in X-Y - space, as shown in
Fig. 10.7. Every point in this cube represents one relocation movement that
should be avoided in order not to move the center of defect into its prohibited
rectangle PRi. We name this cube prohibited relocation cube (PRC). In
every PRCi, any cross-section in X-Y plane will be a rectangle, where
Widthi = Ri Li and Lengthi = Ti Bi. Note that relocation positions are
always bounded by the BO, and the region of each PRC out of the BO will
be trimmed away.
135
ΔX
θ 
ΔY
Width
Le
ng
th
Figure 10.7: One example of the prohibited relocation cube.
For the scenario that max > linear, we can always slice the rotation range
into small pieces to guarantee the linearity in eq. 10.3 and eq. 10.4. In this
way, with the valid PRC denition, for each new piece, we only need to
regenerate the new defect location (Xd; Yd) by calling rotation function once.
Because in reality max is no more than about 10  20 times larger than
linear and the defect number is only around 100, compared to the number of
cubes (around several billion for a full layout defect mitigation), the number
of circular operations can be negligible.
10.3.3 Minimum Cube Overlapping Problem
As mentioned in the previous subsection, each pair of defect and feature
boundary will generate a PRC through previous steps. If we draw all the
PRCs together in the BO, all the PRCs might overlap with each other. The
overlapping PRC number on one point represents the number of damaged
boundaries once that relocation is adopted. In order to nd the best shift
and rotation to minimize damaged boundaries, we only need to nd the point
within the BO where the minimum number of PRCs are overlapping.
So far, the original layout location detection problem can be formulated
as the minimum cube overlapping (Min-CO) problem.
136
Minimum Cube Overlapping (Min-CO) Problem :
Give a bounding octahedron BO and a set of PRCs, nd the loca-
tion (X;Y; ) in BO where the minimum number of PRCs are
overlapping.
10.3.4 Modication for Defect Covering-Only Requirement
In certain schemes, the buried defects are preferred/required to be covered
by absorbers in order to generate a worry-free defect mitigation solution. For
this requirement, we can slightly modify the formulation above to meet the
covering-only requirement.
Instead of dening the prohibited rectangle that dislikes defect centers, we
can dene allowable rectangle (AR) to attract defect centers. As shown in
Fig. 10.5, all the feature regions except prohibited rectangles are the ideal re-
location positions for the center of the defect, in order to fully cover the defect
by the features. By slicing those ideal relocation positions into rectangles,
we can dene each of those rectangles as an allowable rectangle. Similar to
the process illustrated in Section 10.3.2, an allowable relocation cube (ARC)
can be dened to represent relocation movements that can be allowed for
relocation of the center of the defect. Thus the whole defect covering-only
problem will be equivalent to nding the point within the BO where the
maximum number of ARCs are overlapping. The corresponding formulated
problem would be the maximum cube overlapping (Max-CO) problem.
Maximum Cube Overlapping (Max-CO) Problem :
Given a bounding octahedron BO and a set of ARCs, nd the loca-
tion (X;Y; ) in BO where the maximum number of ARCs are
overlapping.
10.4 Problem Solutions
Due to the similarities between the Min-CO and Max-CO problems, in this
section, we will use the Min-CO problem as an example to illustrate the
problem solution process.
137
10.4.1 Problem Reduction
As mentioned in the previous section, usually the freedom of shift and relo-
cation is very limited by the reticle holder; therefore the movable region for
each defect on the layout is also very small. Because shift freedom reduces
when rotation angle increases, the nal moveable region of each defect center
will form a curved region (green regions in Fig. 10.8). The defects farther
from the rotation center will have larger moveable regions. All the features
out of the movable regions can be ignored. In other words, we have no need
to read those features from GDS le into memory. To save memory and disk
load time, we can crop the GDS le by those movable regions and later on
handle the features only in those small regions.
A
B
E
D
C
θmax 
Mask
Fmax
Defect Center
Prohibited 
Rectangle
Defect Movable 
Region
Figure 10.8: The defect moveable region demonstration. The outside
defects have much larger movable regions than the inside ones. Only the
prohibited rectangles covered by the cropped region need to be considered.
The size of prohibited rectangle is exaggerated for better illustration.
138
10.4.2 Solutions for Min-CO Problem
In the 3D space, it is always computationally expensive to solve the irregular
geometry overlapping problems. However, because of the speciality of this
problem, we can solve the Min-CO problem in an ecient way.
As we know, the sweeping line algorithm is an ecient algorithm to solve
the rectangle overlapping problem [60]. We can simply extend this sweeping
line algorithm to solve the overlapping problem in 3D space, which is to sweep
the (X;Y ) plane along the  axis and then nd the minimum overlapping
region in each cross-section. Note that along the  direction, all the cross-
sections of PRCs are rectangles and the cross-section of the BO is always a
square. Therefore, on each sweeping plane of certain , it is equivalent to
detecting the minimum rectangle overlapping in a square region. Usually,
since the number of PRCs n is much larger than Fmax and max on the grid
of 1 nm, instead of the using the sorted X, Y and  coordinates of the
PRCs as the sweeping grid, which takes O(n log n) time, we can directly
use the coordinate grid in the solution space (the bounding octahedron in
Fig. 10.4) as the sweeping grid. In this way, the runtime can be reduced
to O(F 2maxN + Nn). Here, N is the number of grids on the  axis and
 < 1, representing that the number of rectangles in each cross-section of
the (X;Y ) plane is less than n. Since numerically F 2max is comparable
to n and F 2maxN is the full solution space of the bounding octahedron,
our proposed algorithm indeed scales linearly with the size of the solution
space. Compared to the straightforward enumeration method of trying all
the positions, which has the runtime complexity O(nF 2maxN), our method
has achieved a large amount of runtime saving.
Note that our Min-CO problem is also a perfect candidate for parallel
programming. Since there is no correlation between each cross-section, each
CPU can grab one cross-section at a time, and nd the minimum overlapping
location with no interference with other processing CPUs. After all the cross-
sections are processed, we only need to compare the outputs from every CPU
and generate the minimum overlapping regions.
If we consider the solution of Min-CO as a black box, the overall ow can
be expressed in Alg. 5.
139
Algorithm 5 Overall Algorithm
Require: Defect set D = fdi; i = 1 : : : pg, feature boundary set B = fbj ; j =
1 : : : qg, the bounding octahedron BO.
Ensure: Best relocation (X;Y; ) for layout relocation.
1: for all feature boundary bj in B do
2: for all defect di in D do
3: Calculate the PRC(di; bj) with eq. 10.3 and eq. 10.4
4: end for
5: end for
6: Call Min-CO solver to generate the minimum overlapping region of all
PRC(di; bj) in BO.
7: Report the nal location (X;Y; ).
10.5 Experimental Results
In this section, we implement our algorithm into a sequential version using
C++ on a workstation with an Intel Xeon 2.40GHz CPU.
We carry out our experiment with an industrial, full size, test layout and a
set of randomly generated defect maps. Since in reality, the defects in ML are
always randomly distributed, our randomly generated defect maps will not
aect the experimental results compared to the real industrial data. In our
experiment, Fmax is set to be 200 m, N is set to be 300, and the grid size of
shift operation is 1 nm. Then, we can compare our current results with the
results in Chapter 9, in which the freedom on rotation is not explored. The
comparison results are shown in Table 10.1. The columns \Runtime" show
the total runtime to explore the full solution space. The columns \Aected
Edge #" show the nal impact from defects. When the number of \Aected
Edge #" is 0, no feature boundaries are aected defects, which means we are
successful to mitigate all the defects. The column \Early Stop Time" reports
the run time when the exploration on the rotation  plane where the rst
defect-free location is detected is nished. From Table 10.1, we can see that
by allowing small rotation in the reticle holder, the success rate can be largely
increased from 22% to 100% in our experiment. This tremendous increase of
the success rate suggests a signicant decrease of mask preparation cost and
a nal benet to the whole EUV process. Under the circumstances where the
full solution space needs to explore, our algorithm with rotation needs longer
runtime, simply because the solution space for the \relocation with rotation"
problem is much larger and the problem of \relocation without rotation" is
140
only a subproblem when the rotation  = 0. By studying the column \Early
Stop Time", we can nd that usually the defect-free location can be found
in an early stage and full solution space exploration is unnecessary.
Note that in this chapter, the runtime in Table 10.1 is based on our se-
quential code. However, as illustrated in the previous section, we can easily
switch the algorithm into a parallel version, which is expected to have a large
amount of speedup due to little correlations among dierent threads. This
parallel version will be useful especially for the full solution space exploration,
not only to nd a successful defect mitigation spot, but also to optimize the
mask preparation process. Compared to the results in [56] where only shift
without rotation was explored and 64 CPUs were used, the maximum total
CPU time in our experiment is 69423 seconds, which is much smaller than
the 64 146 minute total CPU time. It can be expected that given a cluster
with 64 CPUs, the runtime to explore the full solution space can drop to a
few minutes.
10.6 Conclusions
So far, we have successfully introduced a novel algorithm to mitigate defects
by layout shift and rotation on blanks. After studying the process of the du-
cial mark generation, defect modeling and mitigation process, we transform
the pattern relocation problem into a minimum cube relocation problem.
An extended sweeping line algorithm is then applied to solve the problem
eciently. Compared to the straightforward method, we largely reduce the
runtime complexity from O(nF 2maxN) to be linearly with the solution space
size. The experimental results show a much higher success rate for defect
mitigation with the extra exploring dimension of rotation.
141
T
ab
le
10
.1
:
E
x
p
er
im
en
ta
l
R
es
u
lt
s
fo
r
D
i
er
en
t
T
es
t
S
et
s
R
el
o
ca
ti
on
w
it
h
ou
t
R
ot
at
io
n
a
R
el
o
ca
ti
on
w
it
h
R
ot
at
io
n
T
es
t
M
ap
D
ef
ec
t
#
R
u
n
ti
m
e
(s
)
A

ec
te
d
E
d
ge
#
R
u
n
ti
m
e
(s
)
E
ar
ly
S
to
p
T
im
e
(s
)
A

ec
te
d
E
d
ge
#
M
ap
1
80
31
8
2
44
46
5
12
07
0
M
ap
2
80
30
9
4
43
29
4
71
1
0
M
ap
3
80
28
3
0
39
58
6
28
8
0
M
ap
4
10
0
36
0
0
50
39
0
36
9
0
M
ap
5
10
0
36
9
6
51
64
7
28
77
0
M
ap
6
10
0
33
0
2
46
16
9
69
3
0
M
ap
7
12
0
41
9
4
58
67
5
33
11
0
M
ap
8
12
0
49
6
8
69
42
3
16
36
0
M
ap
9
12
0
34
5
2
48
34
6
10
71
0
a
C
h
ap
te
r
9
142
PART V
NEW SIMULATION
METHOD
143
CHAPTER 11
ACCELERATING AERIAL IMAGE
SIMULATION WITH GPU
11.1 Introduction
The shrinking size of semiconductor devices has been driving lithography
technology to become more and more sophisticated. Due to the severe opti-
cal diraction and interference in current nano-lithography, the printing on
the wafer from the hard mask is no longer \what you see is what you get."
Therefore, aerial image simulation, which simulates the two-dimensional light
intensity on top of the wafer for given mask layout and light source, is widely
used for lithography verication, optical proximity correction and wafer pat-
tern prediction. It has become the most critical and fundamental step in
design for manufacturability (DFM). Its quality and eciency directly im-
pact the yield and time to market of modern VLSI products.
Aerial image simulation involves a huge number of numerical computa-
tions. Two directions have been explored to improve the eciency of the
computation: the polygon-based approach [65] and the fast Fourier transform
(FFT)-based approach [66]. Although the FFT-based approach is generally
considered better for image simulation and parallelism, [67] shows that with
proper arrangement on the data and pattern density, both approaches have
similar speed for layouts.
Due to the parallel nature of aerial image simulation, parallel algorithms
can eectively leverage today's multi-core/many-core computing architec-
tures and thereby achieve great speedup. A graphics processing unit (GPU)
is a microprocessor that specializes in processing 2D/3D image data in par-
allel. Its single-instruction multiple-data (SIMD) architecture and massive
number of arithmetic logic units (ALUs) make it a perfect computing plat-
form for parallel aerial image simulation algorithms. Furthermore, the recent
advancement of Compute Unied Device Architecture (CUDA) greatly eases
144
(x2,y2)
(x1,y1)
= - - +
(x1,y1)
(0,0)
(x2,y1)
(0,0)
(x1,y2)
(0,0)
(x2,y2)
(0,0)(0,0)
Figure 11.1: Rectangle decomposition. The impact of R((x1; y1)  (x2; y2))
on the central pixel can be calculated by looking up the impacts of the four
shaped rectangles together.
programming on GPUs.
Accelerating the FFT-based approach with GPU is straightforward and has
been well studied [68,69]. On the other hand, the study on GPU acceleration
of the polygon-based approach is rather limited. In [67], a GPU implemen-
tation of the polygon-based algorithm is presented to compare with their
FPGA-based implementation. But their GPU implementation is not tuned
for GPU computing platform and therefore achieves very limited speedup. In
this chapter, we propose two GPU-based implementations for the polygon-
based approach. We study the performance of the proposed implementations
and compare them through experiments. We also included [67]'s GPU imple-
mentation in our study. In our experiments, the fastest algorithm we propose
is able to achieve up to 60X speedup over the sequential algorithm. The error
of this algorithm is far from signicant. Its maximum absolute error is on the
order of 10 7 when the light intensity of an image is normalized to between 0
and 1. On the other hand, the GPU implementation in [67] can only achieve
at most 4X speedup and its error is on the order of 10 4.
The rest of this chapter is organized as follows: Section 11.2 introduces
some background about aerial image simulation and CUDA GPU program-
ming. Section 11.3 presents our proposed GPU-based parallel aerial simu-
lation algorithms. It also introduces the GPU-based algorithm used in [67].
Section 11.4 presents the experimental results and section 11.5 concludes this
chapter.
11.2 Background
In this section, we will briey introduce some background on aerial image
simulation and CUDA GPU programming.
145
11.2.1 Aerial Image Simulation
Aerial image simulation is an important problem in normal manufacturability
analysis and litho-aware design-manu-facturing co-optimization. Its task is
to compute the light intensity distribution on the wafer, when the lighting
and mask information as well as the lithography model are provided.
Typically, a lithography system is composed of a light source, projection
lens, mask and wafer. These components all impact the light intensity of a
particular spot on the wafer. Since the light source and projection lens are
xed for all the spots (or pixels) on the wafer, these process parameters are
constant for one exposure. The light intensity I(x; y) at pixel (x; y) can be
obtained by
I(x; y) =
X
i
i  ji(x; y)
 f(x; y)j2 (11.1)
in which f(x; y) represents the transparency of the mask at (x; y) (0 means
opaque and 1 means transparent), i(x; y) is the i-th complex convolution
kernel and i is the corresponding weight [65]. 
 denotes convolution and
j  j takes the modulus of a complex number. Note that f is mask dependent
while  and  are mask-independent lithography system parameters.
The goal of aerial image simulation is to compute I(x; y) for all layout
pixels (x; y) using eq. 11.1. Parameters i(x; y), i, and the layout, which is
a list of rectangles representing the transparent areas, are given as input.
Directly computing the convolution is rather time-consuming. In [65], a
polygon-based approach is proposed. The idea is to pre-compute and store
the convolution of certain basic rectangles into a lookup table. Then the im-
pact of any rectangle on the light intensity at a pixel can be easily computed
through table lookup. Fig. 11.1 illustrates how it works. First, a lookup ta-
ble T of the same size as the convolution kernel is constructed. The value at
location (x; y) in T is the impact of a rectangle R((0; 0) (x; y)), where (0; 0)
is the left-bottom corner and (x; y) is the right-top corner, on the central
pixel (the red dot in Fig. 11.1). Since the convolution operation is linear,
the impact of any rectangle R((x1; y1)  (x2; y2)) on the center pixel can be
obtained by an addition and two subtractions:
impact = T (x2; y2) + T (x1; y1)  T (x2; y1)  T (x1; y2)
In order to compute the impact of a rectangle on any pixel, we need to com-
146
Figure 11.2: Rectangle extending outside the lookup table is truncated.
pute the coordinates of the four corners of the rectangle relative to the lookup
table, as if the pixel were located at the center of the lookup table. Then,
through four table lookups and three oating point additions/subtractions,
the impact on the pixel can be obtained. This approach substantially reduces
the runtime by avoiding the repetitive computation of convolutions. Details
of this approach can be found in [70].
One thing to notice is that if part of a rectangle is outside the region of
the lookup table, we need to truncate the rectangle at the boundary of the
lookup table. The reason is that the part that is outside the lookup table
would also sit outside the convolution kernel, which means that this part will
not contribute to the light intensity at the pixel in eq. 11.1. Therefore, as
shown in Fig. 11.2 truncating the rectangle would give us a result identical
to that of the convolution based approach. If the whole rectangle is outside
the region, the impact of the whole rectangle is ignored.
Another thing to notice is that the convolution kernel is complex and
the modulus operation in eq. 11.1 is nonlinear. Therefore, the polygon-
based approach can only be applied to compute the impact of the real or
imaginary component of one convolution kernel. In order to compute the
total impact of K convolution kernels, we need 2K lookup tables of which
K tables correspond to the real component of the K kernels and the other
K tables correspond to the imaginary component of the K kernels. We then
need to apply the lookup table approach to compute the partial impact from
each table, take the square of them and sum them up. Alg. 6 shows the serial
algorithm of the polygon-based approach. It serves as a base of our study.
147
Algorithm 6 Polygon-based simulation: base algorithm
Require: 2K lookup tables Tk, size WT WT (square table), and 2K weights k
N rectangles Ri, represented by the coordinates of its top, bottom, left and
right boundaries.
Ensure: Aerial image: an 2D array I[x][y], size WI HI
1: for k = 0 to 2K   1 do
2: for i = 0 to N   1 do
3: for x = 0 to WI   1 do
4: for y = 0 to HI   1 do
5: xorig = x WT =2
6: yorig = y  WT =2
7: Compute the relative position of Ri to (xorig; yorig), store result in
rectangle Rrel
8: if Rrel overlaps the lookup table then
9: Truncate Rrel at the four sides of the lookup table, store result in
Rtrunc.
10:
Ipart[k][x][y] =Tk[Rtrunc:right][Rtrunc:top]
 Tk[Rtrunc:left][Rtrunc:top]
 Tk[Rtrunc:right][Rtrunc:bottom]
+Tk[Rtrunc:left][Rtrunc:bottom]
11: end if
12: end for
13: end for
14: end for
15: for x = 0 to WI   1 do
16: for y = 0 to HI   1 do
17: I[x][y] = I[x][y] + k  (Ipart[k][x][y])2
18: end for
19: end for
20: end for
11.2.2 CUDA GPU Programming
A graphics processing unit (GPU) is a microprocessor that specializes in
processing image data. A modern GPU contains hundreds of arithmetic logic
units (ALUs), so it provides a great opportunity for parallelism. Using the
massively parallel computing power in a GPU to accelerate computation-
intensive tasks nds its applications in many research areas. The recent
advancement of Compute Unied Device Architecture (CUDA) initiated by
NVIDIA makes programming on GPUs much easier than before. In this
subsection, we introduce some basics about CUDA GPU programming.
In the CUDA programming model, the CPU works as a host that organizes
the ow of computing, and the GPU works as a computing device (a co-
148
processor) that performs the computation intensive tasks. When the CPU
needs the GPU for computations, it rst sends data over to the memory
on the GPU and then launches a kernel call to the GPU. The kernel call
will launch many threads to process the data in parallel. The threads on
the GPU are organized in a 2-level hierarchy: a 1D/2D/3D array of threads
compose a block and a 1D/2D array of blocks compose the grid. CUDA uses
a single-instruction multiple-data (SIMD) model, which means that all the
threads in one kernel call will execute the same code. Nevertheless, the data
each thread works on are usually dierent. Each thread has a unique index
in a block and each block has a unique index in the grid. Such indices are
used to identify the data that a thread should work on.
To be able to create an ecient program for CUDA, one must know its
memory organization. The host (CPU) can only communicate with the de-
vice (GPU) through global memory. Once the kernel call is launched, all
threads can access the global memory, but the access has a very high la-
tency. On the other hand, each thread has its own registers, and threads
within a block can access the shared memory of the block, which has much
lower latency. Note that threads of dierent blocks cannot communicate
through shared memory; they can only communicate through global mem-
ory. There are also constant and texture memories, which also have high
access latency but come with high-speed caches. Usually, we would choose
to put certain data that all the threads need to frequently read (such as con-
stants and parameters used in the algorithm) in the constant and/or texture
memory for the benet of fast cache access.
There are many factors that aect the speed of a GPU implementation.
In this chapter, we optimize our implementation mainly considering the fol-
lowing factors, which we believe are the key factors for CUDA programming
(detailed discussion on how these factors impact the performance of a CUDA
implementation can be found in [71]):
 Memory usage: The speeds and sizes of the GPU memories are all dif-
ferent. How to intelligently use the memories considering their unique
features is a key issue for CUDA programming.
 Branching condition: Although CUDA allows kernel code to have branches,
branching may slow down our implementation if not carefully used.
Therefore, we need to try our best to limit the branches in the kernel
149
code. If branching is necessary, we need to carefully design it so that
it has the least possible impact on the speed.
 Atomic operations: Atomic operations are expensive for parallel pro-
grams because they have to be performed in series. Excessive use
of atomic operations may result in intensive memory contentions and
greatly slow down the GPU program. One way to avoid the use of
atomic operations is to guarantee that each thread will be working on
dierent memory locations.
11.3 GPU-Based Aerial Image Simulation
GPU-based parallel algorithms can be characterized by the following two
decisions:
1. What each threading block computes.
2. What each thread in the block computes.
Dierent choices for the above will greatly aect the performance of the
algorithm. In this section, we propose two GPU-based parallel aerial image
simulation algorithms and study their advantages and disadvantages. We
also discuss the algorithm presented in [67] in this section.
11.3.1 Rectangle Per Block Approach
The rst approach we propose is called the rectangle per block (RPB) ap-
proach. It can be described as follows:
1. Each block computes the impact of one rectangle on all pixels of the
image.
2. Each thread in a block is responsible for computing the impact of the
rectangle on one pixel of the image.
Since the contribution from each lookup table needs to be combined (squared
and summed up) in the end (line 17 in Alg. 6), we use another GPU kernel
call to compute the squared sum. So the whole ow contains two kernel
150
calls: (1) compute partial image; (2) combine the partial image into the nal
image.
In the RPB approach, each threading block loads one rectangle into the
shared memory for fast access. Then each thread in the block computes the
impact of this rectangle on one pixel by table lookup. Notice that the range
a rectangle can impact is limited, so we do not need to perform computation
on all pixels. We rst compute an impact region (which is also stored in
shared memory for fast access) and each thread will compute the impact on
one or more pixels in the impact region. Alg. 7 shows the algorithm of the
GPU kernel call (1). Note that what line 5 does is essentially line 5 through
10 in Alg. 6.
Notice a technique we applied here: we switched the iteration of k and i in
Alg. 6. For each rectangle, we go through all the lookup tables and compute
their impacts on the pixels and store the results separately. By doing this,
we can save the trouble of computing the impact region of a rectangle for
each lookup table.
After the impacts from all lookup tables are computed, we use another
GPU kernel call to combine the impacts of all lookup tables into one (kernel
call (2)). The algorithm is straightforward; each thread combines one pixel
and performs the operation shown in line 17 of Alg. 6.
The advantage of this approach is that each rectangle is loaded only once
from the global memory to the shared memory. This saves the global memory
access, which saves the runtime. However, since the partial image array Ipart
is too large to t into the shared memory, we have to store it in the global
memory, meaning that every time we compute the impact on a pixel, we need
to write to the global memory to update Ipart. Furthermore, since threads
from dierent blocks are writing to the same Ipart, two threads may update
Algorithm 7 RPB kernel call: compute partial image
1: i = blockIdx:x
2: Impact region G = rectangle Ri bloated by WT =2
3: for k = 0 to 2K   1 do
4: for each pixel (x; y) the thread needs to compute do
5: accumulate the impact of the rectangle to Ipart[k][x][y] (atomic operation
required)
6: end for
7: end for
151
Figure 11.3: Partition the image into tiles: one tile per threading block.
the same pixel at the same time. Therefore, atomic operations need to be
used to prevent race conditions. Such atomic operations are extremely slow,
especially on global memory.
11.3.2 Pixel Per Thread Approach
The time-consuming atomic operation can be avoided if we can guarantee
that only one thread is accessing a pixel at any time. In order to achieve
this, we let each thread compute the light intensity of one unique pixel. The
number of threads we use is then the same as the number of pixels in the
image.
Since the number of threads inside a threading block is limited, we cannot
t the whole image into one threading block. Therefore, we partition the
whole image into small tiles and let each threading block handle one tile (see
Fig. 11.3). The number of pixels in a tile is equal to the number of threads
in one block.
We call this approach the pixel per thread (PPT) approach. It can be
described as follows:
1. Each block computes the impact of all the rectangles on all the pixels
in one tile.
2. Each thread goes through all the rectangles and computes their impact
on one pixel in the tile.
The algorithm of the PPT approach is shown in Alg. 8. Notice the several
techniques we have applied. The rst technique is the same as what we did
152
with the RPB approach: we moved the iteration of k to the innermost loop
so that the computed rectangles' location relative to the pixel can be shared
among all lookup tables. Furthermore, we do not need two GPU kernel
calls because when the computation of partial images is completed, the same
thread can be used to combine the partial images into one. We do not need
any thread synchronization before combining the image.
Algorithm 8 PPT kernel call
1: compute the pixel location (x; y) that the thread is responsible for
2: while there is unprocessed rectangles do
3: load one rectangle to shared memory
4: for each rectangle in shared memory do
5: if rectangle does not impact pixel (x; y) then
6: skip the rectangle
7: else
8: for k = 0 to 2K   1 do
9: accumulate the impact of the rectangle to Ipart[k][x][y] (does not
need atomic operation)
10: end for
11: end if
12: end for
13: end while
14: for k = 0 to 2K   1 do
15: I[x][y]+ = k  (Ipart[k][x][y])2
16: end for
Another technique we applied is to load the rectangles into the shared
memory in parallel. In line 3, each thread loads one rectangle into the shared
memory. Then each thread will go through all the loaded rectangles to
compute their impact. This signicantly cuts down the global memory access.
With this technique, we only need block numberrectangle number reads
from the global memory. However, if we do not load the rectangles into the
shared memory but let each thread read directly from the global memory
instead, we need block numberthread number in a blockrectangle number
global memory reads.
A third technique we used is to skip rectangles that do not impact the
pixel. Before we go into the k loop that goes through all the lookup tables and
accumulate the impact, we check if the rectangle is actually located outside
the lookup table. If it is, we can skip the iteration completely and save many
computations. Since the lookup table is usually quite small compared to
the whole layout, a majority of the rectangles will be skipped for each pixel.
153
Furthermore, since the pixels in a tile are located in close proximity (the tile
size is not very large, usually 1616 for a normal 256 thread block), it is very
likely that all the threads in a block will have the same branching condition.
In most cases, the threads will all choose to skip or to continue computation
with one rectangle. Recall our discussion in the previous session, that having
the same branching condition could signicantly improve the performance of
a GPU implementation.
In our most basic version of the PPT algorithm, we put everything in the
global memory. Only the rectangles are loaded into the shared memory for
faster access in later computations. We call this implementation PPT basic
in our experiments. The lookup tables Tk and weights k are heavily accessed
by all the threads. Moreover, the threads perform only read access to them.
Therefore, putting the lookup table and weight in constant and/or texture
memory provides signicant potential runtime saving. Since we have 26
lookup tables (each of size 257  257) in our data, constant memory is too
small to t them in. Therefore, we put Tk in the texture memory. On the
other hand, we put k in the constant memory because they are just 26
oat-point numbers which can easily t into even the cache of the constant
memory, providing a 100% cache hit rate. This enhanced implementation is
called PPT const tex in our experiments.
The advantage of the PPT approach is apparent: it avoids the expensive
atomic operation on global memory. However, we need to read each rectangle
multiple times because we partition the layout into tiles and each tile needs
to load all the rectangles. Overall, we believe the advantage outweighs the
disadvantage because updating the impact on each pixel consumes the most
signicant portion of the runtime. Improving the speed of pixel updating
by avoiding atomic operation yields a huge speedup. Our experiments also
verify this point.
11.3.3 Cong and Zou's Approach
In [67], Cong and Zou present a GPU-based implementation as a compari-
son to their FPGA-based implementation of the polygon based aerial image
simulation algorithm. We call their approach the CnZ approach. They also
partition the whole image into multiple tiles (their tile size is 40 40). Un-
154
like our PPT approach, each GPU kernel call would compute only one tile
in their approach. So in their ow, they go through all tiles and launch one
GPU kernel call for each tile. Their GPU kernel call can be described as:
1. Each block computes the impact of a set of rectangles to all pixels of
the tile.1 Note that the number of rectangles assigned to each block
would dier by at most one for load balancing.
2. Each thread in a block computes the impact of all the rectangles as-
signed to the block on a set of pixels. Again, the number of pixels
assigned to each thread would dier by at most one for load balanc-
ing. Each pixel will be computed by only one thread to avoid atomic
operations.
Each block holds a copy of the image of the tile in its shared memory for
fast access. However, since each block computes the impact of only some of
the rectangles, the impacts need to be aggregated in the end. In [67], they
write the image back to CPU memory and use CPU to combine the images
from dierent blocks into one. The purpose is to avoid atomic operation
because if they use GPU to do the job, the blocks need to access the same
image in the global memory and atomic operation is needed.
Unfortunately, the algorithm in [67] considers only one lookup table. In
practice, we usually need to compute the impact of more than 10 lookup
tables for better accuracy. For example, the data we use in our experiment
come with 26 lookup tables. In order to enable their algorithm to handle
practical data, we add a wrapper to iterate through all the lookup tables.
The GPU kernel call is shown in Alg. 9 and the CPU ow is shown in Alg. 10.
The major advantage of the CnZ approach against our PPT approach is
that it stores the partial image in the shared memory, which has a much lower
access-latency. However, this comes at a price. Since the size of the shared
memory is very limited, they have to cut the image into tiles and launch
a kernel call for each tile. This means that for a large image, many kernel
calls need to be launched, resulting in signicant overhead in kernel launch.
1Our description of this algorithm is a little dierent from the original algorithm in [67].
In [67], they compute the impact of a rectangle corner at a time. For consistency with
other algorithms in this chapter, we describe the algorithm as if it computes the impact
of one rectangle at a time. This slight dierence does not aect the performance of the
algorithm. In our experiments, we follow their original algorithm and compute the impact
corner by corner.
155
Algorithm 9 CnZ kernel call
1: for each rectangle that current block j needs to process do
2: for each pixel the thread is responsible for do
3: accumulate the impact of the rectangle to the image in shared memory
4: end for
5: end for
6: for each pixel the thread is responsible for do
7: dump the image from the shared memory to the global memory (each block
will dump a copy)
8: end for
Algorithm 10 CnZ CPU ow
1: for k = 0 to 2K   1 do
2: for each tile do
3: use CnZ GPU kernel call to compute Ij [k] for this tile
4: copy the dumped image from global memory to CPU memory
5: for each pixel in the tile do
6: for j = 0 to block number  1 do
7: accumulate the image from block j into the image of lookup table k
8: end for
9: end for
10: end for
11: end for
12: combine image of all 2K lookup tables into one
Moreover, the images in the shared memory of each block are combined at
the CPU side, which is completely serial. This also causes an increase in the
runtime.
11.4 Experimental Results and Analysis
In this section, we compare the three approaches introduced in the previ-
ous section through experiments. All the experiments are conducted on a
machine with a dual socket, dual core 2.4GHz Opteron processor, 8GB of
memory and a NVIDIA GeForce GTX280 GPU.
11.4.1 Tile Size Tuning
In our PPT approach, the image is partitioned into tiles. The size of the
tile has a big impact on the performance of the algorithm. For our PPT
156
00.5
1
1.5
2
2.5
4 8 16 32 64 128 256
R
u
n
 T
im
e
 (
s
) 
TILE SIZE 
test_1
test_2
test_3
test_4
Figure 11.4: Plot of tile size vs. runtime for PPT const tex.
approach, each tile corresponds to a block, so the more tiles we have, the
more threading blocks we need to create. Remember that each threading
block needs to load all the rectangles; more threading blocks means more
loading, which consumes time. However, if the size of the tile is too large
and the number of blocks is too small, we may have a problem balancing the
loads on the stream processors in the GPU. Moreover, the size of a block
aects many other factors such as the number of registers used in a block
and thus would aect the total number of threads assigned to a streaming
processor. The impact of these factors is dicult to analyze in theory. We
need to nd out the perfect tile size by experiments.
To nd out which tile size would give us the best speed, we perform a series
of runs with dierent tile sizes and observe the dierence in the runtime. In
Fig. 11.4, we show a plot of tile size vs. runtime with the PPT const tex
approach. We run four tests on data with dierent scales and increase the
tile size from 2  2 = 4 to 16  16 = 256. We can see that the runtime
decreases as we increase the tile size. The curve attens out when the size
of the tile is 256, meaning that the best performance can be achieved there.
Therefore, in the rest of our experiments, we use 256 as the tile size for
PPT const tex.
The CnZ approach also partitions the image into tiles. In [67], a tile size
of 40 40 = 1600 is used. In our experiment, we implement their approach
using the same tile size.
157
Table 11.1: Data Set for Our Experiments
data #Rect. layout size #pixels
small 1 209 1850 nm1100 nm 463275
small 2 212 1750 nm1100 nm 438275
small 3 217 1650 nm1100 nm 413275
small 4 225 1850 nm1150 nm 463275
medium 1 1104 4100 nm3100 nm 1025775
medium 2 1297 4220 nm3100 nm 1055775
medium 3 1214 4100 nm3100 nm 1025775
medium 4 1292 4100 nm3100 nm 1025775
large 1 4904 8100 nm6100 nm 20251525
large 2 4897 8320 nm6100 nm 20801525
large 3 4892 8100 nm6100 nm 20251525
large 4 4814 8100 nm6100 nm 20251525
11.4.2 Performance Comparison
We also conduct experiments to compare the performance of the three ap-
proaches. The experiment setup is as follows:
 The convolution kernel covers a 1024 nm 1024 nm area. We sample
the kernel every 4 nm, resulting in lookup tables of size 257  257.
We use a total of 13 complex kernels. Since two lookup tables need
to be built for each kernel (one for the real part and another for the
imaginary part), we have 26 lookup tables in total.
 We use three sets of layouts for our experiments: a small set, a medium
set and a large set. Each set contains 4 data. Table 11.1 provides the
information on all the data, including the number of rectangles, the size
of the layout and the number of pixels in the image after we sample
every 4 nm.
We implement the serial algorithm (Alg. 6) on CPU and use its runtime
and image result as the base of comparison. We compare the runtime as well
as the image result of all the GPU-based implementations against the base
and obtain the speedup and error. The error is measured as the maximum
absolute error among all the pixels of the image. The comparison result is
shown in Table 11.2.
The second column shows the runtime of the serial implementation. Fol-
lowing that we have four sections. Each section corresponds to the result
158
T
ab
le
11
.2
:
C
om
p
ar
is
on
b
et
w
ee
n
th
e
T
h
re
e
A
p
p
ro
ac
h
es
.
(U
n
it
of
ti
m
e
is
se
co
n
d
.)
S
er
ia
l
R
P
B
P
P
T
b
as
ic
P
P
T
co
n
st
te
x
C
n
Z
[6
7]
d
at
a
T
im
e
T
im
e
S
p
.
U
p
E
rr
or
T
im
e
S
p
.
U
p
E
rr
or
T
im
e
S
p
.
U
p
E
rr
or
T
im
e
S
p
.
U
p
E
rr
or
sm
al
l
1
6.
77
3.
31
2.
05
X
1.
93
E
-6
0.
28
24
.2
X
1.
03
E
-6
0.
14
48
.4
X
3.
13
E
-7
1.
69
4.
01
X
2.
80
E
-5
sm
al
l
2
6.
79
3.
41
1.
99
X
1.
54
E
-6
0.
28
24
.3
X
1.
01
E
-6
0.
13
52
.2
X
3.
13
E
-7
1.
67
4.
07
X
3.
32
E
-5
sm
al
l
3
6.
60
3.
20
2.
06
X
1.
28
E
-6
0.
27
24
.4
X
9.
39
E
-7
0.
14
47
.1
X
2.
83
E
-7
1.
57
4.
20
X
3.
27
E
-5
sm
al
l
4
7.
4
3.
69
2.
01
X
1.
62
E
-6
0.
32
23
.1
X
1.
07
E
-6
0.
16
46
.3
X
2.
98
E
-7
1.
96
3.
78
X
3.
21
E
-5
m
ed
iu
m
1
30
.3
13
.7
2.
22
X
3.
49
E
-6
1.
12
27
.1
X
1.
15
E
-6
0.
46
66
.0
X
3.
73
E
-7
32
.5
0.
93
X
3.
10
E
-4
m
ed
iu
m
2
33
.6
16
.6
2.
03
X
3.
83
E
-6
1.
31
25
.7
X
1.
36
E
-6
0.
56
60
.0
X
3.
87
E
-7
40
.1
0.
84
X
4.
00
E
-4
m
ed
iu
m
3
36
.2
15
.6
2.
32
X
3.
33
E
-6
1.
27
28
.5
X
1.
24
E
-6
0.
59
61
.4
X
3.
87
E
-7
36
.1
1.
00
X
3.
14
E
-4
m
ed
iu
m
4
34
.9
15
.8
2.
22
X
1.
32
E
-5
1.
26
27
.7
X
1.
27
E
-6
0.
60
58
.2
X
3.
58
E
-7
38
.6
0.
91
X
3.
80
E
-4
la
rg
e
1
25
1
10
8
2.
33
X
1.
94
E
-6
9.
12
27
.5
X
1.
34
E
-6
3.
92
63
.9
X
3.
87
E
-7
45
3
0.
55
X
6.
53
E
-4
la
rg
e
2
24
2
11
5
2.
10
X
2.
10
E
-6
9.
28
26
.1
X
1.
36
E
-6
4.
06
59
.6
X
4.
17
E
-7
50
7
0.
48
X
7.
19
E
-4
la
rg
e
3
26
6
11
6
2.
30
X
1.
97
E
-6
9.
87
26
.9
X
1.
37
E
-6
4.
46
59
.6
X
4.
17
E
-7
49
8
0.
53
X
7.
63
E
-4
la
rg
e
4
26
5
11
4
2.
31
X
2.
01
E
-6
9.
48
27
.9
X
1.
51
E
-6
4.
39
60
.3
X
4.
17
E
-7
49
1
0.
54
X
6.
91
E
-4
159
of one GPU-based implementation. There are three columns in each sec-
tion: the runtime, the speedup against the serial implementation and the
maximum absolute error of the image result.
From the table, we can see that the RPB approach can only achieve around
2X speedup. The key factor that limits its speedup, we believe, is the fre-
quent atomic operations on the global memory. The PPT approach that
avoids this expensive atomic operation achieves much better speedup. For
example, the PPT basic achieves about 25X speedup over the serial algo-
rithm. Comparing the PPT approach with the RPB approach, PPT has
more global memory reads because each block has to read all the rectangles.
On the other hand, RPB needs to perform atomic add on the global mem-
ory. From this experiment, we can see that the benet of avoiding atomic
operation overwhelms the overhead of reading the rectangles multiple times.
When we use constant and texture memory for the PPT approach, we see an
extra 2X speedup. This means the caching mechanism of the constant and
texture memory helps us when we repetitively visit the same weights and
lookup table entries. Note that an extra 2X speedup means that the time
saved on accessing the weights and lookup table is much more than 50%
because the runtimes of other parts such as loading rectangles and writing
pixels are not aected by the use of constant/texture memory. With a 50X
to 60X speedup over the serial algorithm, the PPT const tex implementation
is the fastest implementation in our experiments.
The CnZ approach proposed in [67] shows a better speedup than the RPB
approach on the small set. We believe it is because it avoids atomic op-
eration on the global memory by using shared memory for partial images
and using CPU to add up the partial images. However, the speedup of the
CnZ approach is quite limited compared with the PPT approach. This is
because PPT does not use atomic operation either. Furthermore, in the
PPT approach, we do not need to use CPU to add up the partial image in
series. In other words, the PPT approach is completely parallel while the
CnZ approach is a hybrid of parallel and serial computation. We can also see
from the table that the speedup of the CnZ approach degrades as the image
size increases. We think this is because it needs to launch many kernel calls
to cover bigger area. The overhead of launching so many kernel calls may
compromise the performance of their algorithm.
160
Notice that the CnZ approach was originally proposed for comparison with
their FPGA-based approach. In [67], they observe that their FPGA-based
approach achieves about an extra 2X speedup over their GPU-based imple-
mentation. We cannot directly compare our GPU implementation with their
FPGA implementation because the algorithms as well as settings are dif-
ferent. However, consider that their experiments are performed on GeForce
8800 GTS, which is several generations older than the GeForce GTX280 we
are using, and yet their FPGA implementation only managed to achieve 2X
speedup. It can be inferred that our GPU implementation is much faster
than their FPGA implementation.
The errors of our proposed approaches are within acceptable range. Notice
that the lookup tables and weights are normalized such that the resultant
image would have light intensity between 0 and 1. The intensity threshold
that denes the boundary of a feature is usually around 0.3. Therefore, the
maximum absolute error of the RPB approach and the PPT approach is
around 10 6 which is several orders of magnitude below the threshold. Such
an error can hardly make any dierence in practice. The error of the CnZ
approach is at the order of 10 4, which is much larger than those of the RPB
and PPT approaches.
Finally, we show the layout and the simulation result of small 1 in Fig. 11.5.
11.5 Conclusions
In this chapter, we presented and analyzed several GPU-based implementa-
tions for the polygon-based aerial image simulation algorithm. We compared
the rectangle per block approach, the pixel per thread approach and the ap-
proach proposed by Cong and Zou [67] through experiments. The pixel per
thread approach stands out as the most ecient implementation. With the
help of constant and texture memory, it can achieve 50X to 60X speedup
while the other approaches can hardly achieve more than 5X speedup.
In the future, we plan to further enhance the pixel per thread approach.
We believe that it can be improved in the following aspects:
 Since each block needs to load all the rectangles, every rectangle is
loaded multiple times. Moreover, not all rectangles will actually impact
161
Figure 11.5: The input layout and output aerial image of small 1.
the pixels in a block. If we can reduce the number of rectangles loaded
by each block while keeping the correctness of the computation, we can
further speed up the algorithm.
 Our image is stored in the global memory, which has larger access
latency. Maybe we can learn from the approach in [67] and use shared
memory to store the image. However, we need to avoid using CPU to
combine the image.
Furthermore, we plan to implement an FFT-based aerial image simulation
algorithm on GPU [68, 69] and compare its performance with that of the
polygon-based approach.
162
PART VI
CONCLUSION
163
CHAPTER 12
CONCLUSIONS AND FUTURE WORK
In this dissertation, we have studied modern design-technology co-optimization
problems in next generation lithography. Topics that have been covered in
our study include: self-aligned double/quadruple patterning, 1-D style de-
sign optimization, EUV defect mitigation and aerial image simulation with
GPU.
First, we focused on the self-aligned double patterning decomposition prob-
lem and made a breakthrough in the SADP design-technology ow for mask
generation and indecomposable layout inspection. In Chapter 2, we proposed
a SAT-based SADP decomposition for general 2D layout. It is the rst de-
composition work ever published in the literature. In Chapter 3, we studied
the SADP decomposition problem with overlay minimization requirement
and provided an ILP-based algorithm to nd the best decomposition result
to minimize overlay. In Chapter 4, we continued our study on the SADP
indecomposable layout inspection and proposed the rst hot spot detection
algorithm to help the SADP-friendly design. As an extension from SADP
process, in Chapter 5, we also did an early study on SAQP process to charac-
terize the SAQP-friendly layout with a graph-based algorithm, which is also
the rst work on SAQP-friendly layout characterization.
Second, we studied the 1-D style design optimization problem for better
printability, lower design impact and lower mask cost. In Chapter 6, we stud-
ied the printing performance of the state-of-the-art 1D metal with tip-tip gaps
at present. Then with a certain preference of rearranging gap distribution,
we extend the line-end to largely expand the process windows and enhance
the printability. In Chapter 7, we further considered the performance impact
of the post-design line-end extension and proposed a constrained gap redis-
tribution method based on the preference in Chapter 6. In Chapter 8, for a
process to manufacture 1-D circuit by dense line print and cut techniques,
we proposed an eective optimal algorithm for cut mask simplication.
164
Then we studied the EUV defect mitigation problem. In Chapter 9, we
solved the EUV defect mitigation by pattern shift and 90 rotation. We
proposed a linear time algorithm to greatly speed up the defect mitigation
process compared to the existing commercial tool. In Chapter 10, we con-
tinued our study on defect mitigation to allow small rotation on the pattern
movement for a better success rate. The experiment shows a much better
success rate compared to those without small angle rotation.
Last but not least, in Chapter 11, we studied the utilization of GPU to
speed up aerial image simulation for a higher speed and accuracy. We showed
our algorithm had natural capability to avoid atomic operation and showed
a 23 orders improvement on the numerical error control.
To conclude this dissertation, we would like to point out some future re-
search directions and open problems:
Hierarchical SADP decomposition. As needed for mask generation, full
chip SADP decomposition is in great demand. Although our proposed algo-
rithms in this dissertation and in all other published works have the capabil-
ity to handle SADP decomposition to a certain scale, it is still not ecient
enough for a full chip mask generation. Since SADP decomposition does not
support stitch, it is hard to capture the hierarchical structure of the layout,
and reuse of the pre-decomposed small components such as standard cells
or IP could be very restricted (e.g. limited pins location, limited wires and
adjacency relations). So, how to decompose the full chip layout eciently in
a hierarchical way is a key question that needs to be answered very soon.
1-D style design with restricted rules has demonstrated its capability to
help extend the life of Moore's law. However, when the pitch becomes too
small, the random cuts become the most dicult patterns to print. With
those random distributed 2D features, multiple patterning or E-beam shot
might be potential solutions. However, either way has its own pros and cons.
While multiple patterning requires a few more cut masks with signicantly
higher cost, E-beam shot might be low at throughput. So, how to balance
the cuts while keeping the manufacturing cost, throughput and yield in an
acceptable range is still an important open question.
In this dissertation, we have shown the great opportunity for GPU to
play a role in the future lithography simulation and verication processes.
However, to use GPU for aerial image simulation is just the rst step; some
more calculation intensity processes (e.g. optical proximity correction (OPC)
165
and source mask optimization (SMO)) are still calling for help. What is the
best GPU-based algorithm for the full version OPC or SMO is still an open
question, since the validation of the GPU-based algorithm lies not only in its
speed but also from in its robustness and numerical error controllability.
166
REFERENCES
[1] H. Zhang, Y. Du, M. D. F. Wong, et al., \Eective decomposition al-
gorithm for self-aligned double patterning lithography," in Proc. SPIE,
vol. 7973, 79730J, 2011.
[2] H. Zhang, Y. Du, M. D. F. Wong, and R. Topaloglu, \Self-aligned double
patterning decomposition for overlay minimization and hot spot detec-
tion," in Proc. Design Automation Conf. (DAC'11), 2011, pp. 71{76.
[3] H. Zhang, Y. Du, M. D. F. Wong, and R. O. Topaloglu, \Hot spot
detection for indecomposable self-aligned double patterning layout," in
Proc. SPIE, vol. 8166, 81663E, 2011.
[4] H. Zhang, Y. Du, M. D. F. Wong, and R. O. Topaloglu, \Characteri-
zation and decomposition of self-aligned quadruple patterning friendly
layout," in Proc. SPIE, vol. 8326, 83260F, 2012.
[5] H. Zhang, M. D. F. Wong, K.-Y. Chao, et al., \Uniformity-aware stan-
dard cell design with accurate shape control," in Proc. SPIE, vol. 7275,
72751G, 2009.
[6] H. Zhang, M. D. F. Wong, and K.-Y. Chao, \On process-aware 1-D stan-
dard cell design," in Proc. Asia and South Pacic Design Automation
Conf. (ASP-DAC'10), 2010, pp. 838{842.
[7] H. Zhang, Y. Du, M. Wong, and K.-Y. Chao, \Lithography-aware layout
modication considering performance impact," in Proc. Int. Sym. on
Quality Electronic Design (ISQED'11), Mar. 2011, pp. 1{5.
[8] H. Zhang, Y. Du, M. Wong, and K.-Y. Chao, \Mask cost reduction with
circuit performance consideration for self-aligned double patterning," in
Proc. Asia and South Pacic Design Automation Conf. (ASP-DAC'11),
Jan. 2011, pp. 787{792.
[9] H. Zhang, Y. Du, M. Wong, and R. Topalaglu, \Ecient pattern reloca-
tion for EUV blank defect mitigation," in Proc. Asia and South Pacic
Design Automation Conf. (ASP-DAC'12), Feb. 2012, pp. 719{724.
167
[10] H. Zhang, Y. Du, M. Wong, et al., \Layout small-angle rotation and
shift for EUV defect mitigation," in Proc. Int. Conf. on Computer-Aided
Design (ICCAD'12), Nov. 2012.
[11] H. Zhang, T. Yan, M. Wong, and S. Patel, \Accelerating aerial image
simulation with GPU," in Proc. Int. Conf. on Computer-Aided Design
(ICCAD'11), Nov. 2011, pp. 178{184.
[12] Y. Wei and R. L. Brainard, Advanced Processes for 193-nm Immersion
Lithography. SPIE Publications, Feb. 2009.
[13] A. Kahng, C.-H. Park, X. Xu, and H. Yao, \Layout decomposition for
double patterning lithography," in Proc. Int. Conf. on Computer-Aided
Design (ICCAD'08), Nov. 2008, pp. 465{472.
[14] K. Yuan, J.-S. Yang, and D. Pan, \Double patterning layout decompo-
sition for simultaneous conict and stitch minimization," IEEE Trans.
Comput.-Aided Design Integr. Circuits Syst., vol. 29, no. 2, pp. 185{196,
Feb. 2010.
[15] Y. Xu and C. Chu, \Grema: Graph reduction based ecient mask
assignment for double patterning technology," in Proc. Int. Conf. on
Computer-Aided Design (ICCAD'09), Nov. 2009, pp. 601{606.
[16] C.-H. Hsu, Y.-W. Chang, and S. Nassif, \Simultaneous layout migration
and decomposition for double patterning technology," in Proc. Int. Conf.
on Computer-Aided Design (ICCAD'09), Nov. 2009, pp. 595{600.
[17] C. Bencher, \Sadp: the best option," in Nanochip Technology Journal,
2007.
[18] K. Monahan, \Enabling double patterning at the 32nm node," in Proc.
Int. Sym. on Semiconductor Manufacturing (ISSM'06), Sep. 2006, pp.
126{129.
[19] M. C. Smayling, C. Bencher, H. D. Chen, et al., \APF pitch-halving for
22nm logic cells using gridded design rules," in Proc. SPIE, vol. 6925,
69251E, 2008.
[20] H. Dai, J. Sweis, C. Bencher, et al., \Implementing self-aligned double
patterning on non-gridded design layouts," in Proc. SPIE, vol. 7275,
72751E, 2009.
[21] S. Sun, C. Bencher, Y. Chen, et al., \Demonstration of 32nm half-pitch
electrical testable nand ash patterns using self-aligned double pattern-
ing," in Proc. SPIE, vol. 7274, 72740D, 2009.
168
[22] Y. Ma, J. Sweis, C. Bencher, H. Dai, Y. Chen, J. P. Cain, Y. Deng,
J. Kye, and H. J. Levinson, \Decomposition strategies for self-aligned
double patterning," in Proc. SPIE, vol. 7641, 76410T, 2010.
[23] Y.-S. Chang, J. Sweis, J.-C. Lai, et al., \Full area pattern decomposition
of self-aligned double patterning for 30nm node nand ash process," in
Proc. SPIE, vol. 7637, 76371N, 2010.
[24] \Nangate open cell library." [Online]. Available: http://www.si2.org/
openeda.si2.org/projects/nangatelib/
[25] \Gurobi optimizer 4.0." [Online]. Available: http://www.gurobi.com/
[26] P. Xu, Y. Chen, Y. Chen, et al., \Sidewall spacer quadruple patterning
for 15nm half-pitch," in Proc. SPIE, vol. 7973, 79731Q, 2011.
[27] V. Kheterpal, V. Rovner, T. G. Hersan, et al., \Design methodology
for ic manufacturability based on regular logic-bricks," in Proc. Design
Automation Conf. (DAC'05), 2005, pp. 353{358.
[28] L. Liebmann, L. Pileggi, J. Hibbeler, et al., \Simplify to survive: pre-
scriptive layouts ensure protable scaling to 32nm and beyond," in Proc.
SPIE, vol. 7275, 72750A, 2009.
[29] M. C. Smayling, H. yu Liu, and L. Cai, \Low k1 logic design using
gridded design rules," in Proc. SPIE, vol. 6925, 69250B, 2008.
[30] R. T. Greenway, R. Hendel, K. Jeong, et al., \Interference assisted
lithography for patterning of 1d gridded design," in Proc. SPIE, vol.
7271, 72712U, 2009.
[31] T. El-Moselhy, I. Elfadel, and L. Daniel, \A capacitance solver for in-
cremental variation-aware extraction," in Proc. Int. Conf. on Computer-
Aided Design (ICCAD'08), Nov. 2008, pp. 662{669.
[32] B. Taylor and L. Pileggi, \Exact combinatorial optimization methods
for physical design of regular logic bricks," in Proc. Design Automation
Conf. (DAC'07), 2007, pp. 344{349.
[33] S. Chang, J. Blatchford, S. Prins, et al., \Exploration of complex metal
2d design rules using inverse lithography," in Proc. SPIE, vol. 7275,
72750D, 2009.
[34] R. Goering, \Guest blog: Making restricted design rules
work," Jan. 2010. [Online]. Available: http://www.
cadence.com/Community/blogs/ii/archive/2010/01/27/
guest-blog-making-restricted-design-rules-work.aspx
169
[35] M. Somervell, R. Gronheid, J. Hooge, et al., \Comparison of directed
self-assembly integrations," in Proc. SPIE, vol. 8325, 83250G, 2012.
[36] P. Gupta, A. B. Kahng, D. Sylvester, and J. Yang, \A cost-driven litho-
graphic correction methodology based on o-the-shelf sizing tools," in
Proc. Design Automation Conf. (DAC'03), 2003, pp. 16{21.
[37] P. Gupta, A. B. Kahng, D. Sylvester, and J. Yang, \Performance driven
OPC for mask cost reduction," in Proc. Int. Sym. on Physical Design
(ISPD'05), 2005, pp. 270{275.
[38] Y. Zhang, R. Gray, S. Chou, et al., \Mask cost analysis via write time
estimation," in Proc. SPIE, vol. 5756, no. 1, 2005, pp. 313{318.
[39] R. S. Mackay, H. Kamberian, and Y. Zhang, \Methods to reduce
lithography costs with reticle engineering," Microelectronic Engineer-
ing, vol. 83, no. 49, pp. 914{918, 2006.
[40] T.-C. Wang and D. F. Wong, \A graph theoretic technique to speed
up oorplan area optimization," in Proc. Design Automation Conf.
(DAC'92), 1992, pp. 62{68.
[41] M. C. Chiu, B. S.-M. Lin, M. F. Tsai, et al., \Challenges of 29nm
half-pitch nand ash STI patterning with 193nm dry lithography and
self-aligned double patterning," in Proc. SPIE, vol. 7140, 714021, 2008.
[42] \International technology roadmap for semiconductors lithography."
[Online]. Available: http://www.itrs.net/
[43] F. Melzer, and W. Singer, \Lighting system, particularly for use in ex-
treme ultraviolet (EUV) lithography," U.S. Patent 7196841, 2007.
[44] H. J. Levinson, \Extreme ultraviolet lithography's path to manufactur-
ing," Journal of Micro/Nanolithography, MEMS and MOEMS, vol. 8,
no. 4, 2009.
[45] V. Bakshi, Ed., EUV Lithography. SPIE Publications, Feb. 2008.
[46] G. M. Kloster, T. Liang, T. R. Younkin, et al., \Printability of extreme
ultraviolet lithography mask pattern defects for 22-40 nm half-pitch fea-
tures," in Proc. SPIE, vol. 7636, 76360M, 2010.
[47] Y. Deng, B. L. Fontaine, and A. R. Neureuther, \Performance of re-
paired defects and attPSM in EUV multilayer masks," in Proc. SPIE,
vol. 4889, no. 1, 2002, pp. 418{425.
[48] I.-Y. Kang, H.-S. Seo, B.-S. Ahn, et al., \Printability and inspectability
of programmed pit defects on the masks in EUV lithography," in Proc.
SPIE, vol. 7636, 76361B, 2010.
170
[49] S. Huh, L. Ren, D. Chan, et al., \A study of defects on EUV masks using
blank inspection, patterned mask inspection, and wafer inspection," in
Proc. SPIE, vol. 7636, 76360K, 2010.
[50] Y. D. Chan, A. Rastegar, H. Yun, et al., \EUV mask defect inspec-
tion and defect review strategies for EUV pilot line and high volume
manufacturing," in Proc. SPIE, vol. 7636, 76361D, 2010.
[51] C. H. Cliord, T. T. Chan, A. R. Neureuther, et al., \Compensation
methods using a new model for buried defects in extreme ultraviolet
lithography masks," in Proc. SPIE, vol. 7823, 78230V, 2010.
[52] C. H. Cliord, T. T. Chan, and A. R. Neureuther, \Compensation meth-
ods for buried defects in extreme ultraviolet lithography masks," in Proc.
SPIE, vol. 7636, 763623, 2010.
[53] L. L. Pang, C. Cliord, P. Hu, et al., \Compensation for EUV multilayer
defects within arbitrary layouts by absorber pattern modication," in
Proc. SPIE, vol. 7969, 79691E, 2011.
[54] T. Terasawa, T. Yamane, T. Tanaka, et al., \Actinic phase defect detec-
tion and printability analysis for patterned EUVl mask," in Proc. SPIE,
vol. 7636, 763602, 2010.
[55] P. Y. Yan, \Ml defect integrated solution demonstration," in MASK
TWG Meetings. IEUVI, 2010.
[56] J. Burns and M. Abbas, \EUV mask defect mitigation through pattern
placement," in Proc. SPIE, vol. 7823, 782340, 2010.
[57] C. H. Cliord and A. R. Neureuther, \Smoothing based fast model for
images of isolated buried EUV multilayer defects," in Proc. SPIE, vol.
6921, 692119, 2008.
[58] C. H. Cliord and A. R. Neureuther, \Smoothing based model for images
of buried EUV multilayer defects near absorber features," in Proc. SPIE,
vol. 7122, 71221X, 2008.
[59] C. H. Cliord, \Simulation and compensation methods for EUV lithog-
raphy masks with buried defects," Ph.D. dissertation, University of Cal-
ifornia at Berkeley, 2010.
[60] H. W. Six and D. Wood, \The rectangle intersection problem revisited,"
BIT Numerical Mathematics, vol. 20, pp. 426{433, 1980.
[61] R. H. Gting andW. Schilling, \A practical divide-and-conquer algorithm
for the rectangle intersection problem," Information Sciences, vol. 42,
no. 2, pp. 95{112, 1987.
171
[62] Y. Du, H. Zhang, M. D. F. Wong, and R. O. Topaloglu, \EUV mask
preparation considering blank defects mitigation," in Proc. SPIE, vol.
8166, 816611, 2011.
[63] Y. Du, H. Zhang, M. D. F. Wong, et al., \Ecient multi-die placement
for blank defect mitigation in EUV lithography," in Proc. SPIE, vol.
8322, 832231, 2012.
[64] E. Spiller, S. L. Baker, P. B. Mirkarimi, et al., \High-performance Mo-
Si multilayer coatings for extreme-ultraviolet lithography by ion-beam
deposition," Appl. Opt., vol. 42, no. 19, pp. 4049{4058, Jul 2003.
[65] N. Cobb, \Fast optical and process proximity correction algorithms for
integrated circuit manufacturing," Ph.D. dissertation, University of Cal-
ifornia at Berkeley, 1998.
[66] I. Uzun, A. Amira, and A. Bouridane, \FPGA implementations of fast
Fourier transforms for real-time signal and image processing," Proc. IEE
Vision, Image and Signal Processing, vol. 152, no. 3, pp. 283{296, June
2005.
[67] J. Cong and Y. Zou, \FPGA-based hardware acceleration of lithographic
aerial image simulation," ACM Trans. Recongurable Technol. Syst.,
vol. 2, no. 3, pp. 17:1{17:29, Sep. 2009.
[68] I. Torunoglu, A. Karakas, E. Elsen, et al., \OPC on a single desktop: a
GPU-based OPC and verication tool for fabs and designers," in Proc.
SPIE, vol. 7641, 764114, 2010.
[69] Y.-T. Wang, C.-M. Tsai, and F.-C. Chang, \Lithographic simulations
using graphical processing units," U.S. Patent 20 060 242 618, 2006.
[70] A. K.-K. Wong, Optical Imaging in Projection Microlithography. SPIE
Press, 2005.
[71] D. B. Kirk andW.-M. Hwu, Programming Massively Parallel Processors:
A Hands-on Approach. Morgan Kaufmann, 2010.
172
